Bug #2291
Panic in lwkt_remove_tdallq
| Status: | Closed | Start date: | 01/26/2012 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | - | |||
| Target version: | - |
Description
With a few days old master and under some I/O load I get the mentioned panic (the actual panic message is illegable, only the trace can be read) ... was unable to get a core dump (panic in ddb didn't produce a dump, only rebooted the machine). The panic seems to be easily repeatable (unfortunately this is the first time I actually saw a trace, because I had DDB_UNATTENDED switched on before + was in X, so could not see the message)
Part of the backtrace (only the function names, I managed to take pictures of the backtrace itself, attached to this report):
panic()
lwkt_remove_tdallq()
free_lock()
acquire_lock()
softdep_disk_io_initiation()
devfs_spec_strategy()
vop_strategy()
vn_strategy()
ufs_strategy()
ufs_vnoperate()
vop_strategy()
vn_strategy()
bawrite()
...
syncer_thread()
syncer_thread_start()
kthread_exit()
Related todos
Related issues
History
Updated by vsrinivas about 1 year ago
Also seen on 3.0.2 by tuxillo: http://leaf.dragonflybsd.org/~tuxillo/archive/pics/2291/panic1.png
And on -master by marino and vsrinivas. Callchain can be rooted at kern_exit instead of syncer.
Perhaps some blame goes to softdep's locks. It uses mplock + critical section around softdep callbacks.
Updated by vsrinivas about 1 year ago
http://leaf.dragonflybsd.org/~marino/install_panic_1.jpg is marino's panic.
All of these panics are actually: 'td_critcount would go negative', from crit_panic. This appears to come from FREE_LOCK(), which exits a critical section "softupdates"; if one were to exit a critical section one more time than they entered it, that would cause this panic.
Updated by vsrinivas about 1 year ago
I believe
will correct the problem. Any testing highly appreciated.
Basically, softdep_disk_write_complete() was blocking, losing its lock, but leaving itself marked as the lock-holder. ACQUIRE_LOCK detected this and panic-ed. This patch switches to using lockmgr locks for softdep, which are hard locks and not lost on blocking conditions.
Updated by vsrinivas about 1 year ago
I take that back; my patch introduces a deadlock between BUFWAIT and getting the ffs_softdep lock. Working on it now.
Updated by vsrinivas about 1 year ago
Second try: http://leaf.dragonflybsd.org/~vsrinivas/ffs03.patch
Reviews/testing very much appreciated!
Updated by vsrinivas about 1 year ago
Commit 8e90f899fdf61479c5e76faa87e7ff716982ed08 _should_ fix the problem.
Updated by vsrinivas about 1 year ago
There are remaining issues with this work, uncovered by fsstress:
1)
Fatal trap 12: page fault while in kernel mode
cpuid = 4
fault virtual address = 0x391
fault code = supervisor read, page not present
instruction pointer = 0x2b:0x506bfc
stack pointer = 0x10:0x1002c197538
frame pointer = 0x10:0x1002c1978c0
processor eflags = interrupt enabled, resume, IOPL = 0
current process = Idle
current thread = pri 12 (CRIT)
<- SMP: XXX
kernel: type 12 trap, code=0
CPU4 stopping CPUs: 0x00000000000000ef
stopped
Stopped at 0x506bfc: testb $0x20,0x391(%rdi)
db> where
bwrite() at 0x506bfc
softdep_fsync_mountdev() at 0x5f097e
buf_rb_tree_RB_SCAN() at 0x51b9aa
softdep_sync_metadata() at 0x5f0395
brelvp() at 0x51b171
vfsync() at 0x51bb9a
ffs_mountfs() at 0x5f6322
vop_fsync() at 0x52bae1
vinvalbuf() at 0x51bec4
vfs_getvfs() at 0x51ef57
mountlist_scan() at 0x51f8f6
mountlist_scan() at 0x51fb7c
db>
---
This panic appears to arise from us getting a NULL worklist buffer. This shouldn't be possible, it needs to be examined further.
2)
panic: handle_written_inodeblock: not started
cpuid = 0
Trace beginning at frame 0x10005d039e0
panic() at 0x4b1983
panic() at 0x4b1983
softdep_change_directoryentry_offset() at 0x5f1d31
bpdone() at 0x508429
biodone() at 0x5089ac
cluster_awrite() at 0x50f59c
biodone() at 0x508990
gptinit() at 0x671b7d
register_swi() at 0x48a3bd
Debugger("panic")
Updated by vsrinivas about 1 year ago
- Status changed from New to Resolved
Commit 24624a1562837ae797e4c1b05689f6f5b56006d9 prevents the latter two panics. I believe this issue is resolved.
Updated by rumcic about 1 year ago
- Status changed from Resolved to Closed
Unable to reproduce, I believe this issue can be closed