Panic in lwkt_remove_tdallq
With a few days old master and under some I/O load I get the mentioned panic (the actual panic message is illegable, only the trace can be read) ... was unable to get a core dump (panic in ddb didn't produce a dump, only rebooted the machine). The panic seems to be easily repeatable (unfortunately this is the first time I actually saw a trace, because I had DDB_UNATTENDED switched on before + was in X, so could not see the message)
Part of the backtrace (only the function names, I managed to take pictures of the backtrace itself, attached to this report):
#1 Updated by vsrinivas almost 4 years ago
Also seen on 3.0.2 by tuxillo: http://leaf.dragonflybsd.org/~tuxillo/archive/pics/2291/panic1.png
And on -master by marino and vsrinivas. Callchain can be rooted at kern_exit instead of syncer.
Perhaps some blame goes to softdep's locks. It uses mplock + critical section around softdep callbacks.
#2 Updated by vsrinivas almost 4 years ago
http://leaf.dragonflybsd.org/~marino/install_panic_1.jpg is marino's panic.
All of these panics are actually: 'td_critcount would go negative', from crit_panic. This appears to come from FREE_LOCK(), which exits a critical section "softupdates"; if one were to exit a critical section one more time than they entered it, that would cause this panic.
#3 Updated by vsrinivas almost 4 years ago
will correct the problem. Any testing highly appreciated.
Basically, softdep_disk_write_complete() was blocking, losing its lock, but leaving itself marked as the lock-holder. ACQUIRE_LOCK detected this and panic-ed. This patch switches to using lockmgr locks for softdep, which are hard locks and not lost on blocking conditions.
#5 Updated by vsrinivas almost 4 years ago
Reviews/testing very much appreciated!
#7 Updated by vsrinivas almost 4 years ago
There are remaining issues with this work, uncovered by fsstress:
Fatal trap 12: page fault while in kernel mode
cpuid = 4
fault virtual address = 0x391
fault code = supervisor read, page not present
instruction pointer = 0x2b:0x506bfc
stack pointer = 0x10:0x1002c197538
frame pointer = 0x10:0x1002c1978c0
processor eflags = interrupt enabled, resume, IOPL = 0
current process = Idle
current thread = pri 12 (CRIT)
<- SMP: XXX
kernel: type 12 trap, code=0
CPU4 stopping CPUs: 0x00000000000000ef
Stopped at 0x506bfc: testb $0x20,0x391(%rdi)
bwrite() at 0x506bfc
softdep_fsync_mountdev() at 0x5f097e
buf_rb_tree_RB_SCAN() at 0x51b9aa
softdep_sync_metadata() at 0x5f0395
brelvp() at 0x51b171
vfsync() at 0x51bb9a
ffs_mountfs() at 0x5f6322
vop_fsync() at 0x52bae1
vinvalbuf() at 0x51bec4
vfs_getvfs() at 0x51ef57
mountlist_scan() at 0x51f8f6
mountlist_scan() at 0x51fb7c
This panic appears to arise from us getting a NULL worklist buffer. This shouldn't be possible, it needs to be examined further.
panic: handle_written_inodeblock: not started
cpuid = 0
Trace beginning at frame 0x10005d039e0
panic() at 0x4b1983
panic() at 0x4b1983
softdep_change_directoryentry_offset() at 0x5f1d31
bpdone() at 0x508429
biodone() at 0x5089ac
cluster_awrite() at 0x50f59c
biodone() at 0x508990
gptinit() at 0x671b7d
register_swi() at 0x48a3bd