panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc
On current master changing cpumask using dfly scheduler can result in panic.
Problem is on both DragonFly i386 & x86_64.
Scheduler bsd4 doesn't have this problem.
E.g. on 8 core system running 'usched dfly:3 true' a few times while doing buildkernel triggers panic.
Core dump avail on request.
Unread portion of the kernel message buffer:
panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc at /usr/src/sys/kern/usched_dfly.c:382
cpuid = 0
Trace beginning at frame 0xe4347c54
panic(ffffffff,0,c0396874,e4347c88,d92e1b80) at panic+0x1a8 0xc01bf150
panic(c0396874,c03af2b4,c03af386,c03af174,17e) at panic+0x1a8 0xc01bf150
dfly_acquire_curproc(daee0e00,e4347d00,10,0,0) at dfly_acquire_curproc+0x1ca 0xc01ca47b
syscall2(e4347d40) at syscall2+0x420 0xc037b5af
Xint0x80_syscall() at Xint0x80_syscall+0x36 0xc034c246
CPU0 stopping CPUs: 0x000000fe
_get_mycpu () at ./machine/thread.h:79
79 __asm ("movl %%fs:globaldata,%0" : "=r" (gd) : "m"(__mycpu__dummy));
#0 _get_mycpu () at ./machine/thread.h:79
#1 md_dumpsys (di=0xc079d820)
#2 0xc01be8fe in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:925
#3 0xc015938a in db_fncall (dummy1=-1070290686, dummy2=0,
dummy3=-1072326021, dummy4=0xe4347ae4 "4m4\300\037\361<\300")
#4 0xc015986f in db_command (aux_cmd_tablep_end=0xc03ee69c,
aux_cmd_tablep=0xc03ee698, cmd_table=<optimized out>,
last_cmdp=<optimized out>) at /usr/src/sys/ddb/db_command.c:401
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:467
#6 0xc015c3ce in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_trap.c:71
#7 0xc034ac75 in kdb_trap (type=3, code=0, regs=0xe4347c04)
#8 0xc037ae34 in trap (frame=0xe4347c04)
#9 0xc034c197 in calltrap ()
#10 0xc034a902 in breakpoint () at ./cpu/cpufunc.h:72
#11 Debugger (msg=0xc03ad70a "panic")
#12 0xc01bf165 in panic (
fmt=0xc0396874 "assertion \"%s\" failed in %s at %s:%u")
#13 0xc01ca47b in dfly_acquire_curproc (lp=0xdaee0e00)
#14 0xc037b5af in userexit (lp=<optimized out>)
#15 syscall2 (frame=0xe4347d40) at /usr/src/sys/platform/pc32/i386/trap.c:1419
#16 0xc034c246 in Xint0x80_syscall ()
#17 0x0000001f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
I have been bitten by the same assertion on a 32 threads dual-cpu system.
Running poudriere with more than 10 jobs is enough to panic the machine in less than one hour.
- File 0001-usched_dfly-Remove-an-assert-known-to-cause-panics.patch added
- % Done changed from 0 to 10
The assert line looks suspiciously like a copy/paste from a previous scheduling case.
Removing it with the attached patch allowed my previously crashing system to become stable under heavy loads.
Well, the assertion protects against the scheduling metrics being applied to the wrong cpu, which over time would cause cpus to be improperly weighted and potentially locked out for no reason. The assertion itself is correct, the question is why is it being hit?
So far I haven't had any luck tracking down code paths where qcpu would be wrong. The only possibility I can think of is that whatever system call that backtrace indicates the thread was calling (can't tell from the backtrace), it may have moved the thread to a different current cpu without telling the scheduler.
I need a kernel core to determine whether that is the case or not.