Bug #2436

panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc

Added by thomas.nikolajsen about 2 years ago. Updated over 1 year ago.

Status:NewStart date:10/21/2012
Priority:NormalDue date:
Assignee:-% Done:

10%

Category:-
Target version:-

Description

On current master changing cpumask using dfly scheduler can result in panic.
Problem is on both DragonFly i386 & x86_64.
Scheduler bsd4 doesn't have this problem.

E.g. on 8 core system running 'usched dfly:3 true' a few times while doing buildkernel triggers panic.
Core dump avail on request.

-thomas
-
Unread portion of the kernel message buffer:
panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc at /usr/src/sys/kern/usched_dfly.c:382
cpuid = 0
Trace beginning at frame 0xe4347c54
panic(ffffffff,0,c0396874,e4347c88,d92e1b80) at panic+0x1a8 0xc01bf150
panic(c0396874,c03af2b4,c03af386,c03af174,17e) at panic+0x1a8 0xc01bf150
dfly_acquire_curproc(daee0e00,e4347d00,10,0,0) at dfly_acquire_curproc+0x1ca 0xc01ca47b
syscall2(e4347d40) at syscall2+0x420 0xc037b5af
Xint0x80_syscall() at Xint0x80_syscall+0x36 0xc034c246
Debugger("panic")

CPU0 stopping CPUs: 0x000000fe
stopped
..
_get_mycpu () at ./machine/thread.h:79
79 __asm ("movl %%fs:globaldata,%0" : "=r" (gd) : "m"(__mycpu__dummy));
(kgdb) bt
#0 _get_mycpu () at ./machine/thread.h:79
#1 md_dumpsys (di=0xc079d820)
at /usr/src/sys/platform/pc32/i386/dump_machdep.c:266
#2 0xc01be8fe in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:925
#3 0xc015938a in db_fncall (dummy1=-1070290686, dummy2=0,
dummy3=-1072326021, dummy4=0xe4347ae4 "4m4\300\037\361<\300")
at /usr/src/sys/ddb/db_command.c:539
#4 0xc015986f in db_command (aux_cmd_tablep_end=0xc03ee69c,
aux_cmd_tablep=0xc03ee698, cmd_table=<optimized out>,
last_cmdp=<optimized out>) at /usr/src/sys/ddb/db_command.c:401
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:467
#6 0xc015c3ce in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_trap.c:71
#7 0xc034ac75 in kdb_trap (type=3, code=0, regs=0xe4347c04)
at /usr/src/sys/platform/pc32/i386/db_interface.c:151
#8 0xc037ae34 in trap (frame=0xe4347c04)
at /usr/src/sys/platform/pc32/i386/trap.c:850
#9 0xc034c197 in calltrap ()
at /usr/src/sys/platform/pc32/i386/exception.s:787
#10 0xc034a902 in breakpoint () at ./cpu/cpufunc.h:72
#11 Debugger (msg=0xc03ad70a "panic")
at /usr/src/sys/platform/pc32/i386/db_interface.c:333
#12 0xc01bf165 in panic (
fmt=0xc0396874 "assertion \"%s\" failed in %s at %s:%u")
at /usr/src/sys/kern/kern_shutdown.c:822
#13 0xc01ca47b in dfly_acquire_curproc (lp=0xdaee0e00)
at /usr/src/sys/kern/usched_dfly.c:382
#14 0xc037b5af in userexit (lp=<optimized out>)
at /usr/src/sys/platform/pc32/i386/trap.c:362
#15 syscall2 (frame=0xe4347d40) at /usr/src/sys/platform/pc32/i386/trap.c:1419
#16 0xc034c246 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:878
#17 0x0000001f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

0001-usched_dfly-Remove-an-assert-known-to-cause-panics.patch Magnifier - Remove assert line known to cause panics (739 Bytes) ftigeot, 01/23/2013 03:01 AM


Related issues

Related to Bug #2402: Showstopper panics for Release 3.2 New 08/15/2012

History

#1 Updated by ftigeot over 1 year ago

I have been bitten by the same assertion on a 32 threads dual-cpu system.

Running poudriere with more than 10 jobs is enough to panic the machine in less than one hour.

#2 Updated by ftigeot over 1 year ago

The assert line looks suspiciously like a copy/paste from a previous scheduling case.

Removing it with the attached patch allowed my previously crashing system to become stable under heavy loads.

#3 Updated by dillon over 1 year ago

Well, the assertion protects against the scheduling metrics being applied to the wrong cpu, which over time would cause cpus to be improperly weighted and potentially locked out for no reason. The assertion itself is correct, the question is why is it being hit?

So far I haven't had any luck tracking down code paths where qcpu would be wrong. The only possibility I can think of is that whatever system call that backtrace indicates the thread was calling (can't tell from the backtrace), it may have moved the thread to a different current cpu without telling the scheduler.

I need a kernel core to determine whether that is the case or not.

-Matt

Also available in: Atom PDF