Project

General

Profile

Actions

Bug #2436

open

panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc

Added by thomas.nikolajsen over 12 years ago. Updated almost 12 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
10/21/2012
Due date:
% Done:

10%

Estimated time:

Description

On current master changing cpumask using dfly scheduler can result in panic.
Problem is on both DragonFly i386 & x86_64.
Scheduler bsd4 doesn't have this problem.

E.g. on 8 core system running 'usched dfly:3 true' a few times while doing buildkernel triggers panic.
Core dump avail on request.

thomas

Unread portion of the kernel message buffer:
panic: assertion "lp->lwp_qcpu == dd->cpuid" failed in dfly_acquire_curproc at /usr/src/sys/kern/usched_dfly.c:382
cpuid = 0
Trace beginning at frame 0xe4347c54
panic(ffffffff,0,c0396874,e4347c88,d92e1b80) at panic+0x1a8 0xc01bf150
panic(c0396874,c03af2b4,c03af386,c03af174,17e) at panic+0x1a8 0xc01bf150
dfly_acquire_curproc(daee0e00,e4347d00,10,0,0) at dfly_acquire_curproc+0x1ca 0xc01ca47b
syscall2(e4347d40) at syscall2+0x420 0xc037b5af
Xint0x80_syscall() at Xint0x80_syscall+0x36 0xc034c246
Debugger("panic")

CPU0 stopping CPUs: 0x000000fe
stopped
..
get_mycpu () at ./machine/thread.h:79
79 __asm ("movl %%fs:globaldata,%0" : "=r" (gd) : "m"(
_mycpu__dummy));
(kgdb) bt
#0 _get_mycpu () at ./machine/thread.h:79
#1 md_dumpsys (di=0xc079d820)
at /usr/src/sys/platform/pc32/i386/dump_machdep.c:266
#2 0xc01be8fe in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:925
#3 0xc015938a in db_fncall (dummy1=-1070290686, dummy2=0,
dummy3=-1072326021, dummy4=0xe4347ae4 "4m4\300\037\361<\300")
at /usr/src/sys/ddb/db_command.c:539
#4 0xc015986f in db_command (aux_cmd_tablep_end=0xc03ee69c,
aux_cmd_tablep=0xc03ee698, cmd_table=<optimized out>,
last_cmdp=<optimized out>) at /usr/src/sys/ddb/db_command.c:401
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:467
#6 0xc015c3ce in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_trap.c:71
#7 0xc034ac75 in kdb_trap (type=3, code=0, regs=0xe4347c04)
at /usr/src/sys/platform/pc32/i386/db_interface.c:151
#8 0xc037ae34 in trap (frame=0xe4347c04)
at /usr/src/sys/platform/pc32/i386/trap.c:850
#9 0xc034c197 in calltrap ()
at /usr/src/sys/platform/pc32/i386/exception.s:787
#10 0xc034a902 in breakpoint () at ./cpu/cpufunc.h:72
#11 Debugger (msg=0xc03ad70a "panic")
at /usr/src/sys/platform/pc32/i386/db_interface.c:333
#12 0xc01bf165 in panic (
fmt=0xc0396874 "assertion \"%s\" failed in %s at %s:%u")
at /usr/src/sys/kern/kern_shutdown.c:822
#13 0xc01ca47b in dfly_acquire_curproc (lp=0xdaee0e00)
at /usr/src/sys/kern/usched_dfly.c:382
#14 0xc037b5af in userexit (lp=<optimized out>)
at /usr/src/sys/platform/pc32/i386/trap.c:362
#15 syscall2 (frame=0xe4347d40) at /usr/src/sys/platform/pc32/i386/trap.c:1419
#16 0xc034c246 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:878
#17 0x0000001f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)


Files

0001-usched_dfly-Remove-an-assert-known-to-cause-panics.patch (739 Bytes) 0001-usched_dfly-Remove-an-assert-known-to-cause-panics.patch Remove assert line known to cause panics ftigeot, 01/23/2013 03:01 AM

Related issues 1 (0 open1 closed)

Related to Bug #2402: Showstopper panics for Release 3.2Closedtuxillo08/15/2012

Actions
Actions #1

Updated by ftigeot about 12 years ago

I have been bitten by the same assertion on a 32 threads dual-cpu system.

Running poudriere with more than 10 jobs is enough to panic the machine in less than one hour.

Actions #2

Updated by ftigeot almost 12 years ago

The assert line looks suspiciously like a copy/paste from a previous scheduling case.

Removing it with the attached patch allowed my previously crashing system to become stable under heavy loads.

Actions #3

Updated by dillon almost 12 years ago

Well, the assertion protects against the scheduling metrics being applied to the wrong cpu, which over time would cause cpus to be improperly weighted and potentially locked out for no reason. The assertion itself is correct, the question is why is it being hit?

So far I haven't had any luck tracking down code paths where qcpu would be wrong. The only possibility I can think of is that whatever system call that backtrace indicates the thread was calling (can't tell from the backtrace), it may have moved the thread to a different current cpu without telling the scheduler.

I need a kernel core to determine whether that is the case or not.

-Matt

Actions

Also available in: Atom PDF