Bug #1971

panic in lwpsignal

Added by y0n3t4n1 about 3 years ago. Updated about 3 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi.

I started getting this panic recently while under load. It took me
a couple of days to get the kernel dump because it still locks up
while dumping. The line number is slightly off because I inserted
a KKASSERT() right before the if statement to validate the array index
used in SIGISMEMBER() macro.

The kernel image and the dump is available on leaf as
~y0netan1/crash/{kern,vmcore}.26 . The kernel is built from the source
as of e45c80940 + some local non-kernel modifications.

#8 0xffffffff804a6d0e in calltrap ()
at /usr/src/sys/platform/pc64/x86_64/exception.S:180
#9 0xffffffff802a7d21 in lwpsignal (p=0xffffffe05b1ff3f0, lp=0x0, sig=20)
at /usr/src/sys/kern/kern_sig.c:1048
#10 0xffffffff802a8040 in ksignal (p=0x0, sig=0)
at /usr/src/sys/kern/kern_sig.c:998
#11 0xffffffff80293230 in exit1 (rv=<value optimized out>)
at /usr/src/sys/kern/kern_exit.c:534
#12 0xffffffff80293322 in sys_exit (uap=<value optimized out>)
at /usr/src/sys/kern/kern_exit.c:121
#13 0xffffffff804ae2a2 in syscall2 (frame=0xffffffe072c5fc08)
at /usr/src/sys/platform/pc64/x86_64/trap.c:1182
#14 0xffffffff804a6f4f in Xfast_syscall ()
at /usr/src/sys/platform/pc64/x86_64/exception.S:313
#15 0x000000000000002b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) fr 9
#9 0xffffffff802a7d21 in lwpsignal (p=0xffffffe05b1ff3f0, lp=0x0, sig=20)
at /usr/src/sys/kern/kern_sig.c:1048
1048 if (SIGISMEMBER(p->p_sigignore, sig)) {
(kgdb) p p->p_sigignore
There is no member named p_sigignore.
(kgdb) p p_sigacts->ps_sigignore
No symbol "p_sigacts" in current context.
(kgdb) p p->p_sigacts->ps_sigignore
Cannot access memory at address 0xc00
(kgdb) p p->p_sigacts
$1 = (struct sigacts *) 0x0
(kgdb) p *p
$2 = {p_list = {le_next = 0xffffffe05b1f3ff0, le_prev = 0xdeadc0dedeadc0de},
p_ucred = 0xdeadc0dedeadc0de, p_fd = 0xdeadc0dedeadc0de,
p_fdtol = 0xdeadc0dedeadc0de, p_limit = 0xdeadc0dedeadc0de,
p_stats = 0xdeadc0dedeadc0de, p_mqueue_cnt = 3735929054, p_pad0 = 0x0,
p_sigacts = 0x0, p_flag = 16801792, p_stat = SZOMB, p_pad1 = "\000\000",
:

History

#1 Updated by dillon about 3 years ago

:Hi.
:
:I started getting this panic recently while under load. It took me
:a couple of days to get the kernel dump because it still locks up
:while dumping. The line number is slightly off because I inserted
:a KKASSERT() right before the if statement to validate the array index
:used in SIGISMEMBER() macro.
:
:The kernel image and the dump is available on leaf as
:~y0netan1/crash/{kern,vmcore}.26 . The kernel is built from the source
:as of e45c80940 + some local non-kernel modifications.

Interesting. It looks like it is trying to signal the parent
process:

#11 0xffffffff80293230 in exit1 (rv=<value optimized out>)
at /usr/src/sys/kern/kern_exit.c:534
534 ksignal(p->p_pptr, p->p_sigparent);
(kgdb)

However, p->p_pptr is ripped out from under the ksignal due to the
child reparenting to process 1 on another cpu. So p->p_pptr in the
dump points to process 1, but the process passed to the ksignal is the
'old' parent which is now gone (deadcode in structure).

I'm guessing this is a mplock vs proc_token issue or other blocking
issue. Lets do this patch to start with, but it is also possible
that we might have to hold proc_token... that the exit1() code's
use of the mplock (mp_token now) is not sufficient.

fetch http://apollo.backplane.com/DFlyMisc/exit01.patch

-Matt

#2 Updated by y0n3t4n1 about 3 years ago

:
> I'm guessing this is a mplock vs proc_token issue or other blocking
> issue. Lets do this patch to start with, but it is also possible
> that we might have to hold proc_token... that the exit1() code's
> use of the mplock (mp_token now) is not sufficient.
>
> fetch http://apollo.backplane.com/DFlyMisc/exit01.patch
>
> -Matt

Hi,

I've applied this patch and restarted the system with the new kernel
about 13 hours ago. It's still running happily, but it probably needs
another half a day or so before I can say `it seems to be fixed.'

Thanks.

#3 Updated by y0n3t4n1 about 3 years ago

The fix committed as f2f3db5c5f5d7b57e0475e6661e100e9136a4e7d.

Also available in: Atom PDF