Bug #1996

panic: assertion: p->p_lock == 0 in kern_wait

Added by y0n3t4n1 over 3 years ago. Updated over 3 years ago.

Status:NewStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi.
Apparently a few hours of pbulk test can trigger this panic (this is
on x86_64 and the kernel is built from source as of 5347900e6).
As opposed to what KKASSERT claims, p->p_lock doesn't hold the non-zero
value in the frame 5:

#4 0xffffffff802ad259 in panic (fmt=0xffffffff804f212a "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:799
#5 0xffffffff802989a1 in kern_wait (pid=<value optimized out>,
status=0xffffffe05e997a74, options=1528637672, rusage=0x0,
res=0xffffffe05e997b58) at /usr/src/sys/kern/kern_exit.c:901
#6 0xffffffff80298c4e in sys_wait4 (uap=0xffffffe05e997b58)
at /usr/src/sys/kern/kern_exit.c:754
#7 0xffffffff804b580c in syscall2 (frame=0xffffffe05e997c08)
at /usr/src/sys/platform/pc64/x86_64/trap.c:1182
#8 0xffffffff804ae53f in Xfast_syscall ()
at /usr/src/sys/platform/pc64/x86_64/exception.S:318
#9 0x000000000000002b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) fr 5
#5 0xffffffff802989a1 in kern_wait (pid=<value optimized out>,
status=0xffffffe05e997a74, options=1528637672, rusage=0x0,
res=0xffffffe05e997b58) at /usr/src/sys/kern/kern_exit.c:901
901 KKASSERT(p->p_lock == 0);
(kgdb) p p->p_lock
$1 = 0

I'm not sure if this is a problem in kgdb or a result of optimization,
or it's an MP race.

0001-Remove-double-semi-colon.patch Magnifier (892 Bytes) y0n3t4n1, 04/25/2011 12:15 PM

0002-kernel-Replace-LW-P-HOLD-RELE-to-use-refcount-APIs.patch Magnifier (5.79 KB) y0n3t4n1, 04/25/2011 12:15 PM

History

#1 Updated by y0n3t4n1 over 3 years ago

:
> (kgdb) fr 5
> #5 0xffffffff802989a1 in kern_wait (pid=<value optimized out>,
> status=0xffffffe05e997a74, options=1528637672, rusage=0x0,
> res=0xffffffe05e997b58) at /usr/src/sys/kern/kern_exit.c:901
> 901 KKASSERT(p->p_lock == 0);
> (kgdb) p p->p_lock
> $1 = 0

After poking here and the, I think this KKASSERT() can simply go away
as proc_remove_zombie() will wait for p->p_lock to drop to zero anyway.

/*
* Unlink the proc from its process group so that
* the following operations won't lead to an
* inconsistent state for processes running down
* the zombie list.
*/
KKASSERT(p->p_lock == 0);
proc_remove_zombie(p);
leavepgrp(p);

#2 Updated by y0n3t4n1 over 3 years ago

On Thu, Feb 24, 2011 at 09:29:16AM +0900, YONETANI Tomokazu wrote:
> On Mon, Feb 21, 2011 at 01:41:26PM +0900, YONETANI Tomokazu wrote:
> > Apparently a few hours of pbulk test can trigger this panic (this is
> > on x86_64 and the kernel is built from source as of 5347900e6).
> > As opposed to what KKASSERT claims, p->p_lock doesn't hold the non-zero
> > value in the frame 5:
> >
> :
> > (kgdb) fr 5
> > #5 0xffffffff802989a1 in kern_wait (pid=<value optimized out>,
> > status=0xffffffe05e997a74, options=1528637672, rusage=0x0,
> > res=0xffffffe05e997b58) at /usr/src/sys/kern/kern_exit.c:901
> > 901 KKASSERT(p->p_lock == 0);
> > (kgdb) p p->p_lock
> > $1 = 0
>
> After poking here and the, I think this KKASSERT() can simply go away
> as proc_remove_zombie() will wait for p->p_lock to drop to zero anyway.
>
> /*
> * Unlink the proc from its process group so that
> * the following operations won't lead to an
> * inconsistent state for processes running down
> * the zombie list.
> */
> KKASSERT(p->p_lock == 0);
> proc_remove_zombie(p);
> leavepgrp(p);

The following is what I have in my tree. What it does is to hold proc_token
while waiting for p->p_lock to drop, just as done in proc_remove_zombie().
If I don't hold the proc_token as in the first chunk, I see the

proc_remove_zombie: waiting for %p->p_lock to drop

message a few times every hour on the console. I guess it may also
imply that the race is between a code which holds proc_token for
a long time but not p->p_token, like all*_scan().

diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
index 1e5a110..30fc0af 100644
--- a/sys/kern/kern_exit.c
+++ b/sys/kern/kern_exit.c
@@ -867,8 +867,10 @@ loop:
* put a hold on the process for short periods of
* time.
*/
+ lwkt_gettoken(&proc_token);
while (p->p_lock)
tsleep(p, 0, "reap3", hz);
+ lwkt_reltoken(&proc_token);

/* Take care of our return values. */
*res = p->p_pid;
@@ -898,7 +900,6 @@ loop:
* inconsistent state for processes running down
* the zombie list.
*/
- KKASSERT(p->p_lock == 0);
proc_remove_zombie(p);
leavepgrp(p);

diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c
index cb067f0..048a619 100644
--- a/sys/kern/kern_proc.c
+++ b/sys/kern/kern_proc.c
@@ -661,6 +661,7 @@ proc_remove_zombie(struct proc *p)
{
lwkt_gettoken(&proc_token);
while (p->p_lock) {
+ kprintf("%s: waiting for %p->p_lock to drop\n", __func__, p);
tsleep(p, 0, "reap1", hz / 10);
}
LIST_REMOVE(p, p_list); /* off zombproc */
--
1.7.3.2

#3 Updated by dillon over 3 years ago

:> After poking here and the, I think this KKASSERT() can simply go away
:> as proc_remove_zombie() will wait for p->p_lock to drop to zero anyway.
:>
:...
:The following is what I have in my tree. What it does is to hold proc_token
:while waiting for p->p_lock to drop, just as done in proc_remove_zombie().
:If I don't hold the proc_token as in the first chunk, I see the
:
: proc_remove_zombie: waiting for %p->p_lock to drop
:
:message a few times every hour on the console. I guess it may also
:imply that the race is between a code which holds proc_token for
:a long time but not p->p_token, like all*_scan().

It looks good, I would make two changes. One would be to shortcut
the case where p->p_lock is already 0 to avoid unnecessary contention
with proc_token in the critical exit path.

if (p->p_lock) {
lwkt_gettoken(&proc_token);
while (p->p_lock)
tsleep(p, 0, "reap3", hz);
lwkt_reltoken(&proc_token);
}

:...
:@@ -661,6 +661,7 @@ proc_remove_zombie(struct proc *p)
: {
: lwkt_gettoken(&proc_token);
: while (p->p_lock) {
:+ kprintf("%s: waiting for %p->p_lock to drop\n", __func__, p);
: tsleep(p, 0, "reap1", hz / 10);
: }
: LIST_REMOVE(p, p_list); /* off zombproc */
:--
:1.7.3.2

This one may unnecessarily spam the kprintf since the wait is 1/10
of a second. Maybe conditionalize it with a variable so it only issues
one kprintf().

--

With regards to getting rid of the timeout in the tsleep and using a
proactive wakeup(), we have to avoid calling wakeup() for 1->0
transitions unless someone is known to be waiting on p_lock. This
can be implementing by adding a WAITING flag to the field and using
atomic_cmpset_int() to handle the (WAITING | 1) -> (0) transition and
then calling wakeup() if WAITING was set.

I will augment the sys/refcount.h API and add refcount_wait() and
refcount_release_wakeup() which encapsulate the appropriate atomic
ops. I will leave it up to you if you want to then use the new API
functions for PHOLD/PRELE, which would give the tsleep case a
proactive wakeup() instead of having to wait for it to timeout.

-Matt
Matthew Dillon
<>

#4 Updated by y0n3t4n1 over 3 years ago

On Thu, Apr 21, 2011 at 11:36:10AM -0700, Matthew Dillon wrote:
> :> After poking here and the, I think this KKASSERT() can simply go away
> :> as proc_remove_zombie() will wait for p->p_lock to drop to zero anyway.
> :>
> :...
> :The following is what I have in my tree. What it does is to hold proc_token
> :while waiting for p->p_lock to drop, just as done in proc_remove_zombie().
> :If I don't hold the proc_token as in the first chunk, I see the
> :
> : proc_remove_zombie: waiting for %p->p_lock to drop
> :
> :message a few times every hour on the console. I guess it may also
> :imply that the race is between a code which holds proc_token for
> :a long time but not p->p_token, like all*_scan().
>
> It looks good, I would make two changes. One would be to shortcut
> the case where p->p_lock is already 0 to avoid unnecessary contention
> with proc_token in the critical exit path.
>
> if (p->p_lock) {
> lwkt_gettoken(&proc_token);
> while (p->p_lock)
> tsleep(p, 0, "reap3", hz);
> lwkt_reltoken(&proc_token);
> }

The problem here is that p->p_lock can become non-zero shortly after
being observed as 0 on this reap3 loop (which originally caused the panic
on KKASSERT), so I think this shortcut logic almost defeats that purpose.
However, even without this shortcut logic, the added kprintf() still very
occasionally shows the diagnostic message.

> :...
> :@@ -661,6 +661,7 @@ proc_remove_zombie(struct proc *p)
> : {
> : lwkt_gettoken(&proc_token);
> : while (p->p_lock) {
> :+ kprintf("%s: waiting for %p->p_lock to drop\n", __func__, p);
> : tsleep(p, 0, "reap1", hz / 10);
> : }
> : LIST_REMOVE(p, p_list); /* off zombproc */
> :--
> :1.7.3.2
>
> This one may unnecessarily spam the kprintf since the wait is 1/10
> of a second. Maybe conditionalize it with a variable so it only issues
> one kprintf().

I'll put it under bootverbose.

> With regards to getting rid of the timeout in the tsleep and using a
> proactive wakeup(), we have to avoid calling wakeup() for 1->0
> transitions unless someone is known to be waiting on p_lock. This
> can be implementing by adding a WAITING flag to the field and using
> atomic_cmpset_int() to handle the (WAITING | 1) -> (0) transition and
> then calling wakeup() if WAITING was set.
>
> I will augment the sys/refcount.h API and add refcount_wait() and
> refcount_release_wakeup() which encapsulate the appropriate atomic
> ops. I will leave it up to you if you want to then use the new API
> functions for PHOLD/PRELE, which would give the tsleep case a
> proactive wakeup() instead of having to wait for it to timeout.

So what I need to do is to change PHOLD/PRELE to use refcount_acquire/
refcount_release_wakeup and replace p->p_lock loop with
refcount_release_wakeup? I'll give it a try.

Best Regards,
YONETANI Tomokazu.

#5 Updated by y0n3t4n1 over 3 years ago

On Sun, Apr 24, 2011 at 11:36:27AM +0900, YONETANI Tomokazu wrote:
> > With regards to getting rid of the timeout in the tsleep and using a
> > proactive wakeup(), we have to avoid calling wakeup() for 1->0
> > transitions unless someone is known to be waiting on p_lock. This
> > can be implementing by adding a WAITING flag to the field and using
> > atomic_cmpset_int() to handle the (WAITING | 1) -> (0) transition and
> > then calling wakeup() if WAITING was set.
> >
> > I will augment the sys/refcount.h API and add refcount_wait() and
> > refcount_release_wakeup() which encapsulate the appropriate atomic
> > ops. I will leave it up to you if you want to then use the new API
> > functions for PHOLD/PRELE, which would give the tsleep case a
> > proactive wakeup() instead of having to wait for it to timeout.
>
> So what I need to do is to change PHOLD/PRELE to use refcount_acquire/
> refcount_release_wakeup and replace p->p_lock loop with
> refcount_release_wakeup? I'll give it a try.

I've been running the kernel with patch(es) attached to this message
and so far it's running fine under load. It reduced the number of
non-zero p->p_lock just before calling proc_remove_zombie() even without
holding proc_token around the first wait loop. However I've observed
a panic when a signal is sent to a process group, and I need to determine
if it's jail related or GNU screen related.

Best Regards,
YONETANI Tomokazu.

#6 Updated by y0n3t4n1 over 3 years ago

On Mon, Apr 25, 2011 at 09:13:04PM +0900, YONETANI Tomokazu wrote:
> On Sun, Apr 24, 2011 at 11:36:27AM +0900, YONETANI Tomokazu wrote:
> > > With regards to getting rid of the timeout in the tsleep and using a
> > > proactive wakeup(), we have to avoid calling wakeup() for 1->0
> > > transitions unless someone is known to be waiting on p_lock. This
> > > can be implementing by adding a WAITING flag to the field and using
> > > atomic_cmpset_int() to handle the (WAITING | 1) -> (0) transition and
> > > then calling wakeup() if WAITING was set.
> > >
> > > I will augment the sys/refcount.h API and add refcount_wait() and
> > > refcount_release_wakeup() which encapsulate the appropriate atomic
> > > ops. I will leave it up to you if you want to then use the new API
> > > functions for PHOLD/PRELE, which would give the tsleep case a
> > > proactive wakeup() instead of having to wait for it to timeout.
> >
> > So what I need to do is to change PHOLD/PRELE to use refcount_acquire/
> > refcount_release_wakeup and replace p->p_lock loop with
> > refcount_release_wakeup? I'll give it a try.
>
> I've been running the kernel with patch(es) attached to this message
> and so far it's running fine under load. It reduced the number of
> non-zero p->p_lock just before calling proc_remove_zombie() even without
> holding proc_token around the first wait loop.

I added a small code to PHOLD/PRELE to leave the last p->p_lock holder in
p->p_pad0 (well, far from perfect but better than nothing) and found that
it's always sysctl_kern_proc() who calls PHOLD() at a bad timing.
I guessed that's probably because it walks through zombproc and PHOLD()'s
on the processes, some of which are just about to be reaped. So I added
the following code to skip such processes; the relavant part in kern_wait()
waits for processes whose p->p_nthreads > 0, so I thought it should be fine,
no?
I think I need to wait for a few more days before pbulk can spot other
possible bad callers of PHOLD().

Best Regards,
YONETANI Tomokazu.

diff --git a/sys/kern/kern_proc.c b/sys/kern/kern_proc.c
index 6d760e2..942ce6b 100644
--- a/sys/kern/kern_proc.c
+++ b/sys/kern/kern_proc.c
@@ -945,6 +945,11 @@ sysctl_kern_proc(SYSCTL_HANDLER_ARGS)

if (!PRISON_CHECK(cr1, p->p_ucred))
continue;
+
+ /* don't touch processes about to be reaped */
+ if (p->p_nthreads == 0)
+ continue;
+
PHOLD(p);
error = sysctl_out_proc(p, req, flags);
PRELE(p);

#7 Updated by vsrinivas over 3 years ago

Hi,

I just saw a patch, 49aa3df0ca3e226c0a0d7097863a2426ee6fd534, go in to fix this
issue; it adds:

+
+ /*
+ * Temporary refs may still have been acquired while
+ * we removed the process, make sure they are all
+ * gone before kfree()ing. Now that the process has
+ * been removed from all lists and all references to
+ * it have gone away, no new refs can occur.
+ */
+ while (p->p_lock)
+ tsleep(p, 0, "reap4", hz);
kfree(p, M_PROC);

First, is anything required to ensure that p->p_lock is really loaded each loop
iteration? Is the compiler allowed to optimize away the load after the first loop?

Second, I don't understand how this is safe; the problem here is that another
code path obtained a reference to this process and was using it when the kfree()
happened. What prevents this?

A B
...
vm_waitproc(p)

while(p->p_lock)
tsleep(...)
/* get reference to process */
PHOLD(p)
kfree(p)
/* HEY! */

Thanks,
-- vs

#8 Updated by y0n3t4n1 over 3 years ago

On Mon, Jun 06, 2011 at 05:33:02AM +0000, Venkatesh Srinivas (via DragonFly issue tracker) wrote:
>
> Venkatesh Srinivas <> added the comment:
>
> Hi,
>
> I just saw a patch, 49aa3df0ca3e226c0a0d7097863a2426ee6fd534, go in to fix this
> issue; it adds:

Actually the real fix was 4a7e6f553, which removed the KKASSERT()
right before calling proc_remove_zombie(). I forgot mentioning
the issue1996 in the commit log, though.

>
>
> +
> + /*
> + * Temporary refs may still have been acquired while
> + * we removed the process, make sure they are all
> + * gone before kfree()ing. Now that the process has
> + * been removed from all lists and all references to
> + * it have gone away, no new refs can occur.
> + */
> + while (p->p_lock)
> + tsleep(p, 0, "reap4", hz);
> kfree(p, M_PROC);
>
> First, is anything required to ensure that p->p_lock is really loaded each loop
> iteration? Is the compiler allowed to optimize away the load after the first loop?

This is not an answer to your question, but there are similar wait loops
all over the kernel without `volatile' keywords on the pointer or the
struct member, so if such an optimization is actually allowed, the kernel
breaks in many ways.

> Second, I don't understand how this is safe; the problem here is that another
> code path obtained a reference to this process and was using it when the kfree()
> happened. What prevents this?

But for such a thing to happen, B needs to obtain the address of p before
it's completely removed from the lists held in other processes, lwp's,
or td's, and wait until A reaches kfree() below. By the time A reaches
here, the process is (supposed to be) removed from all the lists, so B
can't find it right before calling PHOLD(). It may be possible if B
is using a cached pointer somewhere, but I haven't found such a code (yet),
at least not without holding tokens.

> A B
> ...
> vm_waitproc(p)
>
> while(p->p_lock)
> tsleep(...)
> /* get reference to process */
> PHOLD(p)
> kfree(p)
> /* HEY! */
>
> Thanks,
> -- vs
>
> _____________________________________________________
> DragonFly issue tracker <>
> <http://bugs.dragonflybsd.org/issue1996>
> _____________________________________________________

#9 Updated by dillon over 3 years ago

:> + while (p->p_lock)
:> + tsleep(p, 0, "reap4", hz);
:> kfree(p, M_PROC);
:>

The reason it is safe is that tsleep() is a global procedure call
which GCC is unable to optimize. Because of this GCC does not
know whether any non-local variables will be modified or not, so
it can't make that optimization. In this case p->p_lock is non-local,
so GCC will reload it across any procedure call.

GCC will still optimize any local variable use across the procedure
call as long as the address of the local variable (e.g. &i) isn't
specified anywhere.

The rules apply to the automatic registerization of locals too, though
GCC isn't above temporarily placing such variables into registers. This
is why you sometimes see things like: 'fubar(&tvp); vp = tvp;',
because the programmer wants GCC to be able to optimize 'vp' into a
register.

If tsleep() were declared in the same source file (whether static or
not) then the assumption would not work and a cpu_ccfence() would
be needed to force the compiler to do the right thing.

I use cpu_ccfence() in numerous situations where reloads have to be
forced across inlined/static function calls.

-Matt

Also available in: Atom PDF