Bug #1137

Process stuck with an empty STATE

Added by qhwt+dfly about 6 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hello.
I very occasionally (only twice as far as I recall) see this on 2.0-RELEASE
or -DEVELOPMENT in `top' output(the following is on my mail server running
2.0-RELEASE):

load averages: 0.00, 0.00, 0.00 up 13+15:09:03 11:26:24
79 processes: 79 running
CPU states: % user, % nice, % system, % interrupt, % idle
Mem: 83M Active, 241M Inact, 130M Wired, 26M Cache, 59M Buf, 1116K Free
Swap: 2048M Total, 340K Used, 2048M Free

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
44758 qhwt 230 0 0K 4K 0:06 182.89% 77.44% git

`kill' or other tools can't locate this process, and I can't find it under
/proc filesystem either. Unfortunately I forgot plugging in the keyboard
when it booted, so probably I can't drop into DDB. There's no `cache_lock'
diagnostic messages on the console.
The `git' command was probably running on HAMMER filesystem, but it's been
regularly pruned and reblocked, and I think it has enough room:

Filesystem Size Used Avail Capacity Mounted on
/dev/ad0s1a 496M 364M 93M 80% /
/dev/ad0s1e 7.9G 5.0G 2.3G 69% /var
/dev/ad0s1f 7.9G 1.2G 6.0G 17% /u
HAMMER 100G 23G 77G 23% /HAMMER
/dev/ad0s1h 22G 11G 9.5G 53% /backup
procfs 4.0K 4.0K 0B 100% /proc
/HAMMER/@@-1:1 100G 23G 77G 23% /home
/HAMMER/@@-1:2 100G 23G 77G 23% /home/source
/HAMMER/@@-1:4 100G 23G 77G 23% /usr/obj

It's not limited to `git' command; I've seen a similar stuck process
(cc1) during buildworld loops on another machine running -DEVELOPMENT.
The kernel and the world has been built on August 24, so it's not the
lastest. Does anyone know if it's already been resolved?

Thanks.

zombie.txt.bz2 (1.02 KB) qhwt+dfly, 01/25/2009 09:07 AM

History

#1 Updated by dillon about 6 years ago

:It's not limited to `git' command; I've seen a similar stuck process
:(cc1) during buildworld loops on another machine running -DEVELOPMENT.
:The kernel and the world has been built on August 24, so it's not the
:lastest. Does anyone know if it's already been resolved?
:
:Thanks.

New one to me. I've never seen this.

I think your best bet is to kgdb the live kernel and see if it shows
up on 'info threads'. If you can get a process or thread address out
of ps it should be possible to track down the actual state of the
process.

-Matt
Matthew Dillon
<>

#2 Updated by qhwt+dfly about 6 years ago

No luck with kgdb; the psx macro doesn't show the process 44758.
`info threads' doesn't show any relavant threads, either.

$ ps -o pid,tid,tt,stat,start,command -x |grep -e 44758
44758 0 p3- DEL Fri08AM (git)
56964 0 p8 RL+ 12:37PM grep -e 44758

#3 Updated by corecode over 5 years ago

does this still happen?

#4 Updated by qhwt+dfly over 5 years ago

Yes, it's happening now. My mail server is running a UP kernel built from
941f5de0 (around Oct. 10), but I can't find a relavant fix on
DragonFly_RELEASE_2_0 branch. I have another box running -DEVELOPMENT,
but its uptime is probably not long enough to reproduce it.

$ top -n -U qhwt 4
load averages: 0.00, 0.00, 0.00 up 100+00:41:22 11:01:26
78 processes: 78 running

Mem: 116M Active, 210M Inact, 138M Wired, 16M Cache, 59M Buf, 1220K Free
Swap: 2048M Total, 15M Used, 2033M Free

PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
63039 qhwt 179 0 2084K 1136K RUN 0:00 1161.00% 56.69% top
18535 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
18527 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
18517 qhwt 157 0 0K 4K 0:00 0.00% 1.95% git

I played with kgdb a bit and found that these 3 processes are on zombproc
but with p_stat==SACTIVE and p_lock==1. I have no idea how it happened.

#5 Updated by dillon over 5 years ago

:941f5de0 (around Oct. 10), but I can't find a relavant fix on
:DragonFly_RELEASE_2_0 branch. I have another box running -DEVELOPMENT,
:but its uptime is probably not long enough to reproduce it.
:
:$ top -n -U qhwt 4
:load averages: 0.00, 0.00, 0.00 up 100+00:41:22 11:01:26
:78 processes: 78 running
:
:Mem: 116M Active, 210M Inact, 138M Wired, 16M Cache, 59M Buf, 1220K Free
:Swap: 2048M Total, 15M Used, 2033M Free
:
:
: PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
:63039 qhwt 179 0 2084K 1136K RUN 0:00 1161.00% 56.69% top
:18535 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
:18527 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
:18517 qhwt 157 0 0K 4K 0:00 0.00% 1.95% git
:
:I played with kgdb a bit and found that these 3 processes are on zombproc
:but with p_stat==SACTIVE and p_lock==1. I have no idea how it happened.

This should hopefully be fixed with this commit I made in December:

commit 2e425d87dc98885c44799de0327ab0013a7c34d6
Author: Matthew Dillon <>
Date: Thu Dec 18 20:20:15 2008 -0800

Close a possible bug where the p_lock for a new process inherits a
non-zero value from its parent on fork(), preventing the process
from being able to exit later on.

-Matt
Matthew Dillon
<>

#6 Updated by qhwt+dfly over 5 years ago

Ok, I'll try this on top of R2.0, or the pre-release of R2.2,
depending on when the new branch becomes available.

Thanks.

#7 Updated by qhwt+dfly over 5 years ago

I found a similar zombie process on a PC running -DEVELOPMENT
(as of 89f297df...), with p_stat==SACTIVE and p_lock==1. The stuck
process was git again. The previous ones were `git log', and this time
`git show', all of which involve $PAGER. I tried issuing the same command
several times, but I couldn't reproduce another zombie yet.

#8 Updated by dillon over 5 years ago

:I found a similar zombie process on a PC running -DEVELOPMENT
:(as of 89f297df...), with p_stat==SACTIVE and p_lock==1. The stuck
:process was git again. The previous ones were `git log', and this time
:`git show', all of which involve $PAGER. I tried issuing the same command
:several times, but I couldn't reproduce another zombie yet.

Check p_xstat, see if it is SIGSTOP or SIGTSTP. I think it's going
from SZOMB -> SSTOP -> SACTIVE (on kill or cont) due to the signal.
The p_lock == 1 is due to the exiting LWP and is correct.

I just made another commit which should catch zombied processes which
are improperly resurrected by a stop signal. However, I couldn't
reproduce it to test the fix... the window of opportunity is fairly
small.

-Matt
Matthew Dillon
<>

#9 Updated by qhwt+dfly over 5 years ago

Yes, that's SIGTSTP (attached `p *zombproc->lh_first' result).

I found how I made the zombie:
----
$ unset PAGER (or unsetenv if you like *csh)
$ cd /path/to/git/repo
$ git show 55a9cd0fa2b75e61230e2802b78eaec8937a1e42
here the pager display 7 lines of commitlog and the patch, and `(END)'
at the end. Suspend it by pressing ctrl+Z, then type fg to resume,
and press `q' in the pager to leave it.

$ ps x
now you see another zombie.
----

I'll try this on the new kernel and see if I can still reproduce it.

Thanks.

#10 Updated by qhwt+dfly over 5 years ago

After the commit 81b18e51, I can no longer reproduce the zombies, so I'd say
it's resolved now. 2.0-RELEASE is affected by this issue, so 81b18e51 and
related commits should be back ported, too.

Thanks.

Also available in: Atom PDF