Bug #1137
closedProcess stuck with an empty STATE
0%
Description
Hello.
I very occasionally (only twice as far as I recall) see this on 2.0-RELEASE
or -DEVELOPMENT in `top' output(the following is on my mail server running
2.0-RELEASE):
load averages: 0.00, 0.00, 0.00 up 13+15:09:03 11:26:24
79 processes: 79 running
CPU states: % user, % nice, % system, % interrupt, % idle
Mem: 83M Active, 241M Inact, 130M Wired, 26M Cache, 59M Buf, 1116K Free
Swap: 2048M Total, 340K Used, 2048M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
44758 qhwt 230 0 0K 4K 0:06 182.89% 77.44% git
`kill' or other tools can't locate this process, and I can't find it under
/proc filesystem either. Unfortunately I forgot plugging in the keyboard
when it booted, so probably I can't drop into DDB. There's no `cache_lock'
diagnostic messages on the console.
The `git' command was probably running on HAMMER filesystem, but it's been
regularly pruned and reblocked, and I think it has enough room:
Filesystem Size Used Avail Capacity Mounted on
/dev/ad0s1a 496M 364M 93M 80% /
/dev/ad0s1e 7.9G 5.0G 2.3G 69% /var
/dev/ad0s1f 7.9G 1.2G 6.0G 17% /u
HAMMER 100G 23G 77G 23% /HAMMER
/dev/ad0s1h 22G 11G 9.5G 53% /backup
procfs 4.0K 4.0K 0B 100% /proc
/HAMMER/@-1:1 100G 23G 77G 23% /home
@-1:2 100G 23G 77G 23% /home/source
/HAMMER/
/HAMMER/@@-1:4 100G 23G 77G 23% /usr/obj
It's not limited to `git' command; I've seen a similar stuck process
(cc1) during buildworld loops on another machine running -DEVELOPMENT.
The kernel and the world has been built on August 24, so it's not the
lastest. Does anyone know if it's already been resolved?
Thanks.
Files
Updated by dillon about 16 years ago
:It's not limited to `git' command; I've seen a similar stuck process
:(cc1) during buildworld loops on another machine running -DEVELOPMENT.
:The kernel and the world has been built on August 24, so it's not the
:lastest. Does anyone know if it's already been resolved?
:
:Thanks.
New one to me. I've never seen this.
I think your best bet is to kgdb the live kernel and see if it shows
up on 'info threads'. If you can get a process or thread address out
of ps it should be possible to track down the actual state of the
process.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by qhwt+dfly about 16 years ago
No luck with kgdb; the psx macro doesn't show the process 44758.
`info threads' doesn't show any relavant threads, either.
$ ps o pid,tid,tt,stat,start,command -x |grep -e 44758 DEL Fri08AM (git)
44758 0 p3
56964 0 p8 RL+ 12:37PM grep -e 44758
Updated by qhwt+dfly almost 16 years ago
Yes, it's happening now. My mail server is running a UP kernel built from
941f5de0 (around Oct. 10), but I can't find a relavant fix on
DragonFly_RELEASE_2_0 branch. I have another box running -DEVELOPMENT,
but its uptime is probably not long enough to reproduce it.
$ top -n -U qhwt 4
load averages: 0.00, 0.00, 0.00 up 100+00:41:22 11:01:26
78 processes: 78 running
Mem: 116M Active, 210M Inact, 138M Wired, 16M Cache, 59M Buf, 1220K Free
Swap: 2048M Total, 15M Used, 2033M Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
63039 qhwt 179 0 2084K 1136K RUN 0:00 1161.00% 56.69% top
18535 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
18527 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
18517 qhwt 157 0 0K 4K 0:00 0.00% 1.95% git
I played with kgdb a bit and found that these 3 processes are on zombproc
but with p_stat==SACTIVE and p_lock==1. I have no idea how it happened.
Updated by dillon almost 16 years ago
:941f5de0 (around Oct. 10), but I can't find a relavant fix on
:DragonFly_RELEASE_2_0 branch. I have another box running -DEVELOPMENT,
:but its uptime is probably not long enough to reproduce it.
:
:$ top -n -U qhwt 4
:load averages: 0.00, 0.00, 0.00 up 100+00:41:22 11:01:26
:78 processes: 78 running
:
:Mem: 116M Active, 210M Inact, 138M Wired, 16M Cache, 59M Buf, 1220K Free
:Swap: 2048M Total, 15M Used, 2033M Free
:
:
: PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
:63039 qhwt 179 0 2084K 1136K RUN 0:00 1161.00% 56.69% top
:18535 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
:18527 qhwt 159 0 0K 4K 0:00 0.00% 3.91% git
:18517 qhwt 157 0 0K 4K 0:00 0.00% 1.95% git
:
:I played with kgdb a bit and found that these 3 processes are on zombproc
:but with p_stat==SACTIVE and p_lock==1. I have no idea how it happened.
This should hopefully be fixed with this commit I made in December:
commit 2e425d87dc98885c44799de0327ab0013a7c34d6
Author: Matthew Dillon <dillon@apollo.backplane.com>
Date: Thu Dec 18 20:20:15 2008 -0800
Close a possible bug where the p_lock for a new process inherits a
non-zero value from its parent on fork(), preventing the process
from being able to exit later on.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by qhwt+dfly almost 16 years ago
Ok, I'll try this on top of R2.0, or the pre-release of R2.2,
depending on when the new branch becomes available.
Thanks.
Updated by qhwt+dfly almost 16 years ago
I found a similar zombie process on a PC running -DEVELOPMENT
(as of 89f297df...), with p_stat==SACTIVE and p_lock==1. The stuck
process was git again. The previous ones were `git log', and this time
`git show', all of which involve $PAGER. I tried issuing the same command
several times, but I couldn't reproduce another zombie yet.
Updated by dillon almost 16 years ago
:I found a similar zombie process on a PC running -DEVELOPMENT
:(as of 89f297df...), with p_stat==SACTIVE and p_lock==1. The stuck
:process was git again. The previous ones were `git log', and this time
:`git show', all of which involve $PAGER. I tried issuing the same command
:several times, but I couldn't reproduce another zombie yet.
Check p_xstat, see if it is SIGSTOP or SIGTSTP. I think it's going
from SZOMB -> SSTOP -> SACTIVE (on kill or cont) due to the signal.
The p_lock == 1 is due to the exiting LWP and is correct.
I just made another commit which should catch zombied processes which
are improperly resurrected by a stop signal. However, I couldn't
reproduce it to test the fix... the window of opportunity is fairly
small.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by qhwt+dfly almost 16 years ago
Yes, that's SIGTSTP (attached `p *zombproc->lh_first' result).
I found how I made the zombie:
----
$ unset PAGER (or unsetenv if you like *csh)
$ cd /path/to/git/repo
$ git show 55a9cd0fa2b75e61230e2802b78eaec8937a1e42
here the pager display 7 lines of commitlog and the patch, and `(END)'
at the end. Suspend it by pressing ctrl+Z, then type fg to resume,
and press `q' in the pager to leave it.
$ ps x
now you see another zombie.
----
I'll try this on the new kernel and see if I can still reproduce it.
Thanks.
Updated by qhwt+dfly almost 16 years ago
After the commit 81b18e51, I can no longer reproduce the zombies, so I'd say
it's resolved now. 2.0-RELEASE is affected by this issue, so 81b18e51 and
related commits should be back ported, too.
Thanks.