Bug #24

waitpid() with WUNTRACED flag? (was Re: Hang on ctrl+Z after the MPSAFE tsleep/wakeup commit)

Added by qhwt+dfly over 8 years ago. Updated almost 8 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

On Fri, Dec 02, 2005 at 10:08:59AM -0800, Matthew Dillon wrote:
> :Hello.
> :After the following commit, stopping a user process with SIGSTOP
> :does not give control to the shell:
> :http://leaf.dragonflybsd.org/mailarchive/commits/2005-11/msg00109.html
> :It only happens with a program which does fork() inside it. Here's
> :a simple program to demonstrate this.
> :
> :$ gcc -g -Wall a.c && ./a.out cat
> :(press ctrl+Z here, and it accepts no other signals until you send
> :SIGCONT to the child process from another screen)
> :
> :`ps' command shows that the stuck processes are marked as "TL+"
> :(meaning that they didn't give up even after having been stopped?).
>
> Ah ha! I have reproduced it. I was running './a.out cat' separately,
> from csh, while you were running it from /bin/sh.
>
> When I changed over to /bin/sh the problem reproduced.
>
> I'll get it fixed today, hopefully soon.

While this has been fixed, I realized that I overlooked another
similar problem; try running vipw, suspend it with ctrl-Z and
continue with `fg'; the shell immediately reports that vipw has been
suspended(processes are marked as `TL' by ps command); another `fg'
seems to continue it, but vi(or whatever editor invoked by vipw)
doesn't handle the screen properly, even though it's responding to
the keystrokes.
Attached is a small code fragment that demonstrates the problem.
BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
while on 1.2.6-RELEASE it never returns when suspended or continued.
Without WUNTRACED flag, the code seems to work the same way on both
versions of the OS.
Too bad I couldn't find it before the new release.

a.c Magnifier (977 Bytes) qhwt+dfly, 12/27/2005 12:32 AM

History

#1 Updated by dillon over 8 years ago

:While this has been fixed, I realized that I overlooked another
:similar problem; try running vipw, suspend it with ctrl-Z and
:continue with `fg'; the shell immediately reports that vipw has been
:suspended(processes are marked as `TL' by ps command); another `fg'
:seems to continue it, but vi(or whatever editor invoked by vipw)
:doesn't handle the screen properly, even though it's responding to
:the keystrokes.
:Attached is a small code fragment that demonstrates the problem.
:BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
:on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
:while on 1.2.6-RELEASE it never returns when suspended or continued.
:Without WUNTRACED flag, the code seems to work the same way on both
:versions of the OS.
:Too bad I couldn't find it before the new release.

We haven't released yet! I can reproduce the same problem so I'll
track it down and get it fixed before the release.

-Matt
Matthew Dillon
<>

#2 Updated by qhwt+dfly over 8 years ago

On Mon, Dec 26, 2005 at 11:16:10AM -0800, Matthew Dillon wrote:
>
> :While this has been fixed, I realized that I overlooked another
> :similar problem; try running vipw, suspend it with ctrl-Z and
> :continue with `fg'; the shell immediately reports that vipw has been
> :suspended(processes are marked as `TL' by ps command); another `fg'
> :seems to continue it, but vi(or whatever editor invoked by vipw)
> :doesn't handle the screen properly, even though it's responding to
> :the keystrokes.
> :Attached is a small code fragment that demonstrates the problem.
> :BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
> :on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
> :while on 1.2.6-RELEASE it never returns when suspended or continued.
> :Without WUNTRACED flag, the code seems to work the same way on both
> :versions of the OS.
> :Too bad I couldn't find it before the new release.
>
> We haven't released yet! I can reproduce the same problem so I'll
> track it down and get it fixed before the release.

Oops, I was supposed to say `before branching.' Anyway, I'll play with
signal a bit more but hopefully this is the last one.
By the way, here's the code I forgot attaching in the previous message.
It behaves consistently on Linux, FreeBSD, and NetBSD boxes around me,
but differently on the recent DragonFly.

#3 Updated by qhwt+dfly over 8 years ago

*blush*

#4 Updated by dillon over 8 years ago

:Content-Type: text/plain; charset=us-ascii
:Content-Disposition: attachment; filename="a.c"
:...

I think it's the same issue. The ^Z effects all processes in
the process group. waitpid in the parent tries to return the
stopped state of the child, but the parent itself is stopped
and the status is not returned until you 'fg'. This starts
both parent and child, but then the parent processes the
return from the waitpid and stops itself again.

I'm not sure if there's a good solution to the problem, or how
much code uses the same sort of construct. Even the original code
still had this race condition, it was just a much smaller window
of opportunity.

One possible solution is to have the kernel wait4() check for a pending
stop request on the parent process after being woken up but before it
checks the state of other processes. This would reduce (but not
eliminate) the window of opportunity back to what it was before
the tsleep work went in.

-Matt

: for (;;) {
: caught = waitpid(pid, &st, WUNTRACED);
: warnx("waitpid returned %d\n", caught);
: errno = WEXITSTATUS(st);
: if (caught == -1)
: return 1;
: else if (WIFSTOPPED(st)) {
: sig = WSTOPSIG(st);
: warnx("WSTOPSIG(%d) = %d\n", st, sig);
: sig = SIGSTOP;
: raise(sig);
: warnx("after raise(%d)\n", sig);
: }

#5 Updated by qhwt+dfly over 8 years ago

On Mon, Dec 26, 2005 at 11:27:44PM -0800, Matthew Dillon wrote:
> :Content-Type: text/plain; charset=us-ascii
> :Content-Disposition: attachment; filename="a.c"
> :...
>
> I think it's the same issue. The ^Z effects all processes in
> the process group. waitpid in the parent tries to return the
> stopped state of the child, but the parent itself is stopped
> and the status is not returned until you 'fg'. This starts
> both parent and child, but then the parent processes the
> return from the waitpid and stops itself again.

Ok, the code doesn't really reflect my question here; on computers here
running other OSes, including DragonFly 1.2.x-RELEASE, waitpid() (or wait4)
doesn't return after ^Z+`fg' even with WUNTRACED (I can confirm this by the
fact that warnx() doesn't print "WSTOPSIG(%d) = %d\n" message), thus no such
problem exists. I don't know if it's an (POSIX-ly or SUS-ly?) expected
behavior, but are we going to change that?

> : for (;;) {
> : caught = waitpid(pid, &st, WUNTRACED);
> : warnx("waitpid returned %d\n", caught);
> : errno = WEXITSTATUS(st);
> : if (caught == -1)
> : return 1;
> : else if (WIFSTOPPED(st)) {
> : sig = WSTOPSIG(st);
> : warnx("WSTOPSIG(%d) = %d\n", st, sig);
> : sig = SIGSTOP;
> : raise(sig);
> : warnx("after raise(%d)\n", sig);
> : }

#6 Updated by dillon over 8 years ago

:Ok, the code doesn't really reflect my question here; on computers here
:running other OSes, including DragonFly 1.2.x-RELEASE, waitpid() (or wait4)
:doesn't return after ^Z+`fg' even with WUNTRACED (I can confirm this by the
:fact that warnx() doesn't print "WSTOPSIG(%d) = %d\n" message), thus no such
:problem exists. I don't know if it's an (POSIX-ly or SUS-ly?) expected
:behavior, but are we going to change that?

I don't know what the expected behavior is, but the only reason the
parent isn't seeing the STOP signal on the child in FreeBSD is due to
a fluke in the way the parent's own STOP signal is handled, which is
causing the parent to stop before it's wait*() scans the list of
child processes. If you were to wake up the parent on FreeBSD with a
kill -CONT from another shell, it will see the vi's stop signal.

When you 'fg' the process group, both parent and child are being
restarted at the same time. The parent then starts its scan of the list
of child processes and of course no longer sees the child in a stopped
state, so it only reports the SIGCONT, instead of a SIGSTOP followed by
a SIGCONT. (The stop/cont state is held in a single variable in the
proc structure so multiple stops and conts are not queued to the parent's
wait*()).

DragonFly is currently scanning the list of child processes *BEFORE* it
stops, so it sees the stopped child and tries to return its status,
but then the parent process stops before it returns to user mode.
When you 'fg' the process group, DragonFly proceeds to return the
original STOP status that it has already processed (so it doesn't get
overwritten by the new CONT status), then the parent program loops on
the waitpid() and DragonFly sees the CONT status.

Which is more correct? I don't know. I think the userland program
itself (the parent, as in your test program and in vipw) might be
incorrect.

-Matt

#7 Updated by corecode over 8 years ago

On 27.12.2005, at 19:57, Matthew Dillon wrote:
> Which is more correct? I don't know. I think the userland program
> itself (the parent, as in your test program and in vipw) might be
> incorrect.

I think from outside it usually should each be an atomar operation:
first stopping all foreground processes and later resuming all them at
the same time. So in the common case I think the parent shouldn't
notice either, but I know that it's not possible in the way we process
suspend signals now. Except, of course, if we add another sleep point
*in* wait. Or am I talking nonsense? (Few sleep at the moment)

cheers
simon

#8 Updated by dillon over 8 years ago

:the same time. So in the common case I think the parent shouldn't=20
:notice either, but I know that it's not possible in the way we process=20=
:
:suspend signals now. Except, of course, if we add another sleep point=20=
:
:*in* wait. Or am I talking nonsense? (Few sleep at the moment)
:
:cheers
: simon

That is precisely what we would have to do, as I indicated earlier
(I worded it differently but it amounts to the same thing). We would
have to put a P_STOPPED check in the kernel wait*() code.

Note that even on FreeBSD the SIGCONT is still picked up. It's in fact
the reason why the SIGSTOP is NOT picked up by the parent, because
the SIGCONT overwrites the SIGSTOP in the chlid's proc.

Making wait*() work like it did before is a two-line fix. Basically just
this after the loop: label in kern_wait():

while (q->p_flag & P_STOPPED)
tstop(q);

But I don't know if I want to actually do that. I think the original
userland code is what was broken, not the kernel.

-Matt

#9 Updated by corecode over 8 years ago

fixed by dillon

Also available in: Atom PDF