Bug #24: waitpid() with WUNTRACED flag? (was Re: Hang on ctrl+Z after the MPSAFE tsleep/wakeup commit) - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #24

closed

waitpid() with WUNTRACED flag? (was Re: Hang on ctrl+Z after the MPSAFE tsleep/wakeup commit)

Added by qhwt+dfly over 19 years ago. Updated almost 19 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

On Fri, Dec 02, 2005 at 10:08:59AM -0800, Matthew Dillon wrote:

:Hello.
:After the following commit, stopping a user process with SIGSTOP
:does not give control to the shell:
:http://leaf.dragonflybsd.org/mailarchive/commits/2005-11/msg00109.html
:It only happens with a program which does fork() inside it. Here's
:a simple program to demonstrate this.
:
:$ gcc -g -Wall a.c && ./a.out cat
:(press ctrl+Z here, and it accepts no other signals until you send
:SIGCONT to the child process from another screen)
:
:`ps' command shows that the stuck processes are marked as "TL+"
:(meaning that they didn't give up even after having been stopped?).

Ah ha! I have reproduced it. I was running './a.out cat' separately,
from csh, while you were running it from /bin/sh.

When I changed over to /bin/sh the problem reproduced.

I'll get it fixed today, hopefully soon.

While this has been fixed, I realized that I overlooked another
similar problem; try running vipw, suspend it with ctrl-Z and
continue with `fg'; the shell immediately reports that vipw has been
suspended(processes are marked as `TL' by ps command); another `fg'
seems to continue it, but vi(or whatever editor invoked by vipw)
doesn't handle the screen properly, even though it's responding to
the keystrokes.
Attached is a small code fragment that demonstrates the problem.
BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
while on 1.2.6-RELEASE it never returns when suspended or continued.
Without WUNTRACED flag, the code seems to work the same way on both
versions of the OS.
Too bad I couldn't find it before the new release.

Files

a.c (977 Bytes) a.c

qhwt+dfly, 12/27/2005 12:32 AM

Actions

Copy link

Updated by dillon over 19 years ago

:While this has been fixed, I realized that I overlooked another
:similar problem; try running vipw, suspend it with ctrl-Z and
:continue with `fg'; the shell immediately reports that vipw has been
:suspended(processes are marked as `TL' by ps command); another `fg'
:seems to continue it, but vi(or whatever editor invoked by vipw)
:doesn't handle the screen properly, even though it's responding to
:the keystrokes.
:Attached is a small code fragment that demonstrates the problem.
:BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
:on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
:while on 1.2.6-RELEASE it never returns when suspended or continued.
:Without WUNTRACED flag, the code seems to work the same way on both
:versions of the OS.
:Too bad I couldn't find it before the new release.

We haven't released yet!  I can reproduce the same problem so I'll
    track it down and get it fixed before the release.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by qhwt+dfly over 19 years ago

On Mon, Dec 26, 2005 at 11:16:10AM -0800, Matthew Dillon wrote:

:While this has been fixed, I realized that I overlooked another
:similar problem; try running vipw, suspend it with ctrl-Z and
:continue with `fg'; the shell immediately reports that vipw has been
:suspended(processes are marked as `TL' by ps command); another `fg'
:seems to continue it, but vi(or whatever editor invoked by vipw)
:doesn't handle the screen properly, even though it's responding to
:the keystrokes.
:Attached is a small code fragment that demonstrates the problem.
:BTW, it behaves differently on -CURRENT and 1.2.6-RELEASE;
:on -CURRENT, waitpid() returns after `fg' command followed by a ctrl-Z,
:while on 1.2.6-RELEASE it never returns when suspended or continued.
:Without WUNTRACED flag, the code seems to work the same way on both
:versions of the OS.
:Too bad I couldn't find it before the new release.

We haven't released yet! I can reproduce the same problem so I'll
track it down and get it fixed before the release.

Oops, I was supposed to say `before branching.' Anyway, I'll play with
signal a bit more but hopefully this is the last one.
By the way, here's the code I forgot attaching in the previous message.
It behaves consistently on Linux, FreeBSD, and NetBSD boxes around me,
but differently on the recent DragonFly.

Actions

Copy link

Updated by qhwt+dfly over 19 years ago

blush

Actions

Copy link

Updated by dillon over 19 years ago

:Content-Type: text/plain; charset=us-ascii
:Content-Disposition: attachment; filename="a.c"
:...

I think it's the same issue.  The ^Z effects all processes in
    the process group.  waitpid in the parent tries to return the
    stopped state of the child, but the parent itself is stopped
    and the status is not returned until you 'fg'.  This starts
    both parent and child, but then the parent processes the
    return from the waitpid and stops itself again.

I'm not sure if there's a good solution to the problem, or how
    much code uses the same sort of construct.  Even the original code
    still had this race condition, it was just a much smaller window
    of opportunity.

One possible solution is to have the kernel wait4() check for a pending
    stop request on the parent process after being woken up but before it 
    checks the state of other processes.  This would reduce (but not
    eliminate) the window of opportunity back to what it was before
    the tsleep work went in.

-Matt

: for (;;) {
: caught = waitpid(pid, &st, WUNTRACED);
: warnx("waitpid returned %d\n", caught);
: errno = WEXITSTATUS;
: if (caught == -1)
: return 1;
: else if (WIFSTOPPED) {
: sig = WSTOPSIG;
: warnx("WSTOPSIG = %d\n", st, sig);
: sig = SIGSTOP;
: raise(sig);
: warnx("after raise(%d)\n", sig);
: }

Actions

Copy link

Updated by qhwt+dfly over 19 years ago

On Mon, Dec 26, 2005 at 11:27:44PM -0800, Matthew Dillon wrote:

:Content-Type: text/plain; charset=us-ascii
:Content-Disposition: attachment; filename="a.c"
:...

I think it's the same issue. The ^Z effects all processes in
the process group. waitpid in the parent tries to return the
stopped state of the child, but the parent itself is stopped
and the status is not returned until you 'fg'. This starts
both parent and child, but then the parent processes the
return from the waitpid and stops itself again.

Ok, the code doesn't really reflect my question here; on computers here
running other OSes, including DragonFly 1.2.x-RELEASE, waitpid() (or wait4)
doesn't return after ^Z+`fg' even with WUNTRACED (I can confirm this by the
fact that warnx() doesn't print "WSTOPSIG = %d\n" message), thus no such
problem exists. I don't know if it's an (POSIX-ly or SUS-ly?) expected
behavior, but are we going to change that?

: for (;;) {
: caught = waitpid(pid, &st, WUNTRACED);
: warnx("waitpid returned %d\n", caught);
: errno = WEXITSTATUS;
: if (caught == -1)
: return 1;
: else if (WIFSTOPPED) {
: sig = WSTOPSIG;
: warnx("WSTOPSIG = %d\n", st, sig);
: sig = SIGSTOP;
: raise(sig);
: warnx("after raise(%d)\n", sig);
: }

Actions

Copy link

Updated by dillon over 19 years ago

:Ok, the code doesn't really reflect my question here; on computers here
:running other OSes, including DragonFly 1.2.x-RELEASE, waitpid() (or wait4)
:doesn't return after ^Z+`fg' even with WUNTRACED (I can confirm this by the
:fact that warnx() doesn't print "WSTOPSIG = %d\n" message), thus no such
:problem exists. I don't know if it's an (POSIX-ly or SUS-ly?) expected
:behavior, but are we going to change that?

I don't know what the expected behavior is, but the only reason the
    parent isn't seeing the STOP signal on the child in FreeBSD is due to
    a fluke in the way the parent's own STOP signal is handled, which is
    causing the parent to stop before it's wait*() scans the list of
    child processes.  If you were to wake up the parent on FreeBSD with a
    kill -CONT from another shell, it will see the vi's stop signal.

When you 'fg' the process group, both parent and child are being
    restarted at the same time.  The parent then starts its scan of the list
    of child processes and of course no longer sees the child in a stopped
    state, so it only reports the SIGCONT, instead of a SIGSTOP followed by
    a SIGCONT.  (The stop/cont state is held in a single variable in the
    proc structure so multiple stops and conts are not queued to the parent's
    wait*()).

DragonFly is currently scanning the list of child processes BEFORE it
    stops, so it sees the stopped child and tries to return its status,
    but then the parent process stops before it returns to user mode.
    When you 'fg' the process group, DragonFly proceeds to return the
    original STOP status that it has already processed (so it doesn't get
    overwritten by the new CONT status), then the parent program loops on
    the waitpid() and DragonFly sees the CONT status.

Which is more correct?  I don't know.  I think the userland program
    itself (the parent, as in your test program and in vipw) might be
    incorrect.

-Matt

Actions

Copy link

Updated by corecode over 19 years ago

On 27.12.2005, at 19:57, Matthew Dillon wrote:

Which is more correct? I don't know. I think the userland program
itself (the parent, as in your test program and in vipw) might be
incorrect.

I think from outside it usually should each be an atomar operation:
first stopping all foreground processes and later resuming all them at
the same time. So in the common case I think the parent shouldn't
notice either, but I know that it's not possible in the way we process
suspend signals now. Except, of course, if we add another sleep point
in wait. Or am I talking nonsense? (Few sleep at the moment)

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:the same time. So in the common case I think the parent shouldn't=20
:notice either, but I know that it's not possible in the way we process=20=
:
:suspend signals now. Except, of course, if we add another sleep point=20=
:
:*in* wait. Or am I talking nonsense? (Few sleep at the moment)
:
:cheers
: simon

That is precisely what we would have to do, as I indicated earlier
    (I worded it differently but it amounts to the same thing).  We would
    have to put a P_STOPPED check in the kernel wait*() code.

Note that even on FreeBSD the SIGCONT is still picked up.  It's in fact
    the reason why the SIGSTOP is NOT picked up by the parent, because
    the SIGCONT overwrites the SIGSTOP in the chlid's proc.

Making wait*() work like it did before is a two-line fix.  Basically just
    this after the loop: label in kern_wait():

while (q->p_flag & P_STOPPED)  
        tstop(q);

But I don't know if I want to actually do that.  I think the original
    userland code is what was broken, not the kernel.

-Matt

Actions

Copy link

Updated by corecode over 19 years ago

fixed by dillon

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #24

waitpid() with WUNTRACED flag? (was Re: Hang on ctrl+Z after the MPSAFE tsleep/wakeup commit)

Updated by dillon over 19 years ago

Updated by qhwt+dfly over 19 years ago

Updated by qhwt+dfly over 19 years ago

Updated by dillon over 19 years ago

Updated by qhwt+dfly over 19 years ago

Updated by dillon over 19 years ago

Updated by corecode over 19 years ago

Updated by dillon over 19 years ago

Updated by corecode over 19 years ago