Bug #198

Unexpected soft update inconsistency

Added by wa1ter almost 8 years ago. Updated over 7 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I've been seeing 'write error' while compiling firefox. After
each write error I reboot and run fsck which produces the same
error each time:

LINK COUNT DIR I=2 OWNER=root MODE=40755
SIZE=512 MTIME=[date] COUNT 21 SHOULD BE 22
LINK COUNT INCREASING
UNEXPECTED SOFT UPDATE INCONSISTENCY

I tell it to correct the error and all seems fine until the
next 'write error' while trying to build firefox. The error
always occurs just after the message 'leaving directory <foo>'
so it's not entirely random the way an overheating CPU might
behave.

Any ideas how I can debug this?

Anyone else seeing the same thing?

History

#1 Updated by schlesinger almost 8 years ago

Am Freitag, 9. Juni 2006 18:35 schrieb walt:
> I've been seeing 'write error' while compiling firefox. After
> each write error I reboot and run fsck which produces the same
> error each time:
>
> LINK COUNT DIR I=2 OWNER=root MODE=40755
> SIZE=512 MTIME=[date] COUNT 21 SHOULD BE 22
> LINK COUNT INCREASING
> UNEXPECTED SOFT UPDATE INCONSISTENCY
>
> I tell it to correct the error and all seems fine until the
> next 'write error' while trying to build firefox. The error
> always occurs just after the message 'leaving directory <foo>'
> so it's not entirely random the way an overheating CPU might
> behave.
>
> Any ideas how I can debug this?
>
> Anyone else seeing the same thing?

At least the "write errors" do I have, too, when building certain packages
with bmake. When I do a "bmake install clean" again, bmake continues building
the package.

Thomas

#2 Updated by dillon almost 8 years ago

:> SIZE=512 MTIME=[date] COUNT 21 SHOULD BE 22
:> LINK COUNT INCREASING
:> UNEXPECTED SOFT UPDATE INCONSISTENCY
:>
:> I tell it to correct the error and all seems fine until the
:> next 'write error' while trying to build firefox. The error
:> always occurs just after the message 'leaving directory <foo>'
:> so it's not entirely random the way an overheating CPU might
:> behave.
:>
:> Any ideas how I can debug this?
:>
:> Anyone else seeing the same thing?
:
:At least the "write errors" do I have, too, when building certain packages
:with bmake. When I do a "bmake install clean" again, bmake continues building
:the package.
:
:Thomas

The 'write error' from the bmake is different from a disk write
error, which I think walt is getting. But a disk write error should
produce a console message (something in the dmesg output and on the
console).

It might be possible to trace out the cause of the bmake write error
by ktraceing the build. Something like this:

ktrace -i -t cns bmake

-Matt
Matthew Dillon
<>

#3 Updated by wa1ter almost 8 years ago

On Fri, 9 Jun 2006, Matthew Dillon wrote:

[...]

> It might be possible to trace out the cause of the bmake write error
> by ktraceing the build. Something like this:
>
> ktrace -i -t cns bmake

Heh. I tried your suggestion and I used up 2.2 Gigs of disk space
and then got a (legitimate) 'disk full' error. ktrace.out had
grown to 2GB.

The bmake of firefox usually runs for only five minutes or so
before crapping out with 'write error' -- so my gut tells me
that using ktrace probably prevented some race condition from
causing the 'write error'. But of course I'm only guessing.

#4 Updated by dillon almost 8 years ago

:Heh. I tried your suggestion and I used up 2.2 Gigs of disk space
:and then got a (legitimate) 'disk full' error. ktrace.out had
:grown to 2GB.
:
:The bmake of firefox usually runs for only five minutes or so
:before crapping out with 'write error' -- so my gut tells me
:that using ktrace probably prevented some race condition from
:causing the 'write error'. But of course I'm only guessing.

Welll.. were you writing the ktrace.out file to the same partition
that bmake puts its objects on ?

If there is really a bad sector on the disk, the existance of the ktrace
file would have definitely changed the block allocations.

-Matt
Matthew Dillon
<>

#5 Updated by wa1ter almost 8 years ago

Matthew Dillon wrote:
[...]
> If there is really a bad sector on the disk...

I've been testing all day, and found some things which may be
significant, dunno.

1) I built pkgsrc/firefox on NetBSD-current which is installed
on the same disk as DF -- I had no problems of any kind.

2) I built pkgsrc/firefox on my *other* DF installation (also on
the same disk) and got the same 'write error' just after
'leaving directory <foo>'. *But* there was one big difference:
when I rebooted and ran fsck there were *no* errors found on
the filesystem. I restarted the firefox bmake from where it
stopped and got a second 'write error'. I rebooted again, ran
fsck again, and again it found *no* error. (This DF install was
last updated on April 25.)

I'm thinking that the soft-update-inconsistency error may have been
just a red-herring due to a bad sector -- but the 'write error'
is also occurring in the absence of bad sectors, and was already
happening on DF-current as long ago as April 25.

As an aside -- firefox pulled in and installed 37 other pkgsrc
packages as dependencies with not one error, before starting to
build firefox itself. The only 'write error' I've seen is during
the building of firefox.

#6 Updated by dillon almost 8 years ago

:2) I built pkgsrc/firefox on my *other* DF installation (also on
: the same disk) and got the same 'write error' just after
: 'leaving directory <foo>'. *But* there was one big difference:
: when I rebooted and ran fsck there were *no* errors found on
: the filesystem. I restarted the firefox bmake from where it
: stopped and got a second 'write error'. I rebooted again, ran
: fsck again, and again it found *no* error. (This DF install was
: last updated on April 25.)
:
:I'm thinking that the soft-update-inconsistency error may have been
:just a red-herring due to a bad sector -- but the 'write error'
:is also occurring in the absence of bad sectors, and was already
:happening on DF-current as long ago as April 25.
:
:As an aside -- firefox pulled in and installed 37 other pkgsrc
:packages as dependencies with not one error, before starting to
:build firefox itself. The only 'write error' I've seen is during
:the building of firefox.

Ok, so try rebuilding just firefox (since now you have all the
other dependancies installed) over and over again and see if you
can consistently get the 'write error'.

I still think ktrace may be our best bet. It sounds more like an
application bug then a kernel bug.

I will attempt to reproduce the problem.

-Matt
Matthew Dillon
<>

#7 Updated by qhwt+dfly almost 8 years ago

On Fri, Jun 09, 2006 at 09:35:38AM -0700, walt wrote:
> LINK COUNT DIR I=2 OWNER=root MODE=40755
> SIZE=512 MTIME=[date] COUNT 21 SHOULD BE 22
> LINK COUNT INCREASING
> UNEXPECTED SOFT UPDATE INCONSISTENCY

Does this problem persist even if you do the following:
1. reboot to single-user mode
2. run fsck at least once with an '-f' flag
3. go to multi-user mode
4. try building firefox

the reason I care about this is because it seems to me that fsck
sometimes leaves inconsistencies unfixed, especially with -p flag.

> I tell it to correct the error and all seems fine until the
> next 'write error' while trying to build firefox. The error
> always occurs just after the message 'leaving directory <foo>'
> so it's not entirely random the way an overheating CPU might
> behave.

Is this the same problem you posted on users@ last month?
http://leaf.dragonflybsd.org/mailarchive/users/2006-05/msg00017.html
if so, have you tried reinstalling gmake (as I wrote in the same thread)?

#8 Updated by qhwt+dfly almost 8 years ago

On Sun, Jun 11, 2006 at 05:06:39PM +0900, YONETANI Tomokazu wrote:
> > I tell it to correct the error and all seems fine until the
> > next 'write error' while trying to build firefox. The error
> > always occurs just after the message 'leaving directory <foo>'
> > so it's not entirely random the way an overheating CPU might
> > behave.
>
> Is this the same problem you posted on users@ last month?
> http://leaf.dragonflybsd.org/mailarchive/users/2006-05/msg00017.html
> if so, have you tried reinstalling gmake (as I wrote in the same thread)?

Please disregard this part: reproduced with a fresh install of pkgsrc.
I'm running DragonFly_RELEASE_1_4 on an SMP machine. As I wrote on users@,
the 'write error' message comes from gmake (it calls itself `make' because
it's invoked via a symlink).

make[3]: Leaving directory `/var/obj/pkg.rel14/www/firefox/work/mozilla/xpinstall/public'
make[3]: write error
make[2]: *** [libs] Error 1
make[2]: Leaving directory `/var/obj/pkg.rel14/www/firefox/work/mozilla/xpinstall'
make[1]: *** [tier_50] Error 2
make[1]: Leaving directory `/var/obj/pkg.rel14/www/firefox/work/mozilla'
gmake: *** [default] Error 2
*** Error code 2

#9 Updated by joerg almost 8 years ago

On Fri, Jun 09, 2006 at 07:30:50PM +0200, Thomas Schlesinger wrote:
> At least the "write errors" do I have, too, when building certain packages
> with bmake. When I do a "bmake install clean" again, bmake continues building
> the package.

bmake or gmake? The errors I know of are from gmake and I'm working on a
fix for those. Rumko is waiting for it as well...

Joerg

#10 Updated by wa1ter almost 8 years ago

YONETANI Tomokazu wrote:
> On Fri, Jun 09, 2006 at 09:35:38AM -0700, walt wrote:
>> UNEXPECTED SOFT UPDATE INCONSISTENCY

> Does this problem persist even if you do the following:
> 1. reboot to single-user mode
> 2. run fsck at least once with an '-f' flag
> 3. go to multi-user mode
> 4. try building firefox

Yes, I always run fsck -f twice, to make sure that any errors
were really fixed -- and in this case they were fixed.

> Is this the same problem you posted on users@ last month?
> http://leaf.dragonflybsd.org/mailarchive/users/2006-05/msg00017.html
> if so, have you tried reinstalling gmake (as I wrote in the same thread)?

Yes, the same problem, and rebuilding gmake was the first thing
I tried -- it didn't change anything. I made one big mistake in
my original post -- the build does *not* always stop in the same
place. It does always stop just after printing 'leaving directory
foo', so the error is not completely random (just partly random).

I've spent all morning trying to get a ktrace of the problem
without success. ktrace.out keeps using up all my free disk
space -- I made room for a 10GB partition, and that just got
filled up. I've been running kdump on the result for 35 minutes
now and it just keeps scrolling by on my screen. Dunno if I'll
be able to tell anything when it finally stops. This process
does not scale well!

#11 Updated by dillon almost 8 years ago

:I've spent all morning trying to get a ktrace of the problem
:without success. ktrace.out keeps using up all my free disk
:space -- I made room for a 10GB partition, and that just got
:filled up. I've been running kdump on the result for 35 minutes
:now and it just keeps scrolling by on my screen. Dunno if I'll
:be able to tell anything when it finally stops. This process
:does not scale well!

Have you tried trimming the ktrace options? e.g.

ktrace [-f outputfile] -i -t cns ...

Only kdump if you actually get the error, and then you'd probably want
to do something like 'kdump [-f inputfile] | tail -5000 | less'

Be sure you ktrace to a different partition then the one the objects
are stored on.

-Matt
Matthew Dillon
<>

#12 Updated by dillon almost 8 years ago

Well, looking at the 'gmake' source, there are a couple of places
where it can generate 'write error', primarily when it is closing
stdout.

I can't imagine how stdout could have an error, perhaps gmake is doing
a double fclose() on stdout and picking up garbage memory somewhere.
It could also be closing descriptor 1 somehow, then a later attempt
to fclose() stdout fails. I just don't know.

-Matt
Matthew Dillon
<>

#13 Updated by joerg almost 8 years ago

On Sun, Jun 11, 2006 at 02:03:32PM -0700, Matthew Dillon wrote:
> Well, looking at the 'gmake' source, there are a couple of places
> where it can generate 'write error', primarily when it is closing
> stdout.

That's exactly were it is bailing out. The problem comes from earlier
write failures when a program set non-blocking operation on the stdout
descriptor. I have an initial patch for that, but Rumko reported
segfaults from it, which I haven't had time to investigate yet.

Joerg

#14 Updated by dillon almost 8 years ago

:
:On Sun, Jun 11, 2006 at 02:03:32PM -0700, Matthew Dillon wrote:
:> Well, looking at the 'gmake' source, there are a couple of places
:> where it can generate 'write error', primarily when it is closing
:> stdout.
:
:That's exactly were it is bailing out. The problem comes from earlier
:write failures when a program set non-blocking operation on the stdout
:descriptor. I have an initial patch for that, but Rumko reported
:segfaults from it, which I haven't had time to investigate yet.
:
:Joerg

Non-blocking? Why would the stdout descriptor be set to non-blocking?
Is gmake doing that or is it the program fork/exec'ing gmake ?

-Matt
Matthew Dillon
<>

#15 Updated by joerg almost 8 years ago

On Sun, Jun 11, 2006 at 04:11:58PM -0700, Matthew Dillon wrote:
>
> :
> :On Sun, Jun 11, 2006 at 02:03:32PM -0700, Matthew Dillon wrote:
> :> Well, looking at the 'gmake' source, there are a couple of places
> :> where it can generate 'write error', primarily when it is closing
> :> stdout.
> :
> :That's exactly were it is bailing out. The problem comes from earlier
> :write failures when a program set non-blocking operation on the stdout
> :descriptor. I have an initial patch for that, but Rumko reported
> :segfaults from it, which I haven't had time to investigate yet.
> :
> :Joerg
>
> Non-blocking? Why would the stdout descriptor be set to non-blocking?
> Is gmake doing that or is it the program fork/exec'ing gmake ?

libc_r is doing that. E.g. gmake calls a threaded program, the program
sets non-blocking mode and gmake tries to write something to stdout when
it is blocking. This results in the error bit getting set and the error
2 to occur during cleanup.

Joerg

#16 Updated by wa1ter almost 8 years ago

wrote:
> On Sun, Jun 11, 2006 at 04:11:58PM -0700, Matthew Dillon wrote:
>> :
>> :On Sun, Jun 11, 2006 at 02:03:32PM -0700, Matthew Dillon wrote:
>> :> Well, looking at the 'gmake' source, there are a couple of places
>> :> where it can generate 'write error', primarily when it is closing
>> :> stdout.
>> :
>> :That's exactly were it is bailing out. The problem comes from earlier
>> :write failures when a program set non-blocking operation on the stdout
>> :descriptor. I have an initial patch for that, but Rumko reported
>> :segfaults from it, which I haven't had time to investigate yet.
>> :
>> :Joerg
>>
>> Non-blocking? Why would the stdout descriptor be set to non-blocking?
>> Is gmake doing that or is it the program fork/exec'ing gmake ?
>
> libc_r is doing that. E.g. gmake calls a threaded program, the program
> sets non-blocking mode and gmake tries to write something to stdout when
> it is blocking. This results in the error bit getting set and the error
> 2 to occur during cleanup.

Matt and Joerg:

I finally captured an error and I uploaded it to:
leaf.dragonflybsd.org/~wa1ter/err.txt.bz2

It appears to my ignorant eye that the operating system is
denying the existence of directories which really do exist.

If I'm misinterpreting the error messages please correct me!
I need to become a more competent debugger.

#17 Updated by dillon almost 8 years ago

:libc_r is doing that. E.g. gmake calls a threaded program, the program
:sets non-blocking mode and gmake tries to write something to stdout when
:it is blocking. This results in the error bit getting set and the error
:2 to occur during cleanup.
:
:Joerg

The FNONBLOCK flag is a per-file-pointer flag. Hmm. I see two possible
solutions:

(1) A per-descriptor flag which libc_r can set, like close-on-exec
(FD_CLOEXEC). We already have most of the infrastructure required,
including a flags variable that is passed to the FP read and write
code (used with FOF_OFFSET).

(2) new system call entry points for non-blocking read and write, then
libc_r can completely ignore FNONBLOCK and just use the new entry
points.

Otherwise this little problem is going to bite us over and over again
as time passes.

I'm thinking (2) might be the only real solution.

-Matt
Matthew Dillon
<>

#18 Updated by dillon almost 8 years ago

: (2) new system call entry points for non-blocking read and write, then
: libc_r can completely ignore FNONBLOCK and just use the new entry
: points.
:
: Otherwise this little problem is going to bite us over and over again
: as time passes.
:
: I'm thinking (2) might be the only real solution.

Something like this:

nbpread(int d, void *buf, size_t nbytes, off_t offset)
nbpread(int d, const void *buf, size_t nbytes, off_t offset)

(same arguments as pread/pwrite). offset can be -1, indicating that the
current offset should be used and updated. The function would perform
the operation non-blocking regardless of the non blocking setting on the
file pointer.

libc_r would then use the new entry points and would not mess with
the non-blocking flag on the FP at all.

-Matt
Matthew Dillon
<>

#19 Updated by joerg almost 8 years ago

On Sun, Jun 11, 2006 at 05:47:31PM -0700, Matthew Dillon wrote:
> libc_r would then use the new entry points and would not mess with
> the non-blocking flag on the FP at all.

While libc_r is the most typical source of non-blocking descriptors, it
is not that uncommon. I prefer to fix the stdio code to properly deal
with the situation.

I just haven't had the time to polish it enough (yet).

Joerg

#20 Updated by dillon almost 8 years ago

:While libc_r is the most typical source of non-blocking descriptors, it
:is not that uncommon. I prefer to fix the stdio code to properly deal
:with the situation.
:
:I just haven't had the time to polish it enough (yet).
:
:Joerg

I agree stdio has to be fixed. But we need a solution that works
with read() and write() too.

We have a general problem with anyone setting O_NONBLOCK. We have
a specific problem with libc_r in that the program abstraction is
blocking, but libc_r sets O_NONBLOCK.

I wonder... maybe we should physically duplicate the file pointer
when an 'exec' occurs rather then share it.

We might have to do both, but at the very least in order to maintain
the blocking abstraction libc_r can't use O_NONBLOCK.

-Matt
Matthew Dillon
<>

#21 Updated by cedric almost 8 years ago

Matthew Dillon wrote:

> : (2) new system call entry points for non-blocking read and write, then
> : libc_r can completely ignore FNONBLOCK and just use the new entry
> : points.
> :
> : Otherwise this little problem is going to bite us over and over again
> : as time passes.
> :
> : I'm thinking (2) might be the only real solution.
>
> Something like this:
>
> nbpread(int d, void *buf, size_t nbytes, off_t offset)
> nbpread(int d, const void *buf, size_t nbytes, off_t offset)
>
> (same arguments as pread/pwrite). offset can be -1, indicating that the
> current offset should be used and updated. The function would perform
> the operation non-blocking regardless of the non blocking setting on the
> file pointer.
>
> libc_r would then use the new entry points and would not mess with
> the non-blocking flag on the FP at all.

That would be fantastic.

I've banged my head with this problem often enough.

And I'm not the only one, of course:

http://cr.yp.to/unix/nonblock.html

Cedric

#22 Updated by dillon almost 8 years ago

I'm adding 'force' flagged system calls to the kernel now to allow
blocking I/O or non-blocking I/O to be forced.

pread and pwrite happened to already have a pad argument that libc
always passes as 0, so no new system call will be required there.
I am going to convert that argument to a flags argument and add
O_F* 'force' defines.

preadv and pwritev will need new system calls.

I am adding similar MSG_F* 'force' flags for sendmsg and recvmsg.

If I don't get this committed tonight I'll get it committed tomorrow.
Everything is straightforward through the socket NBIO code needs a bit
of rewriting.

-Matt

#23 Updated by dillon almost 8 years ago

:On Sun, Jun 11, 2006 at 05:47:31PM -0700, Matthew Dillon wrote:
:> libc_r would then use the new entry points and would not mess with
:> the non-blocking flag on the FP at all.
:
:While libc_r is the most typical source of non-blocking descriptors, it
:is not that uncommon. I prefer to fix the stdio code to properly deal
:with the situation.
:
:I just haven't had the time to polish it enough (yet).
:
:Joerg

Ok. We have a three-pronged approach. I just committed the kernel side
of things and I'll 'fix' libc_r tomorrow. Joerg will continue to work
on fixes to STDIO.

Basically libc_r will only have to set O_NONBLOCK for accept() and
connect() and that's it. read/write/pread/pwrite will be converted
to __pread and __pwrite calls, readv/writev/preadv/pwritev will be
converted to __preadv/__pwritev, and all the socket message calls
(recv*(), send*()) will use the new MSG_ flags.

-Matt
Matthew Dillon
<>

#24 Updated by schlesinger almost 8 years ago

Am Sonntag, 11. Juni 2006 00:17 schrieb walt:

<snip>

>
> I'm thinking that the soft-update-inconsistency error may have been
> just a red-herring due to a bad sector -- but the 'write error'
> is also occurring in the absence of bad sectors, and was already
> happening on DF-current as long ago as April 25.
>
> As an aside -- firefox pulled in and installed 37 other pkgsrc
> packages as dependencies with not one error, before starting to
> build firefox itself. The only 'write error' I've seen is during
> the building of firefox.

Walt,

I don't know, if it is interesting for you anymore, but I've had a "write
error" yesterday while doing an "cvs up -dP" in /usr/pkgsrc, though there was
enough place on /usr.

Thomas

#25 Updated by wa1ter almost 8 years ago

Thomas Schlesinger wrote:
> Am Sonntag, 11. Juni 2006 00:17 schrieb walt:
>
> <snip>
>
>> I'm thinking that the soft-update-inconsistency error may have been
>> just a red-herring due to a bad sector -- but the 'write error'
>> is also occurring in the absence of bad sectors...

> I don't know, if it is interesting for you anymore, but I've had a "write
> error" yesterday while doing an "cvs up -dP" in /usr/pkgsrc, though there was
> enough place on /usr.

Have you done an 'fsck -f' on your /usr partition? I've been doing
it on every reboot and I haven't seen any more errors. Also, are you
running the latest -CURRENT?

#26 Updated by schlesinger almost 8 years ago

Am Dienstag, 27. Juni 2006 19:34 schrieb walt:
> Thomas Schlesinger wrote:
> > Am Sonntag, 11. Juni 2006 00:17 schrieb walt:
> >
> > <snip>
> >
> >> I'm thinking that the soft-update-inconsistency error may have been
> >> just a red-herring due to a bad sector -- but the 'write error'
> >> is also occurring in the absence of bad sectors...
> >
> > I don't know, if it is interesting for you anymore, but I've had a "write
> > error" yesterday while doing an "cvs up -dP" in /usr/pkgsrc, though there
> > was enough place on /usr.
>
> Have you done an 'fsck -f' on your /usr partition? I've been doing
> it on every reboot and I haven't seen any more errors. Also, are you
> running the latest -CURRENT?

Walt,

I haven't done a 'fsck -f' on my /usr partition, before the error occured. I'm
using -Development, last checked out and built on Jun 25.

Thomas

#27 Updated by dillon almost 8 years ago

:> I don't know, if it is interesting for you anymore, but I've had a "write
:> error" yesterday while doing an "cvs up -dP" in /usr/pkgsrc, though there was
:> enough place on /usr.
:
:Have you done an 'fsck -f' on your /usr partition? I've been doing
:it on every reboot and I haven't seen any more errors. Also, are you
:running the latest -CURRENT?

I will note that all the filesystem fixes are in HEAD as of Preview,
but a lot of them didn't make it to 1.4 due to their complexity.

I have an active bug report regarding a bitmap assertion that occured
once, but it is unclear whether it was left over from a previously fixed
bug or whether it is an indication that there are still FS bugs in the
system.

I am particularly interested in reports of any new ufs_dirbad panics
on HEAD. I've been chasing that one forever and anything >= Preview
will have yet another possible fix for it.

-Matt
Matthew Dillon
<>

#28 Updated by corecode over 7 years ago

can we close this?

Also available in: Atom PDF