Bug #1872: nfs stall (fix pushed) - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #1872

closed

nfs stall (fix pushed)

Added by thomas.nikolajsen over 14 years ago. Updated over 14 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

I experience nfs stall with current master during the last few weeks.

Stall is seen during buildkernel, using nfs mount for /usr/src and /usr/obj.

/usr/src is on 2.6.3-rel dfly UP system
/usr/obj is on current master dfly UP system
nfs client is on current master dfly SMP system

All three systems has 2GB RAM and are using HAMMER.

I did a 'panic' using manual escape to debugger on nfs client;
core dump is uploading to my leaf account to 23/.

I'm not sure if this is usable for debugging problem,
please suggest other things if needed.

-thomas

Actions

Copy link

Updated by thomas.nikolajsen over 14 years ago

It occurred to me that I do have another issue on the system
which nfs exports /usr/obj mount:
I see some CRC errors from disk.

This is a SATA disk in a (cheap) enclosure using eSATA to host.

Will investigate if is these issues are related.

Actions

Copy link

Updated by thomas.nikolajsen over 14 years ago

This issue seems to be real
(not caused by disk, issue mentioned in prev. message).

Connecting disk directly to host (vs via disk enclosure w/ sata-sata bridge),
doesn't change symptoms regarding nfs: it still stalls.

Nfs stall is usually seens, after building a few kernels;
especially if nfs client also does other use of nfs mounts:
e.g. `du usr/src; du /usr/obj'.

A new forced crash dump will be uploaded later today,
this is test case without disk CRC errors.
(I guess present crash dump (*.23) reflects nfs/networking problem,
i.e. that is isn't affected by disk CRC errors though)

Actions

Copy link

Updated by thomas.nikolajsen over 14 years ago

I did some more tests: issue not seen w/ UP kernel (GENERIC) on client.

Other things which didn't help:
- disable swapcache(8)
- using UDP for nfs mount

All builds are '-j 4' (or '-j 3'; actually '-j NCPU+2').

I also had a panic while doing 'reboot' on client during a nfs stall:
Fatal trap 12: page fault while in kernel mode
..
db> trace
nfs_removerpc(..)
..
(i jotted down more detail, please respond if they are needed)

On most occasions doing 'reboot' during nfs stall just stalled.

Actions

Copy link

Updated by thomas.nikolajsen over 14 years ago

Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
uploaded to *.24 on leaf.

Actions

Copy link

Updated by dillon over 14 years ago

:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
:uploaded to *.24 on leaf.

Thomas, please turn off the fairq disk scheduler and see if the
    problem persists.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon over 14 years ago

:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
:uploaded to *.24 on leaf.

Thomas, I think I have this reproduced reliably.  I should be able
    to get it fixed today.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon over 14 years ago

: Thomas, I think I have this reproduced reliably. I should be able
: to get it fixed today.
:

Ok, false alarm... I am not reproducing it :-(.  It's still possibly
    the fair queue disk queue.

What version nfs are your mounts?  They should be v3.  I found a bug
    in the NFSv2 write code which I am going to fix but it should not effect
    v3.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon over 14 years ago

Thomas, please test the latest master or the latest 2.6.x. I've fixed
multiple bugs.

The server bug existed on both 2.6 and 2.7 so you will have to upgrade
    your 2.6 server to test the fix.  I did MFC just the server cache fix
    to 2.6 so you can upgrade to the latest on the 2.6 branch
    (DragonFly_RELEASE_2_6), you do not have to upgrade your server to
    master if you do not want to.

Please test.

-Matt

Actions

Copy link

Updated by thomas.nikolajsen over 14 years ago

I will test on Saturday (on vacation).
Thanks for fixing!

Actions

Copy link

#10

Updated by thomas.nikolajsen over 14 years ago

Re. NFS options:
I am using default NFS mount options
(nfsv3, rdirplus, tcp
on current master on DragonFly)

Actions

Copy link

#11

Updated by thomas.nikolajsen over 14 years ago

I have update to latest master / 2.6:
I still see nfs stall, but not so often,
now with message below on console.

Is NFS supposed to recover from this?

I use default values for these sysctls (2M, 64),
which values could be a applicable?

As all systems have plenty RAM (2GB) for the use,
it would be nice if values were auto tuned so this didn't happen.

I have foced core dump, please say if I should upload it.

- console
Warning: NFS: Insufficient sendspace (2123008):
You must increase vfs.nfs.soreserve or decrease vfs.nfs.maxasyncbio

Actions

Copy link

#12

Updated by thomas.nikolajsen over 14 years ago

I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
and haven't seen the 'Insufficient sendspace' since.

NFS stall can still be experienced, if I do 'too many' things at once,
e.g. doing `git status' with repo and checkout on NFS mounts while
doing `make -j3 buildworld' did stall NFS.

Forced core dump avail. on request.

Using x86_64 system also still stall NFS (above was i386);
here forced core dump didn't work out, `panic' failed to generate core dump.

Actions

Copy link

#13

Updated by dillon over 14 years ago

:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
:and haven't seen the 'Insufficient sendspace' since.
:
:NFS stall can still be experienced, if I do 'too many' things at once,
:e.g. doing `git status' with repo and checkout on NFS mounts while
:doing `make -j3 buildworld' did stall NFS.
:
:Forced core dump avail. on request.
:
:Using x86_64 system also still stall NFS (above was i386);
:here forced core dump didn't work out, `panic' failed to generate core dump.

Ok, I'm attempting to reproduce the issue.  It sounds like there is a
    code path that has a bug in it that isn't normally executed unless the
    socket buffer fills up.  Theoretically it should be ok for the socket
    buffer to limit out but it kinda sounds like data is being lost when
    that happens and causing rpcs to be lost.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#14

Updated by dillon over 14 years ago

:I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
:and haven't seen the 'Insufficient sendspace' since.
:
:NFS stall can still be experienced, if I do 'too many' things at once,
:e.g. doing `git status' with repo and checkout on NFS mounts while
:doing `make -j3 buildworld' did stall NFS.
:
:Forced core dump avail. on request.
:
:Using x86_64 system also still stall NFS (above was i386);
:here forced core dump didn't work out, `panic' failed to generate core dump.

Are you always getting the 'NFS: Insufficient sendspace' message
    when a stall occurs or do stalls occur sometimes without the
    message?

The Insufficient sendspace message can only occur with UDP.  What
    happens if you use a TCP NFS mount?  Are you able to stall the mount
    with a TCP NFS mount?

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#15

Updated by dillon over 14 years ago

::I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
::and haven't seen the 'Insufficient sendspace' since.
::
::NFS stall can still be experienced, if I do 'too many' things at once,
::e.g. doing `git status' with repo and checkout on NFS mounts while
::doing `make -j3 buildworld' did stall NFS.
::
::Forced core dump avail. on request.

I have pushed a fix to 2.8 and master for UDP mounts sometimes
    losing track of rpc replies.  This only applies to UDP mounts.
    TCP mounts should not have had this issue.

It is not 100% tested yet but I'd like to see if it solves the
    UDP NFS mount issue you are reporting.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#16

Updated by thomas.nikolajsen over 14 years ago

I do use one UDP NFS mount (/usr/obj), the other are TCP.
(I had all TCP, but changed some to UDP for debugging, and forgot about it..)

'Insufficient sendspace' warning normally isn't seen on stall;
actually I only saw it once (as far I remember).

I am testing latest change at the moment.

Actions

Copy link

#17

Updated by thomas.nikolajsen over 14 years ago

Result from tests:
I still get nfs stall, for both TCP only, and UDP & TCP nfs use.

As written earlier: the first fix pushed for this issue did
reduce the problem considerably.

Actions

Copy link

#18

Updated by dillon over 14 years ago

:
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Result from tests:
:I still get nfs stall, for both TCP only, and UDP & TCP nfs use.
:
:As written earlier: the first fix pushed for this issue did
:reduce the problem considerably.

Ok, I'm back to trying to fix this last niggling issue.  What test
    regimen are you using to reproduce the problem and are you using any
    special mount options?

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#19

Updated by thomas.nikolajsen over 14 years ago

Test setup is:

NFS using TCP
/usr/src is on current 2.8 dfly UP system
/usr/obj is on current master dfly UP system
NFS client is on current master dfly SMP system

NFS exports using:
/DIR -maproot=nfsroot -alldirs -network=192.168.1.128 -mask=255.255.255.128

Client is using su to nfsroot during build.
All three systems has 2GB RAM and are using HAMMER.

swapcache is used on client.
dsched fq is used on all systems.

Stall is seen during buildkernel, using NFS mount for /usr/src and /usr/obj.

Stall still seen with:
- UDP mount
- dsched noop

No stall with:
- UP kernel on client

(will retest w/o swapcache)

Actions

Copy link

#20

Updated by thomas.nikolajsen over 14 years ago

A few questions from IRC:
<dillon> is it basically /usr/src and /usr/obj ?
<dillon> or do you have things like /tmp NFS mounted too ?
I have a bunch of other NFS mounts (e.g. my home dir.), not /tmp or /var/tmp
though, they are on tmpfs and HAMMER respectively.
The traffic on the other NFS mounts are quite minimal during test.

<dillon> when a program stalls can you still access the mount points(s) ?
This is a bit mixed I think: on some stalls even a new root login stalls, this
uses no NFS (root home dir. is on local UFS or HAMMER),
on some stall other NFS mounts do respond up to some point, where they also
stall.
On stall I often (but not always) get console messages; don't remember exact
wording, it is in core dumps uploaded previously for this issue.
Should I upload some fresh forced core dumps?

<dillon> I also need to know how much memory the client & server have, maybe
my test box just has too much memory to hit the conditions right
Both client and servers has 2GB RAM.

<dillon> I'm not sure if it is a server-side issue or a client-side issue,
which is making it difficult to track down
Any hint on how I can help narrow this down?

Actions

Copy link

#21

Updated by thomas.nikolajsen over 14 years ago

Some more info:
console messages are like:
got bad cookie vp 0xe09ddf38 bp 0xc480aa6c
[diagnostic] cache_lock: blocked on 0xe32b26d8 "rate_sample"

The bad cookie lines doesn't seem to cause stall, only the cache_lock lines.

It seems like stall on /usr/src, as doing 'du /usr/obj' on same NFS client
works after stall, but 'du /usr/src' stalls, with same cache_lock console
message, when it reaches the file.

From other NFS client no problem is seen, e.g. installkernel / installworld
worked.

Actions

Copy link

#22

Updated by dillon over 14 years ago

:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Some more info:
:console messages are like:
:got bad cookie vp 0xe09ddf38 bp 0xc480aa6c
:[diagnostic] cache_lock: blocked on 0xe32b26d8 "rate_sample"
:
:The bad cookie lines doesn't seem to cause stall, only the cache_lock lines.

The cache lock lines are side effects due to whatever getting
    stalled also probably holding a vnode lock.

So what we have here is a likely situation where the SMP NFS client
    is to blame.  With that in mind a forced core dump would be useful
    but to make it easier to track down please try to kill all the normal
    processes on the box, leaving just the ones that cannot be killed.

Note that normal nfs mounts must be used.  Do not use the 'intr' or 'soft'
    options or even the stalled processes might be killable, which is not
    what we want.

-Matt

Actions

Copy link

#23

Updated by thomas.nikolajsen over 14 years ago

I have uploaded a forced crash dump to ~/crash/33 on my leaf account.

Did kill some unrelated procs, some more might still have killable.

Hope this can narrow problem somewhat.

Actions

Copy link

#24

Updated by dillon over 14 years ago

:I have uploaded a forced crash dump to ~/crash/33 on my leaf account.
:
:Did kill some unrelated procs, some more might still have killable.
:
:Hope this can narrow problem somewhat.

Yes, I think I found it.  Please try this patch:

fetch http://apollo.backplane.com/DFlyMisc/nfs04.patch

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#25

Updated by dillon over 14 years ago

::I have uploaded a forced crash dump to ~/crash/33 on my leaf account.
::
::Did kill some unrelated procs, some more might still have killable.
::
::Hope this can narrow problem somewhat.
:

Here's an updated patch, adding a lock around one little bit that
    I missed in the first patch:

fetch http://apollo.backplane.com/DFlyMisc/nfs05.patch

Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#26

Updated by dillon over 14 years ago

Updated patch #3, fixes syntax error with patch #2.

fetch http://apollo.backplane.com/DFlyMisc/nfs06.patch

Also for some reason this patch also seems to fix the seg-fault
    issues I've been having on x86-64 for ages and ages (where a buildworld
    would occasionally seg-fault).  My /usr/src has always been mounted NFS
    on my test boxes, but I never expected it could be the cause of the
    seg-fault bug.  So far my buildworld loop has run 84+ iterations with
    no faults.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#27

Updated by thomas.nikolajsen over 14 years ago

This fixes my problem too :)
Thanks for fixing issue!

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #1872

nfs stall (fix pushed)

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by dillon over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by dillon over 14 years ago

Updated by thomas.nikolajsen over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by dillon over 14 years ago

Updated by thomas.nikolajsen over 14 years ago