Bug #1872
closednfs stall (fix pushed)
0%
Description
I experience nfs stall with current master during the last few weeks.
Stall is seen during buildkernel, using nfs mount for /usr/src and /usr/obj.
/usr/src is on 2.6.3-rel dfly UP system
/usr/obj is on current master dfly UP system
nfs client is on current master dfly SMP system
All three systems has 2GB RAM and are using HAMMER.
I did a 'panic' using manual escape to debugger on nfs client;
core dump is uploading to my leaf account to 23/.
I'm not sure if this is usable for debugging problem,
please suggest other things if needed.
-thomas
Updated by thomas.nikolajsen about 14 years ago
It occurred to me that I do have another issue on the system
which nfs exports /usr/obj mount:
I see some CRC errors from disk.
This is a SATA disk in a (cheap) enclosure using eSATA to host.
Will investigate if is these issues are related.
Updated by thomas.nikolajsen about 14 years ago
This issue seems to be real
(not caused by disk, issue mentioned in prev. message).
Connecting disk directly to host (vs via disk enclosure w/ sata-sata bridge),
doesn't change symptoms regarding nfs: it still stalls.
Nfs stall is usually seens, after building a few kernels;
especially if nfs client also does other use of nfs mounts:
e.g. `du usr/src; du /usr/obj'.
A new forced crash dump will be uploaded later today,
this is test case without disk CRC errors.
(I guess present crash dump (*.23) reflects nfs/networking problem,
i.e. that is isn't affected by disk CRC errors though)
Updated by thomas.nikolajsen about 14 years ago
I did some more tests: issue not seen w/ UP kernel (GENERIC) on client.
Other things which didn't help:
- disable swapcache(8)
- using UDP for nfs mount
All builds are '-j 4' (or '-j 3'; actually '-j NCPU+2').
I also had a panic while doing 'reboot' on client during a nfs stall:
Fatal trap 12: page fault while in kernel mode
..
db> trace
nfs_removerpc(..)
..
(i jotted down more detail, please respond if they are needed)
On most occasions doing 'reboot' during nfs stall just stalled.
Updated by thomas.nikolajsen about 14 years ago
Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
uploaded to *.24 on leaf.
Updated by dillon about 14 years ago
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
:uploaded to *.24 on leaf.
Thomas, please turn off the fairq disk scheduler and see if the
problem persists.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Btw: nfs client forced crash dump done during nfs stall (w/o disk issue)
:uploaded to *.24 on leaf.
Thomas, I think I have this reproduced reliably. I should be able
to get it fixed today.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
: Thomas, I think I have this reproduced reliably. I should be able
: to get it fixed today.
:
Ok, false alarm... I am not reproducing it :-(. It's still possibly
the fair queue disk queue.
What version nfs are your mounts? They should be v3. I found a bug
in the NFSv2 write code which I am going to fix but it should not effect
v3.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
Thomas, please test the latest master or the latest 2.6.x. I've fixed
multiple bugs.
The server bug existed on both 2.6 and 2.7 so you will have to upgrade
your 2.6 server to test the fix. I did MFC just the server cache fix
to 2.6 so you can upgrade to the latest on the 2.6 branch
(DragonFly_RELEASE_2_6), you do not have to upgrade your server to
master if you do not want to.
Please test.
-Matt
Updated by thomas.nikolajsen about 14 years ago
I will test on Saturday (on vacation).
Thanks for fixing!
Updated by thomas.nikolajsen about 14 years ago
Re. NFS options:
I am using default NFS mount options
(nfsv3, rdirplus, tcp
on current master on DragonFly)
Updated by thomas.nikolajsen about 14 years ago
I have update to latest master / 2.6:
I still see nfs stall, but not so often,
now with message below on console.
Is NFS supposed to recover from this?
I use default values for these sysctls (2M, 64),
which values could be a applicable?
As all systems have plenty RAM (2GB) for the use,
it would be nice if values were auto tuned so this didn't happen.
I have foced core dump, please say if I should upload it.
- console
Warning: NFS: Insufficient sendspace (2123008):
You must increase vfs.nfs.soreserve or decrease vfs.nfs.maxasyncbio
Updated by thomas.nikolajsen about 14 years ago
I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
and haven't seen the 'Insufficient sendspace' since.
NFS stall can still be experienced, if I do 'too many' things at once,
e.g. doing `git status' with repo and checkout on NFS mounts while
doing `make -j3 buildworld' did stall NFS.
Forced core dump avail. on request.
Using x86_64 system also still stall NFS (above was i386);
here forced core dump didn't work out, `panic' failed to generate core dump.
Updated by dillon about 14 years ago
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
:and haven't seen the 'Insufficient sendspace' since.
:
:NFS stall can still be experienced, if I do 'too many' things at once,
:e.g. doing `git status' with repo and checkout on NFS mounts while
:doing `make -j3 buildworld' did stall NFS.
:
:Forced core dump avail. on request.
:
:Using x86_64 system also still stall NFS (above was i386);
:here forced core dump didn't work out, `panic' failed to generate core dump.
Ok, I'm attempting to reproduce the issue. It sounds like there is a
code path that has a bug in it that isn't normally executed unless the
socket buffer fills up. Theoretically it should be ok for the socket
buffer to limit out but it kinda sounds like data is being lost when
that happens and causing rpcs to be lost.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
:I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
:and haven't seen the 'Insufficient sendspace' since.
:
:NFS stall can still be experienced, if I do 'too many' things at once,
:e.g. doing `git status' with repo and checkout on NFS mounts while
:doing `make -j3 buildworld' did stall NFS.
:
:Forced core dump avail. on request.
:
:Using x86_64 system also still stall NFS (above was i386);
:here forced core dump didn't work out, `panic' failed to generate core dump.
Are you always getting the 'NFS: Insufficient sendspace' message
when a stall occurs or do stalls occur sometimes without the
message?
The Insufficient sendspace message can only occur with UDP. What
happens if you use a TCP NFS mount? Are you able to stall the mount
with a TCP NFS mount?
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
::I reduced vfs.nfs.maxasyncbio to 16 (from default of 64),
::and haven't seen the 'Insufficient sendspace' since.
::
::NFS stall can still be experienced, if I do 'too many' things at once,
::e.g. doing `git status' with repo and checkout on NFS mounts while
::doing `make -j3 buildworld' did stall NFS.
::
::Forced core dump avail. on request.
I have pushed a fix to 2.8 and master for UDP mounts sometimes
losing track of rpc replies. This only applies to UDP mounts.
TCP mounts should not have had this issue.
It is not 100% tested yet but I'd like to see if it solves the
UDP NFS mount issue you are reporting.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by thomas.nikolajsen about 14 years ago
I do use one UDP NFS mount (/usr/obj), the other are TCP.
(I had all TCP, but changed some to UDP for debugging, and forgot about it..)
'Insufficient sendspace' warning normally isn't seen on stall;
actually I only saw it once (as far I remember).
I am testing latest change at the moment.
Updated by thomas.nikolajsen about 14 years ago
Result from tests:
I still get nfs stall, for both TCP only, and UDP & TCP nfs use.
As written earlier: the first fix pushed for this issue did
reduce the problem considerably.
Updated by dillon about 14 years ago
:
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Result from tests:
:I still get nfs stall, for both TCP only, and UDP & TCP nfs use.
:
:As written earlier: the first fix pushed for this issue did
:reduce the problem considerably.
Ok, I'm back to trying to fix this last niggling issue. What test
regimen are you using to reproduce the problem and are you using any
special mount options?
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by thomas.nikolajsen about 14 years ago
Test setup is:
NFS using TCP
/usr/src is on current 2.8 dfly UP system
/usr/obj is on current master dfly UP system
NFS client is on current master dfly SMP system
NFS exports using:
/DIR -maproot=nfsroot -alldirs -network=192.168.1.128 -mask=255.255.255.128
Client is using su to nfsroot during build.
All three systems has 2GB RAM and are using HAMMER.
swapcache is used on client.
dsched fq is used on all systems.
Stall is seen during buildkernel, using NFS mount for /usr/src and /usr/obj.
Stall still seen with:
- UDP mount
- dsched noop
No stall with:
- UP kernel on client
(will retest w/o swapcache)
Updated by thomas.nikolajsen about 14 years ago
A few questions from IRC:
<dillon> is it basically /usr/src and /usr/obj ?
<dillon> or do you have things like /tmp NFS mounted too ?
I have a bunch of other NFS mounts (e.g. my home dir.), not /tmp or /var/tmp
though, they are on tmpfs and HAMMER respectively.
The traffic on the other NFS mounts are quite minimal during test.
<dillon> when a program stalls can you still access the mount points(s) ?
This is a bit mixed I think: on some stalls even a new root login stalls, this
uses no NFS (root home dir. is on local UFS or HAMMER),
on some stall other NFS mounts do respond up to some point, where they also
stall.
On stall I often (but not always) get console messages; don't remember exact
wording, it is in core dumps uploaded previously for this issue.
Should I upload some fresh forced core dumps?
<dillon> I also need to know how much memory the client & server have, maybe
my test box just has too much memory to hit the conditions right
Both client and servers has 2GB RAM.
<dillon> I'm not sure if it is a server-side issue or a client-side issue,
which is making it difficult to track down
Any hint on how I can help narrow this down?
Updated by thomas.nikolajsen about 14 years ago
Some more info:
console messages are like:
got bad cookie vp 0xe09ddf38 bp 0xc480aa6c
[diagnostic] cache_lock: blocked on 0xe32b26d8 "rate_sample"
The bad cookie lines doesn't seem to cause stall, only the cache_lock lines.
It seems like stall on /usr/src, as doing 'du /usr/obj' on same NFS client
works after stall, but 'du /usr/src' stalls, with same cache_lock console
message, when it reaches the file.
From other NFS client no problem is seen, e.g. installkernel / installworld
worked.
Updated by dillon about 14 years ago
:Thomas Nikolajsen <thomas.nikolajsen@mail.dk> added the comment:
:
:Some more info:
:console messages are like:
:got bad cookie vp 0xe09ddf38 bp 0xc480aa6c
:[diagnostic] cache_lock: blocked on 0xe32b26d8 "rate_sample"
:
:The bad cookie lines doesn't seem to cause stall, only the cache_lock lines.
The cache lock lines are side effects due to whatever getting
stalled also probably holding a vnode lock.
So what we have here is a likely situation where the SMP NFS client
is to blame. With that in mind a forced core dump would be useful
but to make it easier to track down please try to kill all the normal
processes on the box, leaving just the ones that cannot be killed.
Note that normal nfs mounts must be used. Do not use the 'intr' or 'soft'
options or even the stalled processes might be killable, which is not
what we want.
-Matt
Updated by thomas.nikolajsen about 14 years ago
I have uploaded a forced crash dump to ~/crash/33 on my leaf account.
Did kill some unrelated procs, some more might still have killable.
Hope this can narrow problem somewhat.
Updated by dillon about 14 years ago
:I have uploaded a forced crash dump to ~/crash/33 on my leaf account.
:
:Did kill some unrelated procs, some more might still have killable.
:
:Hope this can narrow problem somewhat.
Yes, I think I found it. Please try this patch:
fetch http://apollo.backplane.com/DFlyMisc/nfs04.patch
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
::I have uploaded a forced crash dump to ~/crash/33 on my leaf account.
::
::Did kill some unrelated procs, some more might still have killable.
::
::Hope this can narrow problem somewhat.
:
Here's an updated patch, adding a lock around one little bit that
I missed in the first patch:
fetch http://apollo.backplane.com/DFlyMisc/nfs05.patch
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 14 years ago
Updated patch #3, fixes syntax error with patch #2.
fetch http://apollo.backplane.com/DFlyMisc/nfs06.patch
Also for some reason this patch also seems to fix the seg-fault
issues I've been having on x86-64 for ages and ages (where a buildworld
would occasionally seg-fault). My /usr/src has always been mounted NFS
on my test boxes, but I never expected it could be the cause of the
seg-fault bug. So far my buildworld loop has run 84+ iterations with
no faults.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by thomas.nikolajsen about 14 years ago
This fixes my problem too :)
Thanks for fixing issue!