Bug #1715

Deadlock on NFS server

Added by ftigeot about 4 years ago. Updated over 3 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I just got a new sort of deadlock on a NFS server.

During the extraction of a tarball archive by a FreeBSD client, the server
became unresponsive.

It could still answer to pings but all other sort of network activity failed.
Ssh sessions were hanged, there was no answer to telnet commands, etc...

On the console there was a single message:

Warning, objcache(mbuf pkt hdr): Exhausted!

I could switch virtual terminals with ALT-F1 and so. The keyboard appeared at
first to function as usual: I could enter a user name and a password but as
soon as I pressed enter for the second time, the machine hanged for good.

I was able to obtain a crash dump by escaping to the debuger with
ctrl-alt-esc. The files are available here:

http://www.wolfpond.org/crash.dfly/

History

#1 Updated by dillon about 4 years ago

:
:I just got a new sort of deadlock on a NFS server.
:
:During the extraction of a tarball archive by a FreeBSD client, the server
:became unresponsive.
:
:It could still answer to pings but all other sort of network activity failed.
:Ssh sessions were hanged, there was no answer to telnet commands, etc...
:...
:On the console there was a single message:
:
: Warning, objcache(mbuf pkt hdr): Exhausted!

How much memory does the server have? What does 'netstat -m' say
on the server?

-Matt

#2 Updated by ftigeot about 4 years ago

On Tue, Apr 06, 2010 at 10:35:37AM -0700, Matthew Dillon wrote:
> :
> :I just got a new sort of deadlock on a NFS server.
> :
> :During the extraction of a tarball archive by a FreeBSD client, the server
> :became unresponsive.
> :
> :It could still answer to pings but all other sort of network activity failed.
> :Ssh sessions were hanged, there was no answer to telnet commands, etc...
> :...
> :On the console there was a single message:
> :
> : Warning, objcache(mbuf pkt hdr): Exhausted!
>
> How much memory does the server have? What does 'netstat -m' say
> on the server?

Actually, it is the same machine which had all sort of deadlock troubles with
the hammer cleanup code.

The main memory is 2GB.

# netstat -m
7/13312 mbufs in use (current/max):
469/6656 mbuf clusters in use (current/max)
257 mbufs and mbuf clusters allocated to data
6 mbufs and mbuf clusters allocated to packet headers
939 Kbytes allocated to network (5% of mb_map in use)
260 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

The nfs service has not been really used since the last reboot. Should I try
to reproduct the deadlock conditions and monitor netstat -m output ?

#3 Updated by qhwt.dfly over 3 years ago

Hi.
I've been experiencing similar lock-ups on DragonFly NFS server since
the recent changes to the network code. Similar, but without objcache
warnings as in the previous messages. I can reproduce it on multiple
machines and it occurs even when mounted via loop-back interface:

$ sudo mkdir -m1777 /test
$ sudo mount -tnfs 127.0.0.1:/test /mnt
$ env MAKEOBJDIRPREFIX=/mnt make -sj300 buildworld
(leave it for a couple of hours to find many processes stuck
in ZOMB state)

Probably it can reproduce with much lower -j number to make command.

Best Regards.

#4 Updated by dillon over 3 years ago

:Hi.
:I've been experiencing similar lock-ups on DragonFly NFS server since
:the recent changes to the network code. Similar, but without objcache
:warnings as in the previous messages. I can reproduce it on multiple
:machines and it occurs even when mounted via loop-back interface:
:
: $ sudo mkdir -m1777 /test
: $ sudo mount -tnfs 127.0.0.1:/test /mnt
: $ env MAKEOBJDIRPREFIX=/mnt make -sj300 buildworld
: (leave it for a couple of hours to find many processes stuck
: in ZOMB state)
:
:Probably it can reproduce with much lower -j number to make command.
:
:Best Regards.

Hmm. I was able to get a bunch of processes stuck in 'clock'
using that setup. That in turn caused their children to get
stuck as zombies but it looks like the primary issues is the
parent's getting stuck in clock.

This particular deadlock is probably related to the localhost
mount vs a remote mount. I'll see if I can track it down today.

Have you had any issues with remote NFS mounts? Note that the
most recent fixes went in on Friday and are server-side.

-Matt
Matthew Dillon
<>

#5 Updated by dillon over 3 years ago

Bingo. _cache_cleanneg() was calling cache_zap() without specifying
non-blocking operation. Since cache_cleanneg() tries to clean up
ncps in no particular order and out-of-context it must specify
non-blocking. I have committed a fix.

Negative ncp entries are seriously exercised by a buildworld due to
the compiler trying to find #include files.

I don't think this is related to Francois Tigeot's hammer cleanup
deadlock report though, or Matthias Schmidt's hammer cleanup / crypto
deadlock report. I will continue to try to reproduce those test cases.

-Matt

#6 Updated by qhwt.dfly over 3 years ago

On Mon, Sep 27, 2010 at 08:20:18AM -0700, Matthew Dillon wrote:
> Bingo. _cache_cleanneg() was calling cache_zap() without specifying
> non-blocking operation. Since cache_cleanneg() tries to clean up
> ncps in no particular order and out-of-context it must specify
> non-blocking. I have committed a fix.
>
> Negative ncp entries are seriously exercised by a buildworld due to
> the compiler trying to find #include files.

Oh, just changing a zero to one and it's done! I was looking at
completely different places in the code, but glad to see this fixed
anyway :) Does this also affect 2.6?

Thanks.

#7 Updated by eocallaghan over 3 years ago

Fix committed on head, b8587f8c68a1a39ba3e0175b3c228b1431a133f8.

Status: Chatting->Resolved.

qhwt.dfly, I don't believe this particular deadlock affects 2.6-release.

Cheers,
Edward.

#8 Updated by dillon over 3 years ago

:Oh, just changing a zero to one and it's done! I was looking at
:completely different places in the code, but glad to see this fixed
:anyway :) Does this also affect 2.6?
:
:Thanks.

Yes, it probably does. The fix would be the same.

-Matt
Matthew Dillon
<>

Also available in: Atom PDF