Bug #1715: Deadlock on NFS server - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #1715

closed

Deadlock on NFS server

Added by ftigeot over 15 years ago. Updated almost 15 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

I just got a new sort of deadlock on a NFS server.

During the extraction of a tarball archive by a FreeBSD client, the server
became unresponsive.

It could still answer to pings but all other sort of network activity failed.
Ssh sessions were hanged, there was no answer to telnet commands, etc...

On the console there was a single message:

Warning, objcache(mbuf pkt hdr): Exhausted!

I could switch virtual terminals with ALT-F1 and so. The keyboard appeared at
first to function as usual: I could enter a user name and a password but as
soon as I pressed enter for the second time, the machine hanged for good.

I was able to obtain a crash dump by escaping to the debuger with
ctrl-alt-esc. The files are available here:

http://www.wolfpond.org/crash.dfly/

Actions

Copy link

Updated by dillon over 15 years ago

:
:I just got a new sort of deadlock on a NFS server.
:
:During the extraction of a tarball archive by a FreeBSD client, the server
:became unresponsive.
:
:It could still answer to pings but all other sort of network activity failed.
:Ssh sessions were hanged, there was no answer to telnet commands, etc...
:...
:On the console there was a single message:
:
: Warning, objcache(mbuf pkt hdr): Exhausted!

How much memory does the server have?  What does 'netstat -m' say
    on the server?

-Matt

Actions

Copy link

Updated by ftigeot over 15 years ago

On Tue, Apr 06, 2010 at 10:35:37AM -0700, Matthew Dillon wrote:

:
:I just got a new sort of deadlock on a NFS server.
:
:During the extraction of a tarball archive by a FreeBSD client, the server
:became unresponsive.
:
:It could still answer to pings but all other sort of network activity failed.
:Ssh sessions were hanged, there was no answer to telnet commands, etc...
:...
:On the console there was a single message:
:
: Warning, objcache(mbuf pkt hdr): Exhausted!

How much memory does the server have? What does 'netstat -m' say
on the server?

Actually, it is the same machine which had all sort of deadlock troubles with
the hammer cleanup code.

The main memory is 2GB.

netstat -m
7/13312 mbufs in use (current/max):
469/6656 mbuf clusters in use (current/max)
257 mbufs and mbuf clusters allocated to data
6 mbufs and mbuf clusters allocated to packet headers
939 Kbytes allocated to network (5% of mb_map in use)
260 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

The nfs service has not been really used since the last reboot. Should I try
to reproduct the deadlock conditions and monitor netstat -m output ?

Actions

Copy link

Updated by qhwt.dfly almost 15 years ago

Hi.
I've been experiencing similar lock-ups on DragonFly NFS server since
the recent changes to the network code. Similar, but without objcache
warnings as in the previous messages. I can reproduce it on multiple
machines and it occurs even when mounted via loop-back interface:

$ sudo mkdir -m1777 /test
  $ sudo mount -tnfs 127.0.0.1:/test /mnt
  $ env MAKEOBJDIRPREFIX=/mnt make -sj300 buildworld
  (leave it for a couple of hours to find many processes stuck
   in ZOMB state)

Probably it can reproduce with much lower -j number to make command.

Best Regards.

Actions

Copy link

Updated by dillon almost 15 years ago

:Hi.
:I've been experiencing similar lock-ups on DragonFly NFS server since
:the recent changes to the network code. Similar, but without objcache
:warnings as in the previous messages. I can reproduce it on multiple
:machines and it occurs even when mounted via loop-back interface:
:
: $ sudo mkdir -m1777 /test
: $ sudo mount -tnfs 127.0.0.1:/test /mnt
: $ env MAKEOBJDIRPREFIX=/mnt make -sj300 buildworld
: (leave it for a couple of hours to find many processes stuck
: in ZOMB state)
:
:Probably it can reproduce with much lower -j number to make command.
:
:Best Regards.

Hmm.  I was able to get a bunch of processes stuck in 'clock'
    using that setup.  That in turn caused their children to get
    stuck as zombies but it looks like the primary issues is the
    parent's getting stuck in clock.

This particular deadlock is probably related to the localhost
    mount vs a remote mount.  I'll see if I can track it down today.

Have you had any issues with remote NFS mounts?  Note that the
    most recent fixes went in on Friday and are server-side.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon almost 15 years ago

Bingo. _cache_cleanneg() was calling cache_zap() without specifying
non-blocking operation. Since cache_cleanneg() tries to clean up
ncps in no particular order and out-of-context it must specify
non-blocking. I have committed a fix.

Negative ncp entries are seriously exercised by a buildworld due to
    the compiler trying to find #include files.

I don't think this is related to Francois Tigeot's hammer cleanup
    deadlock report though, or Matthias Schmidt's hammer cleanup / crypto
    deadlock report.  I will continue to try to reproduce those test cases.

-Matt

Actions

Copy link

Updated by qhwt.dfly almost 15 years ago

On Mon, Sep 27, 2010 at 08:20:18AM -0700, Matthew Dillon wrote:

Bingo. _cache_cleanneg() was calling cache_zap() without specifying
non-blocking operation. Since cache_cleanneg() tries to clean up
ncps in no particular order and out-of-context it must specify
non-blocking. I have committed a fix.

Negative ncp entries are seriously exercised by a buildworld due to
the compiler trying to find #include files.

Oh, just changing a zero to one and it's done! I was looking at
completely different places in the code, but glad to see this fixed
anyway :) Does this also affect 2.6?

Thanks.

Actions

Copy link

Updated by eocallaghan almost 15 years ago

Fix committed on head, b8587f8c68a1a39ba3e0175b3c228b1431a133f8.

Status: Chatting->Resolved.

qhwt.dfly, I don't believe this particular deadlock affects 2.6-release.

Cheers,
Edward.

Actions

Copy link

Updated by dillon almost 15 years ago

:Oh, just changing a zero to one and it's done! I was looking at
:completely different places in the code, but glad to see this fixed
:anyway :) Does this also affect 2.6?
:
:Thanks.

Yes, it probably does.  The fix would be the same.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #1715

Deadlock on NFS server

Updated by dillon over 15 years ago

Updated by ftigeot over 15 years ago

Updated by qhwt.dfly almost 15 years ago

Updated by dillon almost 15 years ago

Updated by dillon almost 15 years ago

Updated by qhwt.dfly almost 15 years ago

Updated by eocallaghan almost 15 years ago

Updated by dillon almost 15 years ago