Bug #952
closedsystem hang with rsync
0%
Description
Hello.
Doing a backup to our DragonFly file server, using rsync, hangs the
server under certain conditions.
It only happens when I use the --delete option to rsync and seems to
only be on directories with large multi-GB files.  The directory that
consistently reproduces it is my tv recording directory where I have
multiple files ranging from 1 to 4 GB.  The total directory on the
server is about 67 GB, which has old files that need deleted.  The
directory on the client machine is currently about 79 GB.
While the server is hung, I can still ping it and switch virtual
consoles but, other than that, all consoles are just frozen.  While it
is frozen, if I hit <cntl>c I see the '^C' on the screen but get no
other response.
Running top, in another console -- top also freezes and stops
updating, always with zero or very little process load.  On this last
test, after killing rsync on the client side with <cntl>c, top briefly
updated after 3 minutes, then stayed frozen for 3.5 more minutes.  After
a total of 6.5 minutes, the server came alive again.
It does not seem to be related to ssh because I ran the rsync daemon on
the server and ran the same test without ssh and got the same results.
Here is the output of the last test on the client side.
$ rsync -HOav -x --delete . alexandria::tv/recordings
  /tv/recordings
  building file list ... done
  deleting The Universe (Jupiter: The Giant Planet).info
	Hangs here. I waited a while and finally hit <cntl>c.
^Crsync error: received SIGUSR1 or SIGINT (code 20) at rsync.c(163)
	The .info file and the .avi file were both gone on the server after that
but I am not sure if the .avi file was deleted on one of the other
tests.
I did get a couple entries in /var/log/messages on the server with the
following error when I ran it with the rsync daemon
rsyncd[3933]: rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9]
	I did not get that error when run under ssh so I don't know if it has to
do with the freezing problem.
On the server, I replaced the recording directory with one that had
a subset of the files, about 5 recordings that needed deleted, and it
worked fine.  Then I hard linked all files in the directory from another
directory, and it still worked fine.
i.e.
    mv recordings recordings.bak
    mkdir recordings
    ln recordings.bak/* recordings
So far as I can tell, it only happens if it is over a certain amount of
data in the directory and it has to actually delete the files, not just 
unlink a secondary hard link.
I have been able to backup just about every other directory on our
client machines without any problems.
Does anybody have any theories about what might be happening?
The client machine is a NetBSD machine.
The DragonFly server is running version 1.10.1-RELEASE.  I also
tested with a 1.11.0-DEVELOPMENT kernel and got the same results.
      
      Updated by dillon over 17 years ago
      
    
    :Hello.
:
:Doing a backup to our DragonFly file server, using rsync, hangs the
:server under certain conditions.  
:
:It only happens when I use the --delete option to rsync and seems to
:only be on directories with large multi-GB files.  The directory that
:consistently reproduces it is my tv recording directory where I have
:multiple files ranging from 1 to 4 GB.  The total directory on the
:server is about 67 GB, which has old files that need deleted.  The
:directory on the client machine is currently about 79 GB. 
:
:While the server is hung, I can still ping it and switch virtual
:consoles but, other than that, all consoles are just frozen.  While it
:is frozen, if I hit <cntl>c I see the '^C' on the screen but get no
:other response.
My guess is that there is a buffer cache issue on the server that
    is causing a deadlock.
	If you sync every second on the server while running your test, does
    it still hang?
	-Matt
      
      Updated by vince.dragonfly over 17 years ago
      
    
    Sorry it is taking so long.  I will try to get on this and run this test
in the next day or two.
      
      Updated by dillon over 17 years ago
      
    
    :On Sat, Feb 23, 2008 at 01:00:05PM -0800, Matthew Dillon wrote:
:>     If you sync every second on the server while running your test, does
:>     it still hang?
:
:Sorry it is taking so long.  I will try to get on this and run this test
:in the next day or two.
If the sync test succeeds in preventing the crash then I have a pretty
    good idea where the issue might be, but a crash dump would tell
    me precisely.
	If that doesn't work out, try to describe the rsyncd server setup
    (approximate number of files and directories being rsynced, approximate
    size of the files, amount of memory the machine has, etc) and I will
    try to reproduce it.
	-Matt
                    Matthew Dillon 
                    <dillon@backplane.com>
      
      Updated by vince.dragonfly over 17 years ago
      
    
    Hi Matt.
I finally ran the tests today. Sorry it took so long to get to it.
The sync did not prevent it.  It is not actually crashing, but just
hanging.  After I suspend or kill the rsync process on the client
machine (running NetBSD), the server stays hung for several minutes and
then eventually comes back alive.  Top never shows any process load.
I also tried syncing every .1 seconds.  Same result but it gave me
a more fine grained resolution of the freeze pattern.  I printed a '.'
for every sync.  The moment I run the rsync command on the client it
freezes (no more dots).  Some of the time it will freeze for a few
seconds, print a few more dots, freeze again for much longer, sometimes
print another dot or two and then stay frozen.  I also tried nicing the
sync to -10.  Same result.
As I mentioned, it is only with the '--delete' rsync option.  The freeze
takes place while deleting large files and seems to only be when there
are more than a certain number of them to delete.  When I tested a while
back with fewer files to delete, it did not do it.  I don't know where
the threshhold is of how many files triggers it.  However, the freeze
seems to always be while deleting the first file.  After killing it and
waiting several minutes for it to come back alive, the first file in the
list does seem to be deleted.
The directory size on the DragonFly server is 45GB with most of the file
sizes ranging between 2GB and 4GB.  I can provide you with the full
directory listing of both the server and the client if you like.  Either
email it directly to you or post it to the list.  Let me know.
There are 36 files on the server.
Here is a sample listing of a few of the files.
rw-r--r-  1 vince  wheel  -        356 Apr 20  2007 Earth Revealed: Introductory Geology [101].inforw-r--r-  1 vince  wheel  - 1961564096 Apr 21  2007 Earth Revealed: Introductory Geology [101].mpegrw-r--r-  1 vince  wheel  -        490 May 26  2007 From the Earth to the Moon (1968) [04].inforw-r--r-  1 vince  wheel  - 3926833088 May 29  2007 From the Earth to the Moon (1968) [04].mpegrw-r--r-  1 vince  wheel  -  915902498 Jul 15  2007 Man vs. Wild (Everglades).avirw-r--r-  1 vince  wheel  -        387 Jul 13  2007 Man vs. Wild (Everglades).info
Server machine:
    kern.osrelease: 1.11.0-DEVELOPMENT
    kern.osrevision: 200708
    hw.physmem: 528859136
    hw.model: Intel(R) Pentium(R) 4 CPU 1.60GHz
- swapinfo
Device 1K-blocks Used Avail Capacity Type
/dev/ad0s1b 511936 40 511896 0% Interleaved 
      
      Updated by dillon over 17 years ago
      
    
    :The sync did not prevent it.  It is not actually crashing, but just
:hanging.  After I suspend or kill the rsync process on the client
:machine (running NetBSD), the server stays hung for several minutes and
:then eventually comes back alive.  Top never shows any process load.
:
:I also tried syncing every .1 seconds.  Same result but it gave me
:a more fine grained resolution of the freeze pattern.  I printed a '.'
:for every sync.  The moment I run the rsync command on the client it
:freezes (no more dots).  Some of the time it will freeze for a few
:seconds, print a few more dots, freeze again for much longer, sometimes
:print another dot or two and then stay frozen.  I also tried nicing the
:sync to -10.  Same result.
:
:As I mentioned, it is only with the '--delete' rsync option.  The freeze
:takes place while deleting large files and seems to only be when there
:are more than a certain number of them to delete.  When I tested a while
:back with fewer files to delete, it did not do it.  I don't know where
:the threshhold is of how many files triggers it.  However, the freeze
:seems to always be while deleting the first file.  After killing it and
:waiting several minutes for it to come back alive, the first file in the
:list does seem to be deleted.
:
:The directory size on the DragonFly server is 45GB with most of the file
:sizes ranging between 2GB and 4GB.  I can provide you with the full
:directory listing of both the server and the client if you like.  Either
:email it directly to you or post it to the list.  Let me know.
:
:There are 36 files on the server.
:Here is a sample listing of a few of the files.
Do you have a console on the server?  Can you break into DDB?
	When it hangs you can break into DDB with control-alt-escape and do a
    'ps' to see what (probably many) processes are stuck on.  That will
    give us a starting point.  You can then 'cont' to continue operation.
	I'm sure you probably do not want to crash the box but if you don't
    mind and have core dumps enabled and can get a core while it is 'stuck',
    that will give us the most information.  If you want to go that route
    I can give you a leaf.dragonflybsd.org account to upload the core into
    so the developers can have a look at it (just supply me with your public
    dsa key for ssh and your desired username and I will create the leaf
    account).
	-Matt
      
      Updated by vince.dragonfly over 17 years ago
      
    
    Hi Matt.
Thanks for the instructions.  It made us a little nervous :-), but
I went ahead and crashed it and tried to get a core dump.  However, Our
swap partition was apparently not big enough :-(.  We set up a test
machine with another SATA hard drive (to duplicate the environment as
closely as possible, in case hardware type has anythin to do with the
problem) with a bigger swap partition.  Luckily, we had a spare drive we
had not installed yet.  We are in the process of copying the 45 GB file
system to it over the LAN.  Hopefully we can reproduce the problem and
get a kernel core dump on that machine.  If so, I will send it to you.
It may be at least another day or two.
If we successfully reproduce the problem and get a kernel core dump,
I will go ahead and send you a username and dsa key in a private email
for a leaf account.
      
      Updated by vince.dragonfly about 16 years ago
      
    
    On Mon, Aug 24, 2009 at 11:21:38AM +0000, Antonio Huete Jimenez (via DragonFly issue tracker) wrote:
Antonio Huete Jimenez <tuxillo@quantumachine.net> added the comment:
Could this be reproduced still?
Sorry about the delayed response.  It may be a while before I am setup
to test it.  We are low on file space on the file server until we add
another hard drive.  Also, the Dragonfly installation is getting pretty
old (1.11.0-DEVELOPMENT).  Probably not much point in trying to
reproduce it until we upgrade the machine as well.  I don't know how
long it is going to be before we are setup to test it properly again.
If you are wanting to go ahead and close the ticket, I guess I can
always open a new one, or re-open the old one if that is possible, if
I am ever able to reproduce it.
      
      Updated by ftigeot almost 14 years ago
      
    
    - Description updated (diff)
 - Status changed from New to Resolved
 
Closing due to lack of recent feedback
      
      Updated by ftigeot almost 14 years ago
      
    
    - Status changed from Resolved to Closed
 - Assignee deleted (
0)