Bug #952

system hang with rsync

Added by vince.dragonfly over 6 years ago. Updated almost 3 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hello.

Doing a backup to our DragonFly file server, using rsync, hangs the
server under certain conditions.

It only happens when I use the *--delete* option to rsync and seems to
only be on directories with large multi-GB files. The directory that
consistently reproduces it is my tv recording directory where I have
multiple files ranging from 1 to 4 GB. The total directory on the
server is about 67 GB, which has old files that need deleted. The
directory on the client machine is currently about 79 GB.

While the server is hung, I can still ping it and switch virtual
consoles but, other than that, all consoles are just frozen. While it
is frozen, if I hit <cntl>c I see the '^C' on the screen but get no
other response.

Running *top*, in another console -- top also freezes and stops
updating, always with zero or very little process load. On this last
test, after killing rsync on the client side with <cntl>c, *top* briefly
updated after 3 minutes, then stayed frozen for 3.5 more minutes. After
a total of 6.5 minutes, the server came alive again.

It does not seem to be related to ssh because I ran the rsync daemon on
the server and ran the same test without ssh and got the same results.
Here is the output of the last test on the client side.

$ rsync -HOav -x --delete . alexandria::tv/recordings
/tv/recordings
building file list ... done
deleting The Universe (Jupiter: The Giant Planet).info

Hangs here. I waited a while and finally hit <cntl>c.

^Crsync error: received SIGUSR1 or SIGINT (code 20) at rsync.c(163)

The .info file and the .avi file were both gone on the server after that
but I am not sure if the .avi file was deleted on one of the other
tests.

I did get a couple entries in /var/log/messages on the server with the
following error when I ran it with the rsync daemon

rsyncd[3933]: rsync error: error in rsync protocol data stream (code 12) at io.c(453) [receiver=2.6.9]

I did not get that error when run under ssh so I don't know if it has to
do with the freezing problem.

On the server, I replaced the recording directory with one that had
a subset of the files, about 5 recordings that needed deleted, and it
worked fine. Then I hard linked all files in the directory from another
directory, and it still worked fine.

i.e.
mv recordings recordings.bak
mkdir recordings
ln recordings.bak/* recordings

So far as I can tell, it only happens if it is over a certain amount of
data in the directory and it has to actually delete the files, not just
unlink a secondary hard link.

I have been able to backup just about every other directory on our
client machines without any problems.

Does anybody have any theories about what might be happening?

The client machine is a NetBSD machine.
The DragonFly server is running version 1.10.1-RELEASE. I also
tested with a 1.11.0-DEVELOPMENT kernel and got the same results.

History

#1 Updated by dillon over 6 years ago

:Hello.
:
:Doing a backup to our DragonFly file server, using rsync, hangs the
:server under certain conditions.
:
:It only happens when I use the *--delete* option to rsync and seems to
:only be on directories with large multi-GB files. The directory that
:consistently reproduces it is my tv recording directory where I have
:multiple files ranging from 1 to 4 GB. The total directory on the
:server is about 67 GB, which has old files that need deleted. The
:directory on the client machine is currently about 79 GB.
:
:While the server is hung, I can still ping it and switch virtual
:consoles but, other than that, all consoles are just frozen. While it
:is frozen, if I hit <cntl>c I see the '^C' on the screen but get no
:other response.

My guess is that there is a buffer cache issue on the server that
is causing a deadlock.

If you sync every second on the server while running your test, does
it still hang?

-Matt

#2 Updated by vince.dragonfly over 6 years ago

Sorry it is taking so long. I will try to get on this and run this test
in the next day or two.

#3 Updated by dillon over 6 years ago

:On Sat, Feb 23, 2008 at 01:00:05PM -0800, Matthew Dillon wrote:
:> If you sync every second on the server while running your test, does
:> it still hang?
:
:Sorry it is taking so long. I will try to get on this and run this test
:in the next day or two.

If the sync test succeeds in preventing the crash then I have a pretty
good idea where the issue might be, but a crash dump would tell
me precisely.

If that doesn't work out, try to describe the rsyncd server setup
(approximate number of files and directories being rsynced, approximate
size of the files, amount of memory the machine has, etc) and I will
try to reproduce it.

-Matt
Matthew Dillon
<>

#4 Updated by vince.dragonfly over 6 years ago

Hi Matt.

I finally ran the tests today. Sorry it took so long to get to it.

The sync did not prevent it. It is not actually crashing, but just
hanging. After I suspend or kill the rsync process on the client
machine (running NetBSD), the server stays hung for several minutes and
then eventually comes back alive. Top never shows any process load.

I also tried syncing every .1 seconds. Same result but it gave me
a more fine grained resolution of the freeze pattern. I printed a '.'
for every sync. The moment I run the rsync command on the client it
freezes (no more dots). Some of the time it will freeze for a few
seconds, print a few more dots, freeze again for much longer, sometimes
print another dot or two and then stay frozen. I also tried nicing the
sync to -10. Same result.

As I mentioned, it is only with the '--delete' rsync option. The freeze
takes place while deleting large files and seems to only be when there
are more than a certain number of them to delete. When I tested a while
back with fewer files to delete, it did not do it. I don't know where
the threshhold is of how many files triggers it. However, the freeze
seems to always be while deleting the first file. After killing it and
waiting several minutes for it to come back alive, the first file in the
list does seem to be deleted.

The directory size on the DragonFly server is 45GB with most of the file
sizes ranging between 2GB and 4GB. I can provide you with the full
directory listing of both the server and the client if you like. Either
email it directly to you or post it to the list. Let me know.

There are 36 files on the server.
Here is a sample listing of a few of the files.

-rw-r--r-- 1 vince wheel - 356 Apr 20 2007 Earth Revealed: Introductory Geology [101].info
-rw-r--r-- 1 vince wheel - 1961564096 Apr 21 2007 Earth Revealed: Introductory Geology [101].mpeg
-rw-r--r-- 1 vince wheel - 490 May 26 2007 From the Earth to the Moon (1968) [04].info
-rw-r--r-- 1 vince wheel - 3926833088 May 29 2007 From the Earth to the Moon (1968) [04].mpeg
-rw-r--r-- 1 vince wheel - 915902498 Jul 15 2007 Man vs. Wild (Everglades).avi
-rw-r--r-- 1 vince wheel - 387 Jul 13 2007 Man vs. Wild (Everglades).info

Server machine:
kern.osrelease: 1.11.0-DEVELOPMENT
kern.osrevision: 200708
hw.physmem: 528859136
hw.model: Intel(R) Pentium(R) 4 CPU 1.60GHz

# swapinfo
Device 1K-blocks Used Avail Capacity Type
/dev/ad0s1b 511936 40 511896 0% Interleaved

#5 Updated by dillon over 6 years ago

:The sync did not prevent it. It is not actually crashing, but just
:hanging. After I suspend or kill the rsync process on the client
:machine (running NetBSD), the server stays hung for several minutes and
:then eventually comes back alive. Top never shows any process load.
:
:I also tried syncing every .1 seconds. Same result but it gave me
:a more fine grained resolution of the freeze pattern. I printed a '.'
:for every sync. The moment I run the rsync command on the client it
:freezes (no more dots). Some of the time it will freeze for a few
:seconds, print a few more dots, freeze again for much longer, sometimes
:print another dot or two and then stay frozen. I also tried nicing the
:sync to -10. Same result.
:
:As I mentioned, it is only with the '--delete' rsync option. The freeze
:takes place while deleting large files and seems to only be when there
:are more than a certain number of them to delete. When I tested a while
:back with fewer files to delete, it did not do it. I don't know where
:the threshhold is of how many files triggers it. However, the freeze
:seems to always be while deleting the first file. After killing it and
:waiting several minutes for it to come back alive, the first file in the
:list does seem to be deleted.
:
:The directory size on the DragonFly server is 45GB with most of the file
:sizes ranging between 2GB and 4GB. I can provide you with the full
:directory listing of both the server and the client if you like. Either
:email it directly to you or post it to the list. Let me know.
:
:There are 36 files on the server.
:Here is a sample listing of a few of the files.

Do you have a console on the server? Can you break into DDB?

When it hangs you can break into DDB with control-alt-escape and do a
'ps' to see what (probably many) processes are stuck on. That will
give us a starting point. You can then 'cont' to continue operation.

I'm sure you probably do not want to crash the box but if you don't
mind and have core dumps enabled and can get a core while it is 'stuck',
that will give us the most information. If you want to go that route
I can give you a leaf.dragonflybsd.org account to upload the core into
so the developers can have a look at it (just supply me with your public
dsa key for ssh and your desired username and I will create the leaf
account).

-Matt

#6 Updated by vince.dragonfly over 6 years ago

Hi Matt.

Thanks for the instructions. It made us a little nervous :-), but
I went ahead and crashed it and tried to get a core dump. However, Our
swap partition was apparently not big enough :-(. We set up a test
machine with another SATA hard drive (to duplicate the environment as
closely as possible, in case hardware type has anythin to do with the
problem) with a bigger swap partition. Luckily, we had a spare drive we
had not installed yet. We are in the process of copying the 45 GB file
system to it over the LAN. Hopefully we can reproduce the problem and
get a kernel core dump on that machine. If so, I will send it to you.
It may be at least another day or two.

If we successfully reproduce the problem and get a kernel core dump,
I will go ahead and send you a username and dsa key in a private email
for a leaf account.

#7 Updated by tuxillo about 5 years ago

Could this be reproduced still?

#8 Updated by vince.dragonfly about 5 years ago

On Mon, Aug 24, 2009 at 11:21:38AM +0000, Antonio Huete Jimenez (via DragonFly issue tracker) wrote:
>
> Antonio Huete Jimenez <> added the comment:
>
> Could this be reproduced still?

Sorry about the delayed response. It may be a while before I am setup
to test it. We are low on file space on the file server until we add
another hard drive. Also, the Dragonfly installation is getting pretty
old (1.11.0-DEVELOPMENT). Probably not much point in trying to
reproduce it until we upgrade the machine as well. I don't know how
long it is going to be before we are setup to test it properly again.
If you are wanting to go ahead and close the ticket, I guess I can
always open a new one, or re-open the old one if that is possible, if
I am ever able to reproduce it.

#9 Updated by ftigeot almost 3 years ago

  • Description updated (diff)
  • Status changed from New to Resolved

Closing due to lack of recent feedback

#10 Updated by ftigeot almost 3 years ago

  • Status changed from Resolved to Closed
  • Assignee deleted (0)

Also available in: Atom PDF