Bug #2365

Hammer pfs-destroy and prune-everything can cause network loss

Added by t_dfbsd over 2 years ago. Updated over 1 year ago.

Status:ClosedStart date:05/09/2012
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Fairly consistently, when I run hammer prune-everything, at some point in the process my ssh session will stop and not recover and the machine becomes unavailable on the network. It eventually returns to normal and I can reconnect to it. Today, I ran a hammer pfs-destroy on a 1.3TB pfs and the same thing happened. While the network was out, it was putting these messages in the log:

swap_pager: indefinite wait buffer: offset: 6489661440, size: 4096
swap_pager: indefinite wait buffer: offset: 23187279872, size: 4096
swap_pager: indefinite wait buffer: offset: 6642294784, size: 4096
swap_pager: indefinite wait buffer: offset: 6536355840, size: 4096
swap_pager: indefinite wait buffer: offset: 6500122624, size: 4096
swap_pager: indefinite wait buffer: offset: 6489661440, size: 4096
swap_pager: indefinite wait buffer: offset: 6725648384, size: 4096
swap_pager: indefinite wait buffer: offset: 23187279872, size: 4096
swap_pager: indefinite wait buffer: offset: 6642294784, size: 4096
swap_pager: indefinite wait buffer: offset: 6536355840, size: 4096
swap_pager: indefinite wait buffer: offset: 6500122624, size: 4096
swap_pager: indefinite wait buffer: offset: 861982720, size: 4096
swap_pager: indefinite wait buffer: offset: 6489661440, size: 4096
swap_pager: indefinite wait buffer: offset: 6725648384, size: 4096
swap_pager: indefinite wait buffer: offset: 23187279872, size: 4096
swap_pager: indefinite wait buffer: offset: 6642294784, size: 4096
swap_pager: indefinite wait buffer: offset: 1027108864, size: 4096
swap_pager: indefinite wait buffer: offset: 6536355840, size: 4096
swap_pager: indefinite wait buffer: offset: 6500122624, size: 4096

indefinite_wait_buffer.png (73.8 KB) swildner, 08/25/2012 01:50 AM

indefinite_wait_buffer2.png (49.8 KB) swildner, 08/25/2012 01:50 AM

indefinite_wait_buffer3.png (65.6 KB) swildner, 08/25/2012 01:50 AM

History

#1 Updated by swildner over 2 years ago

Is this on i386 or x86_64? I've seen the issue too here, although not triggered by prune-everything or pfs-destroy. It's just a thing that happens from time to time on my i386 box.

http://87.78.98.243/tmp/IMG_20120424_220035.jpg

In my case the trigger is not clear.

And I've never seen it on any x86_64 box.

#2 Updated by t_dfbsd over 2 years ago

This is on x86_64 AMD. I should add that those swap_pager messages were happening during the pfs-destroy, but I don't know whether or not that was the trigger. More concerning to me is the loss of network connectivity.

#3 Updated by swildner over 2 years ago

I just had the "indefinite wait buffer" in an i386 VM and I captured the beginnings of it. I was building in pkgsrc and then thought I should cleanup my HAMMER so I ran /etc/periodic/daily/160.clean-hammer and it went fine until it got to /home (second image). Although nothing of the package building took place on /home (afaics) it hung there and soon the "indefinite wait buffer" messages started to appear on the console (first image). First the offsets were all the same but later on it was several (third image).

Maybe this gives some better clue? So far I had never witnessed it when it happened.

#4 Updated by t_dfbsd about 2 years ago

I upgraded my system (x86_64) to include all the recent scheduler changes and tried hammer prune-everything and for the first time in a long time, it didn't kill the network. I did noticed a slowdown, but that was all. Could those changes have fixed this problem? I'll do more testing.

#5 Updated by t_dfbsd over 1 year ago

As recent as one month ago, this was still a problem. In fact, it happened very consistently with hammer prune-everything. Tonight, I ran it twice and network connections to the box didn't drop. Could recent commits have fixed this?

#6 Updated by t_dfbsd over 1 year ago

  • Status changed from New to Closed

After a good bit of testing, I'm ready to declare this resolved. Again, as of approximately one month ago this was still very much an issue, so some recent commit or combination of commits must have fixed it.

Also available in: Atom PDF