Project

General

Profile

Actions

Bug #2039

closed

Sometimes, DragonFly 2.9 systems can not reboot

Added by ftigeot over 13 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr 3 10:24:40 CEST 2011
:/usr/obj/usr/src/sys/GENERIC_SMP i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

The last console messages are:

Syncing disks...
done.

the machine then waits forever; it can only be rebooted with a hard reset.

I'm setting the priority to critical, thinking about poor souls having to manage
remotely colocated servers.

Actions #1

Updated by sepherosa over 13 years ago

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <> wrote:

New submission from Francois Tigeot <>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr  3 10:24:40 CEST 2011
:/usr/obj/usr/src/sys/GENERIC_SMP  i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Best Regards,
sephe

Actions #2

Updated by ftigeot over 13 years ago

On Tue, Apr 05, 2011 at 09:43:36AM +0000, Sepherosa Ziehau (via DragonFly issue tracker) wrote:

Sepherosa Ziehau <> added the comment:

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <> wrote:

New submission from Francois Tigeot <>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr  3 10:24:40 CEST 2011
:/usr/obj/usr/src/sys/GENERIC_SMP  i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Sorry, I can't test: I do not have access to this machine anymore.

Actions #3

Updated by sepherosa over 13 years ago

On Tue, Apr 5, 2011 at 5:40 PM, Sepherosa Ziehau <> wrote:

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <> wrote:

New submission from Francois Tigeot <>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr  3 10:24:40 CEST 2011
:/usr/obj/usr/src/sys/GENERIC_SMP  i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

I have found a reliable way to trigger it:
switch to single user mode
mount -a
cd /usr/src
make installworld && make upgrade && reboot

print_uptime() has not been called in my case.

Best Regards,
sephe

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Best Regards,
sephe

--
Tomorrow Will Never Die

Actions #4

Updated by dillon over 13 years ago

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

Hmm.  If the uptime is not displayed this implies that the
vfs_unmountall() call is getting stuck.
Try with the latest master, I made some adjustments that might effect
raw device closes. If swapcache is turned on try turning it off
(though my latest patch turns it off on shutdown automatically, too).
So far I cannot replicate the issue on my test box.  I did have
reboot issues in past related to swapcache but those are gone now
w/my recent commits.
Other possible causes: tmpfs, vn, usb mounts, procfs, etc.
If you can reliably replicate the problem you may have to add a bunch
of kprintf()'s to the umountall iterator to track down which filesystem
is getting stuck. I have included a df of my test box below.
-Matt

Filesystem 1K-blocks Used Avail Capacity Mounted on
ROOT 195452928 64848560 130604368 33% /
devfs 1 1 0 100% /dev
/dev/serno/L41K2H5G.s1a 774094 216628 495540 30% /boot
/pfs/@-1:00001 195452928 64848560 130604368 33% /var
/pfs/
@-1:00002 195452928 64848560 130604368 33% /tmp
/pfs/@-1:00003 195452928 64848560 130604368 33% /usr
/pfs/
@-1:00004 195452928 64848560 130604368 33% /home
/pfs/@-1:00005 195452928 64848560 130604368 33% /usr/obj
/pfs/
@-1:00006 195452928 64848560 130604368 33% /var/crash
/pfs/@@-1:00007 195452928 64848560 130604368 33% /var/tmp
BUILD 104398848 42510688 61888160 41% /build3
procfs 4 4 0 100% /proc
apollo.backplane.com:/usr/src 1934024704 292436496 1641588208 15% /usr/src
apollo.backplane.com:/usr/src-misc 1934024704 292436496 1641588208 15% /usr/src-misc
apollo.backplane.com:/usr/pkgsrc 1934024704 292436496 1641588208 15% /usr/pkgsrc
apollo.backplane.com:/netboot1 1934024704 292436496 1641588208 15% /netboot1
test29#

Actions #5

Updated by sepherosa over 13 years ago

On Wed, Apr 6, 2011 at 2:16 PM, Matthew Dillon
<> wrote:

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

   Hmm.  If the uptime is not displayed this implies that the
   vfs_unmountall() call is getting stuck.

Yes, vfs_unmountall() blocks the rebooting.

With the following patch:
http://leaf.dragonflybsd.org/~sephe/umountall_print.diff

In single user mode:
make installworld && make upgrade && reboot
...
...
hammer callback start
<---------- ("nobusy callback done" is not logged, and reboot stops here)

Best Regards,
sephe

   Try with the latest master, I made some adjustments that might effect
   raw device closes.  If swapcache is turned on try turning it off
   (though my latest patch turns it off on shutdown automatically, too).

   So far I cannot replicate the issue on my test box.  I did have
   reboot issues in past related to swapcache but those are gone now
   w/my recent commits.

   Other possible causes: tmpfs, vn, usb mounts, procfs, etc.

   If you can reliably replicate the problem you may have to add a bunch
   of kprintf()'s to the umountall iterator to track down which filesystem
   is getting stuck.  I have included a df of my test box below.

                                               -Matt

Filesystem                         1K-blocks       Used      Avail Capacity  Mounted on
ROOT                                195452928  64848560  130604368    33%    /
devfs                                       1         1          0   100%    /dev
/dev/serno/L41K2H5G.s1a                774094    216628     495540    30%    /boot
/pfs/@-1:00001                     195452928  64848560  130604368    33%    /var
/pfs/
@-1:00002                     195452928  64848560  130604368    33%    /tmp
/pfs/@-1:00003                     195452928  64848560  130604368    33%    /usr
/pfs/
@-1:00004                     195452928  64848560  130604368    33%    /home
/pfs/@-1:00005                     195452928  64848560  130604368    33%    /usr/obj
/pfs/
@-1:00006                     195452928  64848560  130604368    33%    /var/crash
/pfs/@@-1:00007                     195452928  64848560  130604368    33%    /var/tmp
BUILD                               104398848  42510688   61888160    41%    /build3
procfs                                      4         4          0   100%    /proc
apollo.backplane.com:/usr/src      1934024704 292436496 1641588208    15%    /usr/src
apollo.backplane.com:/usr/src-misc 1934024704 292436496 1641588208    15%    /usr/src-misc
apollo.backplane.com:/usr/pkgsrc   1934024704 292436496 1641588208    15%    /usr/pkgsrc
apollo.backplane.com:/netboot1     1934024704 292436496 1641588208    15%    /netboot1
test29#

Actions #6

Updated by sepherosa over 13 years ago

On Mon, Apr 11, 2011 at 4:46 PM, Sepherosa Ziehau <> wrote:

On Wed, Apr 6, 2011 at 2:16 PM, Matthew Dillon
<> wrote:

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

   Hmm.  If the uptime is not displayed this implies that the
   vfs_unmountall() call is getting stuck.

Yes, vfs_unmountall() blocks the rebooting.

With the following patch:
http://leaf.dragonflybsd.org/~sephe/umountall_print.diff

In single user mode:
make installworld && make upgrade && reboot
...
...
hammer callback start
<---------- ("nobusy callback done" is not logged, and reboot stops here)

The output of df(1):
http://leaf.dragonflybsd.org/~sephe/df.txt

One more thing, if I run several sync(8) after installworld &&
upgrade, then reboot does not seem to hang.

Best Regards,
sephe

Best Regards,
sephe

Actions #7

Updated by dillon over 13 years ago

I've figured it out. I looked at the core Francois Tigeot provided
(sorry if others were provided before, it was on my list!)... in
anycase, there is a bug in the HAMMER flusher which can cause it
to loose track of the flush sequence number which umountall can
trigger due to the extra flushes hammer does on unmount.

Commit e86903d84f840af38d1b452a6a6c624702373751 should fix it.
-Matt
Actions #8

Updated by eocallaghan over 13 years ago

Confirm fix?

Actions #9

Updated by sepherosa over 13 years ago

On Tue, Apr 12, 2011 at 11:30 PM, Edward O'Callaghan (via DragonFly
issue tracker) <> wrote:

Edward O'Callaghan <> added the comment:

Confirm fix?

Yeah, it is fixed.

----------
status: chatting -> testing

_____________________________________________
DragonFly issue tracker <>
<http://bugs.dragonflybsd.org/issue2039>
_____________________________________________

Actions #10

Updated by eocallaghan over 13 years ago

Status: testing->resolved.

Cheers.

Actions

Also available in: Atom PDF