Bug #2039: Sometimes, DragonFly 2.9 systems can not reboot - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #2039

closed

Sometimes, DragonFly 2.9 systems can not reboot

Added by ftigeot almost 15 years ago. Updated almost 15 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr 3 10:24:40 CEST 2011
ftigeot@dfly32.zefyris.com:/usr/obj/usr/src/sys/GENERIC_SMP i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

The last console messages are:

Syncing disks...
  done.

the machine then waits forever; it can only be rebooted with a hard reset.

I'm setting the priority to critical, thinking about poor souls having to manage
remotely colocated servers.

History
Notes

Actions

Copy link

Updated by sepherosa almost 15 years ago

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <sinknull@leaf.dragonflybsd.org> wrote:

New submission from Francois Tigeot <ftigeot@wolfpond.org>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr 3 10:24:40 CEST 2011
ftigeot@dfly32.zefyris.com:/usr/obj/usr/src/sys/GENERIC_SMP i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Best Regards,
sephe

Actions

Copy link

Updated by ftigeot almost 15 years ago

On Tue, Apr 05, 2011 at 09:43:36AM +0000, Sepherosa Ziehau (via DragonFly issue tracker) wrote:

Sepherosa Ziehau <sepherosa@gmail.com> added the comment:

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <sinknull@leaf.dragonflybsd.org> wrote:

New submission from Francois Tigeot <ftigeot@wolfpond.org>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr 3 10:24:40 CEST 2011
ftigeot@dfly32.zefyris.com:/usr/obj/usr/src/sys/GENERIC_SMP i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Sorry, I can't test: I do not have access to this machine anymore.

Actions

Copy link

Updated by sepherosa almost 15 years ago

On Tue, Apr 5, 2011 at 5:40 PM, Sepherosa Ziehau <sepherosa@gmail.com> wrote:

On Sun, Apr 3, 2011 at 5:00 PM, Francois Tigeot (via DragonFly issue
tracker) <sinknull@leaf.dragonflybsd.org> wrote:

New submission from Francois Tigeot <ftigeot@wolfpond.org>:

uname -a output:
DragonFly dfly32.zefyris.com 2.9-DEVELOPMENT DragonFly
v2.9.1.1027.gb133d-DEVELOPMENT #3: Sun Apr 3 10:24:40 CEST 2011
ftigeot@dfly32.zefyris.com:/usr/obj/usr/src/sys/GENERIC_SMP i386

The problem may be present on older versions of 2.9.

After issuing a "shutdown -r now" command as superuser, the system starts the
shutdown process

I have found a reliable way to trigger it:
switch to single user mode
mount -a
cd /usr/src
make installworld && make upgrade && reboot

print_uptime() has not been called in my case.

Best Regards,
sephe

sysctl hw.acpi.handle_reboot=1 && shutdown -r now

Does the above help?

Best Regards,
sephe

--
Tomorrow Will Never Die

Actions

Copy link

Updated by dillon almost 15 years ago

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

Hmm.  If the uptime is not displayed this implies that the
    vfs_unmountall() call is getting stuck.

Try with the latest master, I made some adjustments that might effect
    raw device closes.  If swapcache is turned on try turning it off
    (though my latest patch turns it off on shutdown automatically, too).

So far I cannot replicate the issue on my test box.  I did have
    reboot issues in past related to swapcache but those are gone now
    w/my recent commits.

Other possible causes: tmpfs, vn, usb mounts, procfs, etc.

If you can reliably replicate the problem you may have to add a bunch
    of kprintf()'s to the umountall iterator to track down which filesystem
    is getting stuck.  I have included a df of my test box below.

-Matt

Filesystem 1K-blocks Used Avail Capacity Mounted on
ROOT 195452928 64848560 130604368 33% /
devfs 1 1 0 100% /dev
/dev/serno/L41K2H5G.s1a 774094 216628 495540 30% /boot
/pfs/@-1:00001 195452928 64848560 130604368 33% /var /pfs/@-1:00002 195452928 64848560 130604368 33% /tmp
/pfs/@-1:00003 195452928 64848560 130604368 33% /usr /pfs/@-1:00004 195452928 64848560 130604368 33% /home
/pfs/@-1:00005 195452928 64848560 130604368 33% /usr/obj /pfs/@-1:00006 195452928 64848560 130604368 33% /var/crash
/pfs/@@-1:00007 195452928 64848560 130604368 33% /var/tmp
BUILD 104398848 42510688 61888160 41% /build3
procfs 4 4 0 100% /proc
apollo.backplane.com:/usr/src 1934024704 292436496 1641588208 15% /usr/src
apollo.backplane.com:/usr/src-misc 1934024704 292436496 1641588208 15% /usr/src-misc
apollo.backplane.com:/usr/pkgsrc 1934024704 292436496 1641588208 15% /usr/pkgsrc
apollo.backplane.com:/netboot1 1934024704 292436496 1641588208 15% /netboot1
test29#

Actions

Copy link

Updated by sepherosa almost 15 years ago

On Wed, Apr 6, 2011 at 2:16 PM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

Hmm. If the uptime is not displayed this implies that the
vfs_unmountall() call is getting stuck.

Yes, vfs_unmountall() blocks the rebooting.

With the following patch:
http://leaf.dragonflybsd.org/~sephe/umountall_print.diff

In single user mode:
make installworld && make upgrade && reboot
...
...
hammer callback start
<---------- ("nobusy callback done" is not logged, and reboot stops here)

Best Regards,
sephe

Try with the latest master, I made some adjustments that might effect
raw device closes. If swapcache is turned on try turning it off
(though my latest patch turns it off on shutdown automatically, too).

So far I cannot replicate the issue on my test box. I did have
reboot issues in past related to swapcache but those are gone now
w/my recent commits.

Other possible causes: tmpfs, vn, usb mounts, procfs, etc.

If you can reliably replicate the problem you may have to add a bunch
of kprintf()'s to the umountall iterator to track down which filesystem
is getting stuck. I have included a df of my test box below.

-Matt

Filesystem 1K-blocks Used Avail Capacity Mounted on
ROOT 195452928 64848560 130604368 33% /
devfs 1 1 0 100% /dev
/dev/serno/L41K2H5G.s1a 774094 216628 495540 30% /boot
/pfs/@-1:00001 195452928 64848560 130604368 33% /var /pfs/@-1:00002 195452928 64848560 130604368 33% /tmp
/pfs/@-1:00003 195452928 64848560 130604368 33% /usr /pfs/@-1:00004 195452928 64848560 130604368 33% /home
/pfs/@-1:00005 195452928 64848560 130604368 33% /usr/obj /pfs/@-1:00006 195452928 64848560 130604368 33% /var/crash
/pfs/@@-1:00007 195452928 64848560 130604368 33% /var/tmp
BUILD 104398848 42510688 61888160 41% /build3
procfs 4 4 0 100% /proc
apollo.backplane.com:/usr/src 1934024704 292436496 1641588208 15% /usr/src
apollo.backplane.com:/usr/src-misc 1934024704 292436496 1641588208 15% /usr/src-misc
apollo.backplane.com:/usr/pkgsrc 1934024704 292436496 1641588208 15% /usr/pkgsrc
apollo.backplane.com:/netboot1 1934024704 292436496 1641588208 15% /netboot1
test29#

Actions

Copy link

Updated by sepherosa almost 15 years ago

On Mon, Apr 11, 2011 at 4:46 PM, Sepherosa Ziehau <sepherosa@gmail.com> wrote:

On Wed, Apr 6, 2011 at 2:16 PM, Matthew Dillon
<dillon@apollo.backplane.com> wrote:

:I have found a reliable way to trigger it:
:switch to single user mode
:mount -a
:cd /usr/src
:make installworld && make upgrade && reboot
:
:print_uptime() has not been called in my case.
:
:Best Regards,
:sephe

Hmm. If the uptime is not displayed this implies that the
vfs_unmountall() call is getting stuck.

Yes, vfs_unmountall() blocks the rebooting.

With the following patch:
http://leaf.dragonflybsd.org/~sephe/umountall_print.diff

In single user mode:
make installworld && make upgrade && reboot
...
...
hammer callback start
<---------- ("nobusy callback done" is not logged, and reboot stops here)

The output of df(1):
http://leaf.dragonflybsd.org/~sephe/df.txt

One more thing, if I run several sync(8) after installworld &&
upgrade, then reboot does not seem to hang.

Best Regards,
sephe

Actions

Copy link

Updated by dillon almost 15 years ago

I've figured it out. I looked at the core Francois Tigeot provided
(sorry if others were provided before, it was on my list!)... in
anycase, there is a bug in the HAMMER flusher which can cause it
to loose track of the flush sequence number which umountall can
trigger due to the extra flushes hammer does on unmount.

Commit e86903d84f840af38d1b452a6a6c624702373751 should fix it.

-Matt

Actions

Copy link

Updated by eocallaghan almost 15 years ago

Confirm fix?

Actions

Copy link

Updated by sepherosa almost 15 years ago

On Tue, Apr 12, 2011 at 11:30 PM, Edward O'Callaghan (via DragonFly
issue tracker) <sinknull@leaf.dragonflybsd.org> wrote:

Edward O'Callaghan <eocallaghan@auroraux.org> added the comment:

Confirm fix?

Yeah, it is fixed.

----------
status: chatting -> testing

_____________________________________________
DragonFly issue tracker <bugs@lists.dragonflybsd.org>
<http://bugs.dragonflybsd.org/issue2039>
_____________________________________________

Actions

Copy link

#10

Updated by eocallaghan almost 15 years ago

Status: testing->resolved.

Cheers.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #2039

Sometimes, DragonFly 2.9 systems can not reboot

Updated by sepherosa almost 15 years ago

Updated by ftigeot almost 15 years ago

Updated by sepherosa almost 15 years ago

Updated by dillon almost 15 years ago

Updated by sepherosa almost 15 years ago

Updated by sepherosa almost 15 years ago

Updated by dillon almost 15 years ago

Updated by eocallaghan almost 15 years ago

Updated by sepherosa almost 15 years ago

Updated by eocallaghan almost 15 years ago