Project

General

Profile

Actions

Bug #2823

closed

Intel Q9400 PC fails to boot waiting for the following device xpt interrupt routing is probably hosed

Added by davshao almost 9 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
06/07/2015
Due date:
% Done:

100%

Estimated time:

Description

For one Intel Core 2 Q9400 machine, booting fails with console output listed below. The same problem has not been observed on a couple of newer generation Intel-based PCs. Bisection seems to indicate the problem began between, inclusive, master commit

commit 6f1a15dc79a822710cc37e99f6a8bd9910e5e3f1
Date: Sat Jun 6 16:17:41 2015 -0700

sysctl - SMP performance work

and

commit bb06d795144b04574aa0b5fdb94e95ab77360a1a
Date: Sun Jun 7 04:16:12 2015 +0200

<sys/sysctl.h>: Include <sys/lock.h> only for the kernel (unbreaks world).

Attached is a full verbose dmesg from a successful boot with a previous kernel.

Console output for unsuccessful boot:

CAM: finished configuring all busses
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
Giving up, interrupt routing is probably hosed
no B_DEVMAGIC (bootdev=0)
ATA PseudoRAID loaded
Mounting root from hammer:serno/WD-WMATV8216330.s1d
no disk named 'serno/WD-WMATV8216330.s1d'
hammer_mountroot: can't find devvp
kthread 0xffffffe056318c80 syncer3 has exited
Root mount failed:6

Manual root filesystem specification
<fstype>:<device> Specify root (e.g. ufs:da0s1a)
? List valid disk boot devices
panic Just panic
abort Abort manual input

mountroot>

Dmesg output for earlier kernel, successful boot:

sl0: bpf attached
hpt27xx: no controller detected.
hptrr: no controller detected.
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring 6 busses
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: finished configuring all busses
pass0 at ahci0 bus 1 target 0 lun 0
pass0: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
pass0: Serial Number WD-WMATV5981256
pass0: 300.000MB/s transfers
pass1 at ahci0 bus 2 target 0 lun 0
pass1: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
pass1: Serial Number WD-WMATV8216330
pass1: 300.000MB/s transfers


Files

q9400_dmesg_20150607.txt (53.3 KB) q9400_dmesg_20150607.txt davshao, 06/07/2015 08:36 AM
revertsysctlsmp.diff (2.9 KB) revertsysctlsmp.diff davshao, 06/11/2015 05:57 AM
camsysctlfix.diff (2.36 KB) camsysctlfix.diff davshao, 06/21/2015 01:17 AM
Actions #1

Updated by davshao almost 9 years ago

Reverting almost all of

commit 6f1a15dc79a822710cc37e99f6a8bd9910e5e3f1
Date: Sat Jun 6 16:17:41 2015 -0700

sysctl - SMP performance work

restores booting on the Intel Q9400 Eaglelake machine. The diff used is attached.

The good news is that reverting back to the usage of exclusive locks is not what corrected the booting problem. Therefore the problem might not be one of locking logic. Instead, what corrected booting was reverting adding the struct lock field to struct sysctl_oid:

diff --git a/sys/sys/sysctl.h b/sys/sys/sysctl.h
index eb10fa8..0c08764 100644
--- a/sys/sys/sysctl.h
++ b/sys/sys/sysctl.h
@ -153,7 +153,7 @ struct sysctl_oid {
int oid_refcnt;
u_int oid_running;
const char oid_descr;
- struct lock oid_lock; /
per-node lock /
/ struct lock oid_lock; / / per-node lock */
};

Considering I have had no problems whatsoever booting other machines running latest master, I wonder if a crucial difference may be that the partitions in the slice where DragonFly is installed on the Q9400 machine were formatted years before the partitions of the other machines. Is it possible that years ago, the layout of struct sysctl_oid was somehow hard-coded into the layout of partitions or how hammer was initially formatted?

Or alternately the partition containing /boot in the older installation on the Q9400 machine had the default back then of only being around 750m:

$ df -h
Filesystem Size Used Avail Capacity Mounted on
ROOT 130G 75G 55G 58% /
devfs 1.0K 1.0K 0B 100% /dev
/dev/serno/WD-WMATV8216330.s1a 756M 619M 77M 89% /boot
/pfs/@-1:00001 130G 75G 55G 58% /var
/pfs/
@-1:00002 130G 75G 55G 58% /tmp
/pfs/@-1:00003 130G 75G 55G 58% /usr
/pfs/
@-1:00004 130G 75G 55G 58% /home
/pfs/@-1:00005 130G 75G 55G 58% /usr/obj
/pfs/
@-1:00006 130G 75G 55G 58% /var/crash
/pfs/@@-1:00007 130G 75G 55G 58% /var/tmp
procfs 4.0K 4.0K 0B 100% /proc

Actions #2

Updated by davshao almost 9 years ago

Attached is a partial fix. There are two initializations that seem to be missing, a M_ZERO for a kmalloc for a cam_eb struct in cam_xpt.c and a LOCK_INITIALIZER for the new lock field in struct sysctl_oid in the macro SYSCTL_OID. The attached patch allows the Core 2 Intel Q9400 to boot in ordinary but not in bootverbose mode. When booting with bootverbose there is an error message about "can't re-use a leaf (WD_WMATV5981256)" where that is apparently part of the configuration of da0:

da0 at ahci0 bus 1 target 0 lun 0
da0: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
da0: Serial Number WD-WMATV5981256
da0: 300.000MB/s transfers

The missing M_ZERO for allocating a new struct cam_eb appears to make the initial value of the counted_to_config field random. I have included additional debugging information in the bootverbose sections of cam_xpt.c that allowed me to see that 6 busses ahci0.0 through ahci0.5 were being asked to configured by function xptconfigfunc() in cam_xpt.c, but only 5 were being counted by xptconfigbuscountfunc(), with ahci0.0 missing.

I can only conjecture as to the necessity of LOCK_INITIALIZER for the lock field of struct sysctl_oid, but it appears to me that macro SYSCTL_OID from sysctl.h in sys/dev/raid/hptmv/hptproc.c is used to initialize a node hptmv.status in the sysctl tree whose root hptmv is somewhat an outlier. This may also be related to why I have only seen this difficulty booting one one older machine.

Actions #3

Updated by sepherosa almost 9 years ago

Only zeroing counted_to_config is committed.

The LOCK_INITIALIZER is not necessary, since it will be initialized by
the sysctl reg function.

And I think the only rest of the bug is the 'sysctl reuse leaf'. Did
you configure any disk schedulers? However, I don't think the code
path generating the warning will cause any harm, e.g. prevent your
machine from booting.

Thanks,
sephe

On Sun, Jun 21, 2015 at 4:17 PM,
<> wrote:

Issue #2823 has been updated by davshao.

File camsysctlfix.diff added

Attached is a partial fix. There are two initializations that seem to be missing, a M_ZERO for a kmalloc for a cam_eb struct in cam_xpt.c and a LOCK_INITIALIZER for the new lock field in struct sysctl_oid in the macro SYSCTL_OID. The attached patch allows the Core 2 Intel Q9400 to boot in ordinary but not in bootverbose mode. When booting with bootverbose there is an error message about "can't re-use a leaf (WD_WMATV5981256)" where that is apparently part of the configuration of da0:

da0 at ahci0 bus 1 target 0 lun 0
da0: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
da0: Serial Number WD-WMATV5981256
da0: 300.000MB/s transfers

The missing M_ZERO for allocating a new struct cam_eb appears to make the initial value of the counted_to_config field random. I have included additional debugging information in the bootverbose sections of cam_xpt.c that allowed me to see that 6 busses ahci0.0 through ahci0.5 were being asked to configured by function xptconfigfunc() in cam_xpt.c, but only 5 were being counted by xptconfigbuscountfunc(), with ahci0.0 missing.

I can only conjecture as to the necessity of LOCK_INITIALIZER for the lock field of struct sysctl_oid, but it appears to me that macro SYSCTL_OID from sysctl.h in sys/dev/raid/hptmv/hptproc.c is used to initialize a node hptmv.status in the sysctl tree whose root hptmv is somewhat an outlier. This may also be related to why I have only seen this difficulty booting one one older machine.

----------------------------------------
Bug #2823: Intel Q9400 PC fails to boot waiting for the following device xpt interrupt routing is probably hosed
http://bugs.dragonflybsd.org/issues/2823#change-12678

  • Author: davshao
  • Status: New
  • Priority: Normal
  • Assignee:
  • Category:
  • Target version:
    ----------------------------------------
    For one Intel Core 2 Q9400 machine, booting fails with console output listed below. The same problem has not been observed on a couple of newer generation Intel-based PCs. Bisection seems to indicate the problem began between, inclusive, master commit

commit 6f1a15dc79a822710cc37e99f6a8bd9910e5e3f1
Date: Sat Jun 6 16:17:41 2015 -0700

sysctl - SMP performance work

and

commit bb06d795144b04574aa0b5fdb94e95ab77360a1a
Date: Sun Jun 7 04:16:12 2015 +0200

<sys/sysctl.h>: Include <sys/lock.h> only for the kernel (unbreaks world).

Attached is a full verbose dmesg from a successful boot with a previous kernel.

Console output for unsuccessful boot:

CAM: finished configuring all busses
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
WARNING waiting for the following device to finish configuring:
xpt: func=0xffffffff8029ea32 arg=0
Giving up, interrupt routing is probably hosed
no B_DEVMAGIC (bootdev=0)
ATA PseudoRAID loaded
Mounting root from hammer:serno/WD-WMATV8216330.s1d
no disk named 'serno/WD-WMATV8216330.s1d'
hammer_mountroot: can't find devvp
kthread 0xffffffe056318c80 syncer3 has exited
Root mount failed:6

Manual root filesystem specification
<fstype>:<device> Specify root (e.g. ufs:da0s1a)
? List valid disk boot devices
panic Just panic
abort Abort manual input

mountroot>

Dmesg output for earlier kernel, successful boot:

sl0: bpf attached
hpt27xx: no controller detected.
hptrr: no controller detected.
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring bus: ahci0
CAM: Configuring 6 busses
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: Finished configuring bus: ahci0
CAM: finished configuring all busses
pass0 at ahci0 bus 1 target 0 lun 0
pass0: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
pass0: Serial Number WD-WMATV5981256
pass0: 300.000MB/s transfers
pass1 at ahci0 bus 2 target 0 lun 0
pass1: <SATA WDC WD1002FBYS-0 03.0> Fixed Direct Access SCSI-4 device
pass1: Serial Number WD-WMATV8216330
pass1: 300.000MB/s transfers

---Files--------------------------------
q9400_dmesg_20150607.txt (53.3 KB)
revertsysctlsmp.diff (2.9 KB)
camsysctlfix.diff (2.36 KB)

--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://bugs.dragonflybsd.org/my/account

--
Tomorrow Will Never Die

Actions #4

Updated by davshao almost 9 years ago

I have not experienced any problems for about a week whether booting a master kernel with or without boot_verbose=YES on the affected machine. I conclude at least one of the extensive series of bug fixes repaired the underlying race condition.

While the problem existed, I had added some print statements to get an error message such as:

da0: 953869MB (1953525168 512 byte sectors: 255H 63S/T 121601C)
ERROR sysctl_add_oid()
can't re-use a leaf (WD-WMATV5981256)!
kind C2000003, refcount 1
descr (policy)
fmt(A)
da0 at ahci0 bus 1 target 0 lun 0

The relevant sysctl leaf appears to have been:

dsched.policy.WD-WMATV5981256: noop

As the bug no longer seems to manifest after a decent length of time, I consider it closed. Thanks for the reply that indicated a similar hypothesis for the problem.

Actions #5

Updated by davshao almost 9 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF