Bug #2652

189a0ff3761b47 ... ix: Implement MSI-X support locks up Lenovo S10 Intel Atom n270

Added by davshao 10 months ago. Updated 7 months ago.

Status:NewStart date:03/03/2014
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

For a i386 Lenovo S10 Intel Atom n270 netbook, bisection indicates using
189a0ff3761b47 ... ix: Implement MSI-X support and enable multiple TX rings
locks up the machine on booting at the point:

...
md0: invalid primary partition table: no magic
Math emulator present
hptrr: RocketRAID 17xx/2xxx SATA controller driver v1.2
hpt27xx: RocketRAID 27xx controller driver v1.0 (Feb 28 2014 21:38:19)

Attached is a full verbose dmesg from the same machine running with master
previous to the above commit. The machine only fully boots with
acpi disabled using hint.acpi.0.disabled=1, but even with acpi enabled the
lockup with the problematic commit occurs sooner than the normal lockup
with acpi enabled. "Normally" on this machine booting with acpi halts at

acpi0.nexus0.root0
acpi0: <LENOVO CB-01> [tentative] on motherboard
ACPI: All ACPI Tables successfully acquired
ACPI FADT: SCI testing interrupt mode ...
ACPI FADT: SCI testing level/high
IOAPIC: irq 9, gsi 9 edge/high -> level/high

lenovo_s10_dmesg.txt Magnifier (36.5 KB) davshao, 03/03/2014 01:12 PM

xpt_init_tsleep.diff Magnifier (1.26 KB) davshao, 03/07/2014 03:42 PM

History

#1 Updated by swildner 10 months ago

On Mon, 03 Mar 2014 22:20:17 +0100,
<> wrote:

> Issue #2652 has been reported by davshao.
>
> ----------------------------------------
> Bug #2652: 189a0ff3761b47 ... ix: Implement MSI-X support locks up
> Lenovo S10 Intel Atom n270
> http://bugs.dragonflybsd.org/issues/2652
>
> * Author: davshao
> * Status: New
> * Priority: Normal
> * Assignee:
> * Category:
> * Target version:
> ----------------------------------------
> For a i386 Lenovo S10 Intel Atom n270 netbook, bisection indicates using
> 189a0ff3761b47 ... ix: Implement MSI-X support and enable multiple TX
> rings
> locks up the machine on booting at the point:
>
> ...
> md0: invalid primary partition table: no magic
> Math emulator present
> hptrr: RocketRAID 17xx/2xxx SATA controller driver v1.2
> hpt27xx: RocketRAID 27xx controller driver v1.0 (Feb 28 2014 21:38:19)

I don't think the ix commit is related to it. This looks like the old i386
bug we have that it randomly fails booting sometimes. Here it happens
occasionally (more often after changes to /boot/loader.conf, it seems) and
after a few retries (reset button) it starts booting again.

What it is, I don't know, but I'm relatively sure that it is not related
to either the ix commit (which at this early point of the boot will not
affect anything) or the hpt27xx driver.

Sascha

#2 Updated by davshao 10 months ago

I built the kernel multiple times git reset --hard-ing commits one-by-one and the problems began exactly with the mentioned commit.

Furthermore I would not make any assumption the hpt27xx driver can't be partially responsible as does not the hpt_init() call in sys/dev/raid/hpt27xx/osm_bsd.c call an init_config() function that is in turn I believed defined as hpt27xx_init_config in hpt27xx_config.h? But hpt27xx_init_config() is not defined in the source -- is it a function called from the HighPoint binary blob?

#3 Updated by swildner 10 months ago

On Thu, 06 Mar 2014 08:38:14 +0100,
<> wrote:

> Issue #2652 has been updated by davshao.
>
>
> I built the kernel multiple times git reset --hard-ing commits
> one-by-one and the problems began exactly with the mentioned commit.
>
> Furthermore I would not make any assumption the hpt27xx driver can't be
> partially responsible as does not the hpt_init() call in
> sys/dev/raid/hpt27xx/osm_bsd.c call an init_config() function that is in
> turn I believed defined as hpt27xx_init_config in hpt27xx_config.h? But
> hpt27xx_init_config() is not defined in the source -- is it a function
> called from the HighPoint binary blob?

Yes, but I've had the exact same issue (random hangs at boot) independent
of whether the hpt* drivers were compiled in or not. It would just hang at
the "md0: ..." message then.

But a check would be easy. Just move the os_printk() in hpt_init() below
init_config(), or add another message after hpt_init(), and look (in the
case when it hangs) if the message wasn't printed.

Sascha

#4 Updated by davshao 10 months ago

Here's what I have determined using kprintf's.

The hang occurs in xpt_init() which is called by cam_module_event_handler() on MOD_LOAD in sys/bus/cam/cam_xpt.c.

The hang occurs at the call to register_swi() call in xpt_init().

The hang occurs at the call to int_moveto_destcpu() in register_int() in sys/kern/kern_intr.c.

The Intel Atom N270 has hyper-threading and thus has ncpus == 2. Perhaps it is somewhat an unfortunate accident
that SWI_CAMBIO == 195, FIRTS_SOFTINT + 3 forcing the moveto.

The hang occurs at the call to lwkt_migratecpu() in int_moveto_destcpu() in sys/kern/kern_intr.c.

The hang occurs at the call to lwkt_setcpu_self() in lwkt_migratecpu() in sys/kern/lwkt_thread.c.
(td->td_gd != rgd) == 1
(td->td_release != 0) == 0
td_td_flags == 3

The hang occurs at the call to lwkt_switch() in lwkt_setcpu_self() in sys/kern/lwkt_thread.c.

#5 Updated by swildner 10 months ago

Can you post a diff with your kprintf()s somewhere because I don't quite understand yet why it would initialize CAM at this point.

#6 Updated by davshao 10 months ago

I got the idea for examining cam_xpt.c because I saw in
sys/dev/raid/hpt27xx/osm_bsd.c the line

SYSINIT(hptinit, SI_SUB_CONFIGURE, SI_ORDER_FIRST, hpt_init, NULL);

Then doing a search for all other uses of SI_SUB_CONFIGURE, I saw in

sys/bus/cam/cam_xpt.c

DECLARE_MODULE(cam, cam_moduledata, SI_SUB_CONFIGURE, SI_ORDER_SECOND);

and observed the manpage for DECLARE_MODULE said that SYSINIT was called.

By the way, FreeBSD 10.0 at least has a

TUNABLE_INT("kern.cam.boot_delay", &xsoftc.boot_delay);
SYSCTL_INT(_kern_cam, OID_AUTO, boot_delay, CTLFLAG_RDTUN,
&xsoftc.boot_delay, 0, "Bus registration wait time");

in its implementation of cam_xpt.c

#7 Updated by davshao 10 months ago

The following patch to sys/bus/cam/cam_xpt.c that adds a tsleep before the call to register_swi() enables the Lenovo S10 (i386 Intel Atom N270) with

set hint.acpi.0.disabled =1
set kern.cam.init_delay=3000

to boot successfully again. Considering other problems such as

http://bugs.dragonflybsd.org/issues/2653

this won't be the last tsleep that needs to be added for this particular machine.

#8 Updated by swildner 9 months ago

Can you check if the two commits by Matt from today, 7da5c4fd138dc6e5035f5d7aff78a14b09b296e8 and mainly 902419bf6d9fd0f80afc9d07cd4b3e99d20f23ca change anything about this issue?

Thanks,
Sascha

#9 Updated by swildner 7 months ago

Is this issue still happening to you? If not, can you close this ticket please?

Also available in: Atom PDF