Bug #1077

Kernel lockup with SMP on -HEAD

Added by russiane39 about 6 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:HighDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I've got a buddy @ Intel, and I've grabbed some platforms /
motherboards from him to test Dragonfly hardware compatilibilty, the
results shocked me - only one motherboard was compatible with
DragonFly, another 8 just hang kernel when I turn SMP on, /me tired
with ACPICA_IO turned on and turned off - nothing changes. The only
one motherboard, which works with DragonyFLY is rather old ASUS
workstation model, based on i975 chipset, another 8 motherboards was
from Intel,
5 motherboards from S5000 series, S3210SHLX, D975XBX2 and one SR1500
Intel platform. CPU's used in test - Core Quad Q6600 for LGA775
platform, Intel Xeon 5310 for LGA771 platform. All motherboards was
with latest bios firmware.
The hang I've noticed is extremely tricky to debug, I'm trying to find
problem for 4 days already, but can't find solution. Hang usually
occurs (as I could see visually) after detecting all devices, and
before (or maybe slightly after) calling /sbin/init, at least before
filesystems is mounted. I've talked with sephe about this issue, first
we thought it was bug of nata driver, but I've redone test with
Symbiosis SCSI controller, hang occurs absolutely at same place. All
my attempts at remote kgdb sessions failed, so the only meaningful
output from db> I'm sending to you. The problem with kgdb was
following:
kgdb -r /dev/cuaa0 kernel
Initiating remote session (or smthing like it)
Reply contains invalid hex digit 45
Session ended.
Feel free to contact me, so I can assist you in fixing that awful
issue. I should note, that DragonFly works nicely on all those systems
w/o SMP, and works only on one system with SMP.

db> ps
pid lwp uid ppid pgrp pflag lflag stat wmesg wchan cmd
1 cbdfafd0 0 0 0 000200 000000 1 1 swapper
0 c02f9c00 0 0 0 000200 000000 2 3 conifhk c02e0e60 swapper
oops, ran out of processes early!
cpu 0 tdrunqmask 10000000 curthread 0xc02fae78 reqflags 0012
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
28 0xc02fae78 -1 00210403 28/01/02 0xd269bce4 - ithread 1

tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8003a4 -1 00200000 12/01/00 0xff807d30 - idle_0
32 0xc03037c0 -1 00010400 12/01/01 0xcbdaad88 - ithread emerg
32 0xc02fad64 -1 00210400 28/01/00 0xcbdadd88 - ithread 0
32 0xc030e6d4 -1 00010400 14/02/01 0xcbdb0d18 - softclock 0
32 0xc02ff6b4 -1 00210400 14/01/00 0xcbdbcd88 - ithread 68
32 0xc03064c4 -1 00000400 12/01/01 0xcbdbfd88 - dsched 0
32 0xc03055a0 -1 00000400 12/01/01 0xcfc5ad88 - usched 0
32 0xd24d8ac0 1 00000600 10/01/01 0xd2470d34 - swapper
32 0xc0311ec0 -1 00040400 12/02/01 0xd2473cc4 waitport netisr_cpu 0
32 0xc0310d20 -1 00040400 12/02/00 0xd247fcb8 waitport ifnet 0
32 0xd24d8bb0 -1 00000e80 12/03/01 0xd24ebc8c ccb_scan xpt_thrd
32 0xc02ff5a0 -1 00210400 14/02/01 0xd24eece4 - ithread 67
32 0xd24d8ca0 -1 00000680 12/03/01 0xd24f1c9c tqthr taskqueue 0
32 0xc02ff7c8 -1 00210400 14/02/01 0xd24fdce4 - ithread 69
32 0xc02fb718 -1 00210400 28/01/00 0xd2642d88 - ithread 9
32 0xc02fb940 -1 00210400 28/01/00 0xd2665d88 - ithread 11
32 0xc02fb82c -1 00210400 28/02/01 0xd2688ce4 - ithread 10
32 0xc02fb3dc -1 00210400 28/02/01 0xd2697ce4 - ithread 6
32 0xc02fae78 -1 00210403 28/01/02 0xd269bce4 - ithread 1
32 0xc02ff264 -1 00210400 14/01/00 0xd269ed88 - ithread 64
32 0xc02fb1b4 -1 00210400 28/01/00 0xd27e2d88 - ithread 4
32 0xd24d9330 -1 00000680 12/02/01 0xd2af5c80 rwait random
32 0xc0337fa0 -1 00040400 12/02/01 0xd2afccc4 waitport udp_thread 0
32 0xc0337120 -1 00040400 12/02/01 0xd2dcbcc0 waitport tcp_thread 0
32 0xd24d9510 -1 00040600 12/02/01 0xd2dd7cc0 waitport rtable_cpu 0
32 0xc02f9d20 0 00000080 12/02/01 0xc0377ca0 conifhk swapper
cpu 1 tdrunqmask 00000000 curthread 0xff8083a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8083a4 -1 00200001 12/01/00 0xff80fd30 - idle_1
32 0xc030e7d0 -1 00010400 14/02/01 0xcbdb3d18 - softclock 1
32 0xc03065b4 -1 00000400 12/01/01 0xcfc51d88 - dsched 1
32 0xc0305690 -1 00000400 12/01/01 0xcfc5dd88 - usched 1
32 0xc0311fa8 -1 00040400 12/02/01 0xd2476cc4 waitport netisr_cpu 1
32 0xc0310e08 -1 00040400 12/02/00 0xd24e2cb8 waitport ifnet 1
32 0xd24d8d90 -1 00000680 12/03/01 0xd24f4c9c tqthr taskqueue 1
32 0xc0338088 -1 00040400 12/02/01 0xd2affcc4 waitport udp_thread 1
32 0xc0337208 -1 00040400 12/02/01 0xd2dcecc0 waitport tcp_thread 1
32 0xd44a9700 -1 00040600 12/02/01 0xd2ddacc0 waitport rtable_cpu 1
cpu 2 tdrunqmask 00000000 curthread 0xff8103a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8103a4 -1 00200001 12/01/00 0xff817d30 - idle_2
32 0xc030e8cc -1 00010400 14/02/01 0xcbdb6d18 - softclock 2
32 0xc03066a4 -1 00000400 12/01/01 0xcfc54d88 - dsched 2
32 0xc0305780 -1 00000400 12/01/01 0xcfc82d88 - usched 2
32 0xc0312090 -1 00040400 12/02/01 0xd2479cc4 waitport netisr_cpu 2
32 0xc0310ef0 -1 00040400 12/02/00 0xd24e5cb8 waitport ifnet 2
32 0xd24d8e80 -1 00000680 12/03/01 0xd24f7c9c tqthr taskqueue 2
32 0xc0338170 -1 00040400 12/02/01 0xd2dc5cc4 waitport udp_thread 2
32 0xc03372f0 -1 00040400 12/02/01 0xd2dd1cc0 waitport tcp_thread 2
32 0xd44e9700 -1 00040600 12/02/01 0xd2dddcc0 waitport rtable_cpu 2
cpu 3 tdrunqmask 00000000 curthread 0xff8183a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8183a4 -1 00200001 12/01/00 0xff81fd30 - idle_3
32 0xc030e9c8 -1 00010400 12/01/01 0xcbdb9d88 - softclock 3
32 0xc0306794 -1 00000400 12/01/01 0xcfc57d88 - dsched 3
32 0xc0305870 -1 00000400 12/01/01 0xcfc85d88 - usched 3
32 0xc0312178 -1 00040400 12/02/01 0xd247ccc4 waitport netisr_cpu 3
32 0xc0310fd8 -1 00040400 12/02/00 0xd24e8cb8 waitport ifnet 3
32 0xd24d8f70 -1 00000680 12/03/01 0xd24fac9c tqthr taskqueue 3
32 0xc0338258 -1 00040400 12/02/01 0xd2dc8cc4 waitport udp_thread 3
32 0xc03373d8 -1 00040400 12/02/01 0xd2dd4cc0 waitport tcp_thread 3
32 0xd4529700 -1 00040600 12/02/01 0xd382bcc0 waitport rtable_cpu 3
CURCPU 0 CURTHREAD 0xc02fae78 (-1)
db>
pid lwp uid ppid pgrp pflag lflag stat wmesg wchan cmd
1 cbdfafd0 0 0 0 000200 000000 1 1 swapper
0 c02f9c00 0 0 0 000200 000000 2 3 conifhk c02e0e60 swapper
oops, ran out of processes early!
cpu 0 tdrunqmask 10000000 curthread 0xc02fae78 reqflags 0012
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
28 0xc02fae78 -1 00210403 28/01/02 0xd269bce4 - ithread 1

tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8003a4 -1 00200000 12/01/00 0xff807d30 - idle_0
32 0xc03037c0 -1 00010400 12/01/01 0xcbdaad88 - ithread emerg
32 0xc02fad64 -1 00210400 28/01/00 0xcbdadd88 - ithread 0
32 0xc030e6d4 -1 00010400 14/02/01 0xcbdb0d18 - softclock 0
32 0xc02ff6b4 -1 00210400 14/01/00 0xcbdbcd88 - ithread 68
32 0xc03064c4 -1 00000400 12/01/01 0xcbdbfd88 - dsched 0
32 0xc03055a0 -1 00000400 12/01/01 0xcfc5ad88 - usched 0
32 0xd24d8ac0 1 00000600 10/01/01 0xd2470d34 - swapper
32 0xc0311ec0 -1 00040400 12/02/01 0xd2473cc4 waitport netisr_cpu 0
32 0xc0310d20 -1 00040400 12/02/00 0xd247fcb8 waitport ifnet 0
32 0xd24d8bb0 -1 00000e80 12/03/01 0xd24ebc8c ccb_scan xpt_thrd
32 0xc02ff5a0 -1 00210400 14/02/01 0xd24eece4 - ithread 67
32 0xd24d8ca0 -1 00000680 12/03/01 0xd24f1c9c tqthr taskqueue 0
32 0xc02ff7c8 -1 00210400 14/02/01 0xd24fdce4 - ithread 69
32 0xc02fb718 -1 00210400 28/01/00 0xd2642d88 - ithread 9
32 0xc02fb940 -1 00210400 28/01/00 0xd2665d88 - ithread 11
32 0xc02fb82c -1 00210400 28/02/01 0xd2688ce4 - ithread 10
32 0xc02fb3dc -1 00210400 28/02/01 0xd2697ce4 - ithread 6
32 0xc02fae78 -1 00210403 28/01/02 0xd269bce4 - ithread 1
32 0xc02ff264 -1 00210400 14/01/00 0xd269ed88 - ithread 64
32 0xc02fb1b4 -1 00210400 28/01/00 0xd27e2d88 - ithread 4
32 0xd24d9330 -1 00000680 12/02/01 0xd2af5c80 rwait random
32 0xc0337fa0 -1 00040400 12/02/01 0xd2afccc4 waitport udp_thread 0
32 0xc0337120 -1 00040400 12/02/01 0xd2dcbcc0 waitport tcp_thread 0
32 0xd24d9510 -1 00040600 12/02/01 0xd2dd7cc0 waitport rtable_cpu 0
32 0xc02f9d20 0 00000080 12/02/01 0xc0377ca0 conifhk swapper
cpu 1 tdrunqmask 00000000 curthread 0xff8083a4 reqflags 0000
cpu 1 tdrunqmask 00000000 curthread 0xff8083a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
--More--
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8083a4 -1 00200001 12/01/00 0xff80fd30 - idle_1
32 0xc030e7d0 -1 00010400 14/02/01 0xcbdb3d18 - softclock 1
32 0xc03065b4 -1 00000400 12/01/01 0xcfc51d88 - dsched 1
32 0xc0305690 -1 00000400 12/01/01 0xcfc5dd88 - usched 1
32 0xc0311fa8 -1 00040400 12/02/01 0xd2476cc4 waitport netisr_cpu 1
32 0xc0310e08 -1 00040400 12/02/00 0xd24e2cb8 waitport ifnet 1
32 0xd24d8d90 -1 00000680 12/03/01 0xd24f4c9c tqthr taskqueue 1
32 0xc0338088 -1 00040400 12/02/01 0xd2affcc4 waitport udp_thread 1
32 0xc0337208 -1 00040400 12/02/01 0xd2dcecc0 waitport tcp_thread 1
32 0xd44a9700 -1 00040600 12/02/01 0xd2ddacc0 waitport rtable_cpu 1
cpu 2 tdrunqmask 00000000 curthread 0xff8103a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
--More--
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8103a4 -1 00200001 12/01/00 0xff817d30 - idle_2
32 0xc030e8cc -1 00010400 14/02/01 0xcbdb6d18 - softclock 2
32 0xc03066a4 -1 00000400 12/01/01 0xcfc54d88 - dsched 2
32 0xc0305780 -1 00000400 12/01/01 0xcfc82d88 - usched 2
32 0xc0312090 -1 00040400 12/02/01 0xd2479cc4 waitport netisr_cpu 2
32 0xc0310ef0 -1 00040400 12/02/00 0xd24e5cb8 waitport ifnet 2
32 0xd24d8e80 -1 00000680 12/03/01 0xd24f7c9c tqthr taskqueue 2
32 0xc0338170 -1 00040400 12/02/01 0xd2dc5cc4 waitport udp_thread 2
32 0xc03372f0 -1 00040400 12/02/01 0xd2dd1cc0 waitport tcp_thread 2
32 0xd44e9700 -1 00040600 12/02/01 0xd2dddcc0 waitport rtable_cpu 2
cpu 3 tdrunqmask 00000000 curthread 0xff8183a4 reqflags 0000
INCOMING IPIQS:
tdq thread pid flags pri/cs/mp sp wmesg comm
--More--
tdq thread pid flags pri/cs/mp sp wmesg comm
32 0xff8183a4 -1 00200001 12/01/00 0xff81fd30 - idle_3
32 0xc030e9c8 -1 00010400 12/01/01 0xcbdb9d88 - softclock 3
32 0xc0306794 -1 00000400 12/01/01 0xcfc57d88 - dsched 3
32 0xc0305870 -1 00000400 12/01/01 0xcfc85d88 - usched 3
32 0xc0312178 -1 00040400 12/02/01 0xd247ccc4 waitport netisr_cpu 3
32 0xc0310fd8 -1 00040400 12/02/00 0xd24e8cb8 waitport ifnet 3
32 0xd24d8f70 -1 00000680 12/03/01 0xd24fac9c tqthr taskqueue 3
32 0xc0338258 -1 00040400 12/02/01 0xd2dc8cc4 waitport udp_thread 3
32 0xc03373d8 -1 00040400 12/02/01 0xd2dd4cc0 waitport tcp_thread 3
32 0xd4529700 -1 00040600 12/02/01 0xd382bcc0 waitport rtable_cpu 3
CURCPU 0 CURTHREAD 0xc02fae78 (-1)

History

#1 Updated by russiane39 about 6 years ago

this ps obtained after system locking, "trace" from db> aren't give any
meaningful output

#2 Updated by dillon about 6 years ago

:..
:results shocked me - only one motherboard was compatible with
:DragonFly, another 8 just hang kernel when I turn SMP on, /me tired
:with ACPICA_IO turned on and turned off - nothing changes. The only
:one motherboard, which works with DragonyFLY is rather old ASUS
:workstation model, based on i975 chipset, another 8 motherboards was
:from Intel,
:5 motherboards from S5000 series, S3210SHLX, D975XBX2 and one SR1500
:...
:with latest bios firmware.
:The hang I've noticed is extremely tricky to debug, I'm trying to find
:problem for 4 days already, but can't find solution. Hang usually
:occurs (as I could see visually) after detecting all devices, and
:before (or maybe slightly after) calling /sbin/init, at least before

It kinda sounds like an interrupt routing issue. Interrupts are a
nightmare on the intel architecture and we pretty much depend on the
BIOS to do the routing as our ACPI support doesn't do it. Usually
one can get a SMP kernel working either with the APIC_IO option or
without the APIC_IO kernel compile option... one of the two combinations
usually works.

My guess is that the failure is occuring when the kernel enables
interrupts, which usually occurs after the initial bus and device
scan. I can only guess, though.

If you can boot it with a UP kernel it might be possible to compare
the dmesg output from the UP boot with the dmesg output from a SMP
boot.

-Matt

#3 Updated by dillon about 6 years ago

: 0 c02f9c00 0 0 0 000200 000000 2 3 conifhk c02e0e60 swapper
:...
:Michael Kosarev <> added the comment:
:
:this ps obtained after system locking, "trace" from db> aren't give any
:meaningful output

This rings a bell. conifhk == the system is waiting on the autoconf
to finish initializing the mass storage devices.

If you happen to have a DVD drive connected, disconnect it. Try
putting a CD drive in there instead. In particular, we've had
reports of SATA DVD and SATA CD drives possibly locking up the NATA
driver during boot.

-Matt
Matthew Dillon
<>

#4 Updated by russiane39 about 6 years ago

Matt, box haven't got any CD/DVD drives at all. Just SCSI HDD and nothing more,
SATA and USB controllers disabled in BIOS and removed in my kernel config also.

#5 Updated by dillon about 6 years ago

:Michael Kosarev <> added the comment:
:
:Matt, box haven't got any CD/DVD drives at all. Just SCSI HDD and nothing m=
:ore,
:SATA and USB controllers disabled in BIOS and removed in my kernel config a=
:lso.

Hmm. I'd try reversing it. Maybe the issue is the SCSI driver. Try
disabling the SCSI and connecting a disk via SATA.

Also try booting with a month-old snapshot CD and see if that recognizes
the SCSI. Another possibility (though not very likely) is that my CAM
fixes blew up the SCSI controller.

How'd you load DragonFly up in the first place without a CD drive?

-Matt
Matthew Dillon
<>

#6 Updated by russiane39 about 6 years ago

Matt, you understood me wrong probably - I've tried to say, that issue with hang
probably not because of storage drivers, I've checked nata, sym, usual ata
drivers. With any of them kernel hang still present.

#7 Updated by XPinguin about 6 years ago

Good day.
As I have grabbed russiane39's machine at my placement, I have performed some
diagnostics.

First of all, HDD connected via SATA in IDE mode ('ad' device) in this box.

Second, I have compiled the latest kernel (cvsuped just some hours ago) with SMP
and still receive the same problem, but occasionally I get lock-up in the
completely different place (before lo0 initialization), sadly just once...

Third thing I should notice is that after lock-up and pressing Ctrl+Alt+Del I
could see some shutdown actions performed by init process:
"
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask 0xff000000
inetd not running? (check /var/run/inetd.pid).
Shutting down daemon processes:.
cron not running? (check /var/run/cron.pid).
Shutting down local daemons:.
.
" (and continuation of the lock-up).

These are booting logs.
1. Odd one lock-up: http://pastebin.ca/1077610
2. Common lock-up (with 'ps' and 'trace'): http://pastebin.ca/1077611
3. Working non-SMP kernel: http://pastebin.ca/1077609

Now I am trying to get functions call backtrace using the kgdb through the
serial port, but I couldn't fight up the problem with connection speed (noticed
early by russiane39). I've found a possible solution in one manual which
recommended typing 'set baudrate 9600' but didn't specify where to type this
line. I've tried doing so in system shell (setting environment variable) and in
the gdb shell (it doesn't know such command at all) but with no results...

#8 Updated by russiane39 about 6 years ago

Sephe suggested a patch for nata, which probably could solve that problem, but
it is not working.
He tried to fight with that issue: intr 14 at 40001/40000 hz, livelocked limit
engaged!
http://leaf.dragonflybsd.org/~sephe/ata-chipset.c.diff

#9 Updated by dillon about 6 years ago

:...
:
:These are booting logs.
:1=2E Odd one lock-up: http://pastebin.ca/1077610
:2=2E Common lock-up (with 'ps' and 'trace'): http://pastebin.ca/1077611
:3=2E Working non-SMP kernel: http://pastebin.ca/1077609
:

The boot logs are very useful. It looks like the network is the problem.
It's probably dying trying to configure em1, after lo0. It is getting an
interrupt storm on IRQ 14, which no device is attached to. I'm guessing
that the BIOS is lying about EM1 being on IRQ 11 and it is actually on
IRQ 14.

Try building a SMP kernel *without* options APIC_IO.

If that doesn't work try configuring EM1 for polling operation only.
Get it configured and working in a polling-only mode with a UP kernel,
then try booting a SMP kernel and see if that works.

It may also be possible to get through the SMP boot sequence by setting
kern.emergency_intr_enable=1 in /boot/loader.conf, but this would purely
be a stopgap.

-Matt
Matthew Dillon
<>

#10 Updated by XPinguin about 6 years ago

Thanks Matt, will try.
By the way, you are probably right about 'em1' as I have noticed it doesn't work
(can't assign an address to it or whatever) but didn't make sense to it...

#11 Updated by XPinguin about 6 years ago

Matt, I have rebuilt kernel without IO_APIC and it works.
So I think I will try porting FreeBSD APIC there...

#12 Updated by dillon about 6 years ago

:Vladimir <> added the comment:
:
:Matt, I have rebuilt kernel without IO_APIC and it works.
:So I think I will try porting FreeBSD APIC there...

The issue here is that for IO_APIC to work, we really need the
acpica/ACPI stuff from FreeBSD. A complete, fresh porting of that
entire subsystem would be needed as well as a major revamping of the
interrupt assignment infrastructure.

It's a very large job, one of two big-ticket items we really need to
get into DragonFly (the other being AMD64 support, which is being
worked on as a SOC project).

If you want to try your hand and getting that mess working, it would
be great! I won't hold you to it, though, it's a massive job.

I'm happy you got it working with IO_APIC commented out. That's what
most of us have had to do. It's a huge band-aid.

-Matt
Matthew Dillon
<>

#13 Updated by chi about 6 years ago

hi folks:
trid the first five bootoptions of the liveDVD:
(normal, noacpi, smp normal, smp noapic, save mode)
intr 14 at 40001/40000 hz, livelocked limit
and some "can't read temparature" issue where present at all times.
machine: intel q6600 + abitIP35pro + sata hdd + sata dvd.
same DVD runs fine in VMware.

=> in linux i fixed simular issues with grub-boot-option "all_generic_ide".
WHEREAS simular means noone knew why linux dropped me to busybox(recovery-shell).

good luck

#14 Updated by corecode over 5 years ago

APIC_IO problem

Also available in: Atom PDF