Bug #1449: AHCI panic on Intel 6321ESB AHCI - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Bug #1449

closed

AHCI panic on Intel 6321ESB AHCI

Added by polachok almost 16 years ago. Updated almost 16 years ago.

Status:

Closed

Priority:

Normal

Assignee:

-

Category:

-

Target version:

-

Start date:

Due date:

% Done:

0%

Estimated time:

Description

I have a panic here with AHCI enabled. See screenshot here:
http://omploader.org/vMjRueA. Console gets completely unusable and fills with
garbage.

DragonFly dmesg with ahci disabled:
http://leaf.dragonflybsd.org/~polachok/dmesg.dfly

OpenBSD dmesg with ahci enabled:
http://leaf.dragonflybsd.org/~polachok/dmesg.obsd

Actions

#1

Updated by dillon almost 16 years ago

:New submission from Alexander Polakov <polachok@gmail.com>:
:
:I have a panic here with AHCI enabled. See screenshot here:=20
:http://omploader.org/vMjRueA. Console gets completely unusable and fills wi=
:th=20
:garbage.
:
:DragonFly dmesg with ahci disabled:
:http://leaf.dragonflybsd.org/~polachok/dmesg.dfly
:
:OpenBSD dmesg with ahci enabled:
:http://leaf.dragonflybsd.org/~polachok/dmesg.obsd

Under what circumstances does the panic occur?  During boot?

The softreset sequence is supposed to be serialized, so SACT
   shouldn't have any bits set (indicating active requests), let alone
   7 bits set!

-Matt

Actions

#2

Updated by polachok almost 16 years ago

Under what circumstances does the panic occur? During boot?

Yes, during boot.

Actions

#3

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:>Under what circumstances does the panic occur? During boot?
:
:Yes, during boot.

Hmm.  Your AHCI part does not support NCQ so there is no SACT register
    at all on-chip.  Here's a patch which should conditionalize-out the
    SACT register accesses:

fetch http://apollo.backplane.com/DFlyMisc/ahci07.patch

The actual crash is an assertion on the CI register, though, so my
    expectation is that it will still crash.

The ahci port stop/start sequence is supposed to clear CI.  Another
    thing I tried in the patch was to put a delay between the port stop
    and port start, and also after the port start.

Lets see if either of these adjustments fixes your problem.  if they
    do I'd like you to then remove the ahci_os_sleep() calls I added
    in the patch and test again, so I can determine what actual fix is
    needed.

If the problem still persists I'm a bit at a loss as the port stop/start
    sequence is always supposed to reset CI to 0.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#4

Updated by polachok almost 16 years ago

Still crashes: http://omploader.org/vMjR4Mg

Actions

#5

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:Still crashes: http://omploader.org/vMjR4Mg

See if you can get the messages before the crash backtrace (see if
    you can get a crash that does not have a secondary NMI crash which
    scrolls the extra information I printed out off the screen).

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#6

Updated by dillon almost 16 years ago

: See if you can get the messages before the crash backtrace (see if
: you can get a crash that does not have a secondary NMI crash which
: scrolls the extra information I printed out off the screen).

Also see if you can get anything more out of it with a verbose boot.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#7

Updated by polachok almost 16 years ago

I built a UP kernel, now it fails like this:

http://omploader.org/vMjUxcw
or
http://omploader.org/vMjUxdw

doesn't look like ahci-related at all to me, but happens only with ahci enabled.

With boot -v it scrolls too fast, so I cannot capture anything useful.

Actions

#8

Updated by dillon almost 16 years ago

Ok, here's another AHCI patch to try.

fetch http://apollo.backplane.com/DFlyMisc/ahci08.patch

It is a bit of a long-shot but I'm crossing my fingers.  Your AHCI
    chipset is running ahci-1.1 and does not have the PMD bit set.  It's
    possible that the disk identify command is overlapping a page-boundary
    and the AHCI chip is blowing up the machine when that occurs.

I'm a bit at a loss as the problem appears to be a bit
    non-deterministic.

-Matt

Actions

#9

Updated by dillon almost 16 years ago

Also be sure you aren't special-casing your kernel build. i.e. not
doing extra optimizations or anything like that.

(I'm grasping at straws)

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#10

Updated by dillon almost 16 years ago

I put a slightly updated patch up, same URL. I noticed a bug in the
chip reset sequence. I don't know if it has anything to do with the
problem but it is possible that chip did not get properly reset from
a previous BIOS-created compatibility mode.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#11

Updated by polachok almost 16 years ago

Now it looks like this (different on each boot, sometimes everything scrolls
quickly and fill the screen with crap, sometimes not):
http://omploader.org/vMjVnZA

Also be sure you aren't special-casing your kernel build. i.e. not
doing extra optimizations or anything like that.

No.
$ cat /etc/make.conf
CITRUS_MODS=UTF8
KERNCONF=XEON

Actions

#12

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:

Hmm.  If the chip reset change didn't fix it... the CI register just
    could not possibly have a value of 0x00439cdb.  It's impossible.
    So it must not be mapped properly or in the right mode.

Here is a new patch to try:

fetch http://apollo.backplane.com/DFlyMisc/ahci09.patch

This one plays some intel magic that may get the chipset into the
    correct operating mode.   I wish intel wouldn't play these games,
    AHCI is supposed to be AHCI, not some bastardized compatibility mode
    that requires screwing around with to make it AHCI.  The AE/HR sequence
    is supposed to do the job.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#13

Updated by polachok almost 16 years ago

http://omploader.org/vMjVpZg

Actions

#14

Updated by dillon almost 16 years ago

:
:Alexander Polakov <polachok@gmail.com> added the comment:
:
:http://omploader.org/vMjVpZg

Ok, try this.

fetch http://apollo.backplane.com/DFlyMisc/ahci10.patch

Hopefully I coded it properly.  I reordered some of the reset
    sequencing and also reordered the Intel hocus pocus, using the
    linux driver as a template.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#15

Updated by polachok almost 16 years ago

http://omploader.org/vMjVqOQ

Actions

#16

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:http://omploader.org/vMjVqOQ

Is there just one hard drive in the system?  This time it failed
    on port 4.  The previous time it failed on port 3.  The time before
    that it failed on port 0.

Reboot a couple of times.  Is it failing on the same port every time
    or is it jumping around?

It shouldn't even be reaching the softreset code for the ports that
    do not have connected devices.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#17

Updated by polachok almost 16 years ago

Is there just one hard drive in the system?

Second one is connected to mpt (see http://bugs.dragonflybsd.org/issue1451).

Reboot a couple of times. Is it failing on the same port every time
or is it jumping around?

Finally, I opened the case and started to switch ports. It works okay (YEH!)
with ports from 2 to 5 (I tried 2 times each), with port 1 one time it hanged
and second time it worked. With port zero (where it was connected initially) it
fails in different manners, like
1.ahci0.0: <skipped> ci 00004014
2.alignment fault while in user mode (screen filled with garbage)
3.hangs (I waited for 5 minutes).

It's late here, so I'm going to sleep now and will continue my experiments
tomorrow.

Actions

#18

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:) it=20
:fails in different manners, like=20
:1=2Eahci0.0: <skipped> ci 00004014
:2=2Ealignment fault while in user mode (screen filled with garbage)
:3=2Ehangs (I waited for 5 minutes).
:
:It's late here, so I'm going to sleep now and will continue my experiments=20
:tomorrow.

You know what that sounds like?  That sounds like the chip is still
    going through a reset sequence.  When you wake up tomorrow try adding
    various ahci_os_sleep() calls, starting with the one on line 143 of
    ahci.c.  Try increasing that from 100 to 1000.

Then try adding another ahci_os_sleep() call near the end of ahci_init(),
    around line 202.  And maybe a few others.

I sure hope that winds up being the case, though if it is I'll be
    really pissed at Intel because the chip reset is fully handshaked.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#19

Updated by polachok almost 16 years ago

I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before return
in ahci_init. Still panics.

Actions

#20

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before re=
:turn=20
:in ahci_init. Still panics.

Rumko is now getting a panic with a AHCI1.0 Intel chipset with the
    changes that he was NOT getting before the changes.

The results are very similar to what you got... what appears to be
    random memory corruption and in Rumko's case an immediate machine
    reboot and the BIOS configuration got messed up as well.

I feel they must be related issues.  If Rumko and I can figure out
    which change caused his machine to stop booting it may give me a
    good idea where to look for the problem you are reporting.

What is really annoying to me is that the NATA driver is able to
    attach with its AHCI sub-driver.  I don't know what I am doing
    different that is causing the breakage.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#21

Updated by TGEN almost 16 years ago

Matthew Dillon wrote:

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before re=
:turn=20
:in ahci_init. Still panics.

Rumko is now getting a panic with a AHCI1.0 Intel chipset with the
changes that he was NOT getting before the changes.

The results are very similar to what you got... what appears to be
random memory corruption and in Rumko's case an immediate machine
reboot and the BIOS configuration got messed up as well.

I feel they must be related issues. If Rumko and I can figure out
which change caused his machine to stop booting it may give me a
good idea where to look for the problem you are reporting.

What is really annoying to me is that the NATA driver is able to
attach with its AHCI sub-driver. I don't know what I am doing
different that is causing the breakage.

Perhaps your "if intel but AHCI is not enabled then write some value to
a particular config register" change. I'm thinking there's more work to
do to kick the chip into AHCI mode and not confuse the BIOS; besides
that, I think it's not clean. If the device doesn't advertise itself as
being an AHCI subclass, then don't try to force it.

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net
tgen@deepbone.net

Actions

#22

Updated by dillon almost 16 years ago

:Perhaps your "if intel but AHCI is not enabled then write some value to
:a particular config register" change. I'm thinking there's more work to
:do to kick the chip into AHCI mode and not confuse the BIOS; besides
:that, I think it's not clean. If the device doesn't advertise itself as
:being an AHCI subclass, then don't try to force it.

I don't know what that kick code is for but the BIOS is already
    advertising the device as AHCI in the PCI configuration space.
    The AHCI driver only picks it up if it is advertised as AHCI.

I think you are onto something regarding the BIOS handoff, though.
    Combined with Rumko's report that the HR reset sequence seems to be
    the core of the issue that seems to indicate an issue with
    the BIOS supervisory code running in ring -1.

A third possibility is that the HR sequence is bricking the chip's
    PCI physical interface for a short while and that ANY access to the
    chip registers just after HR is set is blowing the system up.

A fourth possibility is that the HR sequence is not clearing the AHCI
    enable bit as it is supposed to, and perhaps cycling the AE bit will
    deal with the case.

So I would like to try two more things.  First to try this patch:

fetch http://apollo.backplane.com/DFlyMisc/ahci11.patch

If the patch does not work, then modify line 141 from:

ahci_write(sc, AHCI_REG_GHC, AHCI_REG_GHC_AE | AHCI_REG_GHC_HR);

To:

ahci_write(sc, AHCI_REG_GHC, AHCI_REG_GHC_HR);

IF it can be gotten to work then I also want to try to reduce those
    500ms delays I have in there to something more reasonable, like 200ms,
    and see if it continues to work.

-Matt

Actions

#23

Updated by dillon almost 16 years ago

Please try the latest master. With Rumko's help testing all sorts of
combinations I think we found the bricking issue.

Basically Intel screwed up, but the bit in question is ancillary to the
    main operation of the chip so we can just not use it.  It turns off
    the Phy on a port.

-Matt

Actions

#24

Updated by rumcic almost 16 years ago

Matthew Dillon wrote:

Please try the latest master. With Rumko's help testing all sorts of
combinations I think we found the bricking issue.

Basically Intel screwed up, but the bit in question is ancillary to the
main operation of the chip so we can just not use it. It turns off
the Phy on a port.

-Matt

Confirming that the latest master does work.
--
Regards,
Rumko

Actions

#25

Updated by polachok almost 16 years ago

Still panics with "ci not 0".

Actions

#26

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:Still panics with "ci not 0".
:

Well, half the problem is solved.  Does plugging the drive into
    different ports still help like it did before?

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#27

Updated by dillon almost 16 years ago

Maybe we can track this down more quickly via IRC instead of
email. on #dragonflybsd. Once Rumko got online we were able
to run a ton of test cases in less then an hour.

If the chipset crash has been fixed maybe cycling the port will
    fix the CI issue this time, so lets try that again.  It also looks
    to me that there is a possible interrupt race.  So try this patch:

fetch http://apollo.backplane.com/DFlyMisc/ahci12.patch

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#28

Updated by polachok almost 16 years ago

Does plugging the drive into
different ports still help like it did before?

No. Now it panics with any port.

Actions

#29

Updated by polachok almost 16 years ago

fetch http://apollo.backplane.com/DFlyMisc/ahci12.patch

Holy crap! Works now! http://leaf.dragonflybsd.org/~polachok/dmesg.dfly

Actions

#30

Updated by dillon almost 16 years ago

:Alexander Polakov <polachok@gmail.com> added the comment:
:
:>fetch http://apollo.backplane.com/DFlyMisc/ahci12.patch
:Holy crap! Works now! http://leaf.dragonflybsd.org/~polachok/dmesg.dfly
:

Holy crap!  That means it was an interrupt race with the 
    initialization code.  That patch makes sure that AHCI interrupts
    are not processed until all ports have gotten past the port
    initialization code.

Ok, I'll commit it.  It took too long for me to find that bug.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

#31

Updated by corecode almost 16 years ago

fixed

Actions

Also available in: Atom PDF