Bug #1449

AHCI panic on Intel 6321ESB AHCI

Added by polachok about 5 years ago. Updated about 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I have a panic here with AHCI enabled. See screenshot here:
http://omploader.org/vMjRueA. Console gets completely unusable and fills with
garbage.

DragonFly dmesg with ahci disabled:
http://leaf.dragonflybsd.org/~polachok/dmesg.dfly

OpenBSD dmesg with ahci enabled:
http://leaf.dragonflybsd.org/~polachok/dmesg.obsd

History

#1 Updated by dillon about 5 years ago

:New submission from Alexander Polakov <>:
:
:I have a panic here with AHCI enabled. See screenshot here:=20
:http://omploader.org/vMjRueA. Console gets completely unusable and fills wi=
:th=20
:garbage.
:
:DragonFly dmesg with ahci disabled:
:http://leaf.dragonflybsd.org/~polachok/dmesg.dfly
:
:OpenBSD dmesg with ahci enabled:
:http://leaf.dragonflybsd.org/~polachok/dmesg.obsd

Under what circumstances does the panic occur? During boot?

The softreset sequence is supposed to be serialized, so SACT
shouldn't have any bits set (indicating active requests), let alone
7 bits set!

-Matt

#2 Updated by polachok about 5 years ago

>Under what circumstances does the panic occur? During boot?

Yes, during boot.

#3 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:>Under what circumstances does the panic occur? During boot?
:
:Yes, during boot.

Hmm. Your AHCI part does not support NCQ so there is no SACT register
at all on-chip. Here's a patch which should conditionalize-out the
SACT register accesses:

fetch http://apollo.backplane.com/DFlyMisc/ahci07.patch

The actual crash is an assertion on the CI register, though, so my
expectation is that it will still crash.

The ahci port stop/start sequence is supposed to clear CI. Another
thing I tried in the patch was to put a delay between the port stop
and port start, and also after the port start.

Lets see if either of these adjustments fixes your problem. if they
do I'd like you to then remove the ahci_os_sleep() calls I added
in the patch and test again, so I can determine what actual fix is
needed.

If the problem still persists I'm a bit at a loss as the port stop/start
sequence is always supposed to reset CI to 0.

-Matt
Matthew Dillon
<>

#4 Updated by polachok about 5 years ago

#5 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:Still crashes: http://omploader.org/vMjR4Mg

See if you can get the messages before the crash backtrace (see if
you can get a crash that does not have a secondary NMI crash which
scrolls the extra information I printed out off the screen).

-Matt
Matthew Dillon
<>

#6 Updated by dillon about 5 years ago

: See if you can get the messages before the crash backtrace (see if
: you can get a crash that does not have a secondary NMI crash which
: scrolls the extra information I printed out off the screen).

Also see if you can get anything more out of it with a verbose boot.

-Matt
Matthew Dillon
<>

#7 Updated by polachok about 5 years ago

I built a UP kernel, now it fails like this:

http://omploader.org/vMjUxcw
or
http://omploader.org/vMjUxdw

doesn't look like ahci-related at all to me, but happens only with ahci enabled.

With boot -v it scrolls too fast, so I cannot capture anything useful.

#8 Updated by dillon about 5 years ago

Ok, here's another AHCI patch to try.

fetch http://apollo.backplane.com/DFlyMisc/ahci08.patch

It is a bit of a long-shot but I'm crossing my fingers. Your AHCI
chipset is running ahci-1.1 and does not have the PMD bit set. It's
possible that the disk identify command is overlapping a page-boundary
and the AHCI chip is blowing up the machine when that occurs.

I'm a bit at a loss as the problem appears to be a bit
non-deterministic.

-Matt

#9 Updated by dillon about 5 years ago

Also be sure you aren't special-casing your kernel build. i.e. not
doing extra optimizations or anything like that.

(I'm grasping at straws)

-Matt
Matthew Dillon
<>

#10 Updated by dillon about 5 years ago

I put a slightly updated patch up, same URL. I noticed a bug in the
chip reset sequence. I don't know if it has anything to do with the
problem but it is possible that chip did not get properly reset from
a previous BIOS-created compatibility mode.

-Matt
Matthew Dillon
<>

#11 Updated by polachok about 5 years ago

Now it looks like this (different on each boot, sometimes everything scrolls
quickly and fill the screen with crap, sometimes not):
http://omploader.org/vMjVnZA

>Also be sure you aren't special-casing your kernel build. i.e. not
>doing extra optimizations or anything like that.
No.
$ cat /etc/make.conf
CITRUS_MODS=UTF8
KERNCONF=XEON

#12 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:

Hmm. If the chip reset change didn't fix it... the CI register just
could not possibly have a value of 0x00439cdb. It's impossible.
So it must not be mapped properly or in the right mode.

Here is a new patch to try:

fetch http://apollo.backplane.com/DFlyMisc/ahci09.patch

This one plays some intel magic that may get the chipset into the
correct operating mode. I wish intel wouldn't play these games,
AHCI is supposed to be AHCI, not some bastardized compatibility mode
that requires screwing around with to make it AHCI. The AE/HR sequence
is supposed to do the job.

-Matt
Matthew Dillon
<>

#14 Updated by dillon about 5 years ago

:
:Alexander Polakov <> added the comment:
:
:http://omploader.org/vMjVpZg

Ok, try this.

fetch http://apollo.backplane.com/DFlyMisc/ahci10.patch

Hopefully I coded it properly. I reordered some of the reset
sequencing and also reordered the Intel hocus pocus, using the
linux driver as a template.

-Matt
Matthew Dillon
<>

#16 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:http://omploader.org/vMjVqOQ

Is there just one hard drive in the system? This time it failed
on port 4. The previous time it failed on port 3. The time before
that it failed on port 0.

Reboot a couple of times. Is it failing on the same port every time
or is it jumping around?

It shouldn't even be reaching the softreset code for the ports that
do not have connected devices.

-Matt
Matthew Dillon
<>

#17 Updated by polachok about 5 years ago

>Is there just one hard drive in the system?

Second one is connected to mpt (see http://bugs.dragonflybsd.org/issue1451).

>Reboot a couple of times. Is it failing on the same port every time
>or is it jumping around?
Finally, I opened the case and started to switch ports. It works okay (YEH!)
with ports from 2 to 5 (I tried 2 times each), with port 1 one time it hanged
and second time it worked. With port zero (where it was connected initially) it
fails in different manners, like
1.ahci0.0: <skipped> ci 00004014
2.alignment fault while in user mode (screen filled with garbage)
3.hangs (I waited for 5 minutes).

It's late here, so I'm going to sleep now and will continue my experiments
tomorrow.

#18 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:) it=20
:fails in different manners, like=20
:1=2Eahci0.0: <skipped> ci 00004014
:2=2Ealignment fault while in user mode (screen filled with garbage)
:3=2Ehangs (I waited for 5 minutes).
:
:It's late here, so I'm going to sleep now and will continue my experiments=20
:tomorrow.

You know what that sounds like? That sounds like the chip is still
going through a reset sequence. When you wake up tomorrow try adding
various ahci_os_sleep() calls, starting with the one on line 143 of
ahci.c. Try increasing that from 100 to 1000.

Then try adding another ahci_os_sleep() call near the end of ahci_init(),
around line 202. And maybe a few others.

I sure hope that winds up being the case, though if it is I'll be
really pissed at Intel because the chip reset is fully handshaked.

-Matt
Matthew Dillon
<>

#19 Updated by polachok about 5 years ago

I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before return
in ahci_init. Still panics.

#20 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before re=
:turn=20
:in ahci_init. Still panics.

Rumko is now getting a panic with a AHCI1.0 Intel chipset with the
changes that he was NOT getting before the changes.

The results are very similar to what you got... what appears to be
random memory corruption and in Rumko's case an immediate machine
reboot and the BIOS configuration got messed up as well.

I feel they must be related issues. If Rumko and I can figure out
which change caused his machine to stop booting it may give me a
good idea where to look for the problem you are reporting.

What is really annoying to me is that the NATA driver is able to
attach with its AHCI sub-driver. I don't know what I am doing
different that is causing the breakage.

-Matt
Matthew Dillon
<>

#21 Updated by TGEN about 5 years ago

Matthew Dillon wrote:
> :Alexander Polakov <> added the comment:
> :
> :I changed value to 1000 on line 143 and added ahci_os_sleep(1000) before re=
> :turn=20
> :in ahci_init. Still panics.
>
> Rumko is now getting a panic with a AHCI1.0 Intel chipset with the
> changes that he was NOT getting before the changes.
>
> The results are very similar to what you got... what appears to be
> random memory corruption and in Rumko's case an immediate machine
> reboot and the BIOS configuration got messed up as well.
>
> I feel they must be related issues. If Rumko and I can figure out
> which change caused his machine to stop booting it may give me a
> good idea where to look for the problem you are reporting.
>
> What is really annoying to me is that the NATA driver is able to
> attach with its AHCI sub-driver. I don't know what I am doing
> different that is causing the breakage.

Perhaps your "if intel but AHCI is not enabled then write some value to
a particular config register" change. I'm thinking there's more work to
do to kick the chip into AHCI mode and not confuse the BIOS; besides
that, I think it's not clean. If the device doesn't advertise itself as
being an AHCI subclass, then don't try to force it.

Cheers,
--
Thomas E. Spanjaard

#22 Updated by dillon about 5 years ago

:Perhaps your "if intel but AHCI is not enabled then write some value to
:a particular config register" change. I'm thinking there's more work to
:do to kick the chip into AHCI mode and not confuse the BIOS; besides
:that, I think it's not clean. If the device doesn't advertise itself as
:being an AHCI subclass, then don't try to force it.

I don't know what that kick code is for but the BIOS is already
advertising the device as AHCI in the PCI configuration space.
The AHCI driver only picks it up if it is advertised as AHCI.

I think you are onto something regarding the BIOS handoff, though.
Combined with Rumko's report that the HR reset sequence seems to be
the core of the issue that seems to indicate an issue with
the BIOS supervisory code running in ring -1.

A third possibility is that the HR sequence is bricking the chip's
PCI physical interface for a short while and that ANY access to the
chip registers just after HR is set is blowing the system up.

A fourth possibility is that the HR sequence is not clearing the AHCI
enable bit as it is supposed to, and perhaps cycling the AE bit will
deal with the case.

So I would like to try two more things. First to try this patch:

fetch http://apollo.backplane.com/DFlyMisc/ahci11.patch

If the patch does not work, then modify line 141 from:

ahci_write(sc, AHCI_REG_GHC, AHCI_REG_GHC_AE | AHCI_REG_GHC_HR);

To:

ahci_write(sc, AHCI_REG_GHC, AHCI_REG_GHC_HR);

IF it can be gotten to work then I also want to try to reduce those
500ms delays I have in there to something more reasonable, like 200ms,
and see if it continues to work.

-Matt

#23 Updated by dillon about 5 years ago

Please try the latest master. With Rumko's help testing all sorts of
combinations I think we found the bricking issue.

Basically Intel screwed up, but the bit in question is ancillary to the
main operation of the chip so we can just not use it. It turns off
the Phy on a port.

-Matt

#24 Updated by rumcic about 5 years ago

Matthew Dillon wrote:

> Please try the latest master. With Rumko's help testing all sorts of
> combinations I think we found the bricking issue.
>
> Basically Intel screwed up, but the bit in question is ancillary to the
> main operation of the chip so we can just not use it. It turns off
> the Phy on a port.
>
> -Matt

Confirming that the latest master does work.
--
Regards,
Rumko

#25 Updated by polachok about 5 years ago

Still panics with "ci not 0".

#26 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:Still panics with "ci not 0".
:

Well, half the problem is solved. Does plugging the drive into
different ports still help like it did before?

-Matt
Matthew Dillon
<>

#27 Updated by dillon about 5 years ago

Maybe we can track this down more quickly via IRC instead of
email. on #dragonflybsd. Once Rumko got online we were able
to run a ton of test cases in less then an hour.

If the chipset crash has been fixed maybe cycling the port will
fix the CI issue this time, so lets try that again. It also looks
to me that there is a possible interrupt race. So try this patch:

fetch http://apollo.backplane.com/DFlyMisc/ahci12.patch

-Matt
Matthew Dillon
<>

#28 Updated by polachok about 5 years ago

>Does plugging the drive into
>different ports still help like it did before?
No. Now it panics with any port.

#30 Updated by dillon about 5 years ago

:Alexander Polakov <> added the comment:
:
:>fetch http://apollo.backplane.com/DFlyMisc/ahci12.patch
:Holy crap! Works now! http://leaf.dragonflybsd.org/~polachok/dmesg.dfly
:

Holy crap! That means it was an interrupt race with the
initialization code. That patch makes sure that AHCI interrupts
are not processed until all ports have gotten past the port
initialization code.

Ok, I'll commit it. It took too long for me to find that bug.

-Matt
Matthew Dillon
<>

#31 Updated by corecode about 5 years ago

fixed

Also available in: Atom PDF