Bug #1460

panic: ahci_put_err_ccb(1) but CI 00000002 != 0

Added by qhwt+dfly over 5 years ago. Updated over 4 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi.
After seeing major fixes on AHCI driver committed, I decided to try it
on my PC, which never managed to boot DragonFly with AHCI enabled.
Unfortunately it still won't. Here's what I get when I set SATA mode
enabled in the BIOS:
pcib3: requested memory range 0xff3fe000-0xff3fffff: good
ahci0.pci3.pcib3.pci0.pcib0.legacy0.nexus0.root0
ahci0: <AHCI-PCI-SATA> [tentative] port 0x9400-0x940f,0x9480-0x9483,0x9800-0x9807,0x9880-0x9883,0x9c00-0x9c07 mem 0xff3fe000-0xff3fffff irq 24 at device 0.0 on pci3
ahci0: Reserved 0x2000 bytes for rid 0x24 type 3 at 0xff3fe000
cpu1: Invalid FID 0xc [0xe, 0xe]
ACPI: domain0 P-State configuration check failed
cpu0: Invalid FID 0xc [0xe, 0xe]
ACPI: domain0 P-State configuration check failed
ahci0: AHCI 1.0 capabilities 0xc722f00<S64A,NCQ,SALP,SAL,SCLO,SPM,PMD,SSC,PSC>, 1 ports, 32 tags/port, gen 1 (1.5Gbps) and 2 (3Gbps)
IOAPIC: try clearing IRR for irq 24
ahci0.0: START HARDRESET
ahci0.0: Transient Errors: 400000<PRCS>
ahci0.0: Restart 00000002
ahci0.0.15: Poll timeout slot 1 CMD: 64010<HPCP,PMA,FR,FRE> TFD: 0x77<ERR> SERR: 40000<DIAG.W>
ahci0.0: PMPROBE First FIS failed
panic: ahci_put_err_ccb(1) but CI 00000002 != 0 (act=00000000 sact=00000000)

mp_lock = 00000000; cpuid = 0

If I disabled ACPI, the lines beginning with ACPI, cpu0, or cpu1 disappear,
but the rest of lines remain same. I tried booting Ubuntu from USB memory
and it at least recognized the drive (though with irq 35, instead of 24,
not knowing if it's important or not). I also tried booting a GENERIC
kernel minus nata driver, which ended up with the same panic. Is there
anything else I can try or any suggestions on where to look at? The kernel
is compiled from the source code as of a22da047710, so I believe it has all
AHCI-related fixes.

Best Regards.

History

#1 Updated by sepherosa over 5 years ago

On Sat, Aug 22, 2009 at 12:36 PM, YONETANI Tomokazu<> wrote:
> Hi.
> After seeing major fixes on AHCI driver committed, I decided to try it
> on my PC, which never managed to boot DragonFly with AHCI enabled.
> Unfortunately it still won't.  Here's what I get when I set SATA mode
> enabled in the BIOS:
>  pcib3: requested memory range 0xff3fe000-0xff3fffff: good
>  ahci0.pci3.pcib3.pci0.pcib0.legacy0.nexus0.root0
>  ahci0: <AHCI-PCI-SATA> [tentative] port 0x9400-0x940f,0x9480-0x9483,0x9800-0x9807,0x9880-0x9883,0x9c00-0x9c07 mem 0xff3fe000-0xff3fffff irq 24 at device 0.0 on pci3
>  ahci0: Reserved 0x2000 bytes for rid 0x24 type 3 at 0xff3fe000
>  cpu1: Invalid FID 0xc [0xe, 0xe]
>  ACPI: domain0 P-State configuration check failed
>  cpu0: Invalid FID 0xc [0xe, 0xe]
>  ACPI: domain0 P-State configuration check failed
>  ahci0: AHCI 1.0 capabilities 0xc722f00<S64A,NCQ,SALP,SAL,SCLO,SPM,PMD,SSC,PSC>, 1 ports, 32 tags/port, gen 1 (1.5Gbps) and 2 (3Gbps)
>  IOAPIC: try clearing IRR for irq 24
>  ahci0.0: START HARDRESET
>  ahci0.0: Transient Errors: 400000<PRCS>
>  ahci0.0: Restart 00000002
>  ahci0.0.15: Poll timeout slot 1 CMD: 64010<HPCP,PMA,FR,FRE> TFD: 0x77<ERR> SERR: 40000<DIAG.W>
>  ahci0.0: PMPROBE First FIS failed
>  panic: ahci_put_err_ccb(1) but CI 00000002 != 0 (act=00000000 sact=00000000)
>
>  mp_lock = 00000000; cpuid = 0
>
> If I disabled ACPI, the lines beginning with ACPI, cpu0, or cpu1 disappear,

Those are error logging from ACPI P-State support. Since it is
deferred from acpi attach path, it could run at any time.
It shouldn't have much to do w/ ahci stuffs.

Best Regards,
sephe

#2 Updated by dillon over 5 years ago

: ahci0: AHCI 1.0 capabilities 0xc722f00<S64A,NCQ,SALP,SAL,SCLO,SPM,PMD,SSC,PSC>, 1 ports, 32 tags/port, gen 1 (1.5Gbps) and 2 (3Gbps)
: IOAPIC: try clearing IRR for irq 24
: ahci0.0: START HARDRESET
: ahci0.0: Transient Errors: 400000<PRCS>
: ahci0.0: Restart 00000002
: ahci0.0.15: Poll timeout slot 1 CMD: 64010<HPCP,PMA,FR,FRE> TFD: 0x77<ERR> SERR: 40000<DIAG.W>
: ahci0.0: PMPROBE First FIS failed
: panic: ahci_put_err_ccb(1) but CI 00000002 != 0 (act=00000000 sact=00000000)
:
: mp_lock = 00000000; cpuid = 0
:
:..
:kernel minus nata driver, which ended up with the same panic. Is there
:anything else I can try or any suggestions on where to look at? The kernel
:is compiled from the source code as of a22da047710, so I believe it has all
:AHCI-related fixes.
:
:Best Regards.

Hmm. Very interesting. It is getting a PRCS interrupt while it
is trying to send a software reset, then attempting to restart
the software reset. Oh joy. I clear any pending PRCS from the port
hardreset sequence so that means the target device is resetting
its PHY when we try to send a device reset to it, which it is not
supposed to do.

Try this patch. It is completely untested (other then a compile test):

fetch http://apollo.backplane.com/DFlyMisc/ahci14.patch

If that doesn't work try also (with patch still applied) increasing
the timeout in the ahci_poll() command in ahci_pm_port_probe() around
line around line 145 from 1000 to 5000.

-Matt

#3 Updated by qhwt+dfly over 5 years ago

On Sat, Aug 22, 2009 at 10:05:01AM -0700, Matthew Dillon wrote:
> Hmm. Very interesting. It is getting a PRCS interrupt while it
> is trying to send a software reset, then attempting to restart
> the software reset. Oh joy. I clear any pending PRCS from the port
> hardreset sequence so that means the target device is resetting
> its PHY when we try to send a device reset to it, which it is not
> supposed to do.
>
> Try this patch. It is completely untested (other then a compile test):
>
> fetch http://apollo.backplane.com/DFlyMisc/ahci14.patch
>
> If that doesn't work try also (with patch still applied) increasing
> the timeout in the ahci_poll() command in ahci_pm_port_probe() around
> line around line 145 from 1000 to 5000.

It doesn't panic anymore, but it says `Device on port is bricked'.
Increasing the timeout in ahci_poll() command on line 145 doesn't help
(same console message).
By the way the controller is JMB360 from JMicron, and I found a very
old patch against linux kernel:
http://lkml.org/lkml/2006/1/29/2

but the driver has been quite reorganized since then, so I have no idea
how this patch fits in our ahci driver.

ahci0.pci3.pcib3.pci0.pcib0.legacy0.nexus0.root0
ahci0: <AHCI-PCI-SATA> [tentative] port 0x9400-0x940f,0x9480-0x9483,0x9800-0x9807,0x9880-0x9883,0x9c00-0x9c07 mem 0xff3fe000-0xff3fffff irq 24 at device 0.0 on pci3
ahci0: Reserved 0x2000 bytes for rid 0x24 tpe 3 at 0xff3fe000
ahci0: AHCI 1.0 capabilities 0xc722ff00<S64A,NCQ,SALP,SAL,SCLO,SPM,PMD,SSC,PSC>, 1 ports 32 tags/port, gen 1 (1.5Gbps) and 2 (3Gbps)
IOAPIC: try clearing IRR for irq 24
ahci0.0: START HARDRESET
ahci0.0: Transient Errors during reset: 0 (ignored)
ahci0.0.15: Poll timeout slot 1 CMD: 6c111<HPCP,PMA,CR,FR,FRE,ST> TFD: 0x77<ERR> SERR: 40000<DIAG.W>
ahci0.0: PMPROBE First FIS failed
ahci0.0.15: Poll timeout slot 1 CMD: 6c111<HPCP,PMA,CR,FR,FRE,ST> TFD: 0x77<ERR> SERR: 40000<DIAG.W>
ahci0.0: PMPROBE First FIS failed
ahci0.0: Device on port is bricked
ahci0.0: END HARDRESET 16
ahci0.0: Failing all commands
(probe0:ahci0:0:0:0) error 22
(probe0:ahci0:0:0:0) Unretryable Error
.. this repeats for ahci0:0:14:0

#4 Updated by dillon over 5 years ago

:It doesn't panic anymore, but it says `Device on port is bricked'.
:Increasing the timeout in ahci_poll() command on line 145 doesn't help
:(same console message).

The actual source of the error is from ahci_port_hardreset() in ahci.c.
I would like you to maintain the applied patch, adjust the timeout
in the ahci_pm.c code back to what it was, and then modify the
ahci_port_hardreset() code as follows:

/*
* We got something that definitely looks like a device. Give
* the device time to send us its first D2H FIS. Waiting for
* BSY to clear accomplishes this.
*
* NOTE that a port multiplier may or may not clear BSY here,
* depending on what is sitting in target 0 behind it.
*/
ahci_os_sleep(100); <<<<<<<<<<<<<<<< ADD
ahci_flush_tfd(ap);

CHANGE TIMEOUT
vvvv
if (ahci_pwait_clr_to(ap, 7000, AHCI_PREG_TFD,
AHCI_PREG_TFD_STS_BSY | AHCI_PREG_TFD_STS_DRQ)) {
error = EBUSY;
} else {

If that works please try going back to the 3000 timeout that was there,
but leaving the ahci_os_sleep() added in. You can also try increasing
the timeout value in the ahci_os_sleep() from 100 to 1000.

Either the PHY is cycling during the hardreset, which is the device's
fault (not the chipset), or the HCA is not masking status changes during
the phy DETect phase, or the device is taking exra-long to come out
of reset.

:By the way the controller is JMB360 from JMicron, and I found a very
:old patch against linux kernel:
: http://lkml.org/lkml/2006/1/29/2
:
:but the driver has been quite reorganized since then, so I have no idea
:how this patch fits in our ahci driver.

Hmm. It doesn't look applicable but you could always try it. We
already do the Intel mod. We don't do that device-specific mod
and I would be a bit leery of adding it.

-Matt
Matthew Dillon
<>

#5 Updated by qhwt+dfly over 5 years ago

Ok, OpenBSD has a driver named jmb(4) and apparently it's doing a special
initialization for JMicron devices. I should've look at OpenBSD source code
first, rather than Googling around.

Cheers.

#6 Updated by alexh over 4 years ago

What's the outcome of this? Does AHCI now work with JMicron devices or not?

Cheers,
Alex Hornung

#7 Updated by qhwt+dfly over 4 years ago

Yes! I haven't tried AHCI since then, and I can't find anything specific
in the commits@ list, but now it happily boot with AHCI. It's JMB360 on
an ASRock mainboard, but maybe other product will work now.

Thanks.

#8 Updated by qhwt+dfly over 4 years ago

On Sat, Apr 03, 2010 at 02:40:29AM +0000, YONETANI Tomokazu (via DragonFly issue tracker) wrote:
> Yes! I haven't tried AHCI since then, and I can't find anything specific
> in the commits@ list, but now it happily boot with AHCI. It's JMB360 on
> an ASRock mainboard, but maybe other product will work now.

Just to add a note on this: if I disabled loading ACPI driver and the
same panic still occurs.I don't remember if I tried with and without ACPI
disabled at that time, but I do recall having to turn off ACPI because of
very occasional lock up of this system (which turned out to be an issue
when acpi_timer is enabled, but it took me months to figure that out).

Also available in: Atom PDF