Bug #563

strange bug with USB hdd

Added by aix-d over 7 years ago. Updated about 5 years ago.

Status:ClosedStart date:
Priority:HighDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I have strange and awful bug:

I installed DF on USB HDD #1, then I boot DF from USB HDD #1 and install DF to
USB HDD #2 (make installworld DESTDIR=/mnt/da1). This operation damages
filesystem of USB HDD #1.

Both USB HDDs on the same USB controller:

# usbdevs
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: EHCI root hub, Intel
addr 2: JM20338 SATA, USB Combo, JMicron
addr 3: USB TO IDE, vendor 0x05e3

TGEN suspect bug in the USB stack

dmesg.boot (8.24 KB) aix-d, 07/25/2007 09:44 AM

History

#1 Updated by corecode about 7 years ago

can you reproduce this on -DEVEL?

#2 Updated by aix-d about 7 years ago

well, I shall try one of these days

#3 Updated by aix-d about 7 years ago

Yes, I can reproduce bug in -DEVEL (10 Jul 2007). But I can not understand, it
is the same bug or not.

Condition for 100% bug reproducibility:

1. System installed to USB HDD (USB box vendor 0x05e3) and booted from it. USB
box connected to right USB slot of Dell Latitude X1 laptop.

2. USB box JMicron connected to left USB slot, file system UFS created and
mounted to /mnt.

3. cp -R /home/dcvs /mnt process started.

There is kernel panic "vm_fault: unrecoverable fault at 0x**** in entry 0x****"
or many "bad block ****, ino ****" messages after 2-15 minutes.

There is no errors in FreeBSD 6.2 and in OpenBSD 4.1 (some hours were tested),
except for 1 occurrence in OpenBSD (after many hours of endless loop "mv
/home/dcvs /mnt; mv /mnt /home/dcvs"):

sd0(umass0:1:0): Check Condition (error 0x70) on opcode 0x28
SENSE KEY: Hardware Error
ASC/ASCQ: Data Phase Error

and cp process was terminated with message: Input/Output error.

#4 Updated by corecode about 7 years ago

thanks for your report. this indeed looks very serious.

is any of the source filesystems damaged after that?

cheers
simon

#5 Updated by aix-d about 7 years ago

> is any of the source filesystems damaged after that?

no

#6 Updated by corecode about 7 years ago

This is very strange. Do you have a guess what is broken, then? I was under then impression that copying to /mnt overwrote /home, but it doesn't seem to be like this?

cheers
simon

#7 Updated by aix-d about 7 years ago

If it can help, this is photos of kernel panic and 'bad block' messages:
http://hep.msu.dubna.ru/~shiryaev/files/563.tar
Sorry for bad quality.

#8 Updated by corecode about 7 years ago

does openbsd also report a failure or what is it I am supposed to see in the last picture?

cheers
simon

#9 Updated by dillon about 7 years ago

A data phase error is typically an indication of a bad cable.

A bad block is usually an indication of a bad block on the hard drive.
It is possible that bad blocks are being reported due to the
cabling/protocol issue but not likely. One major side effect of a
bad block error is that the drive may report old data for the contents
of the block, leading to corruption.

The VM fault is a software bug in the kernel, but it could be related
to the cable/protocol errors.

I will run some life-testing between two USB drives.

Are these drives connected via UHCI or EHCI? Post your dmesg output
after booting is complete.

-Matt
Matthew Dillon
<>

#10 Updated by dillon about 7 years ago

Here's a question: Are your USB drives bus-powered or externally
powered?

-Matt

#11 Updated by dillon about 7 years ago

I am going to assume these are bus-powered USB hard drives. I got
two and found that my test box does not produce enough power to be
able to operate both at the same time.

I happen to have four USB ports on two controllers on this test box.
When I put both HDs on the same controller and load both down at once
one invariably shuts down. These USB HDs are laptop HDs that probably
have voltage droop protection, hence they shut down if the usb
bus overcurrents instead of trying to run with a haywire voltage.

When I put the two HDs on different USB controllers they can operate
simultaniously.

The I/O errors and block errors are almost certainly due to voltage
droop. Your laptop probably can't produce sufficient current to
operate both USB HDs at the same time and I'm guessing your HDs don't
have voltage droop protection, so they try to keep running even when
the bus is overcurrented.

USB controllers do have current limiting and the protocol has a way
to specify current draw, but nobody's drivers (us, NetBSD, FreeBSD, or
I think linux) actually checks whether all the devices on a USB bus
add up to more current then the USB bus can handle.

-Matt

#12 Updated by aix-d about 7 years ago

corecode:

> does openbsd also report a failure or what is it I am supposed to see in the
last picture?

yes, but in time of some hours of copying process, whereas in some minutes in
DragonFly BSD (see msg3318)

dillon:

I don't think what there is bad blocks on hard drive, and cables seems to be ok.

Drives connected via EHCI, both externally powered (3.5 inch drives).

#13 Updated by dillon about 7 years ago

:Alexander Shiryaev <> added the comment:
:
:corecode:
:
:> does openbsd also report a failure or what is it I am supposed to see in =
:the
:last picture?
:
:yes, but in time of some hours of copying process, whereas in some minutes =
:in
:DragonFly BSD (see msg3318)
:
:dillon:
:
:I don't think what there is bad blocks on hard drive, and cables seems to b=
:e ok.
:
:Drives connected via EHCI, both externally powered (3.5 inch drives).

Externally powered means it can't be an overcurrent issue. Shoot.

It could be an EHCI issue, or it could really be bad blocks on the
drive (though that seems less likely). Do you have any problems when
you do not load EHCI and just use OHCI ?

-Matt

#14 Updated by aix-d about 7 years ago

I'm sorry, it's probably hardware problems: similar problems in winxp (some
hours of testing). But, why it occurs in DragonFly BSD so often?

Whether it is necessary to test without EHCI, or you will close it?

#15 Updated by tuxillo about 5 years ago

As it is a hardware failure (tested on several OSes with the same result), I
think we should close this. Also, probably, a faulty disk could have died
already since this message is two years old. There are no means of testing this
in any way.

Please, if you can test it already with latest HEAD, just tell us.

What do you think, guys?

#16 Updated by alexh about 5 years ago

I think this can be closed. reporter doesn't seem to be around anymore.

Also available in: Atom PDF