Project

General

Profile

Actions

Bug #563

closed

strange bug with USB hdd

Added by aix-d about 17 years ago. Updated over 14 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I have strange and awful bug:

I installed DF on USB HDD #1, then I boot DF from USB HDD #1 and install DF to
USB HDD #2 (make installworld DESTDIR=/mnt/da1). This operation damages
filesystem of USB HDD #1.

Both USB HDDs on the same USB controller:

  1. usbdevs
    addr 1: UHCI root hub, Intel
    addr 1: UHCI root hub, Intel
    addr 1: UHCI root hub, Intel
    addr 1: UHCI root hub, Intel
    addr 1: EHCI root hub, Intel
    addr 2: JM20338 SATA, USB Combo, JMicron
    addr 3: USB TO IDE, vendor 0x05e3

TGEN suspect bug in the USB stack


Files

dmesg.boot (8.24 KB) dmesg.boot aix-d, 07/25/2007 09:44 AM
Actions #1

Updated by corecode almost 17 years ago

can you reproduce this on -DEVEL?

Actions #2

Updated by aix-d almost 17 years ago

well, I shall try one of these days

Actions #3

Updated by aix-d almost 17 years ago

Yes, I can reproduce bug in -DEVEL (10 Jul 2007). But I can not understand, it
is the same bug or not.

Condition for 100% bug reproducibility:

1. System installed to USB HDD (USB box vendor 0x05e3) and booted from it. USB
box connected to right USB slot of Dell Latitude X1 laptop.

2. USB box JMicron connected to left USB slot, file system UFS created and
mounted to /mnt.

3. cp -R /home/dcvs /mnt process started.

There is kernel panic "vm_fault: unrecoverable fault at 0x**** in entry 0x****"
or many "bad block *, ino *" messages after 2-15 minutes.

There is no errors in FreeBSD 6.2 and in OpenBSD 4.1 (some hours were tested),
except for 1 occurrence in OpenBSD (after many hours of endless loop "mv
/home/dcvs /mnt; mv /mnt /home/dcvs"):

sd0(umass0:1:0): Check Condition (error 0x70) on opcode 0x28
SENSE KEY: Hardware Error
ASC/ASCQ: Data Phase Error

and cp process was terminated with message: Input/Output error.

Actions #4

Updated by corecode almost 17 years ago

thanks for your report. this indeed looks very serious.

is any of the source filesystems damaged after that?

cheers
simon

Actions #5

Updated by aix-d almost 17 years ago

is any of the source filesystems damaged after that?

no

Actions #6

Updated by corecode almost 17 years ago

This is very strange. Do you have a guess what is broken, then? I was under then impression that copying to /mnt overwrote /home, but it doesn't seem to be like this?

cheers
simon

Actions #7

Updated by aix-d almost 17 years ago

If it can help, this is photos of kernel panic and 'bad block' messages:
http://hep.msu.dubna.ru/~shiryaev/files/563.tar
Sorry for bad quality.

Actions #8

Updated by corecode almost 17 years ago

does openbsd also report a failure or what is it I am supposed to see in the last picture?

cheers
simon

Actions #9

Updated by dillon over 16 years ago

A data phase error is typically an indication of a bad cable.

A bad block is usually an indication of a bad block on the hard drive.
It is possible that bad blocks are being reported due to the
cabling/protocol issue but not likely. One major side effect of a
bad block error is that the drive may report old data for the contents
of the block, leading to corruption.
The VM fault is a software bug in the kernel, but it could be related
to the cable/protocol errors.
I will run some life-testing between two USB drives.
Are these drives connected via UHCI or EHCI?  Post your dmesg output
after booting is complete.
-Matt
Matthew Dillon
<>
Actions #10

Updated by dillon over 16 years ago

Here's a question: Are your USB drives bus-powered or externally
powered?

-Matt
Actions #11

Updated by dillon over 16 years ago

I am going to assume these are bus-powered USB hard drives. I got
two and found that my test box does not produce enough power to be
able to operate both at the same time.

I happen to have four USB ports on two controllers on this test box.
When I put both HDs on the same controller and load both down at once
one invariably shuts down. These USB HDs are laptop HDs that probably
have voltage droop protection, hence they shut down if the usb
bus overcurrents instead of trying to run with a haywire voltage.
When I put the two HDs on different USB controllers they can operate
simultaniously.
The I/O errors and block errors are almost certainly due to voltage
droop. Your laptop probably can't produce sufficient current to
operate both USB HDs at the same time and I'm guessing your HDs don't
have voltage droop protection, so they try to keep running even when
the bus is overcurrented.
USB controllers do have current limiting and the protocol has a way
to specify current draw, but nobody's drivers (us, NetBSD, FreeBSD, or
I think linux) actually checks whether all the devices on a USB bus
add up to more current then the USB bus can handle.
-Matt
Actions #12

Updated by aix-d over 16 years ago

corecode:

does openbsd also report a failure or what is it I am supposed to see in the

last picture?

yes, but in time of some hours of copying process, whereas in some minutes in
DragonFly BSD (see msg3318)

dillon:

I don't think what there is bad blocks on hard drive, and cables seems to be ok.

Drives connected via EHCI, both externally powered (3.5 inch drives).

Actions #13

Updated by dillon over 16 years ago

:Alexander Shiryaev <> added the comment:
:
:corecode:
:
:> does openbsd also report a failure or what is it I am supposed to see in =
:the
:last picture?
:
:yes, but in time of some hours of copying process, whereas in some minutes =
:in
:DragonFly BSD (see msg3318)
:
:dillon:
:
:I don't think what there is bad blocks on hard drive, and cables seems to b=
:e ok.
:
:Drives connected via EHCI, both externally powered (3.5 inch drives).

Externally powered means it can't be an overcurrent issue.  Shoot.
It could be an EHCI issue, or it could really be bad blocks on the
drive (though that seems less likely). Do you have any problems when
you do not load EHCI and just use OHCI ?
-Matt
Actions #14

Updated by aix-d over 16 years ago

I'm sorry, it's probably hardware problems: similar problems in winxp (some
hours of testing). But, why it occurs in DragonFly BSD so often?

Whether it is necessary to test without EHCI, or you will close it?

Actions #15

Updated by tuxillo over 14 years ago

As it is a hardware failure (tested on several OSes with the same result), I
think we should close this. Also, probably, a faulty disk could have died
already since this message is two years old. There are no means of testing this
in any way.

Please, if you can test it already with latest HEAD, just tell us.

What do you think, guys?

Actions #16

Updated by alexh over 14 years ago

I think this can be closed. reporter doesn't seem to be around anymore.

Actions

Also available in: Atom PDF