Bug #563: strange bug with USB hdd - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #563

closed

strange bug with USB hdd

Added by aix-d almost 19 years ago. Updated over 16 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

I have strange and awful bug:

I installed DF on USB HDD #1, then I boot DF from USB HDD #1 and install DF to
USB HDD #2 (make installworld DESTDIR=/mnt/da1). This operation damages
filesystem of USB HDD #1.

Both USB HDDs on the same USB controller:

usbdevs
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: UHCI root hub, Intel
addr 1: EHCI root hub, Intel
addr 2: JM20338 SATA, USB Combo, JMicron
addr 3: USB TO IDE, vendor 0x05e3

TGEN suspect bug in the USB stack

Files

dmesg.boot (8.24 KB) dmesg.boot

aix-d, 07/25/2007 09:44 AM

Actions

Copy link

Updated by corecode over 18 years ago

can you reproduce this on -DEVEL?

Actions

Copy link

Updated by aix-d over 18 years ago

well, I shall try one of these days

Actions

Copy link

Updated by aix-d over 18 years ago

Yes, I can reproduce bug in -DEVEL (10 Jul 2007). But I can not understand, it
is the same bug or not.

Condition for 100% bug reproducibility:

1. System installed to USB HDD (USB box vendor 0x05e3) and booted from it. USB
box connected to right USB slot of Dell Latitude X1 laptop.

2. USB box JMicron connected to left USB slot, file system UFS created and
mounted to /mnt.

3. cp -R /home/dcvs /mnt process started.

There is kernel panic "vm_fault: unrecoverable fault at 0x**** in entry 0x****"
or many "bad block *, ino *" messages after 2-15 minutes.

There is no errors in FreeBSD 6.2 and in OpenBSD 4.1 (some hours were tested),
except for 1 occurrence in OpenBSD (after many hours of endless loop "mv
/home/dcvs /mnt; mv /mnt /home/dcvs"):

sd0(umass0:1:0): Check Condition (error 0x70) on opcode 0x28
SENSE KEY: Hardware Error
ASC/ASCQ: Data Phase Error

and cp process was terminated with message: Input/Output error.

Actions

Copy link

Updated by corecode over 18 years ago

thanks for your report. this indeed looks very serious.

is any of the source filesystems damaged after that?

cheers
simon

Actions

Copy link

Updated by aix-d over 18 years ago

is any of the source filesystems damaged after that?

Actions

Copy link

Updated by corecode over 18 years ago

This is very strange. Do you have a guess what is broken, then? I was under then impression that copying to /mnt overwrote /home, but it doesn't seem to be like this?

cheers
simon

Actions

Copy link

Updated by aix-d over 18 years ago

If it can help, this is photos of kernel panic and 'bad block' messages:
http://hep.msu.dubna.ru/~shiryaev/files/563.tar
Sorry for bad quality.

Actions

Copy link

Updated by corecode over 18 years ago

does openbsd also report a failure or what is it I am supposed to see in the last picture?

cheers
simon

Actions

Copy link

Updated by dillon over 18 years ago

A data phase error is typically an indication of a bad cable.

A bad block is usually an indication of a bad block on the hard drive.
    It is possible that bad blocks are being reported due to the
    cabling/protocol issue but not likely.  One major side effect of a 
    bad block error is that the drive may report old data for the contents
    of the block, leading to corruption.

The VM fault is a software bug in the kernel, but it could be related
    to the cable/protocol errors.

I will run some life-testing between two USB drives.

Are these drives connected via UHCI or EHCI?  Post your dmesg output
    after booting is complete.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#10

Updated by dillon over 18 years ago

Here's a question: Are your USB drives bus-powered or externally
powered?

-Matt

Actions

Copy link

#11

Updated by dillon over 18 years ago

I am going to assume these are bus-powered USB hard drives. I got
two and found that my test box does not produce enough power to be
able to operate both at the same time.

I happen to have four USB ports on two controllers on this test box.
    When I put both HDs on the same controller and load both down at once
    one invariably shuts down.  These USB HDs are laptop HDs that probably
    have voltage droop protection, hence they shut down if the usb
    bus overcurrents instead of trying to run with a haywire voltage.

When I put the two HDs on different USB controllers they can operate
    simultaniously.

The I/O errors and block errors are almost certainly due to voltage
    droop.  Your laptop probably can't produce sufficient current to
    operate both USB HDs at the same time and I'm guessing your HDs don't
    have voltage droop protection, so they try to keep running even when
    the bus is overcurrented.

USB controllers do have current limiting and the protocol has a way
    to specify current draw, but nobody's drivers (us, NetBSD, FreeBSD, or
    I think linux) actually checks whether all the devices on a USB bus
    add up to more current then the USB bus can handle.

-Matt

Actions

Copy link

#12

Updated by aix-d over 18 years ago

corecode:

does openbsd also report a failure or what is it I am supposed to see in the

last picture?

yes, but in time of some hours of copying process, whereas in some minutes in
DragonFly BSD (see msg3318)

dillon:

I don't think what there is bad blocks on hard drive, and cables seems to be ok.

Drives connected via EHCI, both externally powered (3.5 inch drives).

Actions

Copy link

#13

Updated by dillon over 18 years ago

:Alexander Shiryaev <coumarin@gmail.com> added the comment:
:
:corecode:
:
:> does openbsd also report a failure or what is it I am supposed to see in =
:the
:last picture?
:
:yes, but in time of some hours of copying process, whereas in some minutes =
:in
:DragonFly BSD (see msg3318)
:
:dillon:
:
:I don't think what there is bad blocks on hard drive, and cables seems to b=
:e ok.
:
:Drives connected via EHCI, both externally powered (3.5 inch drives).

Externally powered means it can't be an overcurrent issue.  Shoot.

It could be an EHCI issue, or it could really be bad blocks on the
    drive (though that seems less likely).  Do you have any problems when
    you do not load EHCI and just use OHCI ?

-Matt

Actions

Copy link

#14

Updated by aix-d over 18 years ago

I'm sorry, it's probably hardware problems: similar problems in winxp (some
hours of testing). But, why it occurs in DragonFly BSD so often?

Whether it is necessary to test without EHCI, or you will close it?

Actions

Copy link

#15

Updated by tuxillo over 16 years ago

As it is a hardware failure (tested on several OSes with the same result), I
think we should close this. Also, probably, a faulty disk could have died
already since this message is two years old. There are no means of testing this
in any way.

Please, if you can test it already with latest HEAD, just tell us.

What do you think, guys?

Actions

Copy link

#16

Updated by alexh over 16 years ago

I think this can be closed. reporter doesn't seem to be around anymore.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #563

strange bug with USB hdd

Updated by corecode over 18 years ago

Updated by aix-d over 18 years ago

Updated by aix-d over 18 years ago

Updated by corecode over 18 years ago

Updated by aix-d over 18 years ago

Updated by corecode over 18 years ago

Updated by aix-d over 18 years ago

Updated by corecode over 18 years ago

Updated by dillon over 18 years ago

Updated by dillon over 18 years ago

Updated by dillon over 18 years ago

Updated by aix-d over 18 years ago

Updated by dillon over 18 years ago

Updated by aix-d over 18 years ago

Updated by tuxillo over 16 years ago

Updated by alexh over 16 years ago