Bug #563
closedstrange bug with USB hdd
0%
Description
I have strange and awful bug:
I installed DF on USB HDD #1, then I boot DF from USB HDD #1 and install DF to 
USB HDD #2 (make installworld DESTDIR=/mnt/da1). This operation damages 
filesystem of USB HDD #1.
Both USB HDDs on the same USB controller:
- usbdevs
 addr 1: UHCI root hub, Intel
 addr 1: UHCI root hub, Intel
 addr 1: UHCI root hub, Intel
 addr 1: UHCI root hub, Intel
 addr 1: EHCI root hub, Intel
 addr 2: JM20338 SATA, USB Combo, JMicron
 addr 3: USB TO IDE, vendor 0x05e3
TGEN suspect bug in the USB stack
Files
       Updated by aix-d over 18 years ago
      Updated by aix-d over 18 years ago
      
    
    Yes, I can reproduce bug in -DEVEL (10 Jul 2007). But I can not understand, it
is the same bug or not.
Condition for 100% bug reproducibility:
1. System installed to USB HDD (USB box vendor 0x05e3) and booted from it. USB
box connected to right USB slot of Dell Latitude X1 laptop.
2. USB box JMicron connected to left USB slot, file system UFS created and
mounted to /mnt.
3. cp -R /home/dcvs /mnt process started.
There is kernel panic "vm_fault: unrecoverable fault at 0x**** in entry 0x****" 
or many "bad block *, ino *" messages after 2-15 minutes.
There is no errors in FreeBSD 6.2 and in OpenBSD 4.1 (some hours were tested),
except for 1 occurrence in OpenBSD (after many hours of endless loop "mv
/home/dcvs /mnt; mv /mnt /home/dcvs"):
sd0(umass0:1:0): Check Condition (error 0x70) on opcode 0x28
    SENSE KEY: Hardware Error
     ASC/ASCQ: Data Phase Error
and cp process was terminated with message: Input/Output error.
       Updated by corecode over 18 years ago
      Updated by corecode over 18 years ago
      
    
    thanks for your report. this indeed looks very serious.
is any of the source filesystems damaged after that?
cheers
  simon
       Updated by aix-d over 18 years ago
      Updated by aix-d over 18 years ago
      
    
    is any of the source filesystems damaged after that?
no
       Updated by corecode over 18 years ago
      Updated by corecode over 18 years ago
      
    
    This is very strange. Do you have a guess what is broken, then? I was under then impression that copying to /mnt overwrote /home, but it doesn't seem to be like this?
cheers
  simon
       Updated by aix-d over 18 years ago
      Updated by aix-d over 18 years ago
      
    
    If it can help, this is photos of kernel panic and 'bad block' messages:
http://hep.msu.dubna.ru/~shiryaev/files/563.tar
Sorry for bad quality.
       Updated by corecode over 18 years ago
      Updated by corecode over 18 years ago
      
    
    does openbsd also report a failure or what is it I am supposed to see in the last picture?
cheers
  simon
       Updated by dillon over 18 years ago
      Updated by dillon over 18 years ago
      
    
    A data phase error is typically an indication of a bad cable.
A bad block is usually an indication of a bad block on the hard drive.
    It is possible that bad blocks are being reported due to the
    cabling/protocol issue but not likely.  One major side effect of a 
    bad block error is that the drive may report old data for the contents
    of the block, leading to corruption.The VM fault is a software bug in the kernel, but it could be related
    to the cable/protocol errors.I will run some life-testing between two USB drives.Are these drives connected via UHCI or EHCI?  Post your dmesg output
    after booting is complete.-Matt
                    Matthew Dillon 
                    <dillon@backplane.com>
       Updated by dillon over 18 years ago
      Updated by dillon over 18 years ago
      
    
    Here's a question:  Are your USB drives bus-powered or externally
    powered?
-Matt
       Updated by dillon over 18 years ago
      Updated by dillon over 18 years ago
      
    
    I am going to assume these are bus-powered USB hard drives.  I got
    two and found that my test box does not produce enough power to be
    able to operate both at the same time.
I happen to have four USB ports on two controllers on this test box.
    When I put both HDs on the same controller and load both down at once
    one invariably shuts down.  These USB HDs are laptop HDs that probably
    have voltage droop protection, hence they shut down if the usb
    bus overcurrents instead of trying to run with a haywire voltage.When I put the two HDs on different USB controllers they can operate
    simultaniously.The I/O errors and block errors are almost certainly due to voltage
    droop.  Your laptop probably can't produce sufficient current to
    operate both USB HDs at the same time and I'm guessing your HDs don't
    have voltage droop protection, so they try to keep running even when
    the bus is overcurrented.USB controllers do have current limiting and the protocol has a way
    to specify current draw, but nobody's drivers (us, NetBSD, FreeBSD, or
    I think linux) actually checks whether all the devices on a USB bus
    add up to more current then the USB bus can handle.-Matt
       Updated by aix-d over 18 years ago
      Updated by aix-d over 18 years ago
      
    
    corecode:
does openbsd also report a failure or what is it I am supposed to see in the
last picture?
yes, but in time of some hours of copying process, whereas in some minutes in
DragonFly BSD (see msg3318)
dillon:
I don't think what there is bad blocks on hard drive, and cables seems to be ok.
Drives connected via EHCI, both externally powered (3.5 inch drives).
       Updated by dillon over 18 years ago
      Updated by dillon over 18 years ago
      
    
    :Alexander Shiryaev <coumarin@gmail.com> added the comment:
:
:corecode:
:
:> does openbsd also report a failure or what is it I am supposed to see in =
:the
:last picture?
:
:yes, but in time of some hours of copying process, whereas in some minutes =
:in
:DragonFly BSD (see msg3318)
:
:dillon:
:
:I don't think what there is bad blocks on hard drive, and cables seems to b=
:e ok.
:
:Drives connected via EHCI, both externally powered (3.5 inch drives).
Externally powered means it can't be an overcurrent issue.  Shoot.It could be an EHCI issue, or it could really be bad blocks on the
    drive (though that seems less likely).  Do you have any problems when
    you do not load EHCI and just use OHCI ?-Matt
       Updated by aix-d about 18 years ago
      Updated by aix-d about 18 years ago
      
    
    I'm sorry, it's probably hardware problems: similar problems in winxp (some
hours of testing). But, why it occurs in DragonFly BSD so often?
Whether it is necessary to test without EHCI, or you will close it?
       Updated by tuxillo about 16 years ago
      Updated by tuxillo about 16 years ago
      
    
    As it is a hardware failure (tested on several OSes with the same result), I
think we should close this. Also, probably, a faulty disk could have died
already since this message is two years old. There are no means of testing this
in any way.
Please, if you can test it already with latest HEAD, just tell us.
What do you think, guys?
       Updated by alexh about 16 years ago
      Updated by alexh about 16 years ago
      
    
    I think this can be closed. reporter doesn't seem to be around anymore.