Bug #311

Lockups related to (possibly IDE issues) ?

Added by nospam over 8 years ago. Updated about 8 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

I'm actually not sure if this is a bug or something I've messed up.

Things were working until I apparently had a hard drive failure, replaced
Hard drive (drive as master and CDROM as the slave, both on IDE0).

There is no sound card, there is only a 3Com PCI card plus the other components
of a bare machine. (VGA card and the motherboard IDE, there aren't any
USB ports that I can see)

The drive is a MAXTOR 200G, with the first 50G being used for the OS (primary
partition) Setup as master = LBA, slave = AUTO in BIOS. (The BIOS can't see the
full drive.)

Seemed to lock up at random times during the install process, almost always when
copying files from the distribution CD, eventually I was able to get it
installed, only to see it lock up when it probed the CDROM (by this time,
booting from hard drive)

I inserted a CD in the drive, wondering if it was choking on the no media found,
booted again, this time it loaded and I got to the login prompt. (again, booting
from hard drive)

Then it locked up just as I entered the password.

This seemed like defective hardware, so I installed slackware linux, compiled
the slackware kernel WHILE copying files to and from the CDROM, (intending to
give the IDE's a full workout) running interactive programs at the same time.
(Really gave it a tough workout)

Linux did not lock up, FWIW: I did confirm the files were in fact copied.

I don't think there is anything physically defective with the actual machine. Linux
didn't indicate any error messages, nothing like that.

So.. this tells me it's probably hardware related but strictly in the context
of DragonflyBSD, as linux was not affected. (FWIW: I can't even ping the machine,
it's definately a hard lockup)

As pr. the guide, I did a 'vmstat -i' several times, there does NOT appear to
be an IRQ storm. Only thing that changed much was the clk interrupt. (which
I kind of expected :-) )

Basically, it's a machine I can mess with for awhile if it'll help you folks
uncover anything. Eventually I'll need it to do real work, but for the interim,
nothing stored on it is critical.

This machine seems to have a pretty decent BIOS.

Bios settings:

Power Management: OFF
ACPI I/O Device Node is OFF. (I've experimented with this, so far, no dif.)
PNP OS is switched ON.

Anything I should try?

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

History

#1 Updated by qhwt+dfly over 8 years ago

On Fri, Sep 08, 2006 at 12:13:13PM +0000, Jamie wrote:
> I'm actually not sure if this is a bug or something I've messed up.
>
> Things were working until I apparently had a hard drive failure, replaced
> Hard drive (drive as master and CDROM as the slave, both on IDE0).
>
> There is no sound card, there is only a 3Com PCI card plus the other components
> of a bare machine. (VGA card and the motherboard IDE, there aren't any
> USB ports that I can see)
>
> The drive is a MAXTOR 200G, with the first 50G being used for the OS (primary
> partition) Setup as master = LBA, slave = AUTO in BIOS. (The BIOS can't see the
> full drive.)
>
> Seemed to lock up at random times during the install process, almost always when
> copying files from the distribution CD, eventually I was able to get it
> installed, only to see it lock up when it probed the CDROM (by this time,
> booting from hard drive)
>
> I inserted a CD in the drive, wondering if it was choking on the no media found,
> booted again, this time it loaded and I got to the login prompt. (again, booting
> from hard drive)
>
> Then it locked up just as I entered the password.

I assume you're using 1.6.x-RELEASE, right?

:

> This machine seems to have a pretty decent BIOS.
>
> Bios settings:
>
> Power Management: OFF
> ACPI I/O Device Node is OFF. (I've experimented with this, so far, no dif.)

Does this actually turn off ACPI driver? When ACPI driver is disabled,
sysctl hw.acpi gives you nothing:
# sysctl hw.acpi
sysctl: unknown oid 'hw.acpi'

If not, choose "Boot DragonFly with ACPI disabled" from the boot menu,
or type 'unset acpi_load' on the boot loader prompt:
OK unset acpi_load

Cheers.

#2 Updated by justin over 8 years ago

On Fri, September 8, 2006 8:13 am, Jamie wrote:
> I'm actually not sure if this is a bug or something I've messed up.
>
> Things were working until I apparently had a hard drive failure, replaced
> Hard drive (drive as master and CDROM as the slave, both on IDE0).

If it was working normally before, and the only changed hardware was the
drive, then it sounds like something with this drive, specifically.

> This seemed like defective hardware, so I installed slackware linux,

When you installed slackware, where did you install it on the disk? i.e.
was it in the same 50G region as the previous install of DragonFly?

If you can get a dmesg out of the dragonfly install, that may help.

As a sanity check, are all the cables well-seated?

If you have another computer available, can you boot from your DragonFly
installation CD (not for installation, but just as a live cd) and see if
it works?

These are somewhat wild guesses; I don't think I'll have an answer - but
this may get us closer.

#3 Updated by nospam over 8 years ago

In <>,
YONETANI Tomokazu <> mentions:
>On Fri, Sep 08, 2006 at 12:13:13PM +0000, Jamie wrote:
>> I'm actually not sure if this is a bug or something I've messed up.
>>
>> Things were working until I apparently had a hard drive failure, replaced
>> Hard drive (drive as master and CDROM as the slave, both on IDE0).
>>
>> There is no sound card, there is only a 3Com PCI card plus the other components
>> of a bare machine. (VGA card and the motherboard IDE, there aren't any
>> USB ports that I can see)
>>
>> The drive is a MAXTOR 200G, with the first 50G being used for the OS (primary
>> partition) Setup as master = LBA, slave = AUTO in BIOS. (The BIOS can't see the
>> full drive.)
>>
>> Seemed to lock up at random times during the install process, almost always when
>> copying files from the distribution CD, eventually I was able to get it
>> installed, only to see it lock up when it probed the CDROM (by this time,
>> booting from hard drive)
>>
>> I inserted a CD in the drive, wondering if it was choking on the no media found,
>> booted again, this time it loaded and I got to the login prompt. (again, booting
>> from hard drive)
>>
>> Then it locked up just as I entered the password.
>
>I assume you're using 1.6.x-RELEASE, right?

Yes, the most current CD ISO image. (I'm unable to get far enough along to update
the software)

It's this image: dfly-1.6.0_REL.iso

>> This machine seems to have a pretty decent BIOS.
>>
>> Bios settings:
>>
>> Power Management: OFF
>> ACPI I/O Device Node is OFF. (I've experimented with this, so far, no dif.)
>
>Does this actually turn off ACPI driver? When ACPI driver is disabled,
>sysctl hw.acpi gives you nothing:
># sysctl hw.acpi
>sysctl: unknown oid 'hw.acpi'

That is correct:

sysctl: unknown oid 'hw.acpi' (Able to do this only with the boot CD)

(How in the world do you remember all those variables? :-) )

Someone else asked if I had Linux installed in the same partition:

Linux is on /dev/hda2 and DragonFly is /dev/hda1 (not sure exactly what
those equate to in terms of dragonflyBSD devices, it's the first partition
though)

On a whim, I tried disconnecting the CDROM entirely, no difference. Locked up
part way in to the system checks. (trying to correct itself from the time it
froze on the password login I'd imagine)

Is there a limitation on the size of disklabel slices that I missed?
---------------|partition 1| (disk label)-----
4G /
8G /var
1G /tmp
12G /usr
768M swap
24-25G /home
-----------------------
Total: 50G on partition 1.

I've downloaded FreeBSD version 6.1 beta (6.1-BETA1-i386-disc1.iso) I'll try
installing that and see if it goes. Maybe that'll show something?

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

#4 Updated by justin over 8 years ago

On Fri, September 8, 2006 11:09 am, Jamie wrote:

> Linux is on /dev/hda2 and DragonFly is /dev/hda1 (not sure exactly what
> those equate to in terms of dragonflyBSD devices, it's the first partition
> though)

If there's a problem with the disk, and it's in the sectors being used for
DragonFly, that would explain why the Linux install isn't having any
noticealbe problems. If you have time, it may be worth trying to install
Linux over the DragonFly partition, to see if the same symptoms happen.

#5 Updated by nospam over 8 years ago

In <>,
"Justin C. Sherrill" <> mentions:
>On Fri, September 8, 2006 11:09 am, Jamie wrote:
>
>> Linux is on /dev/hda2 and DragonFly is /dev/hda1 (not sure exactly what
>> those equate to in terms of dragonflyBSD devices, it's the first partition
>> though)
>
>If there's a problem with the disk, and it's in the sectors being used for
>DragonFly, that would explain why the Linux install isn't having any
>noticealbe problems. If you have time, it may be worth trying to install
>Linux over the DragonFly partition, to see if the same symptoms happen.

Was in the process of doing that when I read a warning from the FreeBSD
install process:

The Geometry:
24792 Cyls/255 Heads/63 Sectors/398_283_480 is incorrect, please

specify what the BIOS thinks is the correct geometry.

Linux "cfdisk" reported:

Heads/16 Cyl/395136 Sectors-track: 63 Size 203_928_109_056

The two fdisks gave different geometries.

In my BIOS, it's "Auto" (blank, I can key in the geometry manually, I tried,
but there isn't enough space in the fields)

Awhile ago, I recall reading that linux ignored the BIOS for disk geometry
and instead just read the drive itself.

So...

What is BIOS's version of the geometry in relation to the DragonFly booting
process?

I'm a little concerned about entering the wrong settings, as I really don't
want anything to attemp access to portions of the drive that are invalid,
possibly physically destroying the drive itself? (I'm NOT worried about the
data at this point, there isn't any.. but physical damage is a problem.)

As luck would have it, the specifications for the drive don't actually
give the geometry, neither does their website. I can't be sure what the
correct one is.

I'll need to do some more reading before attempting a test install of
FreeBSD, I really want to make sure it's "safe".

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

#6 Updated by dillon over 8 years ago

How is the hard drive and CD cabled to the motherboard? Are they
going to a separate controller or are they sharing a cable ?

-Matt

#7 Updated by nospam over 8 years ago

In <>,
Matthew Dillon <> mentions:
> How is the hard drive and CD cabled to the motherboard? Are they
> going to a separate controller or are they sharing a cable ?

They are sharing a cable. (Just as before) Hard drive in 'LBA mode'
(But I've tried variations of "NORMAL/LARGE")

I've tried disconnecting the CDROM entirely as well as disabling the CDROM in
the BIOS. (as well as both) Same result. (I thought it was a conflict at first,
I've seen linux complain about /dev/hdb on other machines before.)

The old drive (the one that worked) was in "backwards" it was plugged in to the
first plug on the IDE cable. The instructions told me to put it as the last.
I've tried both, same lockup.

While booting from the hard drive it does tend to get past the startup
messages, hardware detection, etc.. usually tends to lockup just as it starts
to run the file system checks.

One thing I am doing differently here is not allocating the full drive.

At the boot prompt: (lsdev, don't know if this is useful info or not gave this:

disk0 BIOS drive A:
disk1 BIOS drive C:
disk1s1a FFS <-- loaddev
disk1s1b swap
disk1s1{d,e,f,g} FFS

disk1s2 ext2fs
disk1s5 Linux swap
disk1s{6,7,8,9} ext2fs

The warning I got when trying to load FreeBSD may have given some insight, but,
if I set BIOS to "autodetect" and it finds a geometry that is obviously
incorrect, is there any danger of having the physical device damaged? (is the
microcode in a hard drive smart enough to avoid doing things that are
physically dangerous?) If I can be sure of this, I'd try different settings
there and see what happens.

Does dragonflybsd pay attention to BIOS's idea of the hard drive geometry, even
after the booting stage? It's getting well past the bootblocks.

Totally unrelated...

How come Dragonfly uses vixiecron instead of Dillon Cron? :-) (I always remembered
you as "that cron guy" before this, the cron that does cron and ONLY cron.)

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

#8 Updated by nospam over 8 years ago

Just an update:

Tried to install FreeBSD in the exact same configuration, same lockup.

So, I deleted the partitions, tried 2 - 1GB partitions (on the off chance
one was bad)

Tried installing FreeBSD on both partitions, same result, lockup.

deleted all data, reset bios to defaults, (keeping the partitions intact though)

Installed FreeBSD (minimal) on the first 1GB partition and.. sucesss.

Followed the exact same procedure with dragonflybsd, success.

Ran: cat /dev/zero >/$partition$/scratch (filling up the available space)
on *each* disk slice (/usr /var /home /tmp /) each time, the filesystem
filled up, told me it filled up and gave me a chance to correct it. NO
lockup.

Was trying to "prove" that all mounted space was accessible, it apparently
was, at least in terms of writing, I didn't attempt to read from the same
files, but I doubt it was a hardware problem at this point)

I'll try testing out various partition sizes and see if there is any
difference, or, perhaps mucking about with the BIOS settings maybe discover
what went wrong... if someone else has problems, at least for now, the "solution"
was to completely reset the BIOS.

I don't think this is so much a "bug" as it is, maybe something that should be
discovered, just in case someone else stumbles upon whatever it is I did. (was
really puzzling as the only thing that changed was the hard drive, same settings
otherwise)

I'd be willing to turn this machine over to a developer for the weekend,
with a serial port connection if they want to run any tests/things/whatever
if that would help. (I really _really_ like the ideas here)

I'd probably have to set up some kind of mechanism so you could access minicom,
etc.. (and some sort of instant message, so you could tell me when to hit reset,
adjust the firewall, etc..)

Just an offer if it'd be at all helpful in anything you're doing. Feel free to
toast the machine at will. :-)

Meanwhile... I'll try and figure out what happened.

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

#9 Updated by erik-wikstrom over 8 years ago

On 2006-09-09 06:33, Jamie wrote:
> Totally unrelated...
>
> How come Dragonfly uses vixiecron instead of Dillon Cron? :-) (I always remembered
> you as "that cron guy" before this, the cron that does cron and ONLY cron.)

This question came up some time ago and the answer Matt gave then was:
There are a bunch of things that vixie cron does that dcron does
not, like deal with environment variables and execute-on-startup
directives in the crontab and TZ overrides. If someone wants to
work on those items, and whatever else is missing, then I wouldn't
mine replacing it. But I don't have time to work on it myself.

#10 Updated by nospam over 8 years ago

In <45028ba8$0$790$415eb37d@crater_reader.dragonflybsd.org>,
=?ISO-8859-1?Q?Erik_Wikstr=F6m?= <> mentions:
>On 2006-09-09 06:33, Jamie wrote:
>> Totally unrelated...
>>
>> How come Dragonfly uses vixiecron instead of Dillon Cron? :-) (I always remembered
>> you as "that cron guy" before this, the cron that does cron and ONLY cron.)
>
>This question came up some time ago and the answer Matt gave then was:
> There are a bunch of things that vixie cron does that dcron does
> not, like deal with environment variables and execute-on-startup
> directives in the crontab and TZ overrides. If someone wants to
> work on those items, and whatever else is missing, then I wouldn't
> mine replacing it. But I don't have time to work on it myself.

This is kind of funny, the thing I liked about Dillon's cron was exactly
that it did NOT do all these things. Just ran cron, no more, no less.

Incidently...

Here was the problem for anyone else with the same headache:

In BIOS:

Block Mode needs to be set to Disable
AND
UDMA needs to be set to Disable.

Neither of these two things by themselves would produce a consistant lockup.
(whats more, I have no idea how they were set that way)

The partition size wasn't related in any way.

Wish I knew of a good source of info for what BIOS settings mean, I have a
general idea, but nothing in depth.

Invitation is still open for any developer if they want a machine to toast for awhile,
I don't know if it'd be helpful or not. If I don't hear from anyone, I'll start setting
it up.

Jamie
--
http://www.geniegate.com Custom web programming
(rot13) User Management Solutions

#11 Updated by dillon over 8 years ago

:...
:if I set BIOS to "autodetect" and it finds a geometry that is obviously
:incorrect, is there any danger of having the physical device damaged? (is the
:microcode in a hard drive smart enough to avoid doing things that are
:physically dangerous?) If I can be sure of this, I'd try different settings
:there and see what happens.
:
:Does dragonflybsd pay attention to BIOS's idea of the hard drive geometry, even
:after the booting stage? It's getting well past the bootblocks.

DragonFly uses 'packet mode' BIOS commands, which are independant of the
drive geometry.

The problem is that the BIOS still screws up even packet mode requests,
and if you leave it in 'Auto' mode it seems to be able to put the disk
into some sort of crazy mode that BSDs can't seem to get it out of,
resulting in an inability to access the whole disk.

Basically you need to program the BIOS to access the disk in LBA or LARGE
mode. If you leave it in Auto the BIOS will mess everything up.

I would love to know how to 'fix' our ATA controller to fix the
access problem when the BIOS misprograms the drive. A RESET doesn't
seem to do it.

--

Insofar as cabling goes, you basically must be sure that the hard drive
is set up as the IDE master and that the CDROM is set up as the IDE
slave. If you have a twisty-cable, then things get more complicated
because the two device connectors on the cable act differently (sometimes
even if the drive or CDRom is jumpered explicitly for master/slave).
I recommend *NOT* using a twisty cable. Just use a straight-through
cable (both connectors on the cable act the same) and jumper the drive
as a master and the CDROM as a slave. Put the drive on the END of the
cable, not the middle (this is because IDE hard drives almost always
terminate the cable properly while many CDROM drives do not, and this
is important at higher IDE speeds).

:Totally unrelated...
:
:How come Dragonfly uses vixiecron instead of Dillon Cron? :-) (I always remembered
:you as "that cron guy" before this, the cron that does cron and ONLY cron.)
:
:Jamie

Someone else posted the exerpt from a posting I made a while ago, which
is still essentially true today. Users depend on the environment variable
support, shell specification, and boot-time job specifications that
DCron does not have. If someone were to add those features to DCron and
we made some minor adjustments to the spool directories, we could use
dcron in the base system instead of vixiecron.

-Matt

#12 Updated by wbh over 8 years ago

Matthew Dillon wrote:

*SNIP*

> slave. If you have a twisty-cable, then things get more complicated
> because the two device connectors on the cable act differently (sometimes
> even if the drive or CDRom is jumpered explicitly for master/slave).

ACK - they would do.
The 'twisty cable' should have the devices set to 'CS' (Cable Select) ONLY.

> I recommend *NOT* using a twisty cable.

Agreed. But largely 'coz they are not easy to find - or even easy to make in
UDMA-grade (shielded).

Their primary value is in making pre-configured spares for IDE hot swap not need
jumper changes, so most of these exist in bespoke housings, muhc as
externally-set SCSI-ID did.

> Just use a straight-through
> cable (both connectors on the cable act the same) and jumper the drive
> as a master and the CDROM as a slave. Put the drive on the END of the
> cable, not the middle (this is because IDE hard drives almost always
> terminate the cable properly while many CDROM drives do not, and this
> is important at higher IDE speeds).

ALSO:

- IF using HDD on *both* PATA connectors of a channel, do not expect that the
'slave' units will in all cases work well/at all if the 'master' units go
flakey. Not HDD anyway...

- When possible, keep CD & DVD on separate IDE PATA controllers from HDD entirely.

- I strongly recommended NOT using PATA secondary channels for IDE psuedo-RAID
(ATACONTROL, GMIRROR) - or even *any* critical-use device. Often better to add a
controller and waste those connectors.

- 'Proper' SATA controllers (not all are such) get around all this, as each
connection is an independent animal, just 'mapped' to traditional IDE dev ID's.

- Use of a SATA controller with a SATA-to-PATA terminating adaptor to a PATA
device can solve several of the above issues rather nicely if you have a need
for supporting the odd legacy PATA device in an otherwise SATA box.
'GigaByte' packages some that have a pull-off loop and fit available space well.

Beware the junk SATA-PATA that is also on the market as unpackaged PCB's.

PATA -> SATA whole different can of worms.

;-)

*snip* (cron issues)

JFWIW...

Bill Hacker

#13 Updated by corecode about 8 years ago

weird BIOS behaviour

Also available in: Atom PDF