Bug #1463

Mountroot before drives are initialized

Added by elekktretterr about 5 years ago. Updated over 3 years ago.

Status:NewStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Hi,

As per Sachas comment about splitting my previous bug report into separate
ones. here is an issue that has been bugging me on this box. (HP Proliant
DL180 G5)

It seems that mountroot is ran before da0 (and cd0) gets initialized on
this box. mountroot then complains that da0s1 is missing.

Here is a verbose dmesg:

http://leaf.dragonflybsd.org/mailarchive/bugs/2009-08/msg00128.html.

Note that this problem seems to happen randomly when booting from live cd.
But happens all the time when booting from disk.

Petr

dmesg.fine.out (26.1 KB) eric.j.christeson, 08/25/2009 08:17 PM

dmesg.hang (6.92 KB) eric.j.christeson, 08/25/2009 08:18 PM

dmesg.devfs.2 (23.7 KB) eric.j.christeson, 08/25/2009 09:49 PM

History

#1 Updated by eric.j.christeson about 5 years ago

I am also running into this problem on a Dell Optiplex GX270 (P4 2.26Ghz) I
have a SCSI drive as my boot/root drive and IDE drive and CD.

I've been tracking HEAD and first noticed the problems after the devfs changes.
At the time I was going a few days between rebuilds so I can't easily pinpoint
the time.

I noticed a few interesting things:

1. booting in verbose mode does NOT result in a mountroot failure
2. at the mountroot prompt, ? doesn't list da0 (root device) the first time, but
will list it subsequent times.
3. at mountroot, specifying root doesn't work as the first typing. If I type ?
first, or try specifying root twice, it works.
4. Booting with or without a CD in the CD-ROM drive gives the same results

I've got a couple of hours, so I may try to look at this.

Included files:
dmesg.fine.out Verbose boot, no mountroot hang
dmesg.hang Standard boot, note failure after first time specifying rootdev,
strange cd0: message after ? and finding root after specifying rootdev again.

#2 Updated by eric.j.christeson about 5 years ago

Guess you can't submit more than one file at a time.

#3 Updated by dillon about 5 years ago

Do a verbose boot so we can see when CAM starts its probes.

It is probably the SYM driver rejecting the initial bus scan
from CAM, and then later (after it is too late) notifying CAM that
a new bus and/or devices are present asynchronously.

I'm not sure how easy it will be to fix, the SYM driver is 10,000
lines and it will take a few hours to figure out how it deals
with the SCSI bus scan.

-Matt

#4 Updated by eric.j.christeson about 5 years ago

Verbose boot doesn't fail
Likewise if I set vfs.devfs.debug=3 it doesn't fail
setting vfs.devfs.debug=2 fails
dmesg.devfs.2 is a dump with vfs.devfs.debug=2

#5 Updated by eric.j.christeson about 5 years ago

Don't know if this info helps, but setting SCSI_DELAY to 10000 or 20000 had no
effect.

#6 Updated by elekktretterr about 5 years ago

> Do a verbose boot so we can see when CAM starts its probes.
>
> It is probably the SYM driver rejecting the initial bus scan
> from CAM, and then later (after it is too late) notifying CAM that
> a new bus and/or devices are present asynchronously.
>
> I'm not sure how easy it will be to fix, the SYM driver is 10,000
> lines and it will take a few hours to figure out how it deals
> with the SCSI bus scan.

Hi Matt,
I attached my verbose dmesg to the original email. Do you think our
problems are same? This server uses the ciss driver.

Petr

#7 Updated by elekktretterr about 5 years ago

Also, is there a way maybe to teach mountroot to wait untill all drives
are initialized?

Petr

#8 Updated by elekktretterr about 5 years ago

>
> Eric J. Christeson <> added the comment:
>
> Don't know if this info helps, but setting SCSI_DELAY to 10000 or 20000
> had no
> effect.

Hi Eric,

I should point out that this problem was already happening before we put
the ciss raid card in it. It was happening with a USB attached cd drive
too. I saw it being initialized AFTER mountroot was ran.

Petr

#9 Updated by eric.j.christeson about 5 years ago

I've been booting with various levels of CAMDEBUG (and it boots fine since the
output gives enough delay for init) and something occurred to me. Do you also
have (n)atapicam compiled in? I noticed that sometimes I would see messages
like this:

**WARNING** waiting for the following device to finish configuring:
xpt: func=0xc0144b4f arg=0

With CAMDEBUG I see why, xpt has to enumerate all the scsci, ata, and usb
bus/devices in the system. xpt has to init before any of the scsi devs (or it
does, even if it doesn't _have_ to) so I wonder if some of the delay isn't
there. I'm going to take out natapicam and see if things improve or not.

#10 Updated by elekktretterr about 5 years ago

I havent tried it without natapicam. Im going to have to install FBSD 7 on
it for now as we are rushing to put this box in the datacentre. I realized
I cant put DragonFly on it because its going to run pgpool-II. The other
boxes all run 64bit OS, but all DragonFly builds are currently still only
32 bit and i was told that pgpool recovery(uses postgres PITR) is
architecture specific.

Petr

#11 Updated by alexh almost 5 years ago

Can you please try to boot again with commit
8c05caabb07caf24fd0dfab4f1497fb58a8c31e0 and writing "set kern.disk_debug=1"
at bootloader prompt?
It should give some more insight on the source of the problem. If you feel
like it, you can also set it to 2, so it gives some info on partition probing
for each slice.

Cheers,
Alex Hornung

#12 Updated by hasso almost 5 years ago

I have the same issue and also using ciss(4) (HP Proliant DL360 G6). Vanilla
kernel fails 100% here to mount root from harddisk, but with kern.disk_debug=1
it succeeds.

#13 Updated by dillon almost 5 years ago

:Hasso Tepper <> added the comment:
:
:I have the same issue and also using ciss(4) (HP Proliant DL360 G6). Vanill=
:a=20
:kernel fails 100% here to mount root from harddisk, but with kern.disk_debu=
:g=3D1=20
:it succeeds.

If Alex doesn't come up with something in the next week or so we will
add a straight-up delay before mountroot.

In fact, could you test that a straight out delay before mountroot works?
Here's a patch.

-Matt
Matthew Dillon
<>

diff --git a/sys/kern/vfs_conf.c b/sys/kern/vfs_conf.c
index a159afc..8bdea67 100644
--- a/sys/kern/vfs_conf.c
+++ b/sys/kern/vfs_conf.c
@@ -109,8 +109,9 @@ SYSINIT(mountroot, SI_SUB_MOUNT_ROOT, SI_ORDER_SECOND, vfs_mountroot, NULL);
static void
vfs_mountroot(void *junk)
{
- int i;
cdev_t save_rootdev = rootdev;
+ int i;
+ int dummy;

/*
* Make sure all disk devices created so far have also been probed,
@@ -121,6 +122,8 @@ vfs_mountroot(void *junk)
* coverage.
*/
sync_devs();
+ tsleep(&dummy, 0, "syncer", hz*2);
+

/*
* The root filesystem information is compiled in, and we are

#14 Updated by dillon almost 5 years ago

(Oh, when testing the tsleep delay for the mountroot fix, do
it with kern.disk_debug=0 of course).

-Matt
Matthew Dillon
<>

#15 Updated by alexh almost 5 years ago

I don't know how to approach this. The solution lies within cam and scsi_da, I
think. The disk is created on time (disk_create) but setdiskinfo does NOT
occur on time to trigger probing before mountroot.
I'll continue investigating, but I'd welcome any ideas on how to solve this in
the aforementioned direction.

Cheers,
Alex Hornung

#16 Updated by eric.j.christeson almost 5 years ago

This is no longer an issue for me as of DragonFly v2.3.2.780.g8f3b7d-DEVELOPMENT
which I built on Friday. Boots fine, for now. It will probably crop up again.
I would be happy to test patches.

#17 Updated by alexh almost 5 years ago

This is temporarily fixed in a2b579620dd7947b8f90d1311c7e87a57ec9f1ea. Leaving
open because the problem itself wasn't addressed.

Cheers,
Alex Hornung

#18 Updated by alexh over 3 years ago

Are there any suggestions on how to deal with it? The commit is just a hack
that'll delay the boot up alltogether. Is there any way to know if something is
still being probed, or pending probing?

Regards,
Alex

Also available in: Atom PDF