Bug #390: ATA identify retries exceeded - kernel panic - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #390

closed

ATA identify retries exceeded - kernel panic

Added by nonsolosoft over 19 years ago. Updated over 19 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

On DFBSD 1.7.0-DEVELOPMENT #0: Sat Oct 7 00:27:36 CEST 2006,
panic on boot.

ad9: READ command timeout tag=0 serv=0 - resetting
ad9: try fallback to PIO mode
at4: resetting devices .. ata4-master: ATA identify retries exceeded
done

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x41455378
fault code = supervisor write, page not present
instruction pointer = 0x8:0xc019d92b
stack pointer = 0x10:0xcfc94cdc
frame pointer = 0x10:0xcfc94d0c
code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL=0
current process = Idle
current threads = pri 60 (CRIT)

kernel: type 12 trap, code=2
Stopped at ad_interrupt+x038a: movl %eax,0x148(%edx)
db> trace
ad_interrupt(cfc1e848,c1a816c0,cfc94d84,c029da35,c1d246e8) at ad_interrupt+0x38a
ata_intr(c1d246e8,0,0,0,13) at ata_intr+0x10a
ithread_handler(a,0,0,0,0) at ithread_handler+0x9d
lwkt_exit() at lwkt_exit

Bye, \fer

Actions

Copy link

Updated by TGEN over 19 years ago

Could you make a coredump of that, then examine whatever it tries to do
with kgdb? (up to the faulting instruction, then list, the print any
variables you see to find dead pointers?)

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

Updated by dillon over 19 years ago

:...
:> kernel: type 12 trap, code=2
:> Stopped at ad_interrupt+x038a: movl %eax,0x148(%edx)
:> db> trace
:> ad_interrupt(cfc1e848,c1a816c0,cfc94d84,c029da35,c1d246e8) at
:> ad_interrupt+0x38a
:> ata_intr(c1d246e8,0,0,0,13) at ata_intr+0x10a
:> ithread_handler(a,0,0,0,0) at ithread_handler+0x9d
:> lwkt_exit() at lwkt_exit

This panic has plagued us for a long time.  There's something broken
    in the retry code.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by TGEN over 19 years ago

Fwiw, I haven't seen such yet with the new ATA code, except in the case
when I forgot to zero the structs I pulled out of my objcache. It might
be nice to have objcache_get() look at oc_flags and zero if & M_ZERO.

In any case, I got that panic because I didn't have a zeroed struct, so
a certain variable was set where it shouldn't have been, resulting in
other code kfree()-ing another struct, thereby making a pointer in that
struct point to 0xdeadc0de (the slab alloc does this). If that's the
case, we need to look for inadvertent frees.

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

Updated by dillon over 19 years ago

:Fwiw, I haven't seen such yet with the new ATA code, except in the case
:when I forgot to zero the structs I pulled out of my objcache. It might
:be nice to have objcache_get() look at oc_flags and zero if & M_ZERO.
:
:In any case, I got that panic because I didn't have a zeroed struct, so
:a certain variable was set where it shouldn't have been, resulting in
:other code kfree()-ing another struct, thereby making a pointer in that
:struct point to 0xdeadc0de (the slab alloc does this). If that's the
:case, we need to look for inadvertent frees.
:
:Cheers,
:--
: Thomas E. Spanjaard
: tgen@netphreax.net

I agree that we should make M_ZERO semantics work more as expected...
    or at least panic() if they aren't allowed.

Here's the issue... The objcache has two layers... it has a freelist
    caching layer and it has an actual allocation and disposal layer.

The idea behind the freelist caching layer is to be able to hold the
    structures in a form that reduces the amount of initialization required.
    This is primarily used by the mbuf system, where certain associations
    (such as mbufs with mbuf cluster buffers) are retained in the freelist
    caching layer and only disentangled in the allocation and dispoal layer.

So we can't just unconditionally bzero() in objcache_get() if M_ZERO is
    set because it would destroy these associations.  I think what we need
    to do is give the objcache some indication that M_ZERO is allowed, then
    either bzero() or panic.

Since bzero is a procedure call, perhaps the solution here is to add
    another function pointer for 'zeroing' a structure when M_ZERO is
    specified.  This function would be specified in objcache_create() and
    default to a bzero if NULL.  This way M_ZERO will work as expected with
    simple objcache setups, and panic the system if someone tries to use
    it with more complex objcache setups (like mbufs).

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by TGEN over 19 years ago

I first attempted to do the zeroing in a ctor I passed to
objcache_create(), but that only works the first time my data is
objcache_get() from the objcache. If it's objcache_put() again, and then
objcache_get(), it would contain the old data. So I also set up a
zeroing dtor, but then I thought it to be too much effort for the common
case (usually, there aren't many objects out of the cache at the same
time in my case, so a limited number is reused all the time), so I
wrapped objcache_get() in a function that does the zeroing. I expect
that to result in less calls to bzero() overall.

Regarding the extra function pointer, that sounds good, but I'd like it
more generic: that the extra pointer is for generic objcache_get()-time
related actions, not just zeroing based on M_ZERO set in oc_flags. The
default could well do something like that, but it shouldn't be
documented as being a func pointer for zeroing funcs only.

On the other hand, one could argue that if one specifies M_ZERO in
oc_flags, one expects a completely zeroed object, with no initialization
whatsoever (panic if M_ZERO is set and a ctor is specified?).

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

Updated by corecode over 19 years ago

yes, i agree.

I don't agree. let's not make it more complex than it should be. how about: if there is no ctor you can specify M_ZERO. if you specify a ctor, M_ZERO is a reason to panic.

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:I don't agree. let's not make it more complex than it should be. how ab=
:out: if there is no ctor you can specify M_ZERO. if you specify a ctor,=
: M_ZERO is a reason to panic.
:
:cheers
: simon

I like it.  I'll do it right now.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon over 19 years ago

Ok, how about this. If the simple form of objcache_create is used
we install simple_objsize and allow M_ZERO. Otherwise we panic if
M_ZERO is specified.

I let the second layer case fall through... if the allocator happens
    to be malloc, then M_ZERO will work.  If it doesn't, then the problem
    will be caught later on when the first layer has cached objects and
    the caller tries to get a new object.  Not perfect but it should catch
    any problems.

-Matt

Index: kern_objcache.c ===================================================================
RCS file: /cvs/src/sys/kern/kern_objcache.c,v
retrieving revision 1.12
diff u -r1.12 kern_objcache.c
-- kern_objcache.c 20 Nov 2006 22:19:54 -0000 1.12
++ kern_objcache.c 29 Nov 2006 23:33:00 -0000
@ -129,6 +129,7 @
objcache_alloc_fn *alloc;
objcache_free_fn *free;
void *allocator_args;
size_t simple_objsize;

SLIST_ENTRY(objcache)    oc_next;

@ -228,6 +229,12 @
null_ctor, null_dtor, NULL,
objcache_malloc_alloc, objcache_malloc_free,
margs);

/*
+ * This indicates that we are a simple objcache and allows
+ * objcache_get() calls with M_ZERO.
+ */
+ oc->simple_objsize = objsize;
return (oc);
}

@ -264,7 +271,14 @
loadedmag = cpucache->loaded_magazine;
if (MAGAZINE_NOTEMPTY(loadedmag)) {
obj = loadedmag->objects[--loadedmag->rounds];
done:
crit_exit();
if (ocflags & M_ZERO) {
+ if (oc->simple_objsize)
+ bzero(obj, oc->simple_objsize);
+ else
+ panic("objcache_get(): M_ZERO illegal here");
+ }
return (obj);
}

@ -275,8 +289,7 @
swap(cpucache->loaded_magazine, cpucache->previous_magazine);
loadedmag = cpucache->loaded_magazine;
obj = loadedmag->objects[--loadedmag->rounds];
- crit_exit();
- return (obj);
+ goto done;
}

/*
@ -284,6 +297,8 @
 * move one of the empty ones to the depot.
 *
 * Obtain the depot spinlock.
+     *
+     * NOTE: Beyond this point, M_ZERO is handled via oc->alloc()
      */
     depot = &oc->depot[myclusterid];
     spin_lock_wr(&depot->spin);

Actions

Copy link

Updated by TGEN over 19 years ago

Isn't just checking whether the ctor pointer is NULL enough instead of
oc->simple_objsize?

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

#10

Updated by dillon over 19 years ago

:Isn't just checking whether the ctor pointer is NULL enough instead of
:oc->simple_objsize?
:
:Cheers,
:--
: Thomas E. Spanjaard
: tgen@netphreax.net

It's never NULL.  The functions default out to dummy routines.  I
    suppose I could check if (oc->ctor == null_ctor) but I think its
    more generic to just have a separate supporting field with the 
    object size, since you need the object size anyway to do the bzero().

-Matt

Actions

Copy link

#11

Updated by TGEN over 19 years ago

After some discussion with Jeffrey Hsu and Simon Schubert, it was
concluded that this incurs a penalty on the hot path of objcache_get(),
which should be avoided. Best to have the consumers handle any zero'ing,
in order to not nerf the consumers who rely on the speed. In that case,
it pollutes the instruction cache, and could worsen performance
significantly in the case of branch misprediction. Especially on the
long pipelines CPUs have these days (though with the Core 2 family,
Intel has cut back on the 30+ stages of the Prescott/Presler...).

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

#12

Updated by justin over 19 years ago

As a followup, does the changes discussed in this thread solve the original
poster's problem?

Actions

Copy link

#13

Updated by TGEN over 19 years ago

No changes relevant to the issue have been made, in fact, the bug hasn't
been pinpointed yet. It would help if we could get a coredump on leaf?

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

#14

Updated by nonsolosoft over 19 years ago

Hi Thomas,

I'll try to send you something tomorrow.

Bye, \fer

Actions

Copy link

#15

Updated by nonsolosoft over 19 years ago

Hi Thomas,

I've rebuild world and kernel at the latest version to make tests but
I'm not able to installkernel:

atl# bmake installkernel
cd /usr/obj/usr/src/sys/GENERIC; make KERNEL=kernel install
chflags noschg /kernel
objcopy --strip-debug /kernel /kernel.old
install m 555 -o root -g wheel -fschg kernel.debug /kernel
set - /modules/*; if [ -f "$1" ]; then mkdir -p /modules.old; for
file; do objcopy --strip-debug $file /modules.old/${file##*/}; done; fi

Error code 1

Stop in /usr/obj/usr/src/sys/GENERIC.

Error code 1

Stop in /usr/src.

Error code 1

Stop in /usr/src.

I've buildworld and buildkernel twice, but it stops at the same point.

Bye, \fer

Actions

Copy link

#16

Updated by swildner over 19 years ago

Hmm, I'm not sure if bmake is supposed to work properly with our {build,install}{world,kernel} targets.

Could you try again with make?

Sascha

Actions

Copy link

#17

Updated by bastyaelvtars over 19 years ago

It's make, not bmake, use it like:

make installkernel KERNCONF=MYKERNELCONFIG

Actions

Copy link

#18

Updated by joerg over 19 years ago

It isn't.

Joerg

Actions

Copy link

#19

Updated by TGEN over 19 years ago

I'm wondering whether he used bmake or make for build{world,kernel}... :).

As an aside, does anyone know what syntax differences between our make
and NetBSD's make are holding compatibility back? Haven't seen Max for a
while either...

Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net

Actions

Copy link

#20

Updated by joerg over 19 years ago

Well, the biggest difference is that operators like :S are lazy in bmake
and non-lazy in make. IIRC there was at least one which conflicted as
well.

Joerg

Actions

Copy link

#21

Updated by nonsolosoft over 19 years ago

OK. I've found!

It was not a problem with make but there was a power failure during
previous make installkernel.

Bye, \fer

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #390

ATA identify retries exceeded - kernel panic

Updated by TGEN over 19 years ago

Updated by dillon over 19 years ago

Updated by TGEN over 19 years ago

Updated by dillon over 19 years ago

Updated by TGEN over 19 years ago

Updated by corecode over 19 years ago

Updated by dillon over 19 years ago

Updated by dillon over 19 years ago

Updated by TGEN over 19 years ago

Updated by dillon over 19 years ago

Updated by TGEN over 19 years ago

Updated by justin over 19 years ago

Updated by TGEN over 19 years ago

Updated by nonsolosoft over 19 years ago

Updated by nonsolosoft over 19 years ago

Updated by swildner over 19 years ago

Updated by bastyaelvtars over 19 years ago

Updated by joerg over 19 years ago

Updated by TGEN over 19 years ago

Updated by joerg over 19 years ago

Updated by nonsolosoft over 19 years ago