https://bugs.dragonflybsd.org/https://bugs.dragonflybsd.org/favicon.ico?16293952082006-11-27T22:00:01ZDragonFlyBSD bugtrackerDragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16502006-11-27T22:00:01ZTGEN
<ul></ul><p>Could you make a coredump of that, then examine whatever it tries to do <br />with kgdb? (up to the faulting instruction, then list, the print any <br />variables you see to find dead pointers?)</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16582006-11-30T01:34:00Zdillon
<ul></ul><p>:...<br />:> kernel: type 12 trap, code=2<br />:> Stopped at ad_interrupt+x038a: movl %eax,0x148(%edx)<br />:> db> trace<br />:> ad_interrupt(cfc1e848,c1a816c0,cfc94d84,c029da35,c1d246e8) at <br />:> ad_interrupt+0x38a<br />:> ata_intr(c1d246e8,0,0,0,13) at ata_intr+0x10a<br />:> ithread_handler(a,0,0,0,0) at ithread_handler+0x9d<br />:> lwkt_exit() at lwkt_exit</p>
<pre><code>This panic has plagued us for a long time. There's something broken<br /> in the retry code.</code></pre>
<pre><code>-Matt<br /> Matthew Dillon <br /> &lt;<a class="email" href="mailto:dillon@backplane.com">dillon@backplane.com</a>&gt;</code></pre> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16592006-11-30T02:58:00ZTGEN
<ul></ul><p>Fwiw, I haven't seen such yet with the new ATA code, except in the case <br />when I forgot to zero the structs I pulled out of my objcache. It might <br />be nice to have objcache_get() look at oc_flags and zero if & M_ZERO.</p>
<p>In any case, I got that panic because I didn't have a zeroed struct, so <br />a certain variable was set where it shouldn't have been, resulting in <br />other code kfree()-ing another struct, thereby making a pointer in that <br />struct point to 0xdeadc0de (the slab alloc does this). If that's the <br />case, we need to look for inadvertent frees.</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16612006-11-30T05:56:01Zdillon
<ul></ul><p>:Fwiw, I haven't seen such yet with the new ATA code, except in the case <br />:when I forgot to zero the structs I pulled out of my objcache. It might <br />:be nice to have objcache_get() look at oc_flags and zero if & M_ZERO.<br />:<br />:In any case, I got that panic because I didn't have a zeroed struct, so <br />:a certain variable was set where it shouldn't have been, resulting in <br />:other code kfree()-ing another struct, thereby making a pointer in that <br />:struct point to 0xdeadc0de (the slab alloc does this). If that's the <br />:case, we need to look for inadvertent frees.<br />:<br />:Cheers,<br />:-- <br />: Thomas E. Spanjaard<br />: <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p>
<pre><code>I agree that we should make M_ZERO semantics work more as expected...<br /> or at least panic() if they aren't allowed.</code></pre>
<pre><code>Here's the issue... The objcache has two layers... it has a freelist<br /> caching layer and it has an actual allocation and disposal layer.</code></pre>
<pre><code>The idea behind the freelist caching layer is to be able to hold the<br /> structures in a form that reduces the amount of initialization required.<br /> This is primarily used by the mbuf system, where certain associations<br /> (such as mbufs with mbuf cluster buffers) are retained in the freelist<br /> caching layer and only disentangled in the allocation and dispoal layer.</code></pre>
<pre><code>So we can't just unconditionally bzero() in objcache_get() if M_ZERO is<br /> set because it would destroy these associations. I think what we need<br /> to do is give the objcache some indication that M_ZERO is allowed, then<br /> either bzero() or panic.</code></pre>
<pre><code>Since bzero is a procedure call, perhaps the solution here is to add<br /> another function pointer for 'zeroing' a structure when M_ZERO is<br /> specified. This function would be specified in objcache_create() and<br /> default to a bzero if NULL. This way M_ZERO will work as expected with<br /> simple objcache setups, and panic the system if someone tries to use<br /> it with more complex objcache setups (like mbufs).</code></pre>
<pre><code>-Matt<br /> Matthew Dillon <br /> &lt;<a class="email" href="mailto:dillon@backplane.com">dillon@backplane.com</a>&gt;</code></pre> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16622006-11-30T06:11:00ZTGEN
<ul></ul><p>I first attempted to do the zeroing in a ctor I passed to <br />objcache_create(), but that only works the first time my data is <br />objcache_get() from the objcache. If it's objcache_put() again, and then <br />objcache_get(), it would contain the old data. So I also set up a <br />zeroing dtor, but then I thought it to be too much effort for the common <br />case (usually, there aren't many objects out of the cache at the same <br />time in my case, so a limited number is reused all the time), so I <br />wrapped objcache_get() in a function that does the zeroing. I expect <br />that to result in less calls to bzero() overall.</p>
<p>Regarding the extra function pointer, that sounds good, but I'd like it <br />more generic: that the extra pointer is for generic objcache_get()-time <br />related actions, not just zeroing based on M_ZERO set in oc_flags. The <br />default could well do something like that, but it shouldn't be <br />documented as being a func pointer for zeroing funcs only.</p>
<p>On the other hand, one could argue that if one specifies M_ZERO in <br />oc_flags, one expects a completely zeroed object, with no initialization <br />whatsoever (panic if M_ZERO is set and a ctor is specified?).</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16632006-11-30T06:32:01Zcorecode
<ul></ul><p>yes, i agree.</p>
<p>I don't agree. let's not make it more complex than it should be. how about: if there is no ctor you can specify M_ZERO. if you specify a ctor, M_ZERO is a reason to panic.</p>
<p>cheers<br /> simon</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16652006-11-30T07:34:01Zdillon
<ul></ul><p>:I don't agree. let's not make it more complex than it should be. how ab=<br />:out: if there is no ctor you can specify M_ZERO. if you specify a ctor,=<br />: M_ZERO is a reason to panic.<br />:<br />:cheers<br />: simon</p>
<pre><code>I like it. I'll do it right now.</code></pre>
<pre><code>-Matt<br /> Matthew Dillon <br /> &lt;<a class="email" href="mailto:dillon@backplane.com">dillon@backplane.com</a>&gt;</code></pre> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16662006-11-30T08:06:01Zdillon
<ul></ul><p>Ok, how about this. If the simple form of objcache_create is used<br /> we install simple_objsize and allow M_ZERO. Otherwise we panic if<br /> M_ZERO is specified.</p>
<pre><code>I let the second layer case fall through... if the allocator happens<br /> to be malloc, then M_ZERO will work. If it doesn't, then the problem<br /> will be caught later on when the first layer has cached objects and<br /> the caller tries to get a new object. Not perfect but it should catch<br /> any problems.</code></pre>
<pre><code>-Matt</code></pre>
<p>Index: kern_objcache.c
===================================================================<br />RCS file: /cvs/src/sys/kern/kern_objcache.c,v<br />retrieving revision 1.12<br />diff <del>u -r1.12 kern_objcache.c<br />--</del> kern_objcache.c 20 Nov 2006 22:19:54 -0000 1.12<br />+<ins>+ kern_objcache.c 29 Nov 2006 23:33:00 -0000<br /><code>@ -129,6 +129,7 </code>@<br /> objcache_alloc_fn *alloc;<br /> objcache_free_fn *free;<br /> void *allocator_args;<br /></ins> size_t simple_objsize;</p>
<pre><code>SLIST_ENTRY(objcache) oc_next;</code></pre>
<p><code>@ -228,6 +229,12 </code>@<br /> null_ctor, null_dtor, NULL,<br /> objcache_malloc_alloc, objcache_malloc_free,<br /> margs);<br /><ins><br /></ins> /*<br />+ * This indicates that we are a simple objcache and allows<br />+ * objcache_get() calls with M_ZERO.<br />+ */<br />+ oc->simple_objsize = objsize;<br /> return (oc);<br /> }</p>
<p><code>@ -264,7 +271,14 </code>@<br /> loadedmag = cpucache->loaded_magazine;<br /> if (MAGAZINE_NOTEMPTY(loadedmag)) {<br /> obj = loadedmag->objects[--loadedmag->rounds];<br /><ins>done:<br /> crit_exit();<br /></ins> if (ocflags & M_ZERO) {<br />+ if (oc->simple_objsize)<br />+ bzero(obj, oc->simple_objsize);<br />+ else<br />+ panic("objcache_get(): M_ZERO illegal here");<br />+ }<br /> return (obj);<br /> }</p>
<p><code>@ -275,8 +289,7 </code>@<br /> swap(cpucache->loaded_magazine, cpucache->previous_magazine);<br /> loadedmag = cpucache->loaded_magazine;<br /> obj = loadedmag->objects[--loadedmag->rounds];<br />- crit_exit();<br />- return (obj);<br />+ goto done;<br /> }</p>
<pre><code>/*<br /><code>@ -284,6 +297,8 </code>@
* move one of the empty ones to the depot.
*
* Obtain the depot spinlock.<br />+ *<br />+ * NOTE: Beyond this point, M_ZERO is handled via oc->alloc()<br /> */<br /> depot = &oc->depot[myclusterid];<br /> spin_lock_wr(&depot->spin);</code></pre> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16672006-11-30T10:21:02ZTGEN
<ul></ul><p>Isn't just checking whether the ctor pointer is NULL enough instead of <br />oc->simple_objsize?</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=16682006-11-30T10:25:00Zdillon
<ul></ul><p>:Isn't just checking whether the ctor pointer is NULL enough instead of <br />:oc->simple_objsize?<br />:<br />:Cheers,<br />:-- <br />: Thomas E. Spanjaard<br />: <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p>
<pre><code>It's never NULL. The functions default out to dummy routines. I<br /> suppose I could check if (oc->ctor == null_ctor) but I think its<br /> more generic to just have a separate supporting field with the <br /> object size, since you need the object size anyway to do the bzero().</code></pre>
<pre><code>-Matt</code></pre> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17102006-12-07T20:59:01ZTGEN
<ul></ul><p>After some discussion with Jeffrey Hsu and Simon Schubert, it was <br />concluded that this incurs a penalty on the hot path of objcache_get(), <br />which should be avoided. Best to have the consumers handle any zero'ing, <br />in order to not nerf the consumers who rely on the speed. In that case, <br />it pollutes the instruction cache, and could worsen performance <br />significantly in the case of branch misprediction. Especially on the <br />long pipelines CPUs have these days (though with the Core 2 family, <br />Intel has cut back on the 30+ stages of the Prescott/Presler...).</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17332006-12-12T11:53:27Zjustin
<ul></ul><p>As a followup, does the changes discussed in this thread solve the original<br />poster's problem?</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17662006-12-13T23:20:00ZTGEN
<ul></ul><p>No changes relevant to the issue have been made, in fact, the bug hasn't <br />been pinpointed yet. It would help if we could get a coredump on leaf?</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17702006-12-14T02:23:01Znonsolosoft
<ul></ul><p>Hi Thomas,</p>
<p>I'll try to send you something tomorrow.</p>
<p>Bye, \fer</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17862006-12-15T16:41:01Znonsolosoft
<ul></ul><p>Hi Thomas,</p>
<p>I've rebuild world and kernel at the latest version to make tests but <br />I'm not able to installkernel:</p>
atl# bmake installkernel<br />cd /usr/obj/usr/src/sys/GENERIC; make KERNEL=kernel install<br />chflags noschg /kernel<br />objcopy --strip-debug /kernel /kernel.old<br />install <del>m 555 -o root -g wheel -fschg kernel.debug /kernel<br />set -</del> /modules/*; if [ -f "$1" ]; then mkdir -p /modules.old; for <br />file; do objcopy --strip-debug $file /modules.old/${file##*/}; done; fi
<ul>
<li>Error code 1</li>
</ul>
Stop in /usr/obj/usr/src/sys/GENERIC.
<ul>
<li>Error code 1</li>
</ul>
Stop in /usr/src.
<ul>
<li>Error code 1</li>
</ul>
<p>Stop in /usr/src.</p>
<p>I've buildworld and buildkernel twice, but it stops at the same point.</p>
<p>Bye, \fer</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17872006-12-15T17:03:00Zswildner
<ul></ul><p>Hmm, I'm not sure if bmake is supposed to work properly with our
{build,install}{world,kernel} targets.</p>
<p>Could you try again with make?</p>
<p>Sascha</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17882006-12-15T18:41:01Zbastyaelvtars
<ul></ul><p>It's make, not bmake, use it like:</p>
<ol>
<li>make installkernel KERNCONF=MYKERNELCONFIG</li>
</ol> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17892006-12-15T22:01:01Zjoerg
<ul></ul><p>It isn't.</p>
<p>Joerg</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17902006-12-16T00:33:01ZTGEN
<ul></ul><p>I'm wondering whether he used bmake or make for build{world,kernel}... :).</p>
<p>As an aside, does anyone know what syntax differences between our make <br />and NetBSD's make are holding compatibility back? Haven't seen Max for a <br />while either...</p>
<p>Cheers,<br />-- <br /> Thomas E. Spanjaard<br /> <a class="email" href="mailto:tgen@netphreax.net">tgen@netphreax.net</a></p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17912006-12-16T01:16:01Zjoerg
<ul></ul><p>Well, the biggest difference is that operators like :S are lazy in bmake<br />and non-lazy in make. IIRC there was at least one which conflicted as<br />well.</p>
<p>Joerg</p> DragonFlyBSD - Bug #390: ATA identify retries exceeded - kernel panichttps://bugs.dragonflybsd.org/issues/390?journal_id=17942006-12-16T15:23:01Znonsolosoft
<ul></ul><p>OK. I've found!</p>
<p>It was not a problem with make but there was a power failure during <br />previous make installkernel.</p>
<p>Bye, \fer</p>