Bug #410
closedPREVIEW (SMP) crash on initializing the bridge
0%
Description
Hello, I compiled a kernel last night and when rebooted it on an SMP
machine, I got a panic when it initialized the bridge0 virtual
interface, photos of it and the output of trace are available at [1]
(core dumps are in the kernel directory). Reverting to the GENERIC
kernel seems to fix it.
[1] http://rnrdoctor.sytes.net/~szg/dfcrash/
P. S.: my kernel config has the following extra options compared to GENERIC:
- Bridging support
pseudo-device bridge
pseudo-device vlan
device pf
device pflog
device pfsync
- ALTQ
options ALTQ #alternate queueing
options ALTQ_CBQ #class based queueing
options ALTQ_RED #random early detection
options ALTQ_RIO #triple red for diffserv (needs RED)
options ALTQ_HFSC #hierarchical fair service curve
options ALTQ_PRIQ #priority queue
options ALTQ_NOPCC #don't use processor cycle counter
options ALTQ_DEBUG #for debugging
- Symmetric Multiprocessing support
options SMP # Symmetric MultiProcessor Kernel
options APIC_IO # Symmetric (APIC) I/O
Any ideas?
Thanks in advance.
Updated by corecode about 18 years ago
could you please compress kernel + vmcore?
Updated by bastyaelvtars about 18 years ago
http://rnrdoctor.sytes.net/~szg/dfcrash/kernel.tar.gz
damnit, should I use the bugtravker's webinterface instead of NNTP?
Updated by corecode about 18 years ago
thanks, i will try to have a look at it tomorrow
No, you're free to choose. btw, you already have a user at the bug tracker (everybody posting to bugs@ or submit@ does). just click on "lost your login?" and enter your email address.
cheers
simon
Updated by bastyaelvtars about 18 years ago
D'uh I get it on another machine, with or without ACPI. I am building the kernel without APIC_IO right now, let's see if it crashes. Anyway, I have got fresh dumps (w/ photos) from the new machine (with ACPI enabled it sput out wird error messages and wasn't even able to create a coredump). I'll upload them later.
Updated by dillon about 18 years ago
:Hmm. In net/bridge/if_bridge.c, kmalloc is called with just M_RNOWAIT,
:then it checks whether bif is NULL. This seems a bit bogus, because then
:it should rather use M_RNOWAIT|M_NULLOK. Even then, I don't see any
:problem in the code blocking here, why doesn't it use M_WAITOK instead?
:
:Also, in kmem_slab_alloc, if I read things right, none of
:VM_ALLOC_NORMAL, VM_ALLOC_INTERRUPT and VM_ALLOC_SYSTEM are set, thereby
:triggering the KKASSERT in vm_page_alloc? Could you try just adjusting
:the kmalloc flags in if_bridge.c:bridge_ioctl_add() from
:M_RNOWAIT|M_ZERO to M_NOWAIT|M_ZERO?
:
:Cheers,
:--
: Thomas E. Spanjaard
: tgen@netphreax.net
Good catch, Thomas. Those allocation calls are seriously broken.
No code is supposed to use M_RNOWAIT ... it is supposed to be an internal
flag used only by the other #define M_* macros. Without M_NULLOK the
kmalloc() will panic. Without any M_USE_* flags any RNOWAIT will
cause the underlying VM system to be called without the correct VM
allocation flags, and crash precisely due to the reason you cited.
I also agree that M_*NOWAIT should not be used at al lthere. This is
probably a left-over from FreeBSD, which used M_NOWAIT freely in
initialization code with the expectation that the malloc would only
ever fail due to a lack of resources. In DragonFly, M_NOWAIT really
does mean no-waiting... any blocking condition will cause it to fail.
All of those calls should probably be M_WAITOK. Please go ahead and
make that commit now.
-Matt
Updated by TGEN about 18 years ago
Hmm. In net/bridge/if_bridge.c, kmalloc is called with just M_RNOWAIT,
then it checks whether bif is NULL. This seems a bit bogus, because then
it should rather use M_RNOWAIT|M_NULLOK. Even then, I don't see any
problem in the code blocking here, why doesn't it use M_WAITOK instead?
Also, in kmem_slab_alloc, if I read things right, none of
VM_ALLOC_NORMAL, VM_ALLOC_INTERRUPT and VM_ALLOC_SYSTEM are set, thereby
triggering the KKASSERT in vm_page_alloc? Could you try just adjusting
the kmalloc flags in if_bridge.c:bridge_ioctl_add() from
M_RNOWAIT|M_ZERO to M_NOWAIT|M_ZERO?
Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net
Updated by dillon about 18 years ago
:Does this also explain why this problem occurs in SMP only?
:
:--
:Gergo Szakal <bastyaelvtars@gmail.com>
:University Of Szeged, HU
It would, yes, because M_NOWAIT will cause allocation failures far
more often on SMP systems if the allocation subsystem cannot get
the big giant lock when accessing the VM system (which occurs when the
per-cpu cache is empty).
The question is, does the commit just made to HEAD (not PREVIEW) fix
the panics?
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by bastyaelvtars about 18 years ago
On Thu, 04 Jan 2007 16:41:45 +0100
"Simon 'corecode' Schubert" <corecode@fs.ei.tum.de> wrote:
Aye, sorry, it was wrongly ordered (I suck at quick renaming :-D).
Sidenote: kernel_APICIO_5.JPG jut shows that weird error message that occurs with ACPI enabled in bios (don't know if it helps, though.)
Updated by corecode about 18 years ago
which panic actually is it? could you picture the beginning of it and not just the trace? the panic is somewhere in the vm subsystem.
cheers
simon
Updated by TGEN about 18 years ago
Hmm, that trace on APICIO_1 is really the next few lines after the
ifconfig output? Not something in between (use scroll lock and arrow-up
to see)?
Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net
Updated by bastyaelvtars about 18 years ago
On Thu, 04 Jan 2007 17:13:25 +0000
"Thomas E. Spanjaard" <tgen@netphreax.net> wrote:
Nope, we have even avoided touching the keyboard.
At http://rnrdoctor.sytes.net/~szg/dfcrash/ there is the error that made me start the thread and it is very similar. :-)
Updated by bastyaelvtars about 18 years ago
I have new pictures here:
http://rnrdoctor.sytes.net/~szg/dfcrash/20070103/
I have no dumps. I will try a clean PREVIEW install next week, compile the SMP kernel in a chrooted environment (as requested by Matt), and test that machine, because we had to set up a UP machine for now, and I borrowed the SMP one. :-)
The bridge code seems to crash with SMP, regardless of APIC_IO (the pictures that contain APIC_IO are the crashes with APIC_IO enabled).
Updated by dillon about 18 years ago
:Umm, may apply the diff from ViewVC against the preview tree? I am a bit afraid to mess with HEAD.
:
:--
:Gergo Szakal <bastyaelvtars@gmail.com>
:University Of Szeged, HU
:Faculty Of General Medicine
Its a fairly simple patch, you should be able to apply it manually
if need be.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by TGEN about 18 years ago
There's still the original trap message missing, the first thing you see
on console when it breaks into DDB. Could you make a screenshot of that
as well?
Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net
Updated by TGEN about 18 years ago
Just make sure you get the second commit for the kmalloc in
bridge_rtupdate as well.
Cheers,
--
Thomas E. Spanjaard
tgen@netphreax.net
Updated by bastyaelvtars about 18 years ago
On Fri, 5 Jan 2007 11:09:40 -0800 (PST)
Matthew Dillon <dillon@apollo.backplane.com> wrote:
Umm, may apply the diff from ViewVC against the preview tree? I am a bit afraid to mess with HEAD.
Updated by bastyaelvtars about 18 years ago
On Thu, 04 Jan 2007 16:47:13 +0000
"Thomas E. Spanjaard" <tgen@netphreax.net> wrote:
Um, well, actually kernel_APICIO_1.JPG kernel_new1.JPG are the very first mesages that we saw (it showed an ifconfig output, and crashed right after). kernel_APICIO_2.JPG and kernel_new2.JPG were taken after entering 'trace' - kernel_APICIO_3.JPG shows that it does not make a dump for some reason with ACPI enabled, kernel_APICIO_4.JPG is just a better quality version of 3, kernel_APICIO_5.JPG shows the weird error message on booting (from before crash). kernel_new3.jpg is just a better quality version of 2. I don't know what else I should take a picture of. ;-)
Updated by bastyaelvtars about 18 years ago
On Thu, 4 Jan 2007 10:45:05 -0800 (PST)
Matthew Dillon <dillon@apollo.backplane.com> wrote:
Does this also explain why this problem occurs in SMP only?
Updated by bastyaelvtars about 18 years ago
On Fri, 5 Jan 2007 11:09:40 -0800 (PST)
Matthew Dillon <dillon@apollo.backplane.com> wrote:
The question is, does the commit just made to HEAD (not PREVIEW) fix
the panics?
Umm, I haven't yet had the time to try it out. I'll test this weekend.
Updated by bastyaelvtars about 18 years ago
Updated to latest PREVIEW, nothing has changed. I'll try w/o APIC_IO, will send new dumps tomorrow, stay tuned. :-(
With UP it works.
Updated by justin almost 18 years ago
Updated by bastyaelvtars almost 18 years ago
I have put it into another SMP machine, built the kernel on top of a fresh PREVIEW installation from the source included on the cd, then rebooted and it crashed as expected. I hacked the sources in single user mode and it seems to work now, - at least it gets past the point where it used to crash.
Thanks to all! :-)