Bug #636: kernel panic - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #636

closed

kernel panic

Added by josepht almost 19 years ago. Updated about 17 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

I updated to the latest HEAD as of yesterday (2007-05-09) after the
INET6 fix.

$uname -a
DragonFly neptune.xenno.com 1.9.0-DEVELOPMENT DragonFly
1.9.0-DEVELOPMENT #135: Wed May 9 23:29:57 EDT 2007
root@neptune.xenno.com:/home/obj/usr/src/sys/NEPTUNE i386

I got the following kernel panic:

Fatal trap 19: non-maskable interrupt trap while in kernel mode
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
instruction pointer = 0x8:0xc02f6a1e
stack pointer = 0x10:0xcade4a44
frame pointer = 0x10:0xcade4a68
code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 729 (ifconfig)
current thread = pri 6
<- SMP: XXX

It looks like the stack was corrupted but I was able to get this:

(kgdb) bt
#0 0x00000000 in ?? ()
(kgdb) info locals
No symbol table info available.
(kgdb) x 0xc02f6a1e
0xc02f6a1e <agp_intel_flush_tlb+35>: 0x81028b90

I can upload the kernel and vmcore files if absolutely necessary but
the vmcore file is 1.6GB uncompressed so if I don't need to I will
save the bandwidth.

Perhaps unrelated, but just in case. I got this out of the kernel
buffer:

[diagnostic] cache_lock: blocked on 0xdc583e28 "utils"

I have also had a cvs process hang in the "vnode" state. I was unable
to attach the process with gdb (this just seemed to hang) or get any
output from ktrace.

Thanks,
Joe Talbott

Actions

Copy link

Updated by dillon almost 19 years ago

:I updated to the latest HEAD as of yesterday (2007-05-09) after the
:INET6 fix.
:
:...
:I got the following kernel panic:
:
:Fatal trap 19: non-maskable interrupt trap while in kernel mode
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:instruction pointer = 0x8:0xc02f6a1e
:stack pointer = 0x10:0xcade4a44
:frame pointer = 0x10:0xcade4a68
:...
:processor eflags = interrupt enabled, IOPL = 0
:current process = 729 (ifconfig)
:current thread = pri 6
: <- SMP: XXX
:
:It looks like the stack was corrupted but I was able to get this:
:
:(kgdb) bt
:#0 0x00000000 in ?? ()
:(kgdb) info locals
:No symbol table info available.
:(kgdb) x 0xc02f6a1e
:0xc02f6a1e <agp_intel_flush_tlb+35>: 0x81028b90
:
:I can upload the kernel and vmcore files if absolutely necessary but
:the vmcore file is 1.6GB uncompressed so if I don't need to I will
:save the bandwidth.
:
:Perhaps unrelated, but just in case. I got this out of the kernel
:buffer:
:
:[diagnostic] cache_lock: blocked on 0xdc583e28 "utils"
:
:I have also had a cvs process hang in the "vnode" state. I was unable
:to attach the process with gdb (this just seemed to hang) or get any
:output from ktrace.
:
:Thanks,
:Joe Talbott

NMI traps during device operation usually indicate a bus parity
    failure during DMA.  This often occurs when shared memory on
    the device itself is not properly initialized by the device driver
    and then accessed.

What interface were you ifconfig'ing when the crash occured?
    And is it repeatable?

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by josepht almost 19 years ago

The strange thing is I was rebooting my laptop (via icewm) when this
occurred. The interface is re(4) according to the kernel buffer output
which follows.

Joe

Here is some kernel buffer output:

<118>Shutting down daemon processes:
<118>.
<118>Stopping cron.
<118>Shutting down local daemons:
<118>.
<118>Terminated
<118>.
<118>Dec 15 16:53:11 neptune syslogd: exiting on signal 15
<118>Enter full pathname of shell or RETURN for /bin/sh:
<118>#
<118>u
<118>m
<118>o
<118>u
<118>n
<118>t
<118>
<118>/
<118>u
<118>s
<118
<118
<118>u
<118>s
<118>r
<118>
<118>/
<118>h
<118>o
<118>m
<118>e
<118>
<118>#
<118>i
<118>f
<118>c
<118>o
<118>n
<118>f
<118>i
<118>g
<118>
<118>r
<118>e
<118>0
<118>
<118>2
<118>0
<118>9
<118>.
<118>1
<118>4
<118>5
<118>.
<118>6
<118>6
<118>.
<118>3
<118>2
<118>

Actions

Copy link

Updated by dillon almost 19 years ago

:The strange thing is I was rebooting my laptop (via icewm) when this
:occurred. The interface is re(4) according to the kernel buffer output
:which follows.
:
:Joe

I'm guessing there's an issue with re_init() or re_stop() that is
    possibly being triggered by setting the IP address.

re_init() for the RE interface looks like is doing some dangerous
    things... if there is DMA still operating while it is trying to
    reinitialize the device, that could be causing the NMI.  It seems to be
    writing 0x00 to the command register which I guess is supposed to stop
    device operation, but it is not waiting for the device to actually stop
    operating before it begins to free the TX and RX rings.

Most network controllers these days are actually microcontrollers,
    which means that commands do not instantaniously take effect when
    you write to the command register.  Usually only the interrupt
    control registers are hardwired.

I got two questions.  First, when you ifconfig the interface with a
    new IP address does it normally pause before returning?  That would
    indicate that is is in fact doing a full device reset when configuring
    an IP address.  Second, can you reproduce the problem?  Perhaps by
    re-configuring the device's IP address over and over again in a loop?

We may be able to 'fix' the problem simply by introducing a delay
    after writing 0x00 to RE_COMMAND, or by calling re_reset() as part
    of re_stop(), but I'd like a way to verify that doing so will actually
    fix the problem.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by josepht almost 19 years ago

There is a small delay <2s. I ran a loop that switched between two
IPs for about 15 minutes and nothing happened.

The kernel buffer output in the corefile was from months ago. I only
remembered because I did the same thing this time; shutdown now;
umount /home; ifconfig re0 ... I don't know how this can be in a dump
months after the fact unless there is stale data in my swap partition
from my last coredump that hasn't been overwritten since I don't do
very much swapping. This idea may be completely wrong. I am 100%
certain that I'm not looking at a stale dump as strings on the kernel
and vmcore show them as being from May 9, 2007. I am also certain
that I was not ifconfig'ing any interface when this happened.

Joe

Actions

Copy link

Updated by dillon almost 19 years ago

:There is a small delay <2s. I ran a loop that switched between two
:IPs for about 15 minutes and nothing happened.
:
:The kernel buffer output in the corefile was from months ago. I only
:remembered because I did the same thing this time; shutdown now;
:umount /home; ifconfig re0 ... I don't know how this can be in a dump
:months after the fact unless there is stale data in my swap partition
:from my last coredump that hasn't been overwritten since I don't do
:very much swapping. This idea may be completely wrong. I am 100%
:certain that I'm not looking at a stale dump as strings on the kernel
:and vmcore show them as being from May 9, 2007. I am also certain
:that I was not ifconfig'ing any interface when this happened.
:
:Joe

Sometimes the BIOS clears memory, sometimes it doesn't.  If it doesn't,
    then sometimes the dmesg text from previous boots will remain in
    memory and be available.  That's all.  Power cycle and it all goes poof.

I've googled similar bug reports on FreeBSD, Linux, NetBSD, etc.  I
    have not found much information other then this:

http://lists.freebsd.org/pipermail/freebsd-bugs/2003-September/003012.html

Which seems to indicate that it might be DRM/DRI related.. or perhaps
    just video/DRI related as this person triggered it simply by restarting
    his X server a few times.

BUT, our flush code already uses the changes made to FreeBSD, i.e.
    uses ~(1 << 7), so I am somewhat at a loss.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by josepht almost 19 years ago

This is a laptop that has been power cycled at least a hundred times
since that took place so it seems to me there's no way it was coming
from memory. When my re(4) troubles were happening I had hw.physmem
set to 256M to get manageable coredumps. After my troubles were
resolved I removed that entry from my loader.conf. So this time my
dump consisted of 1.5GB as did several re(4) related coredumps prior
to my setting hw.physmem. I assume that the swap space isn't zero'd
or otherwise initialized prior to a page being written to it. I also
assume that a coredump is written sparsely to disk so old data could
remain across coredumps. I guess I'll read the code and see if I can
learn a bit more rather than making assumptions.

Joe

Actions

Copy link

Updated by dillon almost 19 years ago

:This is a laptop that has been power cycled at least a hundred times
:since that took place so it seems to me there's no way it was coming
:from memory. When my re(4) troubles were happening I had hw.physmem
:set to 256M to get manageable coredumps. After my troubles were
:resolved I removed that entry from my loader.conf. So this time my
:dump consisted of 1.5GB as did several re(4) related coredumps prior
:to my setting hw.physmem. I assume that the swap space isn't zero'd
:or otherwise initialized prior to a page being written to it. I also
:assume that a coredump is written sparsely to disk so old data could
:remain across coredumps. I guess I'll read the code and see if I can
:learn a bit more rather than making assumptions.
:
:Joe

It may be worth adding a DELAY in re_stop(), but it will take a
    while to determine whether it does any good if we can't reproduce
    the failure consistently.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Index: if_re.c ===================================================================
RCS file: /cvs/src/sys/dev/netif/re/if_re.c,v
retrieving revision 1.32
diff u -r1.32 if_re.c
-- if_re.c 30 Mar 2007 14:15:58 -0000 1.32
++ if_re.c 13 May 2007 07:33:22 -0000
@ -2320,6 +2320,7 @
CSR_WRITE_1(sc, RE_COMMAND, 0x00);
CSR_WRITE_2(sc, RE_IMR, 0x0000);
CSR_WRITE_2(sc, RE_ISR, 0xFFFF);
DELAY;

if (sc->re_head != NULL) {
         m_freem(sc->re_head);

Actions

Copy link

Updated by josepht almost 19 years ago

I made this change and shortly got another hang. I was in X messing
around with a vkernel and was typing away when everything froze. I
waited a bit after trying CTRL-ALT-ESC, CTRL-ALT-BKSP, CTRL-ALT-DEL,
and anything else I could think of, but the machine was frozen. So I
held the power button down for 5s or so and rebooted. After checking
my filesystems, what do you know another coredump was found and it is
the same as last time. I am positive that I'm not looking at the
wrong vmcore. I am inclined to believe that no coredump was ever
written to my swap partition or perhaps only a part of a coredump was
written before I power cycled the laptop. I'm going to try to trigger
this running from the system console versus in X and see if I can get
into the debugger.

Joe

Actions

Copy link

Updated by dillon almost 19 years ago

:I made this change and shortly got another hang. I was in X messing
:around with a vkernel and was typing away when everything froze. I
:waited a bit after trying CTRL-ALT-ESC, CTRL-ALT-BKSP, CTRL-ALT-DEL,
:and anything else I could think of, but the machine was frozen. So I
:held the power button down for 5s or so and rebooted. After checking
:my filesystems, what do you know another coredump was found and it is
:the same as last time. I am positive that I'm not looking at the
:wrong vmcore. I am inclined to believe that no coredump was ever
:written to my swap partition or perhaps only a part of a coredump was
:written before I power cycled the laptop. I'm going to try to trigger
:this running from the system console versus in X and see if I can get
:into the debugger.
:
:Joe

If you have a serial port you can run a serial console to another 
    machine.  If not you do have another option, and that is to compile
    a kernel with:

options DDB_UNATTENDED

Hopefully when it crashes it will be able to write the core out.  If
    it does, you should see the hard drive light for your laptop flicker
    or go solid for however long it takes to write out the core.

Note: Do NOT hard power cycle your machine while it is writing to the
    hard drive.  That's a great way to destroy the hard drive.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#10

Updated by josepht over 18 years ago

This has happened again. The same as the first time the machine locks
up hard when doing a 'shutdown -h now' via icewm's shutdown function.
I had DDB_UNATTENDED compiled in my kernel but to no avail. I did
notice the fan ramp up in speed but there was absolutely no HDD
activity indicated by the HDD LED. I certainly can't rule out
icewm/xorg as the culprit.

Joe

Actions

Copy link

#11

Updated by dillon over 18 years ago

:This has happened again. The same as the first time the machine locks
:up hard when doing a 'shutdown -h now' via icewm's shutdown function.
:I had DDB_UNATTENDED compiled in my kernel but to no avail. I did
:notice the fan ramp up in speed but there was absolutely no HDD
:activity indicated by the HDD LED. I certainly can't rule out
:icewm/xorg as the culprit.
:
:Joe

Do you have a serial port on the box and a second machine you can
    connect it to?  That may be the only way to figure out what is
    going on.

I will try to reproduce it over here with my test box.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#12

Updated by josepht over 18 years ago

No serial port. I don't imagine I can get a serial console attached
via a USB to serial adapter, but if that is possible I can set it up.

Joe

Actions

Copy link

#13

Updated by dillon over 18 years ago

:No serial port. I don't imagine I can get a serial console attached
:via a USB to serial adapter, but if that is possible I can set it up.
:
:Joe

I don't think so.  But it may be possible to set up DCONS, the
    firewire based console, and perhaps get something.  You'd need a
    second machine around.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#14

Updated by corecode over 18 years ago

is this still present?

Actions

Copy link

#15

Updated by josepht over 18 years ago

Yes.

Actions

Copy link

#16

Updated by corecode over 18 years ago

So is this an AGP or a re(4) problem? I.e. does this also happen without AGP
stuff loaded?

Actions

Copy link

#17

Updated by dillon over 18 years ago

:Simon 'corecode' Schubert <corecode@fs.ei.tum.de> added the comment:
:
:So is this an AGP or a re(4) problem? I.e. does this also happen without AGP
:stuff loaded?

I think its a PCI/AGP bus snafu of some sort.  It is probably agp
    related.  NMI's can only occur on memory parity/ecc errors or on bus 
    parity errors.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#18

Updated by corecode about 17 years ago

is this still present?

Actions

Copy link

#19

Updated by josepht about 17 years ago

No.

Joe

Actions

Copy link

#20

Updated by corecode about 17 years ago

thanks, closing.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #636

kernel panic

Updated by dillon almost 19 years ago

Updated by josepht almost 19 years ago

Updated by dillon almost 19 years ago

Updated by josepht almost 19 years ago

Updated by dillon almost 19 years ago

Updated by josepht almost 19 years ago

Updated by dillon almost 19 years ago

Updated by josepht almost 19 years ago

Updated by dillon almost 19 years ago

Updated by josepht over 18 years ago

Updated by dillon over 18 years ago

Updated by josepht over 18 years ago

Updated by dillon over 18 years ago

Updated by corecode over 18 years ago

Updated by josepht over 18 years ago

Updated by corecode over 18 years ago

Updated by dillon over 18 years ago

Updated by corecode about 17 years ago

Updated by josepht about 17 years ago

Updated by corecode about 17 years ago