Project

General

Profile

Actions

Bug #636

closed

kernel panic

Added by josepht almost 17 years ago. Updated about 15 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I updated to the latest HEAD as of yesterday (2007-05-09) after the
INET6 fix.

$uname -a
DragonFly neptune.xenno.com 1.9.0-DEVELOPMENT DragonFly
1.9.0-DEVELOPMENT #135: Wed May 9 23:29:57 EDT 2007
:/home/obj/usr/src/sys/NEPTUNE i386

I got the following kernel panic:

Fatal trap 19: non-maskable interrupt trap while in kernel mode
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
instruction pointer = 0x8:0xc02f6a1e
stack pointer = 0x10:0xcade4a44
frame pointer = 0x10:0xcade4a68
code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, IOPL = 0
current process = 729 (ifconfig)
current thread = pri 6
<- SMP: XXX

It looks like the stack was corrupted but I was able to get this:

(kgdb) bt
#0 0x00000000 in ?? ()
(kgdb) info locals
No symbol table info available.
(kgdb) x 0xc02f6a1e
0xc02f6a1e <agp_intel_flush_tlb+35>: 0x81028b90

I can upload the kernel and vmcore files if absolutely necessary but
the vmcore file is 1.6GB uncompressed so if I don't need to I will
save the bandwidth.

Perhaps unrelated, but just in case. I got this out of the kernel
buffer:

[diagnostic] cache_lock: blocked on 0xdc583e28 "utils"

I have also had a cvs process hang in the "vnode" state. I was unable
to attach the process with gdb (this just seemed to hang) or get any
output from ktrace.

Thanks,
Joe Talbott

Actions #1

Updated by dillon almost 17 years ago

:I updated to the latest HEAD as of yesterday (2007-05-09) after the
:INET6 fix.
:
:...
:I got the following kernel panic:
:
:Fatal trap 19: non-maskable interrupt trap while in kernel mode
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:instruction pointer = 0x8:0xc02f6a1e
:stack pointer = 0x10:0xcade4a44
:frame pointer = 0x10:0xcade4a68
:...
:processor eflags = interrupt enabled, IOPL = 0
:current process = 729 (ifconfig)
:current thread = pri 6
: <- SMP: XXX
:
:It looks like the stack was corrupted but I was able to get this:
:
:(kgdb) bt
:#0 0x00000000 in ?? ()
:(kgdb) info locals
:No symbol table info available.
:(kgdb) x 0xc02f6a1e
:0xc02f6a1e <agp_intel_flush_tlb+35>: 0x81028b90
:
:I can upload the kernel and vmcore files if absolutely necessary but
:the vmcore file is 1.6GB uncompressed so if I don't need to I will
:save the bandwidth.
:
:Perhaps unrelated, but just in case. I got this out of the kernel
:buffer:
:
:[diagnostic] cache_lock: blocked on 0xdc583e28 "utils"
:
:I have also had a cvs process hang in the "vnode" state. I was unable
:to attach the process with gdb (this just seemed to hang) or get any
:output from ktrace.
:
:Thanks,
:Joe Talbott

NMI traps during device operation usually indicate a bus parity
failure during DMA. This often occurs when shared memory on
the device itself is not properly initialized by the device driver
and then accessed.
What interface were you ifconfig'ing when the crash occured?
And is it repeatable?
-Matt
Matthew Dillon
&lt;&gt;
Actions #2

Updated by josepht almost 17 years ago

The strange thing is I was rebooting my laptop (via icewm) when this
occurred. The interface is re(4) according to the kernel buffer output
which follows.

Joe

Here is some kernel buffer output:

<118>Shutting down daemon processes:
<118>.
<118>Stopping cron.
<118>Shutting down local daemons:
<118>.
<118>Terminated
<118>.
<118>Dec 15 16:53:11 neptune syslogd: exiting on signal 15
<118>Enter full pathname of shell or RETURN for /bin/sh:
<118>#
<118>u
<118>m
<118>o
<118>u
<118>n
<118>t
<118>
<118>/
<118>u
<118>s
<118
<118
<118>u
<118>s
<118>r
<118>
<118>/
<118>h
<118>o
<118>m
<118>e
<118>
<118>#
<118>i
<118>f
<118>c
<118>o
<118>n
<118>f
<118>i
<118>g
<118>
<118>r
<118>e
<118>0
<118>
<118>2
<118>0
<118>9
<118>.
<118>1
<118>4
<118>5
<118>.
<118>6
<118>6
<118>.
<118>3
<118>2
<118>

Actions #3

Updated by dillon almost 17 years ago

:The strange thing is I was rebooting my laptop (via icewm) when this
:occurred. The interface is re(4) according to the kernel buffer output
:which follows.
:
:Joe

I'm guessing there's an issue with re_init() or re_stop() that is
possibly being triggered by setting the IP address.
re_init() for the RE interface looks like is doing some dangerous
things... if there is DMA still operating while it is trying to
reinitialize the device, that could be causing the NMI. It seems to be
writing 0x00 to the command register which I guess is supposed to stop
device operation, but it is not waiting for the device to actually stop
operating before it begins to free the TX and RX rings.
Most network controllers these days are actually microcontrollers,
which means that commands do not instantaniously take effect when
you write to the command register. Usually only the interrupt
control registers are hardwired.
I got two questions.  First, when you ifconfig the interface with a
new IP address does it normally pause before returning? That would
indicate that is is in fact doing a full device reset when configuring
an IP address. Second, can you reproduce the problem? Perhaps by
re-configuring the device's IP address over and over again in a loop?
We may be able to 'fix' the problem simply by introducing a delay
after writing 0x00 to RE_COMMAND, or by calling re_reset() as part
of re_stop(), but I'd like a way to verify that doing so will actually
fix the problem.
-Matt
Matthew Dillon
&lt;&gt;
Actions #4

Updated by josepht almost 17 years ago

There is a small delay <2s. I ran a loop that switched between two
IPs for about 15 minutes and nothing happened.

The kernel buffer output in the corefile was from months ago. I only
remembered because I did the same thing this time; shutdown now;
umount /home; ifconfig re0 ... I don't know how this can be in a dump
months after the fact unless there is stale data in my swap partition
from my last coredump that hasn't been overwritten since I don't do
very much swapping. This idea may be completely wrong. I am 100%
certain that I'm not looking at a stale dump as strings on the kernel
and vmcore show them as being from May 9, 2007. I am also certain
that I was not ifconfig'ing any interface when this happened.

Joe

Actions #5

Updated by dillon almost 17 years ago

:There is a small delay <2s. I ran a loop that switched between two
:IPs for about 15 minutes and nothing happened.
:
:The kernel buffer output in the corefile was from months ago. I only
:remembered because I did the same thing this time; shutdown now;
:umount /home; ifconfig re0 ... I don't know how this can be in a dump
:months after the fact unless there is stale data in my swap partition
:from my last coredump that hasn't been overwritten since I don't do
:very much swapping. This idea may be completely wrong. I am 100%
:certain that I'm not looking at a stale dump as strings on the kernel
:and vmcore show them as being from May 9, 2007. I am also certain
:that I was not ifconfig'ing any interface when this happened.
:
:Joe

Sometimes the BIOS clears memory, sometimes it doesn't.  If it doesn't,
then sometimes the dmesg text from previous boots will remain in
memory and be available. That's all. Power cycle and it all goes poof.
I've googled similar bug reports on FreeBSD, Linux, NetBSD, etc.  I
have not found much information other then this:
http://lists.freebsd.org/pipermail/freebsd-bugs/2003-September/003012.html
Which seems to indicate that it might be DRM/DRI related.. or perhaps
just video/DRI related as this person triggered it simply by restarting
his X server a few times.
BUT, our flush code already uses the changes made to FreeBSD, i.e.
uses ~(1 << 7), so I am somewhat at a loss.
-Matt
Matthew Dillon
&lt;&gt;
Actions #6

Updated by josepht almost 17 years ago

This is a laptop that has been power cycled at least a hundred times
since that took place so it seems to me there's no way it was coming
from memory. When my re(4) troubles were happening I had hw.physmem
set to 256M to get manageable coredumps. After my troubles were
resolved I removed that entry from my loader.conf. So this time my
dump consisted of 1.5GB as did several re(4) related coredumps prior
to my setting hw.physmem. I assume that the swap space isn't zero'd
or otherwise initialized prior to a page being written to it. I also
assume that a coredump is written sparsely to disk so old data could
remain across coredumps. I guess I'll read the code and see if I can
learn a bit more rather than making assumptions.

Joe

Actions #7

Updated by dillon almost 17 years ago

:This is a laptop that has been power cycled at least a hundred times
:since that took place so it seems to me there's no way it was coming
:from memory. When my re(4) troubles were happening I had hw.physmem
:set to 256M to get manageable coredumps. After my troubles were
:resolved I removed that entry from my loader.conf. So this time my
:dump consisted of 1.5GB as did several re(4) related coredumps prior
:to my setting hw.physmem. I assume that the swap space isn't zero'd
:or otherwise initialized prior to a page being written to it. I also
:assume that a coredump is written sparsely to disk so old data could
:remain across coredumps. I guess I'll read the code and see if I can
:learn a bit more rather than making assumptions.
:
:Joe

It may be worth adding a DELAY in re_stop(), but it will take a
while to determine whether it does any good if we can't reproduce
the failure consistently.
-Matt
Matthew Dillon
&lt;&gt;

Index: if_re.c ===================================================================
RCS file: /cvs/src/sys/dev/netif/re/if_re.c,v
retrieving revision 1.32
diff u -r1.32 if_re.c
--
if_re.c 30 Mar 2007 14:15:58 -0000 1.32
++ if_re.c 13 May 2007 07:33:22 -0000
@ -2320,6 +2320,7 @
CSR_WRITE_1(sc, RE_COMMAND, 0x00);
CSR_WRITE_2(sc, RE_IMR, 0x0000);
CSR_WRITE_2(sc, RE_ISR, 0xFFFF);
DELAY;

if (sc->re_head != NULL) {
m_freem(sc->re_head);
Actions #8

Updated by josepht almost 17 years ago

I made this change and shortly got another hang. I was in X messing
around with a vkernel and was typing away when everything froze. I
waited a bit after trying CTRL-ALT-ESC, CTRL-ALT-BKSP, CTRL-ALT-DEL,
and anything else I could think of, but the machine was frozen. So I
held the power button down for 5s or so and rebooted. After checking
my filesystems, what do you know another coredump was found and it is
the same as last time. I am positive that I'm not looking at the
wrong vmcore. I am inclined to believe that no coredump was ever
written to my swap partition or perhaps only a part of a coredump was
written before I power cycled the laptop. I'm going to try to trigger
this running from the system console versus in X and see if I can get
into the debugger.

Joe

Actions #9

Updated by dillon almost 17 years ago

:I made this change and shortly got another hang. I was in X messing
:around with a vkernel and was typing away when everything froze. I
:waited a bit after trying CTRL-ALT-ESC, CTRL-ALT-BKSP, CTRL-ALT-DEL,
:and anything else I could think of, but the machine was frozen. So I
:held the power button down for 5s or so and rebooted. After checking
:my filesystems, what do you know another coredump was found and it is
:the same as last time. I am positive that I'm not looking at the
:wrong vmcore. I am inclined to believe that no coredump was ever
:written to my swap partition or perhaps only a part of a coredump was
:written before I power cycled the laptop. I'm going to try to trigger
:this running from the system console versus in X and see if I can get
:into the debugger.
:
:Joe

If you have a serial port you can run a serial console to another 
machine. If not you do have another option, and that is to compile
a kernel with:
options DDB_UNATTENDED
Hopefully when it crashes it will be able to write the core out.  If
it does, you should see the hard drive light for your laptop flicker
or go solid for however long it takes to write out the core.
Note: Do NOT hard power cycle your machine while it is writing to the
hard drive. That's a great way to destroy the hard drive.
-Matt
Matthew Dillon
&lt;&gt;
Actions #10

Updated by josepht almost 17 years ago

This has happened again. The same as the first time the machine locks
up hard when doing a 'shutdown -h now' via icewm's shutdown function.
I had DDB_UNATTENDED compiled in my kernel but to no avail. I did
notice the fan ramp up in speed but there was absolutely no HDD
activity indicated by the HDD LED. I certainly can't rule out
icewm/xorg as the culprit.

Joe

Actions #11

Updated by dillon almost 17 years ago

:This has happened again. The same as the first time the machine locks
:up hard when doing a 'shutdown -h now' via icewm's shutdown function.
:I had DDB_UNATTENDED compiled in my kernel but to no avail. I did
:notice the fan ramp up in speed but there was absolutely no HDD
:activity indicated by the HDD LED. I certainly can't rule out
:icewm/xorg as the culprit.
:
:Joe

Do you have a serial port on the box and a second machine you can
connect it to? That may be the only way to figure out what is
going on.
I will try to reproduce it over here with my test box.
-Matt
Matthew Dillon
&lt;&gt;
Actions #12

Updated by josepht almost 17 years ago

No serial port. I don't imagine I can get a serial console attached
via a USB to serial adapter, but if that is possible I can set it up.

Joe

Actions #13

Updated by dillon almost 17 years ago

:No serial port. I don't imagine I can get a serial console attached
:via a USB to serial adapter, but if that is possible I can set it up.
:
:Joe

I don't think so.  But it may be possible to set up DCONS, the
firewire based console, and perhaps get something. You'd need a
second machine around.
-Matt
Matthew Dillon
&lt;&gt;
Actions #14

Updated by corecode almost 17 years ago

is this still present?

Actions #15

Updated by josepht almost 17 years ago

Yes.

Actions #16

Updated by corecode almost 17 years ago

So is this an AGP or a re(4) problem? I.e. does this also happen without AGP
stuff loaded?

Actions #17

Updated by dillon almost 17 years ago

:Simon 'corecode' Schubert <> added the comment:
:
:So is this an AGP or a re(4) problem? I.e. does this also happen without AGP
:stuff loaded?

I think its a PCI/AGP bus snafu of some sort.  It is probably agp
related. NMI's can only occur on memory parity/ecc errors or on bus
parity errors.
-Matt
Matthew Dillon
&lt;&gt;
Actions #18

Updated by corecode about 15 years ago

is this still present?

Actions #19

Updated by josepht about 15 years ago

No.

Joe

Actions #20

Updated by corecode about 15 years ago

thanks, closing.

Actions

Also available in: Atom PDF