Bug #599
open1.9.0 reproducable panic
0%
Description
Here's a panic I'm getting with some pretty serious network (www) load, then
doing a netstat -an:
Unread portion of the kernel message buffer:
panic: m_copydata, negative off -1
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0
syncing disks... 5
done
Uptime: 12d22h0m32s
(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01954bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2 0xc01957c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3 0xc01c3a32 in m_copydata (m=0x0, off=0, len=0, cp=0xee9534b0 "\001\001
\b\n\006¦*$\035\bͬ") at /usr/src/sys/kern/uipc_mbuf.c:1014
#4 0xc020fc25 in tcp_output (tp=0xdae0c720)
at /usr/src/sys/netinet/tcp_output.c:690
#5 0xc02152bf in tcp_timer_persist (xtp=0xdae0c720)
at /usr/src/sys/netinet/tcp_timer.c:363
#6 0xc01a6423 in softclock_handler (arg=0xc0386a80)
at /usr/src/sys/kern/kern_timeout.c:307
#7 0xc019d037 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:207
Previous frame inner to this frame (corrupt stack?)
The kernel and vmcore is being uploaded to leaf. The source is from March 28.
--Peter
Updated by pavalos over 17 years ago
I forgot to mention that the kernel and cores are *.3 and *.4.
Updated by dillon over 17 years ago
:New submission from Peter Avalos <pavalos@theshell.com>:
:
:Here's a panic I'm getting with some pretty serious network (www) load, then
:doing a netstat -an:
:
:Unread portion of the kernel message buffer:
:panic: m_copydata, negative off -1
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:boot() called on cpu#0
:
:syncing disks... 5
:done
:Uptime: 12d22h0m32s
Woa. You mean the panic occurs only when you do the netstat -an command
under heavy network load? It doesn't happen any other time?
This is a really odd crash. Somehow tp->snd_nxt has become less
then tp->snd_una, causing 'off' to be calculated as -1.
-Matt
Updated by pavalos over 17 years ago
Correct. Once I execute "netstat -an" it panics. Any ideas?
--Peter
Updated by dillon over 17 years ago
:On Wed, Apr 11, 2007 at 11:38:48AM -0700, Matthew Dillon wrote:
:>=20
:> Woa. You mean the panic occurs only when you do the netstat -an comm=
:and
:> under heavy network load? It doesn't happen any other time?
:>=20
:
:Correct. Once I execute "netstat -an" it panics. Any ideas?
:
:--Peter
That's really odd. I looked at the code and found one possible
place where the field could get out of whack. Try turning off the
tcp limited transmit code:
(in /etc/sysctl.conf):
net.inet.tcp.limitedtransmit=0
and reboot to clean out any preexisting tcp connections (or otherwise
clean them out manually by killing and restarting the services).
I'm a bit at a loss a to why netstat -an would trigger the problem,
though. We do know that anything that accesses /dev/kmem heavily,
like fstat, can crash the machine while chasing down stale pointers
in kernel memory. But this panic seems a bit at odds with the sort
of crash I would expect from stale pointer chasing.
-Matt
Updated by joerg over 17 years ago
netstat -an uses a sysctl interface though.
Joerg
Updated by dillon over 17 years ago
:On Wed, Apr 11, 2007 at 03:52:30PM -0700, Matthew Dillon wrote:
:> I'm a bit at a loss a to why netstat -an would trigger the problem,
:> though. We do know that anything that accesses /dev/kmem heavily,
:> like fstat, can crash the machine while chasing down stale pointers
:> in kernel memory. But this panic seems a bit at odds with the sort
:> of crash I would expect from stale pointer chasing.
:
:netstat -an uses a sysctl interface though.
:
:Joerg
That would make more sense. I was scratching my head at how a KVM
access could cause this, a direct sysctl interface is more likely.
I don't see a whole lot in the sysctl code either, unfortunately.
e.g. tcp_pcblist() in tcp_subr.c. There is one likely possibility.
Because the sysctl is dumping its huge, huge list in one large go
and holding the big giant lock while it does it, it could be
preventing the TCP stack's callout's (which is where the panic occured)
from running during that period. There could be a race condition there
that we are not handling properly.
so, e.g. some sort of race in softclock_handler() in kern_timeout.c
related to the acquisition of the big giant lock.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by pavalos over 17 years ago
I have turned off the sysctl, but I'm having a hard time generating the
traffic needed that originally caused the panic. (I'm estimating it was
about 6000 simultaneous http connections, but I'm not exactly sure since
netstat wasn't working.)
--Peter
Updated by dillon over 17 years ago
:traffic needed that originally caused the panic. (I'm estimating it was
:about 6000 simultaneous http connections, but I'm not exactly sure since
:netstat wasn't working.)
:
:--Peter
Ok, for now please continue with the sysctl turned off (put it in your
/etc/sysctl.conf). That way if the crash occurs again in the future
we can discount that part of the tcp stack.
I am going to go ahead and commit a fix to the possible bug related
to the code in question.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by pavalos almost 14 years ago
net.inet.tcp.limitedtransmit=1 definitely causes instability, even on a latest
master:
DragonFly ylem.theshell.com 2.9-DEVELOPMENT DragonFly v2.9.1.321.gadb6af-
DEVELOPMENT #31: Tue Dec 21 13:12:36 HST 2010
root@ylem.theshell.com:/usr/obj/usr/src/sys/YLEM i386
Unfortunately I keep getting secondary panics, so I'm not able to get a vmcore.