1.9.0 reproducable panic
Here's a panic I'm getting with some pretty serious network (www) load, then
doing a netstat -an:
Unread portion of the kernel message buffer:
panic: m_copydata, negative off -1
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0
syncing disks... 5
#0 dumpsys () at thread.h:83
#1 0xc01954bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2 0xc01957c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3 0xc01c3a32 in m_copydata (m=0x0, off=0, len=0, cp=0xee9534b0 "\001\001
\b\n\006¦*$\035\bÍ¬") at /usr/src/sys/kern/uipc_mbuf.c:1014
#4 0xc020fc25 in tcp_output (tp=0xdae0c720)
#5 0xc02152bf in tcp_timer_persist (xtp=0xdae0c720)
#6 0xc01a6423 in softclock_handler (arg=0xc0386a80)
#7 0xc019d037 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:207
Previous frame inner to this frame (corrupt stack?)
The kernel and vmcore is being uploaded to leaf. The source is from March 28.
#2 Updated by dillon over 9 years ago
:New submission from Peter Avalos <firstname.lastname@example.org>:
:Here's a panic I'm getting with some pretty serious network (www) load, then
:doing a netstat -an:
:Unread portion of the kernel message buffer:
:panic: m_copydata, negative off -1
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:boot() called on cpu#0
:syncing disks... 5
Woa. You mean the panic occurs only when you do the netstat -an command
under heavy network load? It doesn't happen any other time?
This is a really odd crash. Somehow tp->snd_nxt has become less
then tp->snd_una, causing 'off' to be calculated as -1.
#4 Updated by dillon over 9 years ago
:On Wed, Apr 11, 2007 at 11:38:48AM -0700, Matthew Dillon wrote:
:> Woa. You mean the panic occurs only when you do the netstat -an comm=
:> under heavy network load? It doesn't happen any other time?
:Correct. Once I execute "netstat -an" it panics. Any ideas?
That's really odd. I looked at the code and found one possible
place where the field could get out of whack. Try turning off the
tcp limited transmit code:
and reboot to clean out any preexisting tcp connections (or otherwise
clean them out manually by killing and restarting the services).
I'm a bit at a loss a to why netstat -an would trigger the problem,
though. We do know that anything that accesses /dev/kmem heavily,
like fstat, can crash the machine while chasing down stale pointers
in kernel memory. But this panic seems a bit at odds with the sort
of crash I would expect from stale pointer chasing.
#6 Updated by dillon over 9 years ago
:On Wed, Apr 11, 2007 at 03:52:30PM -0700, Matthew Dillon wrote:
:> I'm a bit at a loss a to why netstat -an would trigger the problem,
:> though. We do know that anything that accesses /dev/kmem heavily,
:> like fstat, can crash the machine while chasing down stale pointers
:> in kernel memory. But this panic seems a bit at odds with the sort
:> of crash I would expect from stale pointer chasing.
:netstat -an uses a sysctl interface though.
That would make more sense. I was scratching my head at how a KVM
access could cause this, a direct sysctl interface is more likely.
I don't see a whole lot in the sysctl code either, unfortunately.
e.g. tcp_pcblist() in tcp_subr.c. There is one likely possibility.
Because the sysctl is dumping its huge, huge list in one large go
and holding the big giant lock while it does it, it could be
preventing the TCP stack's callout's (which is where the panic occured)
from running during that period. There could be a race condition there
that we are not handling properly.
so, e.g. some sort of race in softclock_handler() in kern_timeout.c
related to the acquisition of the big giant lock.
#9 Updated by dillon over 9 years ago
:traffic needed that originally caused the panic. (I'm estimating it was
:about 6000 simultaneous http connections, but I'm not exactly sure since
:netstat wasn't working.)
Ok, for now please continue with the sysctl turned off (put it in your
/etc/sysctl.conf). That way if the crash occurs again in the future
we can discount that part of the tcp stack.
I am going to go ahead and commit a fix to the possible bug related
to the code in question.
#10 Updated by pavalos over 5 years ago
net.inet.tcp.limitedtransmit=1 definitely causes instability, even on a latest
Unfortunately I keep getting secondary panics, so I'm not able to get a vmcore.