Bug #599

1.9.0 reproducable panic

Added by pavalos over 7 years ago. Updated over 3 years ago.

Status:NewStart date:
Priority:UrgentDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Here's a panic I'm getting with some pretty serious network (www) load, then
doing a netstat -an:

Unread portion of the kernel message buffer:
panic: m_copydata, negative off -1
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0

syncing disks... 5
done
Uptime: 12d22h0m32s

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01954bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2 0xc01957c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3 0xc01c3a32 in m_copydata (m=0x0, off=0, len=0, cp=0xee9534b0 "\001\001
\b\n\006¦*$\035\bͬ") at /usr/src/sys/kern/uipc_mbuf.c:1014
#4 0xc020fc25 in tcp_output (tp=0xdae0c720)
at /usr/src/sys/netinet/tcp_output.c:690
#5 0xc02152bf in tcp_timer_persist (xtp=0xdae0c720)
at /usr/src/sys/netinet/tcp_timer.c:363
#6 0xc01a6423 in softclock_handler (arg=0xc0386a80)
at /usr/src/sys/kern/kern_timeout.c:307
#7 0xc019d037 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:207
Previous frame inner to this frame (corrupt stack?)

The kernel and vmcore is being uploaded to leaf. The source is from March 28.

--Peter

History

#1 Updated by pavalos over 7 years ago

I forgot to mention that the kernel and cores are *.3 and *.4.

#2 Updated by dillon over 7 years ago

:New submission from Peter Avalos <>:
:
:Here's a panic I'm getting with some pretty serious network (www) load, then
:doing a netstat -an:
:
:Unread portion of the kernel message buffer:
:panic: m_copydata, negative off -1
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:boot() called on cpu#0
:
:syncing disks... 5
:done
:Uptime: 12d22h0m32s

Woa. You mean the panic occurs only when you do the netstat -an command
under heavy network load? It doesn't happen any other time?

This is a really odd crash. Somehow tp->snd_nxt has become less
then tp->snd_una, causing 'off' to be calculated as -1.

-Matt

#3 Updated by pavalos over 7 years ago

Correct. Once I execute "netstat -an" it panics. Any ideas?

--Peter

#4 Updated by dillon over 7 years ago

:On Wed, Apr 11, 2007 at 11:38:48AM -0700, Matthew Dillon wrote:
:>=20
:> Woa. You mean the panic occurs only when you do the netstat -an comm=
:and
:> under heavy network load? It doesn't happen any other time?
:>=20
:
:Correct. Once I execute "netstat -an" it panics. Any ideas?
:
:--Peter

That's really odd. I looked at the code and found one possible
place where the field could get out of whack. Try turning off the
tcp limited transmit code:

(in /etc/sysctl.conf):
net.inet.tcp.limitedtransmit=0

and reboot to clean out any preexisting tcp connections (or otherwise
clean them out manually by killing and restarting the services).

I'm a bit at a loss a to why netstat -an would trigger the problem,
though. We do know that anything that accesses /dev/kmem heavily,
like fstat, can crash the machine while chasing down stale pointers
in kernel memory. But this panic seems a bit at odds with the sort
of crash I would expect from stale pointer chasing.

-Matt

#5 Updated by joerg over 7 years ago

netstat -an uses a sysctl interface though.

Joerg

#6 Updated by dillon over 7 years ago

:On Wed, Apr 11, 2007 at 03:52:30PM -0700, Matthew Dillon wrote:
:> I'm a bit at a loss a to why netstat -an would trigger the problem,
:> though. We do know that anything that accesses /dev/kmem heavily,
:> like fstat, can crash the machine while chasing down stale pointers
:> in kernel memory. But this panic seems a bit at odds with the sort
:> of crash I would expect from stale pointer chasing.
:
:netstat -an uses a sysctl interface though.
:
:Joerg

That would make more sense. I was scratching my head at how a KVM
access could cause this, a direct sysctl interface is more likely.

I don't see a whole lot in the sysctl code either, unfortunately.
e.g. tcp_pcblist() in tcp_subr.c. There is one likely possibility.
Because the sysctl is dumping its huge, huge list in one large go
and holding the big giant lock while it does it, it could be
preventing the TCP stack's callout's (which is where the panic occured)
from running during that period. There could be a race condition there
that we are not handling properly.

so, e.g. some sort of race in softclock_handler() in kern_timeout.c
related to the acquisition of the big giant lock.

-Matt
Matthew Dillon
<>

#7 Updated by dillon over 7 years ago

Any luck turning off that sysctl?

-Matt

#8 Updated by pavalos over 7 years ago

I have turned off the sysctl, but I'm having a hard time generating the
traffic needed that originally caused the panic. (I'm estimating it was
about 6000 simultaneous http connections, but I'm not exactly sure since
netstat wasn't working.)

--Peter

#9 Updated by dillon over 7 years ago

:traffic needed that originally caused the panic. (I'm estimating it was
:about 6000 simultaneous http connections, but I'm not exactly sure since
:netstat wasn't working.)
:
:--Peter

Ok, for now please continue with the sysctl turned off (put it in your
/etc/sysctl.conf). That way if the crash occurs again in the future
we can discount that part of the tcp stack.

I am going to go ahead and commit a fix to the possible bug related
to the code in question.

-Matt
Matthew Dillon
<>

#10 Updated by pavalos over 3 years ago

net.inet.tcp.limitedtransmit=1 definitely causes instability, even on a latest
master:

DragonFly ylem.theshell.com 2.9-DEVELOPMENT DragonFly v2.9.1.321.gadb6af-
DEVELOPMENT #31: Tue Dec 21 13:12:36 HST 2010
:/usr/obj/usr/src/sys/YLEM i386

Unfortunately I keep getting secondary panics, so I'm not able to get a vmcore.

Also available in: Atom PDF