Issue599

Title 1.9.0 reproducable panic
Priority critical Status resolved
Superseder Nosy List pavalos
Assigned To Keywords

Created on 2007-04-11.03:24:26 by pavalos, last changed by justin.

Messages
msg2519 (view) Author: dillon Date: 2007-04-17.17:33:01
:traffic needed that originally caused the panic.  (I'm estimating it was
:about 6000 simultaneous http connections, but I'm not exactly sure since
:netstat wasn't working.)
:
:--Peter

    Ok, for now please continue with the sysctl turned off (put it in your
    /etc/sysctl.conf).  That way if the crash occurs again in the future
    we can discount that part of the tcp stack.

    I am going to go ahead and commit a fix to the possible bug related
    to the code in question.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
msg2518 (view) Author: pavalos Date: 2007-04-16.22:50:01
I have turned off the sysctl, but I'm having a hard time generating the
traffic needed that originally caused the panic.  (I'm estimating it was
about 6000 simultaneous http connections, but I'm not exactly sure since
netstat wasn't working.)

--Peter
msg2517 (view) Author: dillon Date: 2007-04-16.17:42:02
Any luck turning off that sysctl?

						-Matt
msg2511 (view) Author: dillon Date: 2007-04-12.17:19:01
:On Wed, Apr 11, 2007 at 03:52:30PM -0700, Matthew Dillon wrote:
:>     I'm a bit at a loss a to why netstat -an would trigger the problem,
:>     though.  We do know that anything that accesses /dev/kmem heavily,
:>     like fstat, can crash the machine while chasing down stale pointers
:>     in kernel memory.  But this panic seems a bit at odds with the sort
:>     of crash I would expect from stale pointer chasing.
:
:netstat -an uses a sysctl interface though.
:
:Joerg

    That would make more sense.  I was scratching my head at how a KVM
    access could cause this, a direct sysctl interface is more likely.

    I don't see a whole lot in the sysctl code either, unfortunately.
    e.g. tcp_pcblist() in tcp_subr.c.  There is one likely possibility.
    Because the sysctl is dumping its huge, huge list in one large go
    and holding the big giant lock while it does it, it could be
    preventing the TCP stack's callout's (which is where the panic occured)
    from running during that period.  There could be a race condition there
    that we are not handling properly.

    so, e.g. some sort of race in softclock_handler() in kern_timeout.c
    related to the acquisition of the big giant lock.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>
msg2508 (view) Author: joerg Date: 2007-04-12.05:57:01
netstat -an uses a sysctl interface though.

Joerg
msg2498 (view) Author: dillon Date: 2007-04-11.22:56:00
:On Wed, Apr 11, 2007 at 11:38:48AM -0700, Matthew Dillon wrote:
:>=20
:>     Woa.  You mean the panic occurs only when you do the netstat -an comm=
:and
:>     under heavy network load?  It doesn't happen any other time?
:>=20
:
:Correct.  Once I execute "netstat -an" it panics.  Any ideas?
:
:--Peter

    That's really odd.  I looked at the code and found one possible
    place where the field could get out of whack.  Try turning off the
    tcp limited transmit code:

    (in /etc/sysctl.conf):
    net.inet.tcp.limitedtransmit=0

    and reboot to clean out any preexisting tcp connections (or otherwise
    clean them out manually by killing and restarting the services).

    I'm a bit at a loss a to why netstat -an would trigger the problem,
    though.  We do know that anything that accesses /dev/kmem heavily,
    like fstat, can crash the machine while chasing down stale pointers
    in kernel memory.  But this panic seems a bit at odds with the sort
    of crash I would expect from stale pointer chasing.

						-Matt
msg2497 (view) Author: pavalos Date: 2007-04-11.22:24:01
Correct.  Once I execute "netstat -an" it panics.  Any ideas?

--Peter
msg2494 (view) Author: dillon Date: 2007-04-11.18:43:01
:New submission from Peter Avalos <pavalos@theshell.com>:
:
:Here's a panic I'm getting with some pretty serious network (www) load, then 
:doing a netstat -an:
:
:Unread portion of the kernel message buffer:
:panic: m_copydata, negative off -1
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:boot() called on cpu#0
:
:syncing disks... 5
:done
:Uptime: 12d22h0m32s

    Woa.  You mean the panic occurs only when you do the netstat -an command
    under heavy network load?  It doesn't happen any other time?

    This is a really odd crash.  Somehow tp->snd_nxt has become less
    then tp->snd_una, causing 'off' to be calculated as -1.

 	 				-Matt
msg2486 (view) Author: pavalos Date: 2007-04-11.06:04:56
I forgot to mention that the kernel and cores are *.3 and *.4.
msg2485 (view) Author: pavalos Date: 2007-04-11.03:24:22
Here's a panic I'm getting with some pretty serious network (www) load, then 
doing a netstat -an:

Unread portion of the kernel message buffer:
panic: m_copydata, negative off -1
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0

syncing disks... 5
done
Uptime: 12d22h0m32s

(kgdb) bt
#0  dumpsys () at thread.h:83
#1  0xc01954bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2  0xc01957c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3  0xc01c3a32 in m_copydata (m=0x0, off=0, len=0, cp=0xee9534b0 "\001\001
\b\n\006¦*$\035\bͬ") at /usr/src/sys/kern/uipc_mbuf.c:1014
#4  0xc020fc25 in tcp_output (tp=0xdae0c720) 
at /usr/src/sys/netinet/tcp_output.c:690
#5  0xc02152bf in tcp_timer_persist (xtp=0xdae0c720) 
at /usr/src/sys/netinet/tcp_timer.c:363
#6  0xc01a6423 in softclock_handler (arg=0xc0386a80) 
at /usr/src/sys/kern/kern_timeout.c:307
#7  0xc019d037 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:207
Previous frame inner to this frame (corrupt stack?)

The kernel and vmcore is being uploaded to leaf.  The source is from March 28.

--Peter
History
Date User Action Args
2007-05-19 00:45:26justinsetstatus: done-cbb -> resolved
2007-05-19 00:44:57justinsetstatus: chatting -> done-cbb
2007-04-17 17:33:03dillonsetmessages: + msg2519
2007-04-16 22:50:03pavalossetmessages: + msg2518
2007-04-16 17:42:02dillonsetmessages: + msg2517
2007-04-12 17:19:04dillonsetmessages: + msg2511
2007-04-12 05:57:01joergsetmessages: + msg2508
2007-04-11 22:56:03dillonsetmessages: + msg2498
2007-04-11 22:24:01pavalossetmessages: + msg2497
2007-04-11 18:43:01dillonsetmessages: + msg2494
2007-04-11 06:04:58pavalossetstatus: unread -> chatting
messages: + msg2486
2007-04-11 03:24:26pavaloscreate