Bug #599: 1.9.0 reproducable panic - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #599

open

1.9.0 reproducable panic

Added by pavalos almost 19 years ago. Updated over 15 years ago.

Status:

New

Priority:

Urgent

Assignee:

Category:

Target version:

6.4

Start date:

Due date:

% Done:

Estimated time:

Description

Here's a panic I'm getting with some pretty serious network (www) load, then
doing a netstat -an:

Unread portion of the kernel message buffer:
panic: m_copydata, negative off -1
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0

syncing disks... 5
done
Uptime: 12d22h0m32s

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01954bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2 0xc01957c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3 0xc01c3a32 in m_copydata (m=0x0, off=0, len=0, cp=0xee9534b0 "\001\001
\b\n\006¦*$\035\bÍ¬") at /usr/src/sys/kern/uipc_mbuf.c:1014
#4 0xc020fc25 in tcp_output (tp=0xdae0c720)
at /usr/src/sys/netinet/tcp_output.c:690
#5 0xc02152bf in tcp_timer_persist (xtp=0xdae0c720)
at /usr/src/sys/netinet/tcp_timer.c:363
#6 0xc01a6423 in softclock_handler (arg=0xc0386a80)
at /usr/src/sys/kern/kern_timeout.c:307
#7 0xc019d037 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:207
Previous frame inner to this frame (corrupt stack?)

The kernel and vmcore is being uploaded to leaf. The source is from March 28.

--Peter

Actions

Copy link

Updated by pavalos almost 19 years ago

I forgot to mention that the kernel and cores are *.3 and *.4.

Actions

Copy link

Updated by dillon almost 19 years ago

:New submission from Peter Avalos <pavalos@theshell.com>:
:
:Here's a panic I'm getting with some pretty serious network (www) load, then
:doing a netstat -an:
:
:Unread portion of the kernel message buffer:
:panic: m_copydata, negative off -1
:mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
:boot() called on cpu#0
:
:syncing disks... 5
:done
:Uptime: 12d22h0m32s

Woa.  You mean the panic occurs only when you do the netstat -an command
    under heavy network load?  It doesn't happen any other time?

This is a really odd crash.  Somehow tp->snd_nxt has become less
    then tp->snd_una, causing 'off' to be calculated as -1.

-Matt

Actions

Copy link

Updated by pavalos almost 19 years ago

Correct. Once I execute "netstat -an" it panics. Any ideas?

--Peter

Actions

Copy link

Updated by dillon almost 19 years ago

:On Wed, Apr 11, 2007 at 11:38:48AM -0700, Matthew Dillon wrote:
:>=20
:> Woa. You mean the panic occurs only when you do the netstat -an comm=
:and
:> under heavy network load? It doesn't happen any other time?
:>=20
:
:Correct. Once I execute "netstat -an" it panics. Any ideas?
:
:--Peter

That's really odd.  I looked at the code and found one possible
    place where the field could get out of whack.  Try turning off the
    tcp limited transmit code:

(in /etc/sysctl.conf):
    net.inet.tcp.limitedtransmit=0

and reboot to clean out any preexisting tcp connections (or otherwise
    clean them out manually by killing and restarting the services).

I'm a bit at a loss a to why netstat -an would trigger the problem,
    though.  We do know that anything that accesses /dev/kmem heavily,
    like fstat, can crash the machine while chasing down stale pointers
    in kernel memory.  But this panic seems a bit at odds with the sort
    of crash I would expect from stale pointer chasing.

-Matt

Actions

Copy link

Updated by joerg almost 19 years ago

netstat -an uses a sysctl interface though.

Joerg

Actions

Copy link

Updated by dillon almost 19 years ago

:On Wed, Apr 11, 2007 at 03:52:30PM -0700, Matthew Dillon wrote:
:> I'm a bit at a loss a to why netstat -an would trigger the problem,
:> though. We do know that anything that accesses /dev/kmem heavily,
:> like fstat, can crash the machine while chasing down stale pointers
:> in kernel memory. But this panic seems a bit at odds with the sort
:> of crash I would expect from stale pointer chasing.
:
:netstat -an uses a sysctl interface though.
:
:Joerg

That would make more sense.  I was scratching my head at how a KVM
    access could cause this, a direct sysctl interface is more likely.

I don't see a whole lot in the sysctl code either, unfortunately.
    e.g. tcp_pcblist() in tcp_subr.c.  There is one likely possibility.
    Because the sysctl is dumping its huge, huge list in one large go
    and holding the big giant lock while it does it, it could be
    preventing the TCP stack's callout's (which is where the panic occured)
    from running during that period.  There could be a race condition there
    that we are not handling properly.

so, e.g. some sort of race in softclock_handler() in kern_timeout.c
    related to the acquisition of the big giant lock.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon almost 19 years ago

Any luck turning off that sysctl?

-Matt

Actions

Copy link

Updated by pavalos almost 19 years ago

I have turned off the sysctl, but I'm having a hard time generating the
traffic needed that originally caused the panic. (I'm estimating it was
about 6000 simultaneous http connections, but I'm not exactly sure since
netstat wasn't working.)

--Peter

Actions

Copy link

Updated by dillon almost 19 years ago

:traffic needed that originally caused the panic. (I'm estimating it was
:about 6000 simultaneous http connections, but I'm not exactly sure since
:netstat wasn't working.)
:
:--Peter

Ok, for now please continue with the sysctl turned off (put it in your
    /etc/sysctl.conf).  That way if the crash occurs again in the future
    we can discount that part of the tcp stack.

I am going to go ahead and commit a fix to the possible bug related
    to the code in question.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

#10

Updated by pavalos over 15 years ago

net.inet.tcp.limitedtransmit=1 definitely causes instability, even on a latest
master:

DragonFly ylem.theshell.com 2.9-DEVELOPMENT DragonFly v2.9.1.321.gadb6af-
DEVELOPMENT #31: Tue Dec 21 13:12:36 HST 2010
root@ylem.theshell.com:/usr/obj/usr/src/sys/YLEM i386

Unfortunately I keep getting secondary panics, so I'm not able to get a vmcore.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #599

1.9.0 reproducable panic

Updated by pavalos almost 19 years ago

Updated by dillon almost 19 years ago

Updated by pavalos almost 19 years ago

Updated by dillon almost 19 years ago

Updated by joerg almost 19 years ago

Updated by dillon almost 19 years ago

Updated by dillon almost 19 years ago

Updated by pavalos almost 19 years ago

Updated by dillon almost 19 years ago

Updated by pavalos over 15 years ago