Bug #937

tcp_sack related panic

Added by pavalos about 6 years ago. Updated about 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Fatal trap 12: page fault while in kernel mode
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
fault virtual address = 0x4
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc0233d36
stack pointer = 0x10:0xdaa45a70
frame pointer = 0x10:0xdaa45a80
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = Idle
current thread = pri 12
<- SMP: XXX
trap number = 12
panic: page fault
mp_lock = 00000000; cpuid = 0
boot() called on cpu#0
Uptime: 3d11h5m38s

dumping to dev #da/0x20001, blockno 378927

(kgdb) bt
#0 dumpsys () at ./machine/thread.h:83
#1 0xc01a2ea9 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:375
#2 0xc01a316c in panic (fmt=0xc033781c "%s") at /usr/src/sys/kern/kern_shutdown.c:800
#3 0xc0310a61 in trap_fatal (frame=0xdaa45a28, eva=<value optimized out>) at /usr/src/sys/platform/pc32/i386/trap.c:1102
#4 0xc0310b9b in trap_pfault (frame=0xdaa45a28, usermode=0, eva=4) at /usr/src/sys/platform/pc32/i386/trap.c:1003
#5 0xc0311198 in trap (frame=0xdaa45a28) at /usr/src/sys/platform/pc32/i386/trap.c:686
#6 0xc02fe396 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:783
#7 0xc0233d36 in sack_block_lookup (scb=0xdace6b0c, seq=1554912228, sb=0xdaa45a90) at /usr/src/sys/netinet/tcp_sack.c:128
#8 0xc0233eda in tcp_sack_nextseg (tp=0xdace6a20, nextrexmt=0xdaa45ad0, plen=0xdaa45ad4, lostdup=0xdaa45acc) at /usr/src/sys/netinet/tcp_sack.c:496
#9 0xc022f603 in tcp_sack_rexmt (tp=0xdace6a20, th=<value optimized out>) at /usr/src/sys/netinet/tcp_input.c:3154
#10 0xc0231aca in tcp_input (m=0xee2c5a00) at /usr/src/sys/netinet/tcp_input.c:1981
#11 0xc0229ae2 in transport_processing_oncpu (m=0xee2c5a00, hlen=20, ip=<value optimized out>, nexthop=0x0) at /usr/src/sys/netinet/ip_input.c:391
#12 0xc022bae0 in ip_input (m=0xee2c5a00) at /usr/src/sys/netinet/ip_input.c:1092
#13 0xc022bbb4 in ip_input_handler (msg0=0xee2c5a18) at /usr/src/sys/netinet/ip_input.c:421
#14 0xc0235653 in tcpmsg_service_loop (dummy=0x0) at /usr/src/sys/netinet/tcp_subr.c:385
#15 0xc01a9fa5 in lwkt_deschedule_self (td=Cannot access memory at address 0x8
) at /usr/src/sys/kern/lwkt_thread.c:214
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

# uname -a
DragonFly ylem.theshell.com 1.11.0-DEVELOPMENT DragonFly 1.11.0-DEVELOPMENT #11: Mon Jan 28 18:13:59 EST 2008 :/usr/obj/usr/src/sys/YLEM i386

History

#1 Updated by dillon about 6 years ago

:#6 0xc02fe396 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.=
:s:783
:#7 0xc0233d36 in sack_block_lookup (scb=3D0xdace6b0c, seq=3D1554912228, sb=
:=3D0xdaa45a90) at /usr/src/sys/netinet/tcp_sack.c:128
:#8 0xc0233eda in tcp_sack_nextseg (tp=3D0xdace6a20, nextrexmt=3D0xdaa45ad0=
:, plen=3D0xdaa45ad4, lostdup=3D0xdaa45acc) at /usr/src/sys/netinet/tcp_sack=
:=2Ec:496
:#9 0xc022f603 in tcp_sack_rexmt (tp=3D0xdace6a20, th=3D<value optimized ou=

Hmm. I see two places where a node is removed from the sackblocks list
but lastfound is not cleared on match. I don't know if this is the
issue but it's the most obvious from looking at the failure.

I'll commit this tomorrow if no new developments come up.

-Matt
Matthew Dillon
<>

Index: tcp_sack.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_sack.c,v
retrieving revision 1.6
diff -u -p -r1.6 tcp_sack.c
--- tcp_sack.c 22 Apr 2007 01:13:14 -0000 1.6
+++ tcp_sack.c 3 Feb 2008 01:32:16 -0000
@@ -176,7 +176,7 @@
sb = TAILQ_FIRST(&scb->sackblocks);
while (sb && SEQ_LEQ(sb->sblk_end, th_ack)) {
nb = TAILQ_NEXT(sb, sblk_list);
- if (sb == scb->lastfound)
+ if (scb->lastfound == sb)
scb->lastfound = NULL;
TAILQ_REMOVE(&scb->sackblocks, sb, sblk_list);
free_sackblock(sb);
@@ -334,6 +334,8 @@ SEQ_GEQ(workingblock->sblk_end, sb-
struct sackblock *nextblock;

nextblock = TAILQ_NEXT(sb, sblk_list);
+ if (scb->lastfound == sb)
+ scb->lastfound = NULL;
/* Remove completely overlapped block */
TAILQ_REMOVE(&scb->sackblocks, sb, sblk_list);
free_sackblock(sb);
@@ -346,6 +348,8 @@ if (sb != NULL &&
SEQ_GEQ(workingblock->sblk_end, sb->sblk_start)) {
/* Extend new block to cover partially overlapped old block. */
workingblock->sblk_end = sb->sblk_end;
+ if (scb->lastfound == sb)
+ scb->lastfound = NULL;
TAILQ_REMOVE(&scb->sackblocks, sb, sblk_list);
free_sackblock(sb);
--scb->nblocks;

#2 Updated by pavalos about 6 years ago

Also just got this with the same sources:

panic: zone: freeing free entry
mp_lock = 00000000; cpuid = 0
boot() called on cpu#0
Uptime: 1d11h35m59s

dumping to dev #da/0x20001, blockno 378927

(kgdb) bt
#0 dumpsys () at ./machine/thread.h:83
#1 0xc01a2ea9 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:375
#2 0xc01a316c in panic (fmt=0xc034328a "zone: freeing free entry") at /usr/src/sys/kern/kern_shutdown.c:800
#3 0xc02a6aa8 in zerror (error=2) at /usr/src/sys/vm/vm_zone.c:567
#4 0xc02a6ff5 in zfree (z=0xd7049438, item=0xdb991760) at /usr/src/sys/vm/vm_zone.c:98
#5 0xc02341ac in tcp_sack_update_scoreboard (tp=0xdad397c0, to=0xdaa45be8) at /usr/src/sys/netinet/tcp_sack.c:165
#6 0xc02318d9 in tcp_input (m=0xeb7df200) at /usr/src/sys/netinet/tcp_input.c:1900
#7 0xc0229ae2 in transport_processing_oncpu (m=0xeb7df200, hlen=20, ip=<value optimized out>, nexthop=0x0) at /usr/src/sys/netinet/ip_input.c:391
#8 0xc022bae0 in ip_input (m=0xeb7df200) at /usr/src/sys/netinet/ip_input.c:1092
#9 0xc022bbb4 in ip_input_handler (msg0=0xeb7df218) at /usr/src/sys/netinet/ip_input.c:421
#10 0xc0235653 in tcpmsg_service_loop (dummy=0x0) at /usr/src/sys/netinet/tcp_subr.c:385
#11 0xc01a9fa5 in lwkt_deschedule_self (td=Cannot access memory at address 0x8
) at /usr/src/sys/kern/lwkt_thread.c:214
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Do you think it's the same problem?

#3 Updated by pavalos about 6 years ago

FYI, the vmcores are on leaf:~pavalos/crash. The first one is *12 and
the 2nd is *13.

--Peter

#4 Updated by dillon about 6 years ago

:Also just got this with the same sources:
:
:panic: zone: freeing free entry
:mp_lock =3D 00000000; cpuid =3D 0
:boot() called on cpu#0
:Uptime: 1d11h35m59s
:...
:#3 0xc02a6aa8 in zerror (error=3D2) at /usr/src/sys/vm/vm_zone.c:567
:#4 0xc02a6ff5 in zfree (z=3D0xd7049438, item=3D0xdb991760) at /usr/src/sys=
:/vm/vm_zone.c:98
:#5 0xc02341ac in tcp_sack_update_scoreboard (tp=3D0xdad397c0, to=3D0xdaa45=
:be8) at /usr/src/sys/netinet/tcp_sack.c:165
:#6 0xc02318d9 in tcp_input (m=3D0xeb7df200) at /usr/src/sys/netinet/tcp_in=
:put.c:1900
:#7 0xc0229ae2 in transport_processing_oncpu (m=3D0xeb7df200, hlen=3D20, ip=
:
:Do you think it's the same problem?

Same sources prior to the patch? It's quite possible.

I tracked this second crash to line 321 of tcp_sack.c (the kgdb backtrace
is all wrong due to all the inlining). It's freeing 'newblock' here,
which should always succeed at this paricular point in the code.

I think this case can only occur if the list had previously been
corrupted due to the hint not getting NULL'd out in those two places.

-Matt
Matthew Dillon
<>

#5 Updated by corecode over 5 years ago

did this get committed?

#6 Updated by pavalos about 5 years ago

Committed in 9e3d6c9645ed28ef5b07a9b13e380e13a86deeb8. I haven't seen this
panic in about a year, so let's call it good.

Also available in: Atom PDF