Bug #1969

pf-related network problem

Added by pavalos about 3 years ago. Updated about 3 years ago.

Status:NewStart date:
Priority:NormalDue date:
Assignee:lentferj% Done:

0%

Category:-
Target version:-

Description

While attempting to reload a huge table (about 300k entries), my network
locks up and I get thousand of these messages on console:

pfr_unroute_kentry: delete failed.

I can't force a core dump in this state, but I believe Jan may have had
a related panic.

My pf.conf and the table file are available here:

http://www.theshell.com/~pavalos/wip/pf.tar.xz

--Peter

History

#1 Updated by lentferj about 3 years ago

the messages pops because rn_delete() returns NULL in pfr_unroute_kentry().

I added some debugging kprintfs to radix.c and as far as I can tell rn_delete
fails at the very beginning because of

line 722 bcmp(key + head_off, tt->rn_key + head_off, klen - head_off))

I don't know yet what the actual problem is here. But after flushing the tables
(pfctl -F Tables) I get a reproducable panic on kldunload pf.

Subject to further investigation.

#2 Updated by lentferj about 3 years ago

I think it boils down to rn_u = {rn_leaf = {rn_Key = 0x4 <Address 0x4 out of
bounds>

I have seen that in 3 or 4 different dumps caused by different rn_ function
calls (e.g. rn_match, rn_walktree) as in:
#0 _get_mycpu (di=0xc06f1bc0) at ./machine/thread.h:83
#1 md_dumpsys (di=0xc06f1bc0) at
/home/lentferj/repo/src/sys/platform/pc32/i386/dump_machdep.c:264
#2 0xc0312cf6 in dumpsys () at /home/lentferj/repo/src/sys/kern/kern_shutdown.c:893
#3 0xc03132b6 in boot (howto=260) at
/home/lentferj/repo/src/sys/kern/kern_shutdown.c:388
#4 0xc0313581 in panic (fmt=0xc05bba44 "%s") at
/home/lentferj/repo/src/sys/kern/kern_shutdown.c:799
#5 0xc056e3c2 in trap_fatal (frame=0xd27356c4, eva=<value optimized out>) at
/home/lentferj/repo/src/sys/platform/pc32/i386/trap.c:1116
#6 0xc056e4f8 in trap_pfault (frame=0xd27356c4, usermode=0, eva=305406860) at
/home/lentferj/repo/src/sys/platform/pc32/i386/trap.c:1018
#7 0xc056ea51 in trap (frame=0xd27356c4) at
/home/lentferj/repo/src/sys/platform/pc32/i386/trap.c:705
#8 0xc05567b7 in calltrap () at
/home/lentferj/repo/src/sys/platform/pc32/i386/exception.s:785
#9 0xc039e06d in rn_walktree (h=0xd0409480, f=0xd362fa90 <pfr_walktree>,
w=0xd273572c) at /home/lentferj/repo/src/sys/net/radix.c:1004
#10 0xd362cad6 in pfr_enqueue_addrs (kt=0xd36549c0, workq=0xd2735760, naddr=0x0,
sweep=0) at /home/lentferj/repo/src/sys/net/pf/pf_table.c:748
#11 0xd362dc7c in pfr_destroy_ktable (kt=0xd36549c0, flushaddr=1) at
/home/lentferj/repo/src/sys/net/pf/pf_table.c:1952
#12 0xd362dea2 in pfr_setflags_ktable (kt=0xd36549c0, newf=0) at
/home/lentferj/repo/src/sys/net/pf/pf_table.c:1861
#13 0xd362df46 in pfr_detach_table (kt=0xd3657f60) at
/home/lentferj/repo/src/sys/net/pf/pf_table.c:2122
#14 0xd3616928 in pf_tbladdr_remove (aw=0xd365d6c0) at
/home/lentferj/repo/src/sys/net/pf/pf.c:1211
#15 0xd3622e79 in pf_rm_rule (rulequeue=0xd3636a8c, rule=0xd365d6c0) at
/home/lentferj/repo/src/sys/net/pf/pf_ioctl.c:433
#16 0xd36230aa in pf_commit_rules (ticket=2, rs_num=1, anchor=0xd2735c6b "") at
/home/lentferj/repo/src/sys/net/pf/pf_ioctl.c:933
#17 0xd3627251 in pf_unload (mod=0xc29e3d70, type=1, data=0x0) at
/home/lentferj/repo/src/sys/net/pf/pf_ioctl.c:3051
#18 pf_modevent (mod=0xc29e3d70, type=1, data=0x0) at
/home/lentferj/repo/src/sys/net/pf/pf_ioctl.c:3316
#19 0xc02f332c in module_unload (mod=0xc29e3d70) at
/home/lentferj/repo/src/sys/kern/kern_module.c:220
#20 0xc02f454c in linker_file_unload (file=0xc2a4bfb8) at
/home/lentferj/repo/src/sys/kern/kern_linker.c:487
#21 0xc02f4e5d in sys_kldunload (uap=0xd2735cf0) at
/home/lentferj/repo/src/sys/kern/kern_linker.c:825
#22 0xc056f1a7 in syscall2 (frame=0xd2735d40) at
/home/lentferj/repo/src/sys/platform/pc32/i386/trap.c:1322
#23 0xc0556866 in Xint0x80_syscall () at
/home/lentferj/repo/src/sys/platform/pc32/i386/exception.s:876
#24 0x0000001f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) f 10
#10 0xd362cad6 in pfr_enqueue_addrs (kt=0xd36549c0, workq=0xd2735760, naddr=0x0,
sweep=0) at /home/lentferj/repo/src/sys/net/pf/pf_table.c:748
748 if (kt->pfrkt_ip4->rnh_walktree(kt->pfrkt_ip4, pfr_walktree, &w))
(kgdb) p kt->pfrkt_ip4
$1 = (struct radix_node_head *) 0xd0409480
(kgdb) p *kt->pfrkt_ip4
$2 = {rnh_treetop = 0xd04094b8, rnh_addaddr = 0xc039e578 <rn_addroute>,
rnh_deladdr = 0xc039e89e <rn_delete>, rnh_matchaddr = 0xc039dbd2 <rn_match>,
rnh_lookup = 0xc039e848 <rn_lookup>, rnh_walktree = 0xc039e049 <rn_walktree>,
rnh_walktree_from = 0xc039df51 <rn_walktree_from>, rnh_close = 0,
rnh_nodes = {{rn_mklist = 0x0, rn_parent = 0xd3657f78, rn_bit = -33, rn_bmask
= 0 '\000', rn_flags = 6 '\006', rn_u = {rn_leaf = {rn_Key = 0xcd8cf7c0 "",
rn_Mask = 0x0, rn_Dupedkey = 0x0}, rn_node = {rn_Off = -846399552,
rn_L = 0x0, rn_R = 0x0}}}, {rn_mklist = 0x0, rn_parent = 0xd04094b8,
rn_bit = 32, rn_bmask = -128 '\200', rn_flags = 6 '\006', rn_u = {rn_leaf
= {rn_Key = 0x4 <Address 0x4 out of bounds>, rn_Mask = 0xd36582b8 "",
rn_Dupedkey = 0xd04094d0}, rn_node = {rn_Off = 4, rn_L = 0xd36582b8,
rn_R = 0xd04094d0}}}, {rn_mklist = 0x0, rn_parent = 0xd04094b8,
rn_bit = -33, rn_bmask = 0 '\000', rn_flags = 6 '\006', rn_u = {rn_leaf = {
rn_Key = 0xcd8cf7dc '\377' <repeats 28 times>,
"\370\376\214\315\060\370\214\315", rn_Mask = 0x0, rn_Dupedkey = 0x0}, rn_node = {
rn_Off = -846399524, rn_L = 0x0, rn_R = 0x0}}}}, rnh_addrsize = 0,
rnh_pktsize = 0, rnh_addpkt = 0, rnh_delpkt = 0, rnh_matchpkt = 0}

#3 Updated by lentferj about 3 years ago

Jan Lentfer (via DragonFly issue tracker) schrieb:

>
> line 722 bcmp(key + head_off, tt->rn_key + head_off, klen - head_off))
>

That was a wrong trace :(.. too tired I guess. The part that actually
returns NULL in radix.c is in rn_delete():

/*
* Delete our route from mask lists.
*/
if (netmask != NULL) {
if ((x = rn_addmask(netmask, TRUE, head_off)) == NULL)
return (NULL);
netmask = x->rn_key;
while (tt->rn_mask != netmask)
if ((tt = tt->rn_dupedkey) == NULL)
return (NULL);
}

Jan

#4 Updated by lentferj about 3 years ago

Jan Lentfer schrieb:

> That was a wrong trace :(.. too tired I guess. The part that actually
> returns NULL in radix.c is in rn_delete():
>
> /*
> * Delete our route from mask lists.
> */
> if (netmask != NULL) {
> if ((x = rn_addmask(netmask, TRUE, head_off)) == NULL)
> return (NULL);
> netmask = x->rn_key;
> while (tt->rn_mask != netmask)
> if ((tt = tt->rn_dupedkey) == NULL)
> return (NULL);
> }
>
> Jan
>

I put some kprintfs in radix.c's rn_addroute() and rn_delete(), to maybe
better figure out where it goes wrong. The table loaded in pf is made up
of the 1st 100 lines of Peter's p2p file. The ouput is rather long, so I
am just linkin to a pastie here:

http://pastie.org/1504070

Maybe someone can make any sense from that??

TIA

Jan

#5 Updated by josepht about 3 years ago

On Thu, Jan 27, 2011 at 11:17:35PM +0100, Jan Lentfer wrote:
> Jan Lentfer schrieb:
>
> >That was a wrong trace :(.. too tired I guess. The part that actually
> >returns NULL in radix.c is in rn_delete():
> >
> > /*
> > * Delete our route from mask lists.
> > */
> > if (netmask != NULL) {
> > if ((x = rn_addmask(netmask, TRUE, head_off)) == NULL)
> > return (NULL);
> > netmask = x->rn_key;
> > while (tt->rn_mask != netmask)
> > if ((tt = tt->rn_dupedkey) == NULL)
> > return (NULL);
> > }
> >
> >Jan
> >
>
>
> I put some kprintfs in radix.c's rn_addroute() and rn_delete(), to maybe
> better figure out where it goes wrong. The table loaded in pf is made up
> of the 1st 100 lines of Peter's p2p file. The ouput is rather long, so I
> am just linkin to a pastie here:
>
> http://pastie.org/1504070

Jan,

You'll need to print netmask as *((uint32_t *)netmask).

Cheers,
Joe
>
> Maybe someone can make any sense from that??
>
> TIA
>
> Jan

#6 Updated by lentferj about 3 years ago

Jan Lentfer schrieb:
[...]
> I put some kprintfs in radix.c's rn_addroute() and rn_delete(), to maybe
> better figure out where it goes wrong. The table loaded in pf is made up
> of the 1st 100 lines of Peter's p2p file. The ouput is rather long, so I
> am just linkin to a pastie here:
[...]

This actually seems to be an SMP issue. I tried all the same on a UP
VKERNEL and GENERIC with my debugging additions. I can't reproduce any
of the errors or the panics on a UP setup.

Jan

#7 Updated by sepherosa about 3 years ago

On Tue, Jan 25, 2011 at 4:58 PM, Peter Avalos <> wrote:
> While attempting to reload a huge table (about 300k entries), my network
> locks up and I get thousand of these messages on console:
>
> pfr_unroute_kentry: delete failed.

I believe it has been fixed on master:
b4628cf91a5f54965e6a4221f243e315aa7a6860

Best Regards,
sephe

#8 Updated by lentferj about 3 years ago

Sepherosa Ziehau schrieb:
> On Tue, Jan 25, 2011 at 4:58 PM, Peter Avalos <> wrote:
>> While attempting to reload a huge table (about 300k entries), my network
>> locks up and I get thousand of these messages on console:
>>
>> pfr_unroute_kentry: delete failed.
>
> I believe it has been fixed on master:
> b4628cf91a5f54965e6a4221f243e315aa7a6860

I can confirm that the pfctl -ef /etc/pf.conf; pfctl -F Tables cycle now
succeeds without any errors on my SMP VM.

BUT, if I kldunload pf now I get

1 tables deleted.
df386devel# kldunload pf
kthread 0xd2ab5f58 pfpurge has exited

Fatal trap 12: page fault while in kernel mode
cpuid = 1; lapic.id = 01000000
fault virtual address = 0x500ffd7
fault code = supervisor read, page not present
instruction pointer = 0x8:0x500ffd7
stack pointer = 0x10:0xd04bf7c4
frame pointer = 0x10:0xd04bf7d8
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 927 (kldunload)
current thread = pri 10
<- SMP: XXX
trap number = 12
panic: page fault
cpuid = 1
Trace beginning at frame 0xd04bf6d0
panic(ffffffff) at panic+0x164
panic(c05edca4,c061e127,0,0,fffff) at panic+0x164
trap_fatal(500ffd7,0,d2852620,d294ac98,0) at trap_fatal+0x307
trap_pfault(cfc03260,d04bf754,0,0,0) at trap_pfault+0x124
trap(d04bf77c) at trap+0x4df
calltrap() at calltrap+0xd
--- trap 0, eip = 0, esp = 0x10246, ebp = 0 ---
boot() called on cpu#1
Uptime: 1m23s
Physical memory: 1014 MB
Dumping 140 MB:panic: Bad link elm 0xd04bf44c next->prev != elm
cpuid = 1
boot() called on cpu#1
Uptime: 1m23s
spin_lock: 0xcd86c2f4, indefinite wait!
panic: Bad tailq NEXT(0xcd86c2f8->tqh_last) != NULL
cpuid = 1
boot() called on cpu#1
Uptime: 1m23s
spin_lock: 0xcd86c2f4, indefinite wait!
panic: Bad tailq NEXT(0xcd86c2f8->tqh_last) != NULL
cpuid = 1
boot() called on cpu#1
[..]

That does not happen if I do exactly the same but without table <p2p>
persist file "/etc/p2p_small.pf" in pf.conf.

Jan

Also available in: Atom PDF