Bug #957

CARP panic

Added by elekktretterr about 6 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

by the way this is an SMP box.

Petr

panic (59 KB) elekktretterr, 07/24/2008 01:10 AM

ethersubr.txt Magnifier (856 Bytes) sepherosa, 07/24/2008 02:07 AM

History

#1 Updated by dillon about 6 years ago

:This box is the MASTER in the CARP domain and it paniced today:
:
:http://www.punchyouremployer.com/files/carp_panic.JPG
:
:Petr

Well, I see a problem immediately. The lockmgr macros in ip_carp.c
are specifying non-blocking locks but all the calls to them assume
success.

I'm not sure what the best solution is here. The lockmgr is not supposed
to be called from an interrupt, which is why I think it was coded
LK_NOWAIT in the carp port, but the carp code doesn't check for an
EBUSY return value. That's a real problem.

We can't use spin locks, there are blockable paths called with the
CARP_LOCK held. It looks like our only real choice is to allow
lockmgr locks from interrupts... which by the way precludes being able
to use a FAST interrupt for network interrupts (which we don't at the
moment anyway, but...).

-Matt
Matthew Dillon
<>

#2 Updated by elekktretterr about 6 years ago

What are the implications of that?

Petr

#3 Updated by TGEN about 6 years ago

I'd prefer CARP to grow EBUSY handling rather than changing our locking
primitives...
--
Thomas E. Spanjaard

#4 Updated by elekktretterr about 6 years ago

It would be great if someone fixed this at some point, as at the moment CARP
is basically useless.

Petr

#5 Updated by elekktretterr almost 6 years ago

Hey there,

Can someone please do something about the CARP panic on SMP boxes before the
release? Ive got a box pending to go in production, but i cant because of the
panic.

Thanks so much to anyone who can do it.
Petr

#6 Updated by sepherosa almost 6 years ago

Since you are using bge(4), I suggest you do following things:
- Add options ETHER_INPUT_CHAIN and options ETHER_INPUT2 in your
kernel config file
- Change line 2119 in netinet/ip_carp.c
from:
#define CARP_LOCK_INIT(cif) lockinit(&(cif)->vhif_lock, "carp_if",
0, LK_NOWAIT);
to:
#define CARP_LOCK_INIT(cif) lockinit(&(cif)->vhif_lock, "carp_if", 0, 0);
i.e. strip the LK_NOWAIT

Hope it will work for you.

Best Regards,
sephe

#7 Updated by dillon almost 6 years ago

:> Hey there,
:>
:> Can someone please do something about the CARP panic on SMP boxes before the
:> release? Ive got a box pending to go in production, but i cant because of the
:> panic.
:>
:> Thanks so much to anyone who can do it.
:
:Since you are using bge(4), I suggest you do following things:
:- Add options ETHER_INPUT_CHAIN and options ETHER_INPUT2 in your
:kernel config file
:- Change line 2119 in netinet/ip_carp.c
: from:
: #define CARP_LOCK_INIT(cif) lockinit(&(cif)->vhif_lock, "carp_if",
:0, LK_NOWAIT);
: to:
: #define CARP_LOCK_INIT(cif) lockinit(&(cif)->vhif_lock, "carp_if", 0, 0);
: i.e. strip the LK_NOWAIT
:
:Hope it will work for you.
:
:Best Regards,
:sephe

If that doesn't work post another traceback and I'll try to track
it down. It kinda looks like one thread is locking and another is
unlocking, which isn't legal without setting the lock holder to a
special value. But it could also have been due to that LK_NOWAIT.

-Matt
Matthew Dillon
<>

#8 Updated by elekktretterr almost 6 years ago

Hey guys,
so far no panic(2 hours and going) with the patch.
Couple of things though:

1) carp inteface cannot be properly configured if net.inet.preempt=1 is
run before the the carp network inteface interface is up. it gives an
error saying: arp_rtrequest: bad gateway value or something like that. and
thus cannot be put into sysctl.conf

if run after the carp/physical inteface is up and configured it works fine

2) maildirs are stored and read from an nfs share. if i unplug the cable
to cause master to go down. commands run on the command (ie. ls
/usr/local/nfs_share) line seem to freeze untill i plug the cable in
again.

Cheers,
petr

#9 Updated by elekktretterr almost 6 years ago

Good news,
so far no panic after 3 days. Please commit a fix to HEAD.

Petr

#10 Updated by elekktretterr almost 6 years ago

Sephe,
Can you please commit the fix?

Cheers,
Petr

#11 Updated by sepherosa almost 6 years ago

Hi,

Using ETHER_INPUT2 is tied with the CARP lock flag changing. Using
ETHER_INPUT2 by default is a little bit risky for this release and
ipflow can't be worked out in compat fashion. You can just keep the
patch locally; if you have other NICs that you want to use it with
CARP, please just tell me. Currently following NICs has ETHER_INPUT2
support (so I could test that code path from day to day before release
:)
bge(4) bce(4) em(4) et(4) msk(4) nfe(4) re(4), bfe(4) fxp(4) xl(4)

Once release is done, I will switch to ETHER_INPUT2 in repo; along
with the CARP lock flag changing of course.

Best Regards,
sephe

#12 Updated by elekktretterr almost 6 years ago

Hey,
Can we at least make it an option? if ETHER_INPUT2 is defined use patched
CARP, if not use the old. I know, a bit of a hack but only temporary.

Cheers,
Petr

#13 Updated by elekktretterr over 5 years ago

Hi Sephe,
Ive just got a panic on the mailserver. saying:

panic: assertion, m == NULL in ether_input_chain2

Attached is an image of the panic.
Can you fix it? Thanks a lot.

Petr

#14 Updated by sepherosa over 5 years ago

On Thu, Jul 24, 2008 at 9:05 AM, Petr Janda
<> wrote:
> Hi Sephe,
> Ive just got a panic on the mailserver. saying:
>
> panic: assertion, m == NULL in ether_input_chain2
>
> Attached is an image of the panic.
> Can you fix it? Thanks a lot.

Try the attached patch.

Best Regards,
sephe

#15 Updated by elekktretterr over 5 years ago

Hey Sephe,
I had to apply the patch manually as it failed to apply cleanly(on DF
2.0-RELEASE). But unfortunately the patch does more wrong than good. All
the network interfaces could not be worked with. ie. ifconfig shows that
they exist but if i did something like ifconfig bge0 or ifconfig carp0 it
would simply say bge0 doesnt exist, and thus there were no ip addresses
associated to the interfaces.

The is the patched section of the file:

static __inline struct lwkt_port *
ether_mport(int num, struct mbuf **m)
{
if (num == NETISR_MAX) {
/*
* All packets whose target msgports can't be
* determined here are dispatched to netisr0,
* where further dispatching may happen.
*/
return cpu_portfn(0);
}

return netisr_find_port(num, m);
}

Did i miss something while patching it manually?

Cheers,
Petr

#16 Updated by sepherosa over 5 years ago

I don't believe that the patch has anything to do with the missing
interfaces. I think before you did the patch your tree was at point
before MPLS was brought in, so the patch was applied uncleanly. I
suspect your kernel is out of sync with your world. Please rebuild
world and kernel.

Yep, that's what after patching.

Best Regards,
sephe

#17 Updated by elekktretterr over 5 years ago

Shoudnt the patch apply cleanly against 2.0 though which is what my cvs
tree is?

#18 Updated by sepherosa over 5 years ago

There is almost nothing related to networking changed after 2.0
branching. No matter what, I suggest you to rebuild world and kernel.
If it still does not work, please post the dmesg.

Best Regards,
sephe

#19 Updated by nant over 5 years ago

It applies cleanly on my 2.0 checkout tree.

[tahngarth] /Development/DragonFly-2.0/src/sys/net> patch -C <
/tmp/ethersubr.txt
Hmm... Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|Index: if_ethersubr.c
|===================================================================
|RCS file: /dcvs/src/sys/net/if_ethersubr.c,v
|retrieving revision 1.76
|diff -u -p -r1.76 if_ethersubr.c
|--- if_ethersubr.c 8 Jul 2008 13:50:52 -0000 1.76
|+++ if_ethersubr.c 24 Jul 2008 01:56:50 -0000
--------------------------
Patching file if_ethersubr.c using Plan A...
Hunk #1 succeeded at 1658.
Hunk #2 succeeded at 1668.
done

#20 Updated by aoiko over 5 years ago

Is this a real issue?

#21 Updated by elekktretterr over 5 years ago

Its been fixed in master HEAD.

Cheers,
Petr

#22 Updated by aoiko over 5 years ago

Fixed

Also available in: Atom PDF