Project

General

Profile

Actions

Bug #1818

open

panic: Bad tailq NEXT (kqueue issue ?)

Added by ftigeot over 13 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I have been running the latest -master for a few hours on my main
desktop box.
It has just now freezed under X11 and rebooted automatically.

On the next boot, the system was able to save a core dump:

panic: Bad tailq NEXT(0xfffffffe5550e190->tqh_last) != NULL

Relevant files are available here:
http://www.wolfpond.org/crash.dfly/

Actions #1

Updated by ftigeot over 13 years ago

On Thu, Sep 02, 2010 at 12:27:39PM +0200, Francois Tigeot wrote:

I have been running the latest -master for a few hours on my main
desktop box.
It has just now freezed under X11 and rebooted automatically.

On the next boot, the system was able to save a core dump:

panic: Bad tailq NEXT != NULL

Relevant files are available here:
http://www.wolfpond.org/crash.dfly/

I believe this bug is a consequence of the recent kqueue work.

The panic originates at line 600 of sys/kern/kern_event.c

The relevant line is part of kern_event():
TAILQ_INSERT_TAIL(&kq->kq_knpend, &marker, kn_tqe);

This function is marked MPSAFE; I'm running a SMP kernel on a Core 2 Duo CPU.

So far, this panic occurs every few hours with the latest kernel.

Actions #2

Updated by dillon over 13 years ago

:>
:> On the next boot, the system was able to save a core dump:
:>
:> panic: Bad tailq NEXT != NULL
:>
:> Relevant files are available here:
:> http://www.wolfpond.org/crash.dfly/
:
:I believe this bug is a consequence of the recent kqueue work.
:
:The panic originates at line 600 of sys/kern/kern_event.c
:
:The relevant line is part of kern_event():
: TAILQ_INSERT_TAIL(&kq->kq_knpend, &marker, kn_tqe);
:
:This function is marked MPSAFE; I'm running a SMP kernel on a Core 2 Duo CPU.
:
:So far, this panic occurs every few hours with the latest kernel.
:
:--
:Francois Tigeot

Hmm. The knote on the knpend list looks good except for its list linkage. It is related to a pipe but that might not be the one messing it up. I'm not sure how the situation can occur.

Try the patch below. All I can think of is that somehow the knote is being double-removed from the list due to knote_remove() blocking on kq_token. If that is the case then this patch should cause it to panic earlier, where the actual double-remove is happening, instead of later.

-Matt
Matthew Dillon
<>

diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c
index 467b95a..d5fed13 100644
--- a/sys/kern/sys_pipe.c
+++ b/sys/kern/sys_pipe.c
@@ -1234,6 +1234,7 @@ filt_pipedetach(struct knote *kn)
 {
     struct pipe *cpipe = (struct pipe *)kn->kn_hook;

+    kn->kn_hook = NULL;
     knote_remove(&cpipe->pipe_kq.ki_note, kn);
 }

Actions #3

Updated by dillon over 13 years ago

I pushed some socket changes that might have an effect on your crash.
I'm beginning to think that it isn't related to the pipe code at all but that some other file descriptor w/ kqueue on it is causing the corruption.
I'm guessing its the socket code but I don't know for sure.

I only give the changes I pushed a 10% chance of fixing the problem, but lets find out.

-Matt

Actions #4

Updated by ftigeot over 13 years ago

On Mon, Sep 06, 2010 at 08:43:00PM +0000, Matthew Dillon (via DragonFly issue tracker) wrote:

Matthew Dillon <> added the comment:

I pushed some socket changes that might have an effect on your crash.
I'm beginning to think that it isn't related to the pipe code at all
but that some other file descriptor w/ kqueue on it is causing
the corruption. I'm guessing its the socket code but I don't know
for sure.

I only give the changes I pushed a 10% chance of fixing the problem,
but lets find out.

No new crash so far; I guess we will need a few more days to be sure.

Actions #5

Updated by dillon over 13 years ago

:
:On Mon, Sep 06, 2010 at 08:43:00PM +0000, Matthew Dillon (via DragonFly issue tracker) wrote:
:>
:> Matthew Dillon <> added the comment:
:>
:> I pushed some socket changes that might have an effect on your crash.
:> I'm beginning to think that it isn't related to the pipe code at all
:> but that some other file descriptor w/ kqueue on it is causing
:> the corruption. I'm guessing its the socket code but I don't know
:> for sure.
:>
:> I only give the changes I pushed a 10% chance of fixing the problem,
:> but lets find out.
:
:No new crash so far; I guess we will need a few more days to be sure.
:
:--
:Francois Tigeot

I pushed more changes on the 6th, make sure you have commit
14343ad3b815bafa1bcec3656de2d614fcc75bec or later. Probably
getting commit through 12d442975420e1da3daae44f5c20a3c1dce055df is
best.
Absolute latest master will work too if you don't need wireless.
(wireless is going to be broken for a day or two while we clean
up the locks).
-Matt
Matthew Dillon
&lt;&gt;
Actions #6

Updated by ftigeot over 13 years ago

On Tue, Sep 07, 2010 at 01:34:40PM -0700, Matthew Dillon wrote:

:No new crash so far; I guess we will need a few more days to be sure.
:
:--
:Francois Tigeot

I pushed more changes on the 6th, make sure you have commit
14343ad3b815bafa1bcec3656de2d614fcc75bec or later. Probably
getting commit through 12d442975420e1da3daae44f5c20a3c1dce055df is
best.

Absolute latest master will work too if you don't need wireless.
(wireless is going to be broken for a day or two while we clean
up the locks).

I'm running 2.7.3.865.g12d44-DEVELOPMENT. I think I've got all the latest
changes up to the LWKT_SERIALIZE_INITIALIZER bit (not included).

Actions #7

Updated by ftigeot over 13 years ago

On Tue, Sep 07, 2010 at 09:05:17PM +0200, Francois Tigeot wrote:

On Mon, Sep 06, 2010 at 08:43:00PM +0000, Matthew Dillon (via DragonFly issue tracker) wrote:

Matthew Dillon <> added the comment:

I pushed some socket changes that might have an effect on your crash.
I'm beginning to think that it isn't related to the pipe code at all
but that some other file descriptor w/ kqueue on it is causing
the corruption. I'm guessing its the socket code but I don't know
for sure.

I only give the changes I pushed a 10% chance of fixing the problem,
but lets find out.

No new crash so far; I guess we will need a few more days to be sure.

I just got a complete system deadlock: no panic, everything suddenly froze in
place.

I was typing a mail under X11 at the time and missed any eventual kernel
message on the console.

I had to use the reset switch; there was nothing in the logs after the next
reboot.

Actions #8

Updated by sjg over 13 years ago

Francois, can you confirm the tailq panic is fixed, how about the strange lockup?

Actions #9

Updated by ftigeot over 13 years ago

On Wed, Oct 13, 2010 at 02:54:46PM +0000, Samuel J. Greear (via DragonFly issue tracker) wrote:

Samuel J. Greear <> added the comment:

Francois, can you confirm the tailq panic is fixed, how about the strange lockup?

Yes, it is. I have not seen a panic/lockup in a long time.

Actions #10

Updated by rumcic over 13 years ago

I was able to get a bad tailq panic, but could not get a dump due to it
repeating itself over and over and ignoring all input.

It happened after a reboot from a previous panic (after savecore finished it's
thing) and due to the repeating and ignoring input, could not see if it was the
primary panic or if it was a secondary panic.

Actions #11

Updated by tuxillo almost 2 years ago

  • Description updated (diff)
  • Assignee deleted (0)
Actions

Also available in: Atom PDF