Bug #717: panic: assertion: msg->ms_flags & MSGF_INTRANSIT - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #717

closed

panic: assertion: msg->ms_flags & MSGF_INTRANSIT

Added by pavalos about 18 years ago. Updated about 18 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

Head sources from the 22nd, I received this panic. No core dump.

panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
lwkt_thread_replyport_remote
mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xd9e7bc88
panic(c0303d72,1,c02fdd2a,d9e7bcbc,d9e7bcbc) at panic+0x16d
panic(c02fdd2a,c03171e8,c02f2675,d9e7bd00,c3b280d0) at panic+0x16d
lwkt_thread_replyport_remote(e8962c28,0,d9e7bd38,c03300f4,286) at
lwkt_thread_replyport_remote+0x5f
lwkt_process_ipiq_core(0,0,d9e78f00,28095120,808b04e) at
lwkt_process_ipiq_core+0x12d
lwkt_process_ipiq_frame(d9e7bd38,c02cac25,d9e7bd40,7b,808002f) at
lwkt_process_ipiq_frame+0x5b
Xipiq() at Xipiq+0x48
boot() called on cpu#1
Uptime: 2d2h25m28s

dumping to dev #da/0x20001, blockno 378927
dump 2047 panic: ahd_run_qoutfifo recursion
mp_lock = 00000000; cpuid = 1; boot() called on cpu#1
Uptime: 2d2h25m29s
Dump already in progress, bailing...
(da0:ahd1:0:0:0): Synchronize cache failed, status 0xb, scsi status
0x0

Actions

Copy link

Updated by dillon about 18 years ago

:Head sources from the 22nd, I received this panic. No core dump.
:
:panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
:lwkt_thread_replyport_remote
:mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xd9e7bc88
:panic(c0303d72,1,c02fdd2a,d9e7bcbc,d9e7bcbc) at panic+0x16d
:panic(c02fdd2a,c03171e8,c02f2675,d9e7bd00,c3b280d0) at panic+0x16d
:lwkt_thread_replyport_remote(e8962c28,0,d9e7bd38,c03300f4,286) at
:...

Shoot.  There isn't enough info to track it down.  The only
    thing that can cause this is if a message is replied twice in
    quick succession, or if the message flags are corrupted, or
    the message is freed too early.  We'd need to know where the
    message came from, which requires a core dump.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by pavalos about 18 years ago

Sucks. Well regarding the dump...Did it not dump because of the panic,
or was it some other bug? Normally I don't have problems getting core
dumps.

--Peter

Actions

Copy link

Updated by dillon about 18 years ago

:Sucks. Well regarding the dump...Did it not dump because of the panic,
:or was it some other bug? Normally I don't have problems getting core
:dumps.
:
:--Peter

It refused to dump because of a recursion in the SCSI disk
    device driver which had an assertion check or a panic on the
    condition.

-Matt

Actions

Copy link

Updated by pavalos about 18 years ago

So is this panic related to the bug I posted earlier regarding ahd?

http://bugs.dragonflybsd.org/issue716

Actions

Copy link

Updated by dillon about 18 years ago

:So is this panic related to the bug I posted earlier regarding ahd?
:
:http://bugs.dragonflybsd.org/issue716

Yes.  Definitely.  I think it is a very long standing issue with
    the AHD driver.  I don't know if its worth checking for updates from
    FreeBSD from when we imported it.

It's possible that the message interlock panic is related to a bug
    in the AHD code too, though that's just a shot in a dark.

-Matt

Actions

Copy link

Updated by pavalos about 18 years ago

Got another one...not sure if it's gunna help though:

panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
lwkt_thread_replyport_remote
mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xff80fc60
panic(c03040c6,1,c02fe0ea,ff80fc94,ff80fc94) at panic+0x16d
panic(c02fe0ea,c0317584,c02f29f5,ff80fcb8,c3b280d0) at panic+0x16d
lwkt_thread_replyport_remote(db413c28,0,0,ff80fcfc,c019f6a0) at
lwkt_thread_replyport_remote+0x5f
lwkt_process_ipiq_core(0,0,ff8083a4,ff8083a4,ff8083a4) at
lwkt_process_ipiq_core+0x12d
lwkt_process_ipiq(0,ff808000,297,c019e307,0) at lwkt_process_ipiq+0x5d
splz_ipiq(c03949a8,10,10,0,ff8083a4) at splz_ipiq+0xe
lwkt_switch(0,ff8083a4,0,ff8083a4,0) at lwkt_switch+0x4e6
cpu_idle(0,0,0,0,0) at cpu_idle+0x62
boot() called on cpu#1
Uptime: 2d12h22m33s

dumping to dev #da/0x20001, blockno 378927
dump devstat_end_transaction: HELP!! busy_count for da1 is < 0 (-1)!
LWKT_WAIT_IPIQ WARNING! 0 wait 1 (-3)
SECONDARY PANIC ON CPU 0 THREAD 0xc0354e04

Actions

Copy link

Updated by dillon about 18 years ago

:Peter Avalos <pavalos@theshell.com> added the comment:
:
:Got another one...not sure if it's gunna help though:
:...
:boot() called on cpu#1
:Uptime: 2d12h22m33s
:
:dumping to dev #da/0x20001, blockno 378927
:dump devstat_end_transaction: HELP!! busy_count for da1 is < 0 (-1)!
:LWKT_WAIT_IPIQ WARNING! 0 wait 1 (-3)
:SECONDARY PANIC ON CPU 0 THREAD 0xc0354e04

Was this after your ahd logic fix?

That secondary panic can only happen if devstat_end_transaction() is
    called too many times.  The INTRANSIT panic implies a message being
    replied to twice, or a message being corrupted prior to being replied.

This implies that the BIO related to a transaction is being replied
    to twice.  I've never seen this before in my life so I think it must
    be specific to the AHD driver.  I'll bet the driver is trying to
    complete an I/O twice.

Lets try to catch the problem earlier and maybe get a more meaningful
    backtrace.   Make sure you have options DDB_TRACE set (I think you
    do).  I have committed some code to lwkt_msgport.c (1.44) which tries
    to catch the MSGF_INTRANSIT flag on the originating cpu.

--

Another data point... you got the ahd_run_qoutfifo() panic again
    as well, which implies that ahd_run_qoutfifo() was running when the
    original panic occured.  I think there's an issue in AHDs transaction
    processing.  The driver does a LOT of work on a scb before it frees
    it by calling ahc_free_scb().  Somewhere in there a recursion is
    happening.

It may be beneficial to add a 'processing in progress' flag to the scb
    in ahc_done() and assert that the flag is not set to catch double-calls
    to ahd_done() on the same scb earlier.  Something like what I have
    below may help.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Index: aic79xx.h ===================================================================
RCS file: /cvs/src/sys/dev/disk/aic7xxx/aic79xx.h,v
retrieving revision 1.2
diff u -p -r1.2 aic79xx.h
-- aic79xx.h 17 Jun 2003 04:28:21 0000 1.2
++ aic79xx.h 4 Jul 2007 19:47:50 -0000
@ -589,12 +589,13 @ SCB_EXPECT_PPR_BUSFREE = 0x01000,
SCB_PKT_SENSE = 0x02000,
SCB_CMDPHASE_ABORT = 0x04000,
SCB_ON_COL_LIST = 0x08000,
SCB_SILENT = 0x10000 /*
SCB_SILENT = 0x10000,/* * Be quiet about transmission type * errors. They are expected and we * don't want to upset the user. This * flag is typically used during DV.
*/
+ SCB_RUNNINGDONE = 0x20000
} scb_flag; ===================================================================
RCS file: /cvs/src/sys/dev/disk/aic7xxx/aic79xx_osm.c,v
retrieving revision 1.13
diff u -p -r1.13 aic79xx_osm.c
-- aic79xx_osm.c 4 Jun 2007 17:21:55 -0000 1.13
+++ aic79xx_osm.c 4 Jul 2007 19:49:48 -0000
@ -196,6 +196,9 @ ahd_done(struct ahd_softc *ahd, struct s {
union ccb *ccb;

struct scb {
Index: aic79xx_osm.c

+ KKASSERT == 0);
+ scb->flags |= SCBRUNNINGDONE;
+
CAM_DEBUG(scb->io_ctx->ccb_h.path, CAM_DEBUG_TRACE,
("ahd_done - scb %d\n", SCB_GET_TAG(scb)));

Actions

Copy link

Updated by dillon about 18 years ago

:+ KKASSERT == 0);
:+ scb->flags |= SCBRUNNINGDONE;
:+

Oops, I meant SCB_RUNNINGDONE :-)

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by pavalos about 18 years ago

Before. I'm running the current code now, so we'll see if i get it again.

There have been a bunch of changes in the last 4 years, and I plan on
looking into bringing some of these in.

I'll try that patch as well as the latest code (lwkt_msgport.c).

--Peter

Actions

Copy link

#10

Updated by pavalos about 18 years ago

Well that was quick:

panic: assertion: (msg->ms_flags & (MSGF_DONE|MSGF_QUEUED|MSGF_INTRANSIT)) == 0 in lwkt_thread_replyport
mp_lock = 00000000; cpuid = 0
Trace beginning at frame 0xda6d2d08
panic(c030db06,0,c02fe28a,da6d2d3c,da6d2d3c) at panic+0x16d
panic(c02fe28a,c03178ac,c02f2b7f,0,e897ab78) at panic+0x16d
lwkt_thread_replyport(d9e57310,e897ac28,e897ab78,ff800000,da6d2d84) at lwkt_thread_replyport+0x39
netmsg_so_notify_abort(e897ab78,0,0,c0390cc0,0) at netmsg_so_notify_abort+0x5c
tcpmsg_service_loop(0,0,0,0,0) at tcpmsg_service_loop+0x3a
lwkt_exit() at lwkt_exit
boot() called on cpu#0
Uptime: 1m48s

dumping to dev #da/0x20001, blockno 378927

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01966d2 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:371
#2 0xc01969ef in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:796
#3 0xc01a19d1 in lwkt_thread_replyport (port=0xd9e57310, msg=0xe897ac28) at /usr/src/sys/kern/lwkt_msgport.c:410
#4 0xc01ca8be in netmsg_so_notify_abort (netmsg=0xe897ab78) at msgport2.h:63
#5 0xc0216127 in tcpmsg_service_loop (dummy=0x0) at /usr/src/sys/netinet/tcp_subr.c:385
#6 0xc019e2b7 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:214
Previous frame inner to this frame (corrupt stack?)
(kgdb)

Actions

Copy link

#11

Updated by pavalos about 18 years ago

Core is on leaf now (*.1).

--Peter

Actions

Copy link

#12

Updated by dillon about 18 years ago

:Well that was quick:
:
:panic: assertion: (msg->ms_flags & (MSGF_DONE|MSGF_QUEUED|MSGF_INTRANSIT)) =
:=3D=3D 0 in lwkt_thread_replyport

Doh!  And there's a bug in netmsg_so_notify_abort() right there in 
    plain sight.  MSGF_DONE is not set until the message reaches the
    target cpu, so there is a window of opportunity where the conditional
    in netmsg_so_notify_abort() will try to reply the message twice.

I'll get that fixed in a sec and we'll see whether it was responsible
    for all your problems or not.

-Matt

Actions

Copy link

#13

Updated by dillon about 18 years ago

kern/uipc_msg.c rev 1.19 has been committed.

-Matt

Actions

Copy link

#14

Updated by dillon about 18 years ago

Fixed race in LWKT message abort processing

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #717

panic: assertion: msg->ms_flags & MSGF_INTRANSIT

Updated by dillon about 18 years ago

Updated by pavalos about 18 years ago

Updated by dillon about 18 years ago

Updated by pavalos about 18 years ago

Updated by dillon about 18 years ago

Updated by pavalos about 18 years ago

Updated by dillon about 18 years ago

Updated by dillon about 18 years ago

Updated by pavalos about 18 years ago

Updated by pavalos about 18 years ago

Updated by pavalos about 18 years ago

Updated by dillon about 18 years ago

Updated by dillon about 18 years ago

Updated by dillon about 18 years ago