Project

General

Profile

Actions

Bug #717

closed

panic: assertion: msg->ms_flags & MSGF_INTRANSIT

Added by pavalos over 14 years ago. Updated over 14 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Head sources from the 22nd, I received this panic. No core dump.

panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
lwkt_thread_replyport_remote
mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xd9e7bc88
panic(c0303d72,1,c02fdd2a,d9e7bcbc,d9e7bcbc) at panic+0x16d
panic(c02fdd2a,c03171e8,c02f2675,d9e7bd00,c3b280d0) at panic+0x16d
lwkt_thread_replyport_remote(e8962c28,0,d9e7bd38,c03300f4,286) at
lwkt_thread_replyport_remote+0x5f
lwkt_process_ipiq_core(0,0,d9e78f00,28095120,808b04e) at
lwkt_process_ipiq_core+0x12d
lwkt_process_ipiq_frame(d9e7bd38,c02cac25,d9e7bd40,7b,808002f) at
lwkt_process_ipiq_frame+0x5b
Xipiq() at Xipiq+0x48
boot() called on cpu#1
Uptime: 2d2h25m28s

dumping to dev #da/0x20001, blockno 378927
dump 2047 panic: ahd_run_qoutfifo recursion
mp_lock = 00000000; cpuid = 1; boot() called on cpu#1
Uptime: 2d2h25m29s
Dump already in progress, bailing...
(da0:ahd1:0:0:0): Synchronize cache failed, status 0xb, scsi status
0x0

Actions #1

Updated by dillon over 14 years ago

:Head sources from the 22nd, I received this panic. No core dump.
:
:panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
:lwkt_thread_replyport_remote
:mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xd9e7bc88
:panic(c0303d72,1,c02fdd2a,d9e7bcbc,d9e7bcbc) at panic+0x16d
:panic(c02fdd2a,c03171e8,c02f2675,d9e7bd00,c3b280d0) at panic+0x16d
:lwkt_thread_replyport_remote(e8962c28,0,d9e7bd38,c03300f4,286) at
:...

Shoot.  There isn't enough info to track it down.  The only
thing that can cause this is if a message is replied twice in
quick succession, or if the message flags are corrupted, or
the message is freed too early. We'd need to know where the
message came from, which requires a core dump.
-Matt
Matthew Dillon
<>
Actions #2

Updated by pavalos over 14 years ago

Sucks. Well regarding the dump...Did it not dump because of the panic,
or was it some other bug? Normally I don't have problems getting core
dumps.

--Peter

Actions #3

Updated by dillon over 14 years ago

:Sucks. Well regarding the dump...Did it not dump because of the panic,
:or was it some other bug? Normally I don't have problems getting core
:dumps.
:
:--Peter

It refused to dump because of a recursion in the SCSI disk
device driver which had an assertion check or a panic on the
condition.
-Matt
Actions #4

Updated by pavalos over 14 years ago

So is this panic related to the bug I posted earlier regarding ahd?

http://bugs.dragonflybsd.org/issue716

Actions #5

Updated by dillon over 14 years ago

:So is this panic related to the bug I posted earlier regarding ahd?
:
:http://bugs.dragonflybsd.org/issue716

Yes.  Definitely.  I think it is a very long standing issue with
the AHD driver. I don't know if its worth checking for updates from
FreeBSD from when we imported it.
It's possible that the message interlock panic is related to a bug
in the AHD code too, though that's just a shot in a dark.
-Matt
Actions #6

Updated by pavalos over 14 years ago

Got another one...not sure if it's gunna help though:

panic: assertion: msg->ms_flags & MSGF_INTRANSIT in
lwkt_thread_replyport_remote
mp_lock = 00000000; cpuid = 1; Trace beginning at frame 0xff80fc60
panic(c03040c6,1,c02fe0ea,ff80fc94,ff80fc94) at panic+0x16d
panic(c02fe0ea,c0317584,c02f29f5,ff80fcb8,c3b280d0) at panic+0x16d
lwkt_thread_replyport_remote(db413c28,0,0,ff80fcfc,c019f6a0) at
lwkt_thread_replyport_remote+0x5f
lwkt_process_ipiq_core(0,0,ff8083a4,ff8083a4,ff8083a4) at
lwkt_process_ipiq_core+0x12d
lwkt_process_ipiq(0,ff808000,297,c019e307,0) at lwkt_process_ipiq+0x5d
splz_ipiq(c03949a8,10,10,0,ff8083a4) at splz_ipiq+0xe
lwkt_switch(0,ff8083a4,0,ff8083a4,0) at lwkt_switch+0x4e6
cpu_idle(0,0,0,0,0) at cpu_idle+0x62
boot() called on cpu#1
Uptime: 2d12h22m33s

dumping to dev #da/0x20001, blockno 378927
dump devstat_end_transaction: HELP!! busy_count for da1 is < 0 (-1)!
LWKT_WAIT_IPIQ WARNING! 0 wait 1 (-3)
SECONDARY PANIC ON CPU 0 THREAD 0xc0354e04

Actions #7

Updated by dillon over 14 years ago

:Peter Avalos <> added the comment:
:
:Got another one...not sure if it's gunna help though:
:...
:boot() called on cpu#1
:Uptime: 2d12h22m33s
:
:dumping to dev #da/0x20001, blockno 378927
:dump devstat_end_transaction: HELP!! busy_count for da1 is < 0 (-1)!
:LWKT_WAIT_IPIQ WARNING! 0 wait 1 (-3)
:SECONDARY PANIC ON CPU 0 THREAD 0xc0354e04

Was this after your ahd logic fix?
That secondary panic can only happen if devstat_end_transaction() is
called too many times. The INTRANSIT panic implies a message being
replied to twice, or a message being corrupted prior to being replied.
This implies that the BIO related to a transaction is being replied
to twice. I've never seen this before in my life so I think it must
be specific to the AHD driver. I'll bet the driver is trying to
complete an I/O twice.
Lets try to catch the problem earlier and maybe get a more meaningful
backtrace. Make sure you have options DDB_TRACE set (I think you
do). I have committed some code to lwkt_msgport.c (1.44) which tries
to catch the MSGF_INTRANSIT flag on the originating cpu.
--
Another data point... you got the ahd_run_qoutfifo() panic again
as well, which implies that ahd_run_qoutfifo() was running when the
original panic occured. I think there's an issue in AHDs transaction
processing. The driver does a LOT of work on a scb before it frees
it by calling ahc_free_scb(). Somewhere in there a recursion is
happening.
It may be beneficial to add a 'processing in progress' flag to the scb
in ahc_done() and assert that the flag is not set to catch double-calls
to ahd_done() on the same scb earlier. Something like what I have
below may help.
-Matt
Matthew Dillon
&lt;&gt;

Index: aic79xx.h ===================================================================
RCS file: /cvs/src/sys/dev/disk/aic7xxx/aic79xx.h,v
retrieving revision 1.2
diff u -p -r1.2 aic79xx.h
--
aic79xx.h 17 Jun 2003 04:28:21 0000 1.2
++ aic79xx.h 4 Jul 2007 19:47:50 -0000
@ -589,12 +589,13 @ SCB_EXPECT_PPR_BUSFREE = 0x01000,
SCB_PKT_SENSE = 0x02000,
SCB_CMDPHASE_ABORT = 0x04000,
SCB_ON_COL_LIST = 0x08000,
SCB_SILENT = 0x10000 /*
SCB_SILENT = 0x10000,/* * Be quiet about transmission type * errors. They are expected and we * don't want to upset the user. This * flag is typically used during DV.
*/
+ SCB_RUNNINGDONE = 0x20000
} scb_flag; ===================================================================
RCS file: /cvs/src/sys/dev/disk/aic7xxx/aic79xx_osm.c,v
retrieving revision 1.13
diff u -p -r1.13 aic79xx_osm.c
--
aic79xx_osm.c 4 Jun 2007 17:21:55 -0000 1.13
+++ aic79xx_osm.c 4 Jul 2007 19:49:48 -0000
@ -196,6 +196,9 @ ahd_done(struct ahd_softc *ahd, struct s {
union ccb *ccb;

struct scb {
Index: aic79xx_osm.c

+ KKASSERT == 0);
+ scb->flags |= SCBRUNNINGDONE;
+
CAM_DEBUG(scb->io_ctx->ccb_h.path, CAM_DEBUG_TRACE,
("ahd_done - scb %d\n", SCB_GET_TAG(scb)));

Actions #8

Updated by dillon over 14 years ago

:+ KKASSERT == 0);
:+ scb->flags |= SCBRUNNINGDONE;
:+

Oops, I meant SCB_RUNNINGDONE :-)
-Matt
Matthew Dillon
&lt;&gt;
Actions #9

Updated by pavalos over 14 years ago

Before. I'm running the current code now, so we'll see if i get it again.

There have been a bunch of changes in the last 4 years, and I plan on
looking into bringing some of these in.

I'll try that patch as well as the latest code (lwkt_msgport.c).

--Peter

Actions #10

Updated by pavalos over 14 years ago

Well that was quick:

panic: assertion: (msg->ms_flags & (MSGF_DONE|MSGF_QUEUED|MSGF_INTRANSIT)) == 0 in lwkt_thread_replyport
mp_lock = 00000000; cpuid = 0
Trace beginning at frame 0xda6d2d08
panic(c030db06,0,c02fe28a,da6d2d3c,da6d2d3c) at panic+0x16d
panic(c02fe28a,c03178ac,c02f2b7f,0,e897ab78) at panic+0x16d
lwkt_thread_replyport(d9e57310,e897ac28,e897ab78,ff800000,da6d2d84) at lwkt_thread_replyport+0x39
netmsg_so_notify_abort(e897ab78,0,0,c0390cc0,0) at netmsg_so_notify_abort+0x5c
tcpmsg_service_loop(0,0,0,0,0) at tcpmsg_service_loop+0x3a
lwkt_exit() at lwkt_exit
boot() called on cpu#0
Uptime: 1m48s

dumping to dev #da/0x20001, blockno 378927

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01966d2 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:371
#2 0xc01969ef in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:796
#3 0xc01a19d1 in lwkt_thread_replyport (port=0xd9e57310, msg=0xe897ac28) at /usr/src/sys/kern/lwkt_msgport.c:410
#4 0xc01ca8be in netmsg_so_notify_abort (netmsg=0xe897ab78) at msgport2.h:63
#5 0xc0216127 in tcpmsg_service_loop (dummy=0x0) at /usr/src/sys/netinet/tcp_subr.c:385
#6 0xc019e2b7 in lwkt_deschedule_self (td=Variable "td" is not available.
) at /usr/src/sys/kern/lwkt_thread.c:214
Previous frame inner to this frame (corrupt stack?)
(kgdb)

Actions #11

Updated by pavalos over 14 years ago

Core is on leaf now (*.1).

--Peter

Actions #12

Updated by dillon over 14 years ago

:Well that was quick:
:
:panic: assertion: (msg->ms_flags & (MSGF_DONE|MSGF_QUEUED|MSGF_INTRANSIT)) =
:=3D=3D 0 in lwkt_thread_replyport

Doh!  And there's a bug in netmsg_so_notify_abort() right there in 
plain sight. MSGF_DONE is not set until the message reaches the
target cpu, so there is a window of opportunity where the conditional
in netmsg_so_notify_abort() will try to reply the message twice.
I'll get that fixed in a sec and we'll see whether it was responsible
for all your problems or not.
-Matt
Actions #13

Updated by dillon over 14 years ago

kern/uipc_msg.c rev 1.19 has been committed.

-Matt
Actions #14

Updated by dillon over 14 years ago

Fixed race in LWKT message abort processing

Actions

Also available in: Atom PDF