Bug #2127

regular panics with 2.11.0.749 x86_64 on vmware

Added by isenmann almost 3 years ago. Updated almost 3 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:swildner% Done:

0%

Category:-
Target version:-

Description

Hi!

Since the upgrade to 2.11.0.749 (was 2.10.1) I get regular panics
during make buildworld on this virtual system running on vmware.
It is always this MPT_LOCK_ASSERT in sys/dev/disk/mpt/mpt_cam.c:1242.
Get this between a few minutes and over half an hour after starting
make buildworld. Maybe the probability for the panic depends somewhat
on the activity on the other vmware guests.

-- part of /var/crash/info:

Dump header from device /dev/da0s1b
Architecture: x86_64
Version String: DragonFly v2.11.0.749.gfe750-DEVELOPMENT #0: Thu Aug 25
14:19:10 CEST 2011
_64:/usr/obj/usr/src/sys/X86_64_GENERIC
Panic String: assertion "lockstatus(&(mpt)->mpt_lock, curthread) != 0" failed
in mpt_timeout at /usr/src/sys/dev/disk/mpt/mpt_cam.c:1242

-- part of /var/crash/core.txt
Reading symbols from /boot/kernel/acpi.ko...done.
Loaded symbols for /boot/kernel/acpi.ko
Reading symbols from /boot/kernel/ahci.ko...done.
Loaded symbols for /boot/kernel/ahci.ko
Reading symbols from /boot/kernel/ehci.ko...done.
Loaded symbols for /boot/kernel/ehci.ko
_get_mycpu (di=<value optimized out>) at ./machine/thread.h:69
69 __asm ("movq %%gs:globaldata,%0" : "=r" (gd) : "m"(__mycpu__dummy));
(kgdb) #0 _get_mycpu (di=<value optimized out>) at ./machine/thread.h:69
#1 md_dumpsys (di=<value optimized out>) at
/usr/src/sys/platform/pc64/x86_64/dump_machdep.c:263
#2 0xffffffff804825e5 in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:927
#3 0xffffffff8029a54c in db_fncall (dummy1=<value optimized out>, dummy2=<value
optimized out>, dummy3=<value optimized out>, dummy4=<value optimized out>) at
/usr/src/sys/ddb/db_command.c:542
#4 0xffffffff8029aa1f in db_command () at /usr/src/sys/ddb/db_command.c:344
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:470
#6 0xffffffff8029d8d1 in db_trap (type=<value optimized out>, code=<value
optimized out>) at /usr/src/sys/ddb/db_trap.c:71
#7 0xffffffff80728c68 in kdb_trap (type=<value optimized out>, code=<value
optimized out>, regs=<value optimized out>) at
/usr/src/sys/platform/pc64/x86_64/db_interface.c:176
#8 0xffffffff8072e7b6 in trap (frame=0xffffffe002294af8) at
/usr/src/sys/platform/pc64/x86_64/trap.c:701
#9 0xffffffff807271ae in calltrap () at
/usr/src/sys/platform/pc64/x86_64/exception.S:187
#10 0xffffffff80728aff in breakpoint (msg=<value optimized out>) at
./cpu/cpufunc.h:73
#11 Debugger (msg=<value optimized out>) at
/usr/src/sys/platform/pc64/x86_64/db_interface.c:361
#12 0xffffffff80482ed4 in panic (fmt=0xffffffff8076eb90 "assertion \"%s\" failed
in %s at %s:%u") at /usr/src/sys/kern/kern_shutdown.c:824
#13 0xffffffff80409668 in mpt_timeout (arg=<value optimized out>) at
/usr/src/sys/dev/disk/mpt/mpt_cam.c:1242
#14 0xffffffff8049e1e9 in softclock_handler (arg=0xffffffff80bd2ca0) at
/usr/src/sys/kern/kern_timeout.c:300
#15 0xffffffff8048dc20 in lwkt_deschedule_self (td=0x0) at
/usr/src/sys/kern/lwkt_thread.c:283
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) (kgdb)
Token collisions owner
pmap_token 0 not held
dev_token 0 not held
vm_token 185286 not held
vmspace_token 2 not held
kvm_token 0 not held
proc_token 1 not held
tty_token 0 not held
vnode_token 0 not held
vmobj_token 524 not held
(kgdb)
cpu pid thread flags comm wchan wmesg
0 -1 80d3d6b8 00000000 idle_0 00000000
0 0 80b59340 00020080 swapper 80c8240c nowork
0 -1 80bd2cc0 00010403 softclock 0 00000000
0 -1 80b9f800 00010400 ithread emerg 00000000
0 -1 80b8f6a8 00010400 ithread 196 00000000
0 -1 80ba41e0 00000400 usched 0 00000000
0 -1 80bb43e8 00000400 dsched 0 00000000
0 -1 0bd63af0 00040600 disk_msg_core 00000000 waitport
0 -1 0bd63f70 00040600 devfs_msg_core 00000000 waitport
0 -1 80c0d8c0 00040400 netisr_cpu 0 00000000 waitport
0 -1 80bfa860 00040400 ifnet 0 00000000 waitport
0 -1 0bd643f0 00020680 taskq_cpu 0 0bb6bad0 tqthr
0 -1 0bd64cf0 00020680 firmware taskq 0bb6bb70 tqthr
0 -1 0bd65170 04020e80 crypto 0 80d320a0 crypto_wait
0 -1 0bd65a70 00020e80 crypto returns 80d33840 crypto_ret_wait
0 -1 80b8faf0 00010400 ithread 197 00000000
0 -1 0bd65ef0 00020e80 xpt_thrd 80b32060 ccb_scanq
0 -1 80b8f260 00010400 ithread 195 00000000
0 -1 0bd66370 00040e00 acpi_task 00000000 waitport
0 -1 80b5d610 00010400 ithread 9 00000000
0 -1 80b5eb78 00010400 ithread 14 00000000
0 -1 80b5efc0 00010400 ithread 15 00000000
0 -1 80b5f850 00010400 ithread 17 00000000
0 -1 0bd667f0 00020e80 mpt_recovery0 10100068 idle
0 -1 80b5fc98 00010400 ithread 18 00000000
0 -1 80b5b3d0 00010400 ithread 1 00000000
0 -1 80b5e2e8 00010400 ithread 12 00000000
0 -1 80b5cd80 00010400 ithread 7 00000000
0 -1 80b8e588 00010400 ithread 192 00000000
0 -1 80b5c0a8 00010400 ithread 4 00000000
0 -1 80b5bc60 00010400 ithread 3 00000000
0 -1 0bd66c70 00000600 random 00000000
0 -1 0bd679f0 00040600 rtable_cpu 0 00000000 waitport
0 -1 0bd68bf0 00020680 hammer-M 106f4ad0 hmrfl1
0 -1 0bd69070 00020680 hammer-S0 0bde45c8 hmrssw
0 -1 0bd694f0 00020680 hammer-S1 0bde4630 hmrssw
0 -1 0bd69970 00020680 hammer-S2 0bde4698 hmrssw
0 -1 0bd69df0 00020680 hammer-S3 0bde4700 hmrssw
0 -1 0bd6a270 00020f80 pagedaemon 80d2d410 psleep
0 -1 0bd6a6f0 00020f80 swapcached 80d2d4a0 csleep
0 -1 0bd6ab70 00020e80 vmdaemon 80d2d46c psleep
0 -1 0bd6b470 00020f80 bufdaemon_hw 80be7134 wdrn1
0 -1 0bd6b8f0 00020f80 bufdaemon 80be7134 wdrn1
0 -1 1072ab70 00020e80 vnlru 1072ab70 vlruwt
0 -1 1072aff0 00020e80 syncer 80be7134 wdrn1
0 -1 1072b470 00020f80 consttyd 1072b470 waiting
0 -1 33788770 00020680 hammer-M 33924160 hmrwwa
0 -1 33787e70 00020680 hammer-S0 0bde4838 hmrssw
0 -1 33787570 00020680 hammer-S1 0bde48a0 hmrssw
0 -1 33786c70 00020680 hammer-S2 0bde4908 hmrssw
0 -1 1072c670 00020680 hammer-S3 0bde4970 hmrssw
0 193 3378a270 00020680 dhclient 33a3eb90 kqread
0 569 3378f370 00820680 dntpd 80bd2c60 nanslp
0 773 33786370 00820680 login 33785ef0 wait
0 1 0bd63670 00020680 init 0bd631f0 wait
0 776 3378cf70 00820680 getty 33b437d8 ttyin
0 779 33789070 00820680 getty 33b43cb8 ttyin
0 780 33796cf0 00820680 getty 33b43e58 ttyin
0 783 1072bd70 00820680 bash 1072b8f0 wait
0 819 337975f0 00820680 sh 107300f0 wait
0 843 337948f0 00820680 tee 33630908 piperd
0 1505 3379bdf0 00820680 make 3379b970 wait
0 1506 3379e670 00820680 sh 107324f0 wait
1 -1 80d476b8 00000001 idle_1 00000000
1 -1 80bd30e8 00010402 softclock 1 00000000
1 -1 80ba45f8 00000400 usched 1 00000000
1 -1 80bb4800 00000400 dsched 1 00000000
1 -1 80c0dcc8 00040400 netisr_cpu 1 00000000 waitport
1 -1 80bfac68 00040400 ifnet 1 00000000 waitport
1 -1 0bd64870 00020680 taskq_cpu 1 0bb6bb20 tqthr
1 -1 0bd655f0 04020e80 crypto 1 80d32100 crypto_wait
1 -1 0bd67e70 00040600 rtable_cpu 1 00000000 waitport
1 -1 0bd6aff0 00020e80 pagezero 80d2d724 pgzero
1 179 3378b470 00020680 dhclient 33a3a190 kqread
1 370 3378c670 00820680 syslogd 33a20390 kqread
1 685 337924f0 00820680 sshd 3381a990 kqread
1 699 33791bf0 00820680 sendmail 33a25190 kqread
1 703 337912f0 00820680 sendmail 0fdea968 pause
1 718 33792df0 00820680 cron 80bd2c60 nanslp
1 775 3378ab70 00820680 getty 33b43638 ttyin
1 774 3378d870 00820680 getty 33b43498 ttyin
1 778 33789970 00820680 getty 33b43b18 ttyin
1 777 3378bd70 00820680 getty 33b43978 ttyin
1 817 337987f0 00820680 ssh-agent 33a2d390 kqread
1 842 33797ef0 00820680 time 107309f0 wait
1 844 337999f0 00820680 make 33799570 wait
1 846 3379a2f0 00820680 make 33799e70 wait
1 868 3379abf0 00820680 make 3379a770 wait
1 1504 3379b4f0 00820680 sh 3379b070 wait
1 1507 33780df0 00020680 sh 00c87880 pgrbwt
(kgdb)
cpu curthread wchan
0 softclock 0 00000000
1 idle_1 00000000
(kgdb)
Filesystem Refs Flags
ufs 1 0x00000000
hammer 2 0x00000000
nfs 0 0x00020000
tmpfs 0 0x00000000
mfs 0 0x00000000
null 7 0x00100000
msdos 0 0x00000000
cd9660 0 0x00040000
procfs 1 0x00080000
devfs 2 0x00080000
(kgdb) Cputime limit exceeded

History

#1 Updated by alexh almost 3 years ago

That assert is completely wrong. It's impossible for the lock to be held there,
since it comes directly from the softclock handler as result of a callout_reset.
That assert needs to be changed to explicitly acquire the mpt lock.

Please try changing MPT_LOCK_ASSERT() to MPT_LOCK() on line 1242, function
mpt_timeout().

Cheers,
Alex

#2 Updated by alexh almost 3 years ago

That is, MPT_LOCK_ASSERT(mpt) => MPT_LOCK(mpt)

#3 Updated by alexh almost 3 years ago

forgot to mention that you also require an unlock at the end of the function to
match that MPT_LOCK()

#4 Updated by swildner almost 3 years ago

OK, I've reverted it to the way it was before my last sync (which wrongly removed
the locking and added the assertion).

Please test!

#5 Updated by isenmann almost 3 years ago

Hi,

was too quick yesterday. Running a kernel with only the LOCK gave me a
lockup and:

[disgnostic] cache_lock: blocked on 0xffffffe0335ecbe8 "spwd.db".

With a new kernel (commit 2be58998960aaaebcfb7e65fea5bb3d4e9df904f) I
got during a "make buildworld ... make update":

Sep 1 08:43:59 dfly-h64 kernel: mpt0: request 0xffffffe010289600:40812 timed
out for ccb 0xffffffe01006c678 (req->ccb 0xffffffe01006c678)
Sep 1 08:44:23 dfly-h64 kernel: mpt0: attempting to abort req
0xffffffe010289600:40812 function 0
Sep 1 08:44:23 dfly-h64 kernel: mpt0: completing timedout/aborted req
0xffffffe010289600:40812
Sep 1 08:44:23 dfly-h64 kernel: mpt0: abort of req 0xffffffe010289600:0 completed

So this system will run latest master (kernel and world) now. But it
will not be used much.

[ I have used it to port ghc to x86_64 and will use it to test
pkg_rolling-replace from time to time, before doing the same on the
desktop at home. ]

#6 Updated by swildner almost 3 years ago

Hmm interesting...

I've never seen mpt(4) in VMware timeout, but it looks like now it is at least
timeouting properly.

#7 Updated by alexh almost 3 years ago

Only a LOCK *and* the corresponding UNLOCK, I assume? Probably easier if you
properly update the sources from git now that Sascha has committed a fix and try
that.

Cheers,
Alex

#8 Updated by isenmann almost 3 years ago

After an update to the lastest git sources, this vmware guest is
building world and updating pkgsrc packages without any obvious
problems for some hours now, while at least a second vmware guest is
also busy building packages.

I can see (only) three aborted mpt0 request now:

Sep 1 08:43:59 dfly-h64 kernel: mpt0: request 0xffffffe010289600:40812 timed
out for ccb 0xffffffe01006c678 (req->ccb 0xffffffe01006c678)
Sep 1 08:44:23 dfly-h64 kernel: mpt0: attempting to abort req
0xffffffe010289600:40812 function 0
Sep 1 08:44:23 dfly-h64 kernel: mpt0: completing timedout/aborted req
0xffffffe010289600:40812
Sep 1 08:44:23 dfly-h64 kernel: mpt0: abort of req 0xffffffe010289600:0 completed
Sep 2 10:17:14 dfly-h64 kernel: mpt0: request 0xffffffe010291130:1633 timed out
for ccb 0xffffffe01006b178 (req->ccb 0xffffffe01006b178)
Sep 2 10:17:14 dfly-h64 kernel: mpt0: attempting to abort req
0xffffffe010291130:1633 function 0
Sep 2 10:17:14 dfly-h64 kernel: mpt0: completing timedout/aborted req
0xffffffe010291130:1633
Sep 2 10:17:14 dfly-h64 kernel: mpt0: abort of req 0xffffffe010291130:0 completed
Sep 2 11:56:41 dfly-h64 kernel: mpt0: request 0xffffffe010290b00:2130 timed out
for ccb 0xffffffe01006b178 (req->ccb 0xffffffe01006b178)
Sep 2 11:57:22 dfly-h64 kernel: mpt0: attempting to abort req
0xffffffe010290b00:2130 function 0
Sep 2 11:57:22 dfly-h64 kernel: mpt0: completing timedout/aborted req
0xffffffe010290b00:2130
Sep 2 11:57:22 dfly-h64 kernel: mpt0: abort of req 0xffffffe010290b00:0 completed

Would have expected, to see this more often, since last week, I got a
panic between 5 and 30 minutes after starting buildworld four or five
times.

But anyway, I think this problem is resolved, and this mpt
timeout/lock works properly.

Also available in: Atom PDF