Bug #2296

panic: assertion "m->wire_count > 0" failed

Added by thomas.nikolajsen almost 3 years ago. Updated over 2 years ago.

Status:In ProgressStart date:02/01/2012
Priority:HighDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

With recent (29/1-12) master and rel3_0 I get this panic
during parallel make release and buildworld, e.g.:
'make MAKE_JOBS=10 release' (i.e. make -j10)
i386 STANDARD
(custom kernel, includes INCLUDE_CONFIG_FILE)
on 8 core host (opteron).

Got this panic twice; succeeds w/o MAKE_JOBS

Core dump at leaf: ~thomas:crash/octopus.i386.3


Related issues

Related to Bug #2327: 3.0.2 catchall ticket Closed 03/08/2012
Related to Bug #2336: 3.0.3 catchall Resolved 03/26/2012
Related to Bug #2402: Showstopper panics for Release 3.2 New 08/15/2012

History

#1 Updated by thomas.nikolajsen almost 3 years ago

Some additional observations:
I have only seens this issue on i386, not on x86_64,
so it seems to be specific for i386.

I have only seen issue while doing buildworld
(e.g. as part of make release), not while doing quickworld.
(this & kernel build is the parallel workloads I do most on this host)
As originally noted: issue seen with -j10 on 8 core system using SMP kernel;
not seen w/o -jN; haven't tried lower N.

On UP system running SMP kernel I haven't observed issue;
this was using -j3.

#2 Updated by phma almost 3 years ago

My crash dump 11 on leaf seems to be an instance of this bug.

#3 Updated by thomas.nikolajsen almost 3 years ago

  • Priority changed from Normal to High

Problem still present in rel3_0;
just did test again, as per original description,
panic on 1st try, again :(

x86_64 code in failing area is changed more recently than i386,
maybe this is related.

http://leaf.dragonflybsd.org/mailarchive/commits/2011-11/msg00147.html
git: kernel - Adjust tlb invalidation in the x86-64 pmap code

- details from current dump (old dump on leaf have similar info)
panic: assertion "m->wire_count > 0" failed in pmap_unwire_pte at /usr/src/sys/platform/pc32/i386/pmap.c:1091
..
CPU5 stopping CPUs: 0x000000df
stopped
SECONDARY PANIC ON CPU 2 THREAD 0xd73e1e60
..
_get_mycpu () at ./machine/thread.h:79
79 __asm ("movl %%fs:globaldata,%0" : "=r" (gd) : "m"(__mycpu__dummy));
(kgdb) bt
#0 _get_mycpu () at ./machine/thread.h:79
#1 md_dumpsys (di=0xc0795d60) at /usr/src/sys/platform/pc32/i386/dump_machdep.c:264
#2 0xc01c0898 in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:925
#3 0xc0159f7a in db_fncall (dummy1=-1070299438, dummy2=0, dummy3=-1072322965, dummy4=0xe844f940 "\324J4\300<\211;\300")
at /usr/src/sys/ddb/db_command.c:539
#4 0xc015a45f in db_command (aux_cmd_tablep_end=0xc03e9884, aux_cmd_tablep=0xc03e9880, cmd_table=<optimized out>,
last_cmdp=<optimized out>) at /usr/src/sys/ddb/db_command.c:401
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:467
#6 0xc015cfbe in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_trap.c:71
#7 0xc0348a45 in kdb_trap (type=3, code=0, regs=0xe844fa60) at /usr/src/sys/platform/pc32/i386/db_interface.c:151
#8 0xc037829a in trap (frame=0xe844fa60) at /usr/src/sys/platform/pc32/i386/trap.c:838
#9 0xc0349f37 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:787
#10 0xc03486d2 in breakpoint () at ./cpu/cpufunc.h:72
#11 Debugger (msg=0xc03aead3 "panic") at /usr/src/sys/platform/pc32/i386/db_interface.c:333
#12 0xc01c10f8 in panic (fmt=0xc03c00c4 "assertion \"%s\" failed in %s at %s:%u") at /usr/src/sys/kern/kern_shutdown.c:822
#13 0xc037417c in pmap_unwire_pte (info=<optimized out>, m=<optimized out>, pmap=<optimized out>)
at /usr/src/sys/platform/pc32/i386/pmap.c:1091
#14 pmap_unuse_pt (pmap=0xdac8e818, va=3225142912, mpte=0xc46a1540, info=0xe844fb28)
at /usr/src/sys/platform/pc32/i386/pmap.c:1131
#15 0xc03743c6 in pmap_remove_all (m=0xc23317e0) at /usr/src/sys/platform/pc32/i386/pmap.c:2038
#16 0xc0374528 in pmap_page_protect (m=0xc23317e0, prot=0 '\000') at /usr/src/sys/platform/pc32/i386/pmap.c:3111
#17 0xc02f19a5 in vm_page_protect (prot=<optimized out>, m=<optimized out>) at /usr/src/sys/vm/vm_page.h:535
#18 vm_fault_object (fs=0xe844fc50, first_pindex=<optimized out>, fault_type=2 '\002') at /usr/src/sys/vm/vm_fault.c:1660
#19 0xc02f27d3 in vm_fault (map=0xdac870f0, vaddr=672600064, fault_type=<optimized out>, fault_flags=12)
at /usr/src/sys/vm/vm_fault.c:497
#20 0xc0377ad3 in trap_pfault (frame=0xe844fd40, usermode=<optimized out>, eva=<optimized out>)
at /usr/src/sys/platform/pc32/i386/trap.c:1006
#21 0xc0377f8a in trap (frame=0xe844fd40) at /usr/src/sys/platform/pc32/i386/trap.c:596
#22 0xc0349f37 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:787
#23 0x2805d276 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) frame 13
#13 0xc037417c in pmap_unwire_pte (info=<optimized out>, m=<optimized out>, pmap=<optimized out>)
at /usr/src/sys/platform/pc32/i386/pmap.c:1091
1091 KKASSERT(m->wire_count > 0);
(kgdb) l
1086 * pmap_release() will catch the case.
1087 */
1088 static PMAP_INLINE int
1089 pmap_unwire_pte(pmap_t pmap, vm_page_t m, pmap_inval_info_t info)
1090 {
1091 KKASSERT(m->wire_count > 0);
1092 if (m->wire_count > 1) {
1093 if (vm_page_unwire_quick(m))
1094 panic("pmap_unwire_pte: Insufficient wire_count");
1095 return 0;
(kgdb) f 14
#14 pmap_unuse_pt (pmap=0xdac8e818, va=3225142912, mpte=0xc46a1540, info=0xe844fb28)
at /usr/src/sys/platform/pc32/i386/pmap.c:1131
1131 return pmap_unwire_pte(pmap, mpte, info);
(kgdb) l
1126 pmap->pm_ptphint = mpte;
1127 vm_page_wakeup(mpte);
1128 }
1129 }
1130
1131 return pmap_unwire_pte(pmap, mpte, info);
1132 }
1133
1134 /*
1135 * Initialize pmap0/vmspace0. This pmap is not added to pmap_list because
(kgdb) p pmap
$2 = (pmap_t) 0xdac8e818
(kgdb) p mpte
$3 = (vm_page_t) 0xc46a1540
(kgdb) p info
$4 = (pmap_inval_info_t) 0xe844fb28
(kgdb) p *mpte
$5 = {pageq = {tqe_next = 0x0, tqe_prev = 0xc0f450e4}, rb_entry = {rbe_left = 0x0, rbe_right = 0xc30a89c0,
rbe_parent = 0x0, rbe_color = 0}, object = 0x0, pindex = 160, phys_addr = 2982219776, md = {pv_list_count = 0,
pv_list = {tqh_first = 0x0, tqh_last = 0xc46a1570}}, queue = 7, pc = 6, act_count = 0 '\000', busy = 0 '\000',
unused01 = 0 '\000', unused02 = 0 '\000', flags = 64, wire_count = 0, hold_count = 0, valid = 0 '\000', dirty = 0 '\000',
ku_pagecnt = 0}
(kgdb) p *pmap
$6 = {pm_pdir = 0xe8a5c000, pm_pdirm = 0xc30a89c0, pm_pteobj = 0xeacc7690, pm_pmnode = {tqe_next = 0xdaab4818,
tqe_prev = 0xdac850a4}, pm_pvlist = {tqh_first = 0x0, tqh_last = 0xdac8e82c}, pm_pvlist_free = {tqh_first = 0x0,
tqh_last = 0xdac8e834}, pm_count = 1, pm_active = 0, pm_cached = 0, pm_filler02 = 0, pm_stats = {resident_count = 1,
wired_count = 0}, pm_ptphint = 0x0, pm_generation = 59768, pm_spin = {counta = 0, countb = 0}, pm_token = {t_count = 0,
t_ref = 0x0, t_collisions = 0, t_desc = 0xc03bdc2c "pmap_tok"}}
(kgdb) p *info
$7 = {pir_flags = 0, pir_va = 672595968, pir_cpusync = {cs_mask = 64, cs_mack = 64,
cs_func = 0xc037574f <pmap_inval_callback>, cs_data = 0xe844fb28}}

#4 Updated by thomas.nikolajsen almost 3 years ago

Did some more tests, reducing number of used CPUs and/or parallel jobs,
didn't get any crash from 3-4 runs of each: i386:
- on 8 cpu system: 'make MAKE_JOBS=5 release'
- using 4 cpu: (hw.ap_max=3): 'make MAKE_JOBS=10 release'

Looking at build logs: panic reported earlier happened during
build of git (scmgit), or some dependency.

My setup uses NFS for /usr/src and /usr/pkgsrc,
and HAMMER for the rest (e.g. /usr/release).
(don't know if NFS use is important for triggering panic)

#5 Updated by vsrinivas almost 3 years ago

Aha, NFS might be at fault. I'll switch to trying to reproduce it there. I've done five days of kernel building at -j10 on a 2-cpu i386/GENERIC system, haven't managed to reproduce it, but all my filesystems are UFS there.

In the core uploaded, there is a page table page marked as not only not wired, but also on the free queue and PG_ZEROed. That's a pretty unexpected, bad state.

The x86-64 pmap is constructed fairly differently wrt synchronization than the i386 version; the i386 pmap uses the vm_token still, whereas the x86-64 one uses a fine-grained approach at the page level. The vm_token is rather easy to lose in blocking conditions, which might be the issue at fault here (we're losing a token and stuff is getting changed under us).

#6 Updated by thomas.nikolajsen almost 3 years ago

Got panic while testing w/o NFS use.
This time 1st round (of make release) succeded;
got panic early in 2nd round (during buildworld 4a).

A wild guess: I usually run shell script doing some sysctl's, to see system load, like cpu freq (running powerd); didn't do that on 1st round, maybe sysctl can be part of trigger (saw other sysctl related bug on bugs@).

Anyway: doing test on x86_64 now to see if it's clean; didn't do much testing for this bug there yet.

#7 Updated by thomas.nikolajsen almost 3 years ago

x86_64 seems clean: 7 rounds succeeded.

This is using setup as in initial description:
'make MAKE_JOBS=10 release' on 8 cpu host
w/ NFS for /usr/src & /usr/pkgsrc.

Seems like we should port our current x86_64
pmap implementation to i386.

#8 Updated by thomas.nikolajsen almost 3 years ago

Note: on v3.0.1 i386 system building dfly v3.0.1
(which I understand we are going to release)
once (to install on UP stable system here)
caused this panic during buildworld.

This was on 8 core system using my usual params,
-j10, did new build using -j4, which succeded.

This is somewhat ironic, but just what should be expected from earlier observations on this bug.

If we release v3.0.1 we might consider not supporting
i386 w/ more than 4 cores, until this bug is resolved.

#9 Updated by phma almost 3 years ago

I have only 2 cpus and I've seen this bug.

I also found that my ktimetracker file was zeroed on the same day this happened. But I had another crash the same day, and my other zeroed file was zeroed on other days, so I can't be sure that this bug zeroes files.

#10 Updated by marino over 2 years ago

Already reported to vsrinivas -

DragonFly v3.1.0.634.gc6fd7-DEVELOPMENT #3: Sat May 5 09:02:18 CEST 2012 :/usr/obj/usr/src/sys/GENERIC

4-core intel Core-i5

http://leaf.dragonflybsd.org/~marino/core/core.wirecount.txt

core dump located on leaf, ~marino/crash

#11 Updated by vsrinivas over 2 years ago

pkgbox32 has crashed a few times with this panic also; there are recent cores (.15) in its /var/crash.

#12 Updated by marino over 2 years ago

  • Status changed from New to In Progress

#13 Updated by marino over 2 years ago

There is something wrong with the new patch.
A new incremental bulk build did not get very far before panicking:
http://leaf.dragonflybsd.org/~marino/core/core.freeing_wired_page_table_page.txt

The core file will be uploaded to ~marino/crash

#15 Updated by thomas.nikolajsen over 2 years ago

This panic is still seen with current master,
but less often than before recent commits which hoped to address it.
Should I upload recent core dump?

#16 Updated by dillon over 2 years ago

I believe there is a race in vm_page_protect() --> pmap_page_protect(). I
have not been able to find it. However, all the crashes came from the
same path from vm_fault in a situation where we don't actually have to
call vm_page_protect().

You can try this patch:

fetch http://apollo.backplane.com/DFlyMisc/wire01.patch

This removes the unnecessary call for the code path in question. It
is NOT a fix, but if the crashes go away then I'll know for sure
that it's an issue w/vm_page_protect().

-Matt

#17 Updated by thomas.nikolajsen over 2 years ago

Did tests w/ wire01.patch applied, and it didn't panic.

Did 17 'make -j10 buildworld buildkernel' test runs.
Test was a bit different than prev. run:
used MAKEOBJDIRPREFIX on tmpfs this time, vs nfs or hammer in prev. tests
(for faster tests & less hammer fill-up).
Will do same tests on master (i.e. w/o patch) to see if I get panic on this setup.

#18 Updated by marino over 2 years ago

I *almost* completed a multi-day bulk build, but it panicked with less than 400 packages to go.
This was a very recent stock kernel WITH the wire01.patch applied.

http://leaf.dragonflybsd.org/~marino/core/core.page_not_busy.4.txt

I will upload the core later to ~marino/crash

#19 Updated by thomas.nikolajsen over 2 years ago

Did tests on master (w/o patch), got panic, as expected.

Did 4 'make -j10 buildworld buildkernel' test runs,
had one panic during first run.

Test was same setup as prev. run:
used MAKEOBJDIRPREFIX on tmpfs this time (and /usr/src on NFS, as usual),
on 8 core system w/ 8GB running i386 (i.e. using ~3GB (4GB minus PCI mem. space)).

Also available in: Atom PDF