Bug #2296: panic: assertion "m->wire_count > 0" failed - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #2296

open

panic: assertion "m->wire_count > 0" failed

Added by thomas.nikolajsen over 13 years ago. Updated almost 13 years ago.

Status:

In Progress

Priority:

High

Assignee:

Category:

Target version:

6.4

Start date:

02/01/2012

Due date:

% Done:

Estimated time:

Description

With recent (29/1-12) master and rel3_0 I get this panic
during parallel make release and buildworld, e.g.:
'make MAKE_JOBS=10 release' (i.e. make -j10)
i386 STANDARD
(custom kernel, includes INCLUDE_CONFIG_FILE)
on 8 core host (opteron).

Got this panic twice; succeeds w/o MAKE_JOBS

Core dump at leaf: ~thomas:crash/octopus.i386.3

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

Some additional observations:
I have only seens this issue on i386, not on x86_64,
so it seems to be specific for i386.

I have only seen issue while doing buildworld
(e.g. as part of make release), not while doing quickworld.
(this & kernel build is the parallel workloads I do most on this host)
As originally noted: issue seen with -j10 on 8 core system using SMP kernel;
not seen w/o -jN; haven't tried lower N.

On UP system running SMP kernel I haven't observed issue;
this was using -j3.

Actions

Copy link

Updated by phma over 13 years ago

My crash dump 11 on leaf seems to be an instance of this bug.

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

Priority changed from Normal to High

Problem still present in rel3_0;
just did test again, as per original description,
panic on 1st try, again :(

x86_64 code in failing area is changed more recently than i386,
maybe this is related.

http://leaf.dragonflybsd.org/mailarchive/commits/2011-11/msg00147.html
git: kernel - Adjust tlb invalidation in the x86-64 pmap code

- details from current dump (old dump on leaf have similar info)
panic: assertion "m->wire_count > 0" failed in pmap_unwire_pte at /usr/src/sys/platform/pc32/i386/pmap.c:1091
..
CPU5 stopping CPUs: 0x000000df
stopped
SECONDARY PANIC ON CPU 2 THREAD 0xd73e1e60
..
get_mycpu () at ./machine/thread.h:79
79 __asm ("movl %%fs:globaldata,%0" : "=r" (gd) : "m"(_mycpu__dummy));
(kgdb) bt
#0 _get_mycpu () at ./machine/thread.h:79
#1 md_dumpsys (di=0xc0795d60) at /usr/src/sys/platform/pc32/i386/dump_machdep.c:264
#2 0xc01c0898 in dumpsys () at /usr/src/sys/kern/kern_shutdown.c:925
#3 0xc0159f7a in db_fncall (dummy1=-1070299438, dummy2=0, dummy3=-1072322965, dummy4=0xe844f940 "\324J4\300<\211;\300")
at /usr/src/sys/ddb/db_command.c:539
#4 0xc015a45f in db_command (aux_cmd_tablep_end=0xc03e9884, aux_cmd_tablep=0xc03e9880, cmd_table=<optimized out>,
last_cmdp=<optimized out>) at /usr/src/sys/ddb/db_command.c:401
#5 db_command_loop () at /usr/src/sys/ddb/db_command.c:467
#6 0xc015cfbe in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_trap.c:71
#7 0xc0348a45 in kdb_trap (type=3, code=0, regs=0xe844fa60) at /usr/src/sys/platform/pc32/i386/db_interface.c:151
#8 0xc037829a in trap (frame=0xe844fa60) at /usr/src/sys/platform/pc32/i386/trap.c:838
#9 0xc0349f37 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:787
#10 0xc03486d2 in breakpoint () at ./cpu/cpufunc.h:72
#11 Debugger (msg=0xc03aead3 "panic") at /usr/src/sys/platform/pc32/i386/db_interface.c:333
#12 0xc01c10f8 in panic (fmt=0xc03c00c4 "assertion \"%s\" failed in %s at %s:%u") at /usr/src/sys/kern/kern_shutdown.c:822
#13 0xc037417c in pmap_unwire_pte (info=<optimized out>, m=<optimized out>, pmap=<optimized out>)
at /usr/src/sys/platform/pc32/i386/pmap.c:1091
#14 pmap_unuse_pt (pmap=0xdac8e818, va=3225142912, mpte=0xc46a1540, info=0xe844fb28)
at /usr/src/sys/platform/pc32/i386/pmap.c:1131
#15 0xc03743c6 in pmap_remove_all (m=0xc23317e0) at /usr/src/sys/platform/pc32/i386/pmap.c:2038
#16 0xc0374528 in pmap_page_protect (m=0xc23317e0, prot=0 '\000') at /usr/src/sys/platform/pc32/i386/pmap.c:3111
#17 0xc02f19a5 in vm_page_protect (prot=<optimized out>, m=<optimized out>) at /usr/src/sys/vm/vm_page.h:535
#18 vm_fault_object (fs=0xe844fc50, first_pindex=<optimized out>, fault_type=2 '\002') at /usr/src/sys/vm/vm_fault.c:1660
#19 0xc02f27d3 in vm_fault (map=0xdac870f0, vaddr=672600064, fault_type=<optimized out>, fault_flags=12)
at /usr/src/sys/vm/vm_fault.c:497
#20 0xc0377ad3 in trap_pfault (frame=0xe844fd40, usermode=<optimized out>, eva=<optimized out>)
at /usr/src/sys/platform/pc32/i386/trap.c:1006
#21 0xc0377f8a in trap (frame=0xe844fd40) at /usr/src/sys/platform/pc32/i386/trap.c:596
#22 0xc0349f37 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:787
#23 0x2805d276 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(kgdb) frame 13
#13 0xc037417c in pmap_unwire_pte (info=<optimized out>, m=<optimized out>, pmap=<optimized out>)
at /usr/src/sys/platform/pc32/i386/pmap.c:1091
1091 KKASSERT;
(kgdb) l
1086 * pmap_release() will catch the case.
1087 /
1088 static PMAP_INLINE int
1089 pmap_unwire_pte(pmap_t pmap, vm_page_t m, pmap_inval_info_t info)
1090 {
1091 KKASSERT;
1092 if (m->wire_count > 1) {
1093 if (vm_page_unwire_quick(m))
1094 panic("pmap_unwire_pte: Insufficient wire_count");
1095 return 0;
(kgdb) f 14
#14 pmap_unuse_pt (pmap=0xdac8e818, va=3225142912, mpte=0xc46a1540, info=0xe844fb28)
at /usr/src/sys/platform/pc32/i386/pmap.c:1131
1131 return pmap_unwire_pte(pmap, mpte, info);
(kgdb) l
1126 pmap->pm_ptphint = mpte;
1127 vm_page_wakeup(mpte);
1128 }
1129 }
1130
1131 return pmap_unwire_pte(pmap, mpte, info);
1132 }
1133
1134 /
1135 * Initialize pmap0/vmspace0. This pmap is not added to pmap_list because
(kgdb) p pmap
$2 = (pmap_t) 0xdac8e818
(kgdb) p mpte
$3 = (vm_page_t) 0xc46a1540
(kgdb) p info
$4 = (pmap_inval_info_t) 0xe844fb28
(kgdb) p *mpte
$5 = {pageq = {tqe_next = 0x0, tqe_prev = 0xc0f450e4}, rb_entry = {rbe_left = 0x0, rbe_right = 0xc30a89c0,
rbe_parent = 0x0, rbe_color = 0}, object = 0x0, pindex = 160, phys_addr = 2982219776, md = {pv_list_count = 0,
pv_list = {tqh_first = 0x0, tqh_last = 0xc46a1570}}, queue = 7, pc = 6, act_count = 0 '\000', busy = 0 '\000',
unused01 = 0 '\000', unused02 = 0 '\000', flags = 64, wire_count = 0, hold_count = 0, valid = 0 '\000', dirty = 0 '\000',
ku_pagecnt = 0}
(kgdb) p *pmap
$6 = {pm_pdir = 0xe8a5c000, pm_pdirm = 0xc30a89c0, pm_pteobj = 0xeacc7690, pm_pmnode = {tqe_next = 0xdaab4818,
tqe_prev = 0xdac850a4}, pm_pvlist = {tqh_first = 0x0, tqh_last = 0xdac8e82c}, pm_pvlist_free = {tqh_first = 0x0,
tqh_last = 0xdac8e834}, pm_count = 1, pm_active = 0, pm_cached = 0, pm_filler02 = 0, pm_stats = {resident_count = 1,
wired_count = 0}, pm_ptphint = 0x0, pm_generation = 59768, pm_spin = {counta = 0, countb = 0}, pm_token = {t_count = 0,
t_ref = 0x0, t_collisions = 0, t_desc = 0xc03bdc2c "pmap_tok"}}
(kgdb) p *info
$7 = {pir_flags = 0, pir_va = 672595968, pir_cpusync = {cs_mask = 64, cs_mack = 64,
cs_func = 0xc037574f <pmap_inval_callback>, cs_data = 0xe844fb28}}

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

Did some more tests, reducing number of used CPUs and/or parallel jobs,
didn't get any crash from 3-4 runs of each: i386:
- on 8 cpu system: 'make MAKE_JOBS=5 release'
- using 4 cpu: (hw.ap_max=3): 'make MAKE_JOBS=10 release'

Looking at build logs: panic reported earlier happened during
build of git (scmgit), or some dependency.

My setup uses NFS for /usr/src and /usr/pkgsrc,
and HAMMER for the rest (e.g. /usr/release).
(don't know if NFS use is important for triggering panic)

Actions

Copy link

Updated by vsrinivas over 13 years ago

Aha, NFS might be at fault. I'll switch to trying to reproduce it there. I've done five days of kernel building at -j10 on a 2-cpu i386/GENERIC system, haven't managed to reproduce it, but all my filesystems are UFS there.

In the core uploaded, there is a page table page marked as not only not wired, but also on the free queue and PG_ZEROed. That's a pretty unexpected, bad state.

The x86-64 pmap is constructed fairly differently wrt synchronization than the i386 version; the i386 pmap uses the vm_token still, whereas the x86-64 one uses a fine-grained approach at the page level. The vm_token is rather easy to lose in blocking conditions, which might be the issue at fault here (we're losing a token and stuff is getting changed under us).

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

Got panic while testing w/o NFS use.
This time 1st round (of make release) succeded;
got panic early in 2nd round (during buildworld 4a).

A wild guess: I usually run shell script doing some sysctl's, to see system load, like cpu freq (running powerd); didn't do that on 1st round, maybe sysctl can be part of trigger (saw other sysctl related bug on bugs@).

Anyway: doing test on x86_64 now to see if it's clean; didn't do much testing for this bug there yet.

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

x86_64 seems clean: 7 rounds succeeded.

This is using setup as in initial description:
'make MAKE_JOBS=10 release' on 8 cpu host
w/ NFS for /usr/src & /usr/pkgsrc.

Seems like we should port our current x86_64
pmap implementation to i386.

Actions

Copy link

Updated by thomas.nikolajsen over 13 years ago

Note: on v3.0.1 i386 system building dfly v3.0.1
(which I understand we are going to release)
once (to install on UP stable system here)
caused this panic during buildworld.

This was on 8 core system using my usual params,
-j10, did new build using -j4, which succeded.

This is somewhat ironic, but just what should be expected from earlier observations on this bug.

If we release v3.0.1 we might consider not supporting
i386 w/ more than 4 cores, until this bug is resolved.

Actions

Copy link

Updated by phma over 13 years ago

I have only 2 cpus and I've seen this bug.

I also found that my ktimetracker file was zeroed on the same day this happened. But I had another crash the same day, and my other zeroed file was zeroed on other days, so I can't be sure that this bug zeroes files.

Actions

Copy link

#10

Updated by marino about 13 years ago

Already reported to vsrinivas -

DragonFly v3.1.0.634.gc6fd7-DEVELOPMENT #3: Sat May 5 09:02:18 CEST 2012 root@dracofly.synsport.com:/usr/obj/usr/src/sys/GENERIC

4-core intel Core-i5

http://leaf.dragonflybsd.org/~marino/core/core.wirecount.txt

core dump located on leaf, ~marino/crash

Actions

Copy link

#11

Updated by vsrinivas about 13 years ago

pkgbox32 has crashed a few times with this panic also; there are recent cores (.15) in its /var/crash.

Actions

Copy link

#12

Updated by marino almost 13 years ago

Status changed from New to In Progress

Matt Dillon may have found and fixed the smoking gun:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/2bb9cc6fdf20935fe2e3dfbfedc4eb353034b935

Actions

Copy link

#13

Updated by marino almost 13 years ago

There is something wrong with the new patch.
A new incremental bulk build did not get very far before panicking:
http://leaf.dragonflybsd.org/~marino/core/core.freeing_wired_page_table_page.txt

The core file will be uploaded to ~marino/crash

Actions

Copy link

#14

Updated by marino almost 13 years ago

Promising update to previous patch:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b148267406ef2d0543d5d87d15c283b2d314516f

Actions

Copy link

#15

Updated by thomas.nikolajsen almost 13 years ago

This panic is still seen with current master,
but less often than before recent commits which hoped to address it.
Should I upload recent core dump?

Actions

Copy link

#16

Updated by dillon almost 13 years ago

I believe there is a race in vm_page_protect() --> pmap_page_protect(). I
have not been able to find it. However, all the crashes came from the
same path from vm_fault in a situation where we don't actually have to
call vm_page_protect().

You can try this patch:

fetch http://apollo.backplane.com/DFlyMisc/wire01.patch

This removes the unnecessary call for the code path in question.  It
    is NOT a fix, but if the crashes go away then I'll know for sure
    that it's an issue w/vm_page_protect().

-Matt

Actions

Copy link

#17

Updated by thomas.nikolajsen almost 13 years ago

Did tests w/ wire01.patch applied, and it didn't panic.

Did 17 'make -j10 buildworld buildkernel' test runs.
Test was a bit different than prev. run:
used MAKEOBJDIRPREFIX on tmpfs this time, vs nfs or hammer in prev. tests
(for faster tests & less hammer fill-up).
Will do same tests on master (i.e. w/o patch) to see if I get panic on this setup.

Actions

Copy link

#18

Updated by marino almost 13 years ago

I almost completed a multi-day bulk build, but it panicked with less than 400 packages to go.
This was a very recent stock kernel WITH the wire01.patch applied.

http://leaf.dragonflybsd.org/~marino/core/core.page_not_busy.4.txt

I will upload the core later to ~marino/crash

Actions

Copy link

#19

Updated by thomas.nikolajsen almost 13 years ago

Did tests on master (w/o patch), got panic, as expected.

Did 4 'make -j10 buildworld buildkernel' test runs,
had one panic during first run.

Test was same setup as prev. run:
used MAKEOBJDIRPREFIX on tmpfs this time (and /usr/src on NFS, as usual),
on 8 core system w/ 8GB running i386 (i.e. using ~3GB (4GB minus PCI mem. space)).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

DragonFlyBSD

Bug #2296

panic: assertion "m->wire_count > 0" failed

Updated by thomas.nikolajsen over 13 years ago

Updated by phma over 13 years ago

Updated by thomas.nikolajsen over 13 years ago

Updated by thomas.nikolajsen over 13 years ago

Updated by vsrinivas over 13 years ago

Updated by thomas.nikolajsen over 13 years ago

Updated by thomas.nikolajsen over 13 years ago

Updated by thomas.nikolajsen over 13 years ago

Updated by phma over 13 years ago

Updated by marino about 13 years ago

Updated by vsrinivas about 13 years ago

Updated by marino almost 13 years ago

Updated by marino almost 13 years ago

Updated by marino almost 13 years ago

Updated by thomas.nikolajsen almost 13 years ago

Updated by dillon almost 13 years ago

Updated by thomas.nikolajsen almost 13 years ago

Updated by marino almost 13 years ago

Updated by thomas.nikolajsen almost 13 years ago