Bug #1635
closedpanic: assertion: pmap->pm_stats.resident_count > 0 in pmap_release_free_page
0%
Description
Hello.
I've been seeing this panic since the beginning of this month; it's not
the panic on vkernel itself, but the DragonFly kernel running the vkernel
executable. Basically what I have to do to reproduce the panic is to run
buildworld on vkernel (sometimes buildkernel is enough, sometimes not).
I used -n16 on vkernel's command line argument, but probably I could trigger
the panic with a smaller number.
The first version which triggered the panic was 376534279, one version
earlier than the pmap_map() syntaxis change. The backtrace below was
from the kernel built from the source tree as of a307f7025.
#4 0xc01a14f8 in panic (fmt=0xc03031fe "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:743
#5 0xc02d9bee in pmap_release_free_page (pmap=0x0, p=0xc1944010)
at /usr/src/sys/platform/pc32/i386/pmap.c:1157
#6 0xc02d9e26 in pmap_release (pmap=0xd6ea7c14)
at /usr/src/sys/platform/pc32/i386/pmap.c:1356
#7 0xc029c650 in vmspace_terminate (vm=0xd6ea7b90)
at /usr/src/sys/vm/vm_map.c:313
#8 0xc017ee80 in _sysref_put (sr=0xd6ea7c88)
at /usr/src/sys/kern/kern_sysref.c:318
#9 0xc02a5f1e in sysref_put (ve=0xc3223000, vkp=<value optimized out>)
at /usr/src/sys/sys/sysref2.h:85
#10 vmspace_entry_delete (ve=0xc3223000, vkp=<value optimized out>)
at /usr/src/sys/vm/vm_vmspace.c:518
#11 0xc02a6396 in sys_vmspace_destroy (uap=0xde8fccf0)
at /usr/src/sys/vm/vm_vmspace.c:147
#12 0xc02dccfc in syscall2 (frame=0xde8fcd40)
at /usr/src/sys/platform/pc32/i386/trap.c:1359
#13 0xc02c83f6 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:876
The kernel and vmcore files are in my leaf account,
at ~y0netan1/crash/{kern,vmcore}.18.
Best Regards,
YONETANI Tomokazu
Updated by dillon about 15 years ago
:I used -n16 on vkernel's command line argument, but probably I could trigger
:the panic with a smaller number.
:The first version which triggered the panic was 376534279, one version
:earlier than the pmap_map() syntaxis change. The backtrace below was
:from the kernel built from the source tree as of a307f7025.
I see the syntaxis change but I don't see any version 376534279.
Are you saying the problem occurred before the syntax change was
committed? Or after?
-Matt
Updated by qhwt+dfly about 15 years ago
On Wed, Dec 23, 2009 at 07:15:39PM -0800, Matthew Dillon wrote:
:I used -n16 on vkernel's command line argument, but probably I could trigger
:the panic with a smaller number.
:The first version which triggered the panic was 376534279, one version
:earlier than the pmap_map() syntaxis change. The backtrace below was
:from the kernel built from the source tree as of a307f7025.I see the syntaxis change but I don't see any version 376534279.
Are you saying the problem occurred before the syntax change was
committed? Or after?
Before the syntaxis change was committed, that is, the panic occurs
on the real kernels built from both versions below:
commit 8e5e6f1b1a2869fdaf18f24243c40e756e6e787a
commit 376534279570d134799f19f5b0b0bb5c0cd24516
I'll start bisect'ing later.
Updated by dillon about 15 years ago
:Before the syntaxis change was committed, that is, the panic occurs
:on the real kernels built from both versions below:
:
: commit 8e5e6f1b1a2869fdaf18f24243c40e756e6e787a
: commit 376534279570d134799f19f5b0b0bb5c0cd24516
:
:I'll start bisect'ing later.
Ah, ok. I was only git log'ing /usr/src/sys/, 3765 was a commit
in the main source tree :-)
The panic is due to pmap->pm_stats.resident_count being off by 1.
It should have been left with a count of 1 with only the page
directory page left to purge but the count was 0.
I'm thinking possibly something in the pmap unwiring code, possibly
in _pmap_unwire_pte_hold(), could be racing.
Here is a patch to try. It adds a bunch of assertions in an attempt
to catch the potential race.
-Matt
Matthew Dillon
<dillon@backplane.com>
diff --git a/sys/platform/pc32/i386/pmap.c b/sys/platform/pc32/i386/pmap.c
index 53cc386..b07f2cf 100644
--- a/sys/platform/pc32/i386/pmap.c
++ b/sys/platform/pc32/i386/pmap.c@ -934,6 +934,7
@ _pmap_unwire_pte_hold(pmap_t pmap, vm_page_t m, pmap_inval_info_t info)
*/
vm_page_busy(m);
pmap_inval_add(info, pmap, 1); KKASSERT(pmap>pm_pdir[m->pindex]);
pmap->pm_pdir[m->pindex] = 0;
KKASSERT(pmap->pm_stats.resident_count > 0);
@ -1153,8 +1154,9
@ pmap_release_free_page(struct pmap pmap, vm_page_t p)
/
* Remove the page table page from the processes address space.
*/
- pde[p->pindex] = 0;
KKASSERT(pmap->pm_stats.resident_count > 0);
+ KKASSERT(pde[p->pindex]);
+ pde[p->pindex] = 0;
--pmap->pm_stats.resident_count;
if (p->hold_count) {
@ -1604,6 +1606,7
@ pmap_remove_pte(struct pmap *pmap, unsigned *ptq, vm_offset_t va,
pmap_inval_add(info, pmap, va);
oldpte = loadandclear(ptq);
+ KKASSERT(oldpte);
if (oldpte & PG_W)
pmap->pm_stats.wired_count -= 1;
/*
@ -2850,6 +2853,7
@ pmap_remove_pages(pmap_t pmap, vm_offset_t sva, vm_offset_t eva)
npv = TAILQ_NEXT(pv, pv_plist);
continue;
}
+ KKASSERT(*pte);
tpte = loadandclear(pte);
m = PHYS_TO_VM_PAGE(tpte);
Updated by qhwt+dfly about 15 years ago
On Wed, Dec 23, 2009 at 08:42:47PM -0800, Matthew Dillon wrote:
The panic is due to pmap->pm_stats.resident_count being off by 1.
It should have been left with a count of 1 with only the page
directory page left to purge but the count was 0.I'm thinking possibly something in the pmap unwiring code, possibly
in _pmap_unwire_pte_hold(), could be racing.Here is a patch to try. It adds a bunch of assertions in an attempt
to catch the potential race.
It panicked at almost the same place as before:
(kgdb) bt
:
#4 0xc01a14fc in panic (fmt=0xc030329e "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:743
#5 0xc02d9cc4 in pmap_release_free_page (pmap=0x0, p=0xc1026410)
at /usr/src/sys/platform/pc32/i386/pmap.c:1157
#6 0xc02d9f23 in pmap_release (pmap=0xdaebd9f4)
at /usr/src/sys/platform/pc32/i386/pmap.c:1358
#7 0xc029c654 in vmspace_terminate (vm=0xdaebd970)
at /usr/src/sys/vm/vm_map.c:313
#8 0xc017ee84 in _sysref_put (sr=0xdaebda68)
at /usr/src/sys/kern/kern_sysref.c:318
#9 0xc02a5f22 in sysref_put (ve=0xc3223200, vkp=<value optimized out>)
at /usr/src/sys/sys/sysref2.h:85
#10 vmspace_entry_delete (ve=0xc3223200, vkp=<value optimized out>)
at /usr/src/sys/vm/vm_vmspace.c:518
#11 0xc02a639a in sys_vmspace_destroy (uap=0xdbb3dcf0)
at /usr/src/sys/vm/vm_vmspace.c:147
#12 0xc02dcd8c in syscall2 (frame=0xdbb3dd40)
at /usr/src/sys/platform/pc32/i386/trap.c:1359
#13 0xc02c83f6 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:876
:
(kgdb) fr 5
#5 0xc02d9cc4 in pmap_release_free_page (pmap=0x0, p=0xc1026410)
at /usr/src/sys/platform/pc32/i386/pmap.c:1157
1157 KKASSERT;
(kgdb) l 12,+18>pm_pdir;
1140 static int
1141 pmap_release_free_page(struct pmap pmap, vm_page_t p)
1142 {
1143 unsigned *pde = (unsigned *) pmap
1144 /
1145 * This code optimizes the case of freeing non-busy
1146 * page-table pages. Those pages are zero now, and
1147 * might as well be placed directly into the zero queue.
1148 /
1149 if (vm_page_sleep_busy(p, FALSE, "pmaprl"))
1150 return 0;
1151
1152 vm_page_busy(p);
1153
1154 /
1155 * Remove the page table page from the processes address space.
1156 */
1157 KKASSERT;
1158 KKASSERT;
Updated by dillon about 15 years ago
:It panicked at almost the same place as before:
I swapped around the pde[] assignment to 0 so print out
pde[p->pindex] if you can (or upload the crash dump to leaf).
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by qhwt+dfly about 15 years ago
On Thu, Dec 24, 2009 at 10:14:06AM -0800, Matthew Dillon wrote:
:It panicked at almost the same place as before:
I swapped around the pde[] assignment to 0 so print out
pde[p->pindex] if you can (or upload the crash dump to leaf).
I'm scp'ing the crash dump now. In the mean while, here's the value.
No idea what to do next, though:
#5 0xc02d9cc4 in pmap_release_free_page (pmap=0x0, p=0xc1026410)
at /usr/src/sys/platform/pc32/i386/pmap.c:1157
1157 KKASSERT;
(kgdb) p/x pde[p->pindex]
$5 = 0x1980e063
Updated by qhwt+dfly about 15 years ago
On Fri, Dec 25, 2009 at 09:46:51AM +0900, YONETANI Tomokazu wrote:
I'm scp'ing the crash dump now.
Done, it's ~y0netan1/crash/{kern,vmcore}.19 on leaf.
Updated by qhwt+dfly about 15 years ago
On Wed, Dec 23, 2009 at 08:42:47PM -0800, Matthew Dillon wrote:
I'm thinking possibly something in the pmap unwiring code, possibly
in _pmap_unwire_pte_hold(), could be racing.Here is a patch to try. It adds a bunch of assertions in an attempt
to catch the potential race.-Matt
Matthew Dillon
<dillon@backplane.com>diff --git a/sys/platform/pc32/i386/pmap.c b/sys/platform/pc32/i386/pmap.c
index 53cc386..b07f2cf 100644
I tried changing the number of CPUs on vkernel from 16 to 4 and reduced
the -j number given to make command, then I get a panic in a different place:
(kgdb) bt
:
#4 0xc01a14fc in panic (fmt=0xc030329e "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:743
#5 0xc02d88c5 in pmap_remove_pages (pmap=0xd6ea6fb4, sva=0, eva=3217031168)
at /usr/src/sys/platform/pc32/i386/pmap.c:2856
#6 0xc02a5ee8 in vmspace_entry_delete (ve=0xdae62b40,
vkp=<value optimized out>) at /usr/src/sys/vm/vm_vmspace.c:514
#7 0xc02a639a in sys_vmspace_destroy (uap=0xdbadecf0)
at /usr/src/sys/vm/vm_vmspace.c:147
#8 0xc02dcd8c in syscall2 (frame=0xdbaded40)
at /usr/src/sys/platform/pc32/i386/trap.c:1359
#9 0xc02c83f6 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:876
:
(kgdb) fr
#5 0xc02d88c5 in pmap_remove_pages (pmap=0xd6ea6fb4, sva=0, eva=3217031168)
at /usr/src/sys/platform/pc32/i386/pmap.c:2856
2856 KKASSERT;
(kgdb) info locals
lp = <value optimized out>
pte = 0xff80cff8
tpte = <value optimized out>
pv = 0xc2ebca9c
npv = 0xdbadec8c
m = 0xdaebbc30
info = {pir_flags = 0, pir_cpusync = {cs_run_func = 0xdaec4d00,
cs_fin1_func = 0x1d, cs_fin2_func = 0, cs_data = 0x0,
cs_maxcount = -609358712, cs_count = -1070966064, cs_mask = 3273535440}}
iscurrentpmap = 0
save_generation = <value optimized out>
func = "pmap_remove_pages"
The kernel and vmcore are in ~y0netan1/crash/{kern,vmcore}.22 .
The kernel is compiled from source as of a307f7025 and your patch applied.
Updated by dillon about 15 years ago
:I tried changing the number of CPUs on vkernel from 16 to 4 and reduced
:the -j number given to make command, then I get a panic in a different place:
That's a better panic, for sure. Here is another patch to try:
fetch http://apollo.backplane.com/DFlyMisc/pmap01.patch
- I fixed one place where pmap->pm_generation was not being bumped.
- I fixed another place where it was looping on the wrong npv
under certain circumstances (but it also kprintf()'s something
in that case and your crash dump didn't have the kprintf).
- I added a few more KKASSERTs closer to the pmap lookup.
- I believe I found a race in pmap_enter()'s call to pmap_remove_pte().
The temporary ptbase can get wiped after the call so we have to reload
the pte pointer.
I'm crossing my fingers. If this fixes the problem I will make the
same changes to the x86_64 and vkernel's pmap code.
-Matt
Updated by qhwt+dfly about 15 years ago
On Mon, Dec 28, 2009 at 09:21:14PM -0800, Matthew Dillon wrote:
That's a better panic, for sure. Here is another patch to try:
Another panic from sys_vmspace_mcontrol by a slightly different workload
on vkernel (actually I caught two panics, but unfortunately the first
one ended up in an endless loop of some spinlock panics then locked up,
so all I could do was to press the reset switch). The kernel and the vmcore
are at ~y0netan1/crash/{kern,vmcore}.23 .
#4 0xc01a14fc in panic (fmt=0xc03032de "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:743
#5 0xc02d9376 in pmap_remove_entry (pmap=0xdaebd334,
ptq=<value optimized out>, va=3217022976, info=0xdbb32c30)
at /usr/src/sys/platform/pc32/i386/pmap.c:1559
#6 pmap_remove_pte (pmap=0xdaebd334, ptq=<value optimized out>,
va=3217022976, info=0xdbb32c30)
at /usr/src/sys/platform/pc32/i386/pmap.c:1640
#7 0xc02d991e in pmap_remove_page (pmap=0xdaebd334, sva=3217022976,
eva=3217027072) at /usr/src/sys/platform/pc32/i386/pmap.c:1668
#8 pmap_remove (pmap=0xdaebd334, sva=3217022976, eva=3217027072)
at /usr/src/sys/platform/pc32/i386/pmap.c:1709
#9 0xc029c0d3 in vm_map_madvise (map=0xdaebd2b0, start=3217022976,
end=3217027072, behav=10, value=0) at /usr/src/sys/vm/vm_map.c:1793
#10 0xc02a618a in sys_vmspace_mcontrol (uap=0xdbb32cf0)
at /usr/src/sys/vm/vm_vmspace.c:469
#11 0xc02dcdbc in syscall2 (frame=0xdbb32d40)
at /usr/src/sys/platform/pc32/i386/trap.c:1359
#12 0xc02c83f6 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:876
(arguments list for the frame #5 is wrong, tho)
(kgdb) fr 6
#6 pmap_remove_pte (pmap=0xdaebd334, ptq=<value optimized out>,
va=3217022976, info=0xdbb32c30)
at /usr/src/sys/platform/pc32/i386/pmap.c:1640
1640 return pmap_remove_entry(pmap, m, va, info);
(kgdb) p *pmap
$7 = {pm_pdir = 0xdbb9b000, pm_pdirm = 0xc12fd490, pm_pteobj = 0xdbb88840,
pm_pmnode = {tqe_next = 0xd6ea6a14, tqe_prev = 0xdaebec00}, pm_pvlist = {
tqh_first = 0xc2f79428, tqh_last = 0xc2f3bc2c}, pm_count = 1,
pm_active = 0, pm_filler02 = 0, pm_stats = {resident_count = 2209,
wired_count = 0}, pm_ptphint = 0xc21ae840, pm_generation = 181731}
(kgdb) p *m
$8 = {pageq = {tqe_next = 0xc1eb7880, tqe_prev = 0xc2053450}, rb_entry = {
rbe_left = 0xc17ea020, rbe_right = 0xc17441c0, rbe_parent = 0xc1767300,
rbe_color = 1}, object = 0xdba3eea0, pindex = 6960, phys_addr = 809406464,
md = {pv_list_count = 2, pv_list = {tqh_first = 0xc2facbf8,
tqh_last = 0xc2f69c20}}, queue = 258, flags = 1200, pc = 41,
act_count = 8 '\b', busy = 0 '\000', wire_count = 0, hold_count = 0,
valid = 255 '\377', dirty = 255 '\377', msf_hint = 0x0, action_list = {
lh_first = 0x0}}
Updated by dillon about 15 years ago
:Another panic from sys_vmspace_mcontrol by a slightly different workload
:on vkernel (actually I caught two panics, but unfortunately the first
:one ended up in an endless loop of some spinlock panics then locked up,
:so all I could do was to press the reset switch). The kernel and the vmcore
:are at ~y0netan1/crash/{kern,vmcore}.23 .
The crash dumps are definitely providing more information now.
It sees a pte entry for which there is no corresponding pv. The
page is mapped to multiple pmaps (but not the particular one that
we paniced on), its a vkernel virtual process stack page, so it
was fork()ed a few times.... hence why more cpus can reproduce
the race.
We caught the problem earlier this time. If it had been allowed
to go through it would have cleared the pte entry and then later
on paniced when it found an zero pte entry. I think there are
two possibilities.
- The first is that the pmap system is allocating a page marked
PG_ZERO which isn't actually an empty page, then encountered
the junk pte later on.
- The second is that there is another race where the temporary
page table mapping gets blown up and causes a pte to be entered
into the wrong pmap, then later we encounter the bad pte and
can't find its corresponding pv.
Here's a new patch to try. I added a bunch more assertions to
try to catch it and I also check to make sure PG_ZERO'd pages
are zero (which is expensive but...).
fetch http://apollo.backplane.com/DFlyMisc/pmap02.patch
-Matt
Updated by qhwt+dfly about 15 years ago
On Tue, Dec 29, 2009 at 10:37:47AM -0800, Matthew Dillon wrote:
Here's a new patch to try. I added a bunch more assertions to
try to catch it and I also check to make sure PG_ZERO'd pages
are zero (which is expensive but...).
I had to update to the latest source to apply it cleanly. The backtrace
seems to remain the same as before. The kernel and vmcore are at
~y0netan1/crash/{kern,vmcore}.24.
#4 0xc01a4b6a in panic (fmt=0xc0316e5e "assertion: %s in %s")
at /usr/src/sys/kern/kern_shutdown.c:744
#5 0xc02ec4d1 in pmap_remove_entry (pmap=0xd6ea5394,
ptq=<value optimized out>, va=711565312, info=0xdbbd2c2c)
at /usr/src/sys/platform/pc32/i386/pmap.c:1578
#6 pmap_remove_pte (pmap=0xd6ea5394, ptq=<value optimized out>, va=711565312,
info=0xdbbd2c2c) at /usr/src/sys/platform/pc32/i386/pmap.c:1662
#7 0xc02eca9d in pmap_remove_page (pmap=0xd6ea5394, sva=711565312,
eva=711569408) at /usr/src/sys/platform/pc32/i386/pmap.c:1690
#8 pmap_remove (pmap=0xd6ea5394, sva=711565312, eva=711569408)
at /usr/src/sys/platform/pc32/i386/pmap.c:1731
#9 0xc02ad2fb in vm_map_madvise (map=0xd6ea5310, start=711565312,
end=711569408, behav=10, value=0) at /usr/src/sys/vm/vm_map.c:1793
#10 0xc02b7ca3 in sys_vmspace_mcontrol (uap=0xdbbd2cf0)
at /usr/src/sys/vm/vm_vmspace.c:470
#11 0xc02f0339 in syscall2 (frame=0xdbbd2d40)
at /usr/src/sys/platform/pc32/i386/trap.c:1361
#12 0xc02db046 in Xint0x80_syscall ()
at /usr/src/sys/platform/pc32/i386/exception.s:876
Updated by dillon about 15 years ago
:On Tue, Dec 29, 2009 at 10:37:47AM -0800, Matthew Dillon wrote:
:> Here's a new patch to try. I added a bunch more assertions to
:> try to catch it and I also check to make sure PG_ZERO'd pages
:> are zero (which is expensive but...).
:>
:> fetch http://apollo.backplane.com/DFlyMisc/pmap02.patch
:
:I had to update to the latest source to apply it cleanly. The backtrace
:seems to remain the same as before. The kernel and vmcore are at
:~y0netan1/crash/{kern,vmcore}.24.
It's a bit different. In this dump I found that for the entry
that asserted the pv was missing from the vm_page_t's pv list
but was present in the pmap's list for that va.
Since the pv is added and removed from both lists in tandem
that shouldn't be possible.
So... more debug code in this next patch. Now I verify that
the pte being added or removed matches the vm_page_t being
associated with the pv entry. Here is the patch:
fetch http://apollo.backplane.com/DFlyMisc/pmap03.patch
There is another possibility... it could be actual bad memory,
or bad timing if you've accelerated your cpus. Try shifting
the memory around and detuning the cpu frequency or FSB frequency
if the BIOS has the ability to do so. I only give it a 15% chance
of being the issue.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon about 15 years ago
: There is another possibility... it could be actual bad memory,
: or bad timing if you've accelerated your cpus. Try shifting
: the memory around and detuning the cpu frequency or FSB frequency
: if the BIOS has the ability to do so. I only give it a 15% chance
: of being the issue.
Never mind on the bad memory possibility... I just reproduced it on
my test box using a vkernel -n 8 and a buildworld -j 8.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by qhwt+dfly about 15 years ago
On Tue, Dec 29, 2009 at 07:56:31PM -0800, Matthew Dillon wrote:
There is another possibility... it could be actual bad memory,
or bad timing if you've accelerated your cpus. Try shifting
the memory around and detuning the cpu frequency or FSB frequency
if the BIOS has the ability to do so. I only give it a 15% chance
of being the issue.
Although the CPU is not over-clocked, I have three memory modules of
512Mbytes with different speeds(DDR266, DDR333, DDR400) installed
on this system. I believed this should be no problem; the BIOS
screen indicated that all modules are treated as DDR266. But I'll try
with the DDR400 one and see if that helps (I'm not sure 512M is enough
for HAMMER and vkernel, though).
Thanks for the hint.
Updated by dillon about 15 years ago
I think I found it. It's a bug in zalloc, in vm/vm_zone.c.
fetch http://apollo.backplane.com/DFlyMisc/pmap04.patch
Remove all the other patches and just throw in that one. The pmap
code uses zbootinit() and zinitna(). zbootinit() takes a fixed
item array and uses the structure size as-is. zinitna() rounds-up
the structure size to a 32 byte boundary. The pv_entry structure
is 28 bytes.
That blows up the calculation when the zone is expanded and creates
an overlap, thus corrupting one of the pv_entry's.
I found it when I added another field to pv_entry and suddenly the
panic stopped occuring on my test box.
There are still a few bugs in pmap.c which I will commit from the
other patches. As soon as I verify that the bug is in zalloc
I will clean up the pmap.c patch and commit it along with the zalloc
fix.
OMG this one was hard to find. Of course, I haven't verified that
it's the cause yet but it just has to be it. So if it isn't I will
cry.
-Matt
Updated by alexh almost 15 years ago
Matt,
Has this patch been committed? Can we close this?
Updated by qhwt+dfly almost 15 years ago
On Fri, Apr 02, 2010 at 10:47:00PM +0000, Alex Hornung (via DragonFly issue tracker) wrote:
Has this patch been committed? Can we close this?
The same panic isn't triggered anymore. I can't tell exactly which commit
fixed it, but I guess the pmap-related commits starting from 5926987a
at the end of year 2009 were the ones.