Bug #530

crash from latest -HEAD

Added by pavalos about 7 years ago. Updated over 5 years ago.

Status:ClosedStart date:
Priority:HighDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Here's a crash from the latest -HEAD. The kernel is already uploaded to
my crash/ on leaf. The vmcore is still in progress. I'll send a follow-up
email when that is complete.

--Peter

Fatal trap 12: page fault while in kernel mode
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
fault virtual address = 0x24
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc0297da6
stack pointer = 0x10:0xdaa1ecd8
frame pointer = 0x10:0xdaa1ed64
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = Idle
current thread = pri 44 (CRIT)
<- SMP: XXX
trap number = 12
panic: page fault
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0

syncing disks... 25 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
done
Uptime: 13h29m22s

dumping to dev #da/0x20001, offset 378927

<snip>

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc0196eac in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:360
#2 0xc01975fc in panic (fmt=0xc02ff58e "%s") at /usr/src/sys/kern/kern_shutdown.c:755
#3 0xc02e28ca in trap_fatal (frame=0xdaa1ec90, eva=0) at /usr/src/sys/platform/pc32/i386/trap.c:1090
#4 0xc02e2511 in trap_pfault (frame=0xdaa1ec90, usermode=0, eva=36) at /usr/src/sys/platform/pc32/i386/trap.c:991
#5 0xc02e20ca in trap (frame=
{tf_gs = -1070792688, tf_fs = -626982888, tf_es = 16, tf_ds = 16, tf_edi = 350412, tf_esi = -1064838976, tf_ebp = -626922140, tf_isp = -626922304, tf_ebx = -750159616, tf_edx = -1035246472, tf_ecx = -1031713432, tf_eax = 0, tf_xflags = 0, tf_trapno = 12, tf_err = 0, tf_eip = -1071022682, tf_cs = 8, tf_eflags = 66118, tf_esp = -1031713432, tf_ss = 72}) at /usr/src/sys/platform/pc32/i386/trap.c:674
#6 0xc02cdf05 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:782
#7 0xc0297da6 in vm_pageout_scan (pass=0) at /usr/src/sys/vm/vm_pageout.c:787
#8 0xc0298b8d in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:1470
#9 0xc018924f in kthread_create_stk (func=0, arg=0x0, tdp=0xc03545a0, stksize=0, fmt=0x0) at /usr/src/sys/kern/kern_kthread.c:102
Previous frame inner to this frame (corrupt stack?)

History

#1 Updated by pavalos about 7 years ago

Complete.

#2 Updated by dillon about 7 years ago

:Here's a crash from the latest -HEAD. The kernel is already uploaded to
:my crash/ on leaf. The vmcore is still in progress. I'll send a follow-up
:email when that is complete.
:
:--Peter
:...
:#6 0xc02cdf05 in calltrap () at /usr/src/sys/platform/pc32/i386/exception.s:782
:#7 0xc0297da6 in vm_pageout_scan (pass=0) at /usr/src/sys/vm/vm_pageout.c:787
:#8 0xc0298b8d in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:1470
:#9 0xc018924f in kthread_create_stk (func=0, arg=0x0, tdp=0xc03545a0, stksize=0, fmt=0x0) at /usr/src/sys/kern/kern_kthread.c:102
:Previous frame inner to this frame (corrupt stack?)

Hmm. That VM page is sitting on the INACTIVE queue but has no
VM object associated with it.

Is the crash repeatable? Where you using any of the new features, like
a the virtual kernel?

-Matt
Matthew Dillon
<>

#3 Updated by pavalos about 7 years ago

I haven't been able to reproduce it. I don't even know what was going on
when it panic'd. Just my normal traffic...

Nothing remarkable about the features. Only weird thing (if you wanna
call it that) is POLLING on em0. If you need info on sysctls, configs,
whatever...lemme know.

--Peter

#4 Updated by pavalos about 7 years ago

Ok, looks like I got another one from Jan 31 sources:

Fatal trap 12: page fault while in kernel mode
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
fault virtual address = 0x24
fault code = supervisor read, page not present
instruction pointer = 0x8:0xc028f21f
stack pointer = 0x10:0xda9f5cdc
frame pointer = 0x10:0xda9f5d84
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = Idle
current thread = pri 44 (CRIT)
<- SMP: XXX
trap number = 12
panic: page fault
mp_lock = 00000000; cpuid = 0; lapic.id = 00000000
boot() called on cpu#0

syncing disks... em0: watchdog timeout -- resetting
5
done
Uptime: 35d3h46m35s

(kgdb) bt
#0 dumpsys () at thread.h:83
#1 0xc01930bb in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:370
#2 0xc01933c0 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:767
#3 0xc02d7070 in trap_fatal (frame=0xda9f5c94, eva=Variable "eva" is not
available.
) at /usr/src/sys/platform/pc32/i386/trap.c:1090
#4 0xc02d71b8 in trap_pfault (frame=0xda9f5c94, usermode=0, eva=36)
at /usr/src/sys/platform/pc32/i386/trap.c:991
#5 0xc02d7861 in trap (frame=0xda9f5c94)
at /usr/src/sys/platform/pc32/i386/trap.c:674
#6 0xc02c38f6 in calltrap ()
at /usr/src/sys/platform/pc32/i386/exception.s:783
#7 0x00000000 in ?? ()
#8 0xc028f21f in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:787
Previous frame inner to this frame (corrupt stack?)

kernel, vmcore, and possible kernel modules are uploaded to my crash/ on
leaf. Look for *.2.

--Peter

#5 Updated by dillon about 7 years ago

:Peter Avalos <> added the comment:
:
:Ok, looks like I got another one from Jan 31 sources:
:
:...
:at /usr/src/sys/platform/pc32/i386/exception.s:783
:#7 0x00000000 in ?? ()
:#8 0xc028f21f in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:787
:Previous frame inner to this frame (corrupt stack?)
:
:kernel, vmcore, and possible kernel modules are uploaded to my crash/ on
:leaf. Look for *.2.
:
:--Peter

The pageout daemon is conking out on this:

if (m->object->ref_count == 0) {
...
}

m->object is NULL. So the question is why is there a page without
an associated object sitting on the inactive queue ?

I think this is the second time this panic has been reported.

The only code path I see that could put a page in that state is from
vm_object_terminate(), when it encounters a wired page while destroying
an object. I think that code needs to move the page to the HOLD
queue in order to fix the problem, but I would like to catch it in the
act first.

Here is a patch that will kprintf() pages that I think are causing the
problem. If we get this just before a similar crash, and its the
same page as the one that the pageout code crashed on (The %esi
register in the trapframe points to the vm_page in the crash), then
we will have found it.

Think you can reproduce it? It looks like the issue might be related
to page wiring or a program being terminated while undergoing paging I/O.

-Matt

Index: vm/vm_object.c
===================================================================
RCS file: /cvs/src/sys/vm/vm_object.c,v
retrieving revision 1.29
diff -u -r1.29 vm_object.c
--- vm/vm_object.c 28 Dec 2006 21:24:02 -0000 1.29
+++ vm/vm_object.c 8 Mar 2007 16:51:04 -0000
@@ -460,6 +460,7 @@
vm_page_free(p);
mycpu->gd_cnt.v_pfree++;
} else {
+ kprintf("vm_object_terminate: Warning: Encountered wired page %p\n", p);
vm_page_busy(p);
vm_page_remove(p);
vm_page_wakeup(p);

#6 Updated by pavalos about 7 years ago

Indeed. If you look at issue530, you'll see the entire trail.

I've compiled up a new kernel with that patch, so we'll see. The first time it
happened only after 13 hours of uptime. The second time took 35 days. If you
have any suggestions on how to trigger it, I'll see what I can do.

--Peter

#7 Updated by pavalos about 7 years ago

Now that I'm running with that patch, I'm getting plenty of those warnings.
Is that expected?

--Peter

#8 Updated by dillon about 7 years ago

:On Thu, Mar 08, 2007 at 01:13:36PM -0500, Peter Avalos wrote:
:>=20
:> I've compiled up a new kernel with that patch, so we'll see. The first t=
:ime it
:> happened only after 13 hours of uptime. The second time took 35 days. I=
:f you
:> have any suggestions on how to trigger it, I'll see what I can do.
:>=20
:
:Now that I'm running with that patch, I'm getting plenty of those warnings.
:Is that expected?
:
:--Peter

Lets try a slightly modified patch... this one only generates warnings
if the page is on a paging queue.

There is clearly a race happening somewhere, and I think the solution
is to remove the page from the paging queues (since we are trying to
destroy it anyhow), but I'd like to accumulate a bit more evidence.

-Matt
Matthew Dillon
<>

Index: vm_object.c
===================================================================
RCS file: /cvs/src/sys/vm/vm_object.c,v
retrieving revision 1.29
diff -u -r1.29 vm_object.c
--- vm_object.c 28 Dec 2006 21:24:02 -0000 1.29
+++ vm_object.c 9 Mar 2007 05:15:41 -0000
@@ -460,6 +460,8 @@
vm_page_free(p);
mycpu->gd_cnt.v_pfree++;
} else {
+ if (p->queue != PQ_NONE)
+ kprintf("vm_object_terminate: Warning: Encountered wired page %p on queue %d\n", p, p->queue);
vm_page_busy(p);
vm_page_remove(p);
vm_page_wakeup(p);

#9 Updated by pavalos almost 7 years ago

Do you think this was related to the previous network hangup problem I was
having because of those sysctls?

I'm sitting at 34 days of uptime right now.

#10 Updated by dillon almost 7 years ago

:Peter Avalos <> added the comment:
:
:Do you think this was related to the previous network hangup problem I was
:having because of those sysctls?
:
:I'm sitting at 34 days of uptime right now.

I've lost the context. What sysctls are we talking about ?

I don't think we've come to a satisfactory conclusion on bug 530
yet, but I did make a fix to the VM_PAGE handling somewhere (I can't
find the commit), and I made numerous other changes to the VM system
since then so I think we'll have to just wait and see if you get
more panics.

-Matt
Matthew Dillon
<>

#11 Updated by pavalos over 5 years ago

I haven't seen this in awhile.

Also available in: Atom PDF