Project

General

Profile

Actions

Bug #749

closed

double fault when nullmounting

Added by corecode over 17 years ago. Updated about 15 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

hey,

I just got this double fault when I was doing some layered mounts:

Fatal double fault
eip=0xc034a27a
esp=0ce2426ffc
ebp=0xe2427000

...
trapwrite(0,c64e0760,0,0,0)

I'll look the symbols up. I guess that was a stack overflow, the second one.

This is a two processor hyperthreading Xeon (4 logical CPUs), SMP kernel.

The mounting was something like this:

server:/pbulk on /pbulk nfs
/dev/ad6s1b on /pbulk2 ufs
/pbulk2/root on /pbulk/root null
/pbulk2/clients on /pbulk/clients null
/pbulk/root on /pbulk/clients/subdir/root null,ro
/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null

there it paniced. however, I was doing that from a shell script, so maybe something unfinished before was doing it. This was kind of reproducable: I did the same sequence one time before and the box just rebooted.

I probably won't be able to reproduce this because the machine is part of a cluster of our lab and it is in a server room, meaning other people want to use it, it is cold there, I have other stuff to do :/

cheers
simon

Actions #1

Updated by dillon over 18 years ago

:hey,
:
:I just got this double fault when I was doing some layered mounts:
:
:Fatal double fault
:eip=0xc034a27a
:esp=0ce2426ffc
:ebp=0xe2427000
:
:...
:trapwrite(0,c64e0760,0,0,0)
:
:I'll look the symbols up. I guess that was a stack overflow, the second one.

The symbols might give me a clue.  Considering the value of %esp it
certainly does look like stack overflow.
nm -N kernel.debug | less ... look for procedure containing %eip above.
-Matt
Actions #2

Updated by corecode over 18 years ago

I accidentially killed the debug kernel by doing make installkernel :/ I'll see if I can recreate the kernel.

cheers
simon

Actions #3

Updated by corecode over 18 years ago

nm data follows:

c034a26e T vop_nresolve_ap
c034a285 T vop_nlookupdotdot_ap

well, trapwrite just takes one argument, so that would be NULL.

cheers
simon

Actions #4

Updated by dillon over 18 years ago

:nm data follows:
:
:Simon 'corecode' Schubert wrote:
:> Fatal double fault
:> eip=0xc034a27a
:
:c034a26e T vop_nresolve_ap
:c034a285 T vop_nlookupdotdot_ap
:...
:cheers
: simon

I was afraid it was in that code.  I gotta make the resolve code
non-recursive or O(LOG(n)) recursive, simple as that.
-Matt
Matthew Dillon
<>
Actions #5

Updated by dillon over 18 years ago

:server:/pbulk on /pbulk nfs
:/dev/ad6s1b on /pbulk2 ufs
:/pbulk2/root on /pbulk/root null
:/pbulk2/clients on /pbulk/clients null
:/pbulk/root on /pbulk/clients/subdir/root null,ro
:/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
:/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
:/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
:/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null

Do you still have the shell script?
Also clarification:  The panic occured while you were doing the mounts,
not while it was doing a build or running utilities or other things
inside the mounts?
-Matt
Actions #6

Updated by corecode over 18 years ago

Nothing. I was setting up the chroot for use. The shell script did exactly these steps:

#!/bin/sh

client=$1

cd "$client"
mount -t null -o ro /pbulk/root root
mount -t null var root/var
mount -t null tmp root/tmp
mount -t null dev root/dev
mount -t null usr.pkg root/usr/pkg
mount -t null /pbulk/scratch root/pbulk/scratch
mount -t null /pbulk/packages root/pbulk/packages
mount -t null /pbulk/distfiles root/pbulk/distfiles
mount -t null /pbulk/bulklog root/pbulk/bulklog
mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc

the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.

cheers
simon

Actions #7

Updated by dillon over 18 years ago

:Matthew Dillon wrote:
:> :server:/pbulk on /pbulk nfs
:> :/dev/ad6s1b on /pbulk2 ufs
:> :/pbulk2/root on /pbulk/root null
:> :/pbulk2/clients on /pbulk/clients null
:>
:
:cd "$client"
:mount -t null -o ro /pbulk/root root
:mount -t null var root/var
:mount -t null tmp root/tmp
:mount -t null dev root/dev
:mount -t null usr.pkg root/usr/pkg
:mount -t null /pbulk/scratch root/pbulk/scratch
:mount -t null /pbulk/packages root/pbulk/packages
:mount -t null /pbulk/distfiles root/pbulk/distfiles
:mount -t null /pbulk/bulklog root/pbulk/bulklog
:mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
:
:the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
:
:cheers
: simon

Well, so far I haven't been able to crash anything.  I definitely want
to try to reproduce this. It isn't what I originally thought it was,
since there are no deep directory recursions occuring here. So whatever
the recursion is which is creating the issue is a software bug causing
an infinite-recursion somehow.
How old was the kernel running on the box?
-Matt
Actions #8

Updated by corecode over 18 years ago

Fresh 1.10-RELEASE, only used SMP and APIC_IO.

Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.

cheers
simon

Actions #9

Updated by dillon over 18 years ago

:Fresh 1.10-RELEASE, only used SMP and APIC_IO.
:
:Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.
:
:cheers
: simon

Ok.  If you can reproduce the double fault we can add a shim to the
code to check whether the stack is too deep or not and panic the system
before it actually hits the double fault. Then a kernel core will be
possible.
-Matt
Actions #10

Updated by tuxillo over 18 years ago

I think Dillon, who's doing a pbulk on AMD64, has a good number of null mounts
without problems, isn't it?
How could we reproduce this?

Actions #11

Updated by corecode over 18 years ago

let's close this, if parallel bulk builders didn't get this problem, it is
probably fixed.

Actions

Also available in: Atom PDF