Bug #749
closeddouble fault when nullmounting
0%
Description
hey,
I just got this double fault when I was doing some layered mounts:
Fatal double fault
eip=0xc034a27a
esp=0ce2426ffc
ebp=0xe2427000
...
trapwrite(0,c64e0760,0,0,0)
I'll look the symbols up. I guess that was a stack overflow, the second one.
This is a two processor hyperthreading Xeon (4 logical CPUs), SMP kernel.
The mounting was something like this:
server:/pbulk on /pbulk nfs
/dev/ad6s1b on /pbulk2 ufs
/pbulk2/root on /pbulk/root null
/pbulk2/clients on /pbulk/clients null
/pbulk/root on /pbulk/clients/subdir/root null,ro
/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null
there it paniced. however, I was doing that from a shell script, so maybe something unfinished before was doing it. This was kind of reproducable: I did the same sequence one time before and the box just rebooted.
I probably won't be able to reproduce this because the machine is part of a cluster of our lab and it is in a server room, meaning other people want to use it, it is cold there, I have other stuff to do :/
cheers
simon
Updated by dillon over 18 years ago
:hey,
:
:I just got this double fault when I was doing some layered mounts:
:
:Fatal double fault
:eip=0xc034a27a
:esp=0ce2426ffc
:ebp=0xe2427000
:
:...
:trapwrite(0,c64e0760,0,0,0)
:
:I'll look the symbols up. I guess that was a stack overflow, the second one.
The symbols might give me a clue. Considering the value of %esp it
certainly does look like stack overflow.
nm -N kernel.debug | less ... look for procedure containing %eip above.
-Matt
Updated by corecode over 18 years ago
I accidentially killed the debug kernel by doing make installkernel :/ I'll see if I can recreate the kernel.
cheers
simon
Updated by corecode over 18 years ago
nm data follows:
c034a26e T vop_nresolve_ap
c034a285 T vop_nlookupdotdot_ap
well, trapwrite just takes one argument, so that would be NULL.
cheers
simon
Updated by dillon over 18 years ago
:nm data follows:
:
:Simon 'corecode' Schubert wrote:
:> Fatal double fault
:> eip=0xc034a27a
:
:c034a26e T vop_nresolve_ap
:c034a285 T vop_nlookupdotdot_ap
:...
:cheers
: simon
I was afraid it was in that code. I gotta make the resolve code
non-recursive or O(LOG(n)) recursive, simple as that.
-Matt
Matthew Dillon
<dillon@backplane.com>
Updated by dillon over 18 years ago
:server:/pbulk on /pbulk nfs
:/dev/ad6s1b on /pbulk2 ufs
:/pbulk2/root on /pbulk/root null
:/pbulk2/clients on /pbulk/clients null
:/pbulk/root on /pbulk/clients/subdir/root null,ro
:/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
:/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
:/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
:/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null
Do you still have the shell script?
Also clarification: The panic occured while you were doing the mounts,
not while it was doing a build or running utilities or other things
inside the mounts?
-Matt
Updated by corecode over 18 years ago
Nothing. I was setting up the chroot for use. The shell script did exactly these steps:
#!/bin/sh
client=$1
cd "$client"
mount -t null -o ro /pbulk/root root
mount -t null var root/var
mount -t null tmp root/tmp
mount -t null dev root/dev
mount -t null usr.pkg root/usr/pkg
mount -t null /pbulk/scratch root/pbulk/scratch
mount -t null /pbulk/packages root/pbulk/packages
mount -t null /pbulk/distfiles root/pbulk/distfiles
mount -t null /pbulk/bulklog root/pbulk/bulklog
mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
cheers
simon
Updated by dillon over 18 years ago
:Matthew Dillon wrote:
:> :server:/pbulk on /pbulk nfs
:> :/dev/ad6s1b on /pbulk2 ufs
:> :/pbulk2/root on /pbulk/root null
:> :/pbulk2/clients on /pbulk/clients null
:>
:
:cd "$client"
:mount -t null -o ro /pbulk/root root
:mount -t null var root/var
:mount -t null tmp root/tmp
:mount -t null dev root/dev
:mount -t null usr.pkg root/usr/pkg
:mount -t null /pbulk/scratch root/pbulk/scratch
:mount -t null /pbulk/packages root/pbulk/packages
:mount -t null /pbulk/distfiles root/pbulk/distfiles
:mount -t null /pbulk/bulklog root/pbulk/bulklog
:mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
:
:the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
:
:cheers
: simon
Well, so far I haven't been able to crash anything. I definitely want
to try to reproduce this. It isn't what I originally thought it was,
since there are no deep directory recursions occuring here. So whatever
the recursion is which is creating the issue is a software bug causing
an infinite-recursion somehow.
How old was the kernel running on the box?
-Matt
Updated by corecode over 18 years ago
Fresh 1.10-RELEASE, only used SMP and APIC_IO.
Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.
cheers
simon
Updated by dillon over 18 years ago
:Fresh 1.10-RELEASE, only used SMP and APIC_IO.
:
:Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.
:
:cheers
: simon
Ok. If you can reproduce the double fault we can add a shim to the
code to check whether the stack is too deep or not and panic the system
before it actually hits the double fault. Then a kernel core will be
possible.
-Matt
Updated by tuxillo over 18 years ago
I think Dillon, who's doing a pbulk on AMD64, has a good number of null mounts
without problems, isn't it?
How could we reproduce this?
Updated by corecode over 18 years ago
let's close this, if parallel bulk builders didn't get this problem, it is
probably fixed.