double fault when nullmounting
I just got this double fault when I was doing some layered mounts:
Fatal double fault
I'll look the symbols up. I guess that was a stack overflow, the second one.
This is a two processor hyperthreading Xeon (4 logical CPUs), SMP kernel.
The mounting was something like this:
server:/pbulk on /pbulk nfs
/dev/ad6s1b on /pbulk2 ufs
/pbulk2/root on /pbulk/root null
/pbulk2/clients on /pbulk/clients null
/pbulk/root on /pbulk/clients/subdir/root null,ro
/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null
there it paniced. however, I was doing that from a shell script, so maybe something unfinished before was doing it. This was kind of reproducable: I did the same sequence one time before and the box just rebooted.
I probably won't be able to reproduce this because the machine is part of a cluster of our lab and it is in a server room, meaning other people want to use it, it is cold there, I have other stuff to do :/
#1 Updated by dillon about 7 years ago
:I just got this double fault when I was doing some layered mounts:
:Fatal double fault
:I'll look the symbols up. I guess that was a stack overflow, the second one.
The symbols might give me a clue. Considering the value of %esp it
certainly does look like stack overflow.
nm -N kernel.debug | less ... look for procedure containing %eip above.
#4 Updated by dillon about 7 years ago
:nm data follows:
:Simon 'corecode' Schubert wrote:
:> Fatal double fault
:c034a26e T vop_nresolve_ap
:c034a285 T vop_nlookupdotdot_ap
I was afraid it was in that code. I gotta make the resolve code
non-recursive or O(LOG(n)) recursive, simple as that.
#5 Updated by dillon about 7 years ago
:server:/pbulk on /pbulk nfs
:/dev/ad6s1b on /pbulk2 ufs
:/pbulk2/root on /pbulk/root null
:/pbulk2/clients on /pbulk/clients null
:/pbulk/root on /pbulk/clients/subdir/root null,ro
:/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
:/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
:/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
:/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null
Do you still have the shell script?
Also clarification: The panic occured while you were doing the mounts,
not while it was doing a build or running utilities or other things
inside the mounts?
#6 Updated by corecode about 7 years ago
Nothing. I was setting up the chroot for use. The shell script did exactly these steps:
mount -t null -o ro /pbulk/root root
mount -t null var root/var
mount -t null tmp root/tmp
mount -t null dev root/dev
mount -t null usr.pkg root/usr/pkg
mount -t null /pbulk/scratch root/pbulk/scratch
mount -t null /pbulk/packages root/pbulk/packages
mount -t null /pbulk/distfiles root/pbulk/distfiles
mount -t null /pbulk/bulklog root/pbulk/bulklog
mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
#7 Updated by dillon about 7 years ago
:Matthew Dillon wrote:
:> :server:/pbulk on /pbulk nfs
:> :/dev/ad6s1b on /pbulk2 ufs
:> :/pbulk2/root on /pbulk/root null
:> :/pbulk2/clients on /pbulk/clients null
:mount -t null -o ro /pbulk/root root
:mount -t null var root/var
:mount -t null tmp root/tmp
:mount -t null dev root/dev
:mount -t null usr.pkg root/usr/pkg
:mount -t null /pbulk/scratch root/pbulk/scratch
:mount -t null /pbulk/packages root/pbulk/packages
:mount -t null /pbulk/distfiles root/pbulk/distfiles
:mount -t null /pbulk/bulklog root/pbulk/bulklog
:mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
:the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
Well, so far I haven't been able to crash anything. I definitely want
to try to reproduce this. It isn't what I originally thought it was,
since there are no deep directory recursions occuring here. So whatever
the recursion is which is creating the issue is a software bug causing
an infinite-recursion somehow.
How old was the kernel running on the box?
#9 Updated by dillon about 7 years ago
:Fresh 1.10-RELEASE, only used SMP and APIC_IO.
:Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.
Ok. If you can reproduce the double fault we can add a shim to the
code to check whether the stack is too deep or not and panic the system
before it actually hits the double fault. Then a kernel core will be