Bug #749: double fault when nullmounting - DragonFlyBSD - DragonFlyBSD bugtracker

Actions

Copy link

Bug #749

closed

double fault when nullmounting

Added by corecode over 18 years ago. Updated over 16 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

Start date:

Due date:

% Done:

Estimated time:

Description

hey,

I just got this double fault when I was doing some layered mounts:

Fatal double fault
eip=0xc034a27a
esp=0ce2426ffc
ebp=0xe2427000

...
trapwrite(0,c64e0760,0,0,0)

I'll look the symbols up. I guess that was a stack overflow, the second one.

This is a two processor hyperthreading Xeon (4 logical CPUs), SMP kernel.

The mounting was something like this:

server:/pbulk on /pbulk nfs
/dev/ad6s1b on /pbulk2 ufs
/pbulk2/root on /pbulk/root null
/pbulk2/clients on /pbulk/clients null
/pbulk/root on /pbulk/clients/subdir/root null,ro
/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null

there it paniced. however, I was doing that from a shell script, so maybe something unfinished before was doing it. This was kind of reproducable: I did the same sequence one time before and the box just rebooted.

I probably won't be able to reproduce this because the machine is part of a cluster of our lab and it is in a server room, meaning other people want to use it, it is cold there, I have other stuff to do :/

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:hey,
:
:I just got this double fault when I was doing some layered mounts:
:
:Fatal double fault
:eip=0xc034a27a
:esp=0ce2426ffc
:ebp=0xe2427000
:
:...
:trapwrite(0,c64e0760,0,0,0)
:
:I'll look the symbols up. I guess that was a stack overflow, the second one.

The symbols might give me a clue.  Considering the value of %esp it
    certainly does look like stack overflow.

nm -N kernel.debug | less ... look for procedure containing %eip above.

-Matt

Actions

Copy link

Updated by corecode over 19 years ago

I accidentially killed the debug kernel by doing make installkernel :/ I'll see if I can recreate the kernel.

cheers
simon

Actions

Copy link

Updated by corecode over 19 years ago

nm data follows:

c034a26e T vop_nresolve_ap
c034a285 T vop_nlookupdotdot_ap

well, trapwrite just takes one argument, so that would be NULL.

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:nm data follows:
:
:Simon 'corecode' Schubert wrote:
:> Fatal double fault
:> eip=0xc034a27a
:
:c034a26e T vop_nresolve_ap
:c034a285 T vop_nlookupdotdot_ap
:...
:cheers
: simon

I was afraid it was in that code.  I gotta make the resolve code
    non-recursive or O(LOG(n)) recursive, simple as that.

-Matt
                    Matthew Dillon 
                    &lt;dillon@backplane.com&gt;

Actions

Copy link

Updated by dillon over 19 years ago

:server:/pbulk on /pbulk nfs
:/dev/ad6s1b on /pbulk2 ufs
:/pbulk2/root on /pbulk/root null
:/pbulk2/clients on /pbulk/clients null
:/pbulk/root on /pbulk/clients/subdir/root null,ro
:/pbulk/clients/subdir/var on /pbulk/clients/subdir/root/var null
:/pbulk/clients/subdir/tmp on /pbulk/clients/subdir/root/tmp null
:/pbulk/clients/subdir/dev on /pbulk/clients/subdir/root/dev null
:/pbulk/clients/subdir/usr.pkg on /pbulk/clients/subdir/root/usr/pkg null

Do you still have the shell script?

Also clarification:  The panic occured while you were doing the mounts,
    not while it was doing a build or running utilities or other things
    inside the mounts?

-Matt

Actions

Copy link

Updated by corecode over 19 years ago

Nothing. I was setting up the chroot for use. The shell script did exactly these steps:

#!/bin/sh

client=$1

cd "$client"
mount -t null -o ro /pbulk/root root
mount -t null var root/var
mount -t null tmp root/tmp
mount -t null dev root/dev
mount -t null usr.pkg root/usr/pkg
mount -t null /pbulk/scratch root/pbulk/scratch
mount -t null /pbulk/packages root/pbulk/packages
mount -t null /pbulk/distfiles root/pbulk/distfiles
mount -t null /pbulk/bulklog root/pbulk/bulklog
mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc

the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:Matthew Dillon wrote:
:> :server:/pbulk on /pbulk nfs
:> :/dev/ad6s1b on /pbulk2 ufs
:> :/pbulk2/root on /pbulk/root null
:> :/pbulk2/clients on /pbulk/clients null
:>
:
:cd "$client"
:mount -t null -o ro /pbulk/root root
:mount -t null var root/var
:mount -t null tmp root/tmp
:mount -t null dev root/dev
:mount -t null usr.pkg root/usr/pkg
:mount -t null /pbulk/scratch root/pbulk/scratch
:mount -t null /pbulk/packages root/pbulk/packages
:mount -t null /pbulk/distfiles root/pbulk/distfiles
:mount -t null /pbulk/bulklog root/pbulk/bulklog
:mount -t null -o ro /pbulk/pkgsrc root/usr/pkgsrc
:
:the /pbulk, /pbulk2, /pbulk/root, /pbulk/clients mounts were done manually before.
:
:cheers
: simon

Well, so far I haven't been able to crash anything.  I definitely want
    to try to reproduce this.  It isn't what I originally thought it was,
    since there are no deep directory recursions occuring here.  So whatever
    the recursion is which is creating the issue is a software bug causing
    an infinite-recursion somehow.

How old was the kernel running on the box?

-Matt

Actions

Copy link

Updated by corecode over 19 years ago

Fresh 1.10-RELEASE, only used SMP and APIC_IO.

Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.

cheers
simon

Actions

Copy link

Updated by dillon over 19 years ago

:Fresh 1.10-RELEASE, only used SMP and APIC_IO.
:
:Don't bother too much for now, I'll see if I can setup the same shebang again and this time get a coredump out of it.
:
:cheers
: simon

Ok.  If you can reproduce the double fault we can add a shim to the
    code to check whether the stack is too deep or not and panic the system
    before it actually hits the double fault.  Then a kernel core will be
    possible.