Project

General

Profile

Actions

Bug #1663

closed

panic: cleaned vnode isn't

Added by ftigeot about 14 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

I just got this panic on a recent 2.5.1

System is DragonFly v2.5.1.631.g711a0-DEVELOPMENT from 2010-01-22 running on a
Core 2 Duo system with 2GB RAM

The machine is a NFS server for pkgsrc distfiles.
It was extracting a distfile locally and a NFS client was also extracting a
different distfile when the panic occurred.

The filesystem is a Hammer v4 volume created recently. It is not an upgrade of
a previous hammer version:

Filesystem Size Used Avail Capacity Mounted on
SAMSUNG.160 144G 22G 122G 16% /

A brief transcription of the panic message:

panic: cleaned vnode isn't
mp_lock = 00000001; cpuid = 1
Trace beginning at
panic
panic
allocvnode
getnewvnode(
hammer_get_vnode
hammer_vop_ncreate
vop_ncreate
vn_open
kern_open
sys_open
syscall2
Xint0x80_syscall
Debugger("panic")

CPU1 stopping CPUs: 0x00000001
stopped

db>

Even though I entered "call dumpsys" on the db prompt, there was no crash dump
after the next boot.

Actions #1

Updated by dillon about 14 years ago

:Even though I entered "call dumpsys" on the db prompt, there was no crash dump
:after the next boot.
:
:--
:Francois Tigeot

If you saw it dump then you should be able to recover the crash dump
by doing a manual savecore as root. If you haven't used any swap yet
that might still work.
Definitely keep a watch on it.  I definitely want to try to get a dump
of it if it occurs again.
-Matt
Matthew Dillon
<>
Actions #2

Updated by ftigeot about 14 years ago

On Sat, Jan 30, 2010 at 10:21:23AM -0800, Matthew Dillon wrote:

:Even though I entered "call dumpsys" on the db prompt, there was no crash dump
:after the next boot.

If you saw it dump then you should be able to recover the crash dump
by doing a manual savecore as root. If you haven't used any swap yet
that might still work.

Definitely keep a watch on it. I definitely want to try to get a dump
of it if it occurs again.

I've got a dump.
The DragonFly version has been updated. It is now v2.5.1.676.gf2b2e-DEVELOPMENT.

You can get the files from here: http://www.wolfpond.org/crash.dfly/

For what it's worth, savecore(8) was trying to read /kernel and exited before
saving the dump:

$ savecore -k -v
savecore: reboot after panic: cleaned vnode isn't
savecore: /kernel: No such file or directory
savecore: writing kernel to kern.0
savecore: /kernel: Bad file descriptor
savecore: WARNING: kernel may be incomplete

I had to create a symbolic link from /boot/kernel to /kernel to make it happy.

Actions #3

Updated by dillon about 14 years ago

:I've got a dump.
:The DragonFly version has been updated. It is now v2.5.1.676.gf2b2e-DEVELOPMENT.

I've downloaded it.  I got the same panic on leaf today as well, so
now I have two nice dumps. I'm working on tracking it down.
-Matt
Actions #4

Updated by dillon about 14 years ago

Please try this patch:

fetch http://apollo.backplane.com/DFlyMisc/lock01.patch
I don't know if this will fix it or not.  There is an issue in
allocfreevnode() where a vnode whos v_lock.lk_flags sets
LK_CANRECURSE can be improperly reallocated while in the middle
of being freed, but only if the filesystem's VOP_RECLAIM code
recurses.
The problem is that really only UFS sets this flag, and UFS doesn't
recurse inside its VOP_RECLAIM code. HAMMER, on the otherhand,
does not use this flag but it CAN recurse inside VOP_RECLAIM.
So the only way I can think of for this crash to occur is if UFS
recurses in softupdates and allocates new vnodes while reclaiming
a vnode, the allocate code then reusing a HAMMER vnode and reclaiming
IT, and HAMMER then recursing and trying to allocate a new vnode
itself and winding up reusing the vnode UFS was originally trying to
reclaim. A difficult path to say the least.
Both your crash dump and the one I got from leaf today crashed on
a HAMMER vnode being reallocated with a seemingly impossible state.
Clearly a MP race, but I couldn't find a smoking gun related to
HAMMER itself. Basically vp->v_mount was NULL, the vnode was in
a reclaimed state, but vp->v_data was still pointing at the
HAMMER inode and the HAMMER inode was still pointing back at the
vp. That implies the vnode was reallocated back to the same
HAMMER inode recursively from within the VOP_RECLAIM itself,
which shouldn't be possible.
So, lets see if this patch fixes it or not.
-Matt
Matthew Dillon
<>
Actions #5

Updated by ftigeot about 14 years ago

On Fri, Feb 05, 2010 at 04:43:59PM -0800, Matthew Dillon wrote:

Please try this patch:

fetch http://apollo.backplane.com/DFlyMisc/lock01.patch

I don't know if this will fix it or not. There is an issue in
allocfreevnode() where a vnode whos v_lock.lk_flags sets
LK_CANRECURSE can be improperly reallocated while in the middle
of being freed, but only if the filesystem's VOP_RECLAIM code
recurses.

This didn't fix it. There was a new crash this night, possibly during the
daily maintenance window at 3am.

So the only way I can think of for this crash to occur is if UFS
recurses in softupdates and allocates new vnodes while reclaiming
a vnode, the allocate code then reusing a HAMMER vnode and reclaiming
IT, and HAMMER then recursing and trying to allocate a new vnode
itself and winding up reusing the vnode UFS was originally trying to
reclaim. A difficult path to say the least.

Only /boot is UFS on this machine and doesn't use softupdates.

Both your crash dump and the one I got from leaf today crashed on
a HAMMER vnode being reallocated with a seemingly impossible state.
Clearly a MP race, but I couldn't find a smoking gun related to
HAMMER itself. Basically vp->v_mount was NULL, the vnode was in
a reclaimed state, but vp->v_data was still pointing at the
HAMMER inode and the HAMMER inode was still pointing back at the
vp. That implies the vnode was reallocated back to the same
HAMMER inode recursively from within the VOP_RECLAIM itself,
which shouldn't be possible.

Most of the crashes I could see occured during a pkgsrc distfile extraction,
just after I did a pkgsrc cvs update.

I've put the new core dump online.

Actions #6

Updated by dillon about 14 years ago

:This didn't fix it. There was a new crash this night, possibly during the
:daily maintenance window at 3am.

I've gotten a couple of the panics too.  I haven't found the bug yet
but it's definitely still there. I'm still trying to locate it.
Heavy vnode activity triggers it.
This is the only major bug in the development system right now.
Speaking of which, could you try turning off vfs.cache_mpsafe if you
have it on? See if that helps.
-Matt
Actions #7

Updated by ftigeot about 14 years ago

On Wed, Feb 10, 2010 at 07:45:17PM -0800, Matthew Dillon wrote:

:This didn't fix it. There was a new crash this night, possibly during the
:daily maintenance window at 3am.

I've gotten a couple of the panics too. I haven't found the bug yet
but it's definitely still there. I'm still trying to locate it.
Heavy vnode activity triggers it.

This is the only major bug in the development system right now.

Speaking of which, could you try turning off vfs.cache_mpsafe if you
have it on? See if that helps.

Unfortunately, it is still at the default value.

vfs.cache_mpsafe: 0

Actions #8

Updated by dillon about 14 years ago

Ok, try this patch:

fetch http://apollo.backplane.com/DFlyMisc/vnode04.patch
This fixes a weird deactivation race which I thought I had already
coded properly but maybe not.
-Matt
Matthew Dillon
<>
Actions #9

Updated by dillon about 14 years ago

:
: Ok, try this patch:
:
: fetch http://apollo.backplane.com/DFlyMisc/vnode04.patch
:
: This fixes a weird deactivation race which I thought I had already
: coded properly but maybe not.

Ugh. Never mind, the panic still occurrs.  I'm getting better at
reproducing it. I should have a solution today.
-Matt
Matthew Dillon
<>
Actions #10

Updated by dillon about 14 years ago

Ok, the fix has been committed to HEAD. This time for sure.
Please retest w/ the latest.

Basically it turned out to be a vgone*() recursion.  allocfreevnode()
vgone*()'s the vnode it is trying to recycle. That called into HAMMER's
hammer_vop_inactive() which then called vrecycle() and recursively
called vgone*() again. As the recursion returned the vnode will have
already been destroyed/reused and it was being modified after the
fact.
-Matt
Actions #11

Updated by ftigeot about 14 years ago

On Thu, Feb 11, 2010 at 12:49:15PM -0800, Matthew Dillon wrote:

Ok, the fix has been committed to HEAD. This time for sure.
Please retest w/ the latest.

Nice to hear you have finally nailed it :)

I'm upgrading right now; I intend to reproduce the crash conditions with
two or three parallel pkgsrc builds just to be sure.

Actions #12

Updated by tuxillo over 13 years ago

Francois,

Was this eventually solved w/ Matt's commit to HEAD by that time?

Cheers,
Antonio Huete

Actions #13

Updated by ftigeot over 13 years ago

On Wed, Aug 04, 2010 at 07:36:51PM +0000, Antonio Huete Jimenez (via DragonFly issue tracker) wrote:

Antonio Huete Jimenez <> added the comment:

Was this eventually solved w/ Matt's commit to HEAD by that time?

I guess so, I haven't seen this particular error since then.

Actions

Also available in: Atom PDF