Bug #1729

Hammer REDO recovery panic

Added by vsrinivas over 4 years ago. Updated almost 2 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:tuxillo% Done:

0%

Category:-
Target version:-

Description

Restarting my system after an earlier panic from running fsstress on HAMMER,
fsync_mode=2. Running DragonFly 2.6.0-gfa1ae. Will try to have a dump tomorrow.

Mounting root from hammer:serno/G3H3DSKC.s1d
tryroot serno/G3H2DSKC.s1d
HAMMER(ROOT) recovery check seqno=023f0d60
HAMMER(ROOT) recovery range 3000000003838438-3000000003fee500
HAMMER(ROOT) recovery nexto 3000000003fee500 endseqno=023f6204
HAMMER(ROOT) recovery undo 3000000003838438-30000000003fee500 (8085704
bytes)(RW)
HAMMER(ROOT) Continuing recovery
HAMMER(ROOT) Continuing recovery
HAMMER(ROOT) Continuing recovery
HAMMER(ROOT) Continuing recovery
HAMMER(ROOT) Found REDO_SYNC 30000000025c4a38
HAMMER(ROOT) recovery complete
HAMMER(ROOT) recovery redo 3000000003838438-30000000003fee500 (80857604
bytes)(RW)
HAMMER(ROOT) Find extended redo 30000000025c4a38, 19347968 extbytes
HAMMER(ROOT) Find extended redo failed 34, unable to run REDO
HAMMER(ROOT) End redo recovery
panic: hammer_ref_interlock_true: bad lock 0xc758a688 00000001

Trace beginning at frame 0xc05d4af4
panic(c05d4b18,50000001,c758a680,c02ed226,c05d4b24) at panic+0x8c
panic(c03e2510,c758a688,1,c758a680,c05d4b34) at panic+0x8c
hammer_ref_interlock_true(c758a688,c758a680,c05d4b5c,c02ed31c,c758a680) at
hammer_ref_interlock_true+0x25
hammer_unload_buffer(c758a680,0,0,0,c758a600) at hammer_unload_buffer+0x38
hammer_buf_rb_tree_RB_SCAN(c740d034,0,c02ee747,0,c1508040) at
hammer_buf_rb_tree_RB_SCAN+0xad
hammer_free_hmp(c740d384,c1225018,1,22,c10c2180) at hammer_free_hmp+0x13b
hammer_vfs_mount(c70d17b8,0,0,c612f5b0,c05d4cec) at hammer_vfs_mount+0xa57
vfs_mount(c70d17b8,0,0,c612f5b0,c10c2180) at vfs_mount+0x32
vfs_mountroot_try(c03bbab0)
vfs_mountroot(0,ffffffff,5d1c00,5df000,5df000) at vfs_mountroot+0x7b
mi_startup(5d1000,0,0,0,0) at mi_startup+0x92
begin() at begin+0x42
Debugger("panic")
Stopped at Debugger+0x34: movb $0,in_Debugger.4308
db>

History

#1 Updated by dillon over 4 years ago

:New submission from Venkatesh Srinivas <>:
:
:Restarting my system after an earlier panic from running fsstress on HAMMER,
:fsync_mode=2. Running DragonFly 2.6.0-gfa1ae. Will try to have a dump tomorrow.
:
:Mounting root from hammer:serno/G3H3DSKC.s1d
:tryroot serno/G3H2DSKC.s1d
:HAMMER(ROOT) recovery check seqno=023f0d60
:HAMMER(ROOT) recovery range 3000000003838438-3000000003fee500
:HAMMER(ROOT) recovery nexto 3000000003fee500 endseqno=023f6204
:HAMMER(ROOT) recovery undo 3000000003838438-30000000003fee500 (8085704
:bytes)(RW)
:HAMMER(ROOT) Continuing recovery
:HAMMER(ROOT) Continuing recovery
:HAMMER(ROOT) Continuing recovery
:HAMMER(ROOT) Continuing recovery
:HAMMER(ROOT) Found REDO_SYNC 30000000025c4a38
:HAMMER(ROOT) recovery complete
:HAMMER(ROOT) recovery redo 3000000003838438-30000000003fee500 (80857604
:bytes)(RW)
:HAMMER(ROOT) Find extended redo 30000000025c4a38, 19347968 extbytes
:HAMMER(ROOT) Find extended redo failed 34, unable to run REDO
:HAMMER(ROOT) End redo recovery
:panic: hammer_ref_interlock_true: bad lock 0xc758a688 00000001
:
:Trace beginning at frame 0xc05d4af4
:panic(c05d4b18,50000001,c758a680,c02ed226,c05d4b24) at panic+0x8c
:...

Ok, I found the bug related to the panic. I have committed a
fix to head and will MFC to 2.6.x.

The REDO error itself is another problem. That error is not supposed
to happen. Please run the command 'hammer -f <device> show-undo' and
put the output on your leaf account. How large is the HAMMER filesystem?
(Hopefully the data hasn't been lost since that time, I'm crossing my
fingers).

You may have issues booting. You need to boot with a fixed kernel to
get past the panic and that may require booting from a USB stick or
something

-Matt
Matthew Dillon
<>

#2 Updated by vsrinivas over 4 years ago

http://acm.jhu.edu/~me/redo_panic holds the show-undo for the fs (undo.gz) and
the vmcore/kern for the panic that occurs when you attempt to mount it.

#3 Updated by vsrinivas over 4 years ago

Have you had a chance to look at this redo_panic log?

I'd like to repurpose the disk holding this fs soon, but if you'd like or think
there's a chance to recover the fs, I'll keep it around.

In the future, for fses with a REDO fifo problem, would it make sense to offer a
'really read-only' mount that doesn't attempt to replay the redo fifo? That way,
we'd at least stand a chance at salvaging data...

Thanks,
-- vs

#4 Updated by dillon over 4 years ago

:Have you had a chance to look at this redo_panic log?
:
:I'd like to repurpose the disk holding this fs soon, but if you'd like or think
:there's a chance to recover the fs, I'll keep it around.
:
:In the future, for fses with a REDO fifo problem, would it make sense to offer a
:'really read-only' mount that doesn't attempt to replay the redo fifo? That way,
:we'd at least stand a chance at salvaging data...
:
:Thanks,
:-- vs

The lock panic was due to a mismatched lock/unlock which should have
been fixed.

An UNDO recovery error is fatal, but a REDO recovery error is not
fatal. It does mean the REDO failed but the filesystem itself will
wind up in a working state.

-Matt

#5 Updated by vsrinivas over 4 years ago

Oh, I guess I wasn't clear; even after the lock fix, I am unable to mount the
filesystem. During mount, the failure to run REDO prevents the fs from mounting...

-- vs

#6 Updated by dillon over 4 years ago

:Venkatesh Srinivas <> added the comment:
:
:Oh, I guess I wasn't clear; even after the lock fix, I am unable to mount the
:filesystem. During mount, the failure to run REDO prevents the fs from mounting...
:
:-- vs

No, you were clear. I guess what I need to do is to hack the code
to force a failure during the redo run and track down why it isn't
allowing the mount.

Matthew Dillon
<>

#7 Updated by tuxillo almost 2 years ago

  • Status changed from New to Closed
  • Assignee changed from 0 to tuxillo

Venk,

I think this commit would help you mount in that case: http://gitweb.dragonflybsd.org/dragonfly.git/commit/dbd4f60002b98556e6fc8413e6eacf2aedfce6df
Since the mismatched locking was corrected by Matt and there's now a workaround to avoid REDO stage (so you are able to mount in that situation) I will close this ticket.

Cheers,
Antonio Huete

Also available in: Atom PDF