Bug #2341
closedpanic: hammer_io_set_modlist: duplicate entry
60%
Description
For the past few days I have been crashing my 3.01 amd64 install of Dragonflybsd with the following crash.
Unread portion of the kernel message buffer:
panic: hammer_io_set_modlist: duplicate entry
cpuid = 1
Trace beginning at frame 0xffffffe09e2b0620
panic() at panic+0x1fb 0xffffffff8049a84d
panic() at panic+0x1fb 0xffffffff8049a84d
hammer_io_modify() at hammer_io_modify+0x1ed 0xffffffff8067cc03
hammer_modify_buffer() at hammer_modify_buffer+0x6b 0xffffffff8067da2e
hammer_blockmap_alloc() at hammer_blockmap_alloc+0x725 0xffffffff8066a4d7
hammer_alloc_btree() at hammer_alloc_btree+0x2f 0xffffffff80688ba4
btree_search() at btree_search+0x166e 0xffffffff80670584
hammer_btree_lookup() at hammer_btree_lookup+0x107 0xffffffff80670e72
hammer_ip_sync_record_cursor() at hammer_ip_sync_record_cursor+0x393 0xffffffff80684df7
hammer_sync_record_callback() at hammer_sync_record_callback+0x230 0xffffffff80678331
hammer_rec_rb_tree_RB_SCAN() at hammer_rec_rb_tree_RB_SCAN+0xf6 0xffffffff80682e83
hammer_sync_inode() at hammer_sync_inode+0x28b 0xffffffff80677bd8
hammer_flusher_flush_inode() at hammer_flusher_flush_inode+0x66 0xffffffff80675305
hammer_fls_rb_tree_RB_SCAN() at hammer_fls_rb_tree_RB_SCAN+0xf7 0xffffffff80675a53
hammer_flusher_slave_thread() at hammer_flusher_slave_thread+0x89 0xffffffff80675b45
Debugger("panic")
CPU1 stopping CPUs: 0x00000001
stopped
oops, ran out of processes early!
panic: from debugger
cpuid = 1
boot() called on cpu#1
Uptime: 2h26m4s
Physical memory: 3877 MB
Dumping 1336 MB: 1321 1305 1289 1273 1257 1241 1225 1209 1193 1177 1161 1145 1129 1113 1097 1081 1065 1049 1033 1017 1001 985 969 953 937 921 905 889 873 857 841 825 809 793 777 761 745 729 713 697 681 665 649 633 617 601 585 569 553 537 521 505 489 473 457 441 425 409 393 377 361 345 329 313 297 281 265 249 233 217 201 185 169 153 137 121 105 89 73 57 41 25 9
This box is mirrors.nycbug.org here are the details on the hardware. This may be a hardware bug but I am not 100% sure
HP DL385 G1
2x AMD Opteron 252 CPUS
4G RAM
HP Smart Array 6i Raid
HP Smart Array 6402 Raid
2x HP MSA20 arrays attached to the Smart Array 6402
HP SA 6i has 2 disks in a raid 1 with os install using hammer 58GB 5GB in use
HP SA 6402 has 12 raid 1 sets setup in a giant hammer 2.7TB 248GB in use
The box runs periodic rsync jobs to rsync from openbsd and dragonflybsd mirrors
This could be a mistake in how i setup the hammer volume on the external storage ( the msa 20's)
Here is what I did to set them up
newfs_hammer -Lexport -f da1s4:da2s4:da3s4:da5s4:da6s4:da7s4:da8s4:da9s4:da10s4:da11s4:da12s4
then I added this to fstab
/dev/da1s4:/dev/da2s4:/dev/da3s4:/dev/da4s4:/dev/da5s4:/dev/da6s4:/dev/da7s4:/dev/da8s4:/dev/da9s4:/dev/da10s4:/dev/da11s4:/dev/da12s4
/export hammer rw,noatime 2 2
The txt dump is at pastebin http://pastebin.com/G8vPAvHf
Files
Updated by dillon over 12 years ago
- File hammer_mvol_01.patch hammer_mvol_01.patch added
- Assignee set to dillon
- % Done changed from 0 to 60
Please try the included patch. It looks like the red-black tree code was trying to decode a volume number from the hammer_io->offset field but the volume number is not encoded in that field, so the same offset in two different volumes was resulting in a collision.
This patch is untested.
p.s. multi-volume support is not recommended, though other than stupid bugs like this it should work just fine. The main reason is that it doesn't add any redundancy (HAMMER1 never progressed that far, but the HAMMER2 work in progress this year should be able to achieve ZFS-like redundancy).
Also note that we believe there may be a bug in live-dedup too. batch dedup run from the hammer utility works fine though.
-Matt
Updated by nonesuch over 12 years ago
So far matt's patch works the only issue noted is the following messages in dmesg
Warning: busy page 0xffffffe0005f7928 found in cache
Warning: busy page 0xffffffe0005362f0 found in cache
CRC DATA a000000ad79e0000/65536 FAILED
a000000ad79e0000/65536 FAILED
Warning: busy page 0xffffffe006993a18 found in cache
Warning: busy page 0xffffffe002551eb8 found in cache
CRC DATA
Warning: busy page 0xffffffe000ed5e78 found in cache
CRC DATA a000000ad79e0000/65536 FAILED
a000000ad79e0000/65536 FAILED
CRC DATA
Warning: busy page 0xffffffe004706ba8 found in cache
Warning: busy page 0xffffffe00606b2d8 found in cache
Warning: busy page 0xffffffe0017ea248 found in cache
Warning: busy page 0xffffffe005c2e6e8 found in cache
Warning: busy page 0xffffffe004bba898 found in cache
Warning: busy page 0xffffffe003672a30 found in cache
Warning: busy page 0xffffffe006bc23c0 found in cache
CRC DATA a000000ad79e0000/65536 FAILED
a000000ad79e0000/65536 FAILED
CRC DATA
CRC DATA a000000ad79e0000/65536 FAILED
a000000ad79e0000/65536 FAILED
CRC DATA
CRC DATA a000000ad79e0000/65536 FAILED
a000000ad79e0000/65536 FAILED
CRC DATA
Warning: busy page 0xffffffe000c247d8 found in cache
CRC DATA @ a000000ad79e0000/65536 FAILED
Matt had said we should try to tar -cvf /dev/null /export to see if we could find the errors that were noted in dmesg. I was unable to get any useful results from that tar. I am still left with that issue which may be unrelated. Is there any way to find what file(s) are referenced by this error "CRC DATA @ a000000ad79e0000/65536 FAILED" other then the tar ? I know both ZFS and IFS (isilons filesystem ) have support for looking up block(s) to file(s) mappings. I would love to see hammer have a similar feature.
Well all in all do you have any insight into the new warnings / errors . I have not seen any major issues develop after they were logged so I am not sure what we should do ?
Updated by vsrinivas over 12 years ago
Uhoh!
HAMMER doesn't have an explicit reverse block-map; but during the forward traversal we could definitely log what file was being read/looked up, so as to print that in data CRC mismatches. 'hammer show' might also output enough information to construct that revmap, I don't remember.
As an easy way to find it, just record the current file being read on entry to hammer_vop_read(). HAMMER is under a per-mount token (for uncached reads), so only one file read can proceed at once. bread()'s callbacks are what validate the data CRCs, so you'd need to track the file in the issued buffer, but it is certainly do-able.
Updated by nonesuch over 12 years ago
Matt / Venkatesh
Was the multi volume patch added to Dragonfly ?
Updated by vsrinivas over 12 years ago
Yes; it is available in both -master and the head of the DragonFly_RELEASE_3_0 (stable) branch.