Project

General

Profile

Actions

Bug #2683

closed

HAMMER data corruption with rebranded LSI RAID adapters

Added by ftigeot over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
06/11/2014
Due date:
% Done:

0%

Estimated time:

Description

Two different machines running DragonFly 3.6.x had the kernel report HAMMER CRC32 data corruption and then panic.

After a reboot, the kernel couldn't repair and mount the HAMMER volumes where these CRC32 errors occurred due to massive amounts of data corruption.

The systems had these elements in common:
- DragonFly 3.6.x
- Dell rack servers
- HAMMER filesystem on a hardware RAID volume managed by the mfi(4) LSI MegaRAID SAS driver

The two different machines were apparently using the same kind of rebranded LSI RAID adapter.

PCI id from one of the cards:
mfi0@pci0:3:0:0: class=0x010400 card=0x1f341028 chip=0x005b1000 rev=0x05 hdr=0x00

Actions #1

Updated by swildner over 10 years ago

We don't know if it is really related to the mfi(4) driver. But it might be a theory worth checking out.

I have committed a new driver from FreeBSD for Thunderbolt, Invader and Fury series adapters, mrsas(4):

http://lists.dragonflybsd.org/pipermail/commits/2014-June/270246.html

If you want to try it out, make sure it is loaded (by putting mrsas_load=yes into your loader.conf or adding "device mrsas" to the kernel configuration you are using and either disable loading or compiling in of mfi(4) entirely or setting hw.mfi.mrsas_enable=1 in /boot/loader.conf, which will allow mrsas(4) to be taken for these adapters.

Note that the disk device nodes for mrsas(4) follow CAM nomenclature and are /dev/da?, while the mfi(4) driver uses /dev/mfid?. It is recommended to take the appropriate nodes in /dev/serno to allow for a smooth transition.

Also note that there is no mfiutil(8) like tool for mrsas(4).

If you need this driver on 3.8, do 'git cherry-pick 6d743f0468a9bd40d1cedc939569228864d0614f' in your 3.8 branch.

Actions #2

Updated by swildner over 10 years ago

I just tried, it can be cleanly cherry-picked to 3.6 too.

Actions #3

Updated by ftigeot over 10 years ago

Some newer Dell h710 firmwares containing important fixes are available to download.
No direct mention was made of data corruption in the release notes but the controllers could hang in some circumstances with older firmwares.

Actions #4

Updated by ftigeot over 10 years ago

PCI id from a different adapter:

mfi0@pci0:2:0:0: class=0x010400 card=0x1f341028 chip=0x005b1000 rev=0x05 hdr=0x00

Actions #5

Updated by ftigeot over 10 years ago

There was an interesting discussion thread about data corruption with a particular LSI adapter and the mfi(4) driver on FreeBSD in March.
Some of the most pertinent individual mails:

http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006289.html
http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006308.html

Actions #6

Updated by ftigeot over 10 years ago

A test server running with the mrsas(4) driver is still running perfectly after having processed terabytes of database imports and nfs traffic.
It is likely the corruption seen with Thunderbolt LSI RAID adapters is specific to the mfi(4) driver at this point.

Actions #7

Updated by ftigeot over 10 years ago

  • Status changed from New to Resolved

The aforementioned database server has now been running in production for months.
Closing.

Actions

Also available in: Atom PDF