Bug #2683

HAMMER data corruption with rebranded LSI RAID adapters

Added by ftigeot 3 months ago. Updated 10 days ago.

Status:ResolvedStart date:06/11/2014
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Description

Two different machines running DragonFly 3.6.x had the kernel report HAMMER CRC32 data corruption and then panic.

After a reboot, the kernel couldn't repair and mount the HAMMER volumes where these CRC32 errors occurred due to massive amounts of data corruption.

The systems had these elements in common:
- DragonFly 3.6.x
- Dell rack servers
- HAMMER filesystem on a hardware RAID volume managed by the mfi(4) LSI MegaRAID SAS driver

The two different machines were apparently using the same kind of rebranded LSI RAID adapter.

PCI id from one of the cards:
mfi0@pci0:3:0:0: class=0x010400 card=0x1f341028 chip=0x005b1000 rev=0x05 hdr=0x00

History

#1 Updated by swildner 3 months ago

We don't know if it is really related to the mfi(4) driver. But it might be a theory worth checking out.

I have committed a new driver from FreeBSD for Thunderbolt, Invader and Fury series adapters, mrsas(4):

http://lists.dragonflybsd.org/pipermail/commits/2014-June/270246.html

If you want to try it out, make sure it is loaded (by putting mrsas_load=yes into your loader.conf or adding "device mrsas" to the kernel configuration you are using and either disable loading or compiling in of mfi(4) entirely or setting hw.mfi.mrsas_enable=1 in /boot/loader.conf, which will allow mrsas(4) to be taken for these adapters.

Note that the disk device nodes for mrsas(4) follow CAM nomenclature and are /dev/da?, while the mfi(4) driver uses /dev/mfid?. It is recommended to take the appropriate nodes in /dev/serno to allow for a smooth transition.

Also note that there is no mfiutil(8) like tool for mrsas(4).

If you need this driver on 3.8, do 'git cherry-pick 6d743f0468a9bd40d1cedc939569228864d0614f' in your 3.8 branch.

#2 Updated by swildner 3 months ago

I just tried, it can be cleanly cherry-picked to 3.6 too.

#3 Updated by ftigeot 3 months ago

Some newer Dell h710 firmwares containing important fixes are available to download.
No direct mention was made of data corruption in the release notes but the controllers could hang in some circumstances with older firmwares.

#4 Updated by ftigeot 3 months ago

PCI id from a different adapter:

mfi0@pci0:2:0:0: class=0x010400 card=0x1f341028 chip=0x005b1000 rev=0x05 hdr=0x00

#5 Updated by ftigeot 3 months ago

There was an interesting discussion thread about data corruption with a particular LSI adapter and the mfi(4) driver on FreeBSD in March.
Some of the most pertinent individual mails:

http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006289.html
http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006308.html

#6 Updated by ftigeot 3 months ago

A test server running with the mrsas(4) driver is still running perfectly after having processed terabytes of database imports and nfs traffic.
It is likely the corruption seen with Thunderbolt LSI RAID adapters is specific to the mfi(4) driver at this point.

#7 Updated by ftigeot 10 days ago

  • Status changed from New to Resolved

The aforementioned database server has now been running in production for months.
Closing.

Also available in: Atom PDF