Something strange is happening to my ZFS pool...

bleomycin · Jul 2, 2014

Hi Everyone,

I've been running ZoL for years now without any trouble...until now.

Code:

zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 6.13G in 6h34m with 0 errors on Wed Jul  2 15:47:42 2014
config:

	NAME                                         STATE     READ WRITE CKSUM
	tank                                         ONLINE       0     0     0
	  raidz2-0                                   ONLINE       0     0     0
	    ata-HGST_HDS724040ALE640_PK1334PBG7YPRS  ONLINE       0     0 19.3K
	    ata-HGST_HDS724040ALE640_PK1334PBG70MES  ONLINE       0     0 10.9K
	    ata-HGST_HDS724040ALE640_PK1334PBG7YH9S  ONLINE       0     0 8.53K
	    ata-HGST_HDS724040ALE640_PK2334PBGBPH7T  ONLINE       0     0 11.9K
	    ata-HGST_HDS724040ALE640_PK2334PBGHTY9T  ONLINE       0     0 7.97K
	    ata-HGST_HDS724040ALE640_PK1334PBG79XAS  ONLINE       0     0 28.7K
	  raidz2-1                                   ONLINE       0     0     0
	    ata-HGST_HDS724040ALE640_PK1334PBG1H80X  ONLINE       0     0 23.8K
	    ata-HGST_HDS724040ALE640_PK2334PBG2SR4T  ONLINE       0     0 14.7K
	    ata-HGST_HDS724040ALE640_PK1334PBG70PAS  ONLINE       0     0     6
	    ata-HGST_HDS724040ALE640_PK2334PBG4H0BT  ONLINE       0     0 13.0K
	    ata-HGST_HDS724040ALE640_PK1334PBG9Y2TS  ONLINE       0     0 33.4K
	    ata-HGST_HDS724040ALE640_PK2334PBG4GTMT  ONLINE       0     0 25.2K

errors: No known data errors

Hardware:
Mobo: supermicro X8DTL-iF
CPU: 2x Xeon E5530
Ram: 48GB ECC
HD: 12x Hitachi 4TB sata
HBA's: 2x M1015 flashed to IT
Case: Supermicro 2U, redundant psu, backplane

Software:
OS: Ubuntu 14.04 amd64 Server
ZFS: ZoL 0.6.3

This is the first time i've run into any issues with data integrity, but the fact every drive is showing errors makes little sense to me. A few days ago I saw my first errors, which were corrected. I ran an additional scrub and all drives showed 0 read, write, or cksum errors. Now, a few days later after another scrub they are back! Any help in debugging this would be greatly appreciated. Thanks!

danswartz · Jul 2, 2014

Memory? Controller? Some other infrastructure issue?

bleomycin · Jul 2, 2014

danswartz said:
Memory? Controller? Some other infrastructure issue?

I have 2x M1015's so it seems really unlikely to me both of them started to act up at the same time. I started extended smart tests on all of the drives but the results wont be available until tomorrow. It's looking like it may be HBA related though, I found these 3 lines in the syslog, nothing else though.

Code:

kernel: [62345.427368] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
kernel: [62345.427607] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
kernel: [62345.427613] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)

Quick google search for similar errors brings up this: http://serverfault.com/questions/407703/deciphering-continuing-mpt2sas-syslog-messages

I actually had zfs fail 7 disks the other day as well I forgot to mention in the original post, a reboot of the system and a zpool clear -f brought it all back online though.

MarkL · Jul 2, 2014

What were the logs from the other day regarding the failing of the disks? I'd look at those as well and consider it all to be related.

SirMaster · Jul 3, 2014

I would ask this in #zfsonlinux on freenode

You will likely get the attention of ZoL developers who want to help debug and figure it out.

bleomycin · Jul 3, 2014

Thanks guys, i'm going to poke around a bit more before I trouble the ZFS devs. I found this blog post which talks about the exact error I have: http://blog.disksurvey.org/blog/2014/03/27/sata-handling-of-medium-errors-log-info-0x0x31080000/

Very strange this is just now showing up after running for years. I'm wondering if a recent kernel/software upgrade in ubuntu is the cause of this? I've collected all logs with mention of mpt2sas and there has been a lot of activity starting Jun 29th: https://www.zerobin.net/?f123907cd2ca05ca#nTjS5pFQbRADvXCWRzJ3/hMjt43Qfw2Fjkyo/+h8bYw=

mpt2sas0 seems to be the one throwing all of the errors, perhaps it is having trouble? mpt2sas1 is mostly quiet without errors.

I'm not able to decipher exactly what is going on but it doesn't look good. A kernel upgrade was available for ubuntu so i applied that and rebooted, running another scrub now along with long smart tests to see what happens. Luckily I have backups if things go too sideways...

cookiesowns · Jul 3, 2014

I'm swaying towards a hardware/software fault.

Have you tried powercycling the machine? What about checking the firmware and driver revisions on your M1015's?

Rectal Prolapse · Jul 3, 2014

Power supply starting to fail, causing random errors over the bus and/or memory?

Are the drives being written to frequently? If so, and your power supply is dying, this could be very very bad.

Years ago I had a power supply with bad capacitors (the infamous Antec Phantom with the notorious leaky caps) and over the course of a month it would report random checksum errors. What you saw very closely mirrors my experience.

Something strange is happening to my ZFS pool...

bleomycin

Limp Gawd

danswartz

2[H]4U

bleomycin

Limp Gawd

MarkL

Limp Gawd

SirMaster

2[H]4U

bleomycin

Limp Gawd

cookiesowns

n00b

Rectal Prolapse

Gawd