Sunday, December 23, 2012

SMART errors and software RAID

For a while now I have been having an intermittent problem with some of my drives. They seem to be working but stop responding to SMART. If I reboot they don't show up any more but if I power off and back on they work fine.

I have two 640Gig WD Green drives in a software RAID1 and another WD Blue drive. After a suspend/resume, sometimes one of the drives is MIA. The kernel thinks it is still there but my SMART tools can't talk to it and start to complain.

The Blue drive has a bad (unreadable) sector but touch wood, that has not caused a problem yet. SMART knows about this and tells me about it frequently. This however does not appear to be the problem. There is a telling message is in dmesg:

sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
PM: Device 0:0:0:0 failed to resume: error 262144

So for some reason, the disk is not responding when the computer resumes. I guess it is a timeout (and I wonder if I can extend it?) Anyway, now I have the situation where my disk is not working even though I know there is nothing actually wrong with it.

Luckily I am using software RAID and the other disk is working so I can continue about my business without crashing or loosing data. After poking and prodding a few different things I have worked out a solution:

  • Hot-remove the device (from the kernel)
echo 1 > /sys/block/sda/device/delete
  • Rescan for the device to hot-add it to the kernel
echo "- - -" > /sys/class/scsi_host/host0/scan
  • Add the 'failed' drive back into the RAID set
mdadm /dev/md127 --re-add /dev/sda2

You must remove the existing (sda) device first or the disk will be re-detected and added with a new name (sde in my case).

Because I have a write-intent bitmap, the RAID set knows what has changed since the drive was failed and only the changes must be re-synced which is quite fast.

There seems to be a 'vibe' that green drives are not good for RAID. I don't really think this is a problem because the drive is green, I think it is a problem because the driver is not trying hard enough to restart the disk.

So in the end this was not a SMART problem after all. Not there there are no bugs to fix there. Particularly in udisks-helper-ata-smart-collect which keeps running and locking up sending the load average into the hundreds. For a tool designed to detect error conditions it probably needs a bit more work.

My next job is to select a replacement drive for the faulty WD Blue...


