Sunday, December 23, 2012

SMART errors and software RAID

For a while now I have been having an intermittent problem with some of my drives. They seem to be working but stop responding to SMART. If I reboot they don't show up any more but if I power off and back on they work fine.

I have two 640Gig WD Green drives in a software RAID1 and another WD Blue drive. After a suspend/resume, sometimes one of the drives is MIA. The kernel thinks it is still there but my SMART tools can't talk to it and start to complain.

The Blue drive has a bad (unreadable) sector but touch wood, that has not caused a problem yet. SMART knows about this and tells me about it frequently. This however does not appear to be the problem. There is a telling message is in dmesg:

sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
PM: Device 0:0:0:0 failed to resume: error 262144

So for some reason, the disk is not responding when the computer resumes. I guess it is a timeout (and I wonder if I can extend it?) Anyway, now I have the situation where my disk is not working even though I know there is nothing actually wrong with it.

Luckily I am using software RAID and the other disk is working so I can continue about my business without crashing or loosing data. After poking and prodding a few different things I have worked out a solution:

  • Hot-remove the device (from the kernel)
echo 1 > /sys/block/sda/device/delete
  • Rescan for the device to hot-add it to the kernel
echo "- - -" > /sys/class/scsi_host/host0/scan
  • Add the 'failed' drive back into the RAID set
mdadm /dev/md127 --re-add /dev/sda2

You must remove the existing (sda) device first or the disk will be re-detected and added with a new name (sde in my case).

Because I have a write-intent bitmap, the RAID set knows what has changed since the drive was failed and only the changes must be re-synced which is quite fast.

There seems to be a 'vibe' that green drives are not good for RAID. I don't really think this is a problem because the drive is green, I think it is a problem because the driver is not trying hard enough to restart the disk.

So in the end this was not a SMART problem after all. Not there there are no bugs to fix there. Particularly in udisks-helper-ata-smart-collect which keeps running and locking up sending the load average into the hundreds. For a tool designed to detect error conditions it probably needs a bit more work.

My next job is to select a replacement drive for the faulty WD Blue...


  1. It is very nice that you share this with us. June

  2. This blog is definitely entertaining additionally factual. I have picked up helluva helpful tips out of this amazing blog. I ad love to visit it again and again. Thanks!
    hotmail login

  3. Great info. I love all the posts vidmate apkxyz, I really enjoyed, I would like more information about this, because it is very nice., Thanks for sharing.

  4. Online gambling Maryland allows lottery games in the casinos and is regulated by the state. Out of all the games, the state gets 30% of the share, while 10% goes to the retailer. The remaining 60% is returned to the players.