John's Linux Blog: Storage

Showing posts with label Storage. Show all posts

Monday, October 21, 2013

Large filesystem on CentOS-6

I had the pleasure of testing out a new server which came with 10 x 4Tb drives. Configured on and HP Smart Array P420 in RAID 50 that gave me a 32Tb drive to use.

I installed CentOS-6 but the installer would not allow me to use the full amount of space. It turns out the maximum ext4 filesystem on CentOS-6 is 16Tb. This is the limit of 32bit block addressing and 4k blocks. (2³² * 4k = 2³² * 2¹²= 2⁴⁴=2⁴⁰ * 2⁴ = 16T)

That did not make for a very exciting test. Only half the disk available was available in a single filesystem. Sure I could create two logical volumes and then create two filesystems but that seemed a bit like a DOS solution.

Some research turned up the options of using ext4 48bit block addressing. This enables a filesysem up to 1Eb in size and allows for future 64bit addressing for an even larger limit.

The catch was of course that 48bit addressing is not supported by the tools which come with CentOS-6.

Fedora 20 (rawhide) does come with the required updates to e2fsprogs (e2fsprogs-1.42.8-3) which enable 48bit addressing (somehow future-proofed by being called 64bit addressing). I then set out to rebuild the fedora 20 rpm for CentOS-6. Amazingly the compile was clean though I did have to disable some tests which is a tad alarming. All in all it did not take long to produce a set of replacement rpms for e2fsprogs.

After installing the new tools I was able to create a new filesystem larger than 16Tb. I started at 17Tb to give me room to play. It seems that the CentOS-6 kernel does already have support for 48bit addressing. I ran a number of workloads and I could not find any problems. Still, I don't know what bugs may be lurking in there.

dumpe2fs shows the new filesystem feature '64bit'.

I also attempted to do an offline resize of the filesystem to the maximum size of my disk. This just worked as expected. Online resize is not available until a much later kernel version. I did not attempt this because there are know bugs.

The last limit I wanted to test was the file size limit. Even on my new filesystem the individual file size limit is 16Tb. It is not every day that I get to make one of them so I did
dd if=/dev/zero of=big bs=1M count=$(expr 16 \* 1024 \* 1024)

At ~500Mb/s it still took 9 hours to complete.
And it turns out that the maximum file size is actually 17592186040320 and not 17592186044416. That is 4k (one block) short of 16Tb. (You can actually check that on any machine with the command truncate big --size 17592186040321 )

That also raised an issue which was a problem on ext3. How long does it take to delete a large file? Well this is the largest possible file and it took 13 minutes.

In conclusion, the 16Tb filesystem limit is easliy raised on CentOS-6 but it comes at the expense of using untested tools and kernel features. Although I did not find any problems in my testing, this could pose a substantial risk if you have over 16Tb of data which you do not want to loose.

If anyone is interested in my rpms you can get them from http://www.chrysocome.net/download

Tuesday, March 26, 2013

scp files with spaces in the filename

scp is a great tool. Built to run over ssh it maintains a good unix design. It does however cause a few problems with it's nonchalant handling of file names.

For example, transferring a file which has a space in the name causes problems because of the amount of escaping required to get the space to persist on the remote side of the connection.

Copying files from local to remote is easy enough, just quote or escape the file name using your shell.
scp Test\ File remote:
scp 'Test File' remote:
scp "Test File" remote:

Copying files from remote to local can be more tricky
scp remtote:Test\ File .
scp: Test: No such file or directory
scp: File: No such file or directory

scp stimulus:"Test File" .
scp: Test: No such file or directory
scp: File: No such file or directory

scp "stimulus:Test File" .
scp: Test: No such file or directory
scp: File: No such file or directory

The escaping is working on your local machine but the remote is still splitting the name at the space.

To solve that, we need an extra level of escape, one for the remote server. Remember that your shell will eat the \ so you need a double \\ to send it to the remote
scp remote:Test\\\ File .
Test File 100% 0 0.0KB/s 00:00

You can also combine the remote escape with local quoting
scp remote:"Test\ File" .
or
scp "remote:Test\ File" .

Best solution of all, avoid spaces in file names. If you can't avoid it, I find the easiest solution is to replace ' ' with '\\\ '.

Sunday, December 23, 2012

SMART errors and software RAID

For a while now I have been having an intermittent problem with some of my drives. They seem to be working but stop responding to SMART. If I reboot they don't show up any more but if I power off and back on they work fine.

I have two 640Gig WD Green drives in a software RAID1 and another WD Blue drive. After a suspend/resume, sometimes one of the drives is MIA. The kernel thinks it is still there but my SMART tools can't talk to it and start to complain.

The Blue drive has a bad (unreadable) sector but touch wood, that has not caused a problem yet. SMART knows about this and tells me about it frequently. This however does not appear to be the problem. There is a telling message is in dmesg:

sd 0:0:0:0: [sda] START_STOP FAILED

sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

PM: Device 0:0:0:0 failed to resume: error 262144

So for some reason, the disk is not responding when the computer resumes. I guess it is a timeout (and I wonder if I can extend it?) Anyway, now I have the situation where my disk is not working even though I know there is nothing actually wrong with it.

Luckily I am using software RAID and the other disk is working so I can continue about my business without crashing or loosing data. After poking and prodding a few different things I have worked out a solution:

Hot-remove the device (from the kernel)

echo 1 > /sys/block/sda/device/delete

Rescan for the device to hot-add it to the kernel

echo "- - -" > /sys/class/scsi_host/host0/scan

Add the 'failed' drive back into the RAID set

mdadm /dev/md127 --re-add /dev/sda2 

You must remove the existing (sda) device first or the disk will be re-detected and added with a new name (sde in my case).

Because I have a write-intent bitmap, the RAID set knows what has changed since the drive was failed and only the changes must be re-synced which is quite fast.

There seems to be a 'vibe' that green drives are not good for RAID. I don't really think this is a problem because the drive is green, I think it is a problem because the driver is not trying hard enough to restart the disk.

So in the end this was not a SMART problem after all. Not there there are no bugs to fix there. Particularly in udisks-helper-ata-smart-collect which keeps running and locking up sending the load average into the hundreds. For a tool designed to detect error conditions it probably needs a bit more work.

My next job is to select a replacement drive for the faulty WD Blue...

Tuesday, December 4, 2012

FlashCache

For some time I have been using FlashCache on my PC at home. Of course I did it the hard way because I wanted it seamlessly integrated at boot time so I could use the cache on my root filesystem.

My hard work seems to have paid off and I have now officially published the fruits of it at http://chrysocome.net/dracut-flashcache

Now CentOS-6 users can get started with FlashCache with (hopefully) a minimum of fuss.

Just as soon as I get time to upgrade my machine at work I will be running it there too.