Hardisk | Bashing Linux

Being the one that is responsible for the backups at work – I never compromised on anything.
As a SysAdmin you must:

Backup
Backup some more
Test your backups

Shoot to maim

At first we mainly needed to backup our subversion repository. A pretty easy task for any SysAdmin.
What I would do is simply dump the repository at night, and scp it to two other workstations of developers in the company (I didn’t really have much of a choice in terms of other computers in the network).
It worked.

The golden goose is on the loose

After a while I managed to convince our R&D manager it is time for detachable backups. Detachable backups can save you in case the building is on fire or if someone decides to shoot a rocket on your building (unlikely even in Israel, but as a SysAdmin – never take any chances).
With the virtual threat of a virtual rocket that might incinerate all of our important information, we decided that the cheapest and most effective way of action is to purchase a tape drive and a few tapes. Mind you, the year is 2006 and portable HDs are expensive, uncommon and small.

Backing up to a tape has always been something on my TODO list that I had to tick.
During one of my previous jobs, we had a tape archive that took care of it transparently, it was managed by a different team. Ever since I had always had yearnings for the /dev/tape that’s totally sequential.
It was very soon that I’d discovered that these tapes are a plain headache:

It is a mess in general to deal with the tapes as the access is sequential
It’s slow!!!
The only reasonable way to deal with the backups is with dumpe2fs – it’s an archaic tool it’s archaic and work only on the extX filesystem family!
It takes a while to eject the tape, I can still remember the minutes of waiting in the servers room for the tape to eject, so I can deposit it at home
The tapes tend to break! like any tape, the film tends to run away from the bearings, rendering the tape useless

Too bad our backup was far from being able to fit on a DVD media.

The glamour, the fortune, the pain

The tape backup held us good for more than 2 years. I was so happy the solution was robust enough to keep us running for that much time without the need of any major changes.
But portable USB HDs became cheaper and larger and it was time for a change. I was excited to receive two brand new and shiny 500GB HDs. I diligently worked on a new backup script. A backup script that would not be dependant on the filesystem type (hell! i wanted to use JFS!), a backup script that would have snapshots weeks back! a backup script that would rule them all!
This backup script will hopefully be published in one of my next posts.
I felt like king of the world, backups became easy, I was much more confident with the new backup as the files could be seen on the mounted HD easily, in contrast to the sequential tape and the binary filesystem dump.
Backups ran manually by me during the day. I inspected them carefully and was pleased.
It was time for the backup to take place at night. And so it was.

From time to time I would get in the backup log:
Input/output error

At first I didn’t pay much attention.
WTF?! are my HDs broken?! – no way, they are brand new and it happened on both of them. But dmesg also showed some nasty information while accessing the HDs.
I started to trigger the backups manually at day time. Not a single error.

Backups went back to night time.
At the morning I would issue a ls:

# ls /media/backup
Input/output error
# ls /media/backup
daily.1 daily.2 daily.3 daily.4 daily.5 daily.6 daily.7

What the hell is going on around here?! – first command fails but the second succeeds?
First command also used to lag for a while, where the second breezed out. I discovered only later it was a key hint.

My backup creates many links using “cp -aL” (in order to preserve snapshots), I had a speculation I might be messing the filesystem structure with too many links to the same inode – unlikely, but I was shooting at all directions, I was clueless.

So there I go, easing the backups up and eliminating the snapshot functionality. Guess what? – still errors on the backup.

What do I do next? Do I stay up at night just to witness the problem in real time?!, Don’t laugh, a friend of mine actually had to do it once in other occasions.
At this time I already introduced this issue to all of my fellow SysAdmin friends. None of them had any idea. I can’t blame them.
I was frustrated, even the archaic tape backups worked better than the HDs, is newer always better? – perhaps not.
I recreated the filesystem on the portable HDs as ext3 instead of JFS, maybe JFS is buggy?
I’ll save you the trouble. JFS is far from being buggy and it had also nothing to do with crond.

We’ll show the unbelievers

For days I’d watch the nightly email the backup would produce, notice the failure and rerun it manually during the day. Until one day.
It had struck me like a lightning on a sunny day.

The second command would always succeed on the device. What if this HD is a little tired?
What if the portable HD goes to sleep and is having problems waking up?
It’s worth trying.

# sdparm --set=STANDBY=0 /dev/sdb
# sdparm --save /dev/sdb

What do you say? – It worked!
It appears that some USB HDs go to sleep and doesn’t wake up nicely when they should.
Should I file a bug about it? Was it the hardware that malfunctioned?
I was so happy this issue was solved – I never cared about either.
Maybe after crafting this post – it is time to care a little more though.

As the madmen play on words and make us all dance to their song…

I’m sitting at my desk, receiving the nightly email informing me the backup was successful. The portable HDs now also utilize an encrypted filesystem. The backup never fails.
I look at my watch, drink a glass of wine and rejoice.

The following took place more than a year ago, but it is still fresh in my mind. After a few colleagues urged me to write about it, I decided to finally do it. If the output of the commands does not match exactly whatever I had while dealing with it – bear with it. It’s far from being the point.

The horror

I took a break from work last year and decided to go and have some fun in NZ. Oh, did I have fun there!
There’s nothing more frustrating than returning to work, turning on your dusty computer and witnessing the following:

*** An error ocfcurred during the filesystem check.
Give root password for maintenance (or type Control-D to continue):

Investigating it just a bit more I got to a conclusion that my /home is not mounting. OMG!!! all of my personal customizations and some private data is at /home!
I must admit, it’s nothing that couldn’t be reproduced at a reasonable amount of time, but having my neat KDE customizations I didn’t want to start the process from the beginning. Think about yourself losing your /home, it’s no fun. I decided I wanted it back.
OK, so I’m running e2fsck:

# e2fsck /dev/sda5
e2fsck 1.39 (29-May-2006)
e2fsck: No such file or directory while trying to open /dev/sda5

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 

#

Oh man, how frustrating, e2fsck can’t read my superblock!
Something I did notice during boot up, is that the HD is very noisy, in addition to very slow boot process. It wasn’t a new HD and it has worked hard (I tested on it numerous times our FS infrastructure of video recording). Probably it’s time has come.

I wanted to feel at /home again.
I decided the smartest thing would be to first try and copy this whole partition aside as I knew there is a hardware problem with the HD. After I’ll solve that one, I could hopefully handle the missing superblock problem much better.

Getting physical

So I quickly inserted a fresh new HD to my machine, disconnected the old faulty HD (it caused the computer to boot so slow because of it’s defects) and issued a network install.
15 minutes later I’m again at linux, with a bare new /home, and the faulty HD connected to it, slowing the computer as hell.
I was sure dd would come for the rescue:

# dd if=/dev/sda5 of=home-sweet-home.ext3 bs=4M

After a few minutes of anticipation and cranky HD noises, I’m with:

dd: reading `/dev/sda5': Input/output error
0+0 records in
0+0 records out
# ls -l /home/home-sweet-home.ext3
-rw-r--r-- 1 root root 0 Jul  10 08:26 home-sweet-home.ext3

Great :(. I’m searching the net, searching for an aggressive dd program, something that instead of giving up on bad sectors, would fill them with zeros and continue on (hoping the defects on the HD are at a very specific place). I must admit I have almost written something by myself, but finally I’ve found dd_rescue.
And off we go:

# dd_rescue -e 1 -A /dev/sda5 /home/home-sweet-home.ext3

It ran for hours! It was 65GB that dd_rescue had to tackle. With a dying HD that could take a lot of time. After more or less 8 hours I was back at my desktop, looking at my old home:

# ls -lh /home/home-sweet-home.ext3
-rw-r--r--   1 root   root        61G Jul  10 20:43 home-sweet-home.ext3
#

Being logical

OK, that’s it, I have my data. Time to dump the old HD and deal with the logical errors I still have with this partition dump. Mounting the partition gave me the same result as I pasted above: no superblock – no fun!
Oh! but ext3 always creates a few backup superblocks, maybe this is my lucky day where I will finally be able to use one of these backups. You are probably familiar with the following output:

# e2fsck /tmp/e2fsck-test
...
27 block groups
8192 blocks per group, 8192 fragments per group
1992 inodes per group
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729, 204801

Writing inode tables: done
Writing superblocks and filesystem accounting information: done
...

Now go figure out where your backup superblocks are. Trying the obvious of 8193 and 32768 did not work for me. I knew there should be more backup superblocks. Google comes for the rescue again. I was quite close at this time as well to writing a small C program that would search the partition dump for ext3 superblock signatures and tell me where the backup superblocks are. But then again, I thought I’m probably not the first one who needs such a utility, here TestDisk came for the rescue.
I simply ran TestDisk which reveled the remaining trustworthy superblocks on my damaged filesystem.
Later on I discovered that it is also possible to run e2fsck on a partition with the same size and see where the superblocks get written. However, I think that probing the superblocks is much cleaner altogether.

# mkdir /home2
# mount -o loop,ro,sb=884736 /home/home-sweet-home.ext3 /home2
#

Silence.

Did it work?

# ls -l /home2
drwxr-xr-x  41 dan    dan    8192 Feb  5 13:39 dan
drwx------   7 distcc distcc   88 Jul 26 15:24 distcc
drwxrwxrwx   2 nobody nobody    1 Feb  5 05:39 public
#

Wow, it did!
So how much of it was eventually damaged? – less than 1%!
So I’ve found a few garbled files I didn’t need anyway, but I was more than 99% back at /home.

Epilogue

Needless to say that ever since I’m backing up my /home in a strict manner. But this is obviously not the point.
The simplicity of Linux in general and ext2/3 in specific is something we should adore. I wouldn’t want to imagine what would have happened on a different OS or a different filesystem (and please don’t start a flame war about it now…).

Bashing Linux Linux system administration, programming and everything that goes in between…

Archive for the ‘Hardisk’ Tag

Backups… all night? 1 comment