Archive for the ‘maintenance’ Tag
Being the one that is responsible for the backups at work – I never compromised on anything.
As a SysAdmin you must:
- Backup
- Backup some more
- Test your backups
Shoot to maim
At first we mainly needed to backup our subversion repository. A pretty easy task for any SysAdmin.
What I would do is simply dump the repository at night, and scp it to two other workstations of developers in the company (I didn’t really have much of a choice in terms of other computers in the network).
It worked.
The golden goose is on the loose
After a while I managed to convince our R&D manager it is time for detachable backups. Detachable backups can save you in case the building is on fire or if someone decides to shoot a rocket on your building (unlikely even in Israel, but as a SysAdmin – never take any chances).
With the virtual threat of a virtual rocket that might incinerate all of our important information, we decided that the cheapest and most effective way of action is to purchase a tape drive and a few tapes. Mind you, the year is 2006 and portable HDs are expensive, uncommon and small.
Backing up to a tape has always been something on my TODO list that I had to tick.
During one of my previous jobs, we had a tape archive that took care of it transparently, it was managed by a different team. Ever since I had always had yearnings for the /dev/tape that’s totally sequential.
It was very soon that I’d discovered that these tapes are a plain headache:
- It is a mess in general to deal with the tapes as the access is sequential
- It’s slow!!!
- The only reasonable way to deal with the backups is with dumpe2fs – it’s an archaic tool it’s archaic and work only on the extX filesystem family!
- It takes a while to eject the tape, I can still remember the minutes of waiting in the servers room for the tape to eject, so I can deposit it at home
- The tapes tend to break! like any tape, the film tends to run away from the bearings, rendering the tape useless
Too bad our backup was far from being able to fit on a DVD media.
The glamour, the fortune, the pain
The tape backup held us good for more than 2 years. I was so happy the solution was robust enough to keep us running for that much time without the need of any major changes.
But portable USB HDs became cheaper and larger and it was time for a change. I was excited to receive two brand new and shiny 500GB HDs. I diligently worked on a new backup script. A backup script that would not be dependant on the filesystem type (hell! i wanted to use JFS!), a backup script that would have snapshots weeks back! a backup script that would rule them all!
This backup script will hopefully be published in one of my next posts.
I felt like king of the world, backups became easy, I was much more confident with the new backup as the files could be seen on the mounted HD easily, in contrast to the sequential tape and the binary filesystem dump.
Backups ran manually by me during the day. I inspected them carefully and was pleased.
It was time for the backup to take place at night. And so it was.
From time to time I would get in the backup log:
Input/output error
At first I didn’t pay much attention.
WTF?! are my HDs broken?! – no way, they are brand new and it happened on both of them. But dmesg also showed some nasty information while accessing the HDs.
I started to trigger the backups manually at day time. Not a single error.
Backups went back to night time.
At the morning I would issue a ls:
# ls /media/backup
Input/output error
# ls /media/backup
daily.1 daily.2 daily.3 daily.4 daily.5 daily.6 daily.7
What the hell is going on around here?! – first command fails but the second succeeds?
First command also used to lag for a while, where the second breezed out. I discovered only later it was a key hint.
My backup creates many links using “cp -aL” (in order to preserve snapshots), I had a speculation I might be messing the filesystem structure with too many links to the same inode – unlikely, but I was shooting at all directions, I was clueless.
So there I go, easing the backups up and eliminating the snapshot functionality. Guess what? – still errors on the backup.
What do I do next? Do I stay up at night just to witness the problem in real time?!, Don’t laugh, a friend of mine actually had to do it once in other occasions.
At this time I already introduced this issue to all of my fellow SysAdmin friends. None of them had any idea. I can’t blame them.
I was frustrated, even the archaic tape backups worked better than the HDs, is newer always better? – perhaps not.
I recreated the filesystem on the portable HDs as ext3 instead of JFS, maybe JFS is buggy?
I’ll save you the trouble. JFS is far from being buggy and it had also nothing to do with crond.
We’ll show the unbelievers
For days I’d watch the nightly email the backup would produce, notice the failure and rerun it manually during the day. Until one day.
It had struck me like a lightning on a sunny day.
The second command would always succeed on the device. What if this HD is a little tired?
What if the portable HD goes to sleep and is having problems waking up?
It’s worth trying.
# sdparm --set=STANDBY=0 /dev/sdb
# sdparm --save /dev/sdb
What do you say? – It worked!
It appears that some USB HDs go to sleep and doesn’t wake up nicely when they should.
Should I file a bug about it? Was it the hardware that malfunctioned?
I was so happy this issue was solved – I never cared about either.
Maybe after crafting this post – it is time to care a little more though.
As the madmen play on words and make us all dance to their song…
I’m sitting at my desk, receiving the nightly email informing me the backup was successful. The portable HDs now also utilize an encrypted filesystem. The backup never fails.
I look at my watch, drink a glass of wine and rejoice.
Introduction
Landing in a new startup company has its cons and pros.
The pros being:
- You can do almost whatever you want
The cons:
- You have to do it from scracth!
The Developers
Linux developers are not dumb. They can’t be. If they were dumb, they couldn’t have developed anything on Linux. They might have been called developers on some other platforms.
I was opted quite early about the question of:
“Am I, as a SysAdmin, going to give those Linux developers root access on their machines?”
Why not:
- They can cause a mess and break their system in a second.
A fellow developer (the chowner) who ran:
# chown -R his_username:his_group *
He came to me saying “My Linux workstation stopped working well!!!”
Later on I also discovered he was at /, when performing this command! 🙂
For his defence he added: “But I stopped the command quickly! after I saw the mistake!”
- And there’s no 2, I think this is the only main reason, given that these are actually people I generally trust.
Why yes:
- They’ll bother me less with small things such as mounting/umounting media.
- If they need to perform any other administrative action – they’ll learn from it.
- Heck, it’s their own workstation, if they really want, they’ll get root access, so who am I to play god with them?
Choosing the former and letting the developers rejoice with their root access on their machines, I had to perform some proactive actions in order to avoid unwanted situations I might encounter.
Installation
Your flavor of installation should be idempotent, in terms of letting the user destroy his workstation, but still be able to reinstall and get to the same position.
Let’s take for example the chowner developer. His workstation was ruined. I never even thought of starting to change back permissions to their originals. It would cause much more trouble in the long run than any good.
We reinstalled his workstation and after 15 minutes he was happy again to continue development.
Automatic network installations are too easy to implement today on Linux. If you don’t have one, you must be living in the medieval times or so.
I can give you one suggestion though about partitioning – make sure your developers have a /home on a different partition. It’ll be easier when reinstalling to preserve /home and remove all the rest.
Consolidating software
I consider installing non-packaged software on Linux a very dirty action.
The reasons for that are:
- You can’t uninstall it using standard ways
- You can’t upgrade it using standard ways
- You can’t keep track of it
In addition to installing packaged software, you must also have all your workstations and server synchronize against the same software repositories.
If user A installs software from repository A and user B from repository B, they might run into different behavior on their software.
Have you ever heard: “How come it works on my computer and doesn’t work on yours??”
As a SysAdmin, you must eliminate the possibilities of this to happen to a zero.
How do you do it?
Well, using CentOS – use a YUM repository and cache whatever packages you need from the various internet repositories out there.
Debian? – just the same – just with apt.
Remember – if you have any software on workstations that is not well packaged or not well controlled – you’ll run into awkward situations very soon.
Today
Up until today Linux developers in my company still posses their root access, but they barely use it. To be honest I don’t think they even really need it. However, they have it. It is also about educating the developers that they are given the root access because they are being trusted. If they blew it, it’s mostly their fault, not yours.
I’ll continue to let them be root when needed. They have proved worthy so far.
And I’ll ask you another question – do you really think that someone who can’t handle his own workstation be a good developer? – think again!
The following took place more than a year ago, but it is still fresh in my mind. After a few colleagues urged me to write about it, I decided to finally do it. If the output of the commands does not match exactly whatever I had while dealing with it – bear with it. It’s far from being the point.
The horror
I took a break from work last year and decided to go and have some fun in NZ. Oh, did I have fun there!
There’s nothing more frustrating than returning to work, turning on your dusty computer and witnessing the following:
*** An error ocfcurred during the filesystem check.
Give root password for maintenance (or type Control-D to continue):
Investigating it just a bit more I got to a conclusion that my /home is not mounting. OMG!!! all of my personal customizations and some private data is at /home!
I must admit, it’s nothing that couldn’t be reproduced at a reasonable amount of time, but having my neat KDE customizations I didn’t want to start the process from the beginning. Think about yourself losing your /home, it’s no fun. I decided I wanted it back.
OK, so I’m running e2fsck:
# e2fsck /dev/sda5
e2fsck 1.39 (29-May-2006)
e2fsck: No such file or directory while trying to open /dev/sda5
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193
#
Oh man, how frustrating, e2fsck can’t read my superblock!
Something I did notice during boot up, is that the HD is very noisy, in addition to very slow boot process. It wasn’t a new HD and it has worked hard (I tested on it numerous times our FS infrastructure of video recording). Probably it’s time has come.
I wanted to feel at /home again.
I decided the smartest thing would be to first try and copy this whole partition aside as I knew there is a hardware problem with the HD. After I’ll solve that one, I could hopefully handle the missing superblock problem much better.
Getting physical
So I quickly inserted a fresh new HD to my machine, disconnected the old faulty HD (it caused the computer to boot so slow because of it’s defects) and issued a network install.
15 minutes later I’m again at linux, with a bare new /home, and the faulty HD connected to it, slowing the computer as hell.
I was sure dd would come for the rescue:
# dd if=/dev/sda5 of=home-sweet-home.ext3 bs=4M
After a few minutes of anticipation and cranky HD noises, I’m with:
dd: reading `/dev/sda5': Input/output error
0+0 records in
0+0 records out
# ls -l /home/home-sweet-home.ext3
-rw-r--r-- 1 root root 0 Jul 10 08:26 home-sweet-home.ext3
Great :(. I’m searching the net, searching for an aggressive dd program, something that instead of giving up on bad sectors, would fill them with zeros and continue on (hoping the defects on the HD are at a very specific place). I must admit I have almost written something by myself, but finally I’ve found dd_rescue.
And off we go:
# dd_rescue -e 1 -A /dev/sda5 /home/home-sweet-home.ext3
It ran for hours! It was 65GB that dd_rescue had to tackle. With a dying HD that could take a lot of time. After more or less 8 hours I was back at my desktop, looking at my old home:
# ls -lh /home/home-sweet-home.ext3
-rw-r--r-- 1 root root 61G Jul 10 20:43 home-sweet-home.ext3
#
Being logical
OK, that’s it, I have my data. Time to dump the old HD and deal with the logical errors I still have with this partition dump. Mounting the partition gave me the same result as I pasted above: no superblock – no fun!
Oh! but ext3 always creates a few backup superblocks, maybe this is my lucky day where I will finally be able to use one of these backups. You are probably familiar with the following output:
# e2fsck /tmp/e2fsck-test
...
27 block groups
8192 blocks per group, 8192 fragments per group
1992 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729, 204801
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
...
Now go figure out where your backup superblocks are. Trying the obvious of 8193 and 32768 did not work for me. I knew there should be more backup superblocks. Google comes for the rescue again. I was quite close at this time as well to writing a small C program that would search the partition dump for ext3 superblock signatures and tell me where the backup superblocks are. But then again, I thought I’m probably not the first one who needs such a utility, here TestDisk came for the rescue.
I simply ran TestDisk which reveled the remaining trustworthy superblocks on my damaged filesystem.
Later on I discovered that it is also possible to run e2fsck on a partition with the same size and see where the superblocks get written. However, I think that probing the superblocks is much cleaner altogether.
# mkdir /home2
# mount -o loop,ro,sb=884736 /home/home-sweet-home.ext3 /home2
#
Silence.
Did it work?
# ls -l /home2
drwxr-xr-x 41 dan dan 8192 Feb 5 13:39 dan
drwx------ 7 distcc distcc 88 Jul 26 15:24 distcc
drwxrwxrwx 2 nobody nobody 1 Feb 5 05:39 public
#
Wow, it did!
So how much of it was eventually damaged? – less than 1%!
So I’ve found a few garbled files I didn’t need anyway, but I was more than 99% back at /home.
Epilogue
Needless to say that ever since I’m backing up my /home in a strict manner. But this is obviously not the point.
The simplicity of Linux in general and ext2/3 in specific is something we should adore. I wouldn’t want to imagine what would have happened on a different OS or a different filesystem (and please don’t start a flame war about it now…).