Rescuing most from a broken tar.bz2 with bzip2recover and some cunning.
I had a bzip2ed tarball from 2005 lying around, for which BZip2 threw a CRC error.
So the next step was to run bunzip2 on all of them, which broke with:
I then bunzipped all the other files, and catted the first good .tar files together as one file, and the last good tar files together as another.
So i quickly read up on the tar header format. It's 512byte aligned, with the first 256bytes the name of the file or directory, null padded. Then all of the other relevant information is presented in the next 256 bytes:
I then found that i got lucky, and that a big file was spanning the broken block. Its tar header was at 0x607800 in the first blob:
It is not likely that i will need this again any time soon, but if i do, i hope i know where to find the info now.
libv@machine:~$ tar -jxvf ../backup-2005.tar.bz2 backup/ backup/file0.txt backup/file1.pdf backup/file2.bin bzip2: Data integrity error when decompressing. Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting nowI then ran bzip2recover, and this produced 328 files called:
rec00001backup-2005.tar.bz2 rec00002backup-2005.tar.bz2 .. rec00327backup-2005.tar.bz2 rec00328backup-2005.tar.bz2There is surprisingly little information out there on how to handle this. One stackoverflow commenter stated to "guess" the size of the undecompressable block, which sounds a bit unscientific.
So the next step was to run bunzip2 on all of them, which broke with:
bunzip2: Data integrity error when decompressing. Input file = rec00051backup-2005.tar.bz2, output file = rec00051backup-2005.tar It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bunzip2: Deleting output file rec00051backup-2005.tar, if it exists. bunzip2: WARNING: some files have not been processed: bunzip2: 328 specified on command line, 277 not processed yet.There does not seem to be a flag for bunzip2 to ignore the CRC and soldier on, flagging the block accordingly. This 900kB block should be considered lost. The rest of the data is still valid though. So we need to trick tar into thinking that there is valid data in where the breakage was. Sadly Bzip2 does not list the actual uncompressed block size, just an approximation (in this case a "9" for 900kB).
I then bunzipped all the other files, and catted the first good .tar files together as one file, and the last good tar files together as another.
So i quickly read up on the tar header format. It's 512byte aligned, with the first 256bytes the name of the file or directory, null padded. Then all of the other relevant information is presented in the next 256 bytes:
00000000 62 61 63 6b 75 70 2f 00 00 00 00 00 00 00 00 00 |backup/.........| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000060 00 00 00 00 30 30 30 30 37 35 35 00 30 30 30 30 |....0000755.0000| 00000070 30 30 30 00 30 30 30 30 30 30 30 00 30 30 30 30 |000.0000000.0000| 00000080 30 30 30 30 30 30 30 00 31 30 30 32 37 31 37 35 |0000000.10027175| 00000090 33 35 34 00 30 31 32 31 33 30 00 20 35 00 00 00 |354.012130. 5...| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000100 00 75 73 74 61 72 20 20 00 6c 69 62 76 00 00 00 |.ustar .libv...| 00000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000120 00 00 00 00 00 00 00 00 00 6c 69 62 76 00 00 00 |.........libv...| 00000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000140 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000| 00000150 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........| 00000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000200 62 61 63 6b 75 70 2f 66 69 6c 65 30 2e 74 78 74 |backup/file0.txt| 00000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000220 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000260 00 00 00 00 30 30 30 30 37 35 35 00 30 30 30 30 |....0000755.0000| 00000270 30 30 30 00 30 30 30 30 30 30 30 00 30 30 30 31 |000.0000000.0001| 00000280 35 36 37 31 30 30 30 00 31 30 32 34 32 31 30 31 |5671000.10242101| 00000290 30 34 30 00 30 31 35 30 37 35 00 20 30 00 00 00 |040.015075. 0...| 000002a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000300 00 75 73 74 61 72 20 20 00 6c 69 62 76 00 00 00 |.ustar .libv...| 00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000320 00 00 00 00 00 00 00 00 00 6c 69 62 76 00 00 00 |.........libv...| 00000330 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000340 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000| 00000350 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........| 00000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000400 54 68 69 73 20 69 73 20 79 6f 75 72 20 64 61 74 |This is your dat|You can see here that the string "00015671000" at 0x27C is the file size that tar expects for file0.txt. This is octal, so you need to run:
echo $((8#00015671000)) 36336643633664While there is no tar specific magic, i was able to search the hexdump for "ustar".
I then found that i got lucky, and that a big file was spanning the broken block. Its tar header was at 0x607800 in the first blob:
00607800 62 61 63 6b 75 70 2f 66 69 6c 65 32 2e 62 69 6e |backup/file2.bin| 00607810 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00607860 00 00 00 00 30 30 30 30 37 35 35 00 30 30 30 30 |....0000755.0000| 00607870 30 30 30 00 30 30 30 30 30 30 30 00 30 31 30 34 |000.0000000.0104| 00607880 34 32 36 34 30 30 30 00 31 30 32 34 30 32 31 37 |4264000.10240217| 00607890 35 33 34 00 30 32 30 35 30 32 00 20 30 00 00 00 |534.020502. 0...| 006078a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00607900 00 75 73 74 61 72 20 20 00 6c 69 62 76 00 00 00 |.ustar .libv...| 00607910 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00607920 00 00 00 00 00 00 00 00 00 6c 69 62 76 00 00 00 |.........libv...| 00607930 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00607940 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000| 00607950 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........| 00607960 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00607a00 64 61 74 61 64 61 74 61 64 61 74 61 64 61 74 61 |datadatadatadata|Because this is the first block, before the corruption, everything is still nicely 0x200 aligned. Our file begins at 0x607a00 with a filesize of 0x8916800 (or 1044264000 octal -- which is surprisingly 0x200 aligned as well -- is this a filesystem thing?). The first blob's size is 0x2af3b17. So the missing amount of data for this file is
(0x607a00 + 0x8916800) - 0x2af3b17 = 0x642A6E9The second blob was slightly more difficult, as there the tar header is no longer 0x200 aligned. I got unlucky with searching the hexdump for "ustar" as it crossed a 16byte boundary, but grep gave me a byteoffset to look for in the hexdump.
0634eb50 00 00 00 00 00 00 00 00 00 00 00 00 62 61 63 6b |............back| 0634eb60 75 70 2f 66 69 6c 65 33 2e 73 68 00 00 00 00 00 |up/file3.sh.....| 0634eb70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *I then of course made the error of using 0x634eb50 and not 0x634eb5c in my calculation. But on the second attempt i calculated:
0x642A6E9 - 0x634eb5c = 0xDBB8DSo i dded the required amount to some file, catted the three files together, and tar was perfectly happy. I did null the broken file, and renamed it appropriately.
It is not likely that i will need this again any time soon, but if i do, i hope i know where to find the info now.
