Repairing Corrupted ZIP Files by Brute Force Scanning in C

My brother is a wonderful photographer, and took 14 gigabytes of photos at my recent graduation from Columbia, some of which I hope to post on PhotoFloat -- my web 2.0 photo gallery done right via static JSON & dynamic javascript.

He was kind enough to upload a ZIP of the RAW (Canon Raw 2 - CR2) photos to my FTP server overnight from his killer 50mbps pipe. The next day, he left for a long period of traveling.

I downloaded the ZIP archive, eager to start playing with the photographs and learning about RAW photos and playing with tools like dcraw, lensfun, and ufraw, and also seeing if I could forge Canon's "Original Decision Data" tags. To my dismay, the ZIP file was corrupted. I couldn't ask my brother to re-upload it or rsync the changes or anything like that because he was traveling and it was already a great burden for him to upload these in the nick of time. I tried zip -F and zip -FF and even a few Windows shareware tools. Nothing worked. So I decided to write my own tool, using nothing more than the official PKZIP spec and man pages.

First a bit about how ZIP files are structured -- everything here is based on the famous official spec in APPNOTE.TXT. Zip files are structured like this:

[local file header 1]
[file data 1]
[data descriptor 1]
. 
.
.
[local file header n]
[file data n]
[data descriptor n]
[archive decryption header] 
[archive extra data record] 
[central directory]
[zip64 end of central directory record]
[zip64 end of central directory locator] 
[end of central directory record]

Generally unzippers seek to the central directory at the end of the file, which has the locations of all the files in the zip, along with their sizes and names. It reads this in, then seeks back up to the top to read the files off one by one.

The strange thing about my brother's broken file was that the beginning files would work and the end files would work, but the middle 11 gigabytes were broken, with Info-ZIP complaining about wrong offsets and lseeks. I figured that some data had been duplicated/reuploaded at random spots in the middle, so the offsets in the zip file's central directory were broken.

For each file, however, there is a local file header and an optional data descriptor. Each local file header starts with the same signature (0x04034b50), and contains the file name and the size of the file that comes after the local file header. But sometimes, the size of the file is not known until the file has already been inserted in the zip file, in which case, the local file header reports "0" for the file size and sets bit 3 in a bit flag. This indicates that after the file, of unknown length, there will be a data descriptor that says the file size. But how do we know where the file ends, if we don't know the length before hand? Well, usually this data is duplicated in the central directory at the end of the zip file, but I wanted to avoid parsing this all together. Instead, it turns out that, though not in the official spec, APPNOTE.TXT states, "Although not originally assigned a signature, the value 0x08074b50 has commonly been adopted as a signature value for the data descriptor record. Implementers should be aware that ZIP files may be encountered with or without this signature marking data descriptors and should account for either case when reading ZIP files to ensure compatibility. When writing ZIP files, it is recommended to include the signature value marking the data descriptor record." Bingo.

So the recovery algorithm works like this:

Look for a local file header signature integer, reading 4 bytes, and rewinding 3 each time it fails.
Once found, see if the size is there. If the size is in it, read the data to the file path.
If the size isn't there, search for the data descriptor signature, reading 4 bytes, and rewinding 3 each time it fails.
When found, rewind to the start of the data segment and read the number of bytes specified in the data descriptor.
Rewind to 4 bytes after the local file header signature and repeat process.

The files may optionally be deflated, so I use zlib inline to inflate, the advantage of which is that this has its own verification built in, so I don't need to use zip's crc32 (though I should).

Along the way there is some additional tricky logic for making sure we're always searching with maximum breadth.

The end result of all this was... 100% recovery of the files in the archive, complete with their full file names. Win.

You can check out the code here. Suggestions are welcome. It's definitely a quick hack, but it did the job. Took a lot of fiddling with to make it work, especially figuring out attribute((packed)) to turn off gcc's power-of-two padding.