Recognise legacy RAZF compression#1244
Conversation
RAZF is an obsolete predecessor to BGZF, and is similarly a variant of GZIP using an extra header field. It also adds a trailing index table. Adding this htsCompression value does not affect bgzf_read_init()'s detection of BGZF vs plain-GZIP; RAZF remains treated as is_gzip and the trailing index table is not handled well, leading to problems if you try to decompress such a legacy file with e.g. bgzip -d.
|
Thanks, the file detection looks good. Do you think it would be a good idea to make |
|
I had a look at the old razf.c, and I think it may be practical to stop and flush the output when it gets to the trailing index table and thus enable sequential reading of RAZF — and thus make I'll have a slightly deeper look and otherwise have it politely decline. (Unless anyone else wants to have a play with this!) |
|
I think read support might be a bit of effort, especially in the constraints of the existing BGZF struct. Politely refusing is a couple of extra lines in |
|
For your consideration, I've added a commit that enables sequential reading and so makes The last 16 bytes of a RAZF file are the compressed and uncompressed file size, so it would be possible to do this exactly by seeking to EOF when opening the file and reading these values. However this is more work and more infrastructure than is worth adding for this legacy format. Instead we simply detect when we have an error due to reading the index table instead of a GZIP header, and consider it as EOF instead. The first time into This is done only for |
|
Hmmm, I'm not too sure about that. It burns our spare bit on a very rarely encountered format, and the EOF detection looks a bit fragile. The old razf.c implementation stopped at |
|
I could repurpose However the usefulness of this is indeed vanishingly small and it would probably be better to print out instructions for a more reliable approach via |
Instead emit an error message recommending the use of gunzip to decompress the file, in the unlikely event a RAZF file is encountered. If seeking is available, attempt to read the sizes stored at the end of the RAZF trailing index table so that the message can show a truncate command to remove the index table before gunzipping the file.
|
[That last comment was a bit tongue-in-cheek BTW 😄] Okay, simplified this to just produce a useful error message instead — by seeking to the end of the file if possible so as to produce instructions to avoid the ambiguity of gunzip's “trailing garbage ignored” (was it just the RAZF index table, or was there a glitch earlier in the file?): |
|
Thanks, I think that's a good solution for now. If RAZF ever becomes more popular, we can move on to something more advanced later. |
|
Thanks. I think we can be pretty confident that RAZF will not have a resurgence in popularity 😄 |
Diagnosing samtools/samtools#1387 was hindered by
htsfileidentifying what was actually an obsolete RAZF-compressed file as plain gzipped. This patch adds basic support for recognising RAZF, similarly to how ef0b40d added basic recognition for bzip2 compression.(RAZF is an obsolete predecessor to BGZF, and is similarly a variant of GZIP using an extra header field. It also adds a trailing index table.)
Adding this
htsCompressionvalue does not affectbgzf_read_init()'s detection of BGZF vs plain-GZIP; RAZF remains treated asis_gzipand the trailing index table is not handled well, leading to problems if you try to decompress such a legacy file with e.g.bgzip -d— as the OP on that issue discovered.