Skip to content

Recognise legacy RAZF compression#1244

Merged
daviesrob merged 2 commits intosamtools:developfrom
jmarshall:detect-razf
Mar 2, 2021
Merged

Recognise legacy RAZF compression#1244
daviesrob merged 2 commits intosamtools:developfrom
jmarshall:detect-razf

Conversation

@jmarshall
Copy link
Copy Markdown
Member

@jmarshall jmarshall commented Mar 1, 2021

Diagnosing samtools/samtools#1387 was hindered by htsfile identifying what was actually an obsolete RAZF-compressed file as plain gzipped. This patch adds basic support for recognising RAZF, similarly to how ef0b40d added basic recognition for bzip2 compression.

$ htsfile human_g1k_v37.fasta.gz 
human_g1k_v37.fasta.gz:	FASTA legacy-RAZF-compressed data

(RAZF is an obsolete predecessor to BGZF, and is similarly a variant of GZIP using an extra header field. It also adds a trailing index table.)

Adding this htsCompression value does not affect bgzf_read_init()'s detection of BGZF vs plain-GZIP; RAZF remains treated as is_gzip and the trailing index table is not handled well, leading to problems if you try to decompress such a legacy file with e.g. bgzip -d — as the OP on that issue discovered.

RAZF is an obsolete predecessor to BGZF, and is similarly a variant of
GZIP using an extra header field. It also adds a trailing index table.

Adding this htsCompression value does not affect bgzf_read_init()'s
detection of BGZF vs plain-GZIP; RAZF remains treated as is_gzip and
the trailing index table is not handled well, leading to problems
if you try to decompress such a legacy file with e.g. bgzip -d.
@daviesrob
Copy link
Copy Markdown
Member

Thanks, the file detection looks good. Do you think it would be a good idea to make bgzf_read_init() detect the format and politely decline to read it? Given that it seems to produce a truncated output, it might be useful to just recommend decompressing with gzip.

@jmarshall
Copy link
Copy Markdown
Member Author

I had a look at the old razf.c, and I think it may be practical to stop and flush the output when it gets to the trailing index table and thus enable sequential reading of RAZF — and thus make bgzip -d actually work on RAZF files as it does for plain gzip.

I'll have a slightly deeper look and otherwise have it politely decline. (Unless anyone else wants to have a play with this!)

@daviesrob
Copy link
Copy Markdown
Member

I think read support might be a bit of effort, especially in the constraints of the existing BGZF struct. Politely refusing is a couple of extra lines in bgzf_read_init() so much quicker to do as a short-term fix, I think.

@jmarshall
Copy link
Copy Markdown
Member Author

For your consideration, I've added a commit that enables sequential reading and so makes bgzip -d foo.rz work as expected. The problem was indeed limited to falling over when reading encounters the trailing index table.

The last 16 bytes of a RAZF file are the compressed and uncompressed file size, so it would be possible to do this exactly by seeking to EOF when opening the file and reading these values. However this is more work and more infrastructure than is worth adding for this legacy format.

Instead we simply detect when we have an error due to reading the index table instead of a GZIP header, and consider it as EOF instead. The first time into inflate_gzip_block() this fails with a distinctive Zlib error message, so is easy to detect fairly accurately. Unfortunately inflate_gzip_block() is repeatedly called at EOF and on subsequent invocations we get a less informative Zlib error, so we reuse last_block_eof (which is unused for plain GZIP files) to indicate that we've already encountered this and should convert to EOF again.

This is done only for if (fp->is_razf) and only when an error has already occurred, so there is no performance impact on BGZF or plain GZIP files.

@daviesrob
Copy link
Copy Markdown
Member

daviesrob commented Mar 1, 2021

Hmmm, I'm not too sure about that. It burns our spare bit on a very rarely encountered format, and the EOF detection looks a bit fragile. The old razf.c implementation stopped at Z_STREAM_END, which seems like it would be more reliable. It looks like it didn't support concatenation, as far as I can tell.

@jmarshall
Copy link
Copy Markdown
Member Author

I could repurpose is_be instead if you prefer, and the EOF detection is no more fragile than using gunzip and hoping that the first “trailing garbage” encountered is indeed the trailing garbage you were expecting.

However the usefulness of this is indeed vanishingly small and it would probably be better to print out instructions for a more reliable approach via gunzip instead. Revision incoming…

Instead emit an error message recommending the use of gunzip to decompress
the file, in the unlikely event a RAZF file is encountered. If seeking is
available, attempt to read the sizes stored at the end of the RAZF trailing
index table so that the message can show a truncate command to remove the
index table before gunzipping the file.
@jmarshall
Copy link
Copy Markdown
Member Author

[That last comment was a bit tongue-in-cheek BTW 😄] Okay, simplified this to just produce a useful error message instead — by seeking to the end of the file if possible so as to produce instructions to avoid the ambiguity of gunzip's “trailing garbage ignored” (was it just the RAZF index table, or was there a glitch earlier in the file?):

$ bgzip -d human_g1k_v37.fasta.gz 
[E::bgzf_read_init] Cannot decompress legacy RAZF format
[E::razf_info] To decompress this file, use the following commands:
    truncate -s 891946027 human_g1k_v37.fasta.gz
    gunzip human_g1k_v37.fasta.gz
The resulting uncompressed file should be 3153506519 bytes in length.
If you do not have a truncate command, skip that step (though gunzip will
likely produce a "trailing garbage ignored" message, which can be ignored).
[bgzip] Could not open human_g1k_v37.fasta.gz: Exec format error

@daviesrob
Copy link
Copy Markdown
Member

Thanks, I think that's a good solution for now. If RAZF ever becomes more popular, we can move on to something more advanced later.

@daviesrob daviesrob merged commit 2abfea3 into samtools:develop Mar 2, 2021
@jmarshall jmarshall deleted the detect-razf branch March 2, 2021 15:39
@jmarshall
Copy link
Copy Markdown
Member Author

Thanks. I think we can be pretty confident that RAZF will not have a resurgence in popularity 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants