Adds CRAM 3.1 support (experimental)#929
Conversation
64de0e5 to
7f69043
Compare
|
Bug-fixed windows / macos (limited stack sizes) and rebased off develop. Some examples: Develop vs this PR using an existing local v3.0 CRAM. Tests read this file (a 3.0 file of 10 million NovaSeq records) and convert it to a new BAM/CRAM file and then read it back again to get encode and decode times. NB: This build has libdeflate support. Without it BAM would be around 2x slower.
(Edit: added gzipped SAM for Martin :-)) Note: "fast" mode is not only light weight compression, but also fine grained random access (10x finer than normal) so why it doesn't look much faster at streaming it's potentially considerably more with some random access patterns. Example command line to test with: |
0a0e6c1 to
a22a0af
Compare
4dc915f to
d98e0e3
Compare
|
Happy Birthday :-) Given 1.11 is imminent and this PR has missed two (lengthy!) releases already, I am bumping this and request please that it is reviewed early in the 1.12 release cycle. If not I'll continue with my previous assertion that all unreviewed PRs of mine after 1 year get merged anyway under the implicit assumption that noone objects. ;-) |
|
The main issue with this pull request is how we manage the new forced dependency on the This will also need to be checked carefully to ensure it doesn't introduce any regressions, especially on memory safety and handling invalid inputs. Given the size of this change, that might take a while. |
|
The main purpose for htscodecs as far as I am concerned is to avoid duplication between io_lib and htslib. I need both because io_lib is much faster moving and has more experimental features (eg it's had CRAM 3.1 and 4 for several years). It doesn't necessarily have to be explicitly done as a separate library, or submodule. Eg the files could be copied in (klib style), provided we're committed to make changes in the upstream project instead of just forking and diverging. It also doesn't necessarily have to use the build system or be a library. It could be done within the htslib Makefile, even if we import it as a separate git module or package. Eg just add the necessary .c files in Makefile so they end up being linked in as part of htslib. That won't be flavour of the month with Debian though. Htscodecs has been through rounds of sanitizers and fuzzing, but yes it may not be as robust as the OSSFuzz hammering. I know it'll take time, but that's why I'm asking for it to be consider at the start of the next release cycle. It's always been "this is too complex, so we'll punt it to the next one" which basically means never unless we act early on. Hence this reminder. |
|
Also, to be frank do we want to encourage continued "cp -r htslib ." cargo-cult style of development? It's a total nightmare for getting bug fixes out there. As long as we make it easy for people to do things the Right Way then it's not so bad. Autotools also isn't that bad from a user perspective - you don't have to install anything (unlike cmake or the myriad of other flavour of the month make replacements). If it's built using --disable-shared then most of the libtool monstrosities go away. :) |
|
Rebased and added some updates to bring this in line with io_lib's CRAM 4 implementation. I've verified that scramble and test_view can now read and write each others files with compatibility between them. CRAM 4 is even more experimental than 3.1, but we may as well at least be consistent. This now also brings this PR in line with the draft cram4.tex specification, which had a few tweaks and adjustments since Sep 2019 (when this PR was made). The old branch, before rebasing, can be found here: https://github.com/jkbonfield/htslib/tree/htscodecs.old |
|
(Compare with 2nd to top post above, but note these are different files than before.) Current htscodecs timings - ie this PR as it now stands, with updated htscodecs submodule too. 10 million NovaSeq records:
It's unclear why scramble's 3.1 is comparable encode speed to 3.0, but not htslib's. It's of note though that scramble's system time is vastly smaller, which is mainly due to it using mallopt to workaround inefficiencies in glib's malloc (which htslib does not do). On 10 million recs from the HiSeq 2k data (9827_2#49.bam):
|
|
Some more benchmarks, this time 10 million NovaSeq records: I post these as data to consider for how to progress. @daviesrob I know your preference is to keep cram/rANS_static.c and have htscodecs as an optional build. IMO this is a big mistake. We don't need more variants and ways to build sub-optimal libraries. Either we support CRAM, or we don't. I'd rather us have a The other thing is there is a catch-22 issue here. CRAM 3.1 spec has gained minimal support elsewhere, partly because it's an abstract notion and isn't supported by the main packages. However this PR has basically been ignored for well over a year because CRAM 3.1 is experimental and hasn't been adopted by GA4GH yet. The only way to break this cycle is to lead the community, not sit back and hope it'll all just go away. That way just gives power to the likes of MPEG fostering commercial alternatives that may be able to legitimately compare against CRAM 3.0 and conveniently pretend 3.1 doesn't exist. As you may gather, I'm getting somewhat annoyed by the endless delays. PLEASE consider merging this. We're more or less half way through our mythical 3 month release cycle. I don't wish there to be yet another excuse of "we can't merge it this late" as happened for the last 2 releases. The sooner it is merged, the sooner it gets into the standard fuzzing and testing regimes and the more confident we'll be that it is solid. |
cram/cram_encode.c
Outdated
|
|
||
| pthread_mutex_lock(&fd->metrics_lock); | ||
| for (i = 0; i < DS_END; i++) | ||
| fd->m[i]->stats = c->stats[i]; |
There was a problem hiding this comment.
This is leading to threading problems. fd->m[i] is shared between all threads, so they fight over which c->stats[i] should be in there. When the pointer is used in cram_compress_block2() it may get stats for the wrong container, or in the worst case the memory may have been released already resulting in a use-after-free bug.
Either cram_compress_block2() needs an extra parameter for c->stats[type] (or c->stats[type]->nvals); or the adjustments to method need to be done in this function so there is no need to pass the stats to cram_compress_block2(). After that I think it should be possible to remove cram_metrics::stats as I don't think it's used anywhere else.
There was a problem hiding this comment.
Wrong container doesn't matter much as the metrics are on-going trends that aggregate data across containers. However you're correct that having a pointer to it is bad as they may go out of scope.
I think the correct answer is to set a flag (or boost a counter) in metrics so it doesn't need the pointer, but metrics analysis can still use it for biasing which codecs to try.
There was a problem hiding this comment.
This definitely looks like a real problem, although I've tried very hard to recreate it and with 100s of tests still cannot.
Anyway, I've now replaced it with a boolean flag which is all we really need.
The question of how to incorporate building htscodecs into HTSlib's build infrastructure comes down to this question about whether there should be an htscodecs entity that is versioned independently from htslib, whether to facilitate distribution packagers making such an independent library, and how to make the codec source code available — both in a Git checkout and in an HTSlib source tarball. FWIW I agree that avoiding having an independently versioned second library for this is the desirable and simple approach. The purpose of the separate htscodecs repository is simply to share the code between HTSlib and e.g. io_lib, not to have them both depend on another separate moving target. I'm not usually a fan of Git submodules: they make the subproject operate in absolute lockstep with the superproject, so are not a good fit for e.g. the samtools/bcftools relationship with htslib (in which many adjacent versions of htslib will do for a particular samtools/bcftools commit). However for this case, in which htslib wants to share a particular htscodecs commit with scramble, a Git submodule is exactly what is desired. Now, how to build the codecs and link their object code directly into htslib.a/.so? HTSlib has its own very capable build system already, so the simplest thing (from HTSlib's point of view) would be to have the submodule's .c files built using rules in HTSlib's Makefile. As the htslib git checkout would reference a particular commit of the codecs submodule, there is no difficulty in having the list of codec source files located in an HTSlib file that would need updating whenever the list of source files in the submodule changes (which would be infrequent anyway). In fact, probably the best way to do this would be to have a new HTSlib source file, e.g., codecs.c, that includes all the desired htscodecs source files: // codecs.c -- HTSlib source file that pulls in code from the htscodecs submodule
#include <config.h> // i.e. *HTSlib's* config.h
// Any other needed setting up
#include "htscodecs/arith_dynamic.c"
#include "htscodecs/fqzcomp_qual.c"
#include "htscodecs/pack.c"
#include "htscodecs/rANS_static.c"
#include "htscodecs/rANS_static4x16pr.c"
#include "htscodecs/rle.c"
#include "htscodecs/tokenise_name3.c"Then HTSlib's Makefile would just list codecs.o in And that's it. No need for HTSlib's INSTALL or configure.ac to learn about automake or libtool or do anything at all. Not using htscodecs's makefile or build system, as we are literally using only the codec source code from the subproject. Not running or using the htscodecs test suite either, but the expectation would be that people doing development on the codecs themselves would be doing it from a separate checkout of htscodecs itself (rather than as a submodule) so would be building and testing using htscodecs's own build system and test suite — quite separately from HTSlib; and then when complete the new commits could be git fetched/pulled into HTSlib's referenced htscodecs commit. Finally, what about HTSlib release source tarballs? There is no git checkout to use to supply files from a submodule, so c-maint/mktarball would need to be extended to copy the htscodecs files (or just the .c/.h files actually needed and the basic README, LICENSE etc) from the referenced repository and commit into a newly mkdired subdirectory within the tarball it's creating. Then the existing codecs.c and its build rules would work exactly as they do in a git checkout. There are a couple of minor organisational changes that would facilitate all this:
|
|
The downside of this is that the codecs would be built in HTSlib by a different system from the one they are built by in the native htscodecs project; in particular, with different compile flags than as determined by the htscodecs build system. OTOH they would be built with the same compile flags as the rest of HTSlib — so there are pros and cons here; it's a tradeoff. My suspicion is that the differences would be minimal anyway. The big advantage is that htscodecs is free to use automake and libtool (boo, hiss 😄) if it wants to, and this will have no effect on HTSlib. However perhaps it would be worth revisiting the htscodecs build system to focus on just test harnesses for the various codecs. [Rob]
[James]
Even though Debian have already jumped the gun and drafted a shared library package from James's draft repository, HTSlib and htscodecs have a choice about whether to facilitate that. It's really a question of managing expectations. Debian perfectly reasonably likes to debundle separate libraries so that the subproject can be independently updated and the superproject gain from that (getting bug fixes etc). However one point of view here is that htscodecs really isn't a separate library — it's just a collection of source files. There's no intrinsic reason why HTSlib or htscodecs need to facilitate using or building htscodecs as a library, when they claim that htscodecs is just a git-native way of sharing some code between HTSlib and io_lib — a more reliable form of klib's copying files around between projects. The claim is that by only having HTSlib (and io_lib) with released versions, and each HTSlib release determining the htscodecs submodule commit that it's used with, maintenance and support are simplified. (E.g. the need to maintain yet another stable API is avoided entirely!) Behaviour from bug reports is easier to understand because there isn't a random combination of htslib version and codec version. This point of view will be more palatable if htscodecs doesn't present itself as a library. I haven't really looked at how io_lib uses it, but if the htscodecs build system was changed to just build a bunch of *.o files that can be linked with their test harnesses, then it would be clear that this code was not intended to be separated as a library. There may be temporary squealing 😄 but I suspect packagers would come around to just using HTSlib's / io_lib's native inclusion of these source files. Particularly as HTSlib/io_lib releases should be frequent enough to keep up with the (in the fullness of time) likely small infrequent updates to the codecs. |
It's used by a released version of io_lib, so they haven't jumped the gun. This problem has mainly come about due to our tardiness. I wasn't willing to tie my own releases of io_lib (libstaden_read1 on Debian) to wait years for htslib PRs to get reviewed and with a lack of any feedback here I just went with something that worked for me. I'd rather not rename it now either, but it does need moving to the samtools organisation (that's not something I have permission to do). It was named after htslib, which doesn't have a hyphen.
It can be thought of either way. I'm open to all ways for htslib to use it. My aim is simply to avoid duplication. I've had enough of having to support two variants of the same code, and occasionally finding subtle bugs that we'd already fixed in the other copy. One source location is absolutely the right solution. As for the tarball, I think it should just be bundled in with it. Definitely for samtools/bcftools, but maybe also for htslib. |
|
I'm just getting around to working out how to integrate this into HTSlib without the need for a recursive make (which I'm not in favour of due to the pain it has caused in SAMtools). My current thinking is to have configure options that would allow it to be built using either bundled htscodecs code, or linked to an external library as Debian would like. Some other projects work this way, for example mplayer can build its own copy of various dependencies or link to them externally. It would be useful as long as it doesn't add too much complication if it means distributors don't start adding big patches so they can package htscodecs externally. James should be able to transfer the htscodecs repository to the samtools organisation now. |
beea1ee to
435caac
Compare
In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes Appveyor use `autoreconf -i` when setting up HTSlib as that will be required for autoconf 2.70.
In tested I spotted v2.1 output still uses the rANS codec, which is incorrect. It does this because rans is enabled when opening a file if the version is >= 3.0, but we set the version after opening the file and that didn't explicitly disable rans again. Io_lib's Scramble does these bits in reverse (set default vers and then open file) so worked fine.
The cram/rANS_static.[ch] implementation has been removed. Updates the build system to make the required parts of htscodecs. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>
Note: this is experimental, subject to the 3.1 spec being agreed upon
by the GA4GH File Formats committee.
Example usage (high compression):
./test/test_view -C -o version=3.1 -o use_fqz -o use_bzip2 -o use_arith \
-o seqs_per_slice=100000 /tmp/30.cram -p /tmp/31-max.cram
Also added support for compression profiles.
Primarily these are CRAM only, but for consistency they're also used as
synonyms for compression levels in bgzf.
In the process of adding CRAM 3.1 several improvements and refactoring
took place to the cram encoder (mainly by dint of copying over the
cram_compress_block function from io_lib):
- Added a "Gzip -1" deflate strategy as sometimes it's smaller
than gzip or comparable size (but considerably fatser). This is
purely for metrics and optimisation.
- Refactored the cram_compress_block function so it can deal with many
more compression method / strategies. It now has a number of lookup
tables which need to be kept synchronised (see comments).
- Added a new cram_compress_block2 function which takes a slice. This
is used for the fqzcomp code to obtain the read lengths.
- Cram_compress_block now has a consistency metric too. This is to
learn when data isn't changing rapidly so don't be too quick to
reevaluate.
Some rANS codecs require over 2 Mbytes of temporary space for the encode/decoder state. On some platforms, notably MacOS the default thread stack size is not big enough to hold this. Previously we fixed this by using malloc instead, but that has other undesireable side effects such as repeated mmap/munmap and wasting time zeroing pages over and over. Our traditional fix here has been to use pthread_once to get a single malloc and then reuse this block, to avoid the mmap issues in glibc. However this is complicated, ensuring the thread stack is sufficiently big is much easier.
This cannot be avoided! It's spam, but it will be removed once the spec becomes official. If there are no changes, that does mean this htslib will be compliant, but spammy, however we can then make a new release.
Not sure we have any files large enough to really put it through its paces, but this is a reasonable start.
We can read/write CRAM 3.1. We can read CRAM 4.0, but writing 4.0 is corrupted.
Normally sorted data deltas the positions and unsorted does not. This allows us to explicitly reenable the AP=1 container header field to do deltaing, as it's sometimes useful on unsorted data. Eg pairs are normally neighbouring position, so it's still a bit better to compress than leaving them as-is.
This does some basic analysis to spot grouped variants and if and only if long enough and scoring +ve will it switch to 'b'(?) feature.
- We no longer keep GZIP_RLE used when libdeflate is compiled in because it doesn't support strategy in the same manner as zlib. Instead we use GZIP level 1. - Adjusted the fixed small-block compensation code. It's now equiv to a 60 byte difference required before changing up to the next tier in complexity. This had no impact on my novaseq test normally, but with bzip2+lzma added in it saved 6% CPU (at a cost of 0.14% file growth).
Same as io_lib's eda7cfe and 16c19ee. This replaces the length field in containers with a variable size quantity instead of 32-bit int, for consistency with other fields, and marks a few fields as using a signed encoding strategy (necessary for the next commit). [The CRAM 4 specification moved on slightly since the original date of this PR.]
This is the htslib equivalent to io_lib's 090dfd3f and d86b008. It is now in adherance to the draft CRAM 4 spec. It removes the E_SINT and E_SLONG types and replaces E_EXTERNAL with E_BYTE, E_VARINT_SIGNED and E_VARINT_UNSIGNED. Integer data types aren't explicitly signed or unsigned but simply use the codec that works for them. So if all ref-ids are >= 0 it'll be VARINT_UNSIGNED, otherwise VARINT_SIGNED. This future proofs things by adding introspection to the format. Also removed HUFFMAN from CRAM 4. It was only being used for storing constant values, so we now have CONST_BYTE and CONST_INT instead (which always uses a signed encoding). This should be the final removal of ever encoding negatives as if they were positives (eg -1 as 0xFFFFFFFF) and so in turn removes all size limitations on values.
This has been replaced by an "unpackable" flag. This avoids cram_compress_block2 from needing to access metrics->stats, which is a pointer to the container->stats struct. In rare cases this could be freed prior to the compression completing.
Also the public version has had the enum members renamed to include an HTS_ prefix. At present they appear to be unused, although it's hard to know for sure with some of them due to the common hits you get all over github. Leaving them as they are though is just waiting for a hard problem to arrive. This commit introduces the notion of HTS_COMPAT. The ABI changes changed, but the API has (albeit in a way we believe should affect people). It's a 3+ digit number with the last two being minor version and earlier being major. So htslib version 1.12 would be 112. The intention is simply to permit packages with dependencies on specific earlier API names to still compile with the minimal of changes (even if it's just by overriding CFLAGS).
They are included by adding extra dependencies to the "test" and "check" targets when htscodecs is bundled via the submodule.
So that anyone who tries to build the library without setting up the submodule first will get a hint on how to make the build work. The check isn't really needed for anyone building from a release tarball, but it will be harmless as it should always succeed (as long as the tarball was built correctly and includes the htscodecs files).
Currently if the submodule files are missing it asks the user to run "git submodule update" to install them. It's possible that the Makefile run this itself and continue, although I'm not sure if that would be completely safe in parallel builds. The variables lising dependencies for htscodecs header files are moved to htscodecs_bundled.mk to ensure they are only set for submodule builds.
|
I've squashed in a couple of the bugfix commits, and rebased so this will merge cleanly again. |
submodule. Also makes everything use autoreconf -i when setting up HTSlib as that will be required for autoconf 2.70. Note that both Travis and cirrus-ci use the .travis/clone script.
submodule. Also makes everything use autoreconf -i when setting up HTSlib as that will be required for autoconf 2.70. Note that both Travis and cirrus-ci use the .travis/clone script.
In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes Appveyor use `autoreconf -i` when setting up HTSlib as that will be required for autoconf 2.70.
In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes everything use autoreconf -i when setting up HTSlib as that will be required for autoconf 2.70. Increases the memory limit for the address sanitizer cirrus-ci test. While the average memory used seems to be low, occasionally some test processes may get big enough to trigger an out of memory killer. Note that both Travis and cirrus-ci use the .travis/clone script.
|
The samtools and bcftools tests should now work with the submodule, so I'll merge this tomorrow morning unless there are any last-minute hitches. |
This is here as demonstration of how to add this into Htslib (in addition to the existing support in Scramble).
It uses htscodecs as a submodule, but compiles and links the code directly into htslib so there isn't a second library hanging around. This now means scramble and htslib share the same codec implementation and we don't accidentally only fix bugs in one place.
Unlike Scramble there is nothing to actively warn that this is a technology test of CRAM 3.1 and not an official GA4GH standard. I could add this, but I wasn't envisioning merging it until it's an adopted standard anyway. (It's sufficient to be able to state there is support for this format in Samtools and hence maybe get more people looking into assessing the new spec.)