Adds CRAM 3.1 support (experimental) by jkbonfield · Pull Request #929 · samtools/htslib

jkbonfield · 2019-09-18T15:59:52Z

This is here as demonstration of how to add this into Htslib (in addition to the existing support in Scramble).

It uses htscodecs as a submodule, but compiles and links the code directly into htslib so there isn't a second library hanging around. This now means scramble and htslib share the same codec implementation and we don't accidentally only fix bugs in one place.

Unlike Scramble there is nothing to actively warn that this is a technology test of CRAM 3.1 and not an official GA4GH standard. I could add this, but I wasn't envisioning merging it until it's an adopted standard anyway. (It's sufficient to be able to state there is support for this format in Samtools and hence maybe get more people looking into assessing the new spec.)

jkbonfield · 2019-09-23T14:28:32Z

Bug-fixed windows / macos (limited stack sizes) and rebased off develop.

Some examples:

Develop vs this PR using an existing local v3.0 CRAM. Tests read this file (a 3.0 file of 10 million NovaSeq records) and convert it to a new BAM/CRAM file and then read it back again to get encode and decode times. NB: This build has libdeflate support. Without it BAM would be around 2x slower.

Branch	CRAM vers	Size	Enc(s)	Dec(s)
develop	(as BAM)	576789115	53.2	4.3
develop	(as SAM.gz)	542040943	60.3	9.5
develop	3.0	208635462	40.0	12.9
htscodecs	2.1	241451288	41.5	13.3
htscodecs	3.0	208392952	38.5	12.0
htscodecs	3.1 fast	231023913	31.5	11.3
htscodecs	3.1 (normal)	175696160	36.4	10.8
htscodecs	3.1 small	166172362	88.8	39.0
htscodecs	3.1 archive	162697852	123.0	38.1

(Edit: added gzipped SAM for Martin :-))

Note: "fast" mode is not only light weight compression, but also fine grained random access (10x finer than normal) so why it doesn't look much faster at streaming it's potentially considerably more with some random access patterns.

Example command line to test with:

perf stat ./test/test_view -C -o version=3.1 -o small ~/scratch/data/novaseq.10m.cram -p /tmp/_v31.cram
ls -l /tmp/_v31.cram
perf stat ./test/test_view -B /tmp/_v31.cram

jkbonfield · 2020-09-18T10:51:58Z

Happy Birthday :-)

Given 1.11 is imminent and this PR has missed two (lengthy!) releases already, I am bumping this and request please that it is reviewed early in the 1.12 release cycle. If not I'll continue with my previous assertion that all unreviewed PRs of mine after 1 year get merged anyway under the implicit assumption that noone objects. ;-)

daviesrob · 2020-09-18T13:41:34Z

The main issue with this pull request is how we manage the new forced dependency on the htscodecs library. It at least needs additional configure options to allow linking against an external htscodecs, as distributors like Debian will want to include it that way. It also currently forces a dependency on autotools which may be unpopular with projects that embed htslib and expect to be able to just use make; and it uses recursive make, which is known to cause problems.

This will also need to be checked carefully to ensure it doesn't introduce any regressions, especially on memory safety and handling invalid inputs. Given the size of this change, that might take a while.

jkbonfield · 2020-09-18T15:10:52Z

The main purpose for htscodecs as far as I am concerned is to avoid duplication between io_lib and htslib. I need both because io_lib is much faster moving and has more experimental features (eg it's had CRAM 3.1 and 4 for several years).

It doesn't necessarily have to be explicitly done as a separate library, or submodule. Eg the files could be copied in (klib style), provided we're committed to make changes in the upstream project instead of just forking and diverging. It also doesn't necessarily have to use the build system or be a library. It could be done within the htslib Makefile, even if we import it as a separate git module or package. Eg just add the necessary .c files in Makefile so they end up being linked in as part of htslib. That won't be flavour of the month with Debian though.

Htscodecs has been through rounds of sanitizers and fuzzing, but yes it may not be as robust as the OSSFuzz hammering. I know it'll take time, but that's why I'm asking for it to be consider at the start of the next release cycle. It's always been "this is too complex, so we'll punt it to the next one" which basically means never unless we act early on. Hence this reminder.

jkbonfield · 2020-09-18T15:18:52Z

Also, to be frank do we want to encourage continued "cp -r htslib ." cargo-cult style of development? It's a total nightmare for getting bug fixes out there. As long as we make it easy for people to do things the Right Way then it's not so bad. Autotools also isn't that bad from a user perspective - you don't have to install anything (unlike cmake or the myriad of other flavour of the month make replacements). If it's built using --disable-shared then most of the libtool monstrosities go away. :)

jkbonfield · 2020-10-29T16:57:57Z

Rebased and added some updates to bring this in line with io_lib's CRAM 4 implementation. I've verified that scramble and test_view can now read and write each others files with compatibility between them. CRAM 4 is even more experimental than 3.1, but we may as well at least be consistent. This now also brings this PR in line with the draft cram4.tex specification, which had a few tweaks and adjustments since Sep 2019 (when this PR was made).

The old branch, before rebasing, can be found here: https://github.com/jkbonfield/htslib/tree/htscodecs.old
(That may be useful if you wish to replay the latest commits on this against your own rebase of the original PR.)

jkbonfield · 2020-10-29T17:46:58Z

(Compare with 2nd to top post above, but note these are different files than before.)

Current htscodecs timings - ie this PR as it now stands, with updated htscodecs submodule too.

10 million NovaSeq records:

Branch	CRAM vers	Size	Enc(s)	Dec(s)
develop	BAM/libdeflate	539562772	44.9	5.3
develop	3.0	223704658	32.9	13.3
htscodecs	3.0	223876657	31.7	10.6
htscodecs	3.1	186554165	34.4	9.4
htscodecs	4.0	182186308	32.2	10.7
htscodecs	3.1 small	175986355	78.4	37.8
htscodecs	4.0 small	170854620	75.0	36.1
scramble	3.0	223761677	33.4	10.3
scramble	3.1	186429613	33.8	8.3
scramble	4.0	182080941	32.3	8.2

It's unclear why scramble's 3.1 is comparable encode speed to 3.0, but not htslib's. It's of note though that scramble's system time is vastly smaller, which is mainly due to it using mallopt to workaround inefficiencies in glib's malloc (which htslib does not do).

On 10 million recs from the HiSeq 2k data (9827_2#49.bam):

Branch	CRAM vers	Size	Enc(s)	Dec(s)
develop	BAM/libdeflate	1163050644	50.6	7.9
develop	3.0	573365519	36.5	14.4
htscodecs	3.0	573317290	34.4	12.7
htscodecs	3.1	551563284	39.1	13.3
htscodecs	4.0	549018451	38.3	11.9
htscodecs	3.1 small	388439096	111.1	63.3
htscodecs	4.0 small	385983116	108.2	63.5
scramble	3.0	573349740	35.3	11.0
scramble	3.1	551636226	39.5	11.1
scramble	4.0	549116518	38.3	11.5

jkbonfield · 2020-11-05T09:44:23Z

Some more benchmarks, this time 10 million NovaSeq records:

CRAM 3.0        Read(s)   Read/Write(s)                                         
---------------------------------------                                         
Develop         13.43     40.11                                                 
Htscodecs       10.60     38.25

I post these as data to consider for how to progress. @daviesrob I know your preference is to keep cram/rANS_static.c and have htscodecs as an optional build. IMO this is a big mistake. We don't need more variants and ways to build sub-optimal libraries. Either we support CRAM, or we don't. I'd rather us have a --disable-cram option that to have two different choices of codecs. It's more work for us to maintain and I see zero benefit. (We've been down this road before with knet + curl as choices, and it's caused many more problems than it solved.) It'll also lead to poor comparisons in benchmarks run by other people.

The other thing is there is a catch-22 issue here. CRAM 3.1 spec has gained minimal support elsewhere, partly because it's an abstract notion and isn't supported by the main packages. However this PR has basically been ignored for well over a year because CRAM 3.1 is experimental and hasn't been adopted by GA4GH yet. The only way to break this cycle is to lead the community, not sit back and hope it'll all just go away. That way just gives power to the likes of MPEG fostering commercial alternatives that may be able to legitimately compare against CRAM 3.0 and conveniently pretend 3.1 doesn't exist.

As you may gather, I'm getting somewhat annoyed by the endless delays. PLEASE consider merging this. We're more or less half way through our mythical 3 month release cycle. I don't wish there to be yet another excuse of "we can't merge it this late" as happened for the last 2 releases. The sooner it is merged, the sooner it gets into the standard fuzzing and testing regimes and the more confident we'll be that it is solid.

daviesrob · 2020-11-10T16:09:05Z

cram/cram_encode.c


+    pthread_mutex_lock(&fd->metrics_lock);
+    for (i = 0; i < DS_END; i++)
+        fd->m[i]->stats = c->stats[i];


This is leading to threading problems. fd->m[i] is shared between all threads, so they fight over which c->stats[i] should be in there. When the pointer is used in cram_compress_block2() it may get stats for the wrong container, or in the worst case the memory may have been released already resulting in a use-after-free bug.

Either cram_compress_block2() needs an extra parameter for c->stats[type] (or c->stats[type]->nvals); or the adjustments to method need to be done in this function so there is no need to pass the stats to cram_compress_block2(). After that I think it should be possible to remove cram_metrics::stats as I don't think it's used anywhere else.

Wrong container doesn't matter much as the metrics are on-going trends that aggregate data across containers. However you're correct that having a pointer to it is bad as they may go out of scope.

I think the correct answer is to set a flag (or boost a counter) in metrics so it doesn't need the pointer, but metrics analysis can still use it for biasing which codecs to try.

This definitely looks like a real problem, although I've tried very hard to recreate it and with 100s of tests still cannot.

Anyway, I've now replaced it with a boolean flag which is all we really need.

jmarshall · 2020-11-19T15:58:08Z

It uses htscodecs as a submodule, but compiles and links the code directly into htslib so there isn't a second library hanging around. This now means scramble and htslib share the same codec implementation and we don't accidentally only fix bugs in one place.

The question of how to incorporate building htscodecs into HTSlib's build infrastructure comes down to this question about whether there should be an htscodecs entity that is versioned independently from htslib, whether to facilitate distribution packagers making such an independent library, and how to make the codec source code available — both in a Git checkout and in an HTSlib source tarball.

FWIW I agree that avoiding having an independently versioned second library for this is the desirable and simple approach. The purpose of the separate htscodecs repository is simply to share the code between HTSlib and e.g. io_lib, not to have them both depend on another separate moving target.

I'm not usually a fan of Git submodules: they make the subproject operate in absolute lockstep with the superproject, so are not a good fit for e.g. the samtools/bcftools relationship with htslib (in which many adjacent versions of htslib will do for a particular samtools/bcftools commit). However for this case, in which htslib wants to share a particular htscodecs commit with scramble, a Git submodule is exactly what is desired.

Now, how to build the codecs and link their object code directly into htslib.a/.so? HTSlib has its own very capable build system already, so the simplest thing (from HTSlib's point of view) would be to have the submodule's .c files built using rules in HTSlib's Makefile. As the htslib git checkout would reference a particular commit of the codecs submodule, there is no difficulty in having the list of codec source files located in an HTSlib file that would need updating whenever the list of source files in the submodule changes (which would be infrequent anyway). In fact, probably the best way to do this would be to have a new HTSlib source file, e.g., codecs.c, that includes all the desired htscodecs source files:

// codecs.c -- HTSlib source file that pulls in code from the htscodecs submodule

#include <config.h>  // i.e. *HTSlib's* config.h

// Any other needed setting up

#include "htscodecs/arith_dynamic.c"
#include "htscodecs/fqzcomp_qual.c"
#include "htscodecs/pack.c"
#include "htscodecs/rANS_static.c"
#include "htscodecs/rANS_static4x16pr.c"
#include "htscodecs/rle.c"
#include "htscodecs/tokenise_name3.c"

Then HTSlib's Makefile would just list codecs.o in LIBHTS_OBJS just like any other HTSlib source file — and would not itself need any lists of codecs source files etc. Or, indeed, any further complexity at all.

And that's it. No need for HTSlib's INSTALL or configure.ac to learn about automake or libtool or do anything at all. Not using htscodecs's makefile or build system, as we are literally using only the codec source code from the subproject. Not running or using the htscodecs test suite either, but the expectation would be that people doing development on the codecs themselves would be doing it from a separate checkout of htscodecs itself (rather than as a submodule) so would be building and testing using htscodecs's own build system and test suite — quite separately from HTSlib; and then when complete the new commits could be git fetched/pulled into HTSlib's referenced htscodecs commit.

Finally, what about HTSlib release source tarballs? There is no git checkout to use to supply files from a submodule, so c-maint/mktarball would need to be extended to copy the htscodecs files (or just the .c/.h files actually needed and the basic README, LICENSE etc) from the referenced repository and commit into a newly mkdired subdirectory within the tarball it's creating. Then the existing codecs.c and its build rules would work exactly as they do in a git checkout.

There are a couple of minor organisational changes that would facilitate all this:

I guess before this is merged to samtools/htslib, it would be desirable to move the codecs repository to the samtools organisation. IMHO samtools/hts-codecs would be a slight improvement on htscodecs, with the hyphen paralleling the one in samtools/hts-specs.
It would be preferable for mktarball to copy the needed source files into a top-level directory, say htslib-1.x/codecs/*.[ch], so that they're not nested inside a tree of otherwise empty directories. Therefore ideally the codec source code would be at the top-level directory of the submodule, so that #include "codecs/foo.c" (for a submodule path of codecs) would be the right path for both a git checkout and a release tarball.

jmarshall · 2020-11-19T16:25:07Z

The downside of this is that the codecs would be built in HTSlib by a different system from the one they are built by in the native htscodecs project; in particular, with different compile flags than as determined by the htscodecs build system. OTOH they would be built with the same compile flags as the rest of HTSlib — so there are pros and cons here; it's a tradeoff. My suspicion is that the differences would be minimal anyway.

The big advantage is that htscodecs is free to use automake and libtool (boo, hiss 😄) if it wants to, and this will have no effect on HTSlib. However perhaps it would be worth revisiting the htscodecs build system to focus on just test harnesses for the various codecs.

[Rob]

The main issue with this pull request is how we manage the new forced dependency on the htscodecs library. It at least needs additional configure options to allow linking against an external htscodecs, as distributors like Debian will want to include it that way.

[James]

It also doesn't necessarily have to use the build system or be a library. It could be done within the htslib Makefile […] That won't be flavour of the month with Debian though.

Even though Debian have already jumped the gun and drafted a shared library package from James's draft repository, HTSlib and htscodecs have a choice about whether to facilitate that. It's really a question of managing expectations.

Debian perfectly reasonably likes to debundle separate libraries so that the subproject can be independently updated and the superproject gain from that (getting bug fixes etc).

However one point of view here is that htscodecs really isn't a separate library — it's just a collection of source files. There's no intrinsic reason why HTSlib or htscodecs need to facilitate using or building htscodecs as a library, when they claim that htscodecs is just a git-native way of sharing some code between HTSlib and io_lib — a more reliable form of klib's copying files around between projects.

The claim is that by only having HTSlib (and io_lib) with released versions, and each HTSlib release determining the htscodecs submodule commit that it's used with, maintenance and support are simplified. (E.g. the need to maintain yet another stable API is avoided entirely!) Behaviour from bug reports is easier to understand because there isn't a random combination of htslib version and codec version.

This point of view will be more palatable if htscodecs doesn't present itself as a library. I haven't really looked at how io_lib uses it, but if the htscodecs build system was changed to just build a bunch of *.o files that can be linked with their test harnesses, then it would be clear that this code was not intended to be separated as a library. There may be temporary squealing 😄 but I suspect packagers would come around to just using HTSlib's / io_lib's native inclusion of these source files. Particularly as HTSlib/io_lib releases should be frequent enough to keep up with the (in the fullness of time) likely small infrequent updates to the codecs.

jkbonfield · 2020-11-19T17:25:12Z

Even though Debian have already jumped the gun and drafted a shared library package from James's draft repository, HTSlib and htscodecs have a choice about whether to facilitate that. It's really a question of managing expectations.

It's used by a released version of io_lib, so they haven't jumped the gun.

This problem has mainly come about due to our tardiness. I wasn't willing to tie my own releases of io_lib (libstaden_read1 on Debian) to wait years for htslib PRs to get reviewed and with a lack of any feedback here I just went with something that worked for me. I'd rather not rename it now either, but it does need moving to the samtools organisation (that's not something I have permission to do). It was named after htslib, which doesn't have a hyphen.

However one point of view here is that htscodecs really isn't a separate library — it's just a collection of source files. There's no intrinsic reason why HTSlib or htscodecs need to facilitate using or building htscodecs as a library, when they claim that htscodecs is just a git-native way of sharing some code between HTSlib and io_lib — a more reliable form of klib's copying files around between projects.

It can be thought of either way. I'm open to all ways for htslib to use it. My aim is simply to avoid duplication. I've had enough of having to support two variants of the same code, and occasionally finding subtle bugs that we'd already fixed in the other copy. One source location is absolutely the right solution.

As for the tarball, I think it should just be bundled in with it. Definitely for samtools/bcftools, but maybe also for htslib.

daviesrob · 2020-11-19T17:58:23Z

I'm just getting around to working out how to integrate this into HTSlib without the need for a recursive make (which I'm not in favour of due to the pain it has caused in SAMtools). My current thinking is to have configure options that would allow it to be built using either bundled htscodecs code, or linked to an external library as Debian would like. Some other projects work this way, for example mplayer can build its own copy of various dependencies or link to them externally. It would be useful as long as it doesn't add too much complication if it means distributors don't start adding big patches so they can package htscodecs externally.

James should be able to transfer the htscodecs repository to the samtools organisation now.

In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes Appveyor use `autoreconf -i` when setting up HTSlib as that will be required for autoconf 2.70.

In tested I spotted v2.1 output still uses the rANS codec, which is incorrect. It does this because rans is enabled when opening a file if the version is >= 3.0, but we set the version after opening the file and that didn't explicitly disable rans again. Io_lib's Scramble does these bits in reverse (set default vers and then open file) so worked fine.

The cram/rANS_static.[ch] implementation has been removed. Updates the build system to make the required parts of htscodecs. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

Note: this is experimental, subject to the 3.1 spec being agreed upon by the GA4GH File Formats committee. Example usage (high compression): ./test/test_view -C -o version=3.1 -o use_fqz -o use_bzip2 -o use_arith \ -o seqs_per_slice=100000 /tmp/30.cram -p /tmp/31-max.cram Also added support for compression profiles. Primarily these are CRAM only, but for consistency they're also used as synonyms for compression levels in bgzf. In the process of adding CRAM 3.1 several improvements and refactoring took place to the cram encoder (mainly by dint of copying over the cram_compress_block function from io_lib): - Added a "Gzip -1" deflate strategy as sometimes it's smaller than gzip or comparable size (but considerably fatser). This is purely for metrics and optimisation. - Refactored the cram_compress_block function so it can deal with many more compression method / strategies. It now has a number of lookup tables which need to be kept synchronised (see comments). - Added a new cram_compress_block2 function which takes a slice. This is used for the fqzcomp code to obtain the read lengths. - Cram_compress_block now has a consistency metric too. This is to learn when data isn't changing rapidly so don't be too quick to reevaluate.

Some rANS codecs require over 2 Mbytes of temporary space for the encode/decoder state. On some platforms, notably MacOS the default thread stack size is not big enough to hold this. Previously we fixed this by using malloc instead, but that has other undesireable side effects such as repeated mmap/munmap and wasting time zeroing pages over and over. Our traditional fix here has been to use pthread_once to get a single malloc and then reuse this block, to avoid the mmap issues in glibc. However this is complicated, ensuring the thread stack is sufficiently big is much easier.

This cannot be avoided! It's spam, but it will be removed once the spec becomes official. If there are no changes, that does mean this htslib will be compliant, but spammy, however we can then make a new release.

Not sure we have any files large enough to really put it through its paces, but this is a reasonable start.

We can read/write CRAM 3.1. We can read CRAM 4.0, but writing 4.0 is corrupted.

Normally sorted data deltas the positions and unsorted does not. This allows us to explicitly reenable the AP=1 container header field to do deltaing, as it's sometimes useful on unsorted data. Eg pairs are normally neighbouring position, so it's still a bit better to compress than leaving them as-is.

This does some basic analysis to spot grouped variants and if and only if long enough and scoring +ve will it switch to 'b'(?) feature.

- We no longer keep GZIP_RLE used when libdeflate is compiled in because it doesn't support strategy in the same manner as zlib. Instead we use GZIP level 1. - Adjusted the fixed small-block compensation code. It's now equiv to a 60 byte difference required before changing up to the next tier in complexity. This had no impact on my novaseq test normally, but with bzip2+lzma added in it saved 6% CPU (at a cost of 0.14% file growth).

Same as io_lib's eda7cfe and 16c19ee. This replaces the length field in containers with a variable size quantity instead of 32-bit int, for consistency with other fields, and marks a few fields as using a signed encoding strategy (necessary for the next commit). [The CRAM 4 specification moved on slightly since the original date of this PR.]

This is the htslib equivalent to io_lib's 090dfd3f and d86b008. It is now in adherance to the draft CRAM 4 spec. It removes the E_SINT and E_SLONG types and replaces E_EXTERNAL with E_BYTE, E_VARINT_SIGNED and E_VARINT_UNSIGNED. Integer data types aren't explicitly signed or unsigned but simply use the codec that works for them. So if all ref-ids are >= 0 it'll be VARINT_UNSIGNED, otherwise VARINT_SIGNED. This future proofs things by adding introspection to the format. Also removed HUFFMAN from CRAM 4. It was only being used for storing constant values, so we now have CONST_BYTE and CONST_INT instead (which always uses a signed encoding). This should be the final removal of ever encoding negatives as if they were positives (eg -1 as 0xFFFFFFFF) and so in turn removes all size limitations on values.

This has been replaced by an "unpackable" flag. This avoids cram_compress_block2 from needing to access metrics->stats, which is a pointer to the container->stats struct. In rare cases this could be freed prior to the compression completing.

Also the public version has had the enum members renamed to include an HTS_ prefix. At present they appear to be unused, although it's hard to know for sure with some of them due to the common hits you get all over github. Leaving them as they are though is just waiting for a hard problem to arrive. This commit introduces the notion of HTS_COMPAT. The ABI changes changed, but the API has (albeit in a way we believe should affect people). It's a 3+ digit number with the last two being minor version and earlier being major. So htslib version 1.12 would be 112. The intention is simply to permit packages with dependencies on specific earlier API names to still compile with the minimal of changes (even if it's just by overriding CFLAGS).

They are included by adding extra dependencies to the "test" and "check" targets when htscodecs is bundled via the submodule.

So that anyone who tries to build the library without setting up the submodule first will get a hint on how to make the build work. The check isn't really needed for anyone building from a release tarball, but it will be harmless as it should always succeed (as long as the tarball was built correctly and includes the htscodecs files).

Currently if the submodule files are missing it asks the user to run "git submodule update" to install them. It's possible that the Makefile run this itself and continue, although I'm not sure if that would be completely safe in parallel builds. The variables lising dependencies for htscodecs header files are moved to htscodecs_bundled.mk to ensure they are only set for submodule builds.

daviesrob · 2021-01-12T09:30:57Z

I've squashed in a couple of the bugfix commits, and rebased so this will merge cleanly again.

submodule. Also makes everything use autoreconf -i when setting up HTSlib as that will be required for autoconf 2.70. Note that both Travis and cirrus-ci use the .travis/clone script.

In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes Appveyor use `autoreconf -i` when setting up HTSlib as that will be required for autoconf 2.70.

In anticipation of samtools/htslib#929 merge, which adds a submodule. Also makes everything use autoreconf -i when setting up HTSlib as that will be required for autoconf 2.70. Increases the memory limit for the address sanitizer cirrus-ci test. While the average memory used seems to be low, occasionally some test processes may get big enough to trigger an out of memory killer. Note that both Travis and cirrus-ci use the .travis/clone script.

daviesrob · 2021-01-19T16:53:18Z

The samtools and bcftools tests should now work with the submodule, so I'll merge this tomorrow morning unless there are any last-minute hitches.

jkbonfield force-pushed the htscodecs branch 7 times, most recently from 64de0e5 to 7f69043 Compare September 23, 2019 14:08

jkbonfield force-pushed the htscodecs branch 4 times, most recently from 0a0e6c1 to a22a0af Compare February 13, 2020 14:48

jkbonfield force-pushed the htscodecs branch 3 times, most recently from 4dc915f to d98e0e3 Compare May 6, 2020 10:46

dlaehnemann mentioned this pull request Sep 22, 2020

Behaviour with S3 cross region is different between Rust binding lib and regular htslib rust-bio/rust-htslib#269

Closed

jkbonfield force-pushed the htscodecs branch from d98e0e3 to 18f1625 Compare October 29, 2020 16:54

daviesrob reviewed Nov 10, 2020

View reviewed changes

jkbonfield force-pushed the htscodecs branch from beea1ee to 435caac Compare December 15, 2020 15:34

jkbonfield and others added 19 commits January 11, 2021 15:33

Use htscodecs as a submodule.

f5203fc

The cram/rANS_static.[ch] implementation has been removed. Updates the build system to make the required parts of htscodecs. Co-Authored-By: Rob Davies <rmd+git@sanger.ac.uk>

Minor tweaks to codec learning params.

a22e8ca

Add warning about unofficial status of CRAM 3.1.

4182364

This cannot be avoided! It's spam, but it will be removed once the spec becomes official. If there are no changes, that does mean this htslib will be compliant, but spammy, however we can then make a new release.

Add some very minimal 3.1 cram tests

1e79b61

Not sure we have any files large enough to really put it through its paces, but this is a reasonable start.

Updated to v0.5 htscodecs

3e05b18

First tranch of CRAM V4.0 support.

0d1971c

We can read/write CRAM 3.1. We can read CRAM 4.0, but writing 4.0 is corrupted.

An alternative (but still commented out) strategy for multi-base vars.

b8c3baf

This does some basic analysis to spot grouped variants and if and only if long enough and scoring +ve will it switch to 'b'(?) feature.

Remove metrics->stats pointer.

7068ac8

This has been replaced by an "unpackable" flag. This avoids cram_compress_block2 from needing to access metrics->stats, which is a pointer to the container->stats struct. In rare cases this could be freed prior to the compression completing.

Get htslib's "make test" to run the htscodecs tests

878c71b

They are included by adding extra dependencies to the "test" and "check" targets when htscodecs is bundled via the submodule.

daviesrob mentioned this pull request Jan 11, 2021

Update CI test configuration to recurse htslib submodules samtools/samtools#1359

Merged

daviesrob force-pushed the htscodecs branch from ec60dfa to f895c67 Compare January 12, 2021 09:29

daviesrob mentioned this pull request Jan 13, 2021

Update CI test configuration to recurse htslib submodules samtools/bcftools#1371

Merged

daviesrob merged commit 01830d7 into samtools:develop Jan 20, 2021

Conversation

jkbonfield commented Sep 18, 2019

Uh oh!

jkbonfield commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Sep 18, 2020

Uh oh!

daviesrob commented Sep 18, 2020

Uh oh!

jkbonfield commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Sep 18, 2020

Uh oh!

jkbonfield commented Oct 29, 2020

Uh oh!

jkbonfield commented Oct 29, 2020

Uh oh!

jkbonfield commented Nov 5, 2020

Uh oh!

daviesrob Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

jkbonfield Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

jkbonfield Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

jmarshall commented Nov 19, 2020

Uh oh!

jmarshall commented Nov 19, 2020

Uh oh!

jkbonfield commented Nov 19, 2020

Uh oh!

daviesrob commented Nov 19, 2020

Uh oh!

daviesrob commented Jan 12, 2021

Uh oh!

daviesrob commented Jan 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkbonfield commented Sep 23, 2019 •

edited

Loading

jkbonfield commented Sep 18, 2020 •

edited

Loading