Adds linear index support by jkbonfield · Pull Request #1031 · samtools/htslib

jkbonfield · 2020-02-21T17:33:38Z

Adds BAI linear index support for standard iterators. This uses a bit more memory as it doesn't discard the BAI linear index, but permits it to be used for optimising range queries. The effect is minimal on deep data, but profound on shallow.
Also adds BAI linear index support for multi-region iterators. This is harder and requires some rather ghastly hacks to avoid changing the ABI. It's possible there are better methods using the off[] array, but this is rather complex and therefore liable to be error prone given the substantial disconnect between intervals and offsets, which can be many to many and also include partial overlaps - not to mention also being in potentially different orders.
Finally fixes a pre-existing bug on multi-region iterators for CRAM where it could terminate early and omit records.

Benchmarks, from the first commit message for normal iterator:

Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI  develop  1.488   164.1        2.43
htsjdk-BAI  develop  0.502    53.6        8.27
htslib-BAI  this     0.498    53.5        2.43
htsjdk-BAI  this     0.530    53.3        8.27
htslib-CSI  this     1.532   163.3        0.49

For multi-region iterator (NB many more regions included in this test):

Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI  develop  15.184  1397.0       2.43
htslib-BAI  this      2.991   305.2       2.43

It looks stark, but remember this is due to shallow data - around 2x coverage. For deep data the difference is minimal.

Add method `cram_index_query_last` to retrieve the last slice overlapping a particular end coordinate.

cram_index_query() returns the first bin overlapping the queried position. If the end position is present in two bins, then we wrongly set off[].v to the first bin instead of the last bin. We now scan along all bins starting at the end query to find the last one overlapping it. Example failure on a seqs_per_slice=1000 CRAM copy formatted file: samtools view -M -c 9827_2#49.1k.cram 19:49678007-49685237 (was 104, should be 105)

Previously it used the bin "loff" field for BAI, to match the CSI data structure (which no longer has a linear index). Unfortunately when coupled with the compress_binning() function this causes BAI to seek to data earlier than is necessary when bins have been compressed. This generally only affects low coverage data set and has minimal impact on deep data as the bins are too large to remove. For CSI we have no option of using the linear index, so this does not change behaviour. Sadly this means CSI indices are less efficient to use than BAI for many small queries on shallow data sets. Some benchmarks on a shallow 1.9x coverage data set, using 1000 randomly picked regions on chr1 each between 1 bp and 10k bp in size. Indexer Branch Time(s) BAM I/O(Mb) Idx I/O(Mb) ----------------------------------------------------- htslib-BAI develop 1.488 164.1 2.43 htsjdk-BAI develop 0.502 53.6 8.27 htslib-BAI this 0.498 53.5 2.43 htsjdk-BAI this 0.530 53.3 8.27 htslib-CSI this 1.532 163.3 0.49 For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the default of 10000, demonstrating the usefulness of changing this if you need fine grained random access. CRAM-s10k this 22.12 602.1 0.11 CRAM-s1k this 6.062 92.7 0.93 CRAM-s300 this 5.115 42.1 2.98 Htslib BAI index, samtools 1.10 ------------------------------- $ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R File: 9827_2#49.bam Num. open 1 Num. close 1 Num. seeks 969 Num. read 6040 Bytes read 164144977 File: 9827_2#49.bam.bai Num. open 1 Num. close 1 Num. read 40 Bytes read 2431088 real 0m1.488s user 0m1.448s sys 0m0.036s Htslib BAI index, this commit ----------------------------- io_trace -S -r 9827_2 -- samtools view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R File: 9827_2#49.bam Num. open 1 Num. close 1 Num. seeks 951 Num. read 1911 Bytes read 53535251 File: 9827_2#49.bam.bai Num. open 1 Num. close 1 Num. read 40 Bytes read 2431088 real 0m0.498s user 0m0.472s sys 0m0.020s

Revert change to hts_itr_t::off in bfc9f0d so the `max` field can be used to store information about which interval goes with each entry. Don't merge hts_itr_t::off entries for different intervals. Make hts_itr_multi_next() skip offsets that will return no data as they are for intervals that have already been completed. Deal with possible failure of reg2intervals() in hts_itr_multi_bam().

daviesrob · 2020-02-26T10:31:46Z

Rebased with modified multi-region iterator code. The new version passes the information needed to skip file regions in the off[] array, so it keeps within the existing ABI. It also skips a few more locations as it's better at eliminating regions from longer indexing bins that are not really needed.

valeriuo · 2020-03-02T11:24:56Z

hts.c

-                if (iter->seek(fp, iter->curr_off, SEEK_SET) < 0) {
-                    hts_log_error("Seek at offset %" PRIu64 " failed.", iter->curr_off);
-                    return -1;
+            } else if (iter->i < iter->n_off) {


This check is now redundant, but the compiler takes care of it, when using -O2.

valeriuo · 2020-03-02T11:27:46Z

Merged into develop as 059abb9 to 6149ea6

jmarshall · 2020-09-23T13:45:32Z

Note that this changes behaviour for some data files. Specifically: unmapped reads are not added to the linear index, so e.g. if the first read in a query region is a placed unmapped read (i.e., it's filled in with its mate's coordinate) and that read is earlier in the file than the first mapped read in the same linear interval — then using the BAI linear index will cause that first read to be omitted from the iteration output as it skips over it to the linear interval's first mapped read.

Unmapped reads stopped being added to the linear index in samtools/samtools@cb5f449 for not entirely clear reasons.

This has made seven pysam test cases fail (when pysam is updated to use htslib-1.11) as they no longer match their expected output. (See tests/pysam_data/ex2.bam and e.g. testBAM2BAM.) The solution for pysam will probably be to adjust the test data file so the first read isn't unmapped (as this foible isn't what the test case is trying to check!), but the difference may be worth considering in general…

jkbonfield · 2020-09-23T13:53:31Z

Eww, so this isn't so much a flaw in the idea of using the linear index for optimisation, but a bug in our handling of unmapped reads. However that's presumably something which gets baked into the file written to disk, so fixing that bug won't magically fix all those broken BAMs.

I wonder what the reasoning was, or whether it was even intentional. If I'm reading your comment correctly then A-mapped followed by A-placed is fine, while A-placed followed by A-mapped is not as A-placed may be omitted. I'm not sure what the default behaviour is for aligners, but I'm hoping it's the former and this is just a quirk of test data rather than a real-world scenario. Needs checking.

Do we know what htsjdk does? Was this change in behaviour something discussed with all implementors, or just a whim? Annoyingly there's nothing about the change in the commit message, but 2011 was rather wild-west era.

jmarshall · 2020-09-23T14:08:34Z

Yes, I'm inclined to agree that the problem is in index generation. The read that gets omitted and the first-mapped-read-in-interval don't need to be related to each other in any way other than happening to be in the same 2¹⁴ base linear interval, so it's not down to an unlucky quirk or aligner behaviour.

As you say, you get different results if the reads happen to be in the file in this-then-that order compared to morally the same data in that-then-this order. So I guess I'll raise this as an issue…

Haven't checked yet what HTSJDK does. I would be surprised if back in 2011 this wasn't just a slightly unfortunate bug fix to prevent calling insert_offset2(…, -1, …) in the unplaced unmapped case.

jmarshall · 2020-09-23T14:24:53Z

Issue raised, and probably a pretty simple fix (what to do with the other use of is_mapped needs thinking about though). I started testing pysam with 1.11-to-be last week; pity I didn't track down the cause of the test failures a bit sooner… 😄

valeriuo and others added 5 commits February 25, 2020 13:50

Add cram_index_query_last

382bd4c

Add method `cram_index_query_last` to retrieve the last slice overlapping a particular end coordinate.

Add BAI linear index usage for multi-region iterator

d6c07e2

daviesrob force-pushed the linear_index branch from 50cb257 to b037ad1 Compare February 26, 2020 10:25

valeriuo approved these changes Mar 2, 2020

View reviewed changes

valeriuo reviewed Mar 2, 2020

View reviewed changes

valeriuo closed this Mar 2, 2020

valeriuo mentioned this pull request Jul 15, 2020

Linear index is unused when querying with a .bai #557

Closed

jmarshall mentioned this pull request Sep 23, 2020

Unmapped reads are not added to the BAI linear index #1142

Closed

This was referenced Oct 17, 2020

Test failures with samtools/htslib/bcftools 1.11 pysam-developers/pysam#961

Closed

Update imports to htslib/samtools/bcftools 1.11 pysam-developers/pysam#962

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds linear index support#1031

Adds linear index support#1031
jkbonfield wants to merge 5 commits intosamtools:developfrom
jkbonfield:linear_index

jkbonfield commented Feb 21, 2020

Uh oh!

daviesrob commented Feb 26, 2020

Uh oh!

valeriuo Mar 2, 2020

Uh oh!

valeriuo commented Mar 2, 2020

Uh oh!

jmarshall commented Sep 23, 2020 •

edited

Loading

Uh oh!

jkbonfield commented Sep 23, 2020

Uh oh!

jmarshall commented Sep 23, 2020

Uh oh!

jmarshall commented Sep 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jkbonfield commented Feb 21, 2020

Uh oh!

daviesrob commented Feb 26, 2020

Uh oh!

valeriuo Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

valeriuo commented Mar 2, 2020

Uh oh!

jmarshall commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Sep 23, 2020

Uh oh!

jmarshall commented Sep 23, 2020

Uh oh!

jmarshall commented Sep 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jmarshall commented Sep 23, 2020 •

edited

Loading