Skip to content

Adds linear index support#1031

Closed
jkbonfield wants to merge 5 commits intosamtools:developfrom
jkbonfield:linear_index
Closed

Adds linear index support#1031
jkbonfield wants to merge 5 commits intosamtools:developfrom
jkbonfield:linear_index

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

  • Adds BAI linear index support for standard iterators. This uses a bit more memory as it doesn't discard the BAI linear index, but permits it to be used for optimising range queries. The effect is minimal on deep data, but profound on shallow.

  • Also adds BAI linear index support for multi-region iterators. This is harder and requires some rather ghastly hacks to avoid changing the ABI. It's possible there are better methods using the off[] array, but this is rather complex and therefore liable to be error prone given the substantial disconnect between intervals and offsets, which can be many to many and also include partial overlaps - not to mention also being in potentially different orders.

  • Finally fixes a pre-existing bug on multi-region iterators for CRAM where it could terminate early and omit records.

Benchmarks, from the first commit message for normal iterator:

Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI  develop  1.488   164.1        2.43
htsjdk-BAI  develop  0.502    53.6        8.27
htslib-BAI  this     0.498    53.5        2.43
htsjdk-BAI  this     0.530    53.3        8.27
htslib-CSI  this     1.532   163.3        0.49

For multi-region iterator (NB many more regions included in this test):

Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI  develop  15.184  1397.0       2.43
htslib-BAI  this      2.991   305.2       2.43

It looks stark, but remember this is due to shallow data - around 2x coverage. For deep data the difference is minimal.

valeriuo and others added 5 commits February 25, 2020 13:50
Add method `cram_index_query_last` to retrieve the last
slice overlapping a particular end coordinate.
cram_index_query() returns the first bin overlapping the queried
position.  If the end position is present in two bins, then we wrongly
set off[].v to the first bin instead of the last bin.  We now scan
along all bins starting at the end query to find the last one
overlapping it.

Example failure on a seqs_per_slice=1000 CRAM copy formatted file:

    samtools view -M -c 9827_2#49.1k.cram 19:49678007-49685237
    (was 104, should be 105)
Previously it used the bin "loff" field for BAI, to match the CSI data
structure (which no longer has a linear index).  Unfortunately when
coupled with the compress_binning() function this causes BAI to seek
to data earlier than is necessary when bins have been compressed.
This generally only affects low coverage data set and has minimal
impact on deep data as the bins are too large to remove.

For CSI we have no option of using the linear index, so this does not
change behaviour.  Sadly this means CSI indices are less efficient to
use than BAI for many small queries on shallow data sets.

Some benchmarks on a shallow 1.9x coverage data set, using 1000
randomly picked regions on chr1 each between 1 bp and 10k bp in size.

Indexer     Branch   Time(s) BAM I/O(Mb)  Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI  develop  1.488   164.1        2.43
htsjdk-BAI  develop  0.502    53.6        8.27
htslib-BAI  this     0.498    53.5        2.43
htsjdk-BAI  this     0.530    53.3        8.27
htslib-CSI  this     1.532   163.3        0.49

For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the
default of 10000, demonstrating the usefulness of changing this if you
need fine grained random access.

CRAM-s10k   this     22.12   602.1	  0.11
CRAM-s1k    this     6.062    92.7	  0.93
CRAM-s300   this     5.115    42.1        2.98

Htslib BAI index, samtools 1.10
-------------------------------

$ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R

File: 9827_2#49.bam
    Num. open           1
    Num. close          1
    Num. seeks          969
    Num. read           6040
    Bytes read          164144977

File: 9827_2#49.bam.bai
    Num. open           1
    Num. close          1
    Num. read           40
    Bytes read          2431088

real    0m1.488s
user    0m1.448s
sys     0m0.036s

Htslib BAI index, this commit
-----------------------------

io_trace -S -r 9827_2 -- samtools view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R

File: 9827_2#49.bam
    Num. open           1
    Num. close          1
    Num. seeks          951
    Num. read           1911
    Bytes read          53535251

File: 9827_2#49.bam.bai
    Num. open           1
    Num. close          1
    Num. read           40
    Bytes read          2431088

real    0m0.498s
user    0m0.472s
sys     0m0.020s
Revert change to hts_itr_t::off in bfc9f0d so the `max` field
can be used to store information about which interval goes
with each entry.

Don't merge hts_itr_t::off entries for different intervals.

Make hts_itr_multi_next() skip offsets that will return no
data as they are for intervals that have already been completed.

Deal with possible failure of reg2intervals() in
hts_itr_multi_bam().
@daviesrob
Copy link
Copy Markdown
Member

Rebased with modified multi-region iterator code. The new version passes the information needed to skip file regions in the off[] array, so it keeps within the existing ABI. It also skips a few more locations as it's better at eliminating regions from longer indexing bins that are not really needed.

if (iter->seek(fp, iter->curr_off, SEEK_SET) < 0) {
hts_log_error("Seek at offset %" PRIu64 " failed.", iter->curr_off);
return -1;
} else if (iter->i < iter->n_off) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is now redundant, but the compiler takes care of it, when using -O2.

@valeriuo
Copy link
Copy Markdown
Contributor

valeriuo commented Mar 2, 2020

Merged into develop as 059abb9 to 6149ea6

@jmarshall
Copy link
Copy Markdown
Member

jmarshall commented Sep 23, 2020

Note that this changes behaviour for some data files. Specifically: unmapped reads are not added to the linear index, so e.g. if the first read in a query region is a placed unmapped read (i.e., it's filled in with its mate's coordinate) and that read is earlier in the file than the first mapped read in the same linear interval — then using the BAI linear index will cause that first read to be omitted from the iteration output as it skips over it to the linear interval's first mapped read.

Unmapped reads stopped being added to the linear index in samtools/samtools@cb5f449 for not entirely clear reasons.

This has made seven pysam test cases fail (when pysam is updated to use htslib-1.11) as they no longer match their expected output. (See tests/pysam_data/ex2.bam and e.g. testBAM2BAM.) The solution for pysam will probably be to adjust the test data file so the first read isn't unmapped (as this foible isn't what the test case is trying to check!), but the difference may be worth considering in general…

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Eww, so this isn't so much a flaw in the idea of using the linear index for optimisation, but a bug in our handling of unmapped reads. However that's presumably something which gets baked into the file written to disk, so fixing that bug won't magically fix all those broken BAMs.

I wonder what the reasoning was, or whether it was even intentional. If I'm reading your comment correctly then A-mapped followed by A-placed is fine, while A-placed followed by A-mapped is not as A-placed may be omitted. I'm not sure what the default behaviour is for aligners, but I'm hoping it's the former and this is just a quirk of test data rather than a real-world scenario. Needs checking.

Do we know what htsjdk does? Was this change in behaviour something discussed with all implementors, or just a whim? Annoyingly there's nothing about the change in the commit message, but 2011 was rather wild-west era.

@jmarshall
Copy link
Copy Markdown
Member

Yes, I'm inclined to agree that the problem is in index generation. The read that gets omitted and the first-mapped-read-in-interval don't need to be related to each other in any way other than happening to be in the same 214 base linear interval, so it's not down to an unlucky quirk or aligner behaviour.

As you say, you get different results if the reads happen to be in the file in this-then-that order compared to morally the same data in that-then-this order. So I guess I'll raise this as an issue…

Haven't checked yet what HTSJDK does. I would be surprised if back in 2011 this wasn't just a slightly unfortunate bug fix to prevent calling insert_offset2(…, -1, …) in the unplaced unmapped case.

@jmarshall
Copy link
Copy Markdown
Member

Issue raised, and probably a pretty simple fix (what to do with the other use of is_mapped needs thinking about though). I started testing pysam with 1.11-to-be last week; pity I didn't track down the cause of the test failures a bit sooner… 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants