Conversation
Add method `cram_index_query_last` to retrieve the last slice overlapping a particular end coordinate.
cram_index_query() returns the first bin overlapping the queried
position. If the end position is present in two bins, then we wrongly
set off[].v to the first bin instead of the last bin. We now scan
along all bins starting at the end query to find the last one
overlapping it.
Example failure on a seqs_per_slice=1000 CRAM copy formatted file:
samtools view -M -c 9827_2#49.1k.cram 19:49678007-49685237
(was 104, should be 105)
Previously it used the bin "loff" field for BAI, to match the CSI data
structure (which no longer has a linear index). Unfortunately when
coupled with the compress_binning() function this causes BAI to seek
to data earlier than is necessary when bins have been compressed.
This generally only affects low coverage data set and has minimal
impact on deep data as the bins are too large to remove.
For CSI we have no option of using the linear index, so this does not
change behaviour. Sadly this means CSI indices are less efficient to
use than BAI for many small queries on shallow data sets.
Some benchmarks on a shallow 1.9x coverage data set, using 1000
randomly picked regions on chr1 each between 1 bp and 10k bp in size.
Indexer Branch Time(s) BAM I/O(Mb) Idx I/O(Mb)
-----------------------------------------------------
htslib-BAI develop 1.488 164.1 2.43
htsjdk-BAI develop 0.502 53.6 8.27
htslib-BAI this 0.498 53.5 2.43
htsjdk-BAI this 0.530 53.3 8.27
htslib-CSI this 1.532 163.3 0.49
For completeness, CRAM with varying seqs_per_slice of 300, 1000 or the
default of 10000, demonstrating the usefulness of changing this if you
need fine grained random access.
CRAM-s10k this 22.12 602.1 0.11
CRAM-s1k this 6.062 92.7 0.93
CRAM-s300 this 5.115 42.1 2.98
Htslib BAI index, samtools 1.10
-------------------------------
$ io_trace -S -r 9827_2 -- samtools1.10 view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R
File: 9827_2#49.bam
Num. open 1
Num. close 1
Num. seeks 969
Num. read 6040
Bytes read 164144977
File: 9827_2#49.bam.bai
Num. open 1
Num. close 1
Num. read 40
Bytes read 2431088
real 0m1.488s
user 0m1.448s
sys 0m0.036s
Htslib BAI index, this commit
-----------------------------
io_trace -S -r 9827_2 -- samtools view -c 9827_2#49.bam##idx##9827_2#49.bam.bai $R
File: 9827_2#49.bam
Num. open 1
Num. close 1
Num. seeks 951
Num. read 1911
Bytes read 53535251
File: 9827_2#49.bam.bai
Num. open 1
Num. close 1
Num. read 40
Bytes read 2431088
real 0m0.498s
user 0m0.472s
sys 0m0.020s
Revert change to hts_itr_t::off in bfc9f0d so the `max` field can be used to store information about which interval goes with each entry. Don't merge hts_itr_t::off entries for different intervals. Make hts_itr_multi_next() skip offsets that will return no data as they are for intervals that have already been completed. Deal with possible failure of reg2intervals() in hts_itr_multi_bam().
50cb257 to
b037ad1
Compare
|
Rebased with modified multi-region iterator code. The new version passes the information needed to skip file regions in the |
| if (iter->seek(fp, iter->curr_off, SEEK_SET) < 0) { | ||
| hts_log_error("Seek at offset %" PRIu64 " failed.", iter->curr_off); | ||
| return -1; | ||
| } else if (iter->i < iter->n_off) { |
There was a problem hiding this comment.
This check is now redundant, but the compiler takes care of it, when using -O2.
|
Note that this changes behaviour for some data files. Specifically: unmapped reads are not added to the linear index, so e.g. if the first read in a query region is a placed unmapped read (i.e., it's filled in with its mate's coordinate) and that read is earlier in the file than the first mapped read in the same linear interval — then using the BAI linear index will cause that first read to be omitted from the iteration output as it skips over it to the linear interval's first mapped read. Unmapped reads stopped being added to the linear index in samtools/samtools@cb5f449 for not entirely clear reasons. This has made seven pysam test cases fail (when pysam is updated to use htslib-1.11) as they no longer match their expected output. (See tests/pysam_data/ex2.bam and e.g. |
|
Eww, so this isn't so much a flaw in the idea of using the linear index for optimisation, but a bug in our handling of unmapped reads. However that's presumably something which gets baked into the file written to disk, so fixing that bug won't magically fix all those broken BAMs. I wonder what the reasoning was, or whether it was even intentional. If I'm reading your comment correctly then A-mapped followed by A-placed is fine, while A-placed followed by A-mapped is not as A-placed may be omitted. I'm not sure what the default behaviour is for aligners, but I'm hoping it's the former and this is just a quirk of test data rather than a real-world scenario. Needs checking. Do we know what htsjdk does? Was this change in behaviour something discussed with all implementors, or just a whim? Annoyingly there's nothing about the change in the commit message, but 2011 was rather wild-west era. |
|
Yes, I'm inclined to agree that the problem is in index generation. The read that gets omitted and the first-mapped-read-in-interval don't need to be related to each other in any way other than happening to be in the same 214 base linear interval, so it's not down to an unlucky quirk or aligner behaviour. As you say, you get different results if the reads happen to be in the file in this-then-that order compared to morally the same data in that-then-this order. So I guess I'll raise this as an issue… Haven't checked yet what HTSJDK does. I would be surprised if back in 2011 this wasn't just a slightly unfortunate bug fix to prevent calling |
|
Issue raised, and probably a pretty simple fix (what to do with the other use of |
Adds BAI linear index support for standard iterators. This uses a bit more memory as it doesn't discard the BAI linear index, but permits it to be used for optimising range queries. The effect is minimal on deep data, but profound on shallow.
Also adds BAI linear index support for multi-region iterators. This is harder and requires some rather ghastly hacks to avoid changing the ABI. It's possible there are better methods using the off[] array, but this is rather complex and therefore liable to be error prone given the substantial disconnect between intervals and offsets, which can be many to many and also include partial overlaps - not to mention also being in potentially different orders.
Finally fixes a pre-existing bug on multi-region iterators for CRAM where it could terminate early and omit records.
Benchmarks, from the first commit message for normal iterator:
For multi-region iterator (NB many more regions included in this test):
It looks stark, but remember this is due to shallow data - around 2x coverage. For deep data the difference is minimal.