Adjust the starting offset of the iterator for unmapped placed reads#1155
Adjust the starting offset of the iterator for unmapped placed reads#1155valeriuo wants to merge 1 commit intosamtools:developfrom
Conversation
|
This seems to more or less just be the same as undoing 45715e4 plus some added CPU expense to boot. Some benchmarks on Samtools 1.10 (before the linear index commit), Samtools dev branch without this commit, and "samtools" which is a checkout of this PR. I/O wise we get the following from io_trace: It does however help when using the multi-region iterator. I'll do more research and post those results separately. So sorry, I don't think this is the answer either. :( It looks like it's basically a CPU expensive way of avoiding using the linear index. However I'm sure there's a way because htsjdk manages this fine. I haven't looked at how it does it yet though. EDIT: Argh - my current PR build was without libdeflate for some reason, unlike the other two. Hence probably same I/O and same CPU. I'll redo that test. |
daviesrob
left a comment
There was a problem hiding this comment.
Unfortunately this does erase much of the benefit from #1031 for alignments with placed, unmapped reads. It is possible to get some, but not all, of the speed back by adjusting the min_off value in hts_itr_query() like this:
if (idx->lidx[tid].offset
&& beg>>idx->min_shift < idx->lidx[tid].n
&& min_off < idx->lidx[tid].offset[beg>>idx->min_shift]) {
hts_pos_t bidx = beg>>idx->min_shift;
min_off = idx->lidx[tid].offset[bidx];
if (unmapped) {
while (bidx > 0) {
if (idx->lidx[tid].offset[bidx] < min_off) break;
bidx--;
}
min_off = idx->lidx[tid].offset[bidx];
}
}
It searches backwards for the previous linear offset position, on the grounds that there must be at least one mapped read between there and the one we started at.
There may well be better ways to do it, and the above code may not be completely correct.
Comments of course apply to the hts_itr_multi_bam() version as well.
| } while (bin); | ||
| if (bin == 0) k = kh_get(bin, bidx, bin); | ||
| min_off = k != kh_end(bidx)? kh_val(bidx, k).loff : 0; | ||
| if (idx->lidx[tid].offset |
There was a problem hiding this comment.
In tests on a sparse file, it looks like this bit is still needed for BAI/TBI indexes. It appears that you quite often bubble up to the parent bin, which has a much smaller loff. I'm guessing that this is because either the bin was merged with its parent; or it didn't exist at all because there was no data in the bin at the start of the region I requested so it bubbled up until it found one with some coverage.
| } | ||
|
|
||
| k = kh_get(bin, bidx, META_BIN(idx)); | ||
| if (k != kh_end(bidx)) unmapped = 1; |
There was a problem hiding this comment.
This should test for kh_val(h, k).list[1].v > 0 if the META_BIN exists. See hts_idx_get_stat().
|
With the multi-region iterator things are better, as indicated above. Some timings (in less gory detail): Sparse: Averaging 1 region every 110Kbp of reference. Dense: Averaging 1 region every 1.2Kbp of reference. So it's fastest of all on the dense querying. However note that bytes read vs cpu time aren't well correlated. 1.10 vs dev branch show more bytes read in dev but much less CPU. I'm guessing this is in part a difference between read and decompress as not using linear index vs read-decompress; read-discard (skip); read-decompress. We'd need to profile it to count function calls to know what work load it's putting on the decompressor. Either way, the difference between this vs dev is small compared to either of them vs 1.10. So in this case the PR works well if we have a densely populated set of regions. |
HTSJDK has a slightly different approach. It filters out the iterator chunks that end before the min offset of the linear index and doesn't update the starting offset of any chunk. |
|
Replaced by #1158 |
Unmapped placed reads can be skipped by an iterator read, when they fall at the beginning of a BGZF block, since they are not added to the linear index. However, the meta-bin from the bin index records the presence of unmapped reads, and can indicate whether the iterator should start reading from the beginning of a bin chunk, which includes the unmapped reads, or from the linear index offset.
Also, the code that sets the
bins_t::loffto thelidx_t::offsetwas removed, as this check is already performed insideupdate_loffmethod.Fixes #1142