Skip to content

Adjust the starting offset of the iterator for unmapped placed reads#1155

Closed
valeriuo wants to merge 1 commit intosamtools:developfrom
valeriuo:iter_unmapped2
Closed

Adjust the starting offset of the iterator for unmapped placed reads#1155
valeriuo wants to merge 1 commit intosamtools:developfrom
valeriuo:iter_unmapped2

Conversation

@valeriuo
Copy link
Copy Markdown
Contributor

@valeriuo valeriuo commented Oct 8, 2020

Unmapped placed reads can be skipped by an iterator read, when they fall at the beginning of a BGZF block, since they are not added to the linear index. However, the meta-bin from the bin index records the presence of unmapped reads, and can indicate whether the iterator should start reading from the beginning of a bin chunk, which includes the unmapped reads, or from the linear index offset.

Also, the code that sets the bins_t::loff to the lidx_t::offset was removed, as this check is already performed inside update_loff method.

Fixes #1142

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Oct 12, 2020

This seems to more or less just be the same as undoing 45715e4 plus some added CPU expense to boot.

Some benchmarks on Samtools 1.10 (before the linear index commit), Samtools dev branch without this commit, and "samtools" which is a checkout of this PR.

@ deskpro107386[samtools.../samtools]; R=`perl -e 'srand(0);for ($i=0;$i<50000;$i++) {my $p = int(rand(60000000));$e=$p+int(rand(1000));print "20:$p-$e "}'`

@ deskpro107386[samtools.../samtools]; time samtools110 view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m45.141s
user	0m44.148s
sys	0m0.987s


@ deskpro107386[samtools.../samtools]; time /tmp/samtools.dev view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m14.350s
user	0m14.069s
sys	0m0.280s


@ deskpro107386[samtools.../samtools]; time samtools view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m53.349s
user	0m52.236s
sys	0m1.112s

I/O wise we get the following from io_trace:

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- samtools110 view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	51656
    Num. read    	294728
    Bytes read   	8084384944

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- /tmp/samtools.dev view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	47493
    Num. read    	80083
    Bytes read   	2268340052

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- samtools view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	51656
    Num. read    	294728
    Bytes read   	8084384944

It does however help when using the multi-region iterator. I'll do more research and post those results separately.

So sorry, I don't think this is the answer either. :( It looks like it's basically a CPU expensive way of avoiding using the linear index. However I'm sure there's a way because htsjdk manages this fine. I haven't looked at how it does it yet though.

EDIT: Argh - my current PR build was without libdeflate for some reason, unlike the other two. Hence probably same I/O and same CPU. I'll redo that test.
EDIT2: Now done - it's a bit slower than 110, but not hugely so.

Copy link
Copy Markdown
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this does erase much of the benefit from #1031 for alignments with placed, unmapped reads. It is possible to get some, but not all, of the speed back by adjusting the min_off value in hts_itr_query() like this:

            if (idx->lidx[tid].offset
                && beg>>idx->min_shift < idx->lidx[tid].n
                && min_off < idx->lidx[tid].offset[beg>>idx->min_shift]) {
		hts_pos_t bidx = beg>>idx->min_shift;
                min_off = idx->lidx[tid].offset[bidx];
                if (unmapped) {
                    while (bidx > 0) {
                        if (idx->lidx[tid].offset[bidx] < min_off) break;
                        bidx--;
                    }
                    min_off = idx->lidx[tid].offset[bidx];
                }
            }

It searches backwards for the previous linear offset position, on the grounds that there must be at least one mapped read between there and the one we started at.

There may well be better ways to do it, and the above code may not be completely correct.

Comments of course apply to the hts_itr_multi_bam() version as well.

} while (bin);
if (bin == 0) k = kh_get(bin, bidx, bin);
min_off = k != kh_end(bidx)? kh_val(bidx, k).loff : 0;
if (idx->lidx[tid].offset
Copy link
Copy Markdown
Member

@daviesrob daviesrob Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests on a sparse file, it looks like this bit is still needed for BAI/TBI indexes. It appears that you quite often bubble up to the parent bin, which has a much smaller loff. I'm guessing that this is because either the bin was merged with its parent; or it didn't exist at all because there was no data in the bin at the start of the region I requested so it bubbled up until it found one with some coverage.

}

k = kh_get(bin, bidx, META_BIN(idx));
if (k != kh_end(bidx)) unmapped = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should test for kh_val(h, k).list[1].v > 0 if the META_BIN exists. See hts_idx_get_stat().

@jkbonfield
Copy link
Copy Markdown
Contributor

With the multi-region iterator things are better, as indicated above. Some timings (in less gory detail):

Sparse: Averaging 1 region every 110Kbp of reference.

        1.10       dev       This PR
Time	10.423s    2.944s    6.730s
Seeks	1546	   7230      4463
Bytes	1456MB	   398MB     1003MB

Dense: Averaging 1 region every 1.2Kbp of reference.

        1.10       dev       This PR
Time	6.511s	   1.184s    0.919s
Seeks	4423       5468      4527
Bytes	119MB      147MB     122MB

So it's fastest of all on the dense querying. However note that bytes read vs cpu time aren't well correlated. 1.10 vs dev branch show more bytes read in dev but much less CPU. I'm guessing this is in part a difference between read and decompress as not using linear index vs read-decompress; read-discard (skip); read-decompress. We'd need to profile it to count function calls to know what work load it's putting on the decompressor.

Either way, the difference between this vs dev is small compared to either of them vs 1.10. So in this case the PR works well if we have a densely populated set of regions.

@valeriuo
Copy link
Copy Markdown
Contributor Author

However I'm sure there's a way because htsjdk manages this fine. I haven't looked at how it does it yet though.

HTSJDK has a slightly different approach. It filters out the iterator chunks that end before the min offset of the linear index and doesn't update the starting offset of any chunk.

@valeriuo
Copy link
Copy Markdown
Contributor Author

Replaced by #1158

@valeriuo valeriuo closed this Oct 16, 2020
@valeriuo valeriuo deleted the iter_unmapped2 branch October 20, 2020 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unmapped reads are not added to the BAI linear index

3 participants