Adjust the starting offset of the iterator for unmapped placed reads by valeriuo · Pull Request #1155 · samtools/htslib

valeriuo · 2020-10-08T14:11:06Z

Unmapped placed reads can be skipped by an iterator read, when they fall at the beginning of a BGZF block, since they are not added to the linear index. However, the meta-bin from the bin index records the presence of unmapped reads, and can indicate whether the iterator should start reading from the beginning of a bin chunk, which includes the unmapped reads, or from the linear index offset.

Also, the code that sets the bins_t::loff to the lidx_t::offset was removed, as this check is already performed inside update_loff method.

Fixes #1142

jkbonfield · 2020-10-12T15:53:18Z

This seems to more or less just be the same as undoing 45715e4 plus some added CPU expense to boot.

Some benchmarks on Samtools 1.10 (before the linear index commit), Samtools dev branch without this commit, and "samtools" which is a checkout of this PR.

@ deskpro107386[samtools.../samtools]; R=`perl -e 'srand(0);for ($i=0;$i<50000;$i++) {my $p = int(rand(60000000));$e=$p+int(rand(1000));print "20:$p-$e "}'`

@ deskpro107386[samtools.../samtools]; time samtools110 view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m45.141s
user	0m44.148s
sys	0m0.987s


@ deskpro107386[samtools.../samtools]; time /tmp/samtools.dev view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m14.350s
user	0m14.069s
sys	0m0.280s


@ deskpro107386[samtools.../samtools]; time samtools view -c ~/scratch/data/9827_2#49.bam $R
500590
real	0m53.349s
user	0m52.236s
sys	0m1.112s

I/O wise we get the following from io_trace:

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- samtools110 view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	51656
    Num. read    	294728
    Bytes read   	8084384944

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- /tmp/samtools.dev view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	47493
    Num. read    	80083
    Bytes read   	2268340052

@ deskpro107386[samtools.../samtools]; io_trace -r 'bam$' -S -- samtools view -c ~/scratch/data/9827_2#49.bam $R
...
    Num. seeks   	51656
    Num. read    	294728
    Bytes read   	8084384944

It does however help when using the multi-region iterator. I'll do more research and post those results separately.

So sorry, I don't think this is the answer either. :( It looks like it's basically a CPU expensive way of avoiding using the linear index. However I'm sure there's a way because htsjdk manages this fine. I haven't looked at how it does it yet though.

EDIT: Argh - my current PR build was without libdeflate for some reason, unlike the other two. Hence probably same I/O and same CPU. I'll redo that test.
EDIT2: Now done - it's a bit slower than 110, but not hugely so.

daviesrob

Unfortunately this does erase much of the benefit from #1031 for alignments with placed, unmapped reads. It is possible to get some, but not all, of the speed back by adjusting the min_off value in hts_itr_query() like this:

            if (idx->lidx[tid].offset
                && beg>>idx->min_shift < idx->lidx[tid].n
                && min_off < idx->lidx[tid].offset[beg>>idx->min_shift]) {
		hts_pos_t bidx = beg>>idx->min_shift;
                min_off = idx->lidx[tid].offset[bidx];
                if (unmapped) {
                    while (bidx > 0) {
                        if (idx->lidx[tid].offset[bidx] < min_off) break;
                        bidx--;
                    }
                    min_off = idx->lidx[tid].offset[bidx];
                }
            }

It searches backwards for the previous linear offset position, on the grounds that there must be at least one mapped read between there and the one we started at.

There may well be better ways to do it, and the above code may not be completely correct.

Comments of course apply to the hts_itr_multi_bam() version as well.

daviesrob · 2020-10-12T15:30:00Z

hts.c

            } while (bin);
            if (bin == 0) k = kh_get(bin, bidx, bin);
-            min_off = k != kh_end(bidx)? kh_val(bidx, k).loff : 0;
-            if (idx->lidx[tid].offset


In tests on a sparse file, it looks like this bit is still needed for BAI/TBI indexes. It appears that you quite often bubble up to the parent bin, which has a much smaller loff. I'm guessing that this is because either the bin was merged with its parent; or it didn't exist at all because there was no data in the bin at the start of the region I requested so it bubbled up until it found one with some coverage.

daviesrob · 2020-10-12T15:38:19Z

hts.c

            }

+            k = kh_get(bin, bidx, META_BIN(idx));
+            if (k != kh_end(bidx)) unmapped = 1;


This should test for kh_val(h, k).list[1].v > 0 if the META_BIN exists. See hts_idx_get_stat().

jkbonfield · 2020-10-12T16:30:05Z

With the multi-region iterator things are better, as indicated above. Some timings (in less gory detail):

Sparse: Averaging 1 region every 110Kbp of reference.

        1.10       dev       This PR
Time	10.423s    2.944s    6.730s
Seeks	1546	   7230      4463
Bytes	1456MB	   398MB     1003MB

Dense: Averaging 1 region every 1.2Kbp of reference.

        1.10       dev       This PR
Time	6.511s	   1.184s    0.919s
Seeks	4423       5468      4527
Bytes	119MB      147MB     122MB

So it's fastest of all on the dense querying. However note that bytes read vs cpu time aren't well correlated. 1.10 vs dev branch show more bytes read in dev but much less CPU. I'm guessing this is in part a difference between read and decompress as not using linear index vs read-decompress; read-discard (skip); read-decompress. We'd need to profile it to count function calls to know what work load it's putting on the decompressor.

Either way, the difference between this vs dev is small compared to either of them vs 1.10. So in this case the PR works well if we have a densely populated set of regions.

valeriuo · 2020-10-13T13:50:41Z

However I'm sure there's a way because htsjdk manages this fine. I haven't looked at how it does it yet though.

HTSJDK has a slightly different approach. It filters out the iterator chunks that end before the min offset of the linear index and doesn't update the starting offset of any chunk.

valeriuo · 2020-10-16T15:27:56Z

Replaced by #1158

Adjust the starting offset of the iterator for unmapped placed reads.

2fbec09

daviesrob reviewed Oct 12, 2020

View reviewed changes

valeriuo mentioned this pull request Oct 16, 2020

Adjust the min offset of a bin when unmapped placed reads detected #1158

Merged

valeriuo closed this Oct 16, 2020

valeriuo deleted the iter_unmapped2 branch October 20, 2020 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the starting offset of the iterator for unmapped placed reads#1155

Adjust the starting offset of the iterator for unmapped placed reads#1155
valeriuo wants to merge 1 commit intosamtools:developfrom
valeriuo:iter_unmapped2

valeriuo commented Oct 8, 2020

Uh oh!

jkbonfield commented Oct 12, 2020 •

edited

Loading

Uh oh!

daviesrob left a comment

Uh oh!

daviesrob Oct 12, 2020 •

edited

Loading

Uh oh!

daviesrob Oct 12, 2020

Uh oh!

jkbonfield commented Oct 12, 2020

Uh oh!

valeriuo commented Oct 13, 2020

Uh oh!

valeriuo commented Oct 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

valeriuo commented Oct 8, 2020

Uh oh!

jkbonfield commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviesrob left a comment

Choose a reason for hiding this comment

Uh oh!

daviesrob Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daviesrob Oct 12, 2020

Choose a reason for hiding this comment

Uh oh!

jkbonfield commented Oct 12, 2020

Uh oh!

valeriuo commented Oct 13, 2020

Uh oh!

valeriuo commented Oct 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkbonfield commented Oct 12, 2020 •

edited

Loading

daviesrob Oct 12, 2020 •

edited

Loading