Skip to content

Adjust the min offset of a bin when unmapped placed reads detected#1158

Merged
jkbonfield merged 1 commit intosamtools:developfrom
valeriuo:bin_minoff
Oct 21, 2020
Merged

Adjust the min offset of a bin when unmapped placed reads detected#1158
jkbonfield merged 1 commit intosamtools:developfrom
valeriuo:bin_minoff

Conversation

@valeriuo
Copy link
Copy Markdown
Contributor

Unmapped placed reads can be skipped by an iterator read, when they fall at the beginning of a BGZF block, because they are not added to the linear index. However, the meta-bin from the bin index records the presence of unmapped reads and can indicate a potential change of the min_off value.
The logic for adjusting the min_off is to look backwards in the linear index until a smaller offset value is found. This new value is then compared to the smallest starting offset of a chunk inside the respective bin. The largest of the two values is then chosen as the new min_off.

Replaces #1155
Fixes #1142

@jkbonfield
Copy link
Copy Markdown
Contributor

I haven't looked at the code yet, but I can confirm this fixes things in my tests and it retains the speed and I/O reduction from the 1.11.

On my shallow 4.5x coverage set it basically gave identical results (exactly the same level of I/O) as 1.11. On the synthetic test set with unmapped + mapped pairs (in that order), it does different I/O because it's fixing the indexing problem, but was very close to 1.11 still (slightly behind without -M, slightly ahead with -M) and much more performant than 1.10.

Thumbs up from the evaluation side.

hts.c Outdated
&& rel_off < idx->lidx[tid].n
&& min_off < idx->lidx[tid].offset[rel_off])
min_off = idx->lidx[tid].offset[rel_off];
if (unmapped) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be contained within the if statement above, and the line above should be an else clause - mainly because it uses idx->lidx[tid].offset and I think that doesn't exist in CSI?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the nesting of if(unmapped) inside if (idx->lidx[tid].offset && rel_off < idx->lidx[tid].n).
I am not sure about the else, though. I need min_off = idx->lidx[tid].offset[rel_off] to be the staring point for searching a new value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I see it does use it as a starting point - in which case the loop should probably start at rel_off - 1.

@jkbonfield jkbonfield merged commit 49059d3 into samtools:develop Oct 21, 2020
@jmarshall
Copy link
Copy Markdown
Member

The pysam test case will soon no longer exercise this behaviour (pysam-developers/pysam#962), as that is behaviour outwith pysam so it doesn't want to be affected by it. Does HTSlib need test cases added to ensure this doesn't regress again in future?

@valeriuo valeriuo deleted the bin_minoff branch October 30, 2020 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unmapped reads are not added to the BAI linear index

4 participants