Skip to content

Removal the bgzf_idx_flush assertion.#1168

Merged
daviesrob merged 1 commit intosamtools:developfrom
jkbonfield:write-index-fix2
Dec 1, 2020
Merged

Removal the bgzf_idx_flush assertion.#1168
daviesrob merged 1 commit intosamtools:developfrom
jkbonfield:write-index-fix2

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

The assumption is that we call bgzf_idx_flush once per outgoing multi-threaded block, and the block numbers will match. The assertion is to validate we're not indexing out of order.

However, very long records (eg with huge aux tags or long seqs) can mean a single record spans multiple blocks and we inherently then skip blocks in bgzf_idx_push calls. Hence the assertion is simply wrong.

Fixes samtools/samtools#1328

@jkbonfield
Copy link
Copy Markdown
Contributor Author

PS. Don't ask how long I spent scratching my head over why indices were different when I did samtools view -@2 vs samtools view. Even -@0 differed...

They carried on differing until I spotted the BAMs also differed, and then remembered to add --no-PG. duh hoisted by our own pitard! :-)

bgzf.c Outdated
assert(mt->idx_cache.nentries == 0 || mt->block_written >= e[0].block_number);

for (i = 0; i < mt->idx_cache.nentries && e[i].block_number == mt->block_written; i++) {
for (i = 0; i < mt->idx_cache.nentries && e[i].block_number <= mt->block_written; i++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value (mt->block_address << 16) + e[i].offset is only valid when e[i].block_number == mt->block_written otherwise an incorrect index entry would be written. Which suggests that using == in the for loop was correct, and the assertion should have been to check that blocks haven't been written without adding their index entries, i.e.:

assert(mt->idx_cache.nentries == 0 || mt->block_written <= e[0].block_number)

I don't think this can happen, but then isn't that the sort of reasoning that the assertion was trying to check?

The assumption is that we call bgzf_idx_flush once per outgoing
multi-threaded block, and the block numbers will match.  The
assertion is to validate we're not indexing out of order.

However, very long records (eg with huge aux tags or long seqs) can
mean a single record spans multiple blocks and we inherently then skip
blocks in bgzf_idx_push calls.  Hence the assertion is simply wrong.

Fixes samtools/samtools#1328
@daviesrob daviesrob merged commit 238600d into samtools:develop Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assertion error when samtools merge is run with threading and auto indexing

2 participants