Skip to content

Speed up Code that Checks Optical Duplicates.#1345

Merged
jkbonfield merged 1 commit intosamtools:developfrom
whitwham:markdup_faster_deep
Jan 14, 2021
Merged

Speed up Code that Checks Optical Duplicates.#1345
jkbonfield merged 1 commit intosamtools:developfrom
whitwham:markdup_faster_deep

Conversation

@whitwham
Copy link
Copy Markdown
Member

Duplicates can appear in any order in a bam file so some degree of backtracking is needed to correctly identify if a duplicate is actually an optical one. The existing code did not scale when a single read can have tens or hundreds of thousands of duplicates.

This should fix #1325, or at least ameliorate it.

@daviesrob
Copy link
Copy Markdown
Member

I'm getting a use-after-free error from address sanitizer - it looks like a stale bam record is on the ori list here. The test harness doesn't show this, so probably isn't doing enough to find the bug. I'll point you at my input file so you can investigate.

@whitwham
Copy link
Copy Markdown
Member Author

I'm getting a use-after-free error from address sanitizer - it looks like a stale bam record is on the ori list here. The test harness doesn't show this, so probably isn't doing enough to find the bug. I'll point you at my input file so you can investigate.

Thanks. I will take a look using your test data.

@whitwham whitwham marked this pull request as ready for review December 3, 2020 14:15
Copy link
Copy Markdown
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only very minor code comments.

I'm struggling to get it to work well though on a small test set. Maybe it's something wrong with how I prepared my data. I'll email you the details.


seps = get_coordinate_positions(name, &xpos, &ypos);

if (!(seps == 3 || seps == 4 || seps == 6 || seps == 7)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to comment the meaning of 3, 4, 6 and 7.

I see get_coordinate_positions lists 4 and 6 for various Illumina models, but what's 7? Looking at the code I see the 7th is the same as 6 (5=x 6=y) but an additional non-coord value.

I see the code for comparing whether the read names match is just checking the bit prior to xpos, so whatever this 7th element is it's not considered in that step. Does that matter?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the commenting. I guess the answer to "7th element is it's not considered in that step. Does that matter?" is "maybe". That'll be a different issue if done though. (See #1358)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter yet.


/* Check all duplicates of the highest quality read (the "original") for consistancy. Also
pre-calculate any values for use in check_duplicate_chain later. */
static int check_chain_against_original(md_param_t *param, khash_t(duplicates) *dup_hash, read_queue_t *ori,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment on return values as this is a bit subtle.

From my understanding, it has 0 as OK, >0 as failed due to unknown naming scheme (so a warning was given and the caller may continue), and <0 as a genuine error (the caller won't continue).


if (param->tag && have_original) {
if (!(tmp = realloc(list->c, list->size * sizeof(check_t)))) {
fprintf(stderr, "[markdup] error: Unable to expand opt check list.\n");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our unwritten policy is all errors in new code should call print_error or print_error_errno rather than fprintf directly.

Similarly elsewhere in this file.

bam_markdup.c Outdated
read_queue_t *current = ori->duplicate;
int xpos;
long x, y;
long tmp_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_count looks redundant now.

bam_markdup.c Outdated
check_t *ac = (check_t *) a;
check_t *bc = (check_t *) b;

return ac->x < bc->x ? -1 : (ac->x == bc->x ? 0 : 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note given coordinates are small, this could just be return ac->x - bc->x.

I don't know how significant a chunk of CPU though the sort comparison function is. It may be pointless optimisation. Your call.

bam_markdup.c Outdated
long *optical_pair, const int check_range) {
int ret = 0;
kliter_t(read_queue) *rq;
long tmp_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A now-redundant variable.

bam_markdup.c Outdated
tmp_file_t temp;
char *idx_fn = NULL;
int exclude = 0;
check_list_t dup_list;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to validate all the conditionals here, but maybe initialise this to NULL. The goto fail code may happen before this gets set, but the fail code does free it.

As example is when the header has SO:queryname. This triggers a failure with address sanitizer attempting to free dup_list.

}
}

if (!(in_read->b->core.flag & BAM_FDUP) && in_read->duplicate) { // is the head of a duplicate chain
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a small test file it still seems to spend longer than I expected, with about 50% of time at this line of code surprisingly.

It got about 20% quicker just by swapping the order of conditionals, ie:

if (in_read->duplicate && !(in_read->b->core.flag & BAM_FDUP))

I'm not sure if my test data was abnormal in some way though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this. While still more efficient the other way around, I think it shouldn't be spending so long in this function. It's a symptom of my input file having already marked for duplicates and not using the "-c" option to clear those flags.

I don't know yet why that makes such a difference though.

@whitwham whitwham force-pushed the markdup_faster_deep branch from f88300b to 9f079a4 Compare January 7, 2021 17:50
@jkbonfield
Copy link
Copy Markdown
Contributor

Thanks for the updates.

I can confirm it now works much faster even on data that's already had markdups run before. However it now appears to duplicate some tag entries. Eg after a 2nd run:

seq     1168    1       9996    16      112S39M *       0       0       <seq> <qual> NM:i:0  MD:Z:39 AS:i:39 XS:i:35 XA:Z:6,-147869,116S35M,0;hs37d5,+10061146,35M116S,0;7,-10002,117S34M,0;15,+102521304,34M117S,0;2,+241161288,34M117S,0;  do:Z:seq2    dt:Z:LB do:Z:seq2    dt:Z:LB

Note do and dt tags are doubled up. Changing all the bam_aux_append calls with bam_aux_update_str seems to cure the problem and I cannot see any impact on the time taken. 1 million seqs with append vs update (uncompressed BAM output) took 4.60s vs 4.65s. The second pass where update actuallys updates instead of appending took 4.35s, so it's now faster (fewer reallocs as a guess). The md5sum output of 1st and 2nd pass is identical, as is 1st past between append vs update.

My diff:

diff --git a/bam_markdup.c b/bam_markdup.c
index 10a3f66..2da184f 100644
--- a/bam_markdup.c
+++ b/bam_markdup.c
@@ -945,7 +945,7 @@ static int mark_duplicates(md_param_t *param, khash_t(duplicates) *dup_hash, bam
     dup->core.flag |= BAM_FDUP;
 
     if (param->tag) {
-        if (bam_aux_append(dup, "do", 'Z', strlen(bam_get_qname(ori)) + 1, (uint8_t*)bam_get_qname(ori))) {
+        if (bam_aux_update_str(dup, "do", strlen(bam_get_qname(ori)) + 1, bam_get_qname(ori))) {
             fprintf(stderr, "[markdup] error: unable to append 'do' tag.\n");
             return -1;
         }
@@ -953,12 +953,12 @@ static int mark_duplicates(md_param_t *param, khash_t(duplicates) *dup_hash, bam
 
     if (param->opt_dist) { // mark optical duplicates
         if (optical_duplicate(ori, dup, param->opt_dist, warn)) {
-            bam_aux_append(dup, "dt", 'Z', 3, (const uint8_t *)"SQ");
+            bam_aux_update_str(dup, "dt", 3, "SQ");
             dup_type = 'O';
             (*optical)++;
         } else {
             // not an optical duplicate
-            bam_aux_append(dup, "dt", 'Z', 3, (const uint8_t *)"LB");
+            bam_aux_update_str(dup, "dt", 3, "LB");
         }
     }
 
@@ -1909,7 +1909,7 @@ static int bam_mark_duplicates(md_param_t *param) {
                     np_duplicate++;
 
                     if (param->tag && kh_val(dup_hash, k).name) {
-                        if (bam_aux_append(b, "do", 'Z', strlen(kh_val(dup_hash, k).name) + 1, (uint8_t*)kh_val(dup_hash, k).name)) {
+                        if (bam_aux_update_str(b, "do", strlen(kh_val(dup_hash, k).name) + 1, (char*)kh_val(dup_hash, k).name)) {
                             fprintf(stderr, "[markdup] error: unable to append supplementary 'do' tag.\n");
                             goto fail;
                         }
@@ -1917,10 +1917,10 @@ static int bam_mark_duplicates(md_param_t *param) {
 
                     if (param->opt_dist) {
                         if (kh_val(dup_hash, k).type) {
-                            bam_aux_append(b, "dt", 'Z', 3, (const uint8_t *)"SQ");
+                            bam_aux_update_str(b, "dt", 3, "SQ");
                             np_opt_duplicate++;
                         } else {
-                            bam_aux_append(b, "dt", 'Z', 3, (const uint8_t *)"LB");
+                            bam_aux_update_str(b, "dt", 3, "LB");
                         }
                     }
                 }

If you wish I can apply this and squash it into the other commits during merge.

Assuming you're happy for the various commits to get squashed together?

@whitwham
Copy link
Copy Markdown
Member Author

whitwham commented Jan 14, 2021

You can do the changes. I guess this is more fallout from not clearing the previously dup marking data. That really should have been the default action.

@jkbonfield
Copy link
Copy Markdown
Contributor

Ok will do.

Existing code did not scale into the tens and hundreds of thousands.
This should work better.

Also fixed a bug in the multiple duplicates chain code where some single duplicate entries where getting lost.
@jkbonfield jkbonfield merged commit c17e914 into samtools:develop Jan 14, 2021
@whitwham whitwham deleted the markdup_faster_deep branch February 14, 2022 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Samtools markdup with high depth regions

3 participants