Speed up Code that Checks Optical Duplicates. by whitwham · Pull Request #1345 · samtools/samtools

whitwham · 2020-11-20T12:04:11Z

Duplicates can appear in any order in a bam file so some degree of backtracking is needed to correctly identify if a duplicate is actually an optical one. The existing code did not scale when a single read can have tens or hundreds of thousands of duplicates.

This should fix #1325, or at least ameliorate it.

daviesrob · 2020-11-20T18:32:37Z

I'm getting a use-after-free error from address sanitizer - it looks like a stale bam record is on the ori list here. The test harness doesn't show this, so probably isn't doing enough to find the bug. I'll point you at my input file so you can investigate.

whitwham · 2020-11-23T08:42:27Z

I'm getting a use-after-free error from address sanitizer - it looks like a stale bam record is on the ori list here. The test harness doesn't show this, so probably isn't doing enough to find the bug. I'll point you at my input file so you can investigate.

Thanks. I will take a look using your test data.

jkbonfield

Only very minor code comments.

I'm struggling to get it to work well though on a small test set. Maybe it's something wrong with how I prepared my data. I'll email you the details.

jkbonfield · 2020-12-16T10:11:13Z

bam_markdup.c

+
+    seps = get_coordinate_positions(name, &xpos, &ypos);
+
+    if (!(seps == 3 || seps == 4 || seps == 6 || seps == 7)) {


It may be good to comment the meaning of 3, 4, 6 and 7.

I see get_coordinate_positions lists 4 and 6 for various Illumina models, but what's 7? Looking at the code I see the 7th is the same as 6 (5=x 6=y) but an additional non-coord value.

I see the code for comparing whether the read names match is just checking the bit prior to xpos, so whatever this 7th element is it's not considered in that step. Does that matter?

Thanks for the commenting. I guess the answer to "7th element is it's not considered in that step. Does that matter?" is "maybe". That'll be a different issue if done though. (See #1358)

It doesn't matter yet.

jkbonfield · 2020-12-16T11:23:16Z

bam_markdup.c


+/* Check all duplicates of the highest quality read (the "original") for consistancy.  Also
+   pre-calculate any values for use in check_duplicate_chain later. */
+static int check_chain_against_original(md_param_t *param, khash_t(duplicates) *dup_hash, read_queue_t *ori,


Maybe add a comment on return values as this is a bit subtle.

From my understanding, it has 0 as OK, >0 as failed due to unknown naming scheme (so a warning was given and the caller may continue), and <0 as a genuine error (the caller won't continue).

jkbonfield · 2020-12-16T11:28:01Z

bam_markdup.c


-        if (param->tag && have_original) {
+            if (!(tmp = realloc(list->c, list->size * sizeof(check_t)))) {
+                fprintf(stderr, "[markdup] error: Unable to expand opt check list.\n");


Our unwritten policy is all errors in new code should call print_error or print_error_errno rather than fprintf directly.

Similarly elsewhere in this file.

jkbonfield · 2020-12-16T11:44:32Z

bam_markdup.c

+    read_queue_t *current = ori->duplicate;
+    int xpos;
+    long x, y;
+    long tmp_count = 0;


tmp_count looks redundant now.

jkbonfield · 2020-12-16T11:48:36Z

bam_markdup.c

+    check_t *ac = (check_t *) a;
+    check_t *bc = (check_t *) b;
+
+    return ac->x < bc->x ? -1 : (ac->x == bc->x ? 0 : 1);


Note given coordinates are small, this could just be return ac->x - bc->x.

I don't know how significant a chunk of CPU though the sort comparison function is. It may be pointless optimisation. Your call.

jkbonfield · 2020-12-16T12:40:11Z

bam_markdup.c

+                                long *optical_pair, const int check_range) {
+    int ret = 0;
+    kliter_t(read_queue) *rq;
+    long tmp_count = 0;


A now-redundant variable.

jkbonfield · 2020-12-16T12:42:41Z

bam_markdup.c

    tmp_file_t temp;
    char *idx_fn = NULL;
    int exclude = 0;
+    check_list_t dup_list;


It's a bit hard to validate all the conditionals here, but maybe initialise this to NULL. The goto fail code may happen before this gets set, but the fail code does free it.

As example is when the header has SO:queryname. This triggers a failure with address sanitizer attempting to free dup_list.

jkbonfield · 2020-12-16T15:44:48Z

bam_markdup.c

+            }
+        }
+
+        if (!(in_read->b->core.flag & BAM_FDUP) && in_read->duplicate) { // is the head of a duplicate chain


On a small test file it still seems to spend longer than I expected, with about 50% of time at this line of code surprisingly.

It got about 20% quicker just by swapping the order of conditionals, ie:

if (in_read->duplicate && !(in_read->b->core.flag & BAM_FDUP))

I'm not sure if my test data was abnormal in some way though.

Ignore this. While still more efficient the other way around, I think it shouldn't be spending so long in this function. It's a symptom of my input file having already marked for duplicates and not using the "-c" option to clear those flags.

I don't know yet why that makes such a difference though.

jkbonfield · 2021-01-14T10:20:20Z

Thanks for the updates.

I can confirm it now works much faster even on data that's already had markdups run before. However it now appears to duplicate some tag entries. Eg after a 2nd run:

seq     1168    1       9996    16      112S39M *       0       0       <seq> <qual> NM:i:0  MD:Z:39 AS:i:39 XS:i:35 XA:Z:6,-147869,116S35M,0;hs37d5,+10061146,35M116S,0;7,-10002,117S34M,0;15,+102521304,34M117S,0;2,+241161288,34M117S,0;  do:Z:seq2    dt:Z:LB do:Z:seq2    dt:Z:LB

Note do and dt tags are doubled up. Changing all the bam_aux_append calls with bam_aux_update_str seems to cure the problem and I cannot see any impact on the time taken. 1 million seqs with append vs update (uncompressed BAM output) took 4.60s vs 4.65s. The second pass where update actuallys updates instead of appending took 4.35s, so it's now faster (fewer reallocs as a guess). The md5sum output of 1st and 2nd pass is identical, as is 1st past between append vs update.

My diff:

diff --git a/bam_markdup.c b/bam_markdup.c
index 10a3f66..2da184f 100644
--- a/bam_markdup.c
+++ b/bam_markdup.c
@@ -945,7 +945,7 @@ static int mark_duplicates(md_param_t *param, khash_t(duplicates) *dup_hash, bam
     dup->core.flag |= BAM_FDUP;
 
     if (param->tag) {
-        if (bam_aux_append(dup, "do", 'Z', strlen(bam_get_qname(ori)) + 1, (uint8_t*)bam_get_qname(ori))) {
+        if (bam_aux_update_str(dup, "do", strlen(bam_get_qname(ori)) + 1, bam_get_qname(ori))) {
             fprintf(stderr, "[markdup] error: unable to append 'do' tag.\n");
             return -1;
         }
@@ -953,12 +953,12 @@ static int mark_duplicates(md_param_t *param, khash_t(duplicates) *dup_hash, bam
 
     if (param->opt_dist) { // mark optical duplicates
         if (optical_duplicate(ori, dup, param->opt_dist, warn)) {
-            bam_aux_append(dup, "dt", 'Z', 3, (const uint8_t *)"SQ");
+            bam_aux_update_str(dup, "dt", 3, "SQ");
             dup_type = 'O';
             (*optical)++;
         } else {
             // not an optical duplicate
-            bam_aux_append(dup, "dt", 'Z', 3, (const uint8_t *)"LB");
+            bam_aux_update_str(dup, "dt", 3, "LB");
         }
     }
 
@@ -1909,7 +1909,7 @@ static int bam_mark_duplicates(md_param_t *param) {
                     np_duplicate++;
 
                     if (param->tag && kh_val(dup_hash, k).name) {
-                        if (bam_aux_append(b, "do", 'Z', strlen(kh_val(dup_hash, k).name) + 1, (uint8_t*)kh_val(dup_hash, k).name)) {
+                        if (bam_aux_update_str(b, "do", strlen(kh_val(dup_hash, k).name) + 1, (char*)kh_val(dup_hash, k).name)) {
                             fprintf(stderr, "[markdup] error: unable to append supplementary 'do' tag.\n");
                             goto fail;
                         }
@@ -1917,10 +1917,10 @@ static int bam_mark_duplicates(md_param_t *param) {
 
                     if (param->opt_dist) {
                         if (kh_val(dup_hash, k).type) {
-                            bam_aux_append(b, "dt", 'Z', 3, (const uint8_t *)"SQ");
+                            bam_aux_update_str(b, "dt", 3, "SQ");
                             np_opt_duplicate++;
                         } else {
-                            bam_aux_append(b, "dt", 'Z', 3, (const uint8_t *)"LB");
+                            bam_aux_update_str(b, "dt", 3, "LB");
                         }
                     }
                 }

If you wish I can apply this and squash it into the other commits during merge.

Assuming you're happy for the various commits to get squashed together?

whitwham · 2021-01-14T10:24:55Z

You can do the changes. I guess this is more fallout from not clearing the previously dup marking data. That really should have been the default action.

jkbonfield · 2021-01-14T10:31:32Z

Ok will do.

Existing code did not scale into the tens and hundreds of thousands. This should work better. Also fixed a bug in the multiple duplicates chain code where some single duplicate entries where getting lost.

whitwham marked this pull request as ready for review December 3, 2020 14:15

daviesrob assigned jkbonfield Dec 15, 2020

jkbonfield reviewed Dec 16, 2020

View reviewed changes

whitwham force-pushed the markdup_faster_deep branch from f88300b to 9f079a4 Compare January 7, 2021 17:50

Rewrite the code that checks optical duplicates.

c17e914

Existing code did not scale into the tens and hundreds of thousands. This should work better. Also fixed a bug in the multiple duplicates chain code where some single duplicate entries where getting lost.

jkbonfield force-pushed the markdup_faster_deep branch from 9f079a4 to c17e914 Compare January 14, 2021 11:29

jkbonfield merged commit c17e914 into samtools:develop Jan 14, 2021

whitwham deleted the markdup_faster_deep branch February 14, 2022 14:25


		seps = get_coordinate_positions(name, &xpos, &ypos);

		if (!(seps == 3 \|\| seps == 4 \|\| seps == 6 \|\| seps == 7)) {

Conversation

whitwham commented Nov 20, 2020

Uh oh!

daviesrob commented Nov 20, 2020

Uh oh!

whitwham commented Nov 23, 2020

Uh oh!

jkbonfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbonfield commented Jan 14, 2021

Uh oh!

whitwham commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Jan 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

whitwham commented Jan 14, 2021 •

edited

Loading