Speed up Code that Checks Optical Duplicates.#1345
Speed up Code that Checks Optical Duplicates.#1345jkbonfield merged 1 commit intosamtools:developfrom
Conversation
|
I'm getting a use-after-free error from address sanitizer - it looks like a stale bam record is on the |
Thanks. I will take a look using your test data. |
jkbonfield
left a comment
There was a problem hiding this comment.
Only very minor code comments.
I'm struggling to get it to work well though on a small test set. Maybe it's something wrong with how I prepared my data. I'll email you the details.
|
|
||
| seps = get_coordinate_positions(name, &xpos, &ypos); | ||
|
|
||
| if (!(seps == 3 || seps == 4 || seps == 6 || seps == 7)) { |
There was a problem hiding this comment.
It may be good to comment the meaning of 3, 4, 6 and 7.
I see get_coordinate_positions lists 4 and 6 for various Illumina models, but what's 7? Looking at the code I see the 7th is the same as 6 (5=x 6=y) but an additional non-coord value.
I see the code for comparing whether the read names match is just checking the bit prior to xpos, so whatever this 7th element is it's not considered in that step. Does that matter?
There was a problem hiding this comment.
Thanks for the commenting. I guess the answer to "7th element is it's not considered in that step. Does that matter?" is "maybe". That'll be a different issue if done though. (See #1358)
|
|
||
| /* Check all duplicates of the highest quality read (the "original") for consistancy. Also | ||
| pre-calculate any values for use in check_duplicate_chain later. */ | ||
| static int check_chain_against_original(md_param_t *param, khash_t(duplicates) *dup_hash, read_queue_t *ori, |
There was a problem hiding this comment.
Maybe add a comment on return values as this is a bit subtle.
From my understanding, it has 0 as OK, >0 as failed due to unknown naming scheme (so a warning was given and the caller may continue), and <0 as a genuine error (the caller won't continue).
|
|
||
| if (param->tag && have_original) { | ||
| if (!(tmp = realloc(list->c, list->size * sizeof(check_t)))) { | ||
| fprintf(stderr, "[markdup] error: Unable to expand opt check list.\n"); |
There was a problem hiding this comment.
Our unwritten policy is all errors in new code should call print_error or print_error_errno rather than fprintf directly.
Similarly elsewhere in this file.
bam_markdup.c
Outdated
| read_queue_t *current = ori->duplicate; | ||
| int xpos; | ||
| long x, y; | ||
| long tmp_count = 0; |
There was a problem hiding this comment.
tmp_count looks redundant now.
bam_markdup.c
Outdated
| check_t *ac = (check_t *) a; | ||
| check_t *bc = (check_t *) b; | ||
|
|
||
| return ac->x < bc->x ? -1 : (ac->x == bc->x ? 0 : 1); |
There was a problem hiding this comment.
Note given coordinates are small, this could just be return ac->x - bc->x.
I don't know how significant a chunk of CPU though the sort comparison function is. It may be pointless optimisation. Your call.
bam_markdup.c
Outdated
| long *optical_pair, const int check_range) { | ||
| int ret = 0; | ||
| kliter_t(read_queue) *rq; | ||
| long tmp_count = 0; |
There was a problem hiding this comment.
A now-redundant variable.
bam_markdup.c
Outdated
| tmp_file_t temp; | ||
| char *idx_fn = NULL; | ||
| int exclude = 0; | ||
| check_list_t dup_list; |
There was a problem hiding this comment.
It's a bit hard to validate all the conditionals here, but maybe initialise this to NULL. The goto fail code may happen before this gets set, but the fail code does free it.
As example is when the header has SO:queryname. This triggers a failure with address sanitizer attempting to free dup_list.
| } | ||
| } | ||
|
|
||
| if (!(in_read->b->core.flag & BAM_FDUP) && in_read->duplicate) { // is the head of a duplicate chain |
There was a problem hiding this comment.
On a small test file it still seems to spend longer than I expected, with about 50% of time at this line of code surprisingly.
It got about 20% quicker just by swapping the order of conditionals, ie:
if (in_read->duplicate && !(in_read->b->core.flag & BAM_FDUP))
I'm not sure if my test data was abnormal in some way though.
There was a problem hiding this comment.
Ignore this. While still more efficient the other way around, I think it shouldn't be spending so long in this function. It's a symptom of my input file having already marked for duplicates and not using the "-c" option to clear those flags.
I don't know yet why that makes such a difference though.
f88300b to
9f079a4
Compare
|
Thanks for the updates. I can confirm it now works much faster even on data that's already had markdups run before. However it now appears to duplicate some tag entries. Eg after a 2nd run: Note My diff: If you wish I can apply this and squash it into the other commits during merge. Assuming you're happy for the various commits to get squashed together? |
|
You can do the changes. I guess this is more fallout from not clearing the previously dup marking data. That really should have been the default action. |
|
Ok will do. |
Existing code did not scale into the tens and hundreds of thousands. This should work better. Also fixed a bug in the multiple duplicates chain code where some single duplicate entries where getting lost.
9f079a4 to
c17e914
Compare
Duplicates can appear in any order in a bam file so some degree of backtracking is needed to correctly identify if a duplicate is actually an optical one. The existing code did not scale when a single read can have tens or hundreds of thousands of duplicates.
This should fix #1325, or at least ameliorate it.