Skip to content

Extract the CIGAR parsing logic into separate methods#1169

Closed
valeriuo wants to merge 2 commits intosamtools:developfrom
valeriuo:cigar_parser_2
Closed

Extract the CIGAR parsing logic into separate methods#1169
valeriuo wants to merge 2 commits intosamtools:developfrom
valeriuo:cigar_parser_2

Conversation

@valeriuo
Copy link
Copy Markdown
Contributor

The CIGAR parsing code, which was previously embedded in the sam_parse1 method, was moved into a separate static method (parse_cigar). This method is called by two wrapper methods, sam_parse_cigar and sam_parse_cigar_b, which are adapted to the destination type, uint32_t array and bam1_t struct, respectively.

Another verification was also added to the parser, which will mark CIGAR strings with two consecutive non-digit characters as invalid. This change forced the correction of a test file.

Fixes #1147

Copy link
Copy Markdown
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test it though and it works and has no detrimental impacts on speed (if anything it's marginally faster for SAM parsing, but minimal diff).

sam.c Outdated
n_cigar = read_ncigar(in);
if (!n_cigar) return 0;
if (n_cigar > (*a_mem - *a_len)) {
uint32_t *a_tmp = realloc(*a_cigar, 1.5*(n_cigar + *a_len)*sizeof(**a_cigar));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this isn't a dynamic read-and-grow cycle, I don't see the point of the 1.5*mem bit. It's previously computed n_cigar so this should be known exactly. I'm also not sure why it's n_cigar + *a_len and not just n_cigar?

Or is this for when called in a loop with the same buffers so we can grow to beyiond the biggest we've seen and give room before finding the next biggest? (If so the + *a_len bit looks like it'll grow and grow and grow.) Generally though I think the htslib way for such reuse is kstring, but that doesn't seem to be a hard rule (bam1_t doesn't work that way for example).

The alternative is to do away with read_ncigar and keep the 1.5*mem bit so it dynamically grows as it's parsing the CIGAR string. That makes it one pass instead of two, at the expense of potentially doing memcpy internally during reallocs.

Anyone else have views on this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid a one pass read of the string, because that would have degraded the performance of sam_parse1 with repeated reallocs. However, since this is a separate API method, we could look into alternative strategies.

sam.c Outdated
}
}

if (!(diff = parse_cigar(in, *a_cigar + *a_len, n_cigar))) return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why *a_cigar + *a_len?

Is this function intented to be ran on multiple CIGAR strings so it can concatenate them? If so the documentation needs updating.

My assumptions are it would fill out a_cigar from pos zero and return *a_len as the length filled out, but that's not what it's doing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It offers the flexibility of appending to an existing CIGAR array of a_len length. We can drop the feature, if we see no use for it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flexibility is good, but I'm struggling to find a use case. Even if we wanted to merge two sequences and their associated CIGAR strings, we'd probably still be wanting to do it manually so we can futz around with merging trailing and leader M ops (for example), or removing hard-clip ops that may end up embedded in the middle otherwise.

So my instinct here is it's not a feature we want. Others care to weigh in?

sam.c Outdated
return n_cigar;
}

size_t sam_parse_cigar_b(const char *in, char **end, bam1_t *b) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the name of this function, but am struggling to think of anything I particularly like.

sam_parse_cigar_bam or sam_parse_cigar2bam maybe?

I'm not really sure it's necessary as a public function though given it could just be a tiny wrapper around sam_parse_cigar or parse_cigar?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew the names would be a point of contention, so I am open to alternatives.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other functions with names like …_cigar2xyz take n_cigar/const uint32_t * as input, so using …cigar2… in the name wouldn't fit that pattern.

bam_parse_cigar() would also be a contender: there are multiple examples of sam_abc() is the general function, bam_abc() is an equivalent limited to BAM in some sense (here, in that it's for bam1_t, like other bam_ functions as discussed in #1159 (comment)).

However there is an extent to which wanting this as a public function really shows a deficiency in the recently-added bam_set1(): perhaps that should be able to take either a size_t n_cigar, const uint32_t *cigar or a const char *cigar_string. It does after all take other unparsed string parameters e.g. seq, qual

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is scope for wanting both text construction and binary construction.

Take CRAM for example. It's constructing the cigar data from the feature codes. So it already has data like code = 'D', length = 10, etc. It would be ugly to have to construct "10D20M" etc in a string only to parse it back again. Also consider cases where we want to do manipulation to an existing record - eg turning soft-clips into hard-clips. (The sequence is problematic given it's annoying nibblyness, but we have to live with that so it's inevitable we end up in 1-byte sequence world or have a two code paths for len%2. ) We already have int based cigars, so again having to go into and out of string form would be poor. Other sources of data may differ though and start from strings.

If we only have one interface, the binary form is better as you can first call one function to parse and then pass those parameters into bam_set1.

… two

most common use cases: independent uint32_t array and bam1_t record.
Add documentation.
@jkbonfield
Copy link
Copy Markdown
Contributor

Squashed and rebased. Merged as 9a55e4e

@jkbonfield jkbonfield closed this Nov 18, 2020
@jmarshall
Copy link
Copy Markdown
Member

This PR didn't add any tests for the new API functions.

In particular, it would be good to add a test based on the following test case to test/sam.c:

static void test_cigar_api(void)
{
    uint32_t *buf = NULL;
    uint32_t n = 0;
    fprintf(stderr, "returned %zu for *\n", sam_parse_cigar("*", NULL, &buf, &n));
    // …etc…
}

Ideally sam_parse_cigar() would return ssize_t with -1 for error. Alternatively (if there's much precedent for non-pointer-returning HTSlib functions returning 0 on error?) it would be okay to semi-artificially return number-of-operators = 1 for this "*" case.

@jkbonfield
Copy link
Copy Markdown
Contributor

Duh, I did think about -1 vs 0 and concluded it's rather irrelevant as the nul cigar isn't valid. Why didn't I think about "*"? Too many other things on the go.

Agreed given "*" then we ought to be returning -1. It's a better choice than faking it as 1 as people may use the return value as an alternative indicator for the number of bytes (x4) filled out.

@jmarshall
Copy link
Copy Markdown
Member

Also be sure to consider whether the currently-implemented behaviour for that test case is the desired behaviour 😄

[E::parse_cigar] CIGAR length invalid at position 1 (*)

@jmarshall
Copy link
Copy Markdown
Member

people may use the return value as an alternative indicator for the number of bytes (x4) filled out.

On further thinking about it: a_mem is a max / buffer-capacity measurement, not an n-used measurement. So actually the sam_parse_cigar return value is the only indicator of the amount of space filled out.

daviesrob added a commit to daviesrob/samtools that referenced this pull request Nov 19, 2020
HTSlib refuses to parse these following the merge of PR
samtools/htslib#1169 (Extract the CIGAR parsing logic into separate
methods).  Fix them so the tests work again

Thanks to John Marshall for reporting the problem in samtools.
@valeriuo valeriuo deleted the cigar_parser_2 branch December 3, 2020 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CIGAR parsing API

3 participants