Extract the CIGAR parsing logic into separate methods#1169
Extract the CIGAR parsing logic into separate methods#1169valeriuo wants to merge 2 commits intosamtools:developfrom
Conversation
sam.c
Outdated
| n_cigar = read_ncigar(in); | ||
| if (!n_cigar) return 0; | ||
| if (n_cigar > (*a_mem - *a_len)) { | ||
| uint32_t *a_tmp = realloc(*a_cigar, 1.5*(n_cigar + *a_len)*sizeof(**a_cigar)); |
There was a problem hiding this comment.
Given this isn't a dynamic read-and-grow cycle, I don't see the point of the 1.5*mem bit. It's previously computed n_cigar so this should be known exactly. I'm also not sure why it's n_cigar + *a_len and not just n_cigar?
Or is this for when called in a loop with the same buffers so we can grow to beyiond the biggest we've seen and give room before finding the next biggest? (If so the + *a_len bit looks like it'll grow and grow and grow.) Generally though I think the htslib way for such reuse is kstring, but that doesn't seem to be a hard rule (bam1_t doesn't work that way for example).
The alternative is to do away with read_ncigar and keep the 1.5*mem bit so it dynamically grows as it's parsing the CIGAR string. That makes it one pass instead of two, at the expense of potentially doing memcpy internally during reallocs.
Anyone else have views on this?
There was a problem hiding this comment.
I wanted to avoid a one pass read of the string, because that would have degraded the performance of sam_parse1 with repeated reallocs. However, since this is a separate API method, we could look into alternative strategies.
sam.c
Outdated
| } | ||
| } | ||
|
|
||
| if (!(diff = parse_cigar(in, *a_cigar + *a_len, n_cigar))) return 0; |
There was a problem hiding this comment.
Why *a_cigar + *a_len?
Is this function intented to be ran on multiple CIGAR strings so it can concatenate them? If so the documentation needs updating.
My assumptions are it would fill out a_cigar from pos zero and return *a_len as the length filled out, but that's not what it's doing.
There was a problem hiding this comment.
It offers the flexibility of appending to an existing CIGAR array of a_len length. We can drop the feature, if we see no use for it.
There was a problem hiding this comment.
Flexibility is good, but I'm struggling to find a use case. Even if we wanted to merge two sequences and their associated CIGAR strings, we'd probably still be wanting to do it manually so we can futz around with merging trailing and leader M ops (for example), or removing hard-clip ops that may end up embedded in the middle otherwise.
So my instinct here is it's not a feature we want. Others care to weigh in?
sam.c
Outdated
| return n_cigar; | ||
| } | ||
|
|
||
| size_t sam_parse_cigar_b(const char *in, char **end, bam1_t *b) { |
There was a problem hiding this comment.
I don't really like the name of this function, but am struggling to think of anything I particularly like.
sam_parse_cigar_bam or sam_parse_cigar2bam maybe?
I'm not really sure it's necessary as a public function though given it could just be a tiny wrapper around sam_parse_cigar or parse_cigar?
There was a problem hiding this comment.
I knew the names would be a point of contention, so I am open to alternatives.
There was a problem hiding this comment.
Other functions with names like …_cigar2xyz take n_cigar/const uint32_t * as input, so using …cigar2… in the name wouldn't fit that pattern.
bam_parse_cigar() would also be a contender: there are multiple examples of sam_abc() is the general function, bam_abc() is an equivalent limited to BAM in some sense (here, in that it's for bam1_t, like other bam_ functions as discussed in #1159 (comment)).
However there is an extent to which wanting this as a public function really shows a deficiency in the recently-added bam_set1(): perhaps that should be able to take either a size_t n_cigar, const uint32_t *cigar or a const char *cigar_string. It does after all take other unparsed string parameters e.g. seq, qual…
There was a problem hiding this comment.
I think there is scope for wanting both text construction and binary construction.
Take CRAM for example. It's constructing the cigar data from the feature codes. So it already has data like code = 'D', length = 10, etc. It would be ugly to have to construct "10D20M" etc in a string only to parse it back again. Also consider cases where we want to do manipulation to an existing record - eg turning soft-clips into hard-clips. (The sequence is problematic given it's annoying nibblyness, but we have to live with that so it's inevitable we end up in 1-byte sequence world or have a two code paths for len%2. ) We already have int based cigars, so again having to go into and out of string form would be poor. Other sources of data may differ though and start from strings.
If we only have one interface, the binary form is better as you can first call one function to parse and then pass those parameters into bam_set1.
… two most common use cases: independent uint32_t array and bam1_t record. Add documentation.
221a42b to
5f8e4b2
Compare
|
Squashed and rebased. Merged as 9a55e4e |
|
This PR didn't add any tests for the new API functions. In particular, it would be good to add a test based on the following test case to test/sam.c: static void test_cigar_api(void)
{
uint32_t *buf = NULL;
uint32_t n = 0;
fprintf(stderr, "returned %zu for *\n", sam_parse_cigar("*", NULL, &buf, &n));
// …etc…
}Ideally |
|
Duh, I did think about -1 vs 0 and concluded it's rather irrelevant as the nul cigar isn't valid. Why didn't I think about "*"? Too many other things on the go. Agreed given "*" then we ought to be returning -1. It's a better choice than faking it as 1 as people may use the return value as an alternative indicator for the number of bytes (x4) filled out. |
|
Also be sure to consider whether the currently-implemented behaviour for that test case is the desired behaviour 😄 |
On further thinking about it: |
HTSlib refuses to parse these following the merge of PR samtools/htslib#1169 (Extract the CIGAR parsing logic into separate methods). Fix them so the tests work again Thanks to John Marshall for reporting the problem in samtools.
The CIGAR parsing code, which was previously embedded in the
sam_parse1method, was moved into a separate static method (parse_cigar). This method is called by two wrapper methods,sam_parse_cigarandsam_parse_cigar_b, which are adapted to the destination type,uint32_tarray andbam1_tstruct, respectively.Another verification was also added to the parser, which will mark CIGAR strings with two consecutive non-digit characters as invalid. This change forced the correction of a test file.
Fixes #1147