Extract the CIGAR parsing logic into separate methods by valeriuo · Pull Request #1169 · samtools/htslib

valeriuo · 2020-10-30T16:56:27Z

The CIGAR parsing code, which was previously embedded in the sam_parse1 method, was moved into a separate static method (parse_cigar). This method is called by two wrapper methods, sam_parse_cigar and sam_parse_cigar_b, which are adapted to the destination type, uint32_t array and bam1_t struct, respectively.

Another verification was also added to the parser, which will mark CIGAR strings with two consecutive non-digit characters as invalid. This change forced the correction of a test file.

Fixes #1147

jkbonfield

I did test it though and it works and has no detrimental impacts on speed (if anything it's marginally faster for SAM parsing, but minimal diff).

jkbonfield · 2020-11-02T16:05:36Z

sam.c

+    n_cigar = read_ncigar(in);
+    if (!n_cigar) return 0;
+    if (n_cigar > (*a_mem - *a_len)) {
+        uint32_t *a_tmp = realloc(*a_cigar, 1.5*(n_cigar + *a_len)*sizeof(**a_cigar));


Given this isn't a dynamic read-and-grow cycle, I don't see the point of the 1.5*mem bit. It's previously computed n_cigar so this should be known exactly. I'm also not sure why it's n_cigar + *a_len and not just n_cigar?

Or is this for when called in a loop with the same buffers so we can grow to beyiond the biggest we've seen and give room before finding the next biggest? (If so the + *a_len bit looks like it'll grow and grow and grow.) Generally though I think the htslib way for such reuse is kstring, but that doesn't seem to be a hard rule (bam1_t doesn't work that way for example).

The alternative is to do away with read_ncigar and keep the 1.5*mem bit so it dynamically grows as it's parsing the CIGAR string. That makes it one pass instead of two, at the expense of potentially doing memcpy internally during reallocs.

Anyone else have views on this?

I wanted to avoid a one pass read of the string, because that would have degraded the performance of sam_parse1 with repeated reallocs. However, since this is a separate API method, we could look into alternative strategies.

jkbonfield · 2020-11-02T16:10:27Z

sam.c

+        }
+    }
+
+    if (!(diff = parse_cigar(in, *a_cigar + *a_len, n_cigar))) return 0;


Why *a_cigar + *a_len?

Is this function intented to be ran on multiple CIGAR strings so it can concatenate them? If so the documentation needs updating.

My assumptions are it would fill out a_cigar from pos zero and return *a_len as the length filled out, but that's not what it's doing.

It offers the flexibility of appending to an existing CIGAR array of a_len length. We can drop the feature, if we see no use for it.

Flexibility is good, but I'm struggling to find a use case. Even if we wanted to merge two sequences and their associated CIGAR strings, we'd probably still be wanting to do it manually so we can futz around with merging trailing and leader M ops (for example), or removing hard-clip ops that may end up embedded in the middle otherwise.

So my instinct here is it's not a feature we want. Others care to weigh in?

jkbonfield · 2020-11-02T16:11:47Z

sam.c

+    return n_cigar;
+}
+
+size_t sam_parse_cigar_b(const char *in, char **end, bam1_t *b) {


I don't really like the name of this function, but am struggling to think of anything I particularly like.

sam_parse_cigar_bam or sam_parse_cigar2bam maybe?

I'm not really sure it's necessary as a public function though given it could just be a tiny wrapper around sam_parse_cigar or parse_cigar?

I knew the names would be a point of contention, so I am open to alternatives.

Other functions with names like …_cigar2xyz take n_cigar/const uint32_t * as input, so using …cigar2… in the name wouldn't fit that pattern.

bam_parse_cigar() would also be a contender: there are multiple examples of sam_abc() is the general function, bam_abc() is an equivalent limited to BAM in some sense (here, in that it's for bam1_t, like other bam_ functions as discussed in #1159 (comment)).

However there is an extent to which wanting this as a public function really shows a deficiency in the recently-added bam_set1(): perhaps that should be able to take either a size_t n_cigar, const uint32_t *cigar or a const char *cigar_string. It does after all take other unparsed string parameters e.g. seq, qual…

I think there is scope for wanting both text construction and binary construction.

Take CRAM for example. It's constructing the cigar data from the feature codes. So it already has data like code = 'D', length = 10, etc. It would be ugly to have to construct "10D20M" etc in a string only to parse it back again. Also consider cases where we want to do manipulation to an existing record - eg turning soft-clips into hard-clips. (The sequence is problematic given it's annoying nibblyness, but we have to live with that so it's inevitable we end up in 1-byte sequence world or have a two code paths for len%2. ) We already have int based cigars, so again having to go into and out of string form would be poor. Other sources of data may differ though and start from strings.

If we only have one interface, the binary form is better as you can first call one function to parse and then pass those parameters into bam_set1.

… two most common use cases: independent uint32_t array and bam1_t record. Add documentation.

jkbonfield · 2020-11-18T09:50:53Z

Squashed and rebased. Merged as 9a55e4e

jmarshall · 2020-11-18T10:48:17Z

This PR didn't add any tests for the new API functions.

In particular, it would be good to add a test based on the following test case to test/sam.c:

static void test_cigar_api(void)
{
    uint32_t *buf = NULL;
    uint32_t n = 0;
    fprintf(stderr, "returned %zu for *\n", sam_parse_cigar("*", NULL, &buf, &n));
    // …etc…
}

Ideally sam_parse_cigar() would return ssize_t with -1 for error. Alternatively (if there's much precedent for non-pointer-returning HTSlib functions returning 0 on error?) it would be okay to semi-artificially return number-of-operators = 1 for this "*" case.

jkbonfield · 2020-11-18T12:00:36Z

Duh, I did think about -1 vs 0 and concluded it's rather irrelevant as the nul cigar isn't valid. Why didn't I think about "*"? Too many other things on the go.

Agreed given "*" then we ought to be returning -1. It's a better choice than faking it as 1 as people may use the return value as an alternative indicator for the number of bytes (x4) filled out.

jmarshall · 2020-11-18T12:11:45Z

Also be sure to consider whether the currently-implemented behaviour for that test case is the desired behaviour 😄

[E::parse_cigar] CIGAR length invalid at position 1 (*)

jmarshall · 2020-11-18T12:22:19Z

people may use the return value as an alternative indicator for the number of bytes (x4) filled out.

On further thinking about it: a_mem is a max / buffer-capacity measurement, not an n-used measurement. So actually the sam_parse_cigar return value is the only indicator of the amount of space filled out.

HTSlib refuses to parse these following the merge of PR samtools/htslib#1169 (Extract the CIGAR parsing logic into separate methods). Fix them so the tests work again Thanks to John Marshall for reporting the problem in samtools.

jkbonfield reviewed Nov 2, 2020

View reviewed changes

valeriuo added 2 commits November 3, 2020 07:40

Extract the CIGAR parsing logic into separate methods for each of the…

dd4bac3

… two most common use cases: independent uint32_t array and bam1_t record. Add documentation.

Rebase and address some of the review comments.

5f8e4b2

valeriuo force-pushed the cigar_parser_2 branch from 221a42b to 5f8e4b2 Compare November 3, 2020 12:20

daviesrob assigned jkbonfield Nov 17, 2020

jkbonfield closed this Nov 18, 2020

jmarshall mentioned this pull request Nov 18, 2020

List --write-index in samtools sort usage [trivial] samtools/samtools#1342

Merged

daviesrob mentioned this pull request Nov 19, 2020

Fix invalid CIGAR strings samtools/samtools#1343

Merged

valeriuo mentioned this pull request Nov 25, 2020

Adjust parser for undefined CIGARs #1182

Closed

valeriuo deleted the cigar_parser_2 branch December 3, 2020 13:18

Conversation

valeriuo commented Oct 30, 2020

Uh oh!

jkbonfield left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbonfield commented Nov 18, 2020

Uh oh!

jmarshall commented Nov 18, 2020

Uh oh!

jkbonfield commented Nov 18, 2020

Uh oh!

jmarshall commented Nov 18, 2020

Uh oh!

jmarshall commented Nov 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkbonfield left a comment •

edited

Loading