Skip to content

Construct a BAM record from the component pieces#1159

Merged
daviesrob merged 13 commits intosamtools:developfrom
anderskaplan:bam_construct
Oct 23, 2020
Merged

Construct a BAM record from the component pieces#1159
daviesrob merged 13 commits intosamtools:developfrom
anderskaplan:bam_construct

Conversation

@anderskaplan
Copy link
Copy Markdown
Contributor

@anderskaplan anderskaplan commented Oct 18, 2020

Added a bam_construct() function as requested and discussed in #789 .

I am a bit confused about what is and what isn't considered to be in the public API. Therefore I kept the existing bam_construct_seq() function with its current behavior, but replaced the implementation with a call to bam_construct(). If possible I'd like to remove that function altogether, as it's largely redundant.

This pull requests also fixes an issue with the return value of cram_to_bam() where the length of the aux data was counted twice.

@daviesrob
Copy link
Copy Markdown
Member

Thanks for this, it generally looks good.

I think one improvement would be to add a check for data_len going over INT32_MAX which is the maximum possible value that can be stored in bam1_t::l_data. Unfortunately this isn't completely trivial because there's a possibility of overflow, especially on 32-bit platforms. The easiest solution might be to keep track of the amount of space left after accounting for each item. You could also have a more restrictive limit on l_seq (which I think should be (INT32_MAX-4)*2/3 = 1431655762, the 4 is a minimal qname) which would help ensure it doesn't overflow when working out how long it is.

On a more pedantic note, qname_nulls should really be qname_nuls as it's referring to the ASCII NUL character \0 rather than the NULL pointer (which is why l_extranul ends in a single "l").

@anderskaplan
Copy link
Copy Markdown
Contributor Author

anderskaplan commented Oct 19, 2020

Nice, I'll add the validation for data_len.

The NUL vs NULL issue, well, I don't think many people make that distinction. According to Internet RFC 20 from 1969, NUL is an abbreviation for NULL just like ASCII 7, BEL, is an abbreviation for BELL, which makes the teletype go bing. 😄

@jmarshall
Copy link
Copy Markdown
Member

What Rob is saying is that the HTSlib source code does make that distinction.

…am_construct().

Changed spelling of nul character.
@anderskaplan
Copy link
Copy Markdown
Contributor Author

anderskaplan commented Oct 20, 2020

Added validation for data_len. And changed the null to nul -- I can't help that it looks like a spelling error to me, but it's more important to stick with the conventions of the code base.

@daviesrob
Copy link
Copy Markdown
Member

Thanks for the update. It might be possible to defeat the test subtract_check_underflow(n_cigar * 4, &limit) on 32-bit platforms if n_cigar * 4 overflows, however as the associated cigar array would have to be bigger than the available address space I expect it would cause a problem before you get to this point - so probably not worth worrying about too much.

Would you like to squash this down before I merge it? Or if you prefer I can do it, but I'll either need to force-push on to your branch so Github records the merge correctly; or squash it into a single commit.

@jmarshall
Copy link
Copy Markdown
Member

Needs tests with both even and odd lengths of seq.

Some error returns (the one due to realloc_bam_data()) set errno. IMHO it would be good if they all did.

sam_parse1() doesn't validate that QNAME contains only valid characters, and HTSlib generally does not validate such things. Is it really useful for this function to do so?

A bit late to the party 😄 but as this function takes an already-set-up bam1_t* (BTW other functions in htslib/sam.h just call this parameter b, not bam), calling it bam_construct() is not a great name especially for C++ programmers — construct means allocating a new object when none existed before.

Better names would be bam_assign() (à la C++) or bam_set1() (à la HTSlib).

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Oct 21, 2020

Coincidentally I was just testing the performance impact of this. Specifically what's the impact on CRAM decoding.
I found that this PR is around 4% slower to decode CRAM than develop. Looking deeper into it, bam_construct is 50% slower than the original bam_construct_seq, and the huge bulk of this is in the qname validation.

4% doesn't sound like a lot, but CRAM's got a lot going on so that's probably the best it could be. On something tighter it may be closer to that 50% performance hit.

We could optimise the validation function. I tried this which for me worked better, but it's still at best maybe only half the slow down recovered.

diff --git a/sam.c b/sam.c
index 7d04713..1397355 100644
--- a/sam.c
+++ b/sam.c
@@ -488,18 +488,48 @@ static void bam_cigar2rqlens(int n_cigar, const uint32_t *
cigar,
 static int validate_qname(size_t l_qname, const char *qname)
 {
     // SAM specification: the query name must be on the format [!-?A-~]{1,254}
-    if (l_qname > 254) {
+    static char vchar[256] = {
+        // 00
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        // 40
+        0, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 0,
+        // 80
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        // C0
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    };
+
+    if (l_qname > 254)
         return -1;
-    }
 
-    int i;
-    for (i = 0; i < l_qname; i++) {
-        if (qname[i] < '!' || qname[i] > '~' || qname[i] == '@') {
-            return -1;
-        }
+    int i, v[4] = {1,1,1,1};
+    const unsigned char *uq = (const unsigned char *)qname;
+    for (i = 0; i < (l_qname & ~3); i+=4) {
+        v[0] &= vchar[uq[i+0]];
+        v[1] &= vchar[uq[i+1]];
+        v[2] &= vchar[uq[i+2]];
+        v[3] &= vchar[uq[i+3]];
     }
+    for (; i < l_qname; i++)
+        v[0] &= vchar[uq[i]];
 
-    return 0;
+    v[0] &= v[1];
+    v[0] &= v[2];
+    v[0] &= v[3];
+
+    return v[0] ? 0 : -1;
 }
 
 static int subtract_check_underflow(size_t length, size_t *limit)

If I just commented out the call to validate_qname entirely my CRAM decode was only 0.7% slower, which is only a minor change (although still oddly slower; I'd expect no change really).

@daviesrob
Copy link
Copy Markdown
Member

sam_parse1() arguably should validate QNAMEs, and only doesn't because it prefers speed to strict specification conformance. As this function generates reads de-novo, it's probably a good thing that it doesn't let you create incorrect names.

@jkbonfield
Copy link
Copy Markdown
Contributor

In that case we should have a second implementation for CRAM, because right now if there is a CRAM file with an illegal char in the name we can no longer decode it, which in turn means we can't even pipe it into a sed script to fix it either.

@jkbonfield
Copy link
Copy Markdown
Contributor

If we really care about this, my preference would be a validation function, which can optionally be turned on. It's applied after decoding a BAM object so we can have a custom "samtools validate" or on-the-fly validate via one of the --input-fmt-option parameters.

@jmarshall
Copy link
Copy Markdown
Member

As for bam_construct_seq(): it's not in htslib/*.h and it doesn't have HTSLIB_EXPORT, so it's not in HTSlib's public API. It is in the io_lib compatibility layer, but the two cram_decode.c files have diverged enough that it probably doesn't matter.

I'd suggest dropping the cram/cram_decode.c and cram/cram_samtools.c changes from this PR, as they're not related to the primary purpose of adding a function. When this is merged, make a followup PR that changes cram_to_bam() to use your new function directly instead of bam_construct_seq(), fixes up cram_to_bam()'s return value, and nukes cram/cram_samtools.c entirely.

Edited to add: also maybe split the function added in this PR into an internal function and a public function that does the validation and then calls the internal one; then the later PR can have cram_to_bam() calling the internal function.

@daviesrob
Copy link
Copy Markdown
Member

I doubt many cram files exist with invalid names, if any, given that most names are generated by sequencing software that generally gets it right. I expect the most likely problem might be someone using an @ somewhere in the middle (the first character can never be @ thanks to its use in SAM headers). In case there are any such files around the best solution may be, as @jmarshall suggests, have an internal function that lets you choose if you want to validate the names, then make the public API call it with name validation enabled and the cram one with it disabled.

@anderskaplan
Copy link
Copy Markdown
Contributor Author

anderskaplan commented Oct 23, 2020

Updated according to comments:

  • Changed the name to bam_assign. All the examples similar to bam_set1 that I could find in the code base were on the form bam_set_(something), so "assign" seems like the better choice in this case.
  • Removed the qname validation altogether, both because of its surprisingly high performance hit but also because it might not be the job of this function.
  • Removed the changes to cram_to_bam() and cram_construct_seq(). Will post another PR with the updates to those once this is merged.
  • Added test cases, setting errno.

@jmarshall
Copy link
Copy Markdown
Member

Thanks for that.

It seems to me that bam_set1() would fit in with bam_init1/bam_destroy1/bam_copy1/ bam_dup1 and sam_parse1/sam_format1/sam_read1/sam_write1 very nicely. (And bam_ rather than sam_ as like those other existing bam_*1 functions it's purely about manipulating a bam1_t.) And _set paralleling all those bam_set_somefield functions (btw there are no HTSlib functions currently with “assign” in the name).

@daviesrob
Copy link
Copy Markdown
Member

Thanks for the updates. It might be useful to keep the l_qname <= 254 check, as it's quick and could avoid some bizarre problems if the l_qname field wraps.

@anderskaplan
Copy link
Copy Markdown
Contributor Author

The naming is always the hardest part 😄

@daviesrob daviesrob merged commit c917518 into samtools:develop Oct 23, 2020
@daviesrob
Copy link
Copy Markdown
Member

Thanks very much. Now squashed and merged before anyone else asks to have the name changed.

@jmarshall
Copy link
Copy Markdown
Member

I took the liberty of pushing a followup commit to tidy up the test's temporary filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants