Construct a BAM record from the component pieces by anderskaplan · Pull Request #1159 · samtools/htslib

anderskaplan · 2020-10-18T11:57:38Z

Added a bam_construct() function as requested and discussed in #789 .

I am a bit confused about what is and what isn't considered to be in the public API. Therefore I kept the existing bam_construct_seq() function with its current behavior, but replaced the implementation with a call to bam_construct(). If possible I'd like to remove that function altogether, as it's largely redundant.

This pull requests also fixes an issue with the return value of cram_to_bam() where the length of the aux data was counted twice.

…) instead of its own logic.

…aux data was counted twice.

daviesrob · 2020-10-19T16:56:06Z

Thanks for this, it generally looks good.

I think one improvement would be to add a check for data_len going over INT32_MAX which is the maximum possible value that can be stored in bam1_t::l_data. Unfortunately this isn't completely trivial because there's a possibility of overflow, especially on 32-bit platforms. The easiest solution might be to keep track of the amount of space left after accounting for each item. You could also have a more restrictive limit on l_seq (which I think should be (INT32_MAX-4)*2/3 = 1431655762, the 4 is a minimal qname) which would help ensure it doesn't overflow when working out how long it is.

On a more pedantic note, qname_nulls should really be qname_nuls as it's referring to the ASCII NUL character \0 rather than the NULL pointer (which is why l_extranul ends in a single "l").

anderskaplan · 2020-10-19T18:56:44Z

Nice, I'll add the validation for data_len.

The NUL vs NULL issue, well, I don't think many people make that distinction. According to Internet RFC 20 from 1969, NUL is an abbreviation for NULL just like ASCII 7, BEL, is an abbreviation for BELL, which makes the teletype go bing. 😄

jmarshall · 2020-10-19T19:09:15Z

What Rob is saying is that the HTSlib source code does make that distinction.

…am_construct(). Changed spelling of nul character.

anderskaplan · 2020-10-20T19:26:13Z

Added validation for data_len. And changed the null to nul -- I can't help that it looks like a spelling error to me, but it's more important to stick with the conventions of the code base.

daviesrob · 2020-10-21T14:28:41Z

Thanks for the update. It might be possible to defeat the test subtract_check_underflow(n_cigar * 4, &limit) on 32-bit platforms if n_cigar * 4 overflows, however as the associated cigar array would have to be bigger than the available address space I expect it would cause a problem before you get to this point - so probably not worth worrying about too much.

Would you like to squash this down before I merge it? Or if you prefer I can do it, but I'll either need to force-push on to your branch so Github records the merge correctly; or squash it into a single commit.

jmarshall · 2020-10-21T15:04:28Z

Needs tests with both even and odd lengths of seq.

Some error returns (the one due to realloc_bam_data()) set errno. IMHO it would be good if they all did.

sam_parse1() doesn't validate that QNAME contains only valid characters, and HTSlib generally does not validate such things. Is it really useful for this function to do so?

A bit late to the party 😄 but as this function takes an already-set-up bam1_t* (BTW other functions in htslib/sam.h just call this parameter b, not bam), calling it bam_construct() is not a great name especially for C++ programmers — construct means allocating a new object when none existed before.

Better names would be bam_assign() (à la C++) or bam_set1() (à la HTSlib).

jkbonfield · 2020-10-21T15:55:17Z

Coincidentally I was just testing the performance impact of this. Specifically what's the impact on CRAM decoding.
I found that this PR is around 4% slower to decode CRAM than develop. Looking deeper into it, bam_construct is 50% slower than the original bam_construct_seq, and the huge bulk of this is in the qname validation.

4% doesn't sound like a lot, but CRAM's got a lot going on so that's probably the best it could be. On something tighter it may be closer to that 50% performance hit.

We could optimise the validation function. I tried this which for me worked better, but it's still at best maybe only half the slow down recovered.

diff --git a/sam.c b/sam.c
index 7d04713..1397355 100644
--- a/sam.c
+++ b/sam.c
@@ -488,18 +488,48 @@ static void bam_cigar2rqlens(int n_cigar, const uint32_t *
cigar,
 static int validate_qname(size_t l_qname, const char *qname)
 {
     // SAM specification: the query name must be on the format [!-?A-~]{1,254}
-    if (l_qname > 254) {
+    static char vchar[256] = {
+        // 00
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        // 40
+        0, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 1,
+        1, 1, 1, 1,  1, 1, 1, 1,   1, 1, 1, 1,   1, 1, 1, 0,
+        // 80
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        // C0
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+        0, 0, 0, 0,  0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    };
+
+    if (l_qname > 254)
         return -1;
-    }
 
-    int i;
-    for (i = 0; i < l_qname; i++) {
-        if (qname[i] < '!' || qname[i] > '~' || qname[i] == '@') {
-            return -1;
-        }
+    int i, v[4] = {1,1,1,1};
+    const unsigned char *uq = (const unsigned char *)qname;
+    for (i = 0; i < (l_qname & ~3); i+=4) {
+        v[0] &= vchar[uq[i+0]];
+        v[1] &= vchar[uq[i+1]];
+        v[2] &= vchar[uq[i+2]];
+        v[3] &= vchar[uq[i+3]];
     }
+    for (; i < l_qname; i++)
+        v[0] &= vchar[uq[i]];
 
-    return 0;
+    v[0] &= v[1];
+    v[0] &= v[2];
+    v[0] &= v[3];
+
+    return v[0] ? 0 : -1;
 }
 
 static int subtract_check_underflow(size_t length, size_t *limit)

If I just commented out the call to validate_qname entirely my CRAM decode was only 0.7% slower, which is only a minor change (although still oddly slower; I'd expect no change really).

daviesrob · 2020-10-21T15:56:00Z

sam_parse1() arguably should validate QNAMEs, and only doesn't because it prefers speed to strict specification conformance. As this function generates reads de-novo, it's probably a good thing that it doesn't let you create incorrect names.

jkbonfield · 2020-10-21T15:58:31Z

In that case we should have a second implementation for CRAM, because right now if there is a CRAM file with an illegal char in the name we can no longer decode it, which in turn means we can't even pipe it into a sed script to fix it either.

jkbonfield · 2020-10-21T15:59:38Z

If we really care about this, my preference would be a validation function, which can optionally be turned on. It's applied after decoding a BAM object so we can have a custom "samtools validate" or on-the-fly validate via one of the --input-fmt-option parameters.

jmarshall · 2020-10-21T16:04:46Z

As for bam_construct_seq(): it's not in htslib/*.h and it doesn't have HTSLIB_EXPORT, so it's not in HTSlib's public API. It is in the io_lib compatibility layer, but the two cram_decode.c files have diverged enough that it probably doesn't matter.

I'd suggest dropping the cram/cram_decode.c and cram/cram_samtools.c changes from this PR, as they're not related to the primary purpose of adding a function. When this is merged, make a followup PR that changes cram_to_bam() to use your new function directly instead of bam_construct_seq(), fixes up cram_to_bam()'s return value, and nukes cram/cram_samtools.c entirely.

Edited to add: also maybe split the function added in this PR into an internal function and a public function that does the validation and then calls the internal one; then the later PR can have cram_to_bam() calling the internal function.

daviesrob · 2020-10-21T17:26:25Z

I doubt many cram files exist with invalid names, if any, given that most names are generated by sequencing software that generally gets it right. I expect the most likely problem might be someone using an @ somewhere in the middle (the first character can never be @ thanks to its use in SAM headers). In case there are any such files around the best solution may be, as @jmarshall suggests, have an internal function that lets you choose if you want to validate the names, then make the public API call it with name validation enabled and the cram one with it disabled.

anderskaplan · 2020-10-23T10:52:25Z

Updated according to comments:

Changed the name to bam_assign. All the examples similar to bam_set1 that I could find in the code base were on the form bam_set_(something), so "assign" seems like the better choice in this case.
Removed the qname validation altogether, both because of its surprisingly high performance hit but also because it might not be the job of this function.
Removed the changes to cram_to_bam() and cram_construct_seq(). Will post another PR with the updates to those once this is merged.
Added test cases, setting errno.

jmarshall · 2020-10-23T11:20:18Z

Thanks for that.

It seems to me that bam_set1() would fit in with bam_init1/bam_destroy1/bam_copy1/ bam_dup1 and sam_parse1/sam_format1/sam_read1/sam_write1 very nicely. (And bam_ rather than sam_ as like those other existing bam_*1 functions it's purely about manipulating a bam1_t.) And _set paralleling all those bam_set_somefield functions (btw there are no HTSlib functions currently with “assign” in the name).

daviesrob · 2020-10-23T11:29:57Z

Thanks for the updates. It might be useful to keep the l_qname <= 254 check, as it's quick and could avoid some bizarre problems if the l_qname field wraps.

anderskaplan · 2020-10-23T11:45:42Z

The naming is always the hardest part 😄

daviesrob · 2020-10-23T13:44:00Z

Thanks very much. Now squashed and merged before anyone else asks to have the name changed.

jmarshall · 2020-10-23T14:28:27Z

I took the liberty of pushing a followup commit to tidy up the test's temporary filename.

anderskaplan added 5 commits October 18, 2020 13:38

Added bam_construct() function and unit tests.

db6f3ca

Updated bam_construct_seq() in cram_samtools.c to call bam_construct(…

b500d00

…) instead of its own logic.

Fixed a faulty return value in cram_to_bam() where the length of the …

e0278f3

…aux data was counted twice.

Fixed compiler warning-as-error

4ec96a7

Fixed memory leak in test case.

4d23a6e

Added size limit validation for the accumulated length of l_data to b…

643a635

…am_construct(). Changed spelling of nul character.

anderskaplan added 5 commits October 23, 2020 12:40

Added test cases for even and odd lengths of seq.

f27474c

Set errno on failure.

34ff7c8

Name change from bam_construct() to bam_assign().

9ea9f30

Removed qname validation.

965db43

Reverted changes to cram_decode.c and cram_samtools.c (for now).

594290e

anderskaplan added 2 commits October 23, 2020 13:43

Revived the qname length check, but not the content check.

669ad3c

Name change to bam_set1().

a13335d

daviesrob merged commit c917518 into samtools:develop Oct 23, 2020

anderskaplan deleted the bam_construct branch October 23, 2020 18:18

anderskaplan mentioned this pull request Oct 24, 2020

Replace bam_construct_seq() with bam_set1() #1164

Closed

jmarshall mentioned this pull request Nov 3, 2020

Extract the CIGAR parsing logic into separate methods #1169

Closed

Conversation

anderskaplan commented Oct 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviesrob commented Oct 19, 2020

Uh oh!

anderskaplan commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmarshall commented Oct 19, 2020

Uh oh!

anderskaplan commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviesrob commented Oct 21, 2020

Uh oh!

jmarshall commented Oct 21, 2020

Uh oh!

jkbonfield commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviesrob commented Oct 21, 2020

Uh oh!

jkbonfield commented Oct 21, 2020

Uh oh!

jkbonfield commented Oct 21, 2020

Uh oh!

jmarshall commented Oct 21, 2020

Uh oh!

daviesrob commented Oct 21, 2020

Uh oh!

anderskaplan commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmarshall commented Oct 23, 2020

Uh oh!

daviesrob commented Oct 23, 2020

Uh oh!

anderskaplan commented Oct 23, 2020

Uh oh!

daviesrob commented Oct 23, 2020

Uh oh!

jmarshall commented Oct 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anderskaplan commented Oct 18, 2020 •

edited

Loading

anderskaplan commented Oct 19, 2020 •

edited

Loading

anderskaplan commented Oct 20, 2020 •

edited

Loading

jkbonfield commented Oct 21, 2020 •

edited

Loading

anderskaplan commented Oct 23, 2020 •

edited

Loading