Construct a BAM record from the component pieces#1159
Construct a BAM record from the component pieces#1159daviesrob merged 13 commits intosamtools:developfrom
Conversation
…) instead of its own logic.
…aux data was counted twice.
|
Thanks for this, it generally looks good. I think one improvement would be to add a check for On a more pedantic note, |
|
Nice, I'll add the validation for The NUL vs NULL issue, well, I don't think many people make that distinction. According to Internet RFC 20 from 1969, NUL is an abbreviation for NULL just like ASCII 7, BEL, is an abbreviation for BELL, which makes the teletype go bing. 😄 |
|
What Rob is saying is that the HTSlib source code does make that distinction. |
…am_construct(). Changed spelling of nul character.
|
Added validation for |
|
Thanks for the update. It might be possible to defeat the test Would you like to squash this down before I merge it? Or if you prefer I can do it, but I'll either need to force-push on to your branch so Github records the merge correctly; or squash it into a single commit. |
|
Needs tests with both even and odd lengths of Some error returns (the one due to
A bit late to the party 😄 but as this function takes an already-set-up Better names would be |
|
Coincidentally I was just testing the performance impact of this. Specifically what's the impact on CRAM decoding. 4% doesn't sound like a lot, but CRAM's got a lot going on so that's probably the best it could be. On something tighter it may be closer to that 50% performance hit. We could optimise the validation function. I tried this which for me worked better, but it's still at best maybe only half the slow down recovered. If I just commented out the call to validate_qname entirely my CRAM decode was only 0.7% slower, which is only a minor change (although still oddly slower; I'd expect no change really). |
|
|
|
In that case we should have a second implementation for CRAM, because right now if there is a CRAM file with an illegal char in the name we can no longer decode it, which in turn means we can't even pipe it into a sed script to fix it either. |
|
If we really care about this, my preference would be a validation function, which can optionally be turned on. It's applied after decoding a BAM object so we can have a custom "samtools validate" or on-the-fly validate via one of the --input-fmt-option parameters. |
|
As for I'd suggest dropping the cram/cram_decode.c and cram/cram_samtools.c changes from this PR, as they're not related to the primary purpose of adding a function. When this is merged, make a followup PR that changes Edited to add: also maybe split the function added in this PR into an internal function and a public function that does the validation and then calls the internal one; then the later PR can have |
|
I doubt many cram files exist with invalid names, if any, given that most names are generated by sequencing software that generally gets it right. I expect the most likely problem might be someone using an |
|
Updated according to comments:
|
|
Thanks for that. It seems to me that |
|
Thanks for the updates. It might be useful to keep the |
|
The naming is always the hardest part 😄 |
|
Thanks very much. Now squashed and merged before anyone else asks to have the name changed. |
|
I took the liberty of pushing a followup commit to tidy up the test's temporary filename. |
Added a bam_construct() function as requested and discussed in #789 .
I am a bit confused about what is and what isn't considered to be in the public API. Therefore I kept the existing bam_construct_seq() function with its current behavior, but replaced the implementation with a call to bam_construct(). If possible I'd like to remove that function altogether, as it's largely redundant.
This pull requests also fixes an issue with the return value of cram_to_bam() where the length of the aux data was counted twice.