Skip to content

Prevent corruption of the ID value of a PG line#1256

Merged
daviesrob merged 1 commit intosamtools:developfrom
valeriuo:header_pg_uid
Mar 24, 2021
Merged

Prevent corruption of the ID value of a PG line#1256
daviesrob merged 1 commit intosamtools:developfrom
valeriuo:header_pg_uid

Conversation

@valeriuo
Copy link
Copy Markdown
Contributor

Some applications add supplementary ID tags to a header @PG line. This change prevents the corruption of the proper (first) ID value of a @PG line, when other unrelated ID tags are present.

Fixes samtools/samtools#1393

@jmarshall
Copy link
Copy Markdown
Member

jmarshall commented Mar 22, 2021

This comes under the category of “heuristic to improve matters in this @PG-CL case”, so should be described as such. If a @PG header has two ID fields, there's no a priori reason to believe the first one — that just happens to be the right thing to do to improve matters for this bwa bug. (I think everyone would agree that duplicate tagged fields in headers are / ought to be invalid SAM, though unlike for records the spec does not say so — this is surely a buglet in the spec.)

If this change fixes the samtools merge problem, then it must be because this just makes the HTSlib header API happen to agree with the merge code that the first ID field should be believed. (Also, a more complete header API change would make changes to all three of @SQ-SN, @RG-ID, and @PG-ID processing.)

I was rather expecting the fix to this to involve changing the samtools merge header merging code, which originated prior to the HTSlib header API. How much of that samtools header merging code still does its own textual header manipulation? Could the problem be better fixed by fully converting the samtools merge code to do its work via the header API instead?

@valeriuo
Copy link
Copy Markdown
Contributor Author

This comes under the category of “heuristic to improve matters in this @PG-CL case”, so should be described as such. If a @PG header has two ID fields, there's no a priori reason to believe the first one — that just happens to be the right thing to do to improve matters for this bwa bug. (I think everyone would agree that duplicate tagged fields in headers are / ought to be invalid SAM, though unlike for records the spec does not say so — this is surely a buglet in the spec.)

Indeed. It is a "most common case" heuristic and I've added a warning message to inform the user. The spec does say: Each @PG line must have a unique ID., but it can be argued that the second ID tag belongs to the command that generated the @PG line. So there might be a case for clarification.

If this change fixes the samtools merge problem, then it must be because this just makes the HTSlib header API happen to agree with the merge code that the first ID field should be believed. (Also, a more complete header API change would make changes to all three of @SQ-SN, @RG-ID, and @PG-ID processing.)

The issue was entirely in the header API, due to the different ways sam_hdr_find_tag_pos() and sam_hdr_find_line_id() are searching for a tag value. A secondary PR that deals with the unique identifiers of @SQ and @RG lines will follow shortly.

I was rather expecting the fix to this to involve changing the samtools merge header merging code, which originated prior to the HTSlib header API. How much of that samtools header merging code still does its own textual header manipulation? Could the problem be better fixed by fully converting the samtools merge code to do its work via the header API instead?

The merge code has its own structures and methods that deal with the merging logic, but most of the low level header access has been delegated to the header API.

@jmarshall
Copy link
Copy Markdown
Member

jmarshall commented Mar 24, 2021

The spec does say: Each @PG line must have a unique ID., but it can be argued that the second ID tag belongs to the command that generated the @PG line.

IMHO that spec sentence is intended to mean that the IDs of different @PG lines must be different from each other (and it definitely needs clarifying and expanding). Incidentally, HTSJDK throws an error for the @PG ID:bwa … ID:H009994 line from samtools/samtools#1393 (comment) as it has disallowed duplicate header line fields with different values for over a decade.

BWA might like to think that the second ID tag is subordinate to the CL field, but from HTSlib's point of view they're all just tab-separated fields on a level playing field. Certainly that's how HTSJDK is treating this. Fortunately HTSlib's header API can be improved to work consistently off the first ID field on lines that have more than one, without expressing an opinion on whether such lines are wise or whether there's anything special about the second ID tag when it comes after a CL field.

The issue was entirely in the header API, due to the different ways sam_hdr_find_tag_pos() and sam_hdr_find_line_id() are searching for a tag value.

Okay, thanks — that makes this make sense. So sam_hdr_find_tag_pos() returns the first ID field on the line, therefore making sure the hash contains the first ID field means that sam_hdr_find_line_id() works consistently with sam_hdr_find_tag_pos() even for invalid header lines with multiple ID fields.

This BTW is the sort of information that it is very useful to capture in commit messages! 😄

(There is still a lack of error handling in samtools bam_sort.c: in the “can't happen” case of sam_hdr_find_line_id() failing, no error message is printed. This scenario is now explained, and after this change perhaps it really can't happen…)

…header

and issue a warning when multiple ID tags are encountered on the same line.

This change also enforces consistency across the header API methods, which
use the ID tag as an argument, by making them agree to always return the
first encountered ID value.
@daviesrob
Copy link
Copy Markdown
Member

It looks like this does the trick. It's fairly chatty when running on the test case I made for samtools/samtools#1393, so I expect people will notice multiple ID: keys now (as long as they inspect the output...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

segfault merge operation

3 participants