Skip to content

Fix reading UTF-8 encoded sample names when char is signed#1237

Merged
valeriuo merged 1 commit intosamtools:developfrom
daviesrob:utf8-samples
Feb 17, 2021
Merged

Fix reading UTF-8 encoded sample names when char is signed#1237
valeriuo merged 1 commit intosamtools:developfrom
daviesrob:utf8-samples

Conversation

@daviesrob
Copy link
Copy Markdown
Member

The trick used in bcf_hdr_parse_sample_line() to rapidly find tabs and newlines could be defeated by UTF-8 characters outside the Basic Latin range on platforms where "char" is signed (like x86). It's currently not clear if VCF intends to allow these, but the 4.3 specification does allow UTF-8 and it's easy enough to support. Fix by casting to unsigned when making the comparison.

Modifies formatcols.vcf to include a UTF-8 character for a round-trip test.

Fixes samtools/bcftools#1408

The trick used in bcf_hdr_parse_sample_line() to rapidly find tabs
and newlines could be defeated by UTF-8 characters outside the
Basic Latin range on platforms where "char" is signed (like x86).
It's currently not clear if VCF intends to allow these, but the
4.3 specification does allow UTF-8 and it's easy enough to support.
Fix by casting to unsigned when making the comparison.

Modifies formatcols.vcf to include a UTF-8 character for a
round-trip test.

Fixes samtools/bcftools#1408
@valeriuo valeriuo merged commit 8127bfc into samtools:develop Feb 17, 2021
@daviesrob daviesrob deleted the utf8-samples branch February 17, 2021 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible bug in htslib/bcftools 1.1: [E::bcf_hdr_add_sample_len] Duplicated sample name

2 participants