Skip to content

A fasta/fastq sanitizer that redacts illegal chars.#1314

Merged
daviesrob merged 3 commits intosamtools:developfrom
jkbonfield:fasta-sanitize
Dec 17, 2020
Merged

A fasta/fastq sanitizer that redacts illegal chars.#1314
daviesrob merged 3 commits intosamtools:developfrom
jkbonfield:fasta-sanitize

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

The SAM and VCF specs ban certain characters in reference names. Trying to work around this after alignment is problematic for numerous reasons, so it is preferable to rename your reference sequences before alignment.

This tool does this.

I'm not sure on the -t bit to stop at the tab instead of first white-space. With hindsight I think this is a misunderstanding from the test data we were given, which had already been modified to remove the spaces so they didn't need to be processed afterall. (Thank goodness)

I can remove that if needed, or just leave it as a hidden option. The script is pretty naive, but seems to work and it's totally standalone.

The SAM and VCF specs ban certain characters in reference names.
Trying to work around this after alignment is problematic for numerous
reasons, so it is preferable to rename your reference sequences before
alignment.

This tool does this.
@daviesrob daviesrob self-assigned this Dec 15, 2020

my $name_re = $name_whitespace;
if ($ARGV[0] eq "-t") {
pop(@ARGV);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work:

$ ./misc/fasta-sanitize.pl -t /tmp/test.fa
Can't open -t: No such file or directory at ./misc/fasta-sanitize.pl line 68.

I think pop should be shift.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, one of those last minute configuration tweaks that I clearly didn't test!

I just ditched it entirely as it was added due to user test data having spaces in names, but it later transpired it broke everything else so they'd already ran a sed script to remove all of those anyway. An ill conceived attempt to work around something that's not an issue.


# Seq
print;
$seq_len += length($_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$_ will include any new-lines, so end-of-qual detection could be defeated in the event that the seq and qual parts of a fastq file are wrapped at different line lengths. It would be safer to call chomp to remove the line ending from the length calculation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a sneaky attack method!

Fixed. :-)

@daviesrob daviesrob merged commit af811a6 into samtools:develop Dec 17, 2020
@daviesrob
Copy link
Copy Markdown
Member

Thanks. Fixes have been squashed and merged in. The random appveyor failure was unrelated, and went away after rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants