A fasta/fastq sanitizer that redacts illegal chars. by jkbonfield · Pull Request #1314 · samtools/samtools

jkbonfield · 2020-09-21T08:21:03Z

The SAM and VCF specs ban certain characters in reference names. Trying to work around this after alignment is problematic for numerous reasons, so it is preferable to rename your reference sequences before alignment.

This tool does this.

I'm not sure on the -t bit to stop at the tab instead of first white-space. With hindsight I think this is a misunderstanding from the test data we were given, which had already been modified to remove the spaces so they didn't need to be processed afterall. (Thank goodness)

I can remove that if needed, or just leave it as a hidden option. The script is pretty naive, but seems to work and it's totally standalone.

The SAM and VCF specs ban certain characters in reference names. Trying to work around this after alignment is problematic for numerous reasons, so it is preferable to rename your reference sequences before alignment. This tool does this.

daviesrob · 2020-12-15T15:33:37Z

misc/fasta-sanitize.pl

+
+my $name_re = $name_whitespace;
+if ($ARGV[0] eq "-t") {
+    pop(@ARGV);


This doesn't work:

$ ./misc/fasta-sanitize.pl -t /tmp/test.fa Can't open -t: No such file or directory at ./misc/fasta-sanitize.pl line 68.

I think pop should be shift.

Oops, one of those last minute configuration tweaks that I clearly didn't test!

I just ditched it entirely as it was added due to user test data having spaces in names, but it later transpired it broke everything else so they'd already ran a sed script to remove all of those anyway. An ill conceived attempt to work around something that's not an issue.

daviesrob · 2020-12-15T15:39:26Z

misc/fasta-sanitize.pl

+
+        # Seq
+        print;
+        $seq_len += length($_);


$_ will include any new-lines, so end-of-qual detection could be defeated in the event that the seq and qual parts of a fastq file are wrapped at different line lengths. It would be safer to call chomp to remove the line ending from the length calculation.

That's a sneaky attack method!

Fixed. :-)

daviesrob · 2020-12-17T12:23:37Z

Thanks. Fixes have been squashed and merged in. The random appveyor failure was unrelated, and went away after rebasing.

daviesrob self-assigned this Dec 15, 2020

daviesrob reviewed Dec 15, 2020

View reviewed changes

jkbonfield added 2 commits December 16, 2020 16:41

Remove ill-conceived -t option

89bacd6

Add chomp to handle differing line wrapping of seq & qual

7c08d79

daviesrob merged commit af811a6 into samtools:develop Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A fasta/fastq sanitizer that redacts illegal chars.#1314

A fasta/fastq sanitizer that redacts illegal chars.#1314
daviesrob merged 3 commits intosamtools:developfrom
jkbonfield:fasta-sanitize

jkbonfield commented Sep 21, 2020

Uh oh!

daviesrob Dec 15, 2020

Uh oh!

jkbonfield Dec 16, 2020

Uh oh!

daviesrob Dec 15, 2020

Uh oh!

jkbonfield Dec 16, 2020

Uh oh!

daviesrob commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkbonfield commented Sep 21, 2020

Uh oh!

daviesrob Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

jkbonfield Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

daviesrob Dec 15, 2020

Choose a reason for hiding this comment

Uh oh!

jkbonfield Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

daviesrob commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants