A fasta/fastq sanitizer that redacts illegal chars.#1314
A fasta/fastq sanitizer that redacts illegal chars.#1314daviesrob merged 3 commits intosamtools:developfrom
Conversation
The SAM and VCF specs ban certain characters in reference names. Trying to work around this after alignment is problematic for numerous reasons, so it is preferable to rename your reference sequences before alignment. This tool does this.
misc/fasta-sanitize.pl
Outdated
|
|
||
| my $name_re = $name_whitespace; | ||
| if ($ARGV[0] eq "-t") { | ||
| pop(@ARGV); |
There was a problem hiding this comment.
This doesn't work:
$ ./misc/fasta-sanitize.pl -t /tmp/test.fa
Can't open -t: No such file or directory at ./misc/fasta-sanitize.pl line 68.
I think pop should be shift.
There was a problem hiding this comment.
Oops, one of those last minute configuration tweaks that I clearly didn't test!
I just ditched it entirely as it was added due to user test data having spaces in names, but it later transpired it broke everything else so they'd already ran a sed script to remove all of those anyway. An ill conceived attempt to work around something that's not an issue.
|
|
||
| # Seq | ||
| print; | ||
| $seq_len += length($_); |
There was a problem hiding this comment.
$_ will include any new-lines, so end-of-qual detection could be defeated in the event that the seq and qual parts of a fastq file are wrapped at different line lengths. It would be safer to call chomp to remove the line ending from the length calculation.
There was a problem hiding this comment.
That's a sneaky attack method!
Fixed. :-)
|
Thanks. Fixes have been squashed and merged in. The random appveyor failure was unrelated, and went away after rebasing. |
The SAM and VCF specs ban certain characters in reference names. Trying to work around this after alignment is problematic for numerous reasons, so it is preferable to rename your reference sequences before alignment.
This tool does this.
I'm not sure on the -t bit to stop at the tab instead of first white-space. With hindsight I think this is a misunderstanding from the test data we were given, which had already been modified to remove the spaces so they didn't need to be processed afterall. (Thank goodness)
I can remove that if needed, or just leave it as a hidden option. The script is pretty naive, but seems to work and it's totally standalone.