-
Notifications
You must be signed in to change notification settings - Fork 176
Closed
Description
Hi!
I'm trying to extract sample records from a FASTQ file.
It has unique records
(base) user@host:/folder# seqkit seq -n records.fastq|wc -l
2595
(base) user@host:/folder# seqkit seq -n records.fastq|sort|uniq|wc -l
2595
and it's giving me wrong number of samples
982
(base) user@host:/folder# N=100;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 85 sequences outputted
85
(base) user@host:/folder# N=800;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 799 sequences outputted
799
(base) user@host:/folder# N=10;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 9 sequences outputted
9
with --two-pass it's doing a better job, but still occasionally gives wrong number
(base) user@host:/folder# N=10;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 10 sequences outputted
10
(base) user@host:/folder# N=100;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 94 sequences outputted
94
(base) user@host:/folder# N=1000;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 1000 sequences outputted
1000
I'm using seqkit v2.10.0 installed via conda (mamba).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels