Skip to content

seqkit sample -n gives lower amount of samples than asked for #518

@lapotok

Description

@lapotok

Hi!
I'm trying to extract sample records from a FASTQ file.
It has unique records

(base) user@host:/folder# seqkit seq -n records.fastq|wc -l
2595
(base) user@host:/folder# seqkit seq -n records.fastq|sort|uniq|wc -l
2595

and it's giving me wrong number of samples

982
(base) user@host:/folder# N=100;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 85 sequences outputted
85
(base) user@host:/folder# N=800;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 799 sequences outputted
799
(base) user@host:/folder# N=10;seqkit sample -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] loading all sequences into memory...
[INFO] 9 sequences outputted
9

with --two-pass it's doing a better job, but still occasionally gives wrong number

(base) user@host:/folder# N=10;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 10 sequences outputted
10
(base) user@host:/folder# N=100;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 94 sequences outputted
94
(base) user@host:/folder# N=1000;seqkit sample --two-pass -n $N records.fastq| seqkit seq -n |wc -l
[INFO] sample by number
[INFO] first pass: counting seq number
[INFO] seq number: 2595
[INFO] second pass: reading and sampling
[INFO] 1000 sequences outputted
1000

I'm using seqkit v2.10.0 installed via conda (mamba).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions