Skip to content

seqkit grep without converting to unique patterns #427

@alvanuffelen

Description

@alvanuffelen

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

I have a FASTQ file from which I would like to subsample very specific sequences by ID. Additionally, some sequences should be subsampled multiple times. However, using seqkit grep with --pattern-file, it extracts each pattern only once.

seqkit grep -f id_list.txt mock.fq
[INFO] 3 patterns loaded from file

The file contains 4 patterns (2 unique and 1 duplicate). Would it be possible to add a parameter such that the patterns are not converted to unique patterns?

In contrast, seqtk does not only extract unique IDs:
seqtk subseq mock.fq id_list.txt
All 4 patterns are used, so the output contains 4 sequences.

mock.fq

@seq1
GATCGATCGA
+
IIIIIIIIII
@seq2
AGCTAGCTAG
+
IIIIIIIIII
@seq3
TACGTACGTA
+
IIIIIIIIII
@seq4
CGATCGATCG
+
IIIIIIIIII
@seq5
ATCGATCGAT
+
IIIIIIIIII
@seq6
GCTAGCTAGC
+
IIIIIIIIII
@seq7
CATGCATGCA
+
IIIIIIIIII
@seq8
TGCATGCATG
+
IIIIIIIIII
@seq9
AGCTAGCTAG
+
IIIIIIIIII
@seq10
ATCGATCGAT
+
IIIIIIIIII

id_list.txt

seq1
seq1
seq2
seq3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions