|
Raptor 4.0.0-rc.1
A fast and space-efficient pre-filter
|
raptor prepare] -> raptor layout -> raptor build -> raptor searchraptor prepare] -> raptor build -> raptor searchRecommendation: HIBF
Cases in which to consider the IBF:
| (k,k) | (w,k) | |
|---|---|---|
| Index size | ⬆️ | ⬇️ |
| Runtime | ⬆️ | ⬇️ |
| RAM usage | ⬆️ | ⬇️ |
| Thresholding¹ | Exact | Heuristic |
¹ When searching with a given number of errors.
Recommendation: (w,k) with gentle compression (w-k=4)
Requirements:
w > kw ≤ query lengthRecommendation:
w - k = 4w << query lengthExamples:
w = 24, k = 20w = 28, k = 24Also see the figure in usage_w_vs_k_figure.
Requirements:
k ≤ query lengthRecommendation:
k << query lengthExamples (for two errors):
k = 20k = 32Depending on the number of errors that should be accounted for when searching, the kmer-size (k) has to be chosen such that the k-mer lemma still has a positive threshold.
K-mer counting lemma: For a given k and number of errors e, there are \(k_p = |p| - k + 1\) many k-mers in the pattern p and an approximate occurrence of p in text T has to share at least \(t = (k_p - k \cdot e)\) k-mers.
For example, when searching reads of length 100 and allowing 4 errors, k has to be at most 20 (100 − 20 + 1 − 4 · 20 = 1).
Furthermore, k shall be such that a random k-mer match in the database is unlikely. For example, we chose k = 32 for the RefSeq data set. In general, there is no drawback in choosing the (currently supported) maximum k of 32, as long as the aforementioned requirements are fulfilled.