-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Hi John,
Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov).
I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:
./ska_0.3.1 cov reads_kraken/ERR8158003_1.fastq.gz reads_kraken/ERR8158003_2.fastq.gz -k 31 > cov.txt
Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.
This is the beginning of the file 'cov.txt':
Count K_mers Mixture_density Component
1 25838523 3.2408096206707826e-1 Error
2 1993278 1.620404810335391e-1 Error
3 129884 5.4013493677846365e-2 Error
4 97141 1.3503373419461602e-2 Error
5 28053 2.7006746838923196e-3 Error
6 35628 4.5011244731538646e-4 Error
7 24702 6.430177818791246e-5 Error
8 35486 8.03772227348904e-6 Error
9 30571 8.93080252609934e-7 Error
10 39312 8.930802526126602e-8 Error
11 40110 8.118911389065957e-9 Error
12 44332 6.765759585478362e-10 Error
13 43658 5.20443537193471e-11 Error
14 49357 3.7176916177683685e-12 Error
15 47325 2.4891833425171915e-13 Error
16 51653 2.009020801325787e-14 Error
17 52514 1.921693705685228e-14 Coverage
18 53472 6.883935794424811e-14 Coverage
19 52873 2.4488922538072723e-13 Coverage
20 53588 8.282019181080119e-13 Coverage
21 54322 2.667583813189892e-12 Coverage
22 54876 8.201562459958256e-12 Coverage
23 55197 2.411959247393554e-11 Coverage
24 54703 6.797667675764525e-11 Coverage
25 54701 1.8391668285808207e-10 Coverage
26 54189 4.784636868075234e-10 Coverage
27 52711 1.1986335327845915e-9 Coverage
28 52430 2.895540187815414e-9 Coverage
29 51131 6.753560645868026e-9 Coverage
30 51195 1.522694413626287e-8 Coverage
31 49749 3.322402658582442e-8 Coverage
32 48342 7.02268990941827e-8 Coverage
33 48280 1.4394306882794722e-7 Coverage
34 47080 2.863604561107574e-7 Coverage
35 46010 5.534089852758182e-7 Coverage
36 45616 1.039788261971188e-6 Coverage
37 45274 1.9008348747272694e-6 Coverage
38 44701 3.383467426815247e-6 Coverage
39 44463 5.868114750185855e-6 Coverage
40 43372 9.922927345843403e-6 Coverage
41 43026 1.637031965870808e-5 Coverage
Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).
Thanks and best,
Romain