Background
I'm the maintainer of atropos, a fork of cutadapt that I'm now winding down. Before archiving, I'm surfacing a few features atropos has that cutadapt doesn't, in case any are interesting upstream.
Proposal
Add an empirical per-base error-rate estimator — either as a standalone cutadapt error subcommand or as an automatic pre-pass that informs the default adapter-match error tolerance (-e).
Two possible methods
- Quality-based: sum per-base
10^(-Q/10) across the input, divide by base count. Streams, no calibration needed, but inflated by quality-score miscalibration.
- Wang et al. 2012 shadow regression: regress mismatching-read count against unique-read count across read-length prefixes, solve for per-base error. Works without an alignment reference.
Why this is useful
Users choosing -e currently guess. If cutadapt could tell them "the empirical error rate is 1.2% — your -e 0.1 is well above that", it would make stringency choices much easier to justify and catch obvious run-quality problems.
Prior art
Happy to help if useful.
Background
I'm the maintainer of atropos, a fork of cutadapt that I'm now winding down. Before archiving, I'm surfacing a few features atropos has that cutadapt doesn't, in case any are interesting upstream.
Proposal
Add an empirical per-base error-rate estimator — either as a standalone
cutadapt errorsubcommand or as an automatic pre-pass that informs the default adapter-match error tolerance (-e).Two possible methods
10^(-Q/10)across the input, divide by base count. Streams, no calibration needed, but inflated by quality-score miscalibration.Why this is useful
Users choosing
-ecurrently guess. If cutadapt could tell them "the empirical error rate is 1.2% — your-e 0.1is well above that", it would make stringency choices much easier to justify and catch obvious run-quality problems.Prior art
errorsubcommand implements both: https://github.com/jdidion/atropos/blob/master/atropos/commands/error/__init__.pyHappy to help if useful.