Skip to content

Feature request: empirical sequencing-error-rate estimation #881

Description

@jdidion

Background

I'm the maintainer of atropos, a fork of cutadapt that I'm now winding down. Before archiving, I'm surfacing a few features atropos has that cutadapt doesn't, in case any are interesting upstream.

Proposal

Add an empirical per-base error-rate estimator — either as a standalone cutadapt error subcommand or as an automatic pre-pass that informs the default adapter-match error tolerance (-e).

Two possible methods

  1. Quality-based: sum per-base 10^(-Q/10) across the input, divide by base count. Streams, no calibration needed, but inflated by quality-score miscalibration.
  2. Wang et al. 2012 shadow regression: regress mismatching-read count against unique-read count across read-length prefixes, solve for per-base error. Works without an alignment reference.

Why this is useful

Users choosing -e currently guess. If cutadapt could tell them "the empirical error rate is 1.2% — your -e 0.1 is well above that", it would make stringency choices much easier to justify and catch obvious run-quality problems.

Prior art

Happy to help if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions