edgePython | Bits of DNA

In a paper titled “THEOREMS FOR A PRICE: Tomorrow’s Semi-Rigorous Mathematics Culture” published in 1993, mathematician Doron Zeilberger wrote:

There are writings on the wall that, now that the silicon savior has arrived, a new testament is going to be written. Although there will always be a small group of “rigorous” old-style mathematicians(e.g. [JQ]) who will insist that the true religion is theirs, and that the computer is a false Messiah, they may be viewed by future mainstream mathematicians as a fringe sect of harmless eccentrics, like mathematical physicists are viewed by regular physicists today.

Zeilberger’s prophecy seems to be upon us. But instead of a silicon savior, what seems to be happening is more like a searing Highlander Quickening, whereby we are receiving the knowledge and power that others have obtained throughout their lives.

A Quickening took place sometime between the end of 2025, when Claude Sonnet Opus 4.5 dropped, and February 5, 2026 when Codex 5.3 was released. On Tuesday February 3rd, I downloaded Claude Opus 4.5 for the first time. My first experiment was to try a port of Sleuth, which my former PhD student Harold Pimentel published in 2017. Sleuth was written in R, and I figured it would be useful to have it Python, since then it could be adapted to single-cell RNA-seq and work seamlessly with the popular Python-based single-cell ecosystem. Within an hour I was able to port Sleuth to Python and replicate exactly the results in R. I was stunned by the ability of Claude, so I figured I should try something even more ambitious. During a discussion with my former student Sina Booeshaghi, he suggested I tackle edgeR, which consists of 14,808 lines of code across 136 files (117 in R and 19 in C).

On Thursday February 5th I started working on edgePython. This was not just a port for the sake of performing a port. I had three specific goals in mind:

I wanted an edgePython that could interact seamlessly with AnnData objects, so that differential analysis could proceed without exporting the data and reading it back in from R. Seamless is a key word. Along with an AnnData interface, I resolved that the port should stick to Python (possibly with Numba), but not incorporate C.
Python has become the de facto standard for single-cell genomics with a huge ecosystem of methods and tools (to the extent that R users are converting Seurat objects to AnnData), so I wanted to use the edgeR port to develop a new single-cell differential analysis method. The approach for single-cell differential in edgePython seemed obvious: port the single-cell NEBULA method, and then incorporate Empirical Bayes, which has been a hallmark feature of edgeR.
To be useful, the port would have to be comprehensive, covering most of edgeR’s functionality, and it would have to be similar in speed. The latter did not seem to be a trivial goal; I checked and edgeR is 39% C code (so it is really edgeRC), and I wasn’t sure Python could match it.

It took me about a week to achieve these three goals, and a few extra days to write up a manuscript with the details. The edgePython code is here: https://github.com/pachterlab/edgePython. A preprint describing the work is here: https://doi.org/10.64898/2026.02.16.706223

The sport of predicting dystopian vs. utopian futures of human civilization in light of recent advances in AI is a strange one, where games inevitably result in a loss for both sides. Thus, I won’t play, but I will offer some observations on the impact of AI on my corner of science right now:

A Python port of edgeR has been attempted twice before, to my knowledge. One effort did not succeed. The other, part of the inMoose project, did result in a partial port after years of work by multiple individuals. The difficulty in porting edgeR stems from a complex codebase: Of the 14,808 lines of code, 39% are written in C, and a lot of the code relates to complex numerical statistical algorithms. A Python port of DESeq2 was published in 2023. This port, while functional, was also incomplete. I performed my initial port with Claude Opus 4.5 and 4.6, and more recently worked on finishing off several pieces with Codex GPT-5.3. I find it remarkable that these tools were able to produce an essentially complete Python port in a week. Remarkably, the running time of the Python port is comparable to that of edgeR (in some cases the port is faster). The port does use Numba, but it does not rely on any C code. While I had to work closely with the Claude and Codex tools to make the port work, and some of my contribution was non-trivial (thanks to Sina Booeshaghi for some major assists along the way), it seems to me that much of my involvement could be automated in the future. The acceleration of computational biology is increasing. I would say that even the snap is positive.
A Python implementation of edgeR is valuable because of the single-cell genomics ecosystem that has been developed in Python, largely around AnnData objects which is part of scverse. Now, with edgePython, differential analysis can be performed directly on AnnData objects without the need for expensive export / import, not to mention the hassle and burden of maintaining a full R/Bioconductor stack and its dependencies. For several of my own project, edgePython has lowered the barriers to adoption of some of the edgeR algorithms.
I was able to simultaneously extend to two different programs: NEBULA and edgeR, to enable single-cell differential analysis with Empirical Bayes. Of course, this extension could have been implemented in R, but the Python single-cell ecosystem makes this method much more likely to be used now that it’s part of edgePython. I find it remarkable that I was able to build on two programs in this way; while they are both open source, the “open” aspect of the codebases has, in many cases, until now been literally true but practically insufficient. To build on a tool such as edgeR, which is 16+ years in the making and consists of a complex codebase with thousands of lines of code, would previously have required years of study and effort to be able to confidently alter the codebase. That is why many bioinformatics tools have effectively “remained lab-bound. Automated programming democratizes all such projects.
I have many ideas for further extensions of edgePython. Isoform-level quantification is essential for accurate differential analysis, even for gene-level differential analysis (see Pimentel et al., 2017), and while edgeR recently implemented an approach for differential transcript usage based on inferential uncertainty of isoform quantifications, an idea developed in sleuth (Baldoni et al., 2024), there is much work to be done on further methods development. I was also able to fix minutiae in the edgeR implementation, albeit details with serious implications. For example, in reviewing the edgeR code I discovered that the authors handicapped kallisto but not implementing import of kallisto HDF5 files, which is essential for utilizing kallisto bootstraps for isoform quantification. edgePython fixes this.
Thinking about isoforms and their role in differential analysis led me to the following hypothesis: a major challenge that will have to be addressed (by humans) in the near term is how to avoid a proliferation of suboptimal tools (and therefore suboptimal results derived from them). Frequently in computational biology the majority opinion is wrong (for an example see the use of UMAPs). Pushing back against the majority can be difficult for scientists, but it is possible. The empowerment of the majority with accelerants may serve to massively amplify flawed approaches, making it much harder for the minority opinion to prevail, even if it is correct.
Working with Claude and Codex has been exhilarating. Being able to code up ideas quickly and effectively opens up wide vistas for everyone. This is already resulting in a deluge of publications (see, e.g. NeurIPS 2025 which received more than 21,000 submissions). I don’t believe that current publication systems, whether for preprints, conference proceedings, or journals, can withstand the onslaught of work that is already in progress, let alone the amount of material that is to come. This is an opportunity to think of novel ways to distribute and communicate science. I think that machine readability, at least for part of it, will be one aspect of a potential solution (see Booeshaghi, Luebbert, Pachter., 2026).
There is one barrier to exploring ideas with our newfound machine tools: cost. As Doron Zeilberger predicted, it will not be possible to try every idea, because even if costs are lowered, as long as they are nonzero, we will face the problem that $\lim_{n \to \infty} c n$ diverges when c > 0. In other words, we will face not just the reality of what Zeilberger calls semi-rigorous mathematics, but more generally semi-rigorous science. This may require an adjustment in habits and expectations.
I said I wouldn’t play the prediction game but I will step onto the field just for a moment… while working on this project I thought of recent comments I’ve heard such as “Subject X will be the first to go”. This is folly. Nothing is going anywhere. Science is about to get better and it will proceed faster than at any time in history. But if someone is going to insist on looking up their integrals by hand in Abramowitz and Stegun, they may find that the scientific enterprise leaves them behind (although artisanal science may be a satisfying hobby for sure).

The Highlander Quickenings are not free. The knowledge and power gained by one immortal comes at a loss for another. Scientists should think long and hard about where the coding superpower they have just received comes from (tl;dr: the enormous corpus of computer code written by software engineers over the past 50+ years, much of it informed by hard work in the sciences). We should all strive to ensure that fruits from the labor of others are turned into meaningful profit for society and not for empty vanity.

edgePython Reference

Pachter L (2026). Differential analysis of genomics count data with edge*. bioRxiv. doi:10.64898/2026.02.16.706223

edgeR References

Robinson MD, Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23(21), 2881-2887. doi:10.1093/bioinformatics/btm453

Robinson MD, Smyth GK (2007). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9(2), 321-332. doi:10.1093/biostatistics/kxm030

Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616

Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. doi:10.1186/gb-2010-11-3-r25

McCarthy DJ, Chen Y, Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. doi:10.1093/nar/gks042

Chen Y, Lun ATL, Smyth GK (2014). Differential expression analysis of complex RNA-seq experiments using edgeR. In Statistical Analysis of Next Generation Sequencing Data, Springer, 51-74. doi:10.1007/978-3-319-07212-8_3

Zhou X, Lindsay H, Robinson MD (2014). Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research, 42(11), e91. doi:10.1093/nar/gku310

Dai Z, Sheridan JM, Gearing LJ, Moore DL, Su S, Wormald S, Wilcox S, O’Connor L, Dickins RA, Blewitt ME, Ritchie ME (2014). edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens. F1000Research, 3, 95. doi:10.12688/f1000research.3928.2

Lun ATL, Chen Y, Smyth GK (2016). It’s DE-licious: A recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. In Statistical Genomics, Springer, 391-416. doi:10.1007/978-1-4939-3578-9_19

Chen Y, Lun ATL, Smyth GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research, 5, 1438. doi:10.12688/f1000research.8987.2

Chen Y, Pal B, Visvader JE, Smyth GK (2018). Differential methylation analysis of reduced representation bisulfite sequencing experiments using edgeR. F1000Research, 6, 2055. doi:10.12688/f1000research.13196.2

Baldoni PL, Chen Y, Hediyeh-zadeh S, Liao Y, Dong X, Ritchie ME, Shi W, Smyth GK (2024). Dividing out quantification uncertainty allows efficient assessment of differential transcript expression with edgeR. Nucleic Acids Research, 52(3), e13. doi:10.1093/nar/gkad1167

Chen Y, Chen L, Lun ATL, Baldoni PL, Smyth GK (2025). edgeR v4: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. Nucleic Acids Research, 53(2), gkaf018. doi:10.1093/nar/gkaf018

NEBULA Reference

He L, Davila-Velderrain J, Sumida TS, Hafler DA, Kellis M, Kulminski AM (2021). NEBULA is a fast negative binomial mixed model for differential or co-expression analysis of large-scale multi-subject single-cell data. Communications Biology, 4, 629. doi:10.1038/s42003-021-02146-6

	Brian Repko on The Quickening
	Audrey Fu on The Quickening
	Frank Galton on Review of Cold Spring Harbor L…
	Jimmy Watson on James Watson in his own w…
	sad comment section… on James Watson in his own w…
	Lior Pachter on Review of Cold Spring Harbor L…
	Jack Tierney on Review of Cold Spring Harbor L…
	CSHL postdoc on Review of Cold Spring Harbor L…
	Thomas Berbas on Review of Cold Spring Harbor L…
	Lior Pachter on Review of Cold Spring Harbor L…

Tag Archive

The Quickening

Recent Comments

Top Posts & Pages

Recent posts

Archives

Biology

Computational Biology

Computer Science

Ideas

Math

Medicine

Statistics

Blog Stats