The more papers I read for a review article I'm writing about ML pitfalls in genomics, the more my faith is shaken in the results from papers that apply machine learning to methylation arrays. A salty thread. 1/
They probably go places that will pay them appropriately for their skills. Paying post-docs under $70k is common but obscene in most fields, given how critical they are.
Selecting features using all data before splitting into folds for training/testing is a big source of train-test leakage. To demonstrate, I generated random data and labels, select down to 25 features, and train a model. Much better than random performance due to the leakage.
The more I use classic bioinformatic tools, e.g. bwa and vcftools, the more I dislike current trends in bioinformatic tooling; pipelines are nice but if I want to test out your method the first step shouldn't be "set up a Terra/GCP/AWS account."
It's frustrating reading comp bio articles these days because many keep falling into the same pitfalls. Hard to know if the method actually works, or whether they messed up the evaluation. Here are some issues I've seen recently (w/o names):
Thrilled to announce that I'll be joining the incredible researchers at @IMPvienna for a year as a visiting scientist and then joining @UMassChan as an assistant professor in Genomics+CompBio in 2025!
At both places, I'll be continuing my work on deep learning + genomics.
Why are you confused? There's just genes. And alternate splicing. And regulatory elements. And regulatory elements in the alternate splicing. And regulatory elements are transcribed. And RNAs can do things. And proteins can fold differently in different cell types. And...
This entire time I knew in the back of my mind that you were a person but, because I've only seen you on Twitter, I just assumed you were a benevolent bird sharing your vast knowledge of biology with us. Illusion shattered by the picture in this article. :(
Me, a former sklearn dev, hiding under the bed:
Armed robber: ...
Me: ...
Armed robber: ....
Me: ....
Armed robber: Logistic regression shouldn't have a default L2 regularization of 1
Me: *still hides*
As a researcher at U of Washington, I remember when @BillGates walked into my lab and said "Stop working on this, we must work on vaccine microchips!" and we dropped all our grant-funded work immediately. We would've gotten away with it too, if you didn't point it out on Twitter.
CS/ML people venturing into biology frequently assume that the data they're given is clean and that all the upstream processing steps have been figured out. This is absolutely not the case.
I would encourage CS/ML people to really look into the gritty details like this.
Sequence-based ML methods (Enformer, ChromBPNet...) are invaluable in genomics but the ecosystem for their *use* after training is less developed.
Introducing, `tangermeme`: a PyTorch library for genomics discovery for everything-other-than-the-model. github.com/jmschrei/tange… 1.
Finally out in @NatureRevGenet: Navigating the pitfalls of applying machine learning in genomics! go.nature.com/3D55JuZ w/ @seawhalen et al.
Our key point: you MUST evaluate your models in the same setting you want them to be used or they might not actually work in practice.