Jacob Schreiber (@jmschreiber91) / X

Jacob Schreiber

5,253 posts

Jacob Schreiber

@jmschreiber91

Programmable Genomics Lab @UMassGCB, Technical Steering Committee @NumFOCUS. Prev @impvienna @StanfordMed. Studying genomics, ML, and fruit. Opinions my own.

Boston, MA

Joined March 2017

Pinned
Jacob Schreiber
@jmschreiber91
Aug 5, 2020
The more papers I read for a review article I'm writing about ML pitfalls in genomics, the more my faith is shaken in the results from papers that apply machine learning to methylation arrays. A salty thread. 1/
Jacob Schreiber
@jmschreiber91
May 21, 2022
They probably go places that will pay them appropriately for their skills. Paying post-docs under $70k is common but obscene in most fields, given how critical they are.
Jacob Schreiber
@jmschreiber91
Jun 8, 2019
Selecting features using all data before splitting into folds for training/testing is a big source of train-test leakage. To demonstrate, I generated random data and labels, select down to 25 features, and train a model. Much better than random performance due to the leakage.
Jacob Schreiber
@jmschreiber91
Jan 6, 2022
The more I use classic bioinformatic tools, e.g. bwa and vcftools, the more I dislike current trends in bioinformatic tooling; pipelines are nice but if I want to test out your method the first step shouldn't be "set up a Terra/GCP/AWS account."
Jacob Schreiber
@jmschreiber91
Feb 13, 2023
It's frustrating reading comp bio articles these days because many keep falling into the same pitfalls. Hard to know if the method actually works, or whether they messed up the evaluation. Here are some issues I've seen recently (w/o names):
95K
Jacob Schreiber
@jmschreiber91
May 31, 2024
Thrilled to announce that I'll be joining the incredible researchers at @IMPvienna for a year as a visiting scientist and then joining @UMassChan as an assistant professor in Genomics+CompBio in 2025! At both places, I'll be continuing my work on deep learning + genomics.
45K
Jacob Schreiber
@jmschreiber91
Sep 22, 2023
Why are you confused? There's just genes. And alternate splicing. And regulatory elements. And regulatory elements in the alternate splicing. And regulatory elements are transcribed. And RNAs can do things. And proteins can fold differently in different cell types. And...
49K
Jacob Schreiber
@jmschreiber91
Aug 1, 2020
Replying to @CT_Bergstrom
This entire time I knew in the back of my mind that you were a person but, because I've only seen you on Twitter, I just assumed you were a benevolent bird sharing your vast knowledge of biology with us. Illusion shattered by the picture in this article. :(
Jacob Schreiber
@jmschreiber91
Jul 10, 2020
Jumping from a successful post-doc into a new PI position.
Caitlin Hudon
@beeonaposy
Jul 9, 2020
jumping from tutorials into your own data
00:00
Jacob Schreiber
@jmschreiber91
Sep 3, 2019
Me, a former sklearn dev, hiding under the bed: Armed robber: ... Me: ... Armed robber: .... Me: .... Armed robber: Logistic regression shouldn't have a default L2 regularization of 1 Me: *still hides*
Jacob Schreiber
@jmschreiber91
Aug 1, 2020
Replying to @naomirwolf and @BillGates
As a researcher at U of Washington, I remember when @BillGates walked into my lab and said "Stop working on this, we must work on vaccine microchips!" and we dropped all our grant-funded work immediately. We would've gotten away with it too, if you didn't point it out on Twitter.
Jacob Schreiber
@jmschreiber91
Aug 25, 2023
CS/ML people venturing into biology frequently assume that the data they're given is clean and that all the upstream processing steps have been figured out. This is absolutely not the case. I would encourage CS/ML people to really look into the gritty details like this.
Steven Salzberg 💙💛
@StevenSalzberg1
Aug 25, 2023
A very intriguing result in the new Y chromosome paper, one that you might miss unless you read the paper closely... 1/6 nature.com/articles/s4158…
68K
Jacob Schreiber
@jmschreiber91
Apr 2, 2024
Sequence-based ML methods (Enformer, ChromBPNet...) are invaluable in genomics but the ecosystem for their *use* after training is less developed. Introducing, `tangermeme`: a PyTorch library for genomics discovery for everything-other-than-the-model. github.com/jmschrei/tange… 1.
GitHub - jmschrei/tangermeme: Biological sequence analysis for the modern age.
From github.com
40K
Jacob Schreiber
@jmschreiber91
Nov 29, 2021
Finally out in @NatureRevGenet: Navigating the pitfalls of applying machine learning in genomics! go.nature.com/3D55JuZ w/ @seawhalen et al. Our key point: you MUST evaluate your models in the same setting you want them to be used or they might not actually work in practice.