Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in
DatologyAI
188 posts
DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.
- Pretraining data curation alone — no SFT, no RL — within 1.8pp of Qwen3-VL-2B at ~87× less train compute. New VLM research from our team. datologyai.com/blog/2020-visi…
- DatologyAI repostedSuper excited to announce our new partnership with DatologyAI following some very impressive mid-training results achieved with minimal insight into our private evals. There’s no doubt that the team is doing the most exciting data-centric work across all of industry & academia.Replying to @datologyaiRead more details about the partnership here: datologyai.com/blog/datologya…
- We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI. By applying DatologyAI’s data curation pipeline for legal domain adaptation
- DatologyAI repostedThis is p sick and completely surprising to me wowie great workNew Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better
- DatologyAI repostedIf I had to compress my PhD into one idea, it is this "The data a model sees early in training leaves an imprint on its representations that is very hard to undo later" This thread runs through - Rephrasing the Web - Safety Pretraining - TOFU This is the Finetuner’s Fallacy🧵
00:00 - New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better
- DatologyAI repostedTwo nursing home residents are eating lunch. One says, "Boy, the food at this place is terrible." The other says, "Yeah, I know, and such small portions, too." This is the multilingual data problem. The data is bad, AND there's not enough of it. Yesterday at @datologyai we
- Replying to @datologyai @Arceeai and @RicardoMonti9If you're interested in joining our team to do cool stuff like this, head to datologyai.com/careers. And if you need to improve your data (you definitely need to improve your data), get in touch via our website datologyai.com
- Replying to @datologyai and @ArceeaiSee @RicardoMonti9's thread for a deep-dive: x.com/RicardoMonti9/… ArXiv: arxiv.org/abs/2602.15210 Blog: datologyai.com/blog/berweb-in…1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyai shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.
- Replying to @datologyaiThis curation approach also powers @arceeai's Trinity Large Base, which shows exceptionally strong multilingual performance relative to its compute budget.















