DatologyAI (@datologyai) / X

DatologyAI

188 posts

DatologyAI

@datologyai

DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.

Redwood City, CA

datologyai.com

Joined September 2023

DatologyAI reposted
Siddharth Joshi
@sjoshi804
May 13
Five years ago, I left a comfortable software engineering job in Big Tech to start a PhD. Last year, I left the PhD to join Datology. Both decisions confused the people around me, and honestly both decisions were about the same thing: I wanted to do research. Not research as in
791K
DatologyAI
@datologyai
May 13
Pretraining data curation alone — no SFT, no RL — within 1.8pp of Qwen3-VL-2B at ~87× less train compute. New VLM research from our team. datologyai.com/blog/2020-visi…
1.9K
DatologyAI reposted
Jonathan Richard Schwarz
@schwarzjn_
Apr 10
Super excited to announce our new partnership with DatologyAI following some very impressive mid-training results achieved with minimal insight into our private evals. There’s no doubt that the team is doing the most exciting data-centric work across all of industry & academia.
DatologyAI
@datologyai
Apr 9
Replying to @datologyai
Read more details about the partnership here: datologyai.com/blog/datologya…
4.2K
DatologyAI
@datologyai
Apr 9
Replying to @datologyai
Read more details about the partnership here:
datologyai.com
DatologyAI: Train Better Models, Faster and Smaller
4.7K
DatologyAI
@datologyai
Apr 9
We’re excited to announce our partnership with Thomson Reuters, a collaboration focused on unlocking the full potential of proprietary data to build the next generation of domain-specific AI. By applying DatologyAI’s data curation pipeline for legal domain adaptation
10K
DatologyAI reposted
Saurabh Shah
@saurabh_shah2
Mar 19
This is p sick and completely surprising to me wowie great work
DatologyAI
@datologyai
Mar 18
New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better
12K
DatologyAI reposted
Pratyush Maini
@pratyushmaini
Mar 19
If I had to compress my PhD into one idea, it is this "The data a model sees early in training leaves an imprint on its representations that is very hard to undo later" This thread runs through - Rephrasing the Web - Safety Pretraining - TOFU This is the Finetuner’s Fallacy🧵
00:00
58K
DatologyAI
@datologyai
Mar 18
Replying to @datologyai
📄 arxiv.org/abs/2603.16177 📝 datologyai.com/blog/finetuner…
2.2K
DatologyAI
@datologyai
Mar 18
New Datology Research: We expose "The Finetuner's Fallacy" The standard approach to domain adaptation (pretrain on web data, finetune on your data) is leaving performance on the table. Mixing just 1-5% domain data into pretraining, then finetuning, produces a strictly better
57K
DatologyAI reposted
Matthew Leavitt
@leavittron
Feb 19
Two nursing home residents are eating lunch. One says, "Boy, the food at this place is terrible." The other says, "Yeah, I know, and such small portions, too." This is the multilingual data problem. The data is bad, AND there's not enough of it. Yesterday at @datologyai we
3.8K
DatologyAI
@datologyai
Feb 18
Replying to @datologyai @Arceeai and @RicardoMonti9
If you're interested in joining our team to do cool stuff like this, head to datologyai.com/careers. And if you need to improve your data (you definitely need to improve your data), get in touch via our website datologyai.com
datologyai.com
Careers: Join Our Team | DatologyAI
Browse jobs and apply to be part of a world-class team solving the frontier research problem that sets the best AI models apart from the rest.
410
DatologyAI
@datologyai
Feb 18
Replying to @datologyai and @Arceeai
See @RicardoMonti9's thread for a deep-dive: x.com/RicardoMonti9/… ArXiv: arxiv.org/abs/2602.15210 Blog: datologyai.com/blog/berweb-in…
Ricardo Monti
@RicardoMonti9
Feb 18
1/ People often think better multilingual models must come at the cost of English performance. Not true. The constraint isn’t capacity, it’s data quality, and we can fix it. Today @datologyai shares ÜberWeb: a year of multilingual curation lessons, scaled to 20T+ tokens.
774
DatologyAI
@datologyai
Feb 18
Replying to @datologyai
This curation approach also powers @arceeai's Trinity Large Base, which shows exceptionally strong multilingual performance relative to its compute budget.
289