Olmo goes multimodal!
We are launching Molmo, a open family of multimodal models that rival the best closed VLMs out there π€―
We spent the last 9 months meticulously curating PixMo, a dataset of (a) high-quality image-caption pairs and (b) multimodal instruction data.
Luca Soldaini π
4,949 posts
data mines are my passion βοΈ mts @MicrosoftAI / ex co-lead Olmo @allen_ai / pfp @YanhongLi2062 / thoughts are mine, leave my employer alone / π
- Announcing Dolma, the dataset for @allen_ai's LLM, OLMo. It's 3+ trillion tokens (web/papers/code/books/wiki). We hope it will facilitate study of LLMs & their behavior! Released on @huggingface w ImpACT license huggingface.co/datasets/allenβ¦ Overview/datasheet blog.allenai.org/dolma-3-trilliβ¦
- OLMo 2 tech report is out We get in the weeds with this one, with 50+ pages on 4 crucial components of LLM development pipeline:
- GPT-4o still gets foiled by my favorite tokenization-related question βΊοΈ
- Myself and @kylelostat have just released peS2o ππ, a collection of 40M open-access papers carefully cleaned for LLM training. V1 has been used by @mosaicml to train MPT, and we have a V2 version! @huggingface page: huggingface.co/datasets/allenβ¦ feedback? github.com/allenai/peS2o/β¦
- Blows my mind that model souping Just Worksβ’οΈ Same model, same data, train 3-5 times with different seeds, 1-2 extra points on MMLU, Hellaswag, ARC, GSM8k, etc
- So many tokens in PDFs π yet so hard to extract them π Not anymore! olmOCR gives you plain text version of any doc you can think of: science papers, old scans, brochures with weird layouts, even handwriting βοΈ Try it today π
00:14Introducing olmOCR, our open-source tool to extract clean plain text from PDFs! Built for scale, olmOCR handles many document types with high throughput. Run it on your own GPU for freeβat over 3000 token/s, equivalent to $190 per million pages, or 1/32 the cost of GPT-4o! - biggest gift to humanity any frontier lab could do is add a ton of uv examples in their posttraining mix π save the people from python dependency management hell!!
- xAI employees burning out faster than surface of the sun please take care of yourselves guys π₯ΊReplying to @justindrossI have literally never seen a team work this hard in my entire life. 9-2am everyday, weekends and holidays included
- Selecting pretraining data points based on correlation with downstream tasks is an effective data mixing technique I love papers that are a simple, elegant idea executed rly well! lovely read from @TristanThrush @ChrisGPotts @tatsu_hashimoto π arxiv.org/abs/2409.05816
- guys Deepseek obviously has more than 2048 H800. thatβs just the size of their largest cluster. Deepseek 3 model is amazing but imagine having 130+ researchers on just 2K GPUs lmao
- release day release day! OLMo 1b + 7b out today π₯³ and 65b coming soon... With OLMo, we are really focused on advancing the study of LLMs. We release **everything**, from toolkit to create its training dataset (dolma) to training & inference code. More details in thread π§΅
GIF - multimodal PDF processing is painful but doesnβt have to! come to our demo at #EMNLP2023 of Papermage, a library for fast manipulation of PDFs (Friday 9/12 @ 9am) we have used it for LLM data cleanup, paper QA, HCI prototypes github.com/allenai/papermβ¦ aclanthology.org/2023.emnlp-demβ¦














