Wing Lian (caseus)
2,942 posts
@axolotl_ai OSS maintainer. Axolotl AI founder. AI/ML tinkerer. Building tools for everyone.
- I'm up to 96k context for Llama 3 8B. Using PoSE, we did continued pre-training of the base model w 300M tokens to extend the context length to 64k. From there we increased the RoPE theta to further attempt to extend the context length. 🧵
- Llama 3 achieves pretty good recall to 65k context w/ rope_theta set to 16MIf you set Llama-3's rope_theta to 8M, you can get 100% passkey retrieval across all depths up to 40K context. No continued pre-training needed. Scaling up further leads to much lower retrieval accuracy, but it doesn't completely fail.
- Here's the pre-trained multimodal Mistral-LLaVA projector model since nobody has publicly released one yet. huggingface.co/openaccess-ai-… Thanks to @imhaotian for open-sourcing LLaVA and the rest of the open source community helping to push truly open source AI. Trained with axolotl.
- The sort of Distillation @deepseek_ai used isn't the KD variant using logits, but 800k samples of SFT data generated from their R1 model. We need some better terminology around distillation to avoid the confusion of talking about distillation.Distillation has been on the news (!) due to @deepseek_ai. The paper arxiv.org/abs/1503.02531 was actually rejected from NeurIPS 2014 due to lack of novelty 🧐 (true-ish), and lack of impact 🙃. Thanks reviewer#2 (literally), and thanks for @arxiv! @geoffreyhinton @JeffDean
- Axolotl v0.7.0 is out! - GRPO support - Process Reward Model support - KD Training from offline top-k logprobs - Multi-GPU LoRA kernels - Deploy your training and evaluation workloads straight to Modal from the axolotl CLI - Sweeps - Chat template parsing improvements - Improved
- Thanks to the amazing work of @jeremyphoward, the @answerdotai team (special thanks to @johnowhitaker & @benjamin_warner for walking through the changes with me) for getting FSDP + QLoRA working. We've managed to integrate their findings into @axolotl_ai and now have additionalToday, with @Tim_Dettmers, @huggingface, & @Mobius_Labs, we're releasing FSDP/QLoRA, a new project that lets you efficiently train very large (70b) models on a home computer with consumer gaming GPUs. 1/🧵 answer.ai/posts/2024-03-…
- If you set Llama-3's rope_theta to 8M, you can get 100% passkey retrieval across all depths up to 40K context. No continued pre-training needed. Scaling up further leads to much lower retrieval accuracy, but it doesn't completely fail.
- GRPO + R1-RL + gsm8k, but learns in about 1/30th the steps. Probably needs some hyperparameter tuning. Will report more when this finishes in the morning 🤞
- What's the trick? DoRA. I don't have a great hypothesis on why it works yet, but I've upstreamed the changes to TRL. The PR merges the LoRA weights into the base model and ships those to vLLM over the Python API and avoids the complexity of the vllm REST API for lora adapters.GRPO + R1-RL + gsm8k, but learns in about 1/30th the steps. Probably needs some hyperparameter tuning. Will report more when this finishes in the morning 🤞
- Axolotl is out v0.8.0 today! Major features include support for Sequence Parallelism, Gemma3, Multimodal (beta), Muon optimizer, and a major expansion to our docs! We've worked to make sure that our features are composable leading to 3.6x speedups over vanilla HF+FA2 with >50%
- - 13B parameter BitNet + infini-Attention + DenseFormer + MoD + In Context-Pretraining + 2 stage pretraining - upcycle w c-BTX to an 8 expert sparse MoE + MoA I’m sure I’m missing about 20 other techniques to throw into a pretrained model/architecture 🤓 In all seriousness
- 700 H100 GPU-hrs later, here's what I've learned so far attempting to full finetune Mistral-Nemo 12B. - stay away from FSDP. neither 32bit or 8bit adamw seems to work properly. The model outputs are garbage. - Deepspeed zero2 works great, even with 8bit adamw. Thanks toAnyone having issues with the outputs when finetuning @MistralAI's Mistral-Nemo-12B? Seems to be tokenizer related, but not 100% sure.
- You'll need at least one A100, but qlora finetuning of Jamba with fast mamba kernels is working in axolotl now.









