Wing Lian (caseus) (@winglian) / X

Wing Lian (caseus)

2,942 posts

Wing Lian (caseus)

@winglian

@axolotl_ai OSS maintainer. Axolotl AI founder. AI/ML tinkerer. Building tools for everyone.

Annapolis, MD

Joined March 2009

Wing Lian (caseus)
@winglian
Mar 21, 2024
This just dropped this morning
GitHub - meta-pytorch/torchtune: PyTorch native post-training library
From github.com
50K
Wing Lian (caseus)
@winglian
Apr 25, 2024
I'm up to 96k context for Llama 3 8B. Using PoSE, we did continued pre-training of the base model w 300M tokens to extend the context length to 64k. From there we increased the RoPE theta to further attempt to extend the context length. 🧵
121K
Wing Lian (caseus)
@winglian
Apr 24, 2024
Llama 3 achieves pretty good recall to 65k context w/ rope_theta set to 16M
Wing Lian (caseus)
@winglian
Apr 24, 2024
If you set Llama-3's rope_theta to 8M, you can get 100% passkey retrieval across all depths up to 40K context. No continued pre-training needed. Scaling up further leads to much lower retrieval accuracy, but it doesn't completely fail.
121K
Wing Lian (caseus)
@winglian
Oct 29, 2023
Here's the pre-trained multimodal Mistral-LLaVA projector model since nobody has publicly released one yet. huggingface.co/openaccess-ai-… Thanks to @imhaotian for open-sourcing LLaVA and the rest of the open source community helping to push truly open source AI. Trained with axolotl.
openaccess-ai-collective/mistral-7b-llava-1_5-pretrained-projector · Hugging Face
From huggingface.co
63K
Wing Lian (caseus)
@winglian
Feb 7, 2025
The sort of Distillation @deepseek_ai used isn't the KD variant using logits, but 800k samples of SFT data generated from their R1 model. We need some better terminology around distillation to avoid the confusion of talking about distillation.
Oriol Vinyals
@OriolVinyalsML
Feb 6, 2025
Distillation has been on the news (!) due to @deepseek_ai. The paper arxiv.org/abs/1503.02531 was actually rejected from NeurIPS 2014 due to lack of novelty 🧐 (true-ish), and lack of impact 🙃. Thanks reviewer#2 (literally), and thanks for @arxiv! @geoffreyhinton @JeffDean
34K
Wing Lian (caseus)
@winglian
Feb 18, 2025
Axolotl v0.7.0 is out! - GRPO support - Process Reward Model support - KD Training from offline top-k logprobs - Multi-GPU LoRA kernels - Deploy your training and evaluation workloads straight to Modal from the axolotl CLI - Sweeps - Chat template parsing improvements - Improved
31K
Wing Lian (caseus)
@winglian
Mar 8, 2024
Thanks to the amazing work of @jeremyphoward, the @answerdotai team (special thanks to @johnowhitaker & @benjamin_warner for walking through the changes with me) for getting FSDP + QLoRA working. We've managed to integrate their findings into @axolotl_ai and now have additional
Jeremy Howard
@jeremyphoward
Mar 7, 2024
Today, with @Tim_Dettmers, @huggingface, & @Mobius_Labs, we're releasing FSDP/QLoRA, a new project that lets you efficiently train very large (70b) models on a home computer with consumer gaming GPUs. 1/🧵 answer.ai/posts/2024-03-…
97K
Wing Lian (caseus)
@winglian
Apr 24, 2024
If you set Llama-3's rope_theta to 8M, you can get 100% passkey retrieval across all depths up to 40K context. No continued pre-training needed. Scaling up further leads to much lower retrieval accuracy, but it doesn't completely fail.
63K
Wing Lian (caseus)
@winglian
Feb 10, 2025
GRPO + R1-RL + gsm8k, but learns in about 1/30th the steps. Probably needs some hyperparameter tuning. Will report more when this finishes in the morning 🤞
63K
Wing Lian (caseus)
@winglian
Feb 10, 2025
What's the trick? DoRA. I don't have a great hypothesis on why it works yet, but I've upstreamed the changes to TRL. The PR merges the LoRA weights into the base model and ships those to vLLM over the Python API and avoids the complexity of the vllm REST API for lora adapters.
Wing Lian (caseus)
@winglian
Feb 10, 2025
GRPO + R1-RL + gsm8k, but learns in about 1/30th the steps. Probably needs some hyperparameter tuning. Will report more when this finishes in the morning 🤞
45K
Wing Lian (caseus)
@winglian
Apr 2, 2025
Axolotl is out v0.8.0 today! Major features include support for Sequence Parallelism, Gemma3, Multimodal (beta), Muon optimizer, and a major expansion to our docs! We've worked to make sure that our features are composable leading to 3.6x speedups over vanilla HF+FA2 with >50%
307K
Wing Lian (caseus)
@winglian
Apr 12, 2024
- 13B parameter BitNet + infini-Attention + DenseFormer + MoD + In Context-Pretraining + 2 stage pretraining - upcycle w c-BTX to an 8 expert sparse MoE + MoA I’m sure I’m missing about 20 other techniques to throw into a pretrained model/architecture 🤓 In all seriousness
46K
Wing Lian (caseus)
@winglian
Jul 22, 2024
700 H100 GPU-hrs later, here's what I've learned so far attempting to full finetune Mistral-Nemo 12B. - stay away from FSDP. neither 32bit or 8bit adamw seems to work properly. The model outputs are garbage. - Deepspeed zero2 works great, even with 8bit adamw. Thanks to
Wing Lian (caseus)
@winglian
Jul 21, 2024
Anyone having issues with the outputs when finetuning @MistralAI's Mistral-Nemo-12B? Seems to be tokenizer related, but not 100% sure.
22K
Wing Lian (caseus)
@winglian
Mar 29, 2024
You'll need at least one A100, but qlora finetuning of Jamba with fast mamba kernels is working in axolotl now.
52K