Releases: axolotl-ai-cloud/axolotl
v0.16.1
Axolotl v0.16.1 Release Notes
Gemma 4 Support
Example YAML: https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/gemma4/26b-a4b-moe-qlora.yaml
Full Changelog: v0.16.0...v0.16.1
v0.16.0
Axolotl v0.16.0 Release Notes
Weβre very excited to share this new packed release. We had ~80 new commits since v0.15.0 (March 6, 2026).
Highlights
Async GRPO β Asynchronous Reinforcement Learning Training (#3486)
Full support for asynchronous Group Relative Policy Optimization with vLLM integration. Includes async data producer with replay buffer, streaming partial-batch training, native LoRA weight sync to vLLM, and FP8 compatibility. Supports multi-GPU via FSDP1/FSDP2 and DeepSpeed ZeRO-3.
Achieves up to 58% faster step times (1.59s/step vs 3.79s baseline on Qwen2-0.5B).
| Optimization | Step Time | Improvement |
|---|---|---|
| Baseline | 3.79s | β |
| + Batched weight sync | 2.52s | 34% faster |
| + Liger kernel fusion | 2.01s | 47% faster |
| + Streaming partial batch | 1.79s | 53% faster |
| + Element chunking + re-roll fix (500 steps) | 1.59s | 58% faster |
ScatterMoE + LoRA Fused Triton Kernels (#3513)
Custom fused Triton kernels for training MoE models with LoRA adapters. By fusing the base expert matmul and LoRA computation into a single kernel pass, these kernels achieve up to 15x faster forward passes and 40x less activation memory vs the eager baseline.
Implementation detail
Key innovations include atomic-free split backward kernels (11x faster dA/dB gradients), autotunable tile sizes, fused gather backward (eliminates intermediate grouped-X buffer when workload is small), selective NF4 dequantization (only dequantizes routed experts - up to ~97% weight memory reduction per layer at short sequences where few experts are activated, though the savings diminish at longer contexts as most experts see at least one token), and H200/B200 register pressure tuning.
SonicMoE Fused LoRA (#3519)
LoRA support for SonicMoE (CUTLASS-based MoE kernels for Hopper/Blackwell GPUs).
This results in upto 1.45x speedup and 30% reduction over grouped_mm baseline on a 1xH100 SXM GPU (Qwen3.5-35B-A3B 8-bit LoRA tuning).
GRPO Flattening & Packing (#3552)
Enables batch flattening and sample packing for GRPO training, improving token efficiency and training throughput (~10%).
Flash Attention 4 support (#3481)
Support for Flash Attention 4 on Hopper and Blackwell GPUs with automatic fallback to FA2/3 depending on hardware.
We found improvements mainly on higher sequence lengths and on Blackwell GPUs.
NeMo Gym Integration (#3516)
Full integration with NVIDIA NeMo Gym for single-turn and multi-turn RL training. Re-use existing environments in Axolotl now! https://github.com/NVIDIA-NeMo/Gym
Implementation detail
Single-turn mode calls /verify endpoints for deterministic rewards (math, logic). Multi-turn mode delegates generation to a NeMo Gym agent server that orchestrates the full agentic loop β the model generates tool calls, the agent executes them against environment servers, feeds results back, and repeats until completion. An env_mask separates model-generated tokens (trained on) from tool results (masked out), and IS correction handles the logprob mismatch between agent-side generation and the training policy. Combined with async_prefetch, generation and training overlap for ~3x throughput on multi-turn tasks.
Energy-Based Fine-Tuning (EBFT) (#3527)
Novel RL training method that matches generated completions to ground-truth feature embeddings without requiring external reward models. Supports structured (vLLM) and strided (single GPU) modes.
Performance & Kernel Optimizations
- Custom Triton Kernels for RL (#3510) β Fused Triton kernels for
entropy_from_logits(single-pass online algorithm) andselective_log_softmax(fused forward+backward). Both avoid materializing the full[B, L, V]softmax intermediate. Benchmarked on Qwen vocab (V=151,936), RTX 5090:- entropy_from_logits: 5.2x faster, near-zero memory overhead (0.2 MB vs 120 MB); on non-contiguous tensors: 7.3x faster, saves up to 10 GB by avoiding
.contiguous()copy - selective_log_softmax: 2.9x faster forward, 3.2x faster fwd+bwd, memory reduced from ~5 GB to <0.2 MB (at B=8, L=2048) β the original OOMs at B=16 where Triton succeeds
- entropy_from_logits: 5.2x faster, near-zero memory overhead (0.2 MB vs 120 MB); on non-contiguous tensors: 7.3x faster, saves up to 10 GB by avoiding
- LoRA Kernel Enhancements (#3528) β Extended LoRA kernels to support bias, dropout, DoRA, and embeddings with FSDP2 compatibility β making them production-ready for all LoRA configurations.
- Liger Kernels for Qwen 3.5 (#3531) β Fused RMSNorm + gated activation kernels for Qwen 3.5 and Qwen 3.5 MoE.
- Auto TF32 (#3473) β Automatically enables TF32 matmul/cuDNN when hardware supports it.
- Reduced Autotune Search Space (#3525) β Faster kernel autotuning startup.
New Features
- CPU Layer Offloading (#3512) β Offload frozen decoder layers to CPU during LoRA training, streaming to GPU only during forward/backward. Dramatically reduces VRAM usage for large models.
- Multiple Custom Optimizers (#3457) β End-to-end support for Flash AdamW, Optimi AdamW, ADOPT AdamW, Muon, Dion, Schedule-Free AdamW, and CAME PyTorch.
- Memory-Efficient LoRA Merging (#3095) β Iterate through weight bins without loading entire layers into memory, enabling merging of massive models on resource-constrained systems.
- Custom MoE Routing (#3526) β Support for Ernie 4.5 MoE and Hunyuan V1 MoE custom routing strategies beyond standard softmax-topk.
- Synthetic Benchmarking Datasets (#3518) β Built-in synthetic datasets for benchmarking and testing training pipelines.
- MX Quantization-Aware Training (QAT) (#3553) β Support for MX format QAT with torchao integration, including state dict hooks for saving dequantized safetensors from transformers.
- dtype Propagation for torchao (#3569) β Proper dtype handling through torchao quantization and optimizer paths.
- Lazy Trainer Loading (#3568) β Lazy-load trainer classes to avoid unnecessary imports, improving startup time.
Documentation
- Greatly improved documentation for GRPO and Agents use (#3564) β New dedicated guides for GRPO training, vLLM serving, training stability and monitoring, debugging, and choosing training methods. Adds agent-specific documentation and deduplicates content across pages.
Model & Framework Support
Deprecation
New Model Support
- Mistral Small 4 (#3502)
- Qwen 3.5 / Qwen 3.5 MoE (#3515, #3523, #3554)
- NeMo Super (Nemotron-H) Support (#3508)
Updates
- Transformers 5.4.0 (#3562)
- lm-eval updated for Transformers v5 (#3571)
- Torchao 0.17.0 (#3569)
- Gemma 3 config fixes (#3500)
- Docker builds for py312-cu128-torch2.9.1 (#3489)
Bug Fixes
- Fixed high eval loss with sample packing (#3478)
- Fixed double sequence partition with Context Parallelism (#3498)
- Fixed DPO compatibility with transformers v0.29 (#3560)
- Fixed Ray train crashing after succeeding (#3542)
- Fixed race condition in patching checks (#3543)
- Fixed token state JSON and Mistral tokenizer issue (#3522)
- Fixed async model loading with quantized BNB models (#3477)
- Fixed MoE quant patch for merge mismatches (#3483)
- Fixed connection error handling for user whoami check (#3529)
- Patches only applied when CUDA is available (#3561)
- Fixed shell injection in modal CLI (#3487)
- Fixed DPO tool role KeyError (#3217), dataset hash output_dir (#3303), config validators (#3538)
- Added
precompute_ref_log_probsto config schema (#3555) - Added troubleshooting note for GLM4 GGUF MTP mismatch (#3559)
- Allow
bf16flag with deprecation warning (#3563)
Infrastructure
- Scored rollouts dispatched to plugins with extended external plugin paths (#3549)
- CI improvements: codecov informational only, better error handling (#3534, #3517)
- Make CI fail GitHub Actions on test failures (#3517)
- Consolidate routing behavior in ScatterMoE kernels (#3475)
roundup_power2_divisionsno longer needed with newer PyTorch versions (#3540)- Logging cleanup (#3482)
- Update pre-commit hooks (#3567)
- Update README (#3503)
- Docker: ubuntu user improvements for uv images (#3491, #3492, #3494, #3495, reverted #3496)
What We Tried (So You Don't Have To)
A few optimizations we prototyped but found little-to-no improvement β sharing so others can skip these rabbit holes:
-
Fused NF4 dequant inside the ScatterMoE Triton GEMM kernel β reads NF4-packed expert weights directly in the inner loop, avoiding the bf16 intermediate buffer entirely. Correctness-verified (zero numerical error), but 3x slower than the separate dequant + bf16 path. BnB's dequant kernel reads data linearly at full memory bandwidth (~1 TB/s), while the fused kernel's scattered byte addressing and per-element NF4 codebook lookups destroy memory coalescing.
-
Batched NF4 dequantization β concatenating gate_up + down_proj packed NF4 data into a single dequant call. Produced identical performance (1.00x) β the kernel is purely bandwidth-bound, so one large call = two smaller calls at the same total bytes.
-
Dequant buffer pool for NF4 MoE β pre-allocating reusable bf16 buffers to avoid 512 alloc/free cycles per step during NF4 expert dequantization. Reduced memory fragmentation from 20....
v0.15.0
Axolotl v0.15.0 Release Notes
This release brings new model support, significant MoE improvements, infrastructure updates with Torch 2.10.0 and uv builds, and a collection of quality-of-life fixes across the board.
π Major Changes
Torch 2.10.0 & uv Builds
We have upgraded to Torch 2.10.0 and introduced uv-based Docker builds for faster, more reproducible image creation. Python 3.14 is now used in unit tests.
ScatterMoE LoRA & SonicMoE
Added LoRA support for ScatterMoE and introduced SonicMoE as a new MoE kernel option for faster and more memory-efficient MoE training compared to grouped_mm in transformers.
- ScatterMoE LoRA by @winglian in #3410.
- SonicMoE by @NanoCode012 in #3411.
MoE Expert Quantization
Added support for quantizing MoE expert weights in Transformers v5, which dramatically reduces peak reserved memory. For example, GLM-4.7-Flash QLoRA drops from ~127 GiB to ~23 GiB reserved memory. Activate via quantize_moe_experts: true. See the expert quantization docs for details.
- Contributed by @NanoCode012 in #3439.
π New Features
- New Model Support:
- SageAttention: Added SageAttention integration for efficient attention computation. (#2823 by @NanoCode012)
- MXFP4 Quantization: Added support for MXFP4 quantization. (#3375 by @ved1beta)
- Hub Revision Support: Added
hub_revisionfor specifying a branch when pushing checkpoints. (#3387 by @madScientist10) - Dot-Notation CLI Args: Support for dot-notation CLI arguments for nested config options. (#3419 by @ManasVardhan)
- Sample Generation for SFT: New sample generation support to SFT training. (#3240 by @ved1beta)
train_per_sec_per_gpuMetric: Added a new training throughput metric. (#3364 by @ved1beta)
β οΈ Breaking Changes
dataset_processesβdataset_num_proc: The config fielddataset_processeshas been renamed todataset_num_proc. (#3352 by @tgoab)
π Bug Fixes
- Context Parallel:
- GRPO:
- Fixed config not accepting
max_prompt_length. (#3390 by @NanoCode012) - Moved rollout logic into
set_training_kwargs. (#3392 by @ved1beta)
- Fixed config not accepting
- Telemetry:
- Removed telemetry warning and improved logging. (#3397, #3398 by @NanoCode012)
- Disabled telemetry on non-master ranks. (#3438 by @NanoCode012)
- LoRA Kernels: Improved failure messaging and handling of
trust_remote_code. (#3378 by @NanoCode012) - Generation Mode: Fixed
add_special_tokenshandling and enabled test mode for generation. (#3396 by @NanoCode012) - CCE Generic Patch: Fixed generic patch for Cut Cross Entropy. (#3405 by @winglian)
- MistralProcessor: Updated for Transformers v5 compatibility. (#3423 by @NanoCode012)
- Adapter Config Validation: Set allowed values for
adapterin Pydantic config. (#3415 by @NanoCode012) - Tokenizer/Processor Revision: Pass
revisionparameter to tokenizer and processor loaders. (#3388 by @madScientist10) - Excess Length Truncation: Fixed
excess_length_strategytruncation method. (#3401 by @rlronan) - Scheduler Fallback: Use
self.optimizerif optimizer not passed toSchedulerMixin.create_scheduler(). (#3435 by @kallewoof) - Dataset Deduplication: Fixed saving de-duplicated dataset during pre-processing. (#3427 by @ManasVardhan)
- FSDP2 Sharding: Fixed FSDP2 sharding and validated AO version for LR groups. (#3403 by @bekk02)
- Qwen3-Next packing: Fixed causal-conv1d call for packing via FLA instead of native pytorch. (#3437 by @NanoCode012)
- Quantize MoE validation: Fixed validation to prevent
lora_target_linearto be true when quantizing moe experts (#3461 by @NanoCode012) - TP batch_size calculation: Fix calculation (#3462 by @Yatimai)
π¦ Dependency & Infrastructure Updates
transformersto 5.3.0,accelerateto 1.13.0,trlto 0.29.0,kernelsto 0.12.2,torchaoto 0.16.0. (#3407 #3459 by @winglian)- Trackio 0.16.1: Updated trackio to 0.16.1. (#3425 by @winglian)
- Cut-Cross-Entropy Bump: Bumped cut-cross-entropy to
58d6572. (#3424 by @NanoCode012) - BunnyCDN for CI: CI assets now served via BunnyCDN. (#3422 by @winglian)
- Telemetry Params: Included number of params and rounded estimates in PostHog telemetry. (#3455 by @winglian)
π Documentation
- Clarified ScatterMoE usage. (#3408 by @NanoCode012)
- Clarified lm_eval plugin usage. (#3404 by @NanoCode012)
- Added limitation note for
unfrozen_parameters. (#3416 by @NanoCode012) - Added expert quantization docs, GLM4.5 Air example configs, and updated README. (#3452 by @NanoCode012)
- Add lint information to contributor doc (#3458 by @winglian)
Other Changes
- set 0.15.0.dev0 version by @winglian in #3380
- Fix typo in dataset_processes field by @lorenzbaraldi in #3426
- mark slow tests that are timing out in CI by @winglian in #3428
- don't install torch ao on arm64 by @winglian in #3448
- uv cloud image should use uv w pip by @winglian in #3449
- fix python version typo for building 3.11 by @winglian in #3454
- fix uv cache subcommand by @winglian in #3447
- extend pytest-sdist timeout to 30 min for slow/flaky tests by @winglian in #3456
- chore: update pre-commit hooks by @github-actions[bot] in #3381
New Contributors
- @tgoab made their first contribution in #3352
- @madScientist10 made their first contribution in #3387
- @lorenzbaraldi made their first contribution in #3426
- @ManasVardhan made their first contribution in #3419
- @rlronan made their first contribution in #3401
- @bekk02 made their first contribution in #3403
- @Yatimai made their first contribution in #3444
Full Changelog: v0.14.0...v0.15.0
v0.14.0
This is a major release marking our migration to Transformers v5. Along with this significant core dependency upgrade, we are introducing major performance optimizations for MoE models and new fine-tuning methods.
π Major Changes
Transformers v5 Upgrade
We have upgraded the underlying transformers dependency from v4 to v5. This has been a long-term effort to ensure Axolotl remains compatible with the latest ecosystem advancements, stability improvements, and future model architectures.
Faster MoE Training
We've added support for selecting MoE kernels via transformers: batched_mm and grouped_mm. Alongside this, we have added our custom integration for scattermoe. This significantly speeds up training and reduces VRAM usage for Mixture of Experts (MoE) models.
π New Features
- EAFT Support: Added support for Efficient Adaptation Fine-Tuning (EAFT). (#3366 by @salmanmohammadi)
- New CCE Support: Added Cut Cross Entropy support for GLM 4.7 Flash, GLM Image, GLM 4.6v, and Exaone 4. (#3373 by @NanoCode012)
Full Changelog: v0.13.2...v0.14.0
v0.13.2
Axolotl v0.13.2 Release Notes
This is a patch release introducing GDPO support and updating core infrastructure, including newer CUDA defaults and Python versions.
π New Features
π¦ Dependency & Infrastructure Updates
- CUDA 12.9.1: Base images now default to CUDA 12.9.1. (#3367 by @winglian)
- Python 3.12: Added Python 3.12 support to base images. (#3367 by @winglian)
- vLLM Upgrade: Upgraded vLLM dependency to v0.14.0. (#3345 by @winglian)
Other fixes
- Version dev by @winglian in #3365
- strip only starting 'v' char; e.g don't strip from '.dev' from version by @winglian in #3368
Full Changelog: v0.13.1...v0.13.2
v0.13.1
This release brings support for PyTorch 2.9.1, expands our ecosystem with new experiment trackers (SwanLab and Trackio), and introduces support for a wide range of new models including Olmo3, Ministral 3, InternVL 3.5, and Kimi. Weβve also included significant improvements to quantization workflows and metrics logging.
π New Features
Expanded Model Support
Weβve added support for more models!
- Olmo3: including Olmo and Olmo2. (#3275 by @NanoCode012)
- Ministral 3 (#3297, #3300 by @NanoCode012)
- InternVL 3.5: (#3141 by @NanoCode012)
- Kimi: using experimental training code. (#3257 by @NanoCode012)
- Trinity: by ArceeAI. (#3292 by @NanoCode012)
- Exaone 4: (#3279 by @nayohan)
- MiMo & Plano: (#3332 by @NanoCode012)
New Experiment Tracking Integrations
- SwanLab: You can now use SwanLab for experiment tracking. (#3334 by @PraMamba)
- Trackio: Added Trackio validation integration. (#3253 by @abidlabs)
Training & PEFT Improvements
- Liger Kernel for DPO: added liger support kernel for DPO training. (#3302 by @ved1beta)
- Distributed Muon: Added support for distributed Muon optimizer. (#3264 by @salmanmohammadi)
- Weight Tying Safety: Added
peft_ensure_weight_tyingto ensure correct parameter handling in PEFT. (#3278 by @NanoCode012) - Adapter Dtypes: Added
peft_autocast_adapter_dtypeconfig option for fine-grained control. (#3311 by @xzuyn) - Cheap PPL Metric: A new metric calculation for Perplexity that is less computationally expensive. (#3317 by @xzuyn)
- Scaled Softmax: Scales softmax calculation by
s * log(n) + b. (#3338 by @ved1beta )
β οΈ Deprecations & Warnings
PyTorch 2.7.1 Deprecation
Support for PyTorch 2.7.1 has been deprecated. We recommend upgrading to newer supported versions.
π§ Fixes & Improvements
Quantization & CLI
- Save Processor: The quantizer CLI now properly saves the processor alongside the model. (#3290 by @salmanmohammadi)
- FP8 Checks: Fixed checks for FP8 capability and
load_in_8bitconfigurations. (#3324, #3327) - NVFP4 Configs: Added QAT NVFP4 configs for reference. (#3280 by @salmanmohammadi)
Logging & Metrics
- Metric Rounding: You can now set the env
AXOLOTL_METRIC_PRECISION=(5 default) to control the rounding of logged metrics. (#3325 by @ved1beta) - Token Logging: Fixed total/trainable tokens logging logic. (#3344 #3293 by @ved1beta)
- Evaluation Loss: Fixed evaluation loss calculation in the KD trainer. (#3271 by @roycho96)
Data Processing
- Long Sequence Handling: Feature added to raise an error on long sequence drops to prevent silent data loss. (#3321 by @kallewoof)
- Qwen3 Tokenization: Fixed an off-by-a-few-tokens issue in Qwen3 Jinja tokenization. (#3295 by @NanoCode012)
π¦ Dependency & Infrastructure Updates
- PyTorch 2.9.1: Added base images and support for PyTorch 2.9.x and xformers wheels. (#3268, #3273, #3308 by @winglian)
- CUDA 13.0: Added initial test matrices and wheel support for CUDA 13.0. (#3343, #3342 by @winglian)
- Liger Kernel: Upgraded to 0.6.4. (#3289 by @NanoCode012)
- Pydantic: Upgraded to 2.12. (#3328 by @winglian)
- General Deps: End of year dependency upgrades. (#3299 by @winglian)
- Scikit-learn: Removed unused dependency
scikit-learn==1.4.2. (#3277 by @ved1beta) - Transformers and peft: Upgrade to 4.57.6 and 0.17.1 respectively (#3358 , #3361 by @winglian)
Others
- fix bin size by @ved1beta in #3307
- pre-commit hooks update. (#3287 by @github-actions[bot])
PYTORCH_CUDA_ALLOC_CONFdeprecation fix to ensure compatibility with future PyTorch versions. (#3313 by @NanoCode012)
- fix: Fix evaluation loss in KD trainer by @roycho96 in #3271
- Fix typo in densemixer RuntimeError by @bethrezen in #3327
- fix preview docs failing due to running out of disk by @winglian in #3326
- feature: raise on long sequence drop by @kallewoof in #3321
- feat: cleanup old flex mask patch, suppress Matmul bnb warn, and misc by @NanoCode012 in #3330
- docs for checkpiont saving by @ved1beta in #3335
- fix: gemma3_text model loading vision config by @NanoCode012 in #3354
- fix syntax for secrets in gha yaml by @winglian in #3355
- Update PR template by @salmanmohammadi in #3349
- fix amd64 and set 2.9.1 as latest cloud image by @winglian in #3356
- don't install deepspeed in arm64 images by @winglian in #3357
- don't install xformers in for arm64 by @winglian in #3359
- set version to v0.13.1 by @winglian in #3363
New Contributors
- @nayohan made their first contribution in #3279
- @roycho96 made their first contribution in #3271
- @bethrezen made their first contribution in #3327
- @abidlabs made their first contribution in #3253
- @PraMamba made their first contribution in #3334
- @1sand0s made their first contribution in #3346
Full Changelog: v0.13.0...v0.13.1
v0.13.0
This release is packed with major new features, including Streaming SFT for massive datasets, a new Text Diffusion training plugin, and a significant upgrade to our Quantization-Aware Training (QAT) capabilities with NVFP4 support. We're also thrilled to announce support for a huge variety of new models like Gemma3, the Qwen3 family, Hunyuan, Granite 4, and many more.
Alongside these headline features, this release brings support for PyTorch 2.9 and CUDA 13, major dependency upgrades, new developer tooling with Ruff, and a host of important bug fixes to improve training stability and user experience.
π New Features
Streaming Supervised Finetuning (SFT)
You can now finetune on datasets of any size without needing to pre-process and load the entire dataset into memory. Streaming SFT processes data on the fly, dramatically reducing memory usage and startup times, making it ideal for large-scale training.
Text Diffusion Training Plugin
Explore a new paradigm of training with our Text Diffusion plugin! This allows you to train models on diffusion-based objectives, opening up new possibilities for text generation and manipulation tasks.
Upgraded Quantization-Aware Training (QAT) with NVFP4 Support
We've migrated to a new QAT API and enhanced the axolotl quantize command. This release introduces support for NVFP4, a new 4-bit floating-point format, further pushing the boundaries of model quantization and efficiency.
- Contributed by @salmanmohammadi in #3107.
Expanded Model Support
We've added support for a wide array of new and powerful models:
- Qwen3 Family:
qwen3-next,qwen3_vl, andqwen3_vl_moe(#3150, #3178 by @NanoCode012, #3183 by @miketung) - Granite:
Granite 4examples and MoE variantsgranitemoeshared&granitemoehybrid(#3256, #3178, #3158 by @NanoCode012) - Gemma3: Support for
gemma3_textattention handling (#3103 by @NanoCode012) - Hunyuan: Added support for the Hunyuan v1 model (#3016 by @NanoCode012)
- Magistral:
Magistral-small-2509and native Mistral3 tokenizer support (#3165 by @NanoCode012) - LFM2 Family: Support for
LFM2and the latest MoE model (#3208 by @NanoCode012) - Other Models:
Apertus(#3144 by @NanoCode012),Seed-OSS(#3104 by @NanoCode012)
Quality-of-Life & Training Improvements
- FSDPv2 Swap Memory: Reduce VRAM usage for FSDP training by swapping memory to the CPU, now with compatibility fixes for QLoRA. (#3167 by @gholmes829)
- Fixed IPO Dataset loading: IPO training will now properly load DPO specific configs. (#3128 by @seungduk-yanolja)
- Modernized Tooling with Ruff: We've replaced
black,isort,flake8, andpylintwithrufffor a faster and unified linting and formatting experience. (#3092 by @djsaunde) - Tokens Per Second Logging: The trainer now logs
tokens_per_secondby default for better performance monitoring. (#3072, #3134 by @salmanmohammadi) - Single GPU DeepSpeed Zero3: Environment variables are now automatically configured for single-GPU DeepSpeed ZeRO-3 runs, simplifying setup. (#3118 by @winglian)
- JSON String Tool Arguments: Function-calling/tool-use training now supports JSON strings in tool arguments for greater flexibility. (#3136 by @gamersover)
- Refactored FSDP Config: FSDP parameters are now neatly organized under a single
fsdp_configdictionary for cleaner configurations. (#3170 by @salmanmohammadi)
π¨ Important changes
PyTorch 2.6.0 Deprecation
PyTorch 2.8 and 2.9 are now supported in Axolotl. Following our previous deprecation cycle, we will now stop support for PyTorch 2.6. The current Docker image is PyTorch 2.8 CUDA 12.8. We recommend updating to get the newest features and fixes. We currently build base images for PyTorch 2.9 CUDA 12.8 and 13.0.
Introduction of Opt-Out Telemetry
Axolotl now includes opt-out, anonymous telemetry to help us understand usage patterns, such as which features and model architectures are most popular. This data is invaluable for prioritizing future development, bug fixes, and improving the library for everyone. We do not collect any personal or confidential information.
To opt-out, please set the below:
export AXOLOTL_DO_NOT_TRACK=1More info can be found at docs.
π¦ Dependency & Environment Updates
- PyTorch & CUDA Support:
- Core Libraries:
deepspeedhas been upgraded to its latest version for improved performance and stability. (#3261)transformersupgraded to 4.57.1 (#3127, #3201, #3214, #3249)peftupgraded to 0.23.1 (#3094, #3214)trlupgraded to 0.24.0 (#3161, #3230, #3249)datasetsupgraded to 4.4.1 (#3266).flash-attnupgraded to 2.8.3 for GPT-OSS attention sink support (#3082).ligerupgraded to 0.6.3 (#3230).numpyupgraded to 2.3.4 (#3236).
π§ Fixes & Bug Squashing
-
Training & Checkpoints:
- Fixed an issue where sweep runs would overwrite each other by reusing the base
output_dir. (#3080 by @ginkyenglee) - Corrected an issue where
warmup_steps: 0orwarmup_ratio: 0did not properly disable warmup. (#3254 by @xzuyn) - Addressed a bug in TRL's
enable_sleep_mode. (#3225 by @matthambrecht) - Added a feature to save an initial checkpoint as soon as training starts, protecting against early-stage failures. (#3233 by @ved1beta)
- Improved handling of LoRA with biases under FSDP2 and file management during checkpoint saving. (#3090)
- Fixed an issue where sweep runs would overwrite each other by reusing the base
-
Dataloader & Preprocessing:
- Fixed a broken
Voxtralpreprocessor. (#3255 by @NanoCode012) - Refactored multipack sampler patch. (#3096 by @djsaunde)
- Fixed a bug preventing the chat template jinja file from being loaded during inference. (#3112 by @NanoCode012)
- Resolved a dataloader slow loading issue by setting
pin_memory=Falsewhendataloader_num_workers > 0. (#3219 by @qywu)
- Fixed a broken
-
Distributed Training (DeepSpeed/FSDP):
-
Other Improvements:
- Fixed
load_in_Xbitdeprecation warnings fromtransformers. (#3205 by @NanoCode012) - Enhanced logging with a new debug log and ot...
- Fixed
v0.12.2
What's Changed
- Add citation.tff by @salmanmohammadi in #3043
- run monkeypatch tests in seperate runner by @winglian in #3047
- update training args check for new defaults by @winglian in #3051
- follow up fix for plugin registration by @winglian in #3054
- fix vllm tagging and add cloud images w/o tmux by @winglian in #3049
- chore: update pre-commit hooks by @github-actions[bot] in #3050
- remove prepare-from-posids patch by @winglian in #3052
- Temporary workaround to unblock docs build in main by @winglian in #3055
- upgrade transformers==4.55.1 and bitsandbytes==0.47.0 by @winglian in #3064
- fix: fsdp_config validation being None by @NanoCode012 in #3061
- use updated patch release transformers 4.55.2 by @winglian in #3066
- Add option to skip slow tests in PRs by @salmanmohammadi in #3060
- Various fixes for VLMs by @winglian in #3063
- [GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS by @winglian in #3073
- [feat] truncation support with excess_length_strategy by @ved1beta in #3068
- data_parallel_size in in VllmserveCliArgs by @ved1beta in #3074
Full Changelog: v0.12.1...v0.12.2
v0.12.1
v0.12.0
We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.
π New features
ND-Parallel for Advanced Parallelism Strategies
Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!
- Contributed by @salmanmohammadi and @winglian in #2977 and #3019.
Expanded Model Support
We've added support for a new wave of powerful models:
- GPT-OSS (#3020 by @winglian) - get up and running right away with our example configs!
- Gemma 3n (#2852 by @NanoCode012)
- Liquid Foundation Model 2 (#2905 by @winglian)
- Voxtral & Magistral Small 1.1 (#2979 by @NanoCode012)
- Devstral (#2896 by @NanoCode012)
Experimental FP8 Mixed-Precision Training with torchao
Check out experimental FP8 mixed-precision training! By leveraging the torchao library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.
Improved Slurm Support
We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example
DeepSpeed Auto Tensor Parallelism (AutoTP)
You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int .
TiledMLP Now Supports FSDP2 and Single GPU
TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.
Dion Optimizer Support
We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.
Enabled LoRA kernels with FSDP2
FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.
Quality-of-Life & Developer Experience Improvements
- CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run
axolotl -hto see how to install it for your shell. (by @winglian in #2955) - Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
- Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
- Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
- New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in #2924)
- Support for
lora_target_parameters: Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in #3006) - Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in #2993)
- LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in #3025)
- Custom Trainer via Module Path: As an alternative to plugin, you can now define a
trainer_clsin your YAML config by pointing to it as a module path (e.g.,my_project.trainers.CustomTrainer). (by @winglian in #3024) - Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in #3021)
π¦ Dependency Updates
peftupgraded to 0.17.0 (#3006) anddatasetsto 4.0.0. (#2917)trlupgraded to 0.20.0. (#2892 , #2987)accelerateupgraded to 1.9.0. (#2936)ligerupgraded to 0.6.1. (#2893 , #2987)torchaoupgraded to the 0.12.0. (#2968)modalupgraded to 1.0.2. (#2925)transformersupgraded to 4.55.0. (#2984 , #3018)bitsandbytesupgraded to 0.46.1. (#2992)
π¨ Upcoming deprecations
Upgrading from FSDP1 β FSDP2
Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.
For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.
- Contributed by @salmanmohammadi and @winglian in #2760, #2910.
Rename of Sequence Parallel config
We have renamed sequence_parallel_degree to context_parallel_size to be more consistent with the ecosystem naming by @salmanmohammadi in #2977.
π§ Fixes & Improvements
Dataset & Preprocessing
- Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
- Smarter Defaults:
- SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
- Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)
Distributed Training & Memory
- DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
- Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
- Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
torch.compile: Removed an extra, unnecessarytorch.compilecall to streamline execution. (by @djsaunde in [#2904](https://github.com/axolotl-ai-cloud/ax...
