Axolotl v0.16.0 Release Notes

@winglian

Axolotl v0.16.1 Release Notes

Gemma 4 Support

Example YAML: https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/gemma4/26b-a4b-moe-qlora.yaml

gemma4 support by @winglian in #3574

Full Changelog: v0.16.0...v0.16.1

Axolotl v0.16.0 Release Notes

We’re very excited to share this new packed release. We had ~80 new commits since v0.15.0 (March 6, 2026).

Highlights

Async GRPO — Asynchronous Reinforcement Learning Training (#3486)

Full support for asynchronous Group Relative Policy Optimization with vLLM integration. Includes async data producer with replay buffer, streaming partial-batch training, native LoRA weight sync to vLLM, and FP8 compatibility. Supports multi-GPU via FSDP1/FSDP2 and DeepSpeed ZeRO-3.

Achieves up to 58% faster step times (1.59s/step vs 3.79s baseline on Qwen2-0.5B).

Optimization	Step Time	Improvement
Baseline	3.79s	—
+ Batched weight sync	2.52s	34% faster
+ Liger kernel fusion	2.01s	47% faster
+ Streaming partial batch	1.79s	53% faster
+ Element chunking + re-roll fix (500 steps)	1.59s	58% faster

ScatterMoE + LoRA Fused Triton Kernels (#3513)

Custom fused Triton kernels for training MoE models with LoRA adapters. By fusing the base expert matmul and LoRA computation into a single kernel pass, these kernels achieve up to 15x faster forward passes and 40x less activation memory vs the eager baseline.

Implementation detail

Key innovations include atomic-free split backward kernels (11x faster dA/dB gradients), autotunable tile sizes, fused gather backward (eliminates intermediate grouped-X buffer when workload is small), selective NF4 dequantization (only dequantizes routed experts - up to ~97% weight memory reduction per layer at short sequences where few experts are activated, though the savings diminish at longer contexts as most experts see at least one token), and H200/B200 register pressure tuning.

SonicMoE Fused LoRA (#3519)

LoRA support for SonicMoE (CUTLASS-based MoE kernels for Hopper/Blackwell GPUs).

This results in upto 1.45x speedup and 30% reduction over grouped_mm baseline on a 1xH100 SXM GPU (Qwen3.5-35B-A3B 8-bit LoRA tuning).

GRPO Flattening & Packing (#3552)

Enables batch flattening and sample packing for GRPO training, improving token efficiency and training throughput (~10%).

Flash Attention 4 support (#3481)

Support for Flash Attention 4 on Hopper and Blackwell GPUs with automatic fallback to FA2/3 depending on hardware.

We found improvements mainly on higher sequence lengths and on Blackwell GPUs.

NeMo Gym Integration (#3516)

Full integration with NVIDIA NeMo Gym for single-turn and multi-turn RL training. Re-use existing environments in Axolotl now! https://github.com/NVIDIA-NeMo/Gym

Implementation detail

Single-turn mode calls /verify endpoints for deterministic rewards (math, logic). Multi-turn mode delegates generation to a NeMo Gym agent server that orchestrates the full agentic loop — the model generates tool calls, the agent executes them against environment servers, feeds results back, and repeats until completion. An env_mask separates model-generated tokens (trained on) from tool results (masked out), and IS correction handles the logprob mismatch between agent-side generation and the training policy. Combined with async_prefetch, generation and training overlap for ~3x throughput on multi-turn tasks.

Energy-Based Fine-Tuning (EBFT) (#3527)

Novel RL training method that matches generated completions to ground-truth feature embeddings without requiring external reward models. Supports structured (vLLM) and strided (single GPU) modes.

Performance & Kernel Optimizations

Custom Triton Kernels for RL (#3510) — Fused Triton kernels for entropy_from_logits (single-pass online algorithm) and selective_log_softmax (fused forward+backward). Both avoid materializing the full [B, L, V] softmax intermediate. Benchmarked on Qwen vocab (V=151,936), RTX 5090:
- entropy_from_logits: 5.2x faster, near-zero memory overhead (0.2 MB vs 120 MB); on non-contiguous tensors: 7.3x faster, saves up to 10 GB by avoiding .contiguous() copy
- selective_log_softmax: 2.9x faster forward, 3.2x faster fwd+bwd, memory reduced from ~5 GB to <0.2 MB (at B=8, L=2048) — the original OOMs at B=16 where Triton succeeds
LoRA Kernel Enhancements (#3528) — Extended LoRA kernels to support bias, dropout, DoRA, and embeddings with FSDP2 compatibility — making them production-ready for all LoRA configurations.
Liger Kernels for Qwen 3.5 (#3531) — Fused RMSNorm + gated activation kernels for Qwen 3.5 and Qwen 3.5 MoE.
Auto TF32 (#3473) — Automatically enables TF32 matmul/cuDNN when hardware supports it.
Reduced Autotune Search Space (#3525) — Faster kernel autotuning startup.

New Features

CPU Layer Offloading (#3512) — Offload frozen decoder layers to CPU during LoRA training, streaming to GPU only during forward/backward. Dramatically reduces VRAM usage for large models.
Multiple Custom Optimizers (#3457) — End-to-end support for Flash AdamW, Optimi AdamW, ADOPT AdamW, Muon, Dion, Schedule-Free AdamW, and CAME PyTorch.
Memory-Efficient LoRA Merging (#3095) — Iterate through weight bins without loading entire layers into memory, enabling merging of massive models on resource-constrained systems.
Custom MoE Routing (#3526) — Support for Ernie 4.5 MoE and Hunyuan V1 MoE custom routing strategies beyond standard softmax-topk.
Synthetic Benchmarking Datasets (#3518) — Built-in synthetic datasets for benchmarking and testing training pipelines.
MX Quantization-Aware Training (QAT) (#3553) — Support for MX format QAT with torchao integration, including state dict hooks for saving dequantized safetensors from transformers.
dtype Propagation for torchao (#3569) — Proper dtype handling through torchao quantization and optimizer paths.
Lazy Trainer Loading (#3568) — Lazy-load trainer classes to avoid unnecessary imports, improving startup time.

Documentation

Greatly improved documentation for GRPO and Agents use (#3564) — New dedicated guides for GRPO training, vLLM serving, training stability and monitoring, debugging, and choosing training methods. Adds agent-specific documentation and deduplicates content across pages.

Model & Framework Support

Deprecation

Deprecated torch 2.8.0 support (#3550)
Removed dead SDPA patches (#3488)

New Model Support

Mistral Small 4 (#3502)
Qwen 3.5 / Qwen 3.5 MoE (#3515, #3523, #3554)
NeMo Super (Nemotron-H) Support (#3508)

Updates

Transformers 5.4.0 (#3562)
lm-eval updated for Transformers v5 (#3571)
Torchao 0.17.0 (#3569)
Gemma 3 config fixes (#3500)
Docker builds for py312-cu128-torch2.9.1 (#3489)

Bug Fixes

Fixed high eval loss with sample packing (#3478)
Fixed double sequence partition with Context Parallelism (#3498)
Fixed DPO compatibility with transformers v0.29 (#3560)
Fixed Ray train crashing after succeeding (#3542)
Fixed race condition in patching checks (#3543)
Fixed token state JSON and Mistral tokenizer issue (#3522)
Fixed async model loading with quantized BNB models (#3477)
Fixed MoE quant patch for merge mismatches (#3483)
Fixed connection error handling for user whoami check (#3529)
Patches only applied when CUDA is available (#3561)
Fixed shell injection in modal CLI (#3487)
Fixed DPO tool role KeyError (#3217), dataset hash output_dir (#3303), config validators (#3538)
Added precompute_ref_log_probs to config schema (#3555)
Added troubleshooting note for GLM4 GGUF MTP mismatch (#3559)
Allow bf16 flag with deprecation warning (#3563)

Infrastructure

Scored rollouts dispatched to plugins with extended external plugin paths (#3549)
CI improvements: codecov informational only, better error handling (#3534, #3517)
Make CI fail GitHub Actions on test failures (#3517)
Consolidate routing behavior in ScatterMoE kernels (#3475)
roundup_power2_divisions no longer needed with newer PyTorch versions (#3540)
Logging cleanup (#3482)
Update pre-commit hooks (#3567)
Update README (#3503)
Docker: ubuntu user improvements for uv images (#3491, #3492, #3494, #3495, reverted #3496)

What We Tried (So You Don't Have To)

A few optimizations we prototyped but found little-to-no improvement — sharing so others can skip these rabbit holes:

Fused NF4 dequant inside the ScatterMoE Triton GEMM kernel — reads NF4-packed expert weights directly in the inner loop, avoiding the bf16 intermediate buffer entirely. Correctness-verified (zero numerical error), but 3x slower than the separate dequant + bf16 path. BnB's dequant kernel reads data linearly at full memory bandwidth (~1 TB/s), while the fused kernel's scattered byte addressing and per-element NF4 codebook lookups destroy memory coalescing.
Batched NF4 dequantization — concatenating gate_up + down_proj packed NF4 data into a single dequant call. Produced identical performance (1.00x) — the kernel is purely bandwidth-bound, so one large call = two smaller calls at the same total bytes.
Dequant buffer pool for NF4 MoE — pre-allocating reusable bf16 buffers to avoid 512 alloc/free cycles per step during NF4 expert dequantization. Reduced memory fragmentation from 20....

@winglian

Axolotl v0.15.0 Release Notes

This release brings new model support, significant MoE improvements, infrastructure updates with Torch 2.10.0 and uv builds, and a collection of quality-of-life fixes across the board.

🚀 Major Changes

Torch 2.10.0 & uv Builds

We have upgraded to Torch 2.10.0 and introduced uv-based Docker builds for faster, more reproducible image creation. Python 3.14 is now used in unit tests.

Contributed by @winglian in #3429, #3430, #3431, and #3450.

ScatterMoE LoRA & SonicMoE

Added LoRA support for ScatterMoE and introduced SonicMoE as a new MoE kernel option for faster and more memory-efficient MoE training compared to grouped_mm in transformers.

ScatterMoE LoRA by @winglian in #3410.
SonicMoE by @NanoCode012 in #3411.

MoE Expert Quantization

Added support for quantizing MoE expert weights in Transformers v5, which dramatically reduces peak reserved memory. For example, GLM-4.7-Flash QLoRA drops from ~127 GiB to ~23 GiB reserved memory. Activate via quantize_moe_experts: true. See the expert quantization docs for details.

Contributed by @NanoCode012 in #3439.

🎉 New Features

New Model Support:
- GLM models with dedicated patch. (#3329 by @ved1beta)
- Step3.5 for Cut Cross Entropy. (#3384 by @NanoCode012)
- Qwen3.5 and Qwen3.5 MoE models with packing support. (#3442 by @ved1beta)
SageAttention: Added SageAttention integration for efficient attention computation. (#2823 by @NanoCode012)
MXFP4 Quantization: Added support for MXFP4 quantization. (#3375 by @ved1beta)
Hub Revision Support: Added hub_revision for specifying a branch when pushing checkpoints. (#3387 by @madScientist10)
Dot-Notation CLI Args: Support for dot-notation CLI arguments for nested config options. (#3419 by @ManasVardhan)
Sample Generation for SFT: New sample generation support to SFT training. (#3240 by @ved1beta)
train_per_sec_per_gpu Metric: Added a new training throughput metric. (#3364 by @ved1beta)

⚠️ Breaking Changes

dataset_processes → dataset_num_proc: The config field dataset_processes has been renamed to dataset_num_proc. (#3352 by @tgoab)

🐛 Bug Fixes

Context Parallel:
- Fixed state dict saving and eval. (#3382 by @ved1beta)
- Corrected total_num_steps and batch_size calculation. (#3444 by @Yatimai)
GRPO:
- Fixed config not accepting max_prompt_length. (#3390 by @NanoCode012)
- Moved rollout logic into set_training_kwargs. (#3392 by @ved1beta)
Telemetry:
- Removed telemetry warning and improved logging. (#3397, #3398 by @NanoCode012)
- Disabled telemetry on non-master ranks. (#3438 by @NanoCode012)
LoRA Kernels: Improved failure messaging and handling of trust_remote_code. (#3378 by @NanoCode012)
Generation Mode: Fixed add_special_tokens handling and enabled test mode for generation. (#3396 by @NanoCode012)
CCE Generic Patch: Fixed generic patch for Cut Cross Entropy. (#3405 by @winglian)
MistralProcessor: Updated for Transformers v5 compatibility. (#3423 by @NanoCode012)
Adapter Config Validation: Set allowed values for adapter in Pydantic config. (#3415 by @NanoCode012)
Tokenizer/Processor Revision: Pass revision parameter to tokenizer and processor loaders. (#3388 by @madScientist10)
Excess Length Truncation: Fixed excess_length_strategy truncation method. (#3401 by @rlronan)
Scheduler Fallback: Use self.optimizer if optimizer not passed to SchedulerMixin.create_scheduler(). (#3435 by @kallewoof)
Dataset Deduplication: Fixed saving de-duplicated dataset during pre-processing. (#3427 by @ManasVardhan)
FSDP2 Sharding: Fixed FSDP2 sharding and validated AO version for LR groups. (#3403 by @bekk02)
Qwen3-Next packing: Fixed causal-conv1d call for packing via FLA instead of native pytorch. (#3437 by @NanoCode012)
Quantize MoE validation: Fixed validation to prevent lora_target_linear to be true when quantizing moe experts (#3461 by @NanoCode012)
TP batch_size calculation: Fix calculation (#3462 by @Yatimai)

📦 Dependency & Infrastructure Updates

transformers to 5.3.0, accelerate to 1.13.0, trl to 0.29.0, kernels to 0.12.2, torchao to 0.16.0. (#3407 #3459 by @winglian)
Trackio 0.16.1: Updated trackio to 0.16.1. (#3425 by @winglian)
Cut-Cross-Entropy Bump: Bumped cut-cross-entropy to 58d6572. (#3424 by @NanoCode012)
BunnyCDN for CI: CI assets now served via BunnyCDN. (#3422 by @winglian)
Telemetry Params: Included number of params and rounded estimates in PostHog telemetry. (#3455 by @winglian)

📖 Documentation

Clarified ScatterMoE usage. (#3408 by @NanoCode012)
Clarified lm_eval plugin usage. (#3404 by @NanoCode012)
Added limitation note for unfrozen_parameters. (#3416 by @NanoCode012)
Added expert quantization docs, GLM4.5 Air example configs, and updated README. (#3452 by @NanoCode012)
Add lint information to contributor doc (#3458 by @winglian)

Other Changes

set 0.15.0.dev0 version by @winglian in #3380
Fix typo in dataset_processes field by @lorenzbaraldi in #3426
mark slow tests that are timing out in CI by @winglian in #3428
don't install torch ao on arm64 by @winglian in #3448
uv cloud image should use uv w pip by @winglian in #3449
fix python version typo for building 3.11 by @winglian in #3454
fix uv cache subcommand by @winglian in #3447
extend pytest-sdist timeout to 30 min for slow/flaky tests by @winglian in #3456
chore: update pre-commit hooks by @github-actions[bot] in #3381

New Contributors

@tgoab made their first contribution in #3352
@madScientist10 made their first contribution in #3387
@lorenzbaraldi made their first contribution in #3426
@ManasVardhan made their first contribution in #3419
@rlronan made their first contribution in #3401
@bekk02 made their first contribution in #3403
@Yatimai made their first contribution in #3444

Full Changelog: v0.14.0...v0.15.0

@winglian

This is a major release marking our migration to Transformers v5. Along with this significant core dependency upgrade, we are introducing major performance optimizations for MoE models and new fine-tuning methods.

🚀 Major Changes

Transformers v5 Upgrade

We have upgraded the underlying transformers dependency from v4 to v5. This has been a long-term effort to ensure Axolotl remains compatible with the latest ecosystem advancements, stability improvements, and future model architectures.

Contributed by @winglian in #3272 and #3376.

Faster MoE Training

We've added support for selecting MoE kernels via transformers: batched_mm and grouped_mm. Alongside this, we have added our custom integration for scattermoe. This significantly speeds up training and reduces VRAM usage for Mixture of Experts (MoE) models.

Contributed by @winglian in #3377.

🎉 New Features

EAFT Support: Added support for Efficient Adaptation Fine-Tuning (EAFT). (#3366 by @salmanmohammadi)
New CCE Support: Added Cut Cross Entropy support for GLM 4.7 Flash, GLM Image, GLM 4.6v, and Exaone 4. (#3373 by @NanoCode012)

Full Changelog: v0.13.2...v0.14.0

@ved1beta

Axolotl v0.13.2 Release Notes

This is a patch release introducing GDPO support and updating core infrastructure, including newer CUDA defaults and Python versions.

🎉 New Features

GDPO Support: Added support for GDPO. (#3353 by @ved1beta)

📦 Dependency & Infrastructure Updates

CUDA 12.9.1: Base images now default to CUDA 12.9.1. (#3367 by @winglian)
Python 3.12: Added Python 3.12 support to base images. (#3367 by @winglian)
vLLM Upgrade: Upgraded vLLM dependency to v0.14.0. (#3345 by @winglian)

Other fixes

Version dev by @winglian in #3365
strip only starting 'v' char; e.g don't strip from '.dev' from version by @winglian in #3368

Full Changelog: v0.13.1...v0.13.2

@NanoCode012

This release brings support for PyTorch 2.9.1, expands our ecosystem with new experiment trackers (SwanLab and Trackio), and introduces support for a wide range of new models including Olmo3, Ministral 3, InternVL 3.5, and Kimi. We’ve also included significant improvements to quantization workflows and metrics logging.

🎉 New Features

Expanded Model Support

We’ve added support for more models!

Olmo3: including Olmo and Olmo2. (#3275 by @NanoCode012)
Ministral 3 (#3297, #3300 by @NanoCode012)
InternVL 3.5: (#3141 by @NanoCode012)
Kimi: using experimental training code. (#3257 by @NanoCode012)
Trinity: by ArceeAI. (#3292 by @NanoCode012)
Exaone 4: (#3279 by @nayohan)
MiMo & Plano: (#3332 by @NanoCode012)

New Experiment Tracking Integrations

SwanLab: You can now use SwanLab for experiment tracking. (#3334 by @PraMamba)
Trackio: Added Trackio validation integration. (#3253 by @abidlabs)

Training & PEFT Improvements

Liger Kernel for DPO: added liger support kernel for DPO training. (#3302 by @ved1beta)
Distributed Muon: Added support for distributed Muon optimizer. (#3264 by @salmanmohammadi)
Weight Tying Safety: Added peft_ensure_weight_tying to ensure correct parameter handling in PEFT. (#3278 by @NanoCode012)
Adapter Dtypes: Added peft_autocast_adapter_dtype config option for fine-grained control. (#3311 by @xzuyn)
Cheap PPL Metric: A new metric calculation for Perplexity that is less computationally expensive. (#3317 by @xzuyn)
Scaled Softmax: Scales softmax calculation by s * log(n) + b. (#3338 by @ved1beta )

⚠️ Deprecations & Warnings

PyTorch 2.7.1 Deprecation

Support for PyTorch 2.7.1 has been deprecated. We recommend upgrading to newer supported versions.

Contributed by @winglian in #3339.

🔧 Fixes & Improvements

Quantization & CLI

Save Processor: The quantizer CLI now properly saves the processor alongside the model. (#3290 by @salmanmohammadi)
FP8 Checks: Fixed checks for FP8 capability and load_in_8bit configurations. (#3324, #3327)
NVFP4 Configs: Added QAT NVFP4 configs for reference. (#3280 by @salmanmohammadi)

Logging & Metrics

Metric Rounding: You can now set the env AXOLOTL_METRIC_PRECISION= (5 default) to control the rounding of logged metrics. (#3325 by @ved1beta)
Token Logging: Fixed total/trainable tokens logging logic. (#3344 #3293 by @ved1beta)
Evaluation Loss: Fixed evaluation loss calculation in the KD trainer. (#3271 by @roycho96)

Data Processing

Long Sequence Handling: Feature added to raise an error on long sequence drops to prevent silent data loss. (#3321 by @kallewoof)
Qwen3 Tokenization: Fixed an off-by-a-few-tokens issue in Qwen3 Jinja tokenization. (#3295 by @NanoCode012)

📦 Dependency & Infrastructure Updates

PyTorch 2.9.1: Added base images and support for PyTorch 2.9.x and xformers wheels. (#3268, #3273, #3308 by @winglian)
CUDA 13.0: Added initial test matrices and wheel support for CUDA 13.0. (#3343, #3342 by @winglian)
Liger Kernel: Upgraded to 0.6.4. (#3289 by @NanoCode012)
Pydantic: Upgraded to 2.12. (#3328 by @winglian)
General Deps: End of year dependency upgrades. (#3299 by @winglian)
Scikit-learn: Removed unused dependency scikit-learn==1.4.2. (#3277 by @ved1beta)
Transformers and peft: Upgrade to 4.57.6 and 0.17.1 respectively (#3358 , #3361 by @winglian)

Others

fix bin size by @ved1beta in #3307
pre-commit hooks update. (#3287 by @github-actions[bot])
PYTORCH_CUDA_ALLOC_CONF deprecation fix to ensure compatibility with future PyTorch versions. (#3313 by @NanoCode012)

fix: Fix evaluation loss in KD trainer by @roycho96 in #3271
Fix typo in densemixer RuntimeError by @bethrezen in #3327
fix preview docs failing due to running out of disk by @winglian in #3326
feature: raise on long sequence drop by @kallewoof in #3321
feat: cleanup old flex mask patch, suppress Matmul bnb warn, and misc by @NanoCode012 in #3330
docs for checkpiont saving by @ved1beta in #3335
fix: gemma3_text model loading vision config by @NanoCode012 in #3354
fix syntax for secrets in gha yaml by @winglian in #3355
Update PR template by @salmanmohammadi in #3349
fix amd64 and set 2.9.1 as latest cloud image by @winglian in #3356
don't install deepspeed in arm64 images by @winglian in #3357
don't install xformers in for arm64 by @winglian in #3359
set version to v0.13.1 by @winglian in #3363

New Contributors

@nayohan made their first contribution in #3279
@roycho96 made their first contribution in #3271
@bethrezen made their first contribution in #3327
@abidlabs made their first contribution in #3253
@PraMamba made their first contribution in #3334
@1sand0s made their first contribution in #3346

Full Changelog: v0.13.0...v0.13.1

@djsaunde

This release is packed with major new features, including Streaming SFT for massive datasets, a new Text Diffusion training plugin, and a significant upgrade to our Quantization-Aware Training (QAT) capabilities with NVFP4 support. We're also thrilled to announce support for a huge variety of new models like Gemma3, the Qwen3 family, Hunyuan, Granite 4, and many more.

Alongside these headline features, this release brings support for PyTorch 2.9 and CUDA 13, major dependency upgrades, new developer tooling with Ruff, and a host of important bug fixes to improve training stability and user experience.

🎉 New Features

Streaming Supervised Finetuning (SFT)

You can now finetune on datasets of any size without needing to pre-process and load the entire dataset into memory. Streaming SFT processes data on the fly, dramatically reducing memory usage and startup times, making it ideal for large-scale training.

Contributed by @djsaunde in #3101.

Text Diffusion Training Plugin

Explore a new paradigm of training with our Text Diffusion plugin! This allows you to train models on diffusion-based objectives, opening up new possibilities for text generation and manipulation tasks.

Contributed by @djsaunde in #3067, with a key fix in #3191.

Upgraded Quantization-Aware Training (QAT) with NVFP4 Support

We've migrated to a new QAT API and enhanced the axolotl quantize command. This release introduces support for NVFP4, a new 4-bit floating-point format, further pushing the boundaries of model quantization and efficiency.

Contributed by @salmanmohammadi in #3107.

Expanded Model Support

We've added support for a wide array of new and powerful models:

Qwen3 Family: qwen3-next, qwen3_vl, and qwen3_vl_moe (#3150, #3178 by @NanoCode012, #3183 by @miketung)
Granite: Granite 4 examples and MoE variants granitemoeshared & granitemoehybrid (#3256, #3178, #3158 by @NanoCode012)
Gemma3: Support for gemma3_text attention handling (#3103 by @NanoCode012)
Hunyuan: Added support for the Hunyuan v1 model (#3016 by @NanoCode012)
Magistral: Magistral-small-2509 and native Mistral3 tokenizer support (#3165 by @NanoCode012)
LFM2 Family: Support for LFM2 and the latest MoE model (#3208 by @NanoCode012)
Other Models: Apertus (#3144 by @NanoCode012), Seed-OSS (#3104 by @NanoCode012)

Quality-of-Life & Training Improvements

FSDPv2 Swap Memory: Reduce VRAM usage for FSDP training by swapping memory to the CPU, now with compatibility fixes for QLoRA. (#3167 by @gholmes829)
Fixed IPO Dataset loading: IPO training will now properly load DPO specific configs. (#3128 by @seungduk-yanolja)
Modernized Tooling with Ruff: We've replaced black, isort, flake8, and pylint with ruff for a faster and unified linting and formatting experience. (#3092 by @djsaunde)
Tokens Per Second Logging: The trainer now logs tokens_per_second by default for better performance monitoring. (#3072, #3134 by @salmanmohammadi)
Single GPU DeepSpeed Zero3: Environment variables are now automatically configured for single-GPU DeepSpeed ZeRO-3 runs, simplifying setup. (#3118 by @winglian)
JSON String Tool Arguments: Function-calling/tool-use training now supports JSON strings in tool arguments for greater flexibility. (#3136 by @gamersover)
Refactored FSDP Config: FSDP parameters are now neatly organized under a single fsdp_config dictionary for cleaner configurations. (#3170 by @salmanmohammadi)

🚨 Important changes

PyTorch 2.6.0 Deprecation

PyTorch 2.8 and 2.9 are now supported in Axolotl. Following our previous deprecation cycle, we will now stop support for PyTorch 2.6. The current Docker image is PyTorch 2.8 CUDA 12.8. We recommend updating to get the newest features and fixes. We currently build base images for PyTorch 2.9 CUDA 12.8 and 13.0.

Change implemented by @winglian in #3261.

Introduction of Opt-Out Telemetry

Axolotl now includes opt-out, anonymous telemetry to help us understand usage patterns, such as which features and model architectures are most popular. This data is invaluable for prioritizing future development, bug fixes, and improving the library for everyone. We do not collect any personal or confidential information.

To opt-out, please set the below:

export AXOLOTL_DO_NOT_TRACK=1

More info can be found at docs.

📦 Dependency & Environment Updates

PyTorch & CUDA Support:
- Container images are now available for PyTorch 2.9.0 (#3221) and CUDA 13.0.0 (#3229).
Core Libraries:
- deepspeed has been upgraded to its latest version for improved performance and stability. (#3261)
- transformers upgraded to 4.57.1 (#3127, #3201, #3214, #3249)
- peft upgraded to 0.23.1 (#3094, #3214)
- trl upgraded to 0.24.0 (#3161, #3230, #3249)
- datasets upgraded to 4.4.1 (#3266).
- flash-attn upgraded to 2.8.3 for GPT-OSS attention sink support (#3082).
- liger upgraded to 0.6.3 (#3230).
- numpy upgraded to 2.3.4 (#3236).

🔧 Fixes & Bug Squashing

Training & Checkpoints:
- Fixed an issue where sweep runs would overwrite each other by reusing the base output_dir. (#3080 by @ginkyenglee)
- Corrected an issue where warmup_steps: 0 or warmup_ratio: 0 did not properly disable warmup. (#3254 by @xzuyn)
- Addressed a bug in TRL's enable_sleep_mode. (#3225 by @matthambrecht)
- Added a feature to save an initial checkpoint as soon as training starts, protecting against early-stage failures. (#3233 by @ved1beta)
- Improved handling of LoRA with biases under FSDP2 and file management during checkpoint saving. (#3090)
Dataloader & Preprocessing:
- Fixed a broken Voxtral preprocessor. (#3255 by @NanoCode012)
- Refactored multipack sampler patch. (#3096 by @djsaunde)
- Fixed a bug preventing the chat template jinja file from being loaded during inference. (#3112 by @NanoCode012)
- Resolved a dataloader slow loading issue by setting pin_memory=False when dataloader_num_workers > 0. (#3219 by @qywu)
Distributed Training (DeepSpeed/FSDP):
- Patched ds_grads_remaining in DeepSpeed for better stability. (#3102 by @ved1beta)
- Fixed a DeepSpeed AttributeError when using Context Parallelism. (#3220 by @NanoCode012)
- Improved logic for disabling P2P on Runpod and other platforms by using torch.cuda.can_device_access_peer. (#3132, #3209)
Other Improvements:
- Fixed load_in_Xbit deprecation warnings from transformers. (#3205 by @NanoCode012)
- Enhanced logging with a new debug log and ot...

@salmanmohammadi

What's Changed

Add citation.tff by @salmanmohammadi in #3043
run monkeypatch tests in seperate runner by @winglian in #3047
update training args check for new defaults by @winglian in #3051
follow up fix for plugin registration by @winglian in #3054
fix vllm tagging and add cloud images w/o tmux by @winglian in #3049
chore: update pre-commit hooks by @github-actions[bot] in #3050
remove prepare-from-posids patch by @winglian in #3052
Temporary workaround to unblock docs build in main by @winglian in #3055
upgrade transformers==4.55.1 and bitsandbytes==0.47.0 by @winglian in #3064
fix: fsdp_config validation being None by @NanoCode012 in #3061
use updated patch release transformers 4.55.2 by @winglian in #3066
Add option to skip slow tests in PRs by @salmanmohammadi in #3060
Various fixes for VLMs by @winglian in #3063
[GPT-OSS] improve FSDP shard merging and documentation for GPT-OSS by @winglian in #3073
[feat] truncation support with excess_length_strategy by @ved1beta in #3068
data_parallel_size in in VllmserveCliArgs by @ved1beta in #3074

Full Changelog: v0.12.1...v0.12.2

@winglian

v0.12.1 is a patch release to fix a regression when using Ray Trainer from the CLI

What's Changed

use exec instead of subprocess to make ctrl+c nicer for cli by @winglian in #3044
fix ray train and add fsdp2 smoke test for ray trainer by @winglian in #3053

Full Changelog: v0.12.0...v0.12.1

@salmanmohammadi

We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.

🎉 New features

ND-Parallel for Advanced Parallelism Strategies

Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!

Contributed by @salmanmohammadi and @winglian in #2977 and #3019.

Expanded Model Support

We've added support for a new wave of powerful models:

GPT-OSS (#3020 by @winglian) - get up and running right away with our example configs!
Gemma 3n (#2852 by @NanoCode012)
Liquid Foundation Model 2 (#2905 by @winglian)
Voxtral & Magistral Small 1.1 (#2979 by @NanoCode012)
Devstral (#2896 by @NanoCode012)

Experimental FP8 Mixed-Precision Training with `torchao`

Check out experimental FP8 mixed-precision training! By leveraging the torchao library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.

Contributed by @djsaunde in #2926.

Improved Slurm Support

We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example

Contributed by @winglian in #3038

DeepSpeed Auto Tensor Parallelism (AutoTP)

You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int .

Contributed by @winglian in #2574.

TiledMLP Now Supports FSDP2 and Single GPU

TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.

Contributed by @winglian in #2950, #2891.

Dion Optimizer Support

We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.

Contributed by @winglian in #3014.

Enabled LoRA kernels with FSDP2

FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.

Contributed by @djsaunde in #2992.

Quality-of-Life & Developer Experience Improvements

CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run axolotl -h to see how to install it for your shell. (by @winglian in #2955)
Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in #2924)
Support for lora_target_parameters : Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in #3006)
Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in #2993)
LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in #3025)
Custom Trainer via Module Path: As an alternative to plugin, you can now define a trainer_cls in your YAML config by pointing to it as a module path (e.g., my_project.trainers.CustomTrainer). (by @winglian in #3024)
Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in #3021)

📦 Dependency Updates

peft upgraded to 0.17.0 (#3006) and datasets to 4.0.0. (#2917)
trl upgraded to 0.20.0. (#2892 , #2987)
accelerate upgraded to 1.9.0. (#2936)
liger upgraded to 0.6.1. (#2893 , #2987)
torchao upgraded to the 0.12.0. (#2968)
modal upgraded to 1.0.2. (#2925)
transformers upgraded to 4.55.0. (#2984 , #3018)
bitsandbytesupgraded to 0.46.1. (#2992)

🚨 Upcoming deprecations

Upgrading from FSDP1 → FSDP2

Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.

For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.

Contributed by @salmanmohammadi and @winglian in #2760, #2910.

Rename of Sequence Parallel config

We have renamed sequence_parallel_degree to context_parallel_size to be more consistent with the ecosystem naming by @salmanmohammadi in #2977.

🔧 Fixes & Improvements

Dataset & Preprocessing

Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
Smarter Defaults:
- The warmup_ratio is now used as a better default over warmup_steps, as it adapts to your dataset size. (by @winglian in #2897)
- pad_to_sequence_len now defaults to True if sample_packing is True for more consistent and intuitive behavior. (by @winglian in #2941)
SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)

Distributed Training & Memory

DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
torch.compile: Removed an extra, unnecessary torch.compile call to streamline execution. (by @djsaunde in [#2904](https://github.com/axolotl-ai-cloud/ax...

Uh oh!

Releases: axolotl-ai-cloud/axolotl

v0.16.1

Axolotl v0.16.1 Release Notes

Gemma 4 Support

Contributors

Uh oh!

v0.16.0

Axolotl v0.16.0 Release Notes

Highlights

Async GRPO — Asynchronous Reinforcement Learning Training (#3486)

ScatterMoE + LoRA Fused Triton Kernels (#3513)

SonicMoE Fused LoRA (#3519)

GRPO Flattening & Packing (#3552)

Flash Attention 4 support (#3481)

NeMo Gym Integration (#3516)

Energy-Based Fine-Tuning (EBFT) (#3527)

Performance & Kernel Optimizations

New Features

Documentation

Model & Framework Support

Deprecation

New Model Support

Updates

Bug Fixes

Infrastructure

What We Tried (So You Don't Have To)

Contributors

Uh oh!

v0.15.0

Axolotl v0.15.0 Release Notes

🚀 Major Changes

Torch 2.10.0 & uv Builds

ScatterMoE LoRA & SonicMoE

MoE Expert Quantization

🎉 New Features

⚠️ Breaking Changes

🐛 Bug Fixes

📦 Dependency & Infrastructure Updates

📖 Documentation

Other Changes

New Contributors

Contributors

Uh oh!

v0.14.0

🚀 Major Changes

Transformers v5 Upgrade

Faster MoE Training

🎉 New Features

Contributors

Uh oh!

v0.13.2

Axolotl v0.13.2 Release Notes

🎉 New Features

📦 Dependency & Infrastructure Updates

Other fixes

Contributors

Uh oh!

v0.13.1

🎉 New Features

Expanded Model Support

New Experiment Tracking Integrations

Training & PEFT Improvements

⚠️ Deprecations & Warnings

PyTorch 2.7.1 Deprecation

🔧 Fixes & Improvements

Quantization & CLI

Logging & Metrics

Data Processing

📦 Dependency & Infrastructure Updates

Others

New Contributors

Contributors

Uh oh!

v0.13.0

🎉 New Features

Streaming Supervised Finetuning (SFT)

Text Diffusion Training Plugin

Upgraded Quantization-Aware Training (QAT) with NVFP4 Support

Expanded Model Support