Skip to content

Releases: axolotl-ai-cloud/axolotl

v0.16.1

02 Apr 21:47
08fc7de

Choose a tag to compare

Axolotl v0.16.1 Release Notes

Gemma 4 Support

gemma-4_blog_keyword_header-dark width-2200 format-webp

Example YAML: https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/gemma4/26b-a4b-moe-qlora.yaml

Full Changelog: v0.16.0...v0.16.1

v0.16.0

02 Apr 14:25
573726c

Choose a tag to compare

Axolotl v0.16.0 Release Notes

We’re very excited to share this new packed release. We had ~80 new commits since v0.15.0 (March 6, 2026).


Highlights

Async GRPO β€” Asynchronous Reinforcement Learning Training (#3486)

Full support for asynchronous Group Relative Policy Optimization with vLLM integration. Includes async data producer with replay buffer, streaming partial-batch training, native LoRA weight sync to vLLM, and FP8 compatibility. Supports multi-GPU via FSDP1/FSDP2 and DeepSpeed ZeRO-3.

Achieves up to 58% faster step times (1.59s/step vs 3.79s baseline on Qwen2-0.5B).

Optimization Step Time Improvement
Baseline 3.79s β€”
+ Batched weight sync 2.52s 34% faster
+ Liger kernel fusion 2.01s 47% faster
+ Streaming partial batch 1.79s 53% faster
+ Element chunking + re-roll fix (500 steps) 1.59s 58% faster

ScatterMoE + LoRA Fused Triton Kernels (#3513)

Custom fused Triton kernels for training MoE models with LoRA adapters. By fusing the base expert matmul and LoRA computation into a single kernel pass, these kernels achieve up to 15x faster forward passes and 40x less activation memory vs the eager baseline.

scattermoe_speedup-2
Implementation detail

Key innovations include atomic-free split backward kernels (11x faster dA/dB gradients), autotunable tile sizes, fused gather backward (eliminates intermediate grouped-X buffer when workload is small), selective NF4 dequantization (only dequantizes routed experts - up to ~97% weight memory reduction per layer at short sequences where few experts are activated, though the savings diminish at longer contexts as most experts see at least one token), and H200/B200 register pressure tuning.

SonicMoE Fused LoRA (#3519)

LoRA support for SonicMoE (CUTLASS-based MoE kernels for Hopper/Blackwell GPUs).

qwen35_moe_comparison-2

This results in upto 1.45x speedup and 30% reduction over grouped_mm baseline on a 1xH100 SXM GPU (Qwen3.5-35B-A3B 8-bit LoRA tuning).

GRPO Flattening & Packing (#3552)

Enables batch flattening and sample packing for GRPO training, improving token efficiency and training throughput (~10%).

Flash Attention 4 support (#3481)

Support for Flash Attention 4 on Hopper and Blackwell GPUs with automatic fallback to FA2/3 depending on hardware.

fa2_vs_fa4

We found improvements mainly on higher sequence lengths and on Blackwell GPUs.

NeMo Gym Integration (#3516)

Full integration with NVIDIA NeMo Gym for single-turn and multi-turn RL training. Re-use existing environments in Axolotl now! https://github.com/NVIDIA-NeMo/Gym

Implementation detail

Single-turn mode calls /verify endpoints for deterministic rewards (math, logic). Multi-turn mode delegates generation to a NeMo Gym agent server that orchestrates the full agentic loop β€” the model generates tool calls, the agent executes them against environment servers, feeds results back, and repeats until completion. An env_mask separates model-generated tokens (trained on) from tool results (masked out), and IS correction handles the logprob mismatch between agent-side generation and the training policy. Combined with async_prefetch, generation and training overlap for ~3x throughput on multi-turn tasks.

Energy-Based Fine-Tuning (EBFT) (#3527)

Novel RL training method that matches generated completions to ground-truth feature embeddings without requiring external reward models. Supports structured (vLLM) and strided (single GPU) modes.


Performance & Kernel Optimizations

  • Custom Triton Kernels for RL (#3510) β€” Fused Triton kernels for entropy_from_logits (single-pass online algorithm) and selective_log_softmax (fused forward+backward). Both avoid materializing the full [B, L, V] softmax intermediate. Benchmarked on Qwen vocab (V=151,936), RTX 5090:
    • entropy_from_logits: 5.2x faster, near-zero memory overhead (0.2 MB vs 120 MB); on non-contiguous tensors: 7.3x faster, saves up to 10 GB by avoiding .contiguous() copy
    • selective_log_softmax: 2.9x faster forward, 3.2x faster fwd+bwd, memory reduced from ~5 GB to <0.2 MB (at B=8, L=2048) β€” the original OOMs at B=16 where Triton succeeds
  • LoRA Kernel Enhancements (#3528) β€” Extended LoRA kernels to support bias, dropout, DoRA, and embeddings with FSDP2 compatibility β€” making them production-ready for all LoRA configurations.
  • Liger Kernels for Qwen 3.5 (#3531) β€” Fused RMSNorm + gated activation kernels for Qwen 3.5 and Qwen 3.5 MoE.
  • Auto TF32 (#3473) β€” Automatically enables TF32 matmul/cuDNN when hardware supports it.
  • Reduced Autotune Search Space (#3525) β€” Faster kernel autotuning startup.

New Features

  • CPU Layer Offloading (#3512) β€” Offload frozen decoder layers to CPU during LoRA training, streaming to GPU only during forward/backward. Dramatically reduces VRAM usage for large models.
  • Multiple Custom Optimizers (#3457) β€” End-to-end support for Flash AdamW, Optimi AdamW, ADOPT AdamW, Muon, Dion, Schedule-Free AdamW, and CAME PyTorch.
  • Memory-Efficient LoRA Merging (#3095) β€” Iterate through weight bins without loading entire layers into memory, enabling merging of massive models on resource-constrained systems.
  • Custom MoE Routing (#3526) β€” Support for Ernie 4.5 MoE and Hunyuan V1 MoE custom routing strategies beyond standard softmax-topk.
  • Synthetic Benchmarking Datasets (#3518) β€” Built-in synthetic datasets for benchmarking and testing training pipelines.
  • MX Quantization-Aware Training (QAT) (#3553) β€” Support for MX format QAT with torchao integration, including state dict hooks for saving dequantized safetensors from transformers.
  • dtype Propagation for torchao (#3569) β€” Proper dtype handling through torchao quantization and optimizer paths.
  • Lazy Trainer Loading (#3568) β€” Lazy-load trainer classes to avoid unnecessary imports, improving startup time.

Documentation

  • Greatly improved documentation for GRPO and Agents use (#3564) β€” New dedicated guides for GRPO training, vLLM serving, training stability and monitoring, debugging, and choosing training methods. Adds agent-specific documentation and deduplicates content across pages.

Model & Framework Support

Deprecation

  • Deprecated torch 2.8.0 support (#3550)
  • Removed dead SDPA patches (#3488)

New Model Support

Updates

  • Transformers 5.4.0 (#3562)
  • lm-eval updated for Transformers v5 (#3571)
  • Torchao 0.17.0 (#3569)
  • Gemma 3 config fixes (#3500)
  • Docker builds for py312-cu128-torch2.9.1 (#3489)

Bug Fixes

  • Fixed high eval loss with sample packing (#3478)
  • Fixed double sequence partition with Context Parallelism (#3498)
  • Fixed DPO compatibility with transformers v0.29 (#3560)
  • Fixed Ray train crashing after succeeding (#3542)
  • Fixed race condition in patching checks (#3543)
  • Fixed token state JSON and Mistral tokenizer issue (#3522)
  • Fixed async model loading with quantized BNB models (#3477)
  • Fixed MoE quant patch for merge mismatches (#3483)
  • Fixed connection error handling for user whoami check (#3529)
  • Patches only applied when CUDA is available (#3561)
  • Fixed shell injection in modal CLI (#3487)
  • Fixed DPO tool role KeyError (#3217), dataset hash output_dir (#3303), config validators (#3538)
  • Added precompute_ref_log_probs to config schema (#3555)
  • Added troubleshooting note for GLM4 GGUF MTP mismatch (#3559)
  • Allow bf16 flag with deprecation warning (#3563)

Infrastructure

  • Scored rollouts dispatched to plugins with extended external plugin paths (#3549)
  • CI improvements: codecov informational only, better error handling (#3534, #3517)
  • Make CI fail GitHub Actions on test failures (#3517)
  • Consolidate routing behavior in ScatterMoE kernels (#3475)
  • roundup_power2_divisions no longer needed with newer PyTorch versions (#3540)
  • Logging cleanup (#3482)
  • Update pre-commit hooks (#3567)
  • Update README (#3503)
  • Docker: ubuntu user improvements for uv images (#3491, #3492, #3494, #3495, reverted #3496)

What We Tried (So You Don't Have To)

A few optimizations we prototyped but found little-to-no improvement β€” sharing so others can skip these rabbit holes:

  • Fused NF4 dequant inside the ScatterMoE Triton GEMM kernel β€” reads NF4-packed expert weights directly in the inner loop, avoiding the bf16 intermediate buffer entirely. Correctness-verified (zero numerical error), but 3x slower than the separate dequant + bf16 path. BnB's dequant kernel reads data linearly at full memory bandwidth (~1 TB/s), while the fused kernel's scattered byte addressing and per-element NF4 codebook lookups destroy memory coalescing.

  • Batched NF4 dequantization β€” concatenating gate_up + down_proj packed NF4 data into a single dequant call. Produced identical performance (1.00x) β€” the kernel is purely bandwidth-bound, so one large call = two smaller calls at the same total bytes.

  • Dequant buffer pool for NF4 MoE β€” pre-allocating reusable bf16 buffers to avoid 512 alloc/free cycles per step during NF4 expert dequantization. Reduced memory fragmentation from 20....

Read more

v0.15.0

06 Mar 17:55
8f19169

Choose a tag to compare

Axolotl v0.15.0 Release Notes

This release brings new model support, significant MoE improvements, infrastructure updates with Torch 2.10.0 and uv builds, and a collection of quality-of-life fixes across the board.

πŸš€ Major Changes

Torch 2.10.0 & uv Builds

We have upgraded to Torch 2.10.0 and introduced uv-based Docker builds for faster, more reproducible image creation. Python 3.14 is now used in unit tests.

ScatterMoE LoRA & SonicMoE

Added LoRA support for ScatterMoE and introduced SonicMoE as a new MoE kernel option for faster and more memory-efficient MoE training compared to grouped_mm in transformers.

MoE Expert Quantization

Added support for quantizing MoE expert weights in Transformers v5, which dramatically reduces peak reserved memory. For example, GLM-4.7-Flash QLoRA drops from ~127 GiB to ~23 GiB reserved memory. Activate via quantize_moe_experts: true. See the expert quantization docs for details.

πŸŽ‰ New Features

  • New Model Support:
  • SageAttention: Added SageAttention integration for efficient attention computation. (#2823 by @NanoCode012)
  • MXFP4 Quantization: Added support for MXFP4 quantization. (#3375 by @ved1beta)
  • Hub Revision Support: Added hub_revision for specifying a branch when pushing checkpoints. (#3387 by @madScientist10)
  • Dot-Notation CLI Args: Support for dot-notation CLI arguments for nested config options. (#3419 by @ManasVardhan)
  • Sample Generation for SFT: New sample generation support to SFT training. (#3240 by @ved1beta)
  • train_per_sec_per_gpu Metric: Added a new training throughput metric. (#3364 by @ved1beta)

⚠️ Breaking Changes

  • dataset_processes β†’ dataset_num_proc: The config field dataset_processes has been renamed to dataset_num_proc. (#3352 by @tgoab)

πŸ› Bug Fixes

  • Context Parallel:
  • GRPO:
  • Telemetry:
  • LoRA Kernels: Improved failure messaging and handling of trust_remote_code. (#3378 by @NanoCode012)
  • Generation Mode: Fixed add_special_tokens handling and enabled test mode for generation. (#3396 by @NanoCode012)
  • CCE Generic Patch: Fixed generic patch for Cut Cross Entropy. (#3405 by @winglian)
  • MistralProcessor: Updated for Transformers v5 compatibility. (#3423 by @NanoCode012)
  • Adapter Config Validation: Set allowed values for adapter in Pydantic config. (#3415 by @NanoCode012)
  • Tokenizer/Processor Revision: Pass revision parameter to tokenizer and processor loaders. (#3388 by @madScientist10)
  • Excess Length Truncation: Fixed excess_length_strategy truncation method. (#3401 by @rlronan)
  • Scheduler Fallback: Use self.optimizer if optimizer not passed to SchedulerMixin.create_scheduler(). (#3435 by @kallewoof)
  • Dataset Deduplication: Fixed saving de-duplicated dataset during pre-processing. (#3427 by @ManasVardhan)
  • FSDP2 Sharding: Fixed FSDP2 sharding and validated AO version for LR groups. (#3403 by @bekk02)
  • Qwen3-Next packing: Fixed causal-conv1d call for packing via FLA instead of native pytorch. (#3437 by @NanoCode012)
  • Quantize MoE validation: Fixed validation to prevent lora_target_linear to be true when quantizing moe experts (#3461 by @NanoCode012)
  • TP batch_size calculation: Fix calculation (#3462 by @Yatimai)

πŸ“¦ Dependency & Infrastructure Updates

  • transformers to 5.3.0, accelerate to 1.13.0, trl to 0.29.0, kernels to 0.12.2, torchao to 0.16.0. (#3407 #3459 by @winglian)
  • Trackio 0.16.1: Updated trackio to 0.16.1. (#3425 by @winglian)
  • Cut-Cross-Entropy Bump: Bumped cut-cross-entropy to 58d6572. (#3424 by @NanoCode012)
  • BunnyCDN for CI: CI assets now served via BunnyCDN. (#3422 by @winglian)
  • Telemetry Params: Included number of params and rounded estimates in PostHog telemetry. (#3455 by @winglian)

πŸ“– Documentation

Other Changes

New Contributors

Full Changelog: v0.14.0...v0.15.0

v0.14.0

30 Jan 19:10
be00978

Choose a tag to compare

This is a major release marking our migration to Transformers v5. Along with this significant core dependency upgrade, we are introducing major performance optimizations for MoE models and new fine-tuning methods.

πŸš€ Major Changes

Transformers v5 Upgrade

We have upgraded the underlying transformers dependency from v4 to v5. This has been a long-term effort to ensure Axolotl remains compatible with the latest ecosystem advancements, stability improvements, and future model architectures.

Faster MoE Training

We've added support for selecting MoE kernels via transformers: batched_mm and grouped_mm. Alongside this, we have added our custom integration for scattermoe. This significantly speeds up training and reduces VRAM usage for Mixture of Experts (MoE) models.

πŸŽ‰ New Features

  • EAFT Support: Added support for Efficient Adaptation Fine-Tuning (EAFT). (#3366 by @salmanmohammadi)
  • New CCE Support: Added Cut Cross Entropy support for GLM 4.7 Flash, GLM Image, GLM 4.6v, and Exaone 4. (#3373 by @NanoCode012)

Full Changelog: v0.13.2...v0.14.0

v0.13.2

22 Jan 15:59

Choose a tag to compare

Axolotl v0.13.2 Release Notes

This is a patch release introducing GDPO support and updating core infrastructure, including newer CUDA defaults and Python versions.

πŸŽ‰ New Features

πŸ“¦ Dependency & Infrastructure Updates

  • CUDA 12.9.1: Base images now default to CUDA 12.9.1. (#3367 by @winglian)
  • Python 3.12: Added Python 3.12 support to base images. (#3367 by @winglian)
  • vLLM Upgrade: Upgraded vLLM dependency to v0.14.0. (#3345 by @winglian)

Other fixes

Full Changelog: v0.13.1...v0.13.2

v0.13.1

20 Jan 13:59
6e42def

Choose a tag to compare

This release brings support for PyTorch 2.9.1, expands our ecosystem with new experiment trackers (SwanLab and Trackio), and introduces support for a wide range of new models including Olmo3, Ministral 3, InternVL 3.5, and Kimi. We’ve also included significant improvements to quantization workflows and metrics logging.

πŸŽ‰ New Features

Expanded Model Support

We’ve added support for more models!

New Experiment Tracking Integrations

  • SwanLab: You can now use SwanLab for experiment tracking. (#3334 by @PraMamba)
  • Trackio: Added Trackio validation integration. (#3253 by @abidlabs)

Training & PEFT Improvements

  • Liger Kernel for DPO: added liger support kernel for DPO training. (#3302 by @ved1beta)
  • Distributed Muon: Added support for distributed Muon optimizer. (#3264 by @salmanmohammadi)
  • Weight Tying Safety: Added peft_ensure_weight_tying to ensure correct parameter handling in PEFT. (#3278 by @NanoCode012)
  • Adapter Dtypes: Added peft_autocast_adapter_dtype config option for fine-grained control. (#3311 by @xzuyn)
  • Cheap PPL Metric: A new metric calculation for Perplexity that is less computationally expensive. (#3317 by @xzuyn)
  • Scaled Softmax: Scales softmax calculation by s * log(n) + b. (#3338 by @ved1beta )

⚠️ Deprecations & Warnings

PyTorch 2.7.1 Deprecation

Support for PyTorch 2.7.1 has been deprecated. We recommend upgrading to newer supported versions.

πŸ”§ Fixes & Improvements

Quantization & CLI

  • Save Processor: The quantizer CLI now properly saves the processor alongside the model. (#3290 by @salmanmohammadi)
  • FP8 Checks: Fixed checks for FP8 capability and load_in_8bit configurations. (#3324, #3327)
  • NVFP4 Configs: Added QAT NVFP4 configs for reference. (#3280 by @salmanmohammadi)

Logging & Metrics

  • Metric Rounding: You can now set the env AXOLOTL_METRIC_PRECISION= (5 default) to control the rounding of logged metrics. (#3325 by @ved1beta)
  • Token Logging: Fixed total/trainable tokens logging logic. (#3344 #3293 by @ved1beta)
  • Evaluation Loss: Fixed evaluation loss calculation in the KD trainer. (#3271 by @roycho96)

Data Processing

  • Long Sequence Handling: Feature added to raise an error on long sequence drops to prevent silent data loss. (#3321 by @kallewoof)
  • Qwen3 Tokenization: Fixed an off-by-a-few-tokens issue in Qwen3 Jinja tokenization. (#3295 by @NanoCode012)

πŸ“¦ Dependency & Infrastructure Updates

Others

  • fix bin size by @ved1beta in #3307
  • pre-commit hooks update. (#3287 by @github-actions[bot])
  • PYTORCH_CUDA_ALLOC_CONF deprecation fix to ensure compatibility with future PyTorch versions. (#3313 by @NanoCode012)

New Contributors

Full Changelog: v0.13.0...v0.13.1

v0.13.0

02 Dec 15:00
f5f21fb

Choose a tag to compare

This release is packed with major new features, including Streaming SFT for massive datasets, a new Text Diffusion training plugin, and a significant upgrade to our Quantization-Aware Training (QAT) capabilities with NVFP4 support. We're also thrilled to announce support for a huge variety of new models like Gemma3, the Qwen3 family, Hunyuan, Granite 4, and many more.

Alongside these headline features, this release brings support for PyTorch 2.9 and CUDA 13, major dependency upgrades, new developer tooling with Ruff, and a host of important bug fixes to improve training stability and user experience.

πŸŽ‰ New Features

Streaming Supervised Finetuning (SFT)

You can now finetune on datasets of any size without needing to pre-process and load the entire dataset into memory. Streaming SFT processes data on the fly, dramatically reducing memory usage and startup times, making it ideal for large-scale training.

Text Diffusion Training Plugin

Explore a new paradigm of training with our Text Diffusion plugin! This allows you to train models on diffusion-based objectives, opening up new possibilities for text generation and manipulation tasks.

Upgraded Quantization-Aware Training (QAT) with NVFP4 Support

We've migrated to a new QAT API and enhanced the axolotl quantize command. This release introduces support for NVFP4, a new 4-bit floating-point format, further pushing the boundaries of model quantization and efficiency.

Expanded Model Support

We've added support for a wide array of new and powerful models:

Quality-of-Life & Training Improvements

  • FSDPv2 Swap Memory: Reduce VRAM usage for FSDP training by swapping memory to the CPU, now with compatibility fixes for QLoRA. (#3167 by @gholmes829)
  • Fixed IPO Dataset loading: IPO training will now properly load DPO specific configs. (#3128 by @seungduk-yanolja)
  • Modernized Tooling with Ruff: We've replaced black, isort, flake8, and pylint with ruff for a faster and unified linting and formatting experience. (#3092 by @djsaunde)
  • Tokens Per Second Logging: The trainer now logs tokens_per_second by default for better performance monitoring. (#3072, #3134 by @salmanmohammadi)
  • Single GPU DeepSpeed Zero3: Environment variables are now automatically configured for single-GPU DeepSpeed ZeRO-3 runs, simplifying setup. (#3118 by @winglian)
  • JSON String Tool Arguments: Function-calling/tool-use training now supports JSON strings in tool arguments for greater flexibility. (#3136 by @gamersover)
  • Refactored FSDP Config: FSDP parameters are now neatly organized under a single fsdp_config dictionary for cleaner configurations. (#3170 by @salmanmohammadi)

🚨 Important changes

PyTorch 2.6.0 Deprecation

PyTorch 2.8 and 2.9 are now supported in Axolotl. Following our previous deprecation cycle, we will now stop support for PyTorch 2.6. The current Docker image is PyTorch 2.8 CUDA 12.8. We recommend updating to get the newest features and fixes. We currently build base images for PyTorch 2.9 CUDA 12.8 and 13.0.

Introduction of Opt-Out Telemetry

Axolotl now includes opt-out, anonymous telemetry to help us understand usage patterns, such as which features and model architectures are most popular. This data is invaluable for prioritizing future development, bug fixes, and improving the library for everyone. We do not collect any personal or confidential information.

To opt-out, please set the below:

export AXOLOTL_DO_NOT_TRACK=1

More info can be found at docs.

πŸ“¦ Dependency & Environment Updates

  • PyTorch & CUDA Support:
    • Container images are now available for PyTorch 2.9.0 (#3221) and CUDA 13.0.0 (#3229).
  • Core Libraries:
    • deepspeed has been upgraded to its latest version for improved performance and stability. (#3261)
    • transformers upgraded to 4.57.1 (#3127, #3201, #3214, #3249)
    • peft upgraded to 0.23.1 (#3094, #3214)
    • trl upgraded to 0.24.0 (#3161, #3230, #3249)
    • datasets upgraded to 4.4.1 (#3266).
    • flash-attn upgraded to 2.8.3 for GPT-OSS attention sink support (#3082).
    • liger upgraded to 0.6.3 (#3230).
    • numpy upgraded to 2.3.4 (#3236).

πŸ”§ Fixes & Bug Squashing

  • Training & Checkpoints:

    • Fixed an issue where sweep runs would overwrite each other by reusing the base output_dir. (#3080 by @ginkyenglee)
    • Corrected an issue where warmup_steps: 0 or warmup_ratio: 0 did not properly disable warmup. (#3254 by @xzuyn)
    • Addressed a bug in TRL's enable_sleep_mode. (#3225 by @matthambrecht)
    • Added a feature to save an initial checkpoint as soon as training starts, protecting against early-stage failures. (#3233 by @ved1beta)
    • Improved handling of LoRA with biases under FSDP2 and file management during checkpoint saving. (#3090)
  • Dataloader & Preprocessing:

    • Fixed a broken Voxtral preprocessor. (#3255 by @NanoCode012)
    • Refactored multipack sampler patch. (#3096 by @djsaunde)
    • Fixed a bug preventing the chat template jinja file from being loaded during inference. (#3112 by @NanoCode012)
    • Resolved a dataloader slow loading issue by setting pin_memory=False when dataloader_num_workers > 0. (#3219 by @qywu)
  • Distributed Training (DeepSpeed/FSDP):

    • Patched ds_grads_remaining in DeepSpeed for better stability. (#3102 by @ved1beta)
    • Fixed a DeepSpeed AttributeError when using Context Parallelism. (#3220 by @NanoCode012)
    • Improved logic for disabling P2P on Runpod and other platforms by using torch.cuda.can_device_access_peer. (#3132, #3209)
  • Other Improvements:

    • Fixed load_in_Xbit deprecation warnings from transformers. (#3205 by @NanoCode012)
    • Enhanced logging with a new debug log and ot...
Read more

v0.12.2

18 Aug 14:41

Choose a tag to compare

What's Changed

Full Changelog: v0.12.1...v0.12.2

v0.12.1

11 Aug 13:38

Choose a tag to compare

v0.12.1 is a patch release to fix a regression when using Ray Trainer from the CLI

What's Changed

  • use exec instead of subprocess to make ctrl+c nicer for cli by @winglian in #3044
  • fix ray train and add fsdp2 smoke test for ray trainer by @winglian in #3053

Full Changelog: v0.12.0...v0.12.1

v0.12.0

08 Aug 12:25
2c8497e

Choose a tag to compare

We're introducing a major upgrade to our distributed training feature set, including support for ND-Parallelism for training at scale, support for DeepSpeed's Auto Tensor Parallelism, and FP8 training. We're also excited to announce support for fine-tuning the latest gpt-oss models (and many more!) and a host of fixes and dependency updates.

πŸŽ‰ New features

ND-Parallel for Advanced Parallelism Strategies

Together with Accelerate, we've introduced ND-Parallel support, allowing you to compose different parallelism techniques like Context Parallelism, Tensor Parallelism, and Fully Sharded Data Parallelism to enable fine-tuning large models at scale. Check out the official Huggingface blogpost for more details!

Expanded Model Support

We've added support for a new wave of powerful models:

Experimental FP8 Mixed-Precision Training with torchao

Check out experimental FP8 mixed-precision training! By leveraging the torchao library, you can train with FP8 data types, perform gather op in fp8, leading to significant memory savings and potential speedups. Read the docs to enable it.

Improved Slurm Support

We've fixed some issues that may freeze tasks during preprocessing and include an easy to use Slurm example for your large cluster needs. Check out the README and example

DeepSpeed Auto Tensor Parallelism (AutoTP)

You can now leverage DeepSpeed's Auto Tensor Parallelism to automatically shard your model's layers across multiple GPUs. This dramatically reduces the VRAM requirement for each GPU, enabling you to fine-tune much larger models than previously possible on the same hardware. Enable it in your yaml config with setting tensor_parallel_size: int .

TiledMLP Now Supports FSDP2 and Single GPU

TiledMLP, which reduces activation memory for long sequences, is now more versatile. It's fully compatible with our new FSDP2 implementation and can now be used on single-GPU setups, making it accessible for a wider range of training scenarios.

Dion Optimizer Support

We've added support for the Dion optimizer, a scalable and communication-efficient optimizer designed increase speedup during training via parallelisms, giving you another tool to fine-tune larger models on your hardware.

Enabled LoRA kernels with FSDP2

FSDP2 and LoRA training is now significantly faster thanks to the integration of optimized kernels, reducing training overhead and speeding up your workflows.

Quality-of-Life & Developer Experience Improvements

  • CLI Autocompletion: Speed up your workflow with new tab-completion for the Axolotl CLI. Simply run axolotl -h to see how to install it for your shell. (by @winglian in #2955)
  • Mid-Training Profiling: You can now start the PyTorch profiler in the middle of a training run, making it easier to debug performance bottlenecks without needing to restart. (by @winglian in #2899)
  • Generic Fused Kernels: Applied generic fused CCE and TiledMLP implementations from Liger to support a wider range of arbitrary models automatically. (by @winglian in #2908)
  • Activation Offloading with CUDA Streams: Reduced VRAM usage by offloading activations to CPU RAM using non-blocking CUDA streams for better GPU utilization. (by @winglian in #2900, fixed for LoRA in #2928)
  • New CLI Launcher: add --launcher option, support launcher args, cleanup, refactor (by @djsaunde in #2924)
  • Support for lora_target_parameters : Allows targeting parameter names for LoRA, useful when targeting module name is not possible in cases like MoE. (by @winglian in #3006)
  • Cut Cross Entropy support for SmollM3, Granite, and GraniteMoE: (by @NanoCode012 in #2993)
  • LoRA Kernels Now Support Biases: Our optimized QLoRA kernels can now be applied to bias terms, increasing the flexibility of your low-rank adaptations. (by @djsaunde in #3025)
  • Custom Trainer via Module Path: As an alternative to plugin, you can now define a trainer_cls in your YAML config by pointing to it as a module path (e.g., my_project.trainers.CustomTrainer). (by @winglian in #3024)
  • Prime Intellect Integration: Added support for running jobs on the Prime Intellect platform. (by @winglian in #3021)

πŸ“¦ Dependency Updates

  • peft upgraded to 0.17.0 (#3006) and datasets to 4.0.0. (#2917)
  • trl upgraded to 0.20.0. (#2892 , #2987)
  • accelerate upgraded to 1.9.0. (#2936)
  • liger upgraded to 0.6.1. (#2893 , #2987)
  • torchao upgraded to the 0.12.0. (#2968)
  • modal upgraded to 1.0.2. (#2925)
  • transformers upgraded to 4.55.0. (#2984 , #3018)
  • bitsandbytesupgraded to 0.46.1. (#2992)

🚨 Upcoming deprecations

Upgrading from FSDP1 β†’ FSDP2

Axolotl now recommends PyTorch's nativeFSDP2 instead of FSDP1. This brings performance improvements, better stability, and additional features and compatibility with the latest fine-tuning techniques.

For migration guidance, please refer to our FSDP documentation and the official PyTorch FSDP guide.

Rename of Sequence Parallel config

We have renamed sequence_parallel_degree to context_parallel_size to be more consistent with the ecosystem naming by @salmanmohammadi in #2977.

πŸ”§ Fixes & Improvements

Dataset & Preprocessing

  • Improved Dataset Processing: Significantly improved performance for dataset processing, sharding, and multiprocessing, resulting in faster startup times. (by @VarunGumma in #2918)
  • Smarter Defaults:
    • The warmup_ratio is now used as a better default over warmup_steps, as it adapts to your dataset size. (by @winglian in #2897)
    • pad_to_sequence_len now defaults to True if sample_packing is True for more consistent and intuitive behavior. (by @winglian in #2941)
  • SimPO Fix: Fixed an issue with using customized datasets with the SimPO trainer. (by @ganler in #2894)
  • Tool Usage Fix: Prevented the incorrect merging of tool arguments during data preprocessing. (by @greenhestu in #2909)

Distributed Training & Memory

  • DDP & DeepSpeed Fixes: Addressed an issue causing incorrect step calculation with DDP (#2915) and prevented distributed initialization during preprocessing with DeepSpeed for faster startups (#2920).
  • Checkpoint Memory: Added garbage collection before saving a checkpoint to reduce peak memory usage and prevent OOM errors. (by @winglian in #2971)
  • Plugin Registration: Ensured plugins are correctly registered in Ray workers for more robust distributed setups. (by @drikster80 in #2901)
  • torch.compile: Removed an extra, unnecessary torch.compile call to streamline execution. (by @djsaunde in [#2904](https://github.com/axolotl-ai-cloud/ax...
Read more