flask-gearQwen3.5 Fine-tuning Guide

Learn how to fine-tune Qwen3.5 LLMs with Unsloth.

You can now fine-tune Qwen3.5 model family (0.8B, 2B, 4B, 9B, 27B, 35B‑A3B, 122B‑A10B) with Unslotharrow-up-right. Support includes both vision and text fine-tuning. Qwen3.5‑35B‑A3B - bf16 LoRA works on 74GB VRAM.

  • Unsloth makes Qwen3.5 train 1.5× faster and uses 50% less VRAM than FA2 setups.

  • Qwen3.5 bf16 LoRA VRAM use: 0.8B: 3GB • 2B: 5GB • 4B: 10GB • 9B: 22GB • 27B: 56GB

  • Fine-tune 0.8B, 2B and 4B bf16 LoRA via our free Google Colab notebooks:

  • If you want to preserve reasoning ability, you can mix reasoning-style examples with direct answers (keep a minimum of 75% reasoning). Otherwise you can emit it fully.

  • Full fine-tuning (FFT) works as well. Note it will use 4x more VRAM.

  • Qwen3.5 is powerful for multilingual fine-tuning as it supports 201 languages.

  • After fine-tuning, you can export to GGUF (for llama.cpp/Ollama/LM Studio/etc.) or vLLM

  • Reinforcement Learning (RL) for Qwen3.5 VLM RL also works via Unsloth inference.

If you’re on an older version (or fine-tuning locally), update first:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
circle-exclamation

MoE fine-tuning (35B, 122B)

For MoE models like Qwen3.5‑35B‑A3B / 122B‑A10B / 397B‑A17B:

  • You can use our Qwen3.5‑35B‑A3B (A100)arrow-up-right fine-tuning notebook

  • Supports our recent ~12x faster MoE training update with >35% less VRAM & ~6x longer context

  • Best to use bf16 setups (e.g. LoRA or full fine-tuning) (MoE QLoRA 4‑bit is not recommended due to BitsandBytes limitations).

  • Unsloth’s MoE kernels are enabled by default and can use different backends; you can switch with UNSLOTH_MOE_BACKEND.

  • Router-layer fine-tuning is disabled by default for stability.

  • Qwen3.5‑122B‑A10B - bf16 LoRA works on 256GB VRAM. If you're using multiGPUs, add device_map = "balanced" or follow our multiGPU Guide.

Quickstart

Below is a minimal SFT recipe (works for “text-only” fine-tuning). See also our vision fine-tuning section.

circle-info

Qwen3.5 is “Causal Language Model with Vision Encoder” (it’s a unified VLM), so ensure you have the usual vision deps installed (torchvision, pillow) if needed, and keep Transformers up-to-date. Use the latest Transformers for Qwen3.5.

If you'd like to do GRPO, it works in Unsloth if you disable fast vLLM inference and use Unsloth inference instead. Follow our Vision RL notebook examples.

circle-info

If you OOM:

  • Drop per_device_train_batch_size to 1 and/or reduce max_seq_length.

  • Keep use_gradient_checkpointing="unsloth" on (it’s designed to reduce VRAM use and extend context length).

Loader example for MoE (bf16 LoRA):

Once loaded, you’ll attach LoRA adapters and train similarly to the SFT example above.

Vision fine-tuning

Unsloth supports vision fine-tuning for the multimodal Qwen3.5 models. Use the below Qwen3.5 notebooks and change the respective model names to your desired Qwen3.5 model.

Disabling Vision / Text-only fine-tuning:

To fine-tune vision models, we now allow you to select which parts of the mode to finetune. You can select to only fine-tune the vision layers, or the language layers, or the attention / MLP layers! We set them all on by default!

In order to fine-tune or train Qwen3.5 with multi-images, view our multi-image vision guide.

Saving / export fine-tuned model

You can view our specific inference / deployment guides for llama.cpp, vLLM, llama-server, Ollama, LM Studio or SGLang.

Save to GGUF

Unsloth supports saving directly to GGUF:

Or push GGUFs to Hugging Face:

If the exported model behaves worse in another runtime, Unsloth flags the most common cause: wrong chat template / EOS token at inference time (you must use the same chat template you trained with).

Save to vLLM

circle-exclamation

To save to 16-bit for vLLM, use:

To save just the LoRA adapters, either use:

Or use our builtin function:

For more details read our inference guides:

Last updated

Was this helpful?