💜Qwen3.5 - How to Run Locally Guide

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs herearrow-up-right.

circle-check

All uploads use Unsloth Dynamic 2.0arrow-up-right for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

circle-info

To enable or disable thinking see How to enable or disable reasoning & thinking.Qwen3.5 Small models disables by default. Also see LM Studio guide to enable Think toggle.

35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.50.8B • 2B • 4B • 9B

⚙️ Usage Guide

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

Qwen3.5
3-bit
4-bit
6-bit
8-bit
BF16

3 GB

3.5 GB

5 GB

7.5 GB

9 GB

4.5 GB

5.5 GB

7 GB

10 GB

14 GB

5.5 GB

6.5 GB

9 GB

13 GB

19 GB

14 GB

17 GB

24 GB

30 GB

54 GB

17 GB

22 GB

30 GB

38 GB

70 GB

60 GB

70 GB

106 GB

132 GB

245 GB

180 GB

214 GB

340 GB

512 GB

810 GB

circle-check

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.

  • Maximum context window: 262,144 (can be extended to 1M via YaRN)

  • presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance

  • Adequate Output Length: 32,768 tokens for most queries

circle-info

If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Thinking mode:

General tasks
Precise coding tasks (e.g. WebDev)

temperature = 1.0

temperature = 0.6

top_p = 0.95

top_p = 0.95

top_k = 20

top_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 0.0

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

Thinking mode for general tasks:

Thinking mode for precise coding tasks:

Instruct (non-thinking) mode settings:

General tasks
Reasoning tasks

temperature = 0.7

temperature = 1.0

top_p = 0.8

top_p = 0.95

top_k = 20

top_k = 20

min_p = 0.0

min_p = 0.0

presence_penalty = 1.5

presence_penalty = 1.5

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

circle-exclamation

Instruct (non-thinking) for general tasks:

Instruct (non-thinking) for reasoning tasks:

Qwen3.5 Inference Tutorials:

Because Qwen3.5 comes in many different sizes, we'll be using Dynamic 4-bit MXFP4_MOE GGUF variants for all inference workloads. Click below to navigate to designated model instructions:

Qwen3.5-35B-A3B27B122B-A10B397B-A17BSmall (0.8B • 2B • 4B • 9B)LM Studio

Unsloth Dynamic GGUF uploads:

circle-exclamation

Qwen3.5-35B-A3B

For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. GGUF: Qwen3.5-35B-A3B-GGUFarrow-up-right

🦙 Llama.cpp Guides

For these tutorials, we will using llama.cpparrow-up-right for fast local inference, especially if you have a CPU.

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5 Small (0.8B • 2B • 4B • 9B)

circle-exclamation

For the Qwen3.5 Small series, because they're so small, all you need to do is change the model name in the scripts to desired variant. For this specific guide we'll be using the 9B parameter variant. To run them all in near full precision, you'll just need 12GB of RAM / VRAM / unified memory device. GGUFs:

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

circle-check

Thinking mode (disabled by default)

triangle-exclamation

General tasks:

circle-check

Non-thinking mode is already on by default

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-27B

For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM / Mac device for fast inference. GGUF: Qwen3.5-27B-GGUFarrow-up-right

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-122B-A10B

For this guide we will be utilizing Dynamic 4-bit which works great on a 70GB RAM / Mac device for fast inference. GGUF: Qwen3.5-122B-A10B-GGUFarrow-up-right

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

Then run the model in conversation mode:

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B is in the same performance tier as Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. The full 397B checkpoint is ~807GB on disk, but via Unsloth's 397B GGUFsarrow-up-right you can run:

  • 3-bit: fits on 192GB RAM systems (e.g., a 192GB Mac)

  • 4-bit (MXFP4): fits on 256GB RAM. Unsloth 4-bit dynamic UD-Q4_K_XL is ~214GB on disk - loads directly on a 256GB M3 Ultra

  • Runs on a single 24GB GPU + 256GB system RAM via MoE offloading, reaching 25+ tokens/s

  • 8-bit needs ~512GB RAM/VRAM

circle-info

See 397B quantization benchmarks on how Unsloth GGUFs perform.

1

Obtain the latest llama.cpp on GitHub herearrow-up-right. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 256K context length.

Follow this for thinking mode:

Follow this for non-thinking mode:

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

4

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

👾 LM Studio Guide

For this guide, we'll be using LM Studioarrow-up-right, a unified UI interface for running LLMs. The '💡Thinking' and 'Non-thinking' toggle may not appear by default so we'll need some extra steps to get it working.

1

Download LM Studioarrow-up-right for your device. Then open Model Search, search for 'unsloth/qwen3.5', and download the GGUF (quant) that you desire.

2

Thinking Toggle instructions: After downloading, Open your Terminal / PowerShell and try: lms --help. Then if LM Studio appears normally with many commands, run:

This will get a yaml file which enables your GGUF to have the '💡Thinking' and 'Non-thinking' toggle appear. You can change 4b to the desired quant you'd like to have.

Otherwise, you can go to our LM Studio pagearrow-up-right and download the specific yaml file.

3

Restart LM Studio, then load your downloaded model (with the specific thinking toggle you downloaded). You should now see the Thinking toggle enabled. Don't forget to set the correct parameters.

🦙 Llama-server serving & OpenAI's completion library

To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

🤔 How to enable or disable reasoning & thinking

For the below commands, you can use 'true' and 'false' interchangeably. To have Think toggle for LM Studio, read our guide.

circle-info

To disable thinking / reasoning, use within llama-server:

If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

circle-info

To enable thinking / reasoning, use within llama-server:

If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":true}"

triangle-exclamation

As an example for Qwen3.5-9B to enable thinking (default is disabled):

And then in Python:

👨‍💻 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to your desired 'Qwen3.5' variant and ensure you follow the correct Qwen3.5 parameters and usage instructions. Use the llama-server we just set up just then.

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

🔨Tool Calling with Qwen3.5

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.

📊 Benchmarks

Unsloth GGUF Benchmarks

We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts. We also fixed a tool calling chat template bug (affects all quant uploaders)

  • All GGUFs now updated with an improved quantization algorithm.

  • All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.

  • Qwen3.5-35B-A3B GGUFs are updated to use new fixes (112B, 27B still converting, re-download once they are updated)

  • 99.9% KL Divergence shows SOTA on Pareto Frontier for UD-Q4_K_XL, IQ3_XXS & more.

  • Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE.

35B-A3B - KLD benchmarks (lower is better)
122B-A10B - KLD benchmarks (lower is better)

READ OUR DETAILED QWEN3.5 ANALYSIS + BENCHMARKS HERE:

chart-fftQwen3.5 GGUF Benchmarkschevron-right

Qwen3.5-397B-A17B Benchmarks

Benjamin Marie (third-party) benchmarkedarrow-up-right Qwen3.5-397B-A17B using Unsloth GGUFs on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).

Key results (accuracy; change vs. original; relative error increase):

  • Original weights: 81.3%

  • UD-Q4_K_XL: 80.5% (−0.8 points; +4.3% relative error increase)

  • UD-Q3_K_XL: 80.7% (−0.6 points; +3.5% relative error increase)

UD-Q4_K_XL and UD-Q3_K_XL stay extremely close to the original, well under a 1-point accuracy drop on this suite, which Ben insinuates that you can sharply reduce memory footprint (~500 GB less) with little to no practical loss on the tested tasks.

How to choose: Q3 scoring slightly higher than Q4 here is completely plausible as normal run-to-run variance at this scale, so treat Q3 and Q4 as effectively similar quality in this benchmark:

  • Pick Q3 if you want the smallest footprint / best memory savings

  • Pick Q4 if you want a slightly more conservative option with similar results

All listed quants utilize our dynamic metholodgy. Even UD-IQ2_M uses a the same methodology of dynamic however the conversion process is different to UD-Q2-K-XL where K-XL is usually faster than UD-IQ2_M even though it's bigger, so that is why UD-IQ2_M may perform better than UD-Q2-K-XL.

Official Qwen Benchmarks

Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks

Qwen3.5-4B and 9B Benchmarks

Qwen3.5-397B-A17B Benchmarks

Last updated

Was this helpful?