💜Qwen3.5 - How to Run Locally Guide
Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!
Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs here.
Mar 5 Update: Redownload Qwen3.5-35B, 27B, 122B and 397B.
All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Tool-calling improved following our chat template fixes. Fix is universal and applies to any Qwen3.5 format and any uploader.
Check new GGUF benchmarks for Unsloth performance results + our MXFP4 investigation.
We're retiring MXFP4 layers from 3 Qwen3.5 GGUFs: Q2_K_XL, Q3_K_XL and Q4_K_XL.
All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.
To enable or disable thinking see How to enable or disable reasoning & thinking.Qwen3.5 Small models disables by default. Also see LM Studio guide to enable Think toggle.
35B-A3B27B122B-A10B397B-A17BFine-tune Qwen3.50.8B • 2B • 4B • 9B
⚙️ Usage Guide
Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)
For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.
Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.
Recommended Settings
Maximum context window:
262,144(can be extended to 1M via YaRN)presence_penalty = 0.0 to 2.0default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performanceAdequate Output Length:
32,768tokens for most queries
If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.
As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:
Thinking mode:
temperature = 1.0
temperature = 0.6
top_p = 0.95
top_p = 0.95
top_k = 20
top_k = 20
min_p = 0.0
min_p = 0.0
presence_penalty = 1.5
presence_penalty = 0.0
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
Thinking mode for general tasks:
Thinking mode for precise coding tasks:
Instruct (non-thinking) mode settings:
temperature = 0.7
temperature = 1.0
top_p = 0.8
top_p = 0.95
top_k = 20
top_k = 20
min_p = 0.0
min_p = 0.0
presence_penalty = 1.5
presence_penalty = 1.5
repeat penalty = disabled or 1.0
repeat penalty = disabled or 1.0
To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'
If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"
Use 'true' and 'false' interchangeably.
For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'
Instruct (non-thinking) for general tasks:
Instruct (non-thinking) for reasoning tasks:
Qwen3.5 Inference Tutorials:
Because Qwen3.5 comes in many different sizes, we'll be using Dynamic 4-bit MXFP4_MOE GGUF variants for all inference workloads. Click below to navigate to designated model instructions:
Qwen3.5-35B-A3B27B122B-A10B397B-A17BSmall (0.8B • 2B • 4B • 9B)LM Studio
Unsloth Dynamic GGUF uploads:
presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance.
Currently no Qwen3.5 GGUF works in Ollama due to separate mmproj vision files. Use llama.cpp compatible backends.
Qwen3.5-35B-A3B
For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. GGUF: Qwen3.5-35B-A3B-GGUF
🦙 Llama.cpp Guides
For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5 Small (0.8B • 2B • 4B • 9B)
For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'
On Windows use: --chat-template-kwargs "{\"enable_thinking\":true}"
For the Qwen3.5 Small series, because they're so small, all you need to do is change the model name in the scripts to desired variant. For this specific guide we'll be using the 9B parameter variant. To run them all in near full precision, you'll just need 12GB of RAM / VRAM / unified memory device. GGUFs:
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
To use another variant other than 9B, you can change the '9B' to: 0.8B, 2B or 4B etc.
Thinking mode (disabled by default)
Qwen3.5 Small models disable thinking by default. Use llama-server to enable it.
General tasks:
To use another variant other than 9B, you can change the '9B' to: 0.8B, 2B or 4B etc.
Non-thinking mode is already on by default
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-27B
For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM / Mac device for fast inference. GGUF: Qwen3.5-27B-GGUF
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-122B-A10B
For this guide we will be utilizing Dynamic 4-bit which works great on a 70GB RAM / Mac device for fast inference. GGUF: Qwen3.5-122B-A10B-GGUF
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.
Follow one of the specific commands below, according to your use-case:
Thinking mode:
Precise coding tasks (e.g. WebDev):
General tasks:
Non-thinking mode:
General tasks:
Reasoning tasks:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode:
Qwen3.5-397B-A17B
Qwen3.5-397B-A17B is in the same performance tier as Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. The full 397B checkpoint is ~807GB on disk, but via Unsloth's 397B GGUFs you can run:
3-bit: fits on 192GB RAM systems (e.g., a 192GB Mac)
4-bit (MXFP4): fits on 256GB RAM. Unsloth 4-bit dynamic UD-Q4_K_XL is ~214GB on disk - loads directly on a 256GB M3 Ultra
Runs on a single 24GB GPU + 256GB system RAM via MoE offloading, reaching 25+ tokens/s
8-bit needs ~512GB RAM/VRAM
See 397B quantization benchmarks on how Unsloth GGUFs perform.
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 256K context length.
Follow this for thinking mode:
Follow this for non-thinking mode:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging
You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
👾 LM Studio Guide
For this guide, we'll be using LM Studio, a unified UI interface for running LLMs. The '💡Thinking' and 'Non-thinking' toggle may not appear by default so we'll need some extra steps to get it working.
Download LM Studio for your device. Then open Model Search, search for 'unsloth/qwen3.5', and download the GGUF (quant) that you desire.

Thinking Toggle instructions: After downloading, Open your Terminal / PowerShell and try: lms --help. Then if LM Studio appears normally with many commands, run:
This will get a yaml file which enables your GGUF to have the '💡Thinking' and 'Non-thinking' toggle appear. You can change 4b to the desired quant you'd like to have.

Otherwise, you can go to our LM Studio page and download the specific yaml file.
Restart LM Studio, then load your downloaded model (with the specific thinking toggle you downloaded). You should now see the Thinking toggle enabled. Don't forget to set the correct parameters.

🦙 Llama-server serving & OpenAI's completion library
To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:
Then in a new terminal, after doing pip install openai, do:
🤔 How to enable or disable reasoning & thinking
For the below commands, you can use 'true' and 'false' interchangeably. To have Think toggle for LM Studio, read our guide.
To disable thinking / reasoning, use within llama-server:
If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"
To enable thinking / reasoning, use within llama-server:
If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":true}"
For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'
And on Windows or Powershell: --chat-template-kwargs "{\"enable_thinking\":true}"
As an example for Qwen3.5-9B to enable thinking (default is disabled):
And then in Python:

👨💻 OpenAI Codex & Claude Code
To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to your desired 'Qwen3.5' variant and ensure you follow the correct Qwen3.5 parameters and usage instructions. Use the llama-server we just set up just then.
After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :



🔨Tool Calling with Qwen3.5
See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:
After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.
📊 Benchmarks
Unsloth GGUF Benchmarks
We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts. We also fixed a tool calling chat template bug (affects all quant uploaders)
All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Qwen3.5-35B-A3B GGUFs are updated to use new fixes (112B, 27B still converting, re-download once they are updated)
99.9% KL Divergence shows SOTA on Pareto Frontier for UD-Q4_K_XL, IQ3_XXS & more.
Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE.


READ OUR DETAILED QWEN3.5 ANALYSIS + BENCHMARKS HERE:
Qwen3.5 GGUF BenchmarksQwen3.5-397B-A17B Benchmarks

Benjamin Marie (third-party) benchmarked Qwen3.5-397B-A17B using Unsloth GGUFs on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).
Key results (accuracy; change vs. original; relative error increase):
Original weights: 81.3%
UD-Q4_K_XL: 80.5% (−0.8 points; +4.3% relative error increase)
UD-Q3_K_XL: 80.7% (−0.6 points; +3.5% relative error increase)
UD-Q4_K_XL and UD-Q3_K_XL stay extremely close to the original, well under a 1-point accuracy drop on this suite, which Ben insinuates that you can sharply reduce memory footprint (~500 GB less) with little to no practical loss on the tested tasks.
How to choose: Q3 scoring slightly higher than Q4 here is completely plausible as normal run-to-run variance at this scale, so treat Q3 and Q4 as effectively similar quality in this benchmark:
Pick Q3 if you want the smallest footprint / best memory savings
Pick Q4 if you want a slightly more conservative option with similar results
All listed quants utilize our dynamic metholodgy. Even UD-IQ2_M uses a the same methodology of dynamic however the conversion process is different to UD-Q2-K-XL where K-XL is usually faster than UD-IQ2_M even though it's bigger, so that is why UD-IQ2_M may perform better than UD-Q2-K-XL.
Official Qwen Benchmarks
Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks

Qwen3.5-4B and 9B Benchmarks

Qwen3.5-397B-A17B Benchmarks

Last updated
Was this helpful?

