DiffusionGemma - How to Run Locally

DiffusionGemma 26B-A4B is Google DeepMind’s new open multimodal model, built on the Gemma 4 MoE architecture. With support for 256K context, 140+ languages, DiffusionGemma is designed for high-speed text generation across text, video and image inputs. DiffusionGemma can run locally on 18GB RAM, and fine-tuning is now supported via Unsloth.

Instead of standard token-by-token decoding, DiffusionGemma uses diffusion generation to produce outputs in parallel and gradually refine them into a final answer - similar to diffusion image models, but for text. Run the model via Unsloth Studio or llama.cpp. GGUF: diffusiongemma-26B-A4B-it-GGUF

Run DiffusionGemma Fine-tune DiffusionGemma

Usage Guide

DiffusionGemma is designed for users who want faster generation than standard models. It is suited for fast local inference, long-context document analysis, image/video understanding, OCR and document parsing, code generation, tool calling, agentic workflows, and low-latency inference with small batch sizes.

Unlike standard Gemma 4 models, DiffusionGemma requires a diffusion-aware inference runtime. Standard autoregressive settings such as temperature, top_p, and top_k are not sufficient to reproduce the recommended behavior unless the runtime includes the required diffusion sampler.

Hardware requirements

It's generally best to have at least 18GB RAM to run the model in 4-bit precision. GGUF: diffusiongemma-26B-A4B-it-GGUF

Table: DiffusionGemma Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory).

4-bit

5-bit

6-bit

8-bit

BF16 / FP16

18 GB

20 GB

24 GB

28 GB

52 GB

As a rule of thumb, your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run using partial RAM / disk offload, but generation will be slower. You will also need more compute, depending on the context window you use.

Recommended Settings

Thinking Mode

DiffusionGemma supports Gemma 4-style thinking mode. To enable thinking, add the thinking token at the start of the system prompt:

<|think|>

When thinking is enabled, the model may emit an internal reasoning channel followed by the final answer:

<|channel>thought
[internal reasoning]
<channel|>
[final answer]

To disable thinking, remove the <|think|> token from the system prompt. When thinking is disabled, the model may still emit an empty thought channel:

<|channel>thought
<channel|>
[final answer]

For multi-turn conversations, do not include previous hidden thoughts in the conversation history. Only include the final assistant response before the next user turn.

Run DiffusionGemma Tutorials

It's best to use at least 4-bit precision so we'll use the Dynamic 4-bit UD-Q4_K_XL quant which needs 18GB RAM. GGUF: diffusiongemma-26B-A4B-it-GGUF

🦥 Unsloth Studio Guide 🦙 Llama.cpp Guide

🦙 Llama.cpp Guide

For this tutorial, we will be utilizing the Dynamic 4-bit UD-Q4_K_XL quant which needs 18GB RAM and llama.cpp for fast local inference, especially if you have a CPU.

Obtain the SPECIFIC llama.cpp PR on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
# build with CUDA (drop -DGGML_CUDA=ON for a CPU-only build)
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli
cd ..

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

Chat with DiffusionGemma

Then run the below:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048

You will see:

And if you type a question like "Create a Flappy Bird Game", you will see steps:

Then afterwards you'll see the output:

You can continue conversing as well!

Change -n 2048 as the number of tokens you want to predict, so more will produce longer answers.

Live visualization of diffusion

To see diffusion actually live, use the below - specially enable --diffusion-visual:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

You will again see:

And we get:

All parameters for llama.cpp using the branch:

-n, --n-predict N - target tokens; derives --diffusion-blocks and grows -ub / -b / -c.
-ngl 99 - offload all layers to the GPU (-ngl 0 for CPU-only).
-cnv - multi-turn conversation mode.
--diffusion-visual - live canvas denoising view.
The Entropy-Bound sampler is on by default (--diffusion-eb auto). Tune it with --diffusion-eb-max-steps (default 48), --diffusion-eb-t-max / --diffusion-eb-t-min (0.8 -> 0.4), --diffusion-eb-entropy-bound (0.1), and --diffusion-eb-confidence (0.005).
--diffusion-kv-cache {auto,on,off} - prompt prefix KV cache (auto = on for single GPU).

🦥 Unsloth Studio Guide

Work in progress! For now use llama.cpp directly.

DiffusionGemma can now be run and trained in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:

Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

Launch Unsloth

MacOS, Linux, WSL and Windows:

unsloth studio -H 0.0.0.0 -p 8888

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

Search and download DiffusionGemma

On first launch you will need to create a password to secure your account and sign in again.

Then go to the Studio Chat tab and search for DiffusionGemma in the search bar and download your desired model and quant.

Run DiffusionGemma

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

Fine-tune DiffusionGemma

You can now train and fine-tune DiffusionGemma directly with Unsloth. In our example, we demonstrate the impact of domain-specific training by fine-tuning the model on Sudoku. The base model initially performs poorly on Sudoku tasks, but after training on a targeted dataset, it learns how to actually solve sudoku and solves every example correctly.

You can use our Colab notebook (A100) to fine-tune Diffusion Gemma with:

Google Colabcolab.research.google.com

DiffusionGemma Best Practices

Multimodal Prompting

DiffusionGemma supports interleaved multimodal inputs, including text and images. Video can be processed as sequences of image frames.

For best results with multimodal prompts, place image or frame content before text instructions. Example:

[image]
Describe the chart and summarize the key trend.

For document parsing, OCR, chart understanding, UI understanding, or small text extraction, use a higher visual token budget.

Supported visual token budgets:

Visual Token Budget

Best For

Fast classification, simple captioning

140

Lightweight visual QA

280

General image understanding

560

OCR, charts, UI screenshots

1120

Dense documents, small text, detailed extraction

For video-style inputs, DiffusionGemma can process up to 60 seconds when sampled at 1 frame per second.

Sampling Notes

DiffusionGemma is not a normal next-token-only model. It generates a block of tokens, called a canvas, by repeatedly refining noisy token predictions. The generation process works roughly as follows:

The encoder processes the prompt and builds a context cache.
The decoder receives a 256-token generation canvas.
The diffusion sampler iteratively denoises the canvas.
Confident tokens are selected and preserved.
Uncertain tokens are renoised and refined again.
Once the canvas is complete, it is appended to the context.
The model continues with the next canvas.

This block-autoregressive approach allows DiffusionGemma to generate many tokens in fewer forward passes than a standard autoregressive model.

Benchmarks

DiffusionGemma is optimized for speed and multimodal reasoning, though standard Gemma 4 is stronger on conventional reasoning benchmarks.

Benchmark

DiffusionGemma 26B-A4B

Gemma 4 26B-A4B

MMLU Pro

77.6%

82.6%

AIME 2026 no tools

69.1%

88.3%

LiveCodeBench v6

69.1%

77.1%

Codeforces ELO

1429

1718

GPQA Diamond

73.2%

82.3%

Tau2 Average

56.2%

68.2%

HLE no tools

11.0%

8.7%

HLE with search

11.9%

17.2%

BigBench Extra Hard

47.6%

64.8%

MMMLU

81.5%

86.3%

Long Context Benchmark

DiffusionGemma 26B-A4B

Gemma 4 26B-A4B

MRCR v2 8 needle 128K average

32.0%

44.1%

Vision benchmarks:

Vision Benchmark

DiffusionGemma 26B-A4B

Gemma 4 26B-A4B

MMMU Pro

54.3%

73.8%

OmniDocBench 1.5, lower is better

0.319

0.149

MATH-Vision

70.5%

82.4%

MedXPertQA MM

49.0%

58.1%

PreviousUnsloth Updates NextGemma 4

Last updated 4 hours ago

Was this helpful?