Add diffusion-gemma block-diffusion support by lnigam · Pull Request #24427 · ggml-org/llama.cpp

lnigam · 2026-06-10T16:36:07Z

Overview

This PR adds initial diffusion-gemma support for Gemma 4 based block-diffusion checkpoints. This is just a draft PR to get feedback on multiple design aspects of diffusion model like block diffusion, approximation of soft embeddings, separate vs single encoder-decoder, prefix KV-cache reuse, diffusion server utility etc.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24427
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release

then use a GGUF (any can work but for eg)

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

Diffusion Gemma benchmark commands

Benchmark scripts:

https://github.com/lnigam/llama.cpp-scripts/tree/diffusion-gemma

The optimized CUDA path uses the current binary defaults for fused full-softmax,
fused top-k sampling, fused self-conditioning embedding, final softcap fusion,
direct self-conditioning, and final token copy on stop.

The only extra tuning flag passed explicitly is:

--diffusion-cuda-mmq-max-x 64

--run-max-denoising-step is not enabled. Top-k is not passed, so the default
full-softmax path is used (top_k=0).

CLI

.\build\bin\Release\llama-diffusion-gemma-cli.exe `
  -m "<path-to-model.gguf>" `
  -p "Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations." `
  -n 1024 `
  -c 8096 `
  -ngl 999 `
  --diffusion-steps 48 `
  --diffusion-cuda-mmq-max-x 64

Server

.\build\bin\Release\llama-diffusion-gemma-server.exe `
  -m "<path-to-model.gguf>" `
  --host 127.0.0.1 `
  --port 18081 `
  -c 8096 `
  -ngl 999 `
  --diffusion-steps 48 `
  --diffusion-cuda-mmq-max-x 64 `
  --metrics `
  --slots

Example request:

curl.exe http://127.0.0.1:18081/v1/chat/completions `
  -H "Content-Type: application/json" `
  -d "{\"model\":\"diffusion-gemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Answer in about 1000 words: explain block diffusion generation.\"}],\"max_tokens\":1024}"

Benchmark scripts

CLI benchmark:

python .\bench-diffusion-gemma-cli.py `
  --binary <path-to-llama-diffusion-gemma-cli> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --n-predict 1024

Server benchmark:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --max-tokens 1024

The benchmark scripts default to:

--ctx-size 8096
--diffusion-steps 48
--diffusion-cuda-mmq-max-x 64
no request-level top-k override, so the binary uses top_k=0
no --ignore-eos, so EOS is respected by default

Additional information

Add GGUF conversion support for diffusion Gemma checkpoints, including self-conditioning tensors and multimodal Gemma 4 vision/mmproj export.
Register the diffusion-gemma architecture and model implementation.
Implement the diffusion Gemma graph using Gemma 4 decoder blocks with bidirectional canvas attention, prompt-prefix conditioning, KV-cache reuse, and self-conditioning.
Add sparse top-k self-conditioning through on-device embedding gather.
Add llama-diffusion-gemma-cli for block-diffusion generation.
Add llama-diffusion-gemma-server, an HTTP server with /v1/completions, /v1/chat/completions, /health, /props, /metrics, and /slots.
Add CUDA backend support for diffusion top-k sampling, entropy/stability decisions, self-conditioning buffers, device-side canvas updates, and device-loop early stopping.
Enable CUDA graph friendly execution by keeping persistent diffusion inputs/output state on device and avoiding inter-step host sampling copies.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: yes, for generating the initial architecture support related changes and some bug fixes, merge conflicts and code review

…pport Adds a new DIFFUSION_GEMMA4 architecture and a DiffusionGemma4Model converter that reuses the existing gemma4 conversion path. The block-diffusion checkpoint nests its language model under model.decoder.* and adds a self_conditioning MLP; its text encoder shares all weights with the decoder except a per-layer layer_scalar (dropped here, since the single-stack graph uses the decoder set). - gguf-py: MODEL_ARCH.DIFFUSION_GEMMA4 + SELF_COND_* tensors and HF name maps - conversion: DiffusionGemma4Model strips the model.decoder. prefix, drops encoder-only tensors, and otherwise inherits gemma4 hparam/tensor handling - relax Gemma4Model num_kv_shared_layers lookup (key absent in diffusion config) Dry-run validated: architecture and hparams recognized (5:1 sliding/global KV pattern, dual head dims, softcapping, 128/8 experts), all 691 tensors map.

Adds LLM_ARCH_DIFFUSION_GEMMA4, reusing the gemma4 decoder block (hparams and graph) by subclassing llama_model_gemma4. Adds the top-level self_conditioning MLP tensors (SELF_COND_*); load_arch_tensors loads them alongside the inherited gemma4 tensors. The block-diffusion sampling loop and the self_conditioning / bidirectional / encoder-KV graph wiring are layered on top in a later step. - llama-arch: arch name, SELF_COND_* tensor names and tensor infos - llama-model: create / rope-type / kv-reuse dispatch for the new arch - models: llama_model_diffusion_gemma4 subclass loading self_conditioning weights Verified: the 50.5 GB bf16 GGUF loads, all 691 tensors map, and a forward pass runs end-to-end with correct shapes (dual head dims, 128-expert MoE routing, tied output). Runs as a gemma4-style causal LM for now; diffusion semantics next.

Replaces the inherited gemma4 (causal, KV-cache) graph with a dedicated bidirectional, no-KV-cache denoising graph: - load_arch_hparams: reuse gemma4 hparams, then set causal_attn = false - register the arch in llm_arch_is_diffusion and on the no-memory path (res = nullptr), so it runs like the other diffusion LMs (DREAM/LLADA) - graph: gemma4 per-layer block (QK-norm, scale-less V-norm, dual head dims, proportional rope on full layers, dense+MoE dual FFN, layer_scalar) but with build_attn_inp_no_cache (bidirectional attention) - input: scaled embeddings -> scale-less RMS norm, which is exactly the self-conditioning transform on the first denoising step (soft-cond = 0) Scope: a single bidirectional denoising pass over the canvas with no prompt context. Valid while canvas_length <= sliding_window (sliding == full attn). The soft-conditioning input path (later steps) and encoder-KV cross-attention (prompted generation) come next. Verified: builds clean; loads the 50.5 GB GGUF with no KV cache and runs a full forward (self_cond_input node + bidirectional attention + dual-FFN MoE + layer_scalar + softcapped logits).

Adds the self_conditioning gated MLP to the decoder input: sc_input = post_norm(inputs_embeds + down(gelu(gate(pre_norm(soft))) * up(pre_norm(soft)))) using the self_cond_norm/gate/up/down weights. The soft-embeddings input is a zero placeholder for now (the block-diffusion sampler will feed softmax(prev_logits) @ embed per denoising step); with soft = 0 this is numerically identical to the verified first-step behaviour (rms_norm(0) = 0 -> sc_signal = 0 -> scale-less post-norm of the scaled embeddings), so no regression, but the self_cond weights are now used in-graph. First slice of the self-conditioning + block-diffusion-sampler unit; the runtime soft-embeddings feed (a settable input channel) lands with the sampler.

Adds llama-diffusion-gemma4-cli implementing the reference block-diffusion loop: random canvas init, per-step full-canvas decode, linear temperature schedule (0.4 -> 0.8), entropy-bound token acceptance, renoise of non-accepted positions, and stable-and-confident stopping. Drives the bidirectional no-KV-cache graph. Scope: unconditioned generation with self-conditioning = 0 (the prompt is not used yet). The loop runs end-to-end and denoises as expected -- over the steps the mean token entropy falls (3.12 -> 0.15) and the accepted-token count rises (1 -> 7/16), i.e. the canvas converges. Self-conditioning feedback and prompt/encoder-KV conditioning are layered on next. DG4_CANVAS / DG4_STEPS env vars override canvas length / step count for testing.

Adds a per-decode soft-conditioning input so the block-diffusion sampler can feed the previous denoising step's token probabilities back into the decoder. - core: llama_diffusion_cond + llm_graph_input_diffusion_self_cond + build_inp_*, threaded through llm_graph_params/llm_graph_context exactly like llama_cross; public API llama_set_diffusion_self_cond(ctx, probs, n_vocab, n_tokens). - graph: soft_embeddings = (probs @ token_embd) * sqrt(n_embd) -> self_cond MLP -> added to the input embeddings (replaces the zero placeholder). Empty buffer -> zeros == the verified first-step behaviour, so no regression. - sampler: feeds softmax(processed_logits) each step for the next decode; cleared for step 0. Runs end-to-end; step 0 is unchanged (zero self-cond), later steps use the feedback. Full numerical verification vs PyTorch self_conditioning is pending. Perf TODO: the in-graph embedding transpose for the probs@embed matmul is recomputed per decode.

Adds prompted block-diffusion generation as a single-pass [prompt ; canvas] forward. - core: llm_graph_input_attn_no_cache_prefix + build_attn_inp_no_cache_prefix(n_prompt) build a prefix mask (prompt attends causally within the prompt; canvas attends to everything -> bidirectional + cross to the prompt). n_prompt is carried on llama_diffusion_cond and set via llama_set_diffusion_prompt_len(). - graph: prompt rows use the raw scaled embeddings (encoder; no self-conditioning / post-norm); canvas rows use the self-conditioned input. Uses the prefix attention when n_prompt > 0, else fully-bidirectional no-cache. - sampler: tokenizes the prompt, builds [prompt ; canvas] each step, requests logits for the canvas positions only, and feeds self-cond probs over the full sequence (prompt columns zero). Result: prompted generation runs end-to-end. "The capital of France is" denoises the canvas toward relevant tokens (including "Paris"). Output quality is still rough (repetition/filler); refinement is ongoing (self-conditioning numerical verification, sampler tuning, full canvas/steps, reconvert from the current v5 source). Notes: encoder and decoder layer_scalars are identical in this checkpoint, so no separate encoder scalars are needed. The prefix mask currently assumes n_tokens <= sliding_window (sliding == full); long prompts need a windowed prefix mask.

This is a chat-trained model (turn/channel special tokens); feeding raw text gives poor results. Format the user prompt with the model's chat template (common_chat_templates) before tokenizing, parsing special tokens. HF reference (chat-formatted) answers cleanly ("The capital of France is **Paris**." then <pad> filler). With chat formatting the canvas now denoises toward prompt-relevant tokens, though full convergence still needs work (canvas size / sampler dynamics).

In the no-KV-cache path llama.cpp prunes tokens that are not requested as outputs. The sampler marked only the canvas tokens as outputs (logits=1), so the prompt tokens were pruned from the attention and the canvas never attended to the prompt -- generation was effectively unconditioned (identical output for different prompts). Fix: request logits for all [prompt; canvas] tokens and read only the canvas rows (offset n_prompt). The prompted forward now matches the HF reference (pos-0 top-5 logits agree to bf16 precision; output is prompt-dependent), and generation produces the correct structure and reasoning (thought channel -> "The capital of France is ..."). Residual: filler positions don't fully converge to <pad> (sampler-dynamics polish).

After the denoising loop, decode the converged canvas once more and emit the greedy argmax per canvas position instead of the accumulated `accepted` array. Positions that were never accepted by the entropy-bound sampler carried stale/renoised tokens, which showed up as garbage in the output; reading the model's argmax given the settled answer cleans them up (filler -> <pad>). With this, prompted generation produces correct, coherent answers, e.g.: "What is the capital of France?" -> <|channel>thought The user is asking for the capital of France. * Country: France * Capital: Paris The capital of France is Paris.

The model answers inside a "<|channel>thought ... <channel|>" block followed by the response, and the fixed 256-canvas tail repeats the answer. Print the full canvas for reference, then extract the final response (tokens after the last <channel|>, truncated at the first end-of-generation token) and drop a trailing exact-duplicate. "What is the capital of France?" now yields a clean: === answer === The capital of France is Paris.

The CLI passed common_params.n_gpu_layers (-1) straight to the model loader without resolving it, so it ran on CPU even in a CUDA build. Default to offloading all layers (999) when -ngl is not given; -ngl N limits offload and -ngl 0 forces CPU. No effect in a CPU-only build. With -DGGML_CUDA=ON the model now runs on the GPU by default.

Encode the prompt (and previously-finalized canvases) once into the unified sliding-window KV cache and reuse it, instead of re-encoding [prompt; canvas] on every denoising step. - Encoder phase (causal, no self-conditioning): prefill the prompt / commit a finalized canvas into the cache; its K/V becomes a read-only prefix. - Decoder phase (bidirectional, self-conditioned): each denoising step decodes only the canvas against the cached prefix, then rolls back its own K/V. - Multi-block autoregressive loop: commit each finalized canvas and advance the cache pointer by canvas_length. Two graph variants share the gemma4 transformer body: a single phase-branching graph (default) and a separate encoder/decoder pair (DG4_SEPARATE_ENC_DEC). Also: - enable the iswa KV cache for the arch (was res=nullptr), reusing the gemma4 layer-reuse / has_kv handling; - precompute a transposed F32 token embedding once at load (load_arch_post) for the self-conditioning soft-embedding matmul, avoiding a per-decode dequantize + transpose of the whole embedding; - add a per-decode phase selector API: llama_set_diffusion_decoder_phase. Verified on CPU and CUDA (partial and full GPU offload): correct prompted generation, multi-block coherence, and clean entropy-bound convergence.

Update the block-diffusion Gemma port to the v7 checkpoint, which renames the text architecture diffusion_gemma4 -> diffusion_gemma and adds a gemma4 vision tower (model_type gemma4_vision) + projector for image input. Rename: - DIFFUSION_GEMMA4 -> DIFFUSION_GEMMA; arch string "diffusion-gemma4" -> "diffusion-gemma"; model class llama_model_diffusion_gemma; converter class DiffusionGemmaModel registered for "DiffusionGemmaForBlockDiffusion". - renamed files src/models/diffusion-gemma.cpp, examples/diffusion-gemma/, and the CLI env knobs (DG_*). Multimodal (vision-only; the v7 diffusion checkpoint has no audio tower): - new DiffusionGemmaVisionModel mmproj converter reusing the existing GEMMA4V vision export: strips the v7 model.encoder.* nesting, skips audio, registered in MMPROJ_MODEL_MAP. The clip.cpp GEMMA4V encoder is reused unchanged. - diffusion-gemma CLI gains --mmproj/--image (enabled for LLAMA_EXAMPLE_DIFFUSION): the image marker is tokenized via libmtmd and the 280 GEMMA4V vision embeddings are fed into the diffusion encoder-phase prefill (mtmd_helper_eval_chunks); the canvas is then denoised against the cached prefix. Verified on CUDA (RTX 5090): text Q4_K_M answers correctly, and image+text produces an accurate, OCR-level description via the GEMMA4V vision tower.

…ne-argmax output - -n / --n-predict now sets the number of 256-token canvas blocks (max_canvases = ceil(n_predict / canvas_length)); canvas_length is fixed at the trained block size (256). DG_MAX_CANVASES still overrides; n_ctx auto-sizes. - print a generation timing summary at the end (blocks, denoising steps, canvas tokens, wall-clock, canvas tok/s, s/step, answer tokens). - emit each block via the inline argmax of the last (stable) denoising step, matching the reference DiffusionGemma _denoising_step, instead of a separate read-out over a stale `accepted` scratch buffer. Fixes the unconverged canvas tail: never-accepted high-entropy positions no longer carry stale-random tokens into the output / committed prefix. Commit also uses the inline argmax. Verified on CUDA: multi-block generation (e.g. -n 1536 -> 6 blocks, with entropy-bound early-stopping) produces coherent, accurate prose and reports timing.

…ded backend The self-conditioning soft-embedding does a full-vocabulary matmul (softmax(prev_logits) @ token_embd) with the precomputed F32 tok_embd_t every decode. token_embd is normally host-resident (it is only used for a cheap get_rows lookup in AR models), and tok_embd_t inherited that host buffer, so the scheduler copied the whole ~2.75 GiB tensor across PCIe on every forward, dominating per-step time. Allocate tok_embd_t on a non-host (offloaded) backend taken from a layer weight when one is offloaded, so the matmul runs on-device with no per-decode copy; fall back to token_embd's buffer for CPU-only runs. Measured on RTX 5090 (Q4_K_M, -ngl 99, pp512/tg256 @ d512): pp512 1339 -> 3530 t/s (2.6x) tg256 4.84 -> 148.8 t/s (30.7x) real generation ~0.43 s/step (was ~0.6-0.8); output unchanged.

Time the prompt prefill (encoder phase: causal, no self-conditioning) separately from the denoising loop and print tokens / seconds / tok-per-second. Makes the real prefill cost visible (it skips the self-cond matmul, unlike a decoder-phase forward), so it is not conflated with the per-step denoising cost.

…) with k-annealing The decoder/denoise phase computed softmax + entropy + self-conditioning over the full 262k vocabulary on the host every step (~67M exp/log per step), which dominated the per-step wall-clock. Add an opt-in top-k path: per canvas position, select the top-k logits (size-k min-heap in one scan), then do softmax / entropy / multinomial-sample / sparse self-conditioning over just those k. The self-cond buffer is sparse (top-k filled), so the existing in-graph soft-embedding matmul blends only the top-k embeddings (the dropped tail carries negligible embedding weight; the following RMS norm absorbs the scale). Knobs (CLI flags; default = full softmax, behaviour unchanged): - --top-k N : fixed top-k per position (0 = full softmax) - --top-k-start/--top-k-end : anneal k from high (first/high-entropy step) to low (last step) -- early canvases are flat (need many tokens), late ones are peaked (a few suffice) - --top-k-tail-correction : use the exact full-vocab entropy (logsumexp) for the accept/stop signal instead of the under-estimating top-k entropy (top-k truncation deflates entropy) Also report encoder-phase prefill timing separately, and add the generation timing summary. RTX 5090 (Q4_K_M, -ngl 99): ~2.6x faster per denoising step with top-k (no correction); output preserved on the tested prompts. This is host-side only -- it reuses the existing self-conditioning channel and graph. The fuller graph-side variant (gather top-k embedding rows + a dedicated top-k self-cond input, dropping the full-vocab matmul and the F32 transposed embedding) is left as a follow-up.

Replace the dense full-vocab self-conditioning soft-embedding (probs @ token_embd over all 262144 rows) with a sparse gather of only the previous step's top-k token embeddings, blended by their probabilities. The CLI feeds the top-256 (id, prob) per canvas position each denoising step via a new API; the decoder graph gathers those rows (ggml_get_rows), prob-weights and sums them (x sqrt(n_embd)). The gather embedding (tok_embd_gpu) is stored as F16 on the offloaded backend. F16 (not the native Q4_K) is required because CUDA get_rows has no Q4_K/Q6_K kernel -- a quantized gather falls back to CPU every step and is a large regression. F16 keeps the gather on-device and halves its VRAM vs the F32 dense transpose (1.47 GB vs 2.75 GB). Measured (RTX 5090, Q4_K_M, France prompt): --top-k 64 ~0.13 s/step vs 0.158 dense (~1.2x faster), ~1.3 GB less VRAM; k=0 at parity (host softmax dominated). Output unchanged. The dense matmul path is retained as a fallback when the F16 gather copy cannot be allocated. API: - llama_set_diffusion_self_cond_topk(ctx, ids, probs, k, n_tokens) - llm_graph_input_diffusion_self_cond_topk + build_inp_diffusion_self_cond_topk

Add examples/diffusion-gemma/diffusion-gemma-server.cpp (target llama-diffusion-gemma-server): loads a block-diffusion model once and serves the same denoising generation loop as the CLI over HTTP, with the llama-server observability surface re-mapped to diffusion semantics. Endpoints: GET /health, /v1/health, /props, /v1/models, /models, /slots (--slots), /metrics (--metrics); POST /v1/chat/completions (+ SSE stream) and /v1/completions. Responses carry the llama-server `timings` object (prompt_n/prompt_ms, predicted_n/predicted_ms, ...) extended with a `diffusion` sub-object (n_blocks, n_steps, canvas_tokens, ms_per_step, steps_per_second, canvas_tokens_per_second, n_decode), plus the OpenAI `usage` object. Per-request llama_perf-style timing logs, an access log, and a llama-server-style startup banner are emitted. Prometheus /metrics exposes prompt/predicted token counters and gauges alongside diffusion_blocks/steps/canvas_tokens totals and rate gauges. Generation is serialized behind a mutex (one slot; the context is single- threaded). Server-only flags (--host, --port, --api-key, --metrics, --slots) are stripped before common_params_parse so all diffusion flags still parse. Request fields: max_tokens -> canvas blocks, top_k, seed, ignore_eos (run the full block count for sustained long-generation benchmarking).

Ajay9o9 · 2026-06-10T18:57:17Z

Tried the latest nvidia-diffusion-gemma branch (e1fc5359f) with diffusiongemma-26B-A4B-it-Q4_K_M.gguf and hit a load error:

i'm on
Branch: nvidia-diffusion-gemma
Commit: e1fc535

Model from Unsloth
diffusiongemma-26B-A4B-it-Q4_K_M.gguf

loading fails with

/repos/llama-diffusion-gemma$ ./build/bin/llama-diffusion-gemma-cli \
  -m "/media/gemma4/diffusiongemma-26B-A4B-it-Q4_K_M.gguf" \
  -p "Hello"
  
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11889 MiB):
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 11889 MiB
0.00.624.210 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.624.556 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.642.077 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.00.861.710 E llama_model_load: error loading model: missing tensor 'self_cond_norm.weight'
0.00.861.719 E llama_model_load_from_file_impl: failed to load model
0.00.861.720 E error: failed to load model '/media/gemma4/diffusiongemma-26B-A4B-it-Q4_K_M.gguf'

missing tensor 'self_cond_norm.weight'

GGUF, inspection shows :

self_cond_down.weight
self_cond_gate.weight
self_cond_pre_norm.weight
self_cond_up.weight

but not:

self_cond_norm.weight

The runtime appears to expect:

{ LLM_TENSOR_SELF_COND_NORM, "self_cond_norm" }

from llama-arch.cpp

I also checked the current branch and couldn't find any references to self_cond_pre_norm in the runtime:

grep -R "self_cond_pre_norm" src/ -n

returns nothing.

Is this a naming mismatch?

Jakeshadow · 2026-06-11T04:10:16Z

Nice, been waiting for this. For anyone building from this PR — heads up that Apple Silicon won't see the speedup. It's a compute-bound model, needs a high-CUDA-core GPU. Detailed GPU requirements here: https://diffusiongemma.dev

lnigam · 2026-06-11T13:18:15Z

@Ajay9o9 Now this PR works for unsloth model.

mohamed-em2m · 2026-06-11T16:44:19Z

llama-diffusion-gemma-server don't appear in builds

lnigam · 2026-06-11T17:45:06Z

@mohamed-em2m they are building in my local and I also can see it in the CI logs. Are there any errors while building?

Keep diffusion decoder input handling limited to Diffusion Gemma so other architectures use standard graph inputs. Assisted-by: Codex

theIvanR · 2026-06-11T18:53:40Z

Awesome job! I will have a look later at the code. Proposal: make the code modular with a clear fetcher, and builder(s).

For the fetcher do something like this:

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

git fetch --depth 1 origin pull/24427/head:pr24427
git checkout pr24427

While for the rest, something similar to how I did it here:
https://github.com/theIvanR/lmstudio-unlocked-backend/tree/main/Generate%20Backends/Windows

EDIT:
I tried with this cpu only builder and something is broken:

C:\Users\Admin\source>talk_to_gemma.cmd
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
0.00.000.889 I diffusion-gemma: GGML_CUDA_MMQ_MAX_X=64
0.01.195.906 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.196.466 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.252.519 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
[SUCCESS] Done

@echo off
setlocal

set "EXE=C:\Users\Admin\source\llama.cpp\build_cpu\bin\llama-diffusion-gemma-cli.exe"
set "MODEL=C:\Users\Admin\Documents\LLM Models\unsloth\diffusiongemma-26B-A4B-it-GGUF\diffusiongemma-26B-A4B-it-Q4_K_M.gguf"

if not exist "%EXE%" (
    echo [ERROR] CLI not found: "%EXE%"
    exit /b 1
)

if not exist "%MODEL%" (
    echo [ERROR] Model not found: "%MODEL%"
    exit /b 1
)

"%EXE%" ^
  -m "%MODEL%" ^
  -p "Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations." ^
  -n 1024 ^
  -c 8096 ^
  -ngl 999 ^
  --diffusion-steps 48 ^
  --diffusion-cuda-mmq-max-x 64

if errorlevel 1 (
    echo [ERROR] Command failed
    exit /b 1
)

echo [SUCCESS] Done
exit /b 0

EDIT 2: trying on gpu with my builder and this directory:

set "EXE=C:\Users\Admin\source\llama.cpp\build_gpu_cuda\bin\llama-diffusion-gemma-cli.exe"

It did something:

PS C:\Users\Admin\source> ./talk_to_gemma.cmd
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8191 MiB):
  Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8191 MiB
0.00.000.655 I diffusion-gemma: GGML_CUDA_MMQ_MAX_X=64
0.01.348.419 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.349.167 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.434.171 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.17.762.433 I formatted prompt: <|turn>system
<|think|>
<turn|>
<|turn>user
Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations.<turn|>
<|turn>model

0.17.762.850 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.17.763.676 W llama_kv_cache_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
0.18.941.717 I diffusion-gemma: prefix=34 canvas=256 max_canvases=4 steps=48 entropy_bound=0.100 temp=[0.40,0.80] n_ctx=8096 mm=0
0.18.941.728 I diffusion-gemma: gpu sampling: on | device self-cond: on | device loop: on
0.18.941.728 I diffusion-gemma: device early-stop interval=1
0.23.787.601 I prefill (encoder, no self-cond): 34 tokens in 4.846 s (7.0 tok/s)

lnigam · 2026-06-11T19:24:41Z

Latest benchmark numbers on RTX 5090 using my benchmark

Server diffusion benchmark

ready ms: 9080.63
runs kept: 10
request wall ms avg/min/max: 5461.17 / 5142.71 / 5563.61
predicted ms avg: 5402.404
canvas tok/s avg: 189.650
steps/s avg: 34.984
ms/step avg: 28.585
mean denoising steps/block: 47.250

Using aiperf (speedbench)
command: aiperf profile -m "diffusiongemma-26B-A4B-it-Q4_K_M" --concurrency 1 --tokenizer "google/gemma-4-26B-A4B-it" --endpoint-type chat -u http://127.0.0.1:18081 --artifact-dir ./gemma-diffsion-5090-topk0 --public-dataset speed_bench_throughput_2k --osl 1024 --extra-inputs "max_tokens:1024,min_tokens:1024,ignore_eos:true,top_k:0" --osl-stddev 0 --ui-type none --streaming --request-count 100 --warmup-request-count 10 --use-server-token-count --request-timeout-seconds 1200

coder543 · 2026-06-11T19:34:01Z

Google reported 700 tokens per second on RTX 5090, so I guess there is a lot of room for optimization here

lnigam · 2026-06-11T21:11:12Z

@theIvanR Can you try the CPU only build with -DGGML_CPU_REPACK=OFF
I dont have 1070 Ti with me. For GPU build can you try following:

%EXE% ^
-m "%MODEL%" ^
-p "Briefly explain block diffusion." ^
-n 64 ^
-c 1024 ^
-ngl 8 ^
--no-kv-offload ^
--diffusion-steps 2 ^
--diffusion-cuda-mmq-max-x 0 ^
--top-k 256 ^
--no-diffusion-device-denoise-loop

lnigam · 2026-06-11T21:56:16Z

@coder543 It depends mostly on the number of denoising steps needed for the canvas to converge. For coding prompts, model converges generally in around 20 steps while the above benchmark took on an average 34 steps to converge.
diffusiongemma-Q8_0 model is performing relatively better than Q4_KM. and with coding speedbench it is reaching around 300 t/s.

For some coding prompts it is reaching upto 534 t/s

mohamed-em2m · 2026-06-11T23:52:28Z

@mohamed-em2m they are building in my local and I also can see it in the CI logs. Are there any errors while building?

it's appear now

lnigam added 27 commits June 9, 2026 23:05

diffusion-gemma: add CUDA graph GPU sampling server path

ff726a2

diffusion-gemma: keep self-cond sampling on device

378fbfd

diffusion-gemma: keep denoising loop on device

f7cea9c

diffusion-gemma: use output flag for persistent inputs

7e65661

diffusion-gemma: checkpoint device-loop early stop

2b7758f

diffusion-gemma: make device early stop every step

3b51143

examples: update diffusion gemma mtmd bitmap loading

e1fc535

lnigam requested review from a team, CISC and ggerganov as code owners June 10, 2026 16:36

github-actions Bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026

gaugarg-nv mentioned this pull request Jun 10, 2026

DiffusionGemma #24423

Draft

incorporating unsloth model related changes

0491dd2

github-actions Bot added the testing Everything test related label Jun 11, 2026

lnigam added 5 commits June 11, 2026 16:36

diffusion-gemma: add tuning options and server accounting

b638ad7

diffusion: add max denoising loop switch

7590647

diffusion: wire CUDA sampling optimization controls

3901206

ggml-cuda: add fused diffusion sampling kernels

55850ae

diffusion: keep final denoise token copy

8a916b9

lnigam added 3 commits June 11, 2026 18:57

diffusion: default to fused CUDA full softmax

93b5a36

diffusion-gemma: enable fast CUDA defaults

4252f53

diffusion-gemma: add CUDA MMQ tile override flag

266cdf1

llama : scope diffusion graph paths to diffusion gemma

9d2fc7d

Keep diffusion decoder input handling limited to Diffusion Gemma so other architectures use standard graph inputs. Assisted-by: Codex

fix flake8_lint failure

7ba437f

avoid no-repack

201052a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add diffusion-gemma block-diffusion support#24427

Add diffusion-gemma block-diffusion support#24427
lnigam wants to merge 39 commits into
ggml-org:masterfrom
lnigam:nvidia-diffusion-gemma

lnigam commented Jun 10, 2026 •

edited

Loading

Uh oh!

Ajay9o9 commented Jun 10, 2026

Uh oh!

Jakeshadow commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

theIvanR commented Jun 11, 2026 •

edited

Loading

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

coder543 commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lnigam commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Diffusion Gemma benchmark commands

CLI

Server

Benchmark scripts

Additional information

Requirements

Uh oh!

Ajay9o9 commented Jun 10, 2026

Uh oh!

Jakeshadow commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

theIvanR commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lnigam commented Jun 11, 2026

Server diffusion benchmark

Uh oh!

coder543 commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

lnigam commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lnigam commented Jun 10, 2026 •

edited

Loading

theIvanR commented Jun 11, 2026 •

edited

Loading