Skip to content

Add diffusion-gemma block-diffusion support#24427

Draft
lnigam wants to merge 39 commits into
ggml-org:masterfrom
lnigam:nvidia-diffusion-gemma
Draft

Add diffusion-gemma block-diffusion support#24427
lnigam wants to merge 39 commits into
ggml-org:masterfrom
lnigam:nvidia-diffusion-gemma

Conversation

@lnigam

@lnigam lnigam commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR adds initial diffusion-gemma support for Gemma 4 based block-diffusion checkpoints. This is just a draft PR to get feedback on multiple design aspects of diffusion model like block diffusion, approximation of soft embeddings, separate vs single encoder-decoder, prefix KV-cache reuse, diffusion server utility etc.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24427
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release

then use a GGUF (any can work but for eg)

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

Diffusion Gemma benchmark commands

Benchmark scripts:

The optimized CUDA path uses the current binary defaults for fused full-softmax,
fused top-k sampling, fused self-conditioning embedding, final softcap fusion,
direct self-conditioning, and final token copy on stop.

The only extra tuning flag passed explicitly is:

--diffusion-cuda-mmq-max-x 64

--run-max-denoising-step is not enabled. Top-k is not passed, so the default
full-softmax path is used (top_k=0).

CLI

.\build\bin\Release\llama-diffusion-gemma-cli.exe `
  -m "<path-to-model.gguf>" `
  -p "Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations." `
  -n 1024 `
  -c 8096 `
  -ngl 999 `
  --diffusion-steps 48 `
  --diffusion-cuda-mmq-max-x 64

Server

.\build\bin\Release\llama-diffusion-gemma-server.exe `
  -m "<path-to-model.gguf>" `
  --host 127.0.0.1 `
  --port 18081 `
  -c 8096 `
  -ngl 999 `
  --diffusion-steps 48 `
  --diffusion-cuda-mmq-max-x 64 `
  --metrics `
  --slots

Example request:

curl.exe http://127.0.0.1:18081/v1/chat/completions `
  -H "Content-Type: application/json" `
  -d "{\"model\":\"diffusion-gemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Answer in about 1000 words: explain block diffusion generation.\"}],\"max_tokens\":1024}"

Benchmark scripts

CLI benchmark:

python .\bench-diffusion-gemma-cli.py `
  --binary <path-to-llama-diffusion-gemma-cli> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --n-predict 1024

Server benchmark:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --max-tokens 1024

The benchmark scripts default to:

  • --ctx-size 8096
  • --diffusion-steps 48
  • --diffusion-cuda-mmq-max-x 64
  • no request-level top-k override, so the binary uses top_k=0
  • no --ignore-eos, so EOS is respected by default

Additional information

  • Add GGUF conversion support for diffusion Gemma checkpoints, including self-conditioning tensors and multimodal Gemma 4 vision/mmproj export.
  • Register the diffusion-gemma architecture and model implementation.
  • Implement the diffusion Gemma graph using Gemma 4 decoder blocks with bidirectional canvas attention, prompt-prefix conditioning, KV-cache reuse, and self-conditioning.
  • Add sparse top-k self-conditioning through on-device embedding gather.
  • Add llama-diffusion-gemma-cli for block-diffusion generation.
  • Add llama-diffusion-gemma-server, an HTTP server with /v1/completions, /v1/chat/completions, /health, /props, /metrics, and /slots.
  • Add CUDA backend support for diffusion top-k sampling, entropy/stability decisions, self-conditioning buffers, device-side canvas updates, and device-loop early stopping.
  • Enable CUDA graph friendly execution by keeping persistent diffusion inputs/output state on device and avoiding inter-step host sampling copies.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: yes, for generating the initial architecture support related changes and some bug fixes, merge conflicts and code review

lnigam added 27 commits June 9, 2026 23:05
…pport

Adds a new DIFFUSION_GEMMA4 architecture and a DiffusionGemma4Model converter
that reuses the existing gemma4 conversion path. The block-diffusion checkpoint
nests its language model under model.decoder.* and adds a self_conditioning MLP;
its text encoder shares all weights with the decoder except a per-layer
layer_scalar (dropped here, since the single-stack graph uses the decoder set).

- gguf-py: MODEL_ARCH.DIFFUSION_GEMMA4 + SELF_COND_* tensors and HF name maps
- conversion: DiffusionGemma4Model strips the model.decoder. prefix, drops
  encoder-only tensors, and otherwise inherits gemma4 hparam/tensor handling
- relax Gemma4Model num_kv_shared_layers lookup (key absent in diffusion config)

Dry-run validated: architecture and hparams recognized (5:1 sliding/global KV
pattern, dual head dims, softcapping, 128/8 experts), all 691 tensors map.
Adds LLM_ARCH_DIFFUSION_GEMMA4, reusing the gemma4 decoder block (hparams and
graph) by subclassing llama_model_gemma4. Adds the top-level self_conditioning
MLP tensors (SELF_COND_*); load_arch_tensors loads them alongside the inherited
gemma4 tensors. The block-diffusion sampling loop and the self_conditioning /
bidirectional / encoder-KV graph wiring are layered on top in a later step.

- llama-arch: arch name, SELF_COND_* tensor names and tensor infos
- llama-model: create / rope-type / kv-reuse dispatch for the new arch
- models: llama_model_diffusion_gemma4 subclass loading self_conditioning weights

Verified: the 50.5 GB bf16 GGUF loads, all 691 tensors map, and a forward pass
runs end-to-end with correct shapes (dual head dims, 128-expert MoE routing,
tied output). Runs as a gemma4-style causal LM for now; diffusion semantics next.
Replaces the inherited gemma4 (causal, KV-cache) graph with a dedicated
bidirectional, no-KV-cache denoising graph:

- load_arch_hparams: reuse gemma4 hparams, then set causal_attn = false
- register the arch in llm_arch_is_diffusion and on the no-memory path
  (res = nullptr), so it runs like the other diffusion LMs (DREAM/LLADA)
- graph: gemma4 per-layer block (QK-norm, scale-less V-norm, dual head dims,
  proportional rope on full layers, dense+MoE dual FFN, layer_scalar) but with
  build_attn_inp_no_cache (bidirectional attention)
- input: scaled embeddings -> scale-less RMS norm, which is exactly the
  self-conditioning transform on the first denoising step (soft-cond = 0)

Scope: a single bidirectional denoising pass over the canvas with no prompt
context. Valid while canvas_length <= sliding_window (sliding == full attn).
The soft-conditioning input path (later steps) and encoder-KV cross-attention
(prompted generation) come next.

Verified: builds clean; loads the 50.5 GB GGUF with no KV cache and runs a full
forward (self_cond_input node + bidirectional attention + dual-FFN MoE +
layer_scalar + softcapped logits).
Adds the self_conditioning gated MLP to the decoder input:
  sc_input = post_norm(inputs_embeds + down(gelu(gate(pre_norm(soft))) * up(pre_norm(soft))))
using the self_cond_norm/gate/up/down weights. The soft-embeddings input is a zero
placeholder for now (the block-diffusion sampler will feed softmax(prev_logits) @ embed
per denoising step); with soft = 0 this is numerically identical to the verified
first-step behaviour (rms_norm(0) = 0 -> sc_signal = 0 -> scale-less post-norm of the
scaled embeddings), so no regression, but the self_cond weights are now used in-graph.

First slice of the self-conditioning + block-diffusion-sampler unit; the runtime
soft-embeddings feed (a settable input channel) lands with the sampler.
Adds llama-diffusion-gemma4-cli implementing the reference block-diffusion loop:
random canvas init, per-step full-canvas decode, linear temperature schedule
(0.4 -> 0.8), entropy-bound token acceptance, renoise of non-accepted positions,
and stable-and-confident stopping. Drives the bidirectional no-KV-cache graph.

Scope: unconditioned generation with self-conditioning = 0 (the prompt is not used
yet). The loop runs end-to-end and denoises as expected -- over the steps the mean
token entropy falls (3.12 -> 0.15) and the accepted-token count rises (1 -> 7/16),
i.e. the canvas converges. Self-conditioning feedback and prompt/encoder-KV
conditioning are layered on next.

DG4_CANVAS / DG4_STEPS env vars override canvas length / step count for testing.
Adds a per-decode soft-conditioning input so the block-diffusion sampler can feed
the previous denoising step's token probabilities back into the decoder.

- core: llama_diffusion_cond + llm_graph_input_diffusion_self_cond + build_inp_*,
  threaded through llm_graph_params/llm_graph_context exactly like llama_cross;
  public API llama_set_diffusion_self_cond(ctx, probs, n_vocab, n_tokens).
- graph: soft_embeddings = (probs @ token_embd) * sqrt(n_embd) -> self_cond MLP ->
  added to the input embeddings (replaces the zero placeholder). Empty buffer ->
  zeros == the verified first-step behaviour, so no regression.
- sampler: feeds softmax(processed_logits) each step for the next decode; cleared
  for step 0.

Runs end-to-end; step 0 is unchanged (zero self-cond), later steps use the feedback.
Full numerical verification vs PyTorch self_conditioning is pending. Perf TODO: the
in-graph embedding transpose for the probs@embed matmul is recomputed per decode.
Adds prompted block-diffusion generation as a single-pass [prompt ; canvas] forward.

- core: llm_graph_input_attn_no_cache_prefix + build_attn_inp_no_cache_prefix(n_prompt)
  build a prefix mask (prompt attends causally within the prompt; canvas attends to
  everything -> bidirectional + cross to the prompt). n_prompt is carried on
  llama_diffusion_cond and set via llama_set_diffusion_prompt_len().
- graph: prompt rows use the raw scaled embeddings (encoder; no self-conditioning /
  post-norm); canvas rows use the self-conditioned input. Uses the prefix attention
  when n_prompt > 0, else fully-bidirectional no-cache.
- sampler: tokenizes the prompt, builds [prompt ; canvas] each step, requests logits
  for the canvas positions only, and feeds self-cond probs over the full sequence
  (prompt columns zero).

Result: prompted generation runs end-to-end. "The capital of France is" denoises the
canvas toward relevant tokens (including "Paris"). Output quality is still rough
(repetition/filler); refinement is ongoing (self-conditioning numerical verification,
sampler tuning, full canvas/steps, reconvert from the current v5 source).

Notes: encoder and decoder layer_scalars are identical in this checkpoint, so no
separate encoder scalars are needed. The prefix mask currently assumes
n_tokens <= sliding_window (sliding == full); long prompts need a windowed prefix mask.
This is a chat-trained model (turn/channel special tokens); feeding raw text gives
poor results. Format the user prompt with the model's chat template
(common_chat_templates) before tokenizing, parsing special tokens.

HF reference (chat-formatted) answers cleanly ("The capital of France is **Paris**."
then <pad> filler). With chat formatting the canvas now denoises toward prompt-relevant
tokens, though full convergence still needs work (canvas size / sampler dynamics).
In the no-KV-cache path llama.cpp prunes tokens that are not requested as outputs.
The sampler marked only the canvas tokens as outputs (logits=1), so the prompt
tokens were pruned from the attention and the canvas never attended to the prompt
-- generation was effectively unconditioned (identical output for different prompts).

Fix: request logits for all [prompt; canvas] tokens and read only the canvas rows
(offset n_prompt). The prompted forward now matches the HF reference (pos-0 top-5
logits agree to bf16 precision; output is prompt-dependent), and generation produces
the correct structure and reasoning (thought channel -> "The capital of France is ...").

Residual: filler positions don't fully converge to <pad> (sampler-dynamics polish).
After the denoising loop, decode the converged canvas once more and emit the greedy
argmax per canvas position instead of the accumulated `accepted` array. Positions that
were never accepted by the entropy-bound sampler carried stale/renoised tokens, which
showed up as garbage in the output; reading the model's argmax given the settled answer
cleans them up (filler -> <pad>).

With this, prompted generation produces correct, coherent answers, e.g.:
  "What is the capital of France?" ->
  <|channel>thought
  The user is asking for the capital of France.
      * Country: France
      * Capital: Paris
  The capital of France is Paris.
The model answers inside a "<|channel>thought ... <channel|>" block followed by the
response, and the fixed 256-canvas tail repeats the answer. Print the full canvas for
reference, then extract the final response (tokens after the last <channel|>, truncated
at the first end-of-generation token) and drop a trailing exact-duplicate.

"What is the capital of France?" now yields a clean:
  === answer ===
  The capital of France is Paris.
The CLI passed common_params.n_gpu_layers (-1) straight to the model loader without
resolving it, so it ran on CPU even in a CUDA build. Default to offloading all layers
(999) when -ngl is not given; -ngl N limits offload and -ngl 0 forces CPU. No effect in
a CPU-only build. With -DGGML_CUDA=ON the model now runs on the GPU by default.
Encode the prompt (and previously-finalized canvases) once into the unified
sliding-window KV cache and reuse it, instead of re-encoding [prompt; canvas]
on every denoising step.

- Encoder phase (causal, no self-conditioning): prefill the prompt / commit a
  finalized canvas into the cache; its K/V becomes a read-only prefix.
- Decoder phase (bidirectional, self-conditioned): each denoising step decodes
  only the canvas against the cached prefix, then rolls back its own K/V.
- Multi-block autoregressive loop: commit each finalized canvas and advance the
  cache pointer by canvas_length.

Two graph variants share the gemma4 transformer body: a single phase-branching
graph (default) and a separate encoder/decoder pair (DG4_SEPARATE_ENC_DEC).

Also:
- enable the iswa KV cache for the arch (was res=nullptr), reusing the gemma4
  layer-reuse / has_kv handling;
- precompute a transposed F32 token embedding once at load (load_arch_post) for
  the self-conditioning soft-embedding matmul, avoiding a per-decode dequantize
  + transpose of the whole embedding;
- add a per-decode phase selector API: llama_set_diffusion_decoder_phase.

Verified on CPU and CUDA (partial and full GPU offload): correct prompted
generation, multi-block coherence, and clean entropy-bound convergence.
Update the block-diffusion Gemma port to the v7 checkpoint, which renames the
text architecture diffusion_gemma4 -> diffusion_gemma and adds a gemma4 vision
tower (model_type gemma4_vision) + projector for image input.

Rename:
- DIFFUSION_GEMMA4 -> DIFFUSION_GEMMA; arch string "diffusion-gemma4" ->
  "diffusion-gemma"; model class llama_model_diffusion_gemma; converter class
  DiffusionGemmaModel registered for "DiffusionGemmaForBlockDiffusion".
- renamed files src/models/diffusion-gemma.cpp, examples/diffusion-gemma/, and
  the CLI env knobs (DG_*).

Multimodal (vision-only; the v7 diffusion checkpoint has no audio tower):
- new DiffusionGemmaVisionModel mmproj converter reusing the existing GEMMA4V
  vision export: strips the v7 model.encoder.* nesting, skips audio, registered
  in MMPROJ_MODEL_MAP. The clip.cpp GEMMA4V encoder is reused unchanged.
- diffusion-gemma CLI gains --mmproj/--image (enabled for LLAMA_EXAMPLE_DIFFUSION):
  the image marker is tokenized via libmtmd and the 280 GEMMA4V vision embeddings
  are fed into the diffusion encoder-phase prefill (mtmd_helper_eval_chunks); the
  canvas is then denoised against the cached prefix.

Verified on CUDA (RTX 5090): text Q4_K_M answers correctly, and image+text
produces an accurate, OCR-level description via the GEMMA4V vision tower.
…ne-argmax output

- -n / --n-predict now sets the number of 256-token canvas blocks
  (max_canvases = ceil(n_predict / canvas_length)); canvas_length is fixed at the
  trained block size (256). DG_MAX_CANVASES still overrides; n_ctx auto-sizes.
- print a generation timing summary at the end (blocks, denoising steps, canvas
  tokens, wall-clock, canvas tok/s, s/step, answer tokens).
- emit each block via the inline argmax of the last (stable) denoising step,
  matching the reference DiffusionGemma _denoising_step, instead of a separate
  read-out over a stale `accepted` scratch buffer. Fixes the unconverged canvas
  tail: never-accepted high-entropy positions no longer carry stale-random tokens
  into the output / committed prefix. Commit also uses the inline argmax.

Verified on CUDA: multi-block generation (e.g. -n 1536 -> 6 blocks, with
entropy-bound early-stopping) produces coherent, accurate prose and reports timing.
…ded backend

The self-conditioning soft-embedding does a full-vocabulary matmul
(softmax(prev_logits) @ token_embd) with the precomputed F32 tok_embd_t every
decode. token_embd is normally host-resident (it is only used for a cheap
get_rows lookup in AR models), and tok_embd_t inherited that host buffer, so the
scheduler copied the whole ~2.75 GiB tensor across PCIe on every forward,
dominating per-step time.

Allocate tok_embd_t on a non-host (offloaded) backend taken from a layer weight
when one is offloaded, so the matmul runs on-device with no per-decode copy; fall
back to token_embd's buffer for CPU-only runs.

Measured on RTX 5090 (Q4_K_M, -ngl 99, pp512/tg256 @ d512):
  pp512  1339 -> 3530 t/s (2.6x)
  tg256  4.84 -> 148.8 t/s (30.7x)
real generation ~0.43 s/step (was ~0.6-0.8); output unchanged.
Time the prompt prefill (encoder phase: causal, no self-conditioning) separately
from the denoising loop and print tokens / seconds / tok-per-second. Makes the
real prefill cost visible (it skips the self-cond matmul, unlike a decoder-phase
forward), so it is not conflated with the per-step denoising cost.
…) with k-annealing

The decoder/denoise phase computed softmax + entropy + self-conditioning over the
full 262k vocabulary on the host every step (~67M exp/log per step), which
dominated the per-step wall-clock. Add an opt-in top-k path: per canvas position,
select the top-k logits (size-k min-heap in one scan), then do softmax / entropy /
multinomial-sample / sparse self-conditioning over just those k. The self-cond
buffer is sparse (top-k filled), so the existing in-graph soft-embedding matmul
blends only the top-k embeddings (the dropped tail carries negligible embedding
weight; the following RMS norm absorbs the scale).

Knobs (CLI flags; default = full softmax, behaviour unchanged):
- --top-k N                 : fixed top-k per position (0 = full softmax)
- --top-k-start/--top-k-end : anneal k from high (first/high-entropy step) to low
                              (last step) -- early canvases are flat (need many
                              tokens), late ones are peaked (a few suffice)
- --top-k-tail-correction   : use the exact full-vocab entropy (logsumexp) for the
                              accept/stop signal instead of the under-estimating
                              top-k entropy (top-k truncation deflates entropy)

Also report encoder-phase prefill timing separately, and add the generation
timing summary. RTX 5090 (Q4_K_M, -ngl 99): ~2.6x faster per denoising step with
top-k (no correction); output preserved on the tested prompts.

This is host-side only -- it reuses the existing self-conditioning channel and
graph. The fuller graph-side variant (gather top-k embedding rows + a dedicated
top-k self-cond input, dropping the full-vocab matmul and the F32 transposed
embedding) is left as a follow-up.
Replace the dense full-vocab self-conditioning soft-embedding
(probs @ token_embd over all 262144 rows) with a sparse gather of only
the previous step's top-k token embeddings, blended by their
probabilities. The CLI feeds the top-256 (id, prob) per canvas position
each denoising step via a new API; the decoder graph gathers those rows
(ggml_get_rows), prob-weights and sums them (x sqrt(n_embd)).

The gather embedding (tok_embd_gpu) is stored as F16 on the offloaded
backend. F16 (not the native Q4_K) is required because CUDA get_rows has
no Q4_K/Q6_K kernel -- a quantized gather falls back to CPU every step
and is a large regression. F16 keeps the gather on-device and halves its
VRAM vs the F32 dense transpose (1.47 GB vs 2.75 GB).

Measured (RTX 5090, Q4_K_M, France prompt): --top-k 64 ~0.13 s/step vs
0.158 dense (~1.2x faster), ~1.3 GB less VRAM; k=0 at parity (host
softmax dominated). Output unchanged.

The dense matmul path is retained as a fallback when the F16 gather copy
cannot be allocated.

API:
  - llama_set_diffusion_self_cond_topk(ctx, ids, probs, k, n_tokens)
  - llm_graph_input_diffusion_self_cond_topk + build_inp_diffusion_self_cond_topk
Add examples/diffusion-gemma/diffusion-gemma-server.cpp (target
llama-diffusion-gemma-server): loads a block-diffusion model once and serves
the same denoising generation loop as the CLI over HTTP, with the llama-server
observability surface re-mapped to diffusion semantics.

Endpoints: GET /health, /v1/health, /props, /v1/models, /models,
/slots (--slots), /metrics (--metrics); POST /v1/chat/completions (+ SSE
stream) and /v1/completions.

Responses carry the llama-server `timings` object (prompt_n/prompt_ms,
predicted_n/predicted_ms, ...) extended with a `diffusion` sub-object
(n_blocks, n_steps, canvas_tokens, ms_per_step, steps_per_second,
canvas_tokens_per_second, n_decode), plus the OpenAI `usage` object. Per-request
llama_perf-style timing logs, an access log, and a llama-server-style startup
banner are emitted. Prometheus /metrics exposes prompt/predicted token counters
and gauges alongside diffusion_blocks/steps/canvas_tokens totals and rate gauges.

Generation is serialized behind a mutex (one slot; the context is single-
threaded). Server-only flags (--host, --port, --api-key, --metrics, --slots)
are stripped before common_params_parse so all diffusion flags still parse.
Request fields: max_tokens -> canvas blocks, top_k, seed, ignore_eos (run the
full block count for sustained long-generation benchmarking).
@lnigam lnigam requested review from a team, CISC and ggerganov as code owners June 10, 2026 16:36
@github-actions github-actions Bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 10, 2026
@gaugarg-nv gaugarg-nv mentioned this pull request Jun 10, 2026
@Ajay9o9

Ajay9o9 commented Jun 10, 2026

Copy link
Copy Markdown

Tried the latest nvidia-diffusion-gemma branch (e1fc5359f) with diffusiongemma-26B-A4B-it-Q4_K_M.gguf and hit a load error:

i'm on
Branch: nvidia-diffusion-gemma
Commit: e1fc535

Model from Unsloth
diffusiongemma-26B-A4B-it-Q4_K_M.gguf

loading fails with

/repos/llama-diffusion-gemma$ ./build/bin/llama-diffusion-gemma-cli \
  -m "/media/gemma4/diffusiongemma-26B-A4B-it-Q4_K_M.gguf" \
  -p "Hello"
  
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11889 MiB):
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 11889 MiB
0.00.624.210 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.624.556 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.642.077 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.00.861.710 E llama_model_load: error loading model: missing tensor 'self_cond_norm.weight'
0.00.861.719 E llama_model_load_from_file_impl: failed to load model
0.00.861.720 E error: failed to load model '/media/gemma4/diffusiongemma-26B-A4B-it-Q4_K_M.gguf'
missing tensor 'self_cond_norm.weight'

GGUF, inspection shows :

self_cond_down.weight
self_cond_gate.weight
self_cond_pre_norm.weight
self_cond_up.weight

but not:

self_cond_norm.weight

The runtime appears to expect:

{ LLM_TENSOR_SELF_COND_NORM, "self_cond_norm" }

from llama-arch.cpp

I also checked the current branch and couldn't find any references to self_cond_pre_norm in the runtime:

grep -R "self_cond_pre_norm" src/ -n

returns nothing.

Is this a naming mismatch?

@Jakeshadow

Copy link
Copy Markdown

Nice, been waiting for this. For anyone building from this PR — heads up that Apple Silicon won't see the speedup. It's a compute-bound model, needs a high-CUDA-core GPU. Detailed GPU requirements here: https://diffusiongemma.dev

@github-actions github-actions Bot added the testing Everything test related label Jun 11, 2026
@lnigam

lnigam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@Ajay9o9 Now this PR works for unsloth model.

@mohamed-em2m

Copy link
Copy Markdown

llama-diffusion-gemma-server don't appear in builds

@lnigam

lnigam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@mohamed-em2m they are building in my local and I also can see it in the CI logs. Are there any errors while building?

Keep diffusion decoder input handling limited to Diffusion Gemma so other architectures use standard graph inputs.

Assisted-by: Codex
@theIvanR

theIvanR commented Jun 11, 2026

Copy link
Copy Markdown

Awesome job! I will have a look later at the code. Proposal: make the code modular with a clear fetcher, and builder(s).

For the fetcher do something like this:

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

git fetch --depth 1 origin pull/24427/head:pr24427
git checkout pr24427

While for the rest, something similar to how I did it here:
https://github.com/theIvanR/lmstudio-unlocked-backend/tree/main/Generate%20Backends/Windows

EDIT:
I tried with this cpu only builder and something is broken:

C:\Users\Admin\source>talk_to_gemma.cmd
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
0.00.000.889 I diffusion-gemma: GGML_CUDA_MMQ_MAX_X=64
0.01.195.906 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.196.466 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.252.519 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
[SUCCESS] Done
@echo off
setlocal

set "EXE=C:\Users\Admin\source\llama.cpp\build_cpu\bin\llama-diffusion-gemma-cli.exe"
set "MODEL=C:\Users\Admin\Documents\LLM Models\unsloth\diffusiongemma-26B-A4B-it-GGUF\diffusiongemma-26B-A4B-it-Q4_K_M.gguf"

if not exist "%EXE%" (
    echo [ERROR] CLI not found: "%EXE%"
    exit /b 1
)

if not exist "%MODEL%" (
    echo [ERROR] Model not found: "%MODEL%"
    exit /b 1
)

"%EXE%" ^
  -m "%MODEL%" ^
  -p "Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations." ^
  -n 1024 ^
  -c 8096 ^
  -ngl 999 ^
  --diffusion-steps 48 ^
  --diffusion-cuda-mmq-max-x 64

if errorlevel 1 (
    echo [ERROR] Command failed
    exit /b 1
)

echo [SUCCESS] Done
exit /b 0

EDIT 2: trying on gpu with my builder and this directory:

set "EXE=C:\Users\Admin\source\llama.cpp\build_gpu_cuda\bin\llama-diffusion-gemma-cli.exe"

It did something:

PS C:\Users\Admin\source> ./talk_to_gemma.cmd
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 8191 MiB):
  Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes, VRAM: 8191 MiB
0.00.000.655 I diffusion-gemma: GGML_CUDA_MMQ_MAX_X=64
0.01.348.419 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.349.167 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.434.171 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.17.762.433 I formatted prompt: <|turn>system
<|think|>
<turn|>
<|turn>user
Answer in about 1000 words: explain block diffusion generation and CUDA sampling optimizations.<turn|>
<|turn>model

0.17.762.850 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.17.763.676 W llama_kv_cache_iswa: using full-size SWA cache (ref: https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
0.18.941.717 I diffusion-gemma: prefix=34 canvas=256 max_canvases=4 steps=48 entropy_bound=0.100 temp=[0.40,0.80] n_ctx=8096 mm=0
0.18.941.728 I diffusion-gemma: gpu sampling: on | device self-cond: on | device loop: on
0.18.941.728 I diffusion-gemma: device early-stop interval=1
0.23.787.601 I prefill (encoder, no self-cond): 34 tokens in 4.846 s (7.0 tok/s)

@lnigam

lnigam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Latest benchmark numbers on RTX 5090 using my benchmark

Server diffusion benchmark

ready ms: 9080.63
runs kept: 10
request wall ms avg/min/max: 5461.17 / 5142.71 / 5563.61
predicted ms avg: 5402.404
canvas tok/s avg: 189.650
steps/s avg: 34.984
ms/step avg: 28.585
mean denoising steps/block: 47.250

Using aiperf (speedbench)
command: aiperf profile -m "diffusiongemma-26B-A4B-it-Q4_K_M" --concurrency 1 --tokenizer "google/gemma-4-26B-A4B-it" --endpoint-type chat -u http://127.0.0.1:18081 --artifact-dir ./gemma-diffsion-5090-topk0 --public-dataset speed_bench_throughput_2k --osl 1024 --extra-inputs "max_tokens:1024,min_tokens:1024,ignore_eos:true,top_k:0" --osl-stddev 0 --ui-type none --streaming --request-count 100 --warmup-request-count 10 --use-server-token-count --request-timeout-seconds 1200

image

@coder543

Copy link
Copy Markdown

Google reported 700 tokens per second on RTX 5090, so I guess there is a lot of room for optimization here

@lnigam

lnigam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@theIvanR Can you try the CPU only build with -DGGML_CPU_REPACK=OFF
I dont have 1070 Ti with me. For GPU build can you try following:

%EXE% ^
-m "%MODEL%" ^
-p "Briefly explain block diffusion." ^
-n 64 ^
-c 1024 ^
-ngl 8 ^
--no-kv-offload ^
--diffusion-steps 2 ^
--diffusion-cuda-mmq-max-x 0 ^
--top-k 256 ^
--no-diffusion-device-denoise-loop

@lnigam

lnigam commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@coder543 It depends mostly on the number of denoising steps needed for the canvas to converge. For coding prompts, model converges generally in around 20 steps while the above benchmark took on an average 34 steps to converge.
diffusiongemma-Q8_0 model is performing relatively better than Q4_KM. and with coding speedbench it is reaching around 300 t/s.
image

For some coding prompts it is reaching upto 534 t/s
image

@mohamed-em2m

Copy link
Copy Markdown

@mohamed-em2m they are building in my local and I also can see it in the CI logs. Are there any errors while building?

it's appear now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants