DiffusionGemma by danielhanchen · Pull Request #24423 · ggml-org/llama.cpp

danielhanchen · 2026-06-10T15:56:37Z

Worked on prelim Diffusion Gemma support!

Has normal chat similar to llama-cli via llama-diffusion-cli -cnv -n 2048
Has a visualization method to show diffusion live via llama-diffusion-cli -cnv -n 2048 --diffusion-visual

To try this PR:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

then use a GGUF (any can work but for eg)

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

then use chat or visualization:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048

or

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

Example below (a bit blurry to limit to 10MB on Github :()

Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more

Some diffusion cli and visual updates

ggml-gh-bot · 2026-06-10T16:01:39Z

Hi @danielhanchen, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

pwilkin · 2026-06-10T16:38:47Z

Oof, that's a big one.

There's a ton of debugging stuff left in there that needs throwing out, for one. I'm also not convinced about the idea to make a server just for one model - I think if we're intending to support diffusion models in a server mechanism, it should be a general diffusion-server (but that's just my opinion, probably have to wait for what @ggerganov thinks about this one).

danielhanchen · 2026-06-10T16:42:31Z

Haha sorry - this PR was more of a direct translation / proof of concept that it works!

danielhanchen · 2026-06-10T16:42:44Z

I'll edit the PR - sorry we're juggling multiple things haha

gaugarg-nv · 2026-06-10T16:43:20Z

Another PR for DiffusionGemma: #24427

CISC · 2026-06-10T16:50:14Z

You have some failing tests to fix. :)

coder543 · 2026-06-10T16:50:37Z

if we're intending to support diffusion models in a server mechanism, it should be a general diffusion-server

With a block diffusion model, couldn't the regular server just return each block when it is finished diffusing? It would be nice to just have one server, and the API could remain fully compatible so clients don't need to be aware that they're dealing with a diffusion model. (We don't show the distribution of sampled logits for AR models, and I don't see why people would need to see the intermediate diffusion steps either, since those won't be useful.)

quasar-of-mikus · 2026-06-10T16:56:48Z

Doesn't build on Windows:

[421/433] Building CXX object examples\diffusion\CMakeFiles\llama-diffusion-cli.dir\diffusion-cli.cpp.obj
FAILED: examples/diffusion/CMakeFiles/llama-diffusion-cli.dir/diffusion-cli.cpp.obj
ccache C:\PROGRA~1\MICROS~3\2022\COMMUN~1\VC\Tools\Llvm\x64\bin\clang-cl.exe  /nologo -TP -DGGML_BACKEND_SHARED -DGGML_SHARED -DGGML_USE_CPU -DGGML_USE_CUDA -DLLAMA_SHARED -D_CRT_SECURE_NO_WARNINGS -IC:\Textgen\llama.cpp\src\..\include -IC:\Textgen\llama.cpp\ggml\src\..\include -IC:\Textgen\llama.cpp\common\. -IC:\Textgen\llama.cpp\common\..\vendor /DWIN32 /D_WINDOWS /EHsc /O2 /Ob2 /DNDEBUG -std:c++17 -MD /utf-8 /bigobj /showIncludes /Foexamples\diffusion\CMakeFiles\llama-diffusion-cli.dir\diffusion-cli.cpp.obj /Fdexamples\diffusion\CMakeFiles\llama-diffusion-cli.dir\ -c -- C:\Textgen\llama.cpp\examples\diffusion\diffusion-cli.cpp
C:\Textgen\llama.cpp\examples\diffusion\diffusion-cli.cpp(10,10): fatal error: 'sys/ioctl.h' file not found
   10 | #include <sys/ioctl.h>
      |          ^~~~~~~~~~~~~
1 error generated.

danielhanchen · 2026-06-10T16:57:01Z

Yep will fix haha - I also added a short GIF of it working edited in description!

…s, drop debug hooks - guard sys/ioctl.h behind _WIN32 and add a GetConsoleScreenBufferInfo fallback for the visual viewport size, so diffusion-cli builds on Windows - skip diffusion-gemma in test-llama-archs like gemma4 (shared ISWA backbone, no synthetic fixture params yet) - remove the DG_DUMP_KV_LAYER / DG_NSWA debug scaffolding and its llama.h API - fix flake8 E306 in conversion/diffusion_gemma.py

stepfunction83 · 2026-06-10T19:53:44Z

I was able to compile it successfully on Linux for my 4090, but when running it, I get the following error after sending a user message:

0.13.483.785 E ggml_cuda_compute_forward: SOFT_MAX failed
0.13.483.796 E CUDA error: invalid argument
0.13.483.798 E   current device: 0, in function ggml_cuda_compute_forward at /home/LLM/DiffusionGemma/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:3163
0.13.483.798 /home/LLM/DiffusionGemma/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error
E   err
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(+0x1c1ab)[0x7bb8567401ab]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21c)[0x7bb85674062c]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x15b)[0x7bb85674080b]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-cuda.so.0(_Z15ggml_cuda_errorPKcS0_S0_iS0_+0xb7)[0x7bb853260997]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-cuda.so.0(+0x27a810)[0x7bb85327a810]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x817)[0x7bb85675def7]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7bb855ee04c1]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x114)[0x7bb855ee2c94]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6encodeERK11llama_batch+0x240)[0x7bb855ee6c80]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(llama_decode+0xf)[0x7bb855eec1ef]
./build/bin/llama-diffusion-cli(+0x1941c)[0x5a66e697341c]
./build/bin/llama-diffusion-cli(+0x8143)[0x5a66e6962143]
./build/bin/llama-diffusion-cli(+0x60e4)[0x5a66e69600e4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7bb85562a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7bb85562a28b]
./build/bin/llama-diffusion-cli(+0x6a45)[0x5a66e6960a45]
Aborted (core dumped)

The command I'm running is:

 CUDA_VISIBLE_DEVICES="" ./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv -n 2048 --system-prompt-file sysprompt.txt \
  --diffusion-eb auto \
  --diffusion-eb-max-steps 48 \
  --diffusion-eb-t-max 1.0 \
  --diffusion-eb-t-min 0.6 \
  --diffusion-visual

To compile it, I used:

rm -rf build
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc)
cmake --build build -j --config Release --target llama-diffusion-cli

quasar-of-mikus · 2026-06-10T20:06:24Z

Yep will fix haha - I also added a short GIF of it working edited in description!

Builds and runs on Windows now.
1x 3090 Q4KM: time per step: 326.13ms
2x 3090 Q8_0: time per step: 878.83ms

arkham000 · 2026-06-10T20:10:32Z

is it only cli at this point? no llama-server ?

kroaton · 2026-06-10T20:16:25Z

Thank you for putting this together! An Issue I found is that --fit doesn't work with this PR.

icedream · 2026-06-10T21:54:09Z

Test run on my system with AMD hardware (7900 XTX) in it, Q4_K_M - time per step: 364.45ms

Screencast_20260610_234032_c.webm

lucasbinder · 2026-06-10T21:57:15Z

@icedream What (equivalent) tokens per second are you getting? I also tried running it with an AMD GPU (R9700 w/ vulkan) and only got ~27t/s.

icedream · 2026-06-10T22:24:59Z

@lucasbinder Not 100% sure if that's the right way to calculate it but based on two more runs with the same prompt, calculating with 256 tokens per full canvas diffused (I left out the last canvas as tail end of response), taking the start/end timings per canvas from the --verbose output:

Run 1

2.327213 - 9.803998 = 7.476785 = 34.24 t/s
10.119182 - 17.995429 = 7.876247 = 32.50 t/s
18.412472 - 23.738067 = 5.325595 = 48.07 t/s
24.261660 - 32.754977 = 8.493317 = 30.14 t/s
33.345178 - 51.575057 = 18.229879 = 14.04 t/s

Run 2

2.328457 - 9.777839 = 7.449382 = 34.37 t/s
10.092776 - 17.935576 = 7.8428 = 32.64 t/s
18.349154 - 23.654470 = 5.305316 = 48.25 t/s
24.176041 - 32.655410 = 8.479369 = 30.19 t/s
33.245244 - 51.415716 = 18.170472 = 14.09 t/s

(Also I should clarify I used ROCm, not Vulkan in my case so that may be influencing the performance as well.)

gbuznote-beep · 2026-06-11T01:49:10Z

Unofficial prebuilt binaries for anyone who wants to test this PR without setting up a CUDA toolchain:

https://github.com/gbuznote-beep/llama-diffusion-cli-prebuilt

Linux x86_64 / WSL2 — CUDA 12.8, sm_86 (RTX 30-series / A4000–A6000), glibc ≥ 2.39, self-contained (cudart/cublas/cublasLt/nccl bundled)
Windows x64 — CPU-only
Pinned to c84e85a; SHA256SUMS + reproducible build scripts included (other GPU archs rebuild in ~10 min)

Data points from testing (256 tokens, EB sampler): A5000 full-GPU 0.98 s/step; RTX 3070 Ti Laptop 8 GB via WSL2 (-ngl 99 --n-cpu-moe 22) 5.9 s/step; i7-12700H CPU-only 17.1 s/step. Output quality looks coherent (thinking-style drafts + self-critique). Thanks @danielhanchen for the implementation!

csabakecskemeti · 2026-06-11T04:04:54Z

Results on RTX PRO6000 + 5090 (I could have been used only the 6000)
google.diffusiongemma-26B-A4B-it.Q4_K_M.gguf -ngl -1 -cnv -n 2048
total time: 22329.51ms, time per step: 192.50ms (116 steps over 6 blocks, entropy-bound)

Not sure if I calculating the t/s correctly
tokens = blocks_completed * canvas_length = 6 * 256 = 1536 tokens
t/s = 1536 / (22329.51 / 1000) = 1536 / 22.33 ≈ 68.8 t/s

…to model params) The CLI hand-builds llama_model_params and never copied tensor_buft_overrides, so -ot and --n-cpu-moe were parsed but silently dropped - the MoE experts stayed on the GPU and OOMed small-VRAM cards. Mirror common_model_params_to_llama.

… /clear - --diffusion-gpu-sampling {auto,on,off} (default auto = on for single-GPU): keep the prev step's canvas logits in a device buffer (sc_dev) and read self-conditioning from it instead of a 268 MB host upload each step. SC inputs are bit-identical to the host path; auto-disables on multi-GPU like --diffusion-kv-cache. ~1.3x per step. - cli: add effective + in-step-parallel throughput to the timing summary. - cli: add /help and /clear in conversation mode.

danielhanchen · 2026-06-11T07:06:44Z

Throughput increased from 1461 tokens / s to 1831 tok/s! (1.25x faster) with 0 change in accuracy! Tested on B200x1 Q8_0 quant

Also added a new section so it's more clear on numbers:

total time: 19298.76ms, time per step: 139.85ms (138 steps over 11 blocks, entropy-bound)
throughput: 145.9 tok/s (2816 tok in 19298.76ms), in-step parallel 1831 tok/s (256-tok canvas x 12.5 steps/block)

Before:

total time: 30142.16ms, time per step: 175.25ms (172 steps over 12 blocks, entropy-bound)
throughput: 101.9 tok/s (3072 tok in 30142.16ms), in-step parallel 1461 tok/s (256-tok canvas x 14.3 steps/block)

Use --diffusion-gpu-sampling [auto / off / on]
Default is on so --diffusion-gpu-sampling auto is the 1.25x faster version

Iipal · 2026-06-11T07:27:08Z

@danielhanchen hello, currently testing local this PR on 5070Ti, how do I get the throughput info ?

and I got these results so far as best:

total time: 33280.82ms, time per step: 564.08ms (59 steps over 4 blocks, entropy-bound)

run command:

./build/bin/llama-diffusion-cli \
  -m diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv --diffusion-visual --n-cpu-moe 18 --no-mmap \
    -n 2048 \
    --threads 8 \
    --threads-batch 8 \
    --main-gpu 0 -fa on \
    --split-mode none

danielhanchen · 2026-06-11T07:40:00Z

@Iipal You'll need to recompile!

Iipal · 2026-06-11T08:22:25Z

@danielhanchen Yeap, here is the results:

total time: 33923.17ms, time per step: 595.14ms (57 steps over 3 blocks, entropy-bound)
throughput: 22.6 tok/s (768 tok in 33923.17ms), in-step parallel 430 tok/s (256-tok canvas x 19.0 steps/block)

In comparison:
Regular Gemma4 26B A4B with 90k context on my 5070Ti gives me roughly about 45-55 t/s, and Qwen3.6 35B A3b - 55-75t/s, and this is on WSL2, in the native Linux Ubuntu 26.04 OS I'm getting about 20% more of speed with all the same parameters, and even possibly to increase context to 200k, just fyi

Sample argmax/entropy/multinomial per canvas position directly from the device sc_dev buffer instead of copying the [C, n_vocab] canvas logits to host (268 MB/step) and reducing on the CPU. Removes the last per-step bus copy on the entropy-bound path. - new ggml-cuda kernel (dense, top_k==0), reached from llama via the backend-reg proc-address boundary (no new llama<->cuda link); falls back to the host path on non-CUDA / multi-GPU / no sc_dev. - --diffusion-gpu-sample-reduce {auto,on,off}, auto=on for single-GPU, requires --diffusion-gpu-sampling. byte-identical when off. - argmax bit-identical to host every step; Z/entropy differ only by the parallel-reduction order (~1e-4), same FP-equivalence class as --diffusion-kv-cache. greedy decode identical; stochastic output identical on every prompt tested. ~1.42x per step on B200 Q8_0.

danielhanchen · 2026-06-11T09:20:20Z

New update again! Now 2200 tokens / s on B200x1 so 1461 to 2200!

cudaPointerGetAttributes / cudaPointerAttributes / cudaMemoryTypeDevice are not mapped by the hip/musa vendor layer. Drop the pointer-attribute device probe (the sampler is gated to a single CUDA device, so the tensor is already on the current device) and route the runtime calls through CUDA_CHECK.

googlefan256 · 2026-06-11T11:04:45Z

I think this PR currently uses only 1 CPU thread when running with CPU offloading.

(CPU is AMD Ryzen 7 5700X 16C)

Command used:

./build/bin/llama-diffusion-cli \
  --model ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -cnv -n 2048 \
  -ngl 999 --n-cpu-moe 23 \
  --diffusion-visual

dogarrowtype · 2026-06-11T14:32:35Z

Hello, just reporting that on mac/metal (M3 Ultra and 8bit unsloth quant) I get an error after each diffusion step. The output text seems to generate normally so it's not a fatal error, but this error doesn't seem to appear for other platforms. This seems to happen if the visualizer is on or off (if the visualizer is on, the error just quickly flashes at the bottom of the screen each diffusion step). The cmake build went perfectly fine with -DGGML_CUDA=OFF , no errors while building.

0.19.684.423 E diffusion_generate_entropy_bound: device sample failed at step 0; falling back to host
diffusion step: 0/48 [                                                  ] 0%0.19.946.564 E diffusion_generate_entropy_bound: device sample failed at step 1; falling back to host
diffusion step: 1/48 [=                                                 ] 2%0.20.209.883 E diffusion_generate_entropy_bound: device sample failed at step 2; falling back to host
diffusion step: 2/48 [==                                                ] 4%0.20.467.469 E diffusion_generate_entropy_bound: device sample failed at step 3; falling back to host
continues on like this until generation finishes...

Persistent forward server that runs diffusion_generate_entropy_bound and streams the per-step argmax canvas (plus each committed block) over stdin/stdout, so a host can render the denoise without reloading the model. Reuses the entropy-bound decoder; links llama-diffusion.

mohamed-em2m · 2026-06-11T15:39:01Z

it's give me : error loading model: unknown model architecture: 'diffusion-gemma'

Take chat messages as JSON and apply the GGUF chat template + tokenizer in the server (common_chat_templates + common_tokenize), and stream the per-step canvas and committed blocks back as detokenized text. Drops the need for any client-side tokenizer; the request is now {seed, n_blocks, messages}.

When a backend cannot run the on-device sampler (e.g. Metal), latch the fallback after the first failure: warn once and use the host reduction for the rest of the run instead of retrying and logging an error every step. Output is unchanged (host sampling was already the fallback); only the per-step error spam is removed.

cpietsch · 2026-06-11T17:40:16Z

this is so cool! on my 5090 in WSL I get total time: 1109.67ms, time per step: 55.48ms (20 steps over 1 blocks, entropy-bound) throughput: 230.7 tok/s (256 tok in 1109.67ms), in-step parallel 4614 tok/s (256-tok canvas x 20.0 steps/block)

map9959 · 2026-06-11T17:44:11Z

From an M4 Pro, 14 cores:

total time: 100675.41ms, time per step: 867.89ms (116 steps over 6 blocks, entropy-bound)
throughput: 15.3 tok/s (1536 tok in 100675.41ms), in-step parallel 295 tok/s (256-tok canvas x 19.3 steps/block)

fizzAI · 2026-06-11T18:13:34Z

Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more

Is this not a violation of the project's contributing guidelines?

This project does not accept pull requests that are fully or predominantly AI-generated. AI tools may be utilized solely in an assistive capacity.
AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (e.g., generating repeated lines with minor variations).

Iipal · 2026-06-11T18:19:34Z

it's give me : error loading model: unknown model architecture: 'diffusion-gemma'
@mohamed-em2m

Make sure you are checkout correctly to the diffusiongemma branch before compilation, as the instruction follows:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# this step is important
git fetch origin pull/24423/head:pr-24423 && git switch pr-24423

cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

pl752 · 2026-06-11T18:39:30Z

Encountered one rough place: I am unable to enable gpu sampling: it seems that the program disables it due to multi-gpu even when only one device is selected. In order to use it I have to use env CUDA_VISIBLE_DEVICES=0 Also utilization is reported as 40-50% without and 70-80 with sampling (rtx3090, q4_k_m)

luminary19 · 2026-06-11T19:13:37Z

Same issue as above (Q4_K_M). Running on an RTX 4070, 32GB RAM 8GB VRAM. I tried setting -ngl down to 2 or 3 doesn't work -- model still runs thanks to CPU fallback, but GPU usage records 0 due to failed sampling.

mohamed-em2m · 2026-06-11T19:22:32Z

Bug: diffusion-gemma ignores `-ngl` and server never starts listening

Environment

Commit: PR DiffusionGemma #24423
GPU: NVIDIA L4 (22 GB VRAM)
CUDA: enabled
OS: Linux
Model: unsloth/diffusiongemma-26B-A4B-it-GGUF
Quantization: Q4_K_M

Command

./build/bin/llama-diffusion-gemma-server \
  /root/.cache/huggingface/hub/unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  --diffusion-eb auto \
  --diffusion-eb-max-steps 48 \
  --diffusion-eb-t-max 1.0 \
  --diffusion-eb-t-min 0.6 \
  --diffusion-visual \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -cnv \
  -n 2048

GPU Detection

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 22563 MiB):
  Device 0: NVIDIA L4, compute capability 8.9

Observed Behavior

The model loads successfully, but no layers are offloaded to the GPU:

load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/31 layers to GPU

CPU_Mapped model buffer size = 16013.13 MiB

The server reports:

diffusion-gemma-server ready (n_vocab=262144, MAXTOK=2304, NGL=0)
READY 262144

despite launching with:

-ngl 99

Additionally, after reaching:

READY 262144

the process appears to stall and never starts serving requests. I do not see any indication that the HTTP server is listening on port 8080.

For example:

curl http://127.0.0.1:8080

fails because nothing is listening on the configured port.

Additional Diagnostics

The log also contains:

done_getting_tensors: tensor 'token_embd.weight' (q6_K) (and 696 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

and:

CUDA0 compute buffer size = 2955.75 MiB

which suggests CUDA is initialized and compute buffers are created, but all model tensors remain on CPU.

Expected Behavior

-ngl 99 should offload layers to the GPU, or an explicit message should explain why diffusion-gemma currently does not support GPU offloading.
After model initialization completes, the server should start listening on the configured host/port (0.0.0.0:8080) and accept requests.

luminary19 · 2026-06-11T19:27:01Z

Same issue as above (Q4_K_M). Running on an RTX 4070, 32GB RAM 8GB VRAM. I tried setting -ngl down to 2 or 3 doesn't work -- model still runs thanks to CPU fallback, but GPU usage records 0 due to failed sampling.

Setting --diffusion-gpu-sample-reduce off lets it work, but throughput is still low at around ~5.2tok/s (112 in step parallel). Guess this is what I can expect from 8GB VRAM.

DATEx2 · 2026-06-11T19:40:24Z

I don't understand what I am doing wrong -
I am getting 248 tokens/sec on regular llama.cpp / GEMMA 4 MTP2 / 5090 / Q6_K and now I am getting only 167/190 tokens / sec in this diffusion compiled llama.cpp - why?

Iipal · 2026-06-11T20:03:11Z

current(10a2613) progress on the tests:

total time: 31084.10ms, time per step: 535.93ms (58 steps over 4 blocks, entropy-bound)
throughput: 32.9 tok/s (1024 tok in 31084.10ms), in-step parallel 478 tok/s (256-tok canvas x 14.5 steps/block)

previous (commit: 15ad8f4):

total time: 33923.17ms, time per step: 595.14ms (57 steps over 3 blocks, entropy-bound)
throughput: 22.6 tok/s (768 tok in 33923.17ms), in-step parallel 430 tok/s (256-tok canvas x 19.0 steps/block)

run script:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv --diffusion-visual --n-cpu-moe 18 --no-mmap \
    -n 2048 \
    --threads 8 \
    --threads-batch 8 \
    --main-gpu 0 -fa on \
    --split-mode none

both tested on the single prompt: "create a fibonacci script in python"

system: WSL2, 5070Ti, 32Gb DDR4, R7 5800x

GPU usage: 30-55% (15.2\16 Gb VRAM)
CPU usage: 20-40%
RAM usage: 25\32Gb (with all other apps included)

But with each next following prompt, the t/s speed drops in half

diffusion-visual updates

c5fe75b

Some diffusion cli and visual updates

danielhanchen requested review from a team, CISC and am17an as code owners June 10, 2026 15:56

github-actions Bot added model Model specific examples python python script changes labels Jun 10, 2026

danielhanchen marked this pull request as draft June 10, 2026 15:58

danielhanchen changed the title ~~diffusion-visual updates~~ DiffusionGemma Jun 10, 2026

github-actions Bot added the testing Everything test related label Jun 10, 2026

danielhanchen mentioned this pull request Jun 10, 2026

diffusion_studio: serve DiffusionGemma in Unsloth Studio with the optimized visual decoder unslothai/unsloth-zoo#748

Open

pearsonkyle mentioned this pull request Jun 10, 2026

Add DiffusionGemma (Gemma-4 MoE block-diffusion) support: MoE imatrix variants, AWQ MoE groups, llama.cpp bump, diffusion eval backend pearsonkyle/Quant-Tuner#4

Draft

danielhanchen added 2 commits June 11, 2026 04:05

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 11, 2026

ashalliants mentioned this pull request Jun 11, 2026

Add DiffusionGemma-26B-A4B llama.cpp single-card serving (PR #24423) noonghunna/club-3090#373

Open

danielhanchen added a commit to unslothai/llama.cpp that referenced this pull request Jun 11, 2026

Add ggml-org#24423 (DiffusionGemma) to the mix-build set

cc4edad

danielhanchen added 2 commits June 11, 2026 15:50

Conversation

danielhanchen commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggml-gh-bot Bot commented Jun 10, 2026

Uh oh!

pwilkin commented Jun 10, 2026

Uh oh!

danielhanchen commented Jun 10, 2026

Uh oh!

danielhanchen commented Jun 10, 2026

Uh oh!

gaugarg-nv commented Jun 10, 2026

Uh oh!

CISC commented Jun 10, 2026

Uh oh!

coder543 commented Jun 10, 2026

Uh oh!

quasar-of-mikus commented Jun 10, 2026

Uh oh!

danielhanchen commented Jun 10, 2026

Uh oh!

stepfunction83 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasar-of-mikus commented Jun 10, 2026

Uh oh!

arkham000 commented Jun 10, 2026

Uh oh!

kroaton commented Jun 10, 2026

Uh oh!

icedream commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasbinder commented Jun 10, 2026

Uh oh!

icedream commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbuznote-beep commented Jun 11, 2026

Uh oh!

csabakecskemeti commented Jun 11, 2026

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

Iipal commented Jun 11, 2026

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

Iipal commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jun 11, 2026

Uh oh!

googlefan256 commented Jun 11, 2026

Uh oh!

dogarrowtype commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Uh oh!

cpietsch commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

map9959 commented Jun 11, 2026

Uh oh!

fizzAI commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Iipal commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luminary19 commented Jun 11, 2026

Uh oh!

mohamed-em2m commented Jun 11, 2026

Bug: diffusion-gemma ignores -ngl and server never starts listening

Environment

Command

danielhanchen commented Jun 10, 2026 •

edited

Loading

stepfunction83 commented Jun 10, 2026 •

edited

Loading

icedream commented Jun 10, 2026 •

edited

Loading

icedream commented Jun 10, 2026 •

edited

Loading

Iipal commented Jun 11, 2026 •

edited

Loading

cpietsch commented Jun 11, 2026 •

edited

Loading

fizzAI commented Jun 11, 2026 •

edited

Loading

Iipal commented Jun 11, 2026 •

edited

Loading

pl752 commented Jun 11, 2026 •

edited

Loading

Bug: diffusion-gemma ignores `-ngl` and server never starts listening