Skip to content

DiffusionGemma#24423

Draft
danielhanchen wants to merge 10 commits into
ggml-org:masterfrom
danielhanchen:diffusion-visual-updates
Draft

DiffusionGemma#24423
danielhanchen wants to merge 10 commits into
ggml-org:masterfrom
danielhanchen:diffusion-visual-updates

Conversation

@danielhanchen

@danielhanchen danielhanchen commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Worked on prelim Diffusion Gemma support!

  1. Has normal chat similar to llama-cli via llama-diffusion-cli -cnv -n 2048
  2. Has a visualization method to show diffusion live via llama-diffusion-cli -cnv -n 2048 --diffusion-visual

To try this PR:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

then use a GGUF (any can work but for eg)

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

then use chat or visualization:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048

or

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

Example below (a bit blurry to limit to 10MB on Github :()
diffusiongem-ezgif com-resize

Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more

Some diffusion cli and visual updates
@danielhanchen danielhanchen requested review from a team, CISC and am17an as code owners June 10, 2026 15:56
@github-actions github-actions Bot added model Model specific examples python python script changes labels Jun 10, 2026
@danielhanchen danielhanchen marked this pull request as draft June 10, 2026 15:58
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

Hi @danielhanchen, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@danielhanchen danielhanchen changed the title diffusion-visual updates DiffusionGemma Jun 10, 2026
@pwilkin

pwilkin commented Jun 10, 2026

Copy link
Copy Markdown
Member

Oof, that's a big one.

There's a ton of debugging stuff left in there that needs throwing out, for one. I'm also not convinced about the idea to make a server just for one model - I think if we're intending to support diffusion models in a server mechanism, it should be a general diffusion-server (but that's just my opinion, probably have to wait for what @ggerganov thinks about this one).

@danielhanchen

Copy link
Copy Markdown
Contributor Author

Haha sorry - this PR was more of a direct translation / proof of concept that it works!

@danielhanchen

Copy link
Copy Markdown
Contributor Author

I'll edit the PR - sorry we're juggling multiple things haha

@gaugarg-nv

Copy link
Copy Markdown
Contributor

Another PR for DiffusionGemma: #24427

@CISC

CISC commented Jun 10, 2026

Copy link
Copy Markdown
Member

You have some failing tests to fix. :)

@coder543

Copy link
Copy Markdown

if we're intending to support diffusion models in a server mechanism, it should be a general diffusion-server

With a block diffusion model, couldn't the regular server just return each block when it is finished diffusing? It would be nice to just have one server, and the API could remain fully compatible so clients don't need to be aware that they're dealing with a diffusion model. (We don't show the distribution of sampled logits for AR models, and I don't see why people would need to see the intermediate diffusion steps either, since those won't be useful.)

@quasar-of-mikus

Copy link
Copy Markdown

Doesn't build on Windows:

[421/433] Building CXX object examples\diffusion\CMakeFiles\llama-diffusion-cli.dir\diffusion-cli.cpp.obj
FAILED: examples/diffusion/CMakeFiles/llama-diffusion-cli.dir/diffusion-cli.cpp.obj
ccache C:\PROGRA~1\MICROS~3\2022\COMMUN~1\VC\Tools\Llvm\x64\bin\clang-cl.exe  /nologo -TP -DGGML_BACKEND_SHARED -DGGML_SHARED -DGGML_USE_CPU -DGGML_USE_CUDA -DLLAMA_SHARED -D_CRT_SECURE_NO_WARNINGS -IC:\Textgen\llama.cpp\src\..\include -IC:\Textgen\llama.cpp\ggml\src\..\include -IC:\Textgen\llama.cpp\common\. -IC:\Textgen\llama.cpp\common\..\vendor /DWIN32 /D_WINDOWS /EHsc /O2 /Ob2 /DNDEBUG -std:c++17 -MD /utf-8 /bigobj /showIncludes /Foexamples\diffusion\CMakeFiles\llama-diffusion-cli.dir\diffusion-cli.cpp.obj /Fdexamples\diffusion\CMakeFiles\llama-diffusion-cli.dir\ -c -- C:\Textgen\llama.cpp\examples\diffusion\diffusion-cli.cpp
C:\Textgen\llama.cpp\examples\diffusion\diffusion-cli.cpp(10,10): fatal error: 'sys/ioctl.h' file not found
   10 | #include <sys/ioctl.h>
      |          ^~~~~~~~~~~~~
1 error generated.

@danielhanchen

Copy link
Copy Markdown
Contributor Author

Yep will fix haha - I also added a short GIF of it working edited in description!

…s, drop debug hooks

- guard sys/ioctl.h behind _WIN32 and add a GetConsoleScreenBufferInfo fallback
  for the visual viewport size, so diffusion-cli builds on Windows
- skip diffusion-gemma in test-llama-archs like gemma4 (shared ISWA backbone,
  no synthetic fixture params yet)
- remove the DG_DUMP_KV_LAYER / DG_NSWA debug scaffolding and its llama.h API
- fix flake8 E306 in conversion/diffusion_gemma.py
@stepfunction83

stepfunction83 commented Jun 10, 2026

Copy link
Copy Markdown

I was able to compile it successfully on Linux for my 4090, but when running it, I get the following error after sending a user message:

0.13.483.785 E ggml_cuda_compute_forward: SOFT_MAX failed
0.13.483.796 E CUDA error: invalid argument
0.13.483.798 E   current device: 0, in function ggml_cuda_compute_forward at /home/LLM/DiffusionGemma/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:3163
0.13.483.798 /home/LLM/DiffusionGemma/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:103: CUDA error
E   err
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(+0x1c1ab)[0x7bb8567401ab]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21c)[0x7bb85674062c]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x15b)[0x7bb85674080b]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-cuda.so.0(_Z15ggml_cuda_errorPKcS0_S0_iS0_+0xb7)[0x7bb853260997]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-cuda.so.0(+0x27a810)[0x7bb85327a810]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x817)[0x7bb85675def7]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7bb855ee04c1]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x114)[0x7bb855ee2c94]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6encodeERK11llama_batch+0x240)[0x7bb855ee6c80]
/home/LLM/DiffusionGemma/llama.cpp/build/bin/libllama.so.0(llama_decode+0xf)[0x7bb855eec1ef]
./build/bin/llama-diffusion-cli(+0x1941c)[0x5a66e697341c]
./build/bin/llama-diffusion-cli(+0x8143)[0x5a66e6962143]
./build/bin/llama-diffusion-cli(+0x60e4)[0x5a66e69600e4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7bb85562a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7bb85562a28b]
./build/bin/llama-diffusion-cli(+0x6a45)[0x5a66e6960a45]
Aborted (core dumped)

The command I'm running is:

 CUDA_VISIBLE_DEVICES="" ./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv -n 2048 --system-prompt-file sysprompt.txt \
  --diffusion-eb auto \
  --diffusion-eb-max-steps 48 \
  --diffusion-eb-t-max 1.0 \
  --diffusion-eb-t-min 0.6 \
  --diffusion-visual

To compile it, I used:

rm -rf build
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc)
cmake --build build -j --config Release --target llama-diffusion-cli

@quasar-of-mikus

Copy link
Copy Markdown

Yep will fix haha - I also added a short GIF of it working edited in description!

Builds and runs on Windows now.
1x 3090 Q4KM: time per step: 326.13ms
2x 3090 Q8_0: time per step: 878.83ms

@arkham000

Copy link
Copy Markdown

is it only cli at this point? no llama-server ?

@kroaton

kroaton commented Jun 10, 2026

Copy link
Copy Markdown

Thank you for putting this together! An Issue I found is that --fit doesn't work with this PR.

@icedream

icedream commented Jun 10, 2026

Copy link
Copy Markdown

Test run on my system with AMD hardware (7900 XTX) in it, Q4_K_M - time per step: 364.45ms

Screencast_20260610_234032_c.webm

@lucasbinder

Copy link
Copy Markdown

@icedream What (equivalent) tokens per second are you getting? I also tried running it with an AMD GPU (R9700 w/ vulkan) and only got ~27t/s.

@icedream

icedream commented Jun 10, 2026

Copy link
Copy Markdown

@lucasbinder Not 100% sure if that's the right way to calculate it but based on two more runs with the same prompt, calculating with 256 tokens per full canvas diffused (I left out the last canvas as tail end of response), taking the start/end timings per canvas from the --verbose output:

Run 1

2.327213 - 9.803998 = 7.476785 = 34.24 t/s
10.119182 - 17.995429 = 7.876247 = 32.50 t/s
18.412472 - 23.738067 = 5.325595 = 48.07 t/s
24.261660 - 32.754977 = 8.493317 = 30.14 t/s
33.345178 - 51.575057 = 18.229879 = 14.04 t/s

Run 2

2.328457 - 9.777839 = 7.449382 = 34.37 t/s
10.092776 - 17.935576 = 7.8428 = 32.64 t/s
18.349154 - 23.654470 = 5.305316 = 48.25 t/s
24.176041 - 32.655410 = 8.479369 = 30.19 t/s
33.245244 - 51.415716 = 18.170472 = 14.09 t/s

(Also I should clarify I used ROCm, not Vulkan in my case so that may be influencing the performance as well.)

@gbuznote-beep

Copy link
Copy Markdown

Unofficial prebuilt binaries for anyone who wants to test this PR without setting up a CUDA toolchain:

https://github.com/gbuznote-beep/llama-diffusion-cli-prebuilt

  • Linux x86_64 / WSL2 — CUDA 12.8, sm_86 (RTX 30-series / A4000–A6000), glibc ≥ 2.39, self-contained (cudart/cublas/cublasLt/nccl bundled)
  • Windows x64 — CPU-only
  • Pinned to c84e85a; SHA256SUMS + reproducible build scripts included (other GPU archs rebuild in ~10 min)

Data points from testing (256 tokens, EB sampler): A5000 full-GPU 0.98 s/step; RTX 3070 Ti Laptop 8 GB via WSL2 (-ngl 99 --n-cpu-moe 22) 5.9 s/step; i7-12700H CPU-only 17.1 s/step. Output quality looks coherent (thinking-style drafts + self-critique). Thanks @danielhanchen for the implementation!

@csabakecskemeti

Copy link
Copy Markdown
Contributor

Results on RTX PRO6000 + 5090 (I could have been used only the 6000)
google.diffusiongemma-26B-A4B-it.Q4_K_M.gguf -ngl -1 -cnv -n 2048
total time: 22329.51ms, time per step: 192.50ms (116 steps over 6 blocks, entropy-bound)

Not sure if I calculating the t/s correctly
tokens = blocks_completed * canvas_length = 6 * 256 = 1536 tokens
t/s = 1536 / (22329.51 / 1000) = 1536 / 22.33 ≈ 68.8 t/s

…to model params)

The CLI hand-builds llama_model_params and never copied tensor_buft_overrides, so -ot and
--n-cpu-moe were parsed but silently dropped - the MoE experts stayed on the GPU and OOMed
small-VRAM cards. Mirror common_model_params_to_llama.
… /clear

- --diffusion-gpu-sampling {auto,on,off} (default auto = on for single-GPU):
  keep the prev step's canvas logits in a device buffer (sc_dev) and read
  self-conditioning from it instead of a 268 MB host upload each step. SC
  inputs are bit-identical to the host path; auto-disables on multi-GPU like
  --diffusion-kv-cache. ~1.3x per step.
- cli: add effective + in-step-parallel throughput to the timing summary.
- cli: add /help and /clear in conversation mode.
@danielhanchen

Copy link
Copy Markdown
Contributor Author

Throughput increased from 1461 tokens / s to 1831 tok/s! (1.25x faster) with 0 change in accuracy! Tested on B200x1 Q8_0 quant

Also added a new section so it's more clear on numbers:

total time: 19298.76ms, time per step: 139.85ms (138 steps over 11 blocks, entropy-bound)
throughput: 145.9 tok/s (2816 tok in 19298.76ms), in-step parallel 1831 tok/s (256-tok canvas x 12.5 steps/block)

Before:

total time: 30142.16ms, time per step: 175.25ms (172 steps over 12 blocks, entropy-bound)
throughput: 101.9 tok/s (3072 tok in 30142.16ms), in-step parallel 1461 tok/s (256-tok canvas x 14.3 steps/block)

Use --diffusion-gpu-sampling [auto / off / on]
Default is on so --diffusion-gpu-sampling auto is the 1.25x faster version

@Iipal

Iipal commented Jun 11, 2026

Copy link
Copy Markdown

@danielhanchen hello, currently testing local this PR on 5070Ti, how do I get the throughput info ?

and I got these results so far as best:

total time: 33280.82ms, time per step: 564.08ms (59 steps over 4 blocks, entropy-bound)

run command:

./build/bin/llama-diffusion-cli \
  -m diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv --diffusion-visual --n-cpu-moe 18 --no-mmap \
    -n 2048 \
    --threads 8 \
    --threads-batch 8 \
    --main-gpu 0 -fa on \
    --split-mode none

@danielhanchen

Copy link
Copy Markdown
Contributor Author

@Iipal You'll need to recompile!

@Iipal

Iipal commented Jun 11, 2026

Copy link
Copy Markdown

@danielhanchen Yeap, here is the results:

total time: 33923.17ms, time per step: 595.14ms (57 steps over 3 blocks, entropy-bound)
throughput: 22.6 tok/s (768 tok in 33923.17ms), in-step parallel 430 tok/s (256-tok canvas x 19.0 steps/block)

In comparison:
Regular Gemma4 26B A4B with 90k context on my 5070Ti gives me roughly about 45-55 t/s, and Qwen3.6 35B A3b - 55-75t/s, and this is on WSL2, in the native Linux Ubuntu 26.04 OS I'm getting about 20% more of speed with all the same parameters, and even possibly to increase context to 200k, just fyi

Sample argmax/entropy/multinomial per canvas position directly from the
device sc_dev buffer instead of copying the [C, n_vocab] canvas logits to
host (268 MB/step) and reducing on the CPU. Removes the last per-step bus
copy on the entropy-bound path.

- new ggml-cuda kernel (dense, top_k==0), reached from llama via the
  backend-reg proc-address boundary (no new llama<->cuda link); falls back
  to the host path on non-CUDA / multi-GPU / no sc_dev.
- --diffusion-gpu-sample-reduce {auto,on,off}, auto=on for single-GPU,
  requires --diffusion-gpu-sampling. byte-identical when off.
- argmax bit-identical to host every step; Z/entropy differ only by the
  parallel-reduction order (~1e-4), same FP-equivalence class as
  --diffusion-kv-cache. greedy decode identical; stochastic output
  identical on every prompt tested. ~1.42x per step on B200 Q8_0.
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 11, 2026
@danielhanchen

Copy link
Copy Markdown
Contributor Author

New update again! Now 2200 tokens / s on B200x1 so 1461 to 2200!

cudaPointerGetAttributes / cudaPointerAttributes / cudaMemoryTypeDevice
are not mapped by the hip/musa vendor layer. Drop the pointer-attribute
device probe (the sampler is gated to a single CUDA device, so the tensor
is already on the current device) and route the runtime calls through
CUDA_CHECK.
@googlefan256

Copy link
Copy Markdown

I think this PR currently uses only 1 CPU thread when running with CPU offloading.

スクリーンショット 2026-06-11 20 00 41

(CPU is AMD Ryzen 7 5700X 16C)

Command used:

./build/bin/llama-diffusion-cli \
  --model ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -cnv -n 2048 \
  -ngl 999 --n-cpu-moe 23 \
  --diffusion-visual

danielhanchen added a commit to unslothai/llama.cpp that referenced this pull request Jun 11, 2026
@dogarrowtype

Copy link
Copy Markdown

Hello, just reporting that on mac/metal (M3 Ultra and 8bit unsloth quant) I get an error after each diffusion step. The output text seems to generate normally so it's not a fatal error, but this error doesn't seem to appear for other platforms. This seems to happen if the visualizer is on or off (if the visualizer is on, the error just quickly flashes at the bottom of the screen each diffusion step). The cmake build went perfectly fine with -DGGML_CUDA=OFF , no errors while building.

0.19.684.423 E diffusion_generate_entropy_bound: device sample failed at step 0; falling back to host
diffusion step: 0/48 [                                                  ] 0%0.19.946.564 E diffusion_generate_entropy_bound: device sample failed at step 1; falling back to host
diffusion step: 1/48 [=                                                 ] 2%0.20.209.883 E diffusion_generate_entropy_bound: device sample failed at step 2; falling back to host
diffusion step: 2/48 [==                                                ] 4%0.20.467.469 E diffusion_generate_entropy_bound: device sample failed at step 3; falling back to host
continues on like this until generation finishes...

Persistent forward server that runs diffusion_generate_entropy_bound and
streams the per-step argmax canvas (plus each committed block) over
stdin/stdout, so a host can render the denoise without reloading the model.
Reuses the entropy-bound decoder; links llama-diffusion.
@mohamed-em2m

Copy link
Copy Markdown

it's give me : error loading model: unknown model architecture: 'diffusion-gemma'

Take chat messages as JSON and apply the GGUF chat template + tokenizer in
the server (common_chat_templates + common_tokenize), and stream the per-step
canvas and committed blocks back as detokenized text. Drops the need for any
client-side tokenizer; the request is now {seed, n_blocks, messages}.
When a backend cannot run the on-device sampler (e.g. Metal), latch the
fallback after the first failure: warn once and use the host reduction for the
rest of the run instead of retrying and logging an error every step. Output is
unchanged (host sampling was already the fallback); only the per-step error
spam is removed.
@cpietsch

cpietsch commented Jun 11, 2026

Copy link
Copy Markdown

this is so cool! on my 5090 in WSL I get total time: 1109.67ms, time per step: 55.48ms (20 steps over 1 blocks, entropy-bound) throughput: 230.7 tok/s (256 tok in 1109.67ms), in-step parallel 4614 tok/s (256-tok canvas x 20.0 steps/block)

@map9959

map9959 commented Jun 11, 2026

Copy link
Copy Markdown

From an M4 Pro, 14 cores:

total time: 100675.41ms, time per step: 867.89ms (116 steps over 6 blocks, entropy-bound)
throughput: 15.3 tok/s (1536 tok in 100675.41ms), in-step parallel 295 tok/s (256-tok canvas x 19.3 steps/block)

@fizzAI

fizzAI commented Jun 11, 2026

Copy link
Copy Markdown

Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more

Is this not a violation of the project's contributing guidelines?

This project does not accept pull requests that are fully or predominantly AI-generated. AI tools may be utilized solely in an assistive capacity.
AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (e.g., generating repeated lines with minor variations).

@Iipal

Iipal commented Jun 11, 2026

Copy link
Copy Markdown

it's give me : error loading model: unknown model architecture: 'diffusion-gemma'
@mohamed-em2m

Make sure you are checkout correctly to the diffusiongemma branch before compilation, as the instruction follows:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# this step is important
git fetch origin pull/24423/head:pr-24423 && git switch pr-24423

cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

@pl752

pl752 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Encountered one rough place: I am unable to enable gpu sampling: it seems that the program disables it due to multi-gpu even when only one device is selected. In order to use it I have to use env CUDA_VISIBLE_DEVICES=0 Also utilization is reported as 40-50% without and 70-80 with sampling (rtx3090, q4_k_m)

@luminary19

Copy link
Copy Markdown

Same issue as above (Q4_K_M). Running on an RTX 4070, 32GB RAM 8GB VRAM. I tried setting -ngl down to 2 or 3 doesn't work -- model still runs thanks to CPU fallback, but GPU usage records 0 due to failed sampling.

@mohamed-em2m

Copy link
Copy Markdown

Bug: diffusion-gemma ignores -ngl and server never starts listening

Environment

  • Commit: PR DiffusionGemma #24423
  • GPU: NVIDIA L4 (22 GB VRAM)
  • CUDA: enabled
  • OS: Linux
  • Model: unsloth/diffusiongemma-26B-A4B-it-GGUF
  • Quantization: Q4_K_M

Command

./build/bin/llama-diffusion-gemma-server \
  /root/.cache/huggingface/hub/unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  --diffusion-eb auto \
  --diffusion-eb-max-steps 48 \
  --diffusion-eb-t-max 1.0 \
  --diffusion-eb-t-min 0.6 \
  --diffusion-visual \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  -cnv \
  -n 2048

GPU Detection

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 22563 MiB):
  Device 0: NVIDIA L4, compute capability 8.9

Observed Behavior

The model loads successfully, but no layers are offloaded to the GPU:

load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/31 layers to GPU

CPU_Mapped model buffer size = 16013.13 MiB

The server reports:

diffusion-gemma-server ready (n_vocab=262144, MAXTOK=2304, NGL=0)
READY 262144

despite launching with:

-ngl 99

Additionally, after reaching:

READY 262144

the process appears to stall and never starts serving requests. I do not see any indication that the HTTP server is listening on port 8080.

For example:

curl http://127.0.0.1:8080

fails because nothing is listening on the configured port.

Additional Diagnostics

The log also contains:

done_getting_tensors: tensor 'token_embd.weight' (q6_K) (and 696 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

and:

CUDA0 compute buffer size = 2955.75 MiB

which suggests CUDA is initialized and compute buffers are created, but all model tensors remain on CPU.

Expected Behavior

  1. -ngl 99 should offload layers to the GPU, or an explicit message should explain why diffusion-gemma currently does not support GPU offloading.
  2. After model initialization completes, the server should start listening on the configured host/port (0.0.0.0:8080) and accept requests.

@luminary19

Copy link
Copy Markdown

Same issue as above (Q4_K_M). Running on an RTX 4070, 32GB RAM 8GB VRAM. I tried setting -ngl down to 2 or 3 doesn't work -- model still runs thanks to CPU fallback, but GPU usage records 0 due to failed sampling.

Setting --diffusion-gpu-sample-reduce off lets it work, but throughput is still low at around ~5.2tok/s (112 in step parallel). Guess this is what I can expect from 8GB VRAM.

@DATEx2

DATEx2 commented Jun 11, 2026

Copy link
Copy Markdown

I don't understand what I am doing wrong -
I am getting 248 tokens/sec on regular llama.cpp / GEMMA 4 MTP2 / 5090 / Q6_K and now I am getting only 167/190 tokens / sec in this diffusion compiled llama.cpp - why?

@Iipal

Iipal commented Jun 11, 2026

Copy link
Copy Markdown

current(10a2613) progress on the tests:

total time: 31084.10ms, time per step: 535.93ms (58 steps over 4 blocks, entropy-bound)
throughput: 32.9 tok/s (1024 tok in 31084.10ms), in-step parallel 478 tok/s (256-tok canvas x 14.5 steps/block)

previous (commit: 15ad8f4):

total time: 33923.17ms, time per step: 595.14ms (57 steps over 3 blocks, entropy-bound)
throughput: 22.6 tok/s (768 tok in 33923.17ms), in-step parallel 430 tok/s (256-tok canvas x 19.0 steps/block)

run script:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv --diffusion-visual --n-cpu-moe 18 --no-mmap \
    -n 2048 \
    --threads 8 \
    --threads-batch 8 \
    --main-gpu 0 -fa on \
    --split-mode none

both tested on the single prompt: "create a fibonacci script in python"

system: WSL2, 5070Ti, 32Gb DDR4, R7 5800x

GPU usage: 30-55% (15.2\16 Gb VRAM)
CPU usage: 20-40%
RAM usage: 25\32Gb (with all other apps included)

But with each next following prompt, the t/s speed drops in half

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.