DiffusionGemma#24423
Conversation
Some diffusion cli and visual updates
|
Hi @danielhanchen, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Oof, that's a big one. There's a ton of debugging stuff left in there that needs throwing out, for one. I'm also not convinced about the idea to make a server just for one model - I think if we're intending to support diffusion models in a server mechanism, it should be a general diffusion-server (but that's just my opinion, probably have to wait for what @ggerganov thinks about this one). |
|
Haha sorry - this PR was more of a direct translation / proof of concept that it works! |
|
I'll edit the PR - sorry we're juggling multiple things haha |
|
Another PR for DiffusionGemma: #24427 |
|
You have some failing tests to fix. :) |
With a block diffusion model, couldn't the regular server just return each block when it is finished diffusing? It would be nice to just have one server, and the API could remain fully compatible so clients don't need to be aware that they're dealing with a diffusion model. (We don't show the distribution of sampled logits for AR models, and I don't see why people would need to see the intermediate diffusion steps either, since those won't be useful.) |
|
Doesn't build on Windows: |
|
Yep will fix haha - I also added a short GIF of it working edited in description! |
…s, drop debug hooks - guard sys/ioctl.h behind _WIN32 and add a GetConsoleScreenBufferInfo fallback for the visual viewport size, so diffusion-cli builds on Windows - skip diffusion-gemma in test-llama-archs like gemma4 (shared ISWA backbone, no synthetic fixture params yet) - remove the DG_DUMP_KV_LAYER / DG_NSWA debug scaffolding and its llama.h API - fix flake8 E306 in conversion/diffusion_gemma.py
|
I was able to compile it successfully on Linux for my 4090, but when running it, I get the following error after sending a user message: The command I'm running is: To compile it, I used: |
Builds and runs on Windows now. |
|
is it only cli at this point? no llama-server ? |
|
Thank you for putting this together! An Issue I found is that --fit doesn't work with this PR. |
|
Test run on my system with AMD hardware (7900 XTX) in it, Q4_K_M - time per step: 364.45ms Screencast_20260610_234032_c.webm |
|
@icedream What (equivalent) tokens per second are you getting? I also tried running it with an AMD GPU (R9700 w/ vulkan) and only got ~27t/s. |
|
@lucasbinder Not 100% sure if that's the right way to calculate it but based on two more runs with the same prompt, calculating with 256 tokens per full canvas diffused (I left out the last canvas as tail end of response), taking the start/end timings per canvas from the Run 1 2.327213 - 9.803998 = 7.476785 = 34.24 t/s Run 2 2.328457 - 9.777839 = 7.449382 = 34.37 t/s (Also I should clarify I used ROCm, not Vulkan in my case so that may be influencing the performance as well.) |
|
Unofficial prebuilt binaries for anyone who wants to test this PR without setting up a CUDA toolchain: https://github.com/gbuznote-beep/llama-diffusion-cli-prebuilt
Data points from testing (256 tokens, EB sampler): A5000 full-GPU 0.98 s/step; RTX 3070 Ti Laptop 8 GB via WSL2 ( |
|
Results on RTX PRO6000 + 5090 (I could have been used only the 6000) Not sure if I calculating the t/s correctly |
…to model params) The CLI hand-builds llama_model_params and never copied tensor_buft_overrides, so -ot and --n-cpu-moe were parsed but silently dropped - the MoE experts stayed on the GPU and OOMed small-VRAM cards. Mirror common_model_params_to_llama.
… /clear
- --diffusion-gpu-sampling {auto,on,off} (default auto = on for single-GPU):
keep the prev step's canvas logits in a device buffer (sc_dev) and read
self-conditioning from it instead of a 268 MB host upload each step. SC
inputs are bit-identical to the host path; auto-disables on multi-GPU like
--diffusion-kv-cache. ~1.3x per step.
- cli: add effective + in-step-parallel throughput to the timing summary.
- cli: add /help and /clear in conversation mode.
|
Throughput increased from 1461 tokens / s to 1831 tok/s! (1.25x faster) with 0 change in accuracy! Tested on B200x1 Q8_0 quant Also added a new section so it's more clear on numbers: Before: Use |
|
@danielhanchen hello, currently testing local this PR on 5070Ti, how do I get the and I got these results so far as best: run command: ./build/bin/llama-diffusion-cli \
-m diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-ngl 99 -cnv --diffusion-visual --n-cpu-moe 18 --no-mmap \
-n 2048 \
--threads 8 \
--threads-batch 8 \
--main-gpu 0 -fa on \
--split-mode none |
|
@Iipal You'll need to recompile! |
|
@danielhanchen Yeap, here is the results: total time: 33923.17ms, time per step: 595.14ms (57 steps over 3 blocks, entropy-bound) In comparison: |
Sample argmax/entropy/multinomial per canvas position directly from the
device sc_dev buffer instead of copying the [C, n_vocab] canvas logits to
host (268 MB/step) and reducing on the CPU. Removes the last per-step bus
copy on the entropy-bound path.
- new ggml-cuda kernel (dense, top_k==0), reached from llama via the
backend-reg proc-address boundary (no new llama<->cuda link); falls back
to the host path on non-CUDA / multi-GPU / no sc_dev.
- --diffusion-gpu-sample-reduce {auto,on,off}, auto=on for single-GPU,
requires --diffusion-gpu-sampling. byte-identical when off.
- argmax bit-identical to host every step; Z/entropy differ only by the
parallel-reduction order (~1e-4), same FP-equivalence class as
--diffusion-kv-cache. greedy decode identical; stochastic output
identical on every prompt tested. ~1.42x per step on B200 Q8_0.
|
New update again! Now 2200 tokens / s on B200x1 so 1461 to 2200! |
cudaPointerGetAttributes / cudaPointerAttributes / cudaMemoryTypeDevice are not mapped by the hip/musa vendor layer. Drop the pointer-attribute device probe (the sampler is gated to a single CUDA device, so the tensor is already on the current device) and route the runtime calls through CUDA_CHECK.
|
Hello, just reporting that on mac/metal (M3 Ultra and 8bit unsloth quant) I get an error after each diffusion step. The output text seems to generate normally so it's not a fatal error, but this error doesn't seem to appear for other platforms. This seems to happen if the visualizer is on or off (if the visualizer is on, the error just quickly flashes at the bottom of the screen each diffusion step). The cmake build went perfectly fine with -DGGML_CUDA=OFF , no errors while building. |
Persistent forward server that runs diffusion_generate_entropy_bound and streams the per-step argmax canvas (plus each committed block) over stdin/stdout, so a host can render the denoise without reloading the model. Reuses the entropy-bound decoder; links llama-diffusion.
|
it's give me : error loading model: unknown model architecture: 'diffusion-gemma' |
Take chat messages as JSON and apply the GGUF chat template + tokenizer in
the server (common_chat_templates + common_tokenize), and stream the per-step
canvas and committed blocks back as detokenized text. Drops the need for any
client-side tokenizer; the request is now {seed, n_blocks, messages}.
When a backend cannot run the on-device sampler (e.g. Metal), latch the fallback after the first failure: warn once and use the host reduction for the rest of the run instead of retrying and logging an error every step. Output is unchanged (host sampling was already the fallback); only the per-step error spam is removed.
|
this is so cool! on my 5090 in WSL I get |
|
From an M4 Pro, 14 cores: |
Is this not a violation of the project's contributing guidelines?
|
Make sure you are checkout correctly to the diffusiongemma branch before compilation, as the instruction follows: |
|
Encountered one rough place: I am unable to enable gpu sampling: it seems that the program disables it due to multi-gpu even when only one device is selected. In order to use it I have to use env |
|
Same issue as above (Q4_K_M). Running on an RTX 4070, 32GB RAM 8GB VRAM. I tried setting -ngl down to 2 or 3 doesn't work -- model still runs thanks to CPU fallback, but GPU usage records 0 due to failed sampling. |
Bug: diffusion-gemma ignores
|
Setting --diffusion-gpu-sample-reduce off lets it work, but throughput is still low at around ~5.2tok/s (112 in step parallel). Guess this is what I can expect from 8GB VRAM. |
|
I don't understand what I am doing wrong - |
|
current(10a2613) progress on the tests: total time: 31084.10ms, time per step: 535.93ms (58 steps over 4 blocks, entropy-bound) previous (commit: 15ad8f4):
run script: both tested on the single prompt: "create a fibonacci script in python" system: WSL2, 5070Ti, 32Gb DDR4, R7 5800x GPU usage: 30-55% (15.2\16 Gb VRAM) But with each next following prompt, the t/s speed drops in half |
Worked on prelim Diffusion Gemma support!
llama-cliviallama-diffusion-cli -cnv -n 2048llama-diffusion-cli -cnv -n 2048 --diffusion-visualTo try this PR:
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp gh pr checkout 24423 cmake -B build -DGGML_CUDA=ON cmake --build build -j --config Release --target llama-diffusion-clithen use a GGUF (any can work but for eg)
then use chat or visualization:
or
Example below (a bit blurry to limit to 10MB on Github :()

Disclaimer Heavy usage of AI, but verified logits matching with transformers, checked FP16 vs FP32 KV cache, long context checks and much more