Unofficial, experimental prebuilt binaries of llama-diffusion-cli from the (not yet merged)
llama.cpp pull request ggml-org/llama.cpp#24423,
which adds support for the diffusion-gemma architecture (DiffusionGemma 26B-A4B —
a diffusion language model that generates by iteratively denoising 256-token canvases with
bidirectional attention, instead of autoregressive token-by-token decoding).
⚠️ Why this repo exists: while the PR is unmerged, no official llama.cpp release contains thediffusion-gemmaarchitecture or shipsllama-diffusion-clibuilds — and a CUDA build requires a full nvcc toolchain. These binaries let you try DiffusionGemma today without building anything. Once the PR merges and official releases ship it, prefer those.
All credit for the implementation goes to the PR author (danielhanchen) and the
ggml-org/llama.cpp project. This repo only packages
binaries (pinned commit c84e85a, 2026-06-10) with reproducible build scripts.
Binaries are on the Releases page (not in the repo file tree):
| Asset | Platform | Backend | Notes |
|---|---|---|---|
linux-cuda128-sm86.tar.gz (~898 MB) |
Linux x86_64 / WSL2 | CUDA 12.8, sm_86 only | self-contained: bundles libcudart/libcublas/libcublasLt/libnccl |
win-x64-cpu.zip (~3 MB) |
Windows x64 | CPU (AVX2) | no GPU; slow but zero dependencies |
SHA256SUMS.txt in this repo — verify your downloads.
- CUDA bundle is
sm_86ONLY: RTX 30-series (3060–3090 Ti), RTX A4000/A5000/A6000. Other GPUs (40-series = sm_89, 20-series = sm_75, …) need a rebuild — takes ~10 min withscripts/remote-diffusion-build.sh(just change-DCMAKE_CUDA_ARCHITECTURES). - glibc ≥ 2.39 (Ubuntu 24.04+, Fedora 40+; WSL2 Ubuntu-24.04 ✔). Built on Ubuntu 24.04.
- NVIDIA driver with CUDA ≥ 12.8 support (Linux R570+, or current Windows driver for WSL2).
# 1) extract
tar xzf llama-diffusion-cli-pr24423-c84e85a-linux-cuda128-sm86.tar.gz
cd bundle
# 2) get the model (~14.5 GB) — for WSL2, store it on ext4 (NOT /mnt/c — mmap over 9P crawls)
wget -O model.gguf "https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF/resolve/main/diffusiongemma-26B-A4B-it-Q4_K_M.gguf"
# 3) run (24 GB VRAM: everything on GPU)
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99
# 3b) small VRAM (8–16 GB): keep MoE expert tensors on CPU
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99 --n-cpu-moe 22
# fun: watch the canvas denoise live
./run.sh ... --diffusion-visualrun.sh just sets LD_LIBRARY_PATH to the bundled lib/ (plus /usr/lib/wsl/lib for the WSL2
driver passthrough) and execs the binary. The model uses an entropy-bounded sampler by default
(max 48 steps, adaptive stop) — no sampling flags needed. -n sets the token budget;
generation happens in 256-token canvas blocks.
unzip, then:
llama-diffusion-cli.exe -m model.gguf -p "Write a haiku about local AI." -n 256 -t <physical cores>| Hardware | Config | s/step | total |
|---|---|---|---|
| RTX A5000 24 GB (datacenter) | full GPU, -ngl 99 |
0.98 | 23 s |
| RTX 3070 Ti Laptop 8 GB (WSL2) | -ngl 99 --n-cpu-moe 22 |
5.9 | ~2.5 min |
| i7-12700H, 14 cores (Windows CPU build) | CPU only | 17.1 | ~7.4 min |
For reference, the same machine runs autoregressive Gemma-4 26B-A4B at ~34 t/s — DiffusionGemma support is young; treat this as a tech preview, not a production inference path.
- Unmerged PR: behavior may change upstream; this build is pinned to commit
c84e85a. llama-servercannot serve diffusion models — there is no OpenAI-compatible API; CLI only.- The bundle deliberately does NOT include
libcuda.so.1(that's the driver's library). Do not add one — on WSL2 the passthrough copy in/usr/lib/wsl/libmust be the one that loads. libnccl.so.2is included (the build links it); sourced from the official PyPInvidia-nccl-cu12wheel.- Model not included — download from unsloth/diffusiongemma-26B-A4B-it-GGUF.
scripts/remote-diffusion-build.sh— full recipe used for the CUDA build (runs on any Ubuntu 24.04 + CUDA 12.8 box; we used a $0.23/h vast.ai RTX A5000 — total cost of this build: $0.21). Change-DCMAKE_CUDA_ARCHITECTURESfor your GPU.scripts/wsl-fix-nccl.sh— fetcheslibnccl.so.2from the PyPI wheel without pip.scripts/build-windows-cpu.bat— Windows CPU build (VS 2022 Build Tools + CMake + Ninja).
llama-diffusion-cliis built from llama.cpp (MIT, © The ggml authors) — diffusion-gemma support by PR #24423 (danielhanchen).- NVIDIA CUDA runtime libraries redistributed per the CUDA Toolkit EULA (redistributable components); NCCL under BSD-3-Clause. See THIRD_PARTY_NOTICES.md.
- Repo content (scripts, docs): MIT.
- DiffusionGemma model © Google (Gemma license), GGUF quantization by Unsloth — not included.
Provided as-is, no warranty. Not affiliated with ggml-org, Google, NVIDIA, or Unsloth.