llama-diffusion-cli — prebuilt binaries (llama.cpp PR #24423 / DiffusionGemma)

Unofficial, experimental prebuilt binaries of llama-diffusion-cli from the (not yet merged) llama.cpp pull request ggml-org/llama.cpp#24423, which adds support for the diffusion-gemma architecture (DiffusionGemma 26B-A4B — a diffusion language model that generates by iteratively denoising 256-token canvases with bidirectional attention, instead of autoregressive token-by-token decoding).

⚠️ Why this repo exists: while the PR is unmerged, no official llama.cpp release contains the diffusion-gemma architecture or ships llama-diffusion-cli builds — and a CUDA build requires a full nvcc toolchain. These binaries let you try DiffusionGemma today without building anything. Once the PR merges and official releases ship it, prefer those.

All credit for the implementation goes to the PR author (danielhanchen) and the ggml-org/llama.cpp project. This repo only packages binaries (pinned commit c84e85a, 2026-06-10) with reproducible build scripts.

⬇️ Downloads

Binaries are on the Releases page (not in the repo file tree):

Asset	Platform	Backend	Notes
`linux-cuda128-sm86.tar.gz` (~898 MB)	Linux x86_64 / WSL2	CUDA 12.8, sm_86 only	self-contained: bundles `libcudart`/`libcublas`/`libcublasLt`/`libnccl`
`win-x64-cpu.zip` (~3 MB)	Windows x64	CPU (AVX2)	no GPU; slow but zero dependencies

SHA256SUMS.txt in this repo — verify your downloads.

Compatibility — read before downloading

CUDA bundle is sm_86 ONLY: RTX 30-series (3060–3090 Ti), RTX A4000/A5000/A6000. Other GPUs (40-series = sm_89, 20-series = sm_75, …) need a rebuild — takes ~10 min with scripts/remote-diffusion-build.sh (just change -DCMAKE_CUDA_ARCHITECTURES).
glibc ≥ 2.39 (Ubuntu 24.04+, Fedora 40+; WSL2 Ubuntu-24.04 ✔). Built on Ubuntu 24.04.
NVIDIA driver with CUDA ≥ 12.8 support (Linux R570+, or current Windows driver for WSL2).

Quick start (WSL2 — also applies to native Linux)

# 1) extract
tar xzf llama-diffusion-cli-pr24423-c84e85a-linux-cuda128-sm86.tar.gz
cd bundle

# 2) get the model (~14.5 GB) — for WSL2, store it on ext4 (NOT /mnt/c — mmap over 9P crawls)
wget -O model.gguf "https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF/resolve/main/diffusiongemma-26B-A4B-it-Q4_K_M.gguf"

# 3) run (24 GB VRAM: everything on GPU)
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99

# 3b) small VRAM (8–16 GB): keep MoE expert tensors on CPU
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99 --n-cpu-moe 22

# fun: watch the canvas denoise live
./run.sh ... --diffusion-visual

run.sh just sets LD_LIBRARY_PATH to the bundled lib/ (plus /usr/lib/wsl/lib for the WSL2 driver passthrough) and execs the binary. The model uses an entropy-bounded sampler by default (max 48 steps, adaptive stop) — no sampling flags needed. -n sets the token budget; generation happens in 256-token canvas blocks.

Windows CPU build

unzip, then:
llama-diffusion-cli.exe -m model.gguf -p "Write a haiku about local AI." -n 256 -t <physical cores>

Measured performance (256 tokens ≈ 23–26 denoising steps)

Hardware	Config	s/step	total
RTX A5000 24 GB (datacenter)	full GPU, `-ngl 99`	0.98	23 s
RTX 3070 Ti Laptop 8 GB (WSL2)	`-ngl 99 --n-cpu-moe 22`	5.9	~2.5 min
i7-12700H, 14 cores (Windows CPU build)	CPU only	17.1	~7.4 min

For reference, the same machine runs autoregressive Gemma-4 26B-A4B at ~34 t/s — DiffusionGemma support is young; treat this as a tech preview, not a production inference path.

Known limitations / gotchas

Unmerged PR: behavior may change upstream; this build is pinned to commit c84e85a.
llama-server cannot serve diffusion models — there is no OpenAI-compatible API; CLI only.
The bundle deliberately does NOT include libcuda.so.1 (that's the driver's library). Do not add one — on WSL2 the passthrough copy in /usr/lib/wsl/lib must be the one that loads.
libnccl.so.2 is included (the build links it); sourced from the official PyPI nvidia-nccl-cu12 wheel.
Model not included — download from unsloth/diffusiongemma-26B-A4B-it-GGUF.

Rebuild it yourself (reproducible)

scripts/remote-diffusion-build.sh — full recipe used for the CUDA build (runs on any Ubuntu 24.04 + CUDA 12.8 box; we used a $0.23/h vast.ai RTX A5000 — total cost of this build: $0.21). Change -DCMAKE_CUDA_ARCHITECTURES for your GPU.
scripts/wsl-fix-nccl.sh — fetches libnccl.so.2 from the PyPI wheel without pip.
scripts/build-windows-cpu.bat — Windows CPU build (VS 2022 Build Tools + CMake + Ninja).

License & attribution

llama-diffusion-cli is built from llama.cpp (MIT, © The ggml authors) — diffusion-gemma support by PR #24423 (danielhanchen).
NVIDIA CUDA runtime libraries redistributed per the CUDA Toolkit EULA (redistributable components); NCCL under BSD-3-Clause. See THIRD_PARTY_NOTICES.md.
Repo content (scripts, docs): MIT.

Provided as-is, no warranty. Not affiliated with ggml-org, Google, NVIDIA, or Unsloth.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md
SHA256SUMS.txt		SHA256SUMS.txt
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-diffusion-cli — prebuilt binaries (llama.cpp PR #24423 / DiffusionGemma)

⬇️ Downloads

Compatibility — read before downloading

Quick start (WSL2 — also applies to native Linux)

Windows CPU build

Measured performance (256 tokens ≈ 23–26 denoising steps)

Known limitations / gotchas

Rebuild it yourself (reproducible)

License & attribution

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-diffusion-cli — prebuilt binaries (llama.cpp PR #24423 / DiffusionGemma)

⬇️ Downloads

Compatibility — read before downloading

Quick start (WSL2 — also applies to native Linux)

Windows CPU build

Measured performance (256 tokens ≈ 23–26 denoising steps)

Known limitations / gotchas

Rebuild it yourself (reproducible)

License & attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages