Skip to content

gbuznote-beep/llama-diffusion-cli-prebuilt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama-diffusion-cli — prebuilt binaries (llama.cpp PR #24423 / DiffusionGemma)

Unofficial, experimental prebuilt binaries of llama-diffusion-cli from the (not yet merged) llama.cpp pull request ggml-org/llama.cpp#24423, which adds support for the diffusion-gemma architecture (DiffusionGemma 26B-A4B — a diffusion language model that generates by iteratively denoising 256-token canvases with bidirectional attention, instead of autoregressive token-by-token decoding).

⚠️ Why this repo exists: while the PR is unmerged, no official llama.cpp release contains the diffusion-gemma architecture or ships llama-diffusion-cli builds — and a CUDA build requires a full nvcc toolchain. These binaries let you try DiffusionGemma today without building anything. Once the PR merges and official releases ship it, prefer those.

All credit for the implementation goes to the PR author (danielhanchen) and the ggml-org/llama.cpp project. This repo only packages binaries (pinned commit c84e85a, 2026-06-10) with reproducible build scripts.


⬇️ Downloads

Binaries are on the Releases page (not in the repo file tree):

Asset Platform Backend Notes
linux-cuda128-sm86.tar.gz (~898 MB) Linux x86_64 / WSL2 CUDA 12.8, sm_86 only self-contained: bundles libcudart/libcublas/libcublasLt/libnccl
win-x64-cpu.zip (~3 MB) Windows x64 CPU (AVX2) no GPU; slow but zero dependencies

SHA256SUMS.txt in this repo — verify your downloads.

Compatibility — read before downloading

  • CUDA bundle is sm_86 ONLY: RTX 30-series (3060–3090 Ti), RTX A4000/A5000/A6000. Other GPUs (40-series = sm_89, 20-series = sm_75, …) need a rebuild — takes ~10 min with scripts/remote-diffusion-build.sh (just change -DCMAKE_CUDA_ARCHITECTURES).
  • glibc ≥ 2.39 (Ubuntu 24.04+, Fedora 40+; WSL2 Ubuntu-24.04 ✔). Built on Ubuntu 24.04.
  • NVIDIA driver with CUDA ≥ 12.8 support (Linux R570+, or current Windows driver for WSL2).

Quick start (WSL2 — also applies to native Linux)

# 1) extract
tar xzf llama-diffusion-cli-pr24423-c84e85a-linux-cuda128-sm86.tar.gz
cd bundle

# 2) get the model (~14.5 GB) — for WSL2, store it on ext4 (NOT /mnt/c — mmap over 9P crawls)
wget -O model.gguf "https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF/resolve/main/diffusiongemma-26B-A4B-it-Q4_K_M.gguf"

# 3) run (24 GB VRAM: everything on GPU)
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99

# 3b) small VRAM (8–16 GB): keep MoE expert tensors on CPU
./run.sh -m model.gguf -p "Write a haiku about local AI." -n 256 -ngl 99 --n-cpu-moe 22

# fun: watch the canvas denoise live
./run.sh ... --diffusion-visual

run.sh just sets LD_LIBRARY_PATH to the bundled lib/ (plus /usr/lib/wsl/lib for the WSL2 driver passthrough) and execs the binary. The model uses an entropy-bounded sampler by default (max 48 steps, adaptive stop) — no sampling flags needed. -n sets the token budget; generation happens in 256-token canvas blocks.

Windows CPU build

unzip, then:
llama-diffusion-cli.exe -m model.gguf -p "Write a haiku about local AI." -n 256 -t <physical cores>

Measured performance (256 tokens ≈ 23–26 denoising steps)

Hardware Config s/step total
RTX A5000 24 GB (datacenter) full GPU, -ngl 99 0.98 23 s
RTX 3070 Ti Laptop 8 GB (WSL2) -ngl 99 --n-cpu-moe 22 5.9 ~2.5 min
i7-12700H, 14 cores (Windows CPU build) CPU only 17.1 ~7.4 min

For reference, the same machine runs autoregressive Gemma-4 26B-A4B at ~34 t/s — DiffusionGemma support is young; treat this as a tech preview, not a production inference path.

Known limitations / gotchas

  • Unmerged PR: behavior may change upstream; this build is pinned to commit c84e85a.
  • llama-server cannot serve diffusion models — there is no OpenAI-compatible API; CLI only.
  • The bundle deliberately does NOT include libcuda.so.1 (that's the driver's library). Do not add one — on WSL2 the passthrough copy in /usr/lib/wsl/lib must be the one that loads.
  • libnccl.so.2 is included (the build links it); sourced from the official PyPI nvidia-nccl-cu12 wheel.
  • Model not included — download from unsloth/diffusiongemma-26B-A4B-it-GGUF.

Rebuild it yourself (reproducible)

  • scripts/remote-diffusion-build.sh — full recipe used for the CUDA build (runs on any Ubuntu 24.04 + CUDA 12.8 box; we used a $0.23/h vast.ai RTX A5000 — total cost of this build: $0.21). Change -DCMAKE_CUDA_ARCHITECTURES for your GPU.
  • scripts/wsl-fix-nccl.sh — fetches libnccl.so.2 from the PyPI wheel without pip.
  • scripts/build-windows-cpu.bat — Windows CPU build (VS 2022 Build Tools + CMake + Ninja).

License & attribution

  • llama-diffusion-cli is built from llama.cpp (MIT, © The ggml authors) — diffusion-gemma support by PR #24423 (danielhanchen).
  • NVIDIA CUDA runtime libraries redistributed per the CUDA Toolkit EULA (redistributable components); NCCL under BSD-3-Clause. See THIRD_PARTY_NOTICES.md.
  • Repo content (scripts, docs): MIT.
  • DiffusionGemma model © Google (Gemma license), GGUF quantization by Unsloth — not included.

Provided as-is, no warranty. Not affiliated with ggml-org, Google, NVIDIA, or Unsloth.

About

Unofficial prebuilt llama-diffusion-cli (llama.cpp PR #24423, DiffusionGemma) — Linux/WSL2 CUDA sm_86 + Windows CPU. Try diffusion LLMs without building.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors