Diffusion Gemma benchmark scripts

Python scripts for benchmarking llama-diffusion-gemma-cli and llama-diffusion-gemma-server with GGUF Diffusion Gemma models.

The scripts are standalone and only use the Python standard library. They do not assume a fixed llama.cpp checkout path. Provide the executable path and the model path explicitly.

Files

bench-diffusion-gemma-cli.py: runs the CLI executable across one prompt or a prompt file.
bench-diffusion-gemma-server.py: starts the server executable, sends OpenAI compatible requests, collects timings and metrics, then stops the server.
diffusion-gemma-prompts.txt: sample prompt sweep with one prompt per line.

Requirements

Python 3.10 or newer.
A built llama.cpp Diffusion Gemma CLI executable.
A built llama.cpp Diffusion Gemma server executable.
A compatible GGUF model.

Prompt files

Use --prompt-file to pass a text file with one prompt per non-empty line. If --prompt-file is omitted, the scripts use --prompt.

The included diffusion-gemma-prompts.txt asks for progressively longer answers so the benchmark can exercise different generated block counts.

Current defaults

The scripts are aligned with the current Diffusion Gemma CUDA defaults:

Context size defaults to 8096.
Denoising steps default to 48.
Top-k is not passed by default, so the binaries use full softmax (top_k=0).
EOS is respected by default. The server script only sends ignore_eos when --ignore-eos is passed.
--run-max-denoising-step is not passed by default, so the device stop state is read each denoising step and blocks can stop early.
The CUDA fast paths that are binary defaults are not passed explicitly by the scripts. Use the --no-* switches when you want to benchmark an opt-out.
The scripts pass --diffusion-cuda-mmq-max-x 64 by default. Use --diffusion-cuda-mmq-max-x 0 to disable that process override, or pass a different positive value to test another MMQ tile cap.

Output layout

Use --output-dir to choose where results are written. If omitted, the current directory is used.

By default the scripts write:

CLI: <output-dir>/diffusion-cli/
Server: <output-dir>/diffusion-server/

Each output directory contains:

report.md: markdown benchmark report.
outputs.jsonl: generated text and metrics for every run.
Per-run logs or response JSON files.

You can override individual output paths with:

--log-dir
--json-out
--report-out
--outputs-jsonl

Key metrics

The report treats token throughput as canvas token throughput.

The report also includes:

Total denoising steps
Blocks total
Mean denoising steps/block
Per-step time avg, ms/step

Mean denoising steps/block is computed as:

total denoising steps / total generated blocks

CLI benchmark

Example:

python .\bench-diffusion-gemma-cli.py `
  --binary <path-to-llama-diffusion-gemma-cli> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --n-predict 3584

Useful options:

--ctx-size: context size passed to the CLI. Default: 8096.
--n-predict: maximum output tokens requested from the CLI.
--diffusion-steps: default denoising steps per block. Default: 48.
--diffusion-cuda-mmq-max-x: MMQ tile cap passed to the CLI. Default: 64.
--top-k: optional top-k override. Omitted by default, which keeps Diffusion Gemma on full softmax (top_k=0).
--no-diffusion-cuda-fused-full-softmax, --no-diffusion-cuda-fast-top-k, and related diffusion CUDA switches can be used to opt out of binary defaults for ablation runs.
--repeat: kept benchmark runs per prompt.
--warmup: warmup runs per prompt, recorded but excluded from summaries.

Server benchmark

Example:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --max-tokens 25600

Useful options:

--ctx-size: context size passed to the server. Default: 8096.
--max-tokens: maximum output tokens requested per request.
--diffusion-steps: server default denoising steps per block. Default: 48.
--diffusion-cuda-mmq-max-x: MMQ tile cap passed to the server. Default: 64.
--request-diffusion-steps: per-request denoising step override.
--top-k: server-level top-k override. Omitted by default, which keeps the server on full softmax (top_k=0).
--request-top-k: per-request top-k override.
--endpoint: chat or completion. Default: chat.
--host and --port: server bind address. Defaults: 127.0.0.1:18081.
--repeat: kept benchmark runs per prompt.
--warmup: warmup runs per prompt, recorded but excluded from summaries.

By default, the server does not set ignore_eos, so generation stops naturally when the model emits EOS. --max-tokens is an upper cap. Pass --ignore-eos only when you intentionally want to force the full requested block count.

Long-context runs

The default context size is 8096. Increase it when asking for many blocks. For example:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --ctx-size 26368 `
  --max-tokens 25600 `
  --output-dir .\benchmark-results-long

For 100 canvas blocks of 256 tokens, use an output cap around 25600 and a large enough context size for the prompt plus generated canvases.

Windows note

If process launch fails with a duplicate Path / PATH environment error, sanitize the process environment before running Python:

$pathValue = [Environment]::GetEnvironmentVariable('Path', 'Process')
if (-not $pathValue) { $pathValue = [Environment]::GetEnvironmentVariable('PATH', 'Process') }
[Environment]::SetEnvironmentVariable('PATH', $null, 'Process')
[Environment]::SetEnvironmentVariable('Path', $pathValue, 'Process')

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
bench-diffusion-gemma-cli.py		bench-diffusion-gemma-cli.py
bench-diffusion-gemma-server.py		bench-diffusion-gemma-server.py
diffusion-gemma-prompts.txt		diffusion-gemma-prompts.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diffusion Gemma benchmark scripts

Files

Requirements

Prompt files

Current defaults

Output layout

Key metrics

CLI benchmark

Server benchmark

Long-context runs

Windows note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diffusion Gemma benchmark scripts

Files

Requirements

Prompt files

Current defaults

Output layout

Key metrics

CLI benchmark

Server benchmark

Long-context runs

Windows note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages