Python scripts for benchmarking llama-diffusion-gemma-cli and
llama-diffusion-gemma-server with GGUF Diffusion Gemma models.
The scripts are standalone and only use the Python standard library. They do not assume a fixed llama.cpp checkout path. Provide the executable path and the model path explicitly.
bench-diffusion-gemma-cli.py: runs the CLI executable across one prompt or a prompt file.bench-diffusion-gemma-server.py: starts the server executable, sends OpenAI compatible requests, collects timings and metrics, then stops the server.diffusion-gemma-prompts.txt: sample prompt sweep with one prompt per line.
- Python 3.10 or newer.
- A built llama.cpp Diffusion Gemma CLI executable.
- A built llama.cpp Diffusion Gemma server executable.
- A compatible GGUF model.
Use --prompt-file to pass a text file with one prompt per non-empty line.
If --prompt-file is omitted, the scripts use --prompt.
The included diffusion-gemma-prompts.txt asks for progressively longer
answers so the benchmark can exercise different generated block counts.
The scripts are aligned with the current Diffusion Gemma CUDA defaults:
- Context size defaults to
8096. - Denoising steps default to
48. - Top-k is not passed by default, so the binaries use full softmax (
top_k=0). - EOS is respected by default. The server script only sends
ignore_eoswhen--ignore-eosis passed. --run-max-denoising-stepis not passed by default, so the device stop state is read each denoising step and blocks can stop early.- The CUDA fast paths that are binary defaults are not passed explicitly by the
scripts. Use the
--no-*switches when you want to benchmark an opt-out. - The scripts pass
--diffusion-cuda-mmq-max-x 64by default. Use--diffusion-cuda-mmq-max-x 0to disable that process override, or pass a different positive value to test another MMQ tile cap.
Use --output-dir to choose where results are written. If omitted, the current
directory is used.
By default the scripts write:
- CLI:
<output-dir>/diffusion-cli/ - Server:
<output-dir>/diffusion-server/
Each output directory contains:
report.md: markdown benchmark report.outputs.jsonl: generated text and metrics for every run.- Per-run logs or response JSON files.
You can override individual output paths with:
--log-dir--json-out--report-out--outputs-jsonl
The report treats token throughput as canvas token throughput.
The report also includes:
Total denoising stepsBlocks totalMean denoising steps/blockPer-step time avg, ms/step
Mean denoising steps/block is computed as:
total denoising steps / total generated blocks
Example:
python .\bench-diffusion-gemma-cli.py `
--binary <path-to-llama-diffusion-gemma-cli> `
--model <path-to-model.gguf> `
--prompt-file .\diffusion-gemma-prompts.txt `
--output-dir .\benchmark-results `
--repeat 1 `
--warmup 0 `
--n-predict 3584Useful options:
--ctx-size: context size passed to the CLI. Default:8096.--n-predict: maximum output tokens requested from the CLI.--diffusion-steps: default denoising steps per block. Default:48.--diffusion-cuda-mmq-max-x: MMQ tile cap passed to the CLI. Default:64.--top-k: optional top-k override. Omitted by default, which keeps Diffusion Gemma on full softmax (top_k=0).--no-diffusion-cuda-fused-full-softmax,--no-diffusion-cuda-fast-top-k, and related diffusion CUDA switches can be used to opt out of binary defaults for ablation runs.--repeat: kept benchmark runs per prompt.--warmup: warmup runs per prompt, recorded but excluded from summaries.
Example:
python .\bench-diffusion-gemma-server.py `
--binary <path-to-llama-diffusion-gemma-server> `
--model <path-to-model.gguf> `
--prompt-file .\diffusion-gemma-prompts.txt `
--output-dir .\benchmark-results `
--repeat 1 `
--warmup 0 `
--max-tokens 25600Useful options:
--ctx-size: context size passed to the server. Default:8096.--max-tokens: maximum output tokens requested per request.--diffusion-steps: server default denoising steps per block. Default:48.--diffusion-cuda-mmq-max-x: MMQ tile cap passed to the server. Default:64.--request-diffusion-steps: per-request denoising step override.--top-k: server-level top-k override. Omitted by default, which keeps the server on full softmax (top_k=0).--request-top-k: per-request top-k override.--endpoint:chatorcompletion. Default:chat.--hostand--port: server bind address. Defaults:127.0.0.1:18081.--repeat: kept benchmark runs per prompt.--warmup: warmup runs per prompt, recorded but excluded from summaries.
By default, the server does not set ignore_eos, so generation stops naturally
when the model emits EOS. --max-tokens is an upper cap. Pass --ignore-eos
only when you intentionally want to force the full requested block count.
The default context size is 8096. Increase it when asking for many blocks.
For example:
python .\bench-diffusion-gemma-server.py `
--binary <path-to-llama-diffusion-gemma-server> `
--model <path-to-model.gguf> `
--prompt-file .\diffusion-gemma-prompts.txt `
--ctx-size 26368 `
--max-tokens 25600 `
--output-dir .\benchmark-results-longFor 100 canvas blocks of 256 tokens, use an output cap around 25600 and a
large enough context size for the prompt plus generated canvases.
If process launch fails with a duplicate Path / PATH environment error,
sanitize the process environment before running Python:
$pathValue = [Environment]::GetEnvironmentVariable('Path', 'Process')
if (-not $pathValue) { $pathValue = [Environment]::GetEnvironmentVariable('PATH', 'Process') }
[Environment]::SetEnvironmentVariable('PATH', $null, 'Process')
[Environment]::SetEnvironmentVariable('Path', $pathValue, 'Process')