Skip to content

lnigam/llama.cpp-scripts

Repository files navigation

Diffusion Gemma benchmark scripts

Python scripts for benchmarking llama-diffusion-gemma-cli and llama-diffusion-gemma-server with GGUF Diffusion Gemma models.

The scripts are standalone and only use the Python standard library. They do not assume a fixed llama.cpp checkout path. Provide the executable path and the model path explicitly.

Files

  • bench-diffusion-gemma-cli.py: runs the CLI executable across one prompt or a prompt file.
  • bench-diffusion-gemma-server.py: starts the server executable, sends OpenAI compatible requests, collects timings and metrics, then stops the server.
  • diffusion-gemma-prompts.txt: sample prompt sweep with one prompt per line.

Requirements

  • Python 3.10 or newer.
  • A built llama.cpp Diffusion Gemma CLI executable.
  • A built llama.cpp Diffusion Gemma server executable.
  • A compatible GGUF model.

Prompt files

Use --prompt-file to pass a text file with one prompt per non-empty line. If --prompt-file is omitted, the scripts use --prompt.

The included diffusion-gemma-prompts.txt asks for progressively longer answers so the benchmark can exercise different generated block counts.

Current defaults

The scripts are aligned with the current Diffusion Gemma CUDA defaults:

  • Context size defaults to 8096.
  • Denoising steps default to 48.
  • Top-k is not passed by default, so the binaries use full softmax (top_k=0).
  • EOS is respected by default. The server script only sends ignore_eos when --ignore-eos is passed.
  • --run-max-denoising-step is not passed by default, so the device stop state is read each denoising step and blocks can stop early.
  • The CUDA fast paths that are binary defaults are not passed explicitly by the scripts. Use the --no-* switches when you want to benchmark an opt-out.
  • The scripts pass --diffusion-cuda-mmq-max-x 64 by default. Use --diffusion-cuda-mmq-max-x 0 to disable that process override, or pass a different positive value to test another MMQ tile cap.

Output layout

Use --output-dir to choose where results are written. If omitted, the current directory is used.

By default the scripts write:

  • CLI: <output-dir>/diffusion-cli/
  • Server: <output-dir>/diffusion-server/

Each output directory contains:

  • report.md: markdown benchmark report.
  • outputs.jsonl: generated text and metrics for every run.
  • Per-run logs or response JSON files.

You can override individual output paths with:

  • --log-dir
  • --json-out
  • --report-out
  • --outputs-jsonl

Key metrics

The report treats token throughput as canvas token throughput.

The report also includes:

  • Total denoising steps
  • Blocks total
  • Mean denoising steps/block
  • Per-step time avg, ms/step

Mean denoising steps/block is computed as:

total denoising steps / total generated blocks

CLI benchmark

Example:

python .\bench-diffusion-gemma-cli.py `
  --binary <path-to-llama-diffusion-gemma-cli> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --n-predict 3584

Useful options:

  • --ctx-size: context size passed to the CLI. Default: 8096.
  • --n-predict: maximum output tokens requested from the CLI.
  • --diffusion-steps: default denoising steps per block. Default: 48.
  • --diffusion-cuda-mmq-max-x: MMQ tile cap passed to the CLI. Default: 64.
  • --top-k: optional top-k override. Omitted by default, which keeps Diffusion Gemma on full softmax (top_k=0).
  • --no-diffusion-cuda-fused-full-softmax, --no-diffusion-cuda-fast-top-k, and related diffusion CUDA switches can be used to opt out of binary defaults for ablation runs.
  • --repeat: kept benchmark runs per prompt.
  • --warmup: warmup runs per prompt, recorded but excluded from summaries.

Server benchmark

Example:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --output-dir .\benchmark-results `
  --repeat 1 `
  --warmup 0 `
  --max-tokens 25600

Useful options:

  • --ctx-size: context size passed to the server. Default: 8096.
  • --max-tokens: maximum output tokens requested per request.
  • --diffusion-steps: server default denoising steps per block. Default: 48.
  • --diffusion-cuda-mmq-max-x: MMQ tile cap passed to the server. Default: 64.
  • --request-diffusion-steps: per-request denoising step override.
  • --top-k: server-level top-k override. Omitted by default, which keeps the server on full softmax (top_k=0).
  • --request-top-k: per-request top-k override.
  • --endpoint: chat or completion. Default: chat.
  • --host and --port: server bind address. Defaults: 127.0.0.1:18081.
  • --repeat: kept benchmark runs per prompt.
  • --warmup: warmup runs per prompt, recorded but excluded from summaries.

By default, the server does not set ignore_eos, so generation stops naturally when the model emits EOS. --max-tokens is an upper cap. Pass --ignore-eos only when you intentionally want to force the full requested block count.

Long-context runs

The default context size is 8096. Increase it when asking for many blocks. For example:

python .\bench-diffusion-gemma-server.py `
  --binary <path-to-llama-diffusion-gemma-server> `
  --model <path-to-model.gguf> `
  --prompt-file .\diffusion-gemma-prompts.txt `
  --ctx-size 26368 `
  --max-tokens 25600 `
  --output-dir .\benchmark-results-long

For 100 canvas blocks of 256 tokens, use an output cap around 25600 and a large enough context size for the prompt plus generated canvases.

Windows note

If process launch fails with a duplicate Path / PATH environment error, sanitize the process environment before running Python:

$pathValue = [Environment]::GetEnvironmentVariable('Path', 'Process')
if (-not $pathValue) { $pathValue = [Environment]::GetEnvironmentVariable('PATH', 'Process') }
[Environment]::SetEnvironmentVariable('PATH', $null, 'Process')
[Environment]::SetEnvironmentVariable('Path', $pathValue, 'Process')

About

Scripts needed to run various llama.cpp workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages