LiveSQLBench-CLI

LiveSQLBench CLI

Harbor-based tooling to turn LiveSQLBench instances into Harbor tasks and run terminal agents (OpenHands, Codex CLI, Oracle, etc.) against PostgreSQL-backed databases.

For benchmark context, datasets, and citations for LiveSQLBench itself, see the repository README.

Prerequisites

Docker (for agent images and per-database PostgreSQL images)
uv (Python env / uv run harbor …)
Dataset JSONL + PostgreSQL dumps laid out locally (see below)

Install dependencies from this directory:

cd LiveSQLBench-CLI
uv sync

Layout

Place datasets and dumps under LiveSQLBench-CLI/ (or paths you pass to run_adapter.py). A typical layout:

LiveSQLBench-CLI/
  data/
    livesqlbench-base-lite/
    livesqlbench-base-full/          # e.g. HF clone of base-full-v1
    livesqlbench-large-v1/
    postgre_table_dumps/             # lite dumps (one subdir per DB)
    postgre_table_dumps_base_full/
    postgre_table_dumps_large/       # large-v1 dumps
  evaluation/
    src/                             # verifier helpers used by generated tasks

The adapter resolves common paths automatically; you can override with CLI flags (see python3 adapters/livesqlbench/run_adapter.py --help), including --data-jsonl / --gt-jsonl when filenames differ from the defaults.

Prepare datasets and dumps

Full step-by-step setup (including HF clones and evaluation Docker flows) is in the repository README — the same Google Drive dump archives are linked in the Evaluation Environment Setup / Download the database dumps section there.

Hugging Face hosts the public task JSONL; PostgreSQL dump archives are distributed via Google Drive (same links as the root README):

Variant	Hugging Face (task data)	PostgreSQL dumps (Google Drive)	Place under this CLI layout
Base-Lite	`birdsql/livesqlbench-base-lite`	Download dumps	`data/postgre_table_dumps/`
Base-Full v1	`birdsql/livesqlbench-base-full-v1`	Download dumps	`data/postgre_table_dumps_base_full/`
Large-v1	`birdsql/livesqlbench-large-v1`	Download dumps	`data/postgre_table_dumps_large/`

The root README uses paths like evaluation/postgre_table_dumps; for this directory, keep the same unzip structure but under data/… as in the layout above.

Unzip dumps so each database appears as a subdirectory under the matching dump folder.

Build Docker images

1. Agent image (OpenHands + Codex in default image)

cd adapters/livesqlbench/template
docker build -t livesqlbench-main-openhands:latest -f environment/Dockerfile.openhands .
cd ../../..

Optional Codex-only image for smaller task regeneration:

cd adapters/livesqlbench/template
docker build -t livesqlbench-main-codex:latest -f environment/Dockerfile.codex .
cd ../../..

2. PostgreSQL images (one image per database)

Base-Lite:

bash adapters/livesqlbench/build_db_images.sh --db-dir data/postgre_table_dumps

Base-Full:

bash adapters/livesqlbench/build_db_images.sh --db-dir data/postgre_table_dumps_base_full

Large-v1 (newer dumps often need a newer server image):

bash adapters/livesqlbench/build_db_images.sh \
  --db-dir data/postgre_table_dumps_large \
  --postgres-base-image postgres:18

Use --force to rebuild existing tags, and --name <db> to build a single database.

Generate Harbor task directories

Base-Lite:

python3 adapters/livesqlbench/run_adapter.py \
  --data-root data/livesqlbench-base-lite \
  --db-dump-root data/postgre_table_dumps \
  --eval-src-dir evaluation/src \
  --output-dir datasets/livesqlbench-lite \
  --agent-image livesqlbench-main-openhands:latest

Base-Full:

python3 adapters/livesqlbench/run_adapter.py \
  --data-root data/livesqlbench-base-full \
  --db-dump-root data/postgre_table_dumps_base_full \
  --eval-src-dir evaluation/src \
  --output-dir datasets/livesqlbench-full \
  --agent-image livesqlbench-main-openhands:latest

Large-v1:

python3 adapters/livesqlbench/run_adapter.py \
  --data-root data/livesqlbench-large-v1 \
  --db-dump-root data/postgre_table_dumps_large \
  --eval-src-dir evaluation/src \
  --output-dir datasets/livesqlbench-large \
  --agent-image livesqlbench-main-openhands:latest

Add --limit 1 for a smoke test; add --force to overwrite an existing output tree.

Verify with Oracle

Lite / Full / Large (adjust -p):

uv run harbor run -p datasets/livesqlbench-lite -a oracle -n 4
uv run harbor run -p datasets/livesqlbench-full -a oracle -n 4
uv run harbor run -p datasets/livesqlbench-large -a oracle -n 4

Use -n 1 if you prefer a single trial at a time (easier debugging or lighter machine load).

Environment variables

export OPENAI_API_KEY=your_key
export LLM_API_KEY="$OPENAI_API_KEY"    # OpenHands often expects this as well

Use other provider keys as required by your agent and model. Behind a proxy, set http_proxy / https_proxy (and unset conflicting all_proxy if needed) before harbor run.

Run agents

Example (OpenHands, replace model id as needed):

uv run harbor run -p datasets/livesqlbench-lite \
  -a openhands \
  -m openai/gpt-5.5 \
  --ak reasoning_effort=low \
  --ae MAX_ITERATIONS=50 \
  --agent-timeout-multiplier 0.333 \
  -n 4

For Large-v1 you can use the helper script:

./scripts/run_openhands_gpt55_low.sh

Codex CLI (-a codex) uses OPENAI_API_KEY; it does not use OpenHands MAX_ITERATIONS, but task timeouts and --agent-timeout-multiplier still apply.

Acknowledgment

This component builds on the Harbor framework (Apache-2.0). Please cite Harbor when using this CLI stack:

@software{Harbor_Framework_Team_Harbor_A_framework_2026,
  author = {{Harbor Framework Team}},
  title = {{Harbor: A framework for evaluating and optimizing agents and models in container environments}},
  year = {2026},
  month = jan,
  url = {https://github.com/harbor-framework/harbor}
}

Name		Name	Last commit message	Last commit date
parent directory ..
adapters		adapters
scripts		scripts
src/harbor		src/harbor
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

LiveSQLBench CLI

Prerequisites

Layout

Prepare datasets and dumps

Build Docker images

1. Agent image (OpenHands + Codex in default image)

2. PostgreSQL images (one image per database)

Generate Harbor task directories

Verify with Oracle

Environment variables

Run agents

Acknowledgment

FilesExpand file tree

LiveSQLBench-CLI

Directory actions

More options

Directory actions

More options

Latest commit

History

LiveSQLBench-CLI

Folders and files

parent directory

README.md

LiveSQLBench CLI

Prerequisites

Layout

Prepare datasets and dumps

Build Docker images

1. Agent image (OpenHands + Codex in default image)

2. PostgreSQL images (one image per database)

Generate Harbor task directories

Verify with Oracle

Environment variables

Run agents

Acknowledgment