Harbor-based tooling to turn LiveSQLBench instances into Harbor tasks and run terminal agents (OpenHands, Codex CLI, Oracle, etc.) against PostgreSQL-backed databases.
For benchmark context, datasets, and citations for LiveSQLBench itself, see the repository README.
- Docker (for agent images and per-database PostgreSQL images)
- uv (Python env /
uv run harbor …) - Dataset JSONL + PostgreSQL dumps laid out locally (see below)
Install dependencies from this directory:
cd LiveSQLBench-CLI
uv syncPlace datasets and dumps under LiveSQLBench-CLI/ (or paths you pass to run_adapter.py). A typical layout:
LiveSQLBench-CLI/
data/
livesqlbench-base-lite/
livesqlbench-base-full/ # e.g. HF clone of base-full-v1
livesqlbench-large-v1/
postgre_table_dumps/ # lite dumps (one subdir per DB)
postgre_table_dumps_base_full/
postgre_table_dumps_large/ # large-v1 dumps
evaluation/
src/ # verifier helpers used by generated tasks
The adapter resolves common paths automatically; you can override with CLI flags (see python3 adapters/livesqlbench/run_adapter.py --help), including --data-jsonl / --gt-jsonl when filenames differ from the defaults.
Full step-by-step setup (including HF clones and evaluation Docker flows) is in the repository README — the same Google Drive dump archives are linked in the Evaluation Environment Setup / Download the database dumps section there.
Hugging Face hosts the public task JSONL; PostgreSQL dump archives are distributed via Google Drive (same links as the root README):
| Variant | Hugging Face (task data) | PostgreSQL dumps (Google Drive) | Place under this CLI layout |
|---|---|---|---|
| Base-Lite | birdsql/livesqlbench-base-lite |
Download dumps | data/postgre_table_dumps/ |
| Base-Full v1 | birdsql/livesqlbench-base-full-v1 |
Download dumps | data/postgre_table_dumps_base_full/ |
| Large-v1 | birdsql/livesqlbench-large-v1 |
Download dumps | data/postgre_table_dumps_large/ |
The root README uses paths like evaluation/postgre_table_dumps; for this directory, keep the same unzip structure but under data/… as in the layout above.
Unzip dumps so each database appears as a subdirectory under the matching dump folder.
cd adapters/livesqlbench/template
docker build -t livesqlbench-main-openhands:latest -f environment/Dockerfile.openhands .
cd ../../..Optional Codex-only image for smaller task regeneration:
cd adapters/livesqlbench/template
docker build -t livesqlbench-main-codex:latest -f environment/Dockerfile.codex .
cd ../../..Base-Lite:
bash adapters/livesqlbench/build_db_images.sh --db-dir data/postgre_table_dumpsBase-Full:
bash adapters/livesqlbench/build_db_images.sh --db-dir data/postgre_table_dumps_base_fullLarge-v1 (newer dumps often need a newer server image):
bash adapters/livesqlbench/build_db_images.sh \
--db-dir data/postgre_table_dumps_large \
--postgres-base-image postgres:18Use --force to rebuild existing tags, and --name <db> to build a single database.
Base-Lite:
python3 adapters/livesqlbench/run_adapter.py \
--data-root data/livesqlbench-base-lite \
--db-dump-root data/postgre_table_dumps \
--eval-src-dir evaluation/src \
--output-dir datasets/livesqlbench-lite \
--agent-image livesqlbench-main-openhands:latestBase-Full:
python3 adapters/livesqlbench/run_adapter.py \
--data-root data/livesqlbench-base-full \
--db-dump-root data/postgre_table_dumps_base_full \
--eval-src-dir evaluation/src \
--output-dir datasets/livesqlbench-full \
--agent-image livesqlbench-main-openhands:latestLarge-v1:
python3 adapters/livesqlbench/run_adapter.py \
--data-root data/livesqlbench-large-v1 \
--db-dump-root data/postgre_table_dumps_large \
--eval-src-dir evaluation/src \
--output-dir datasets/livesqlbench-large \
--agent-image livesqlbench-main-openhands:latestAdd --limit 1 for a smoke test; add --force to overwrite an existing output tree.
Lite / Full / Large (adjust -p):
uv run harbor run -p datasets/livesqlbench-lite -a oracle -n 4
uv run harbor run -p datasets/livesqlbench-full -a oracle -n 4
uv run harbor run -p datasets/livesqlbench-large -a oracle -n 4Use -n 1 if you prefer a single trial at a time (easier debugging or lighter machine load).
export OPENAI_API_KEY=your_key
export LLM_API_KEY="$OPENAI_API_KEY" # OpenHands often expects this as wellUse other provider keys as required by your agent and model. Behind a proxy, set http_proxy / https_proxy (and unset conflicting all_proxy if needed) before harbor run.
Example (OpenHands, replace model id as needed):
uv run harbor run -p datasets/livesqlbench-lite \
-a openhands \
-m openai/gpt-5.5 \
--ak reasoning_effort=low \
--ae MAX_ITERATIONS=50 \
--agent-timeout-multiplier 0.333 \
-n 4For Large-v1 you can use the helper script:
./scripts/run_openhands_gpt55_low.shCodex CLI (-a codex) uses OPENAI_API_KEY; it does not use OpenHands MAX_ITERATIONS, but task timeouts and --agent-timeout-multiplier still apply.
This component builds on the Harbor framework (Apache-2.0). Please cite Harbor when using this CLI stack:
@software{Harbor_Framework_Team_Harbor_A_framework_2026,
author = {{Harbor Framework Team}},
title = {{Harbor: A framework for evaluating and optimizing agents and models in container environments}},
year = {2026},
month = jan,
url = {https://github.com/harbor-framework/harbor}
}