Skip to main content
Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models. With Harbor you can evaluate arbitrary agents (Claude Code, OpenHands, Codex CLI, and others) against curated datasets like Terminal-Bench, SWE-Bench, and Aider Polyglot, build and share your own benchmarks, run thousands of trials in parallel across cloud providers, and generate rollouts for RL optimization. Harbor abstracts the execution backend behind an --env flag. Tensorlake plugs in as one of those providers — alongside other sandboxes and local Docker — so the same Harbor commands run on Tensorlake sandboxes without changing your tasks, agents, or evaluators.
This guide focuses on running CLI-agent evaluations against benchmarks like Terminal-Bench. Harbor also supports generating rollouts for RL optimization — we’ll cover those workflows in follow-up guides.
New to Tensorlake? Sign up at the dashboard — new accounts include free credits, enough to run a full Terminal-Bench sweep before you pay for anything.

Quick start

1

Get a Tensorlake API key

Grab one from the Tensorlake Dashboard. You’ll also need an API key for whichever agent provider you want to evaluate (e.g., Anthropic).
2

Install Harbor with the Tensorlake provider

The harbor[tensorlake] extra installs the TensorLakeEnvironment provider alongside Harbor.
uv pip install "harbor[tensorlake]"
3

Set your environment variables

export TENSORLAKE_API_KEY="tl_..."
export ANTHROPIC_API_KEY="sk-ant-..."   # or another agent provider
4

Run a Terminal-Bench task

Run a single Terminal-Bench task on Tensorlake with Claude Code as the agent:
harbor run --env tensorlake \
  --include-task-name pytorch-model-cli \
  --dataset terminal-bench@2.0 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
Drop --include-task-name to run the full Terminal-Bench 2.0 suite. --ae KEY=VALUE forwards an environment variable from your shell into the sandbox where the agent runs — add more --ae flags for any other secrets the agent needs.

Why Tensorlake for Harbor

Harbor’s value comes from running large fleets of environments in parallel and trusting the results. Tensorlake’s runtime is designed for exactly that workload:
  • Per-trial sandboxes — each task starts on a clean machine and is destroyed at the end. No shared kernel state between trials, which matters for both eval reproducibility and RL reward integrity.
  • Pre-warmed snapshots — environments with heavy apt/pip installs (PyTorch, CUDA toolchains, full Linux desktops) can be built once, snapshotted, and restored under a second for every subsequent trial or rollout.
  • Independent verification — Harbor’s test script runs inside the sandbox and writes 1.0 or 0.0 to reward.txt. The agent never sees or touches the verifier, so “the agent said it worked” is never confused with “the tests pass.”
  • Parallel scale — Tensorlake schedules thousands of sandboxes concurrently, which is what RL rollout generation and full benchmark sweeps need.

Anatomy of a Harbor task

Harbor expects each task to be laid out like this - take gcode-to-text as an example:
gcode-to-text
├── environment
│   ├── Dockerfile
│   └── text.gcode.gz
├── instruction.md
├── solution
│   └── solve.sh
├── task.toml
└── tests
    ├── test_outputs.py
    └── test.sh
  • environment/Dockerfile defines the base image and any setup steps.
  • instruction.md is the prompt the agent receives.
  • solution/ is an oracle reference used to validate the environment itself.
  • tests/test.sh runs after the agent finishes and produces reward.txt.

Tune sandbox resources

Each task’s task.toml controls the sandbox Harbor provisions on Tensorlake. Set resources in the [environment] block:
task.toml
[environment]
cpus = 2
memory_mb = 4096
storage_mb = 20480
allow_internet = true
FieldDefaultForwarded to Tensorlake
cpus1cpus
memory_mb2048memory_mb
storage_mb10240ephemeral_disk_mb
allow_internettrueallow_internet_access
Tensorlake requires memory_mb to be between 1024 and 8192 MB per CPU core.
A few rules of thumb:
  • Large or heavy images — if your environment/Dockerfile pulls in big toolchains (PyTorch, CUDA, full Linux desktops, large datasets), bump cpus and memory_mb so the build and runtime have headroom, and raise storage_mb past the image size plus working-set room. Underprovisioned sandboxes show up as build timeouts or OOMs mid-trial.
  • Lock down allow_internet — set allow_internet = false to stop the agent from searching the web for answers. If the verifier needs network access, bake those dependencies into the Dockerfile. Per-host allowlists are coming soon, so you’ll be able to block search engines while leaving package mirrors reachable.

Image build & caching

Each trial needs an image to boot from. Harbor on Tensorlake supports three modes — pick based on how expensive your environment is to build and how often you reuse it.
ModeHow it bootsWhen to use
Legacy replay (default)Boot a minimal Tensorlake base image, then replay the Dockerfile’s RUN/COPY on every trial.Light Dockerfiles, quick iteration.
OCI image buildBuild the Dockerfile once, cache it under a content hash, boot subsequent trials directly from the cached image.Heavy apt/pip Dockerfiles where per-trial replay dominates wall time.
Snapshot restoreRestore a pre-warmed snapshot in under a second.Stable environments reused across many trials or rollouts.

Legacy replay

Default. No extra flags. Use while you’re iterating on a Dockerfile or when the build is cheap enough that per-trial replay isn’t a bottleneck.

OCI image build

OCI image build requires Harbor 0.9.0 or later. Upgrade with uv pip install --upgrade "harbor[tensorlake]" (or the pip equivalent) if you’re on an earlier version.
Pass --ek use_oci_image_build=true and Harbor builds the task’s Dockerfile once via Tensorlake’s image builder, registers it under a name keyed on the build context’s content hash, and boots every later trial directly from the cached image — no replay, no apt/pip work.
harbor run --env tensorlake \
  --dataset terminal-bench@2.0 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --ek use_oci_image_build=true
The cache key hashes the Dockerfile and every file in the build context, so changing a requirements.txt pin or any COPY’d file invalidates the cache automatically. When to enable it
  • You’re running many trials of the same task. The build is paid once and amortized across every later trial — the more trials share the same content hash, the bigger the win. For one-shot runs the build latency is pure overhead.
  • Your Dockerfile is known to build cleanly on Tensorlake’s builder. OCI build is stricter than Docker (see Known limitations below). For Dockerfiles you haven’t validated yet, stick with legacy replay while you iterate.
What the logs show The path Harbor took is logged at debug level. Run with --log-level debug (or inspect trial.log) and look for:
oci-build harbor-task-<hash>: {…build event…}
OCI image harbor-task-<hash> already registered (local marker); skipping build_sandbox_image call
tensorlake sandbox started: id=… image=harbor-task-<hash>
Skipping baseline setup and Dockerfile replay: booted from OCI-built image harbor-task-<hash>
If the build fails for any reason, Harbor automatically falls back to legacy replay and logs a warning at the default level:
build_sandbox_image(harbor-task-<hash>) failed; falling back to legacy boot-from-minimal + Dockerfile replay: …
So a failed OCI build never blocks a trial — it just costs the build attempt’s latency before replay takes over. Known limitations OCI build skips the compatibility shims that legacy replay applies post-boot, so a few Docker conventions don’t carry over:
  • COPY does not auto-create parent directories the way Docker does — COPY x /a/b/c fails if /a/b doesn’t already exist. Add an explicit RUN mkdir -p /a/b before the COPY.
  • OS-pinned apt versions (apt-get install curl=8.5.0-2ubuntu10.6) fail hard inside the builder. Legacy replay strips the pin transparently; OCI build does not. Drop the version pin or pick one that exists in the target distro.
  • Non-native Python versions must be installable via the FROM image’s own apt repos. Legacy replay falls back to deadsnakes/backports/uv to fetch e.g. Python 3.10 on Bookworm; OCI build doesn’t, so use a FROM image whose distro ships the Python version you need (e.g. python:3.10-bookworm).
We’re working on closing these gaps so OCI build becomes a safe default for arbitrary Dockerfiles. Until then, validate the build on a single task first, then turn it on for the full sweep. To force a fresh build even when the cached image is current, add --force-build:
harbor run --env tensorlake \
  --ek use_oci_image_build=true \
  --force-build \
  --dataset terminal-bench@2.0 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
--force-build is a one-shot bypass for that run only. It does not refresh the canonical content-hashed cache — subsequent normal runs keep using whatever image they would have used otherwise.

Snapshot restore

Build the environment once, snapshot it, then point every later trial at the snapshot:
harbor run --env tensorlake \
  --ek snapshot_id=snap_abc123 \
  --dataset terminal-bench@2.0 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
Snapshot restore skips both Dockerfile replay and the OCI build path. See Snapshots for how to build and manage them.

Ad-hoc native dependencies

If a task just needs a couple of extra apt packages and you don’t want to edit the Dockerfile or maintain a snapshot, use preinstall_packages:
harbor run --env tensorlake \
  --ek 'preinstall_packages=["build-essential","rustc","cargo"]' \
  --dataset terminal-bench@2.0 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
The packages are installed at the start of each trial. Prefer snapshots when the package set is large or reused across many runs so you pay the install cost once.

Interactive debugging

When a trial fails and you want to poke around the live environment, attach to the session:
harbor env attach <session_id>
Drop directly into the running sandbox to inspect state, rerun tests by hand, and confirm whether the failure was the agent or the environment.

Structured logs

Each trial produces structured artifacts, e.g.:
gcode-to-text__UFALMLv
├── agent/
├── verifier/
├── result.json
└── trial.log
So you can trace:
  • The agent’s actions and outputs
  • What the verifier checked
  • Why the trial passed or failed

What to build next

Snapshots

Build an environment once, snapshot it, and restore in seconds for every trial.

Reproducible RL Environments

Use sandboxes as a deterministic reward oracle for RL training loops.