Skip to content

Add CoreWeave Sandbox and W&B environment support#1698

Merged
alexgshaw merged 9 commits into
harbor-framework:mainfrom
matthoare117-wandb:hoare-cw/wandb
May 27, 2026
Merged

Add CoreWeave Sandbox and W&B environment support#1698
alexgshaw merged 9 commits into
harbor-framework:mainfrom
matthoare117-wandb:hoare-cw/wandb

Conversation

@matthoare117-wandb

@matthoare117-wandb matthoare117-wandb commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds CoreWeave Sandboxes and W&B Sandboxes as Harbor cloud execution environments.
This PR includes:

  • cwsandbox environment support using the CoreWeave Sandbox SDK
  • wandb environment support using wandb.sandbox
  • Environment factory registration and EnvironmentType entries for both
  • Unit coverage for lifecycle, env propagation, file transfer, command execution, cleanup, and W&B secret handling
  • Docs update listing CoreWeave Sandboxes and W&B Sandboxes as cloud sandbox options

Implementation Notes

cwsandbox is the CoreWeave Sandbox-backed Harbor environment. It handles startup/stop, exec, file transfer, resource/env mapping, and cleanup through the CoreWeave SDK.
wandb is the W&B Sandbox-backed Harbor environment. It reuses the cwsandbox implementation, but uses wandb.sandbox auth and W&B sandbox secret handling.

Validation

Ran the full : full terminal-bench/terminal-bench-2-1 run with n_attempts: 3 on both cwsandbox and wandb.
Screenshots/results below:

CWSandbox:
image

Wandb Sandbox:
Screenshot 2026-05-22 at 2 49 47 PM

@vercel

vercel Bot commented May 22, 2026

Copy link
Copy Markdown

@matthoare117-wandb is attempting to deploy a commit to the Harbor Framework Team on Vercel.

A member of the Team first needs to authorize it.

@matthoare117-wandb matthoare117-wandb marked this pull request as ready for review May 23, 2026 02:13
@github-actions

Copy link
Copy Markdown
Contributor

Enjoy a better diff viewing experience by clicking one of these URLs:

@alexgshaw alexgshaw merged commit f99317c into harbor-framework:main May 27, 2026
6 checks passed
AdamGold added a commit to islo-labs/harbor-fork that referenced this pull request Jun 7, 2026
* fix(terminus-2): reset per-run state and attribute step exceptions in multi-step trials (#1566)

Multi-step support added in PR #1234 made the trial layer call agent.run()
once per step but did not update Terminus2, which stores per-trial state
on the instance. Three categories of bugs result:

1. Trajectory step IDs are non-sequential.
   The initial-prompt Step appends with step_id=1 hardcoded, but
   _trajectory_steps persists across run() calls. After step 2 we get
   [1,2,3,1,2,3,...] which fails Pydantic validation in
   _dump_trajectory(): all terminus-2 multi-step trials fail.

2. Per-run state accumulators leak across steps. _api_request_times,
   _trajectory_steps, _subagent_metrics, _subagent_rollout_details,
   _summarization_count, _session_id, _pending_completion,
   _pending_subagent_refs, _pending_handoff_prompt, _timestamped_markers
   are all written but never reset. Concrete consequences:
     - All step_results' metadata.api_request_times_msec reference the
       same growing list (Python aliasing) -> per-step latency
       tracking unusable.
     - Step N's trajectory.json contains all of steps 1..N (quadratic
       disk usage, downstream consumers see duplicated content).
     - All per-step trajectory.json files share one session_id.
     - If summarization fires in step 1, every later step's reported
       n_input_tokens / cost_usd is inflated by step 1's summarization
       cost.

3. Trial._execute_step_agent only catches asyncio.TimeoutError and
   NonZeroAgentExitCodeError. Any other exception (LLM errors, network
   errors, validation errors, anything from a subprocess agent) bubbles
   to trial-level. step_result.exception_info stays None on the failing
   step and remaining steps are silently aborted.

Fix:
  - Add Terminus2._reset_per_run_state(), called at the top of run().
    Clears all per-trial accumulators. A user-provided session_id (kwarg)
    is preserved via a new _user_provided_session_id attribute.
  - Widen Trial._execute_step_agent's except to Exception, matching the
    sibling _verify_step (line 603) and the caller of _run_step_setup
    (line 638). The explicit abort at trial.py:673
    (`if exception_info and not verifier_result: break`) still fires
    when needed; the trial smartly continues if the verifier still
    produced a result.

Verified against a 2-step task: 1/1 trial, mean reward 1.0, 0 exceptions,
distinct session ids per step, distinct api_request_times_msec per step.
Verified against a step-1-timeout-step-2-recovers task: step 1 records
TimeoutError, step 2 still runs with fully isolated state, trial reward
0.5 (mean of 0 + 1.0).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(islo): drop redundant compose overlay (broken by merge skew with #1599) (#1639)

PR #1559 (docker-compose support) introduced `_write_ca_overlay`,
which bind-mounted the VM's CA bundle into the `main` service and
set NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / REQUESTS_CA_BUNDLE.
PR #1599 (merged 2 minutes earlier) had just removed the
`_VM_CA_BUNDLE` constant and the equivalent `docker run -v` mount,
because the redundant CA mount caused `dpkg` to fail installing
`ca-certificates` inside the container — the runner image already
trusts the gateway's MITM certs via its base CA store.

Neither PR rebased on the other. Upstream main currently references
`_VM_CA_BUNDLE` at 4 call sites inside `_write_ca_overlay` with no
matching definition. The module imports (Python late-binds names
in function bodies) but compose-mode tasks crash with
`NameError: name '_VM_CA_BUNDLE' is not defined` the moment a
sandbox starts.

Fix: drop the provider-side overlay entirely. Removed:

- `_write_ca_overlay` method and its caller in `_start_compose`
- `_COMPOSE_CA_OVERLAY_NAME` constant
- the `-f` flag for the overlay in `_compose_file_flags`
- the two overlay unit tests and the overlay assertion at
  test_islo.py:1280

Daytona's DinD compose path (daytona.py:461) already works
without any provider-side overlay — tasks declare their own
locale + env in their compose/Dockerfile. Matching that contract
on islo as well. Added a regression test
(`TestComposeFileFlagsHasNoProviderOverlay`) that asserts no
`docker-compose-islo-*` path is injected into the `-f` flags.

Verified end-to-end against api.islo.dev with the oracle agent on
examples/tasks/hello-mcp (compose-mode): build + compose-up +
verifier complete cleanly, reward 1.0.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tensorlake): preserve env state on snapshot restore (#1637)

* snapshot fixes

* fix

* fixed

* [Ready for Review] Update GDB adapter dependency and invocation (#1527)

* Update GDB adapter dependency and invocation

Pin the adapter to lica-gdb 0.2.1 and remove the adapter's conflicting gdb console script so generation uses the explicit module entry point.

Made-with: Cursor

* Update GDB registry dataset docs

Made-with: Cursor

* Update GDB parity review links

Made-with: Cursor

* Add GDB adapter CLI alias

Made-with: Cursor

* Add separate verifier environments (#1655)

* Add separate verifier environments

* Add separate verifier changelog and compose env compatibility

* Handle verifier artifact staging collisions

* minor updates.

* Minor fixes.

* Update skills. Add blog post.

* v0.7.0

* Remove internal trial timeout retries (#1628)

* Fix task.toml writing.

* Fix task.toml writing.

* Add Novita environment support to Harbor (#1025)

* Add Novita environment support to Harbor

- Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service.
- Implemented end-to-end and unit tests for NovitaEnvironment functionality.

* Fix CI failures: type errors, lint, and pytest collection crash

- Add type: ignore comments for novita_sandbox SDK type issues
- Move sys.exit() guard into __main__ block so pytest collection doesn't crash
- Add template reuse test phase to e2e integration test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix COPY instruction parsing and timeout_sec=0 handling

- Skip COPY --from=... instructions (multi-stage builds)
- Filter out COPY flags (--chown, --chmod) before extracting source path
- Use explicit None check for timeout_sec to allow timeout_sec=0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address Devin review: internet flag, default timeout, multi-source COPY

- Set can_disable_internet to False (not yet supported by Novita SDK)
- Change default exec timeout from 60s to 0 (no timeout), matching e2b
- Handle multi-source COPY instructions (COPY a.py b.py /dest/)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Windows path separator in upload_dir remote paths

Use PurePosixPath for remote sandbox paths to ensure forward slashes
on all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Change default exec timeout from 0 to 300s

The novita_sandbox SDK defaults to 60s internally when 0 is passed.
Use 300s (5 minutes) to avoid premature termination of long-running
agent and verifier commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix build error log index and defer API base URL resolution

- Use logs[-1] instead of logs[-2] for build failure error message
- Move NOVITA_BASE_URL lookup from class definition to __init__,
  consistent with NOVITA_API_KEY handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle null logs in build failure error reporting

Use `status.get("logs") or []` instead of `status.get("logs", [])`
to handle API returning `"logs": null`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Wrap _http_client.aclose() in try/except in stop()

Prevent transport-level errors during HTTP client cleanup from
propagating out of stop() and masking the trial outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Preserve sandbox when delete=False for debugging

When stop(delete=False) is called, skip killing the sandbox and closing
the HTTP client so the sandbox remains running for debugging purposes.
This aligns with how other environments (e.g. GKE) handle the delete flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: use alias endpoint for template lookup and fix stale alias recovery

- Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias}
  endpoint for O(1) template lookup instead of scanning all templates
- Add stale alias recovery in _api_create_template: on 403 "Alias already used",
  look up the stale template via alias endpoint, delete it, then retry creation
- Include API key suffix in template alias to avoid cross-account conflicts
- Increase build timeout from 600s to 1200s for heavy Dockerfiles
- Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU)
- Update tests to cover new alias endpoint behavior (44 tests passing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: auto-recover from stale cached templates on sandbox creation

When _find_template_by_alias returns a template ID that no longer exists
in the backend (alias registered but build failed/incomplete), AsyncSandbox
would raise a SandboxException("404: template not found"). Now start()
catches this case, deletes the stale template via REST API, and triggers
a fresh build before retrying sandbox creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: include last 5 log lines in build failure error message

Previously only the last log line was shown, which was often just
"Postprocessing finished. Cleaning up..." instead of the actual error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors

* chore: update parity_summary.csv [skip ci]

* Fix review issues and CI failures in Novita environment

- Add _merge_env(env) call in exec() so persistent env vars (--ae flags,
  task [environment.env] config) are correctly forwarded to sandbox commands
- Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment
  interface (fixes type-check invalid-method-override errors)
- Close HTTP client in stop(delete=False) to prevent resource leak; update
  test to assert aclose is called
- Fix uv.lock: missing [[package]] header before networkx entry caused TOML
  parse errors that broke all CI checks; regenerate lockfile cleanly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix exec() to respect user parameter via _resolve_user

The user parameter was accepted but never used — all commands ran as
root. Now calls _resolve_user(user) to honour the orchestrator-set
default_user (e.g. task agent.user / verifier.user from task.toml).

Novita SDK's user parameter is Literal["root", "user"], so map any
non-root resolved user to "user"; add Literal import accordingly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add preflight() and chmod 777 on log dirs in Novita environment

- Add preflight() classmethod to validate NOVITA_API_KEY before any
  trials are queued, giving immediate feedback instead of failing mid-job
- chmod 777 agent/verifier log directories after creation in start() so
  non-root agent/verifier users can write reward files and logs
- Update start() test mocks to handle both foreground (healthcheck) and
  background (exec) sandbox.commands.run call patterns

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format test_novita.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix template name slash escaping and cwd quoting in exec

- Replace '/' with '__' in template alias construction so org/name task
  names (e.g. harbor/hello-world) don't break REST API URL paths
- Use shlex.quote(effective_cwd) in exec() to handle paths with spaces
  or shell metacharacters safely

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use timeout=0 (no limit) as default in exec, aligning with E2B

timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means
no connection time limit, avoiding premature 300s cutoffs on long-running
agent setup or verifier scripts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment

* refactor: move novita-sandbox to optional extra, matching other cloud providers

- Move `novita-sandbox` from main deps to `[novita]` optional extra
- Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it)
- Include `harbor[novita]` in the `cloud` bundle
- Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same
  lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor
- Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError

Without this guard, instantiating NovitaEnvironment when novita-sandbox
is not installed raises a raw NameError (on DockerfileParser) instead of
a helpful MissingExtraError with install instructions. Follows the same
pattern as E2BEnvironment and RunloopEnvironment.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: import EnvironmentCapabilities in Novita environment

Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: update Novita capability tests

Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: fix file upload endpoint

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Minor fixes for ruff.

* Minor fixes for type check (#1665)

* Simplify trial flow (#1672)

* Refactor trial execution by shape

* Clean up trial helper typing

* Skip Windows container hello world without Docker

* Partial refactor.

* Improve artifact handler.

* Minor multi step fixes.

* Make artifact handler paths operation scoped

* Fix CI after trial flow cleanup

* Keep download dir excludes explicit

* Rename download dir exclusions helper

* Address artifact exclusion review comments

* Avoid duplicate single-step artifact recovery

* Avoid double stop after cancellation

---------

Co-authored-by: gabeorlanski <gabeorlanski@gmail.com>

* fix(terminus-2): make tmux send-keys dash-proof and improve send-keys error messages (#1657)

- _tmux_send_keys: append `--` end-of-options marker to the
  `tmux send-keys -t <session>` prefix so keys beginning with `-`
  (e.g. `-x`, `-Lfoo`) are treated as literal key arguments rather
  than being parsed as tmux options.
- _send_blocking_keys / _send_non_blocking_keys: include `command`
  (truncated to 100 chars), `return_code`, `stderr`, and `stdout` in
  the raised RuntimeError to make intermittent send-keys failures
  easier to diagnose from logs.
- tests: update _extract_send_keys_payload helper for the new `--`
  separator and add coverage for keys starting with `-` and for the
  enriched failure messages.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [codex] add repeatable skill inputs (#1674)

* add repeatable skill inputs

* Register injected skills for Cursor CLI

* Use Cursor native skills directory

* Simplify skill resolution

* Make injected skills readable by agents

* Address skill input review comments

* Reject relative task skills dir for injected skills

* Add skills CLI alias

* Rename injected skill config to skills

* Add runtime skills job example

* Trim runtime skills example config

* [codex] add repeatable extra docker compose overlays (#1676)

* add repeatable extra docker compose overlays

* preserve modal compose build markers

* preserve cloud compose file precedence

* Guard extra compose by environment capability

* Rename extra compose config paths

* Revert "Rename extra compose config paths"

This reverts commit 5c531c6d5a7117d6e1fdf9d58e01a8e088dd002e.

* Add extra compose job example

* Address extra compose example comments

* Nest extra compose job example

* Fix skills merge.

* [codex] Add runtime MCP config support (#1675)

* Add runtime MCP config support

* Use extra compose overlay for MCP proof example

* Remove MCP proof example volume

* Use Python base image in MCP proof task

* Document MCP proof compose context

* Trim MCP proof job defaults

* Embed MCP proof runtime config

* [codex] Add extra instruction path support (#1682)

* feat: add support for --extra-instruction-paths

* Add extra instruction path support

* Fix lock equality env serialization

* Fix lock equality for digest-backed paths

---------

Co-authored-by: ZHAO Jin-Xiang <xiaoxiangmoe@gmail.com>

* v0.7.1

* fix(terminus): use UTF-8 byte length for tmux send-keys size checks (#1680)

* Update reward output documentation (#1684)

Update based on change in #1620

* Add minimal verifier extension hook (#1653)

* Add minimal verifier extension hook

Add a small verifier factory hook that allows jobs to provide an optional custom verifier by import path while keeping the existing task verification flow as the default.

This enables job-specific verification to supplement task-specific checks. For example, a job can attach generic trajectory evaluators, policy checks, or run-level scoring logic across many tasks without rebuilding, copying, or modifying those task definitions.

The hook keeps task authorship and job evaluation concerns separate: tasks continue to define their normal verification, and jobs can opt into additional verifier behavior only when needed.

Default behavior is unchanged when no custom verifier is configured.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Tighten verifier extension contract

Introduce BaseVerifier and VerifierContext so custom verifiers receive a stable construction context while the built-in verifier keeps legacy kwargs compatibility.

Require verifier outputs to be VerifierResult before assigning them to trial results, preserving Harbor aggregation semantics for built-in and imported verifiers. Keep legacy import-path constructors working through an adapter that enforces the return contract.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Reject unused verifier kwargs

Fail fast when verifier kwargs are provided without a verifier import path, since the built-in verifier does not consume arbitrary extension kwargs.

This makes CLI/config mistakes visible instead of silently dropping values like --verifier-kwarg foo=bar.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Fix verifier factory test patch

Update Windows multi-step verifier tests to patch VerifierFactory.create_verifier_from_config after trial verification moved behind the factory hook.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Simplify verifier extension constructor

* Simplify verifier factory contract

* Fix skills merge example config paths

---------

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Minor improvements.

* fix: fail opencode runs on error events (#1658)

* Update Novita to latest SDK build flow (#1688)

* Add Novita environment support to Harbor

- Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service.
- Implemented end-to-end and unit tests for NovitaEnvironment functionality.

* Fix CI failures: type errors, lint, and pytest collection crash

- Add type: ignore comments for novita_sandbox SDK type issues
- Move sys.exit() guard into __main__ block so pytest collection doesn't crash
- Add template reuse test phase to e2e integration test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix COPY instruction parsing and timeout_sec=0 handling

- Skip COPY --from=... instructions (multi-stage builds)
- Filter out COPY flags (--chown, --chmod) before extracting source path
- Use explicit None check for timeout_sec to allow timeout_sec=0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address Devin review: internet flag, default timeout, multi-source COPY

- Set can_disable_internet to False (not yet supported by Novita SDK)
- Change default exec timeout from 60s to 0 (no timeout), matching e2b
- Handle multi-source COPY instructions (COPY a.py b.py /dest/)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Windows path separator in upload_dir remote paths

Use PurePosixPath for remote sandbox paths to ensure forward slashes
on all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Change default exec timeout from 0 to 300s

The novita_sandbox SDK defaults to 60s internally when 0 is passed.
Use 300s (5 minutes) to avoid premature termination of long-running
agent and verifier commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix build error log index and defer API base URL resolution

- Use logs[-1] instead of logs[-2] for build failure error message
- Move NOVITA_BASE_URL lookup from class definition to __init__,
  consistent with NOVITA_API_KEY handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle null logs in build failure error reporting

Use `status.get("logs") or []` instead of `status.get("logs", [])`
to handle API returning `"logs": null`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Wrap _http_client.aclose() in try/except in stop()

Prevent transport-level errors during HTTP client cleanup from
propagating out of stop() and masking the trial outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Preserve sandbox when delete=False for debugging

When stop(delete=False) is called, skip killing the sandbox and closing
the HTTP client so the sandbox remains running for debugging purposes.
This aligns with how other environments (e.g. GKE) handle the delete flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: use alias endpoint for template lookup and fix stale alias recovery

- Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias}
  endpoint for O(1) template lookup instead of scanning all templates
- Add stale alias recovery in _api_create_template: on 403 "Alias already used",
  look up the stale template via alias endpoint, delete it, then retry creation
- Include API key suffix in template alias to avoid cross-account conflicts
- Increase build timeout from 600s to 1200s for heavy Dockerfiles
- Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU)
- Update tests to cover new alias endpoint behavior (44 tests passing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: auto-recover from stale cached templates on sandbox creation

When _find_template_by_alias returns a template ID that no longer exists
in the backend (alias registered but build failed/incomplete), AsyncSandbox
would raise a SandboxException("404: template not found"). Now start()
catches this case, deletes the stale template via REST API, and triggers
a fresh build before retrying sandbox creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: include last 5 log lines in build failure error message

Previously only the last log line was shown, which was often just
"Postprocessing finished. Cleaning up..." instead of the actual error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors

* chore: update parity_summary.csv [skip ci]

* Fix review issues and CI failures in Novita environment

- Add _merge_env(env) call in exec() so persistent env vars (--ae flags,
  task [environment.env] config) are correctly forwarded to sandbox commands
- Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment
  interface (fixes type-check invalid-method-override errors)
- Close HTTP client in stop(delete=False) to prevent resource leak; update
  test to assert aclose is called
- Fix uv.lock: missing [[package]] header before networkx entry caused TOML
  parse errors that broke all CI checks; regenerate lockfile cleanly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix exec() to respect user parameter via _resolve_user

The user parameter was accepted but never used — all commands ran as
root. Now calls _resolve_user(user) to honour the orchestrator-set
default_user (e.g. task agent.user / verifier.user from task.toml).

Novita SDK's user parameter is Literal["root", "user"], so map any
non-root resolved user to "user"; add Literal import accordingly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add preflight() and chmod 777 on log dirs in Novita environment

- Add preflight() classmethod to validate NOVITA_API_KEY before any
  trials are queued, giving immediate feedback instead of failing mid-job
- chmod 777 agent/verifier log directories after creation in start() so
  non-root agent/verifier users can write reward files and logs
- Update start() test mocks to handle both foreground (healthcheck) and
  background (exec) sandbox.commands.run call patterns

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format test_novita.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix template name slash escaping and cwd quoting in exec

- Replace '/' with '__' in template alias construction so org/name task
  names (e.g. harbor/hello-world) don't break REST API URL paths
- Use shlex.quote(effective_cwd) in exec() to handle paths with spaces
  or shell metacharacters safely

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use timeout=0 (no limit) as default in exec, aligning with E2B

timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means
no connection time limit, avoiding premature 300s cutoffs on long-running
agent setup or verifier scripts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment

* refactor: move novita-sandbox to optional extra, matching other cloud providers

- Move `novita-sandbox` from main deps to `[novita]` optional extra
- Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it)
- Include `harbor[novita]` in the `cloud` bundle
- Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same
  lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor
- Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError

Without this guard, instantiating NovitaEnvironment when novita-sandbox
is not installed raises a raw NameError (on DockerfileParser) instead of
a helpful MissingExtraError with install instructions. Follows the same
pattern as E2BEnvironment and RunloopEnvironment.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: import EnvironmentCapabilities in Novita environment

Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: update Novita capability tests

Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: fix file upload endpoint

* fix: integrate Novita SDK template builds

Use the Novita SDK template builder directly while preserving Harbor's Dockerfile COPY handling, and pin the alpha SDK version without enabling global prerelease resolution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: pin Novita sandbox domain

Use the regional Novita sandbox endpoint consistently so local domain overrides cannot route template operations to the wrong API host.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: avoid Novita SDK import during test collection

Load Novita SDK modules only when the Novita environment actually needs them so pytest can collect E2B and Novita tests in the same process without duplicate protobuf descriptor registration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Fix EnvironmentConfig deprecation warnings on default construction.

Migrate legacy memory/storage fields in a before validator instead of
Field(deprecated=...) plus an after validator, and reject conflicting
legacy and modern resource values.

Closes #1693

Co-authored-by: Cursor <cursoragent@cursor.com>

* Estimate cursor-cli cost from usage via LiteLLM

Cursor CLI stream-json reports token usage on result events but not
dollar cost. Parse optional totalCost when present and otherwise
estimate from per-category token counts using LiteLLM pricing.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add built-in pricing for Cursor Composer models in cursor-cli.

LiteLLM does not list cursor/composer models, so estimate cost from token
usage using Cursor's published rates before falling back to LiteLLM.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [codex] Add resource enforcement policies (#1697)

* Add resource enforcement policies

* Pre flight check.

* Fix CHANGELOG breaking changes for resource enforcement policies.

Document removed task resource defaults and stricter validation instead of incorrectly claiming --cpus/--memory repurposed numeric overrides.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* v0.8.0

* Fix resource default test after provider-default change (#1701)

* fix tests on main

* chore: rerun CI

* Document job sharing (#1706)

* feat(viewer): add ←/→ trial navigation, ⌥+←/→ tab cycling, persistent tab across trials, and X/N position indicator on the trial page (#1705)

* docs(atif): refresh trajectory format page to v1.7 (#1704)

The trajectory format docs page still advertised ATIF-v1.4 as current and stopped its supported-versions list at v1.4, while the canonical RFC (rfcs/0001-trajectory-format.md) has been at v1.7 for several releases. Bump the example schema_version strings to ATIF-v1.7 and extend the Schema Versions section with v1.5, v1.6, and v1.7 entries summarized from the RFC's Version History.

No code changes; docs only.

* Add PR diff links workflow with manual dispatch. (#1716)

Post devinreview and diffshub links when PRs open, and allow testing on existing PRs via workflow_dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: add Openclaw installed agent (#1661)

* feat: add openclaw installed agent

* Cleanup commit

* save full session turns

* NeMo-Flow Integration

* cleanup

* update defaults

* fix test for updated defaults

* Fix tests for new defaults

* Fix lint error

* Remove nemoflow from PR

Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com>

* refactor(openclaw): generalize provider config normalization

Address review feedback: drop NVIDIA-specific code paths from the
OpenClaw plugin so it works generically across any OpenAI-compatible
provider.

- Replace `_merge_nvidia_base_url_from_env` and
  `_normalize_nvidia_models_provider` with provider-agnostic
  `_merge_provider_base_url_from_env` and
  `_normalize_provider_models_schema` that derive the provider from
  `--model` (e.g. `openai/gpt-4.1` -> `OPENAI_BASE_URL`).
- Remove the hardcoded NVIDIA default base URL; users select a
  custom provider via env or `openclaw_config`.
- Update class docstring to use `openai/*` as the generic example.
- Rewrite the NVIDIA-themed unit tests to cover the generic
  behavior with `openai/*`.

The `nvidia` entry in the env-var forwarding switch is retained
alongside ~15 other providers (anthropic, openai, google, ...) as a
plain provider registry, since removing it would break existing
`nvidia/*` model selections.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>

* feature(api): multi-provider compatibility for openclaw

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>

---------

Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Co-authored-by: Bryan Bednarski <bbednarski@nvidia.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Add GPU support to GKE environment (#1640)

* Add GPU support to GKE environment

* Address PR comments

- Early failure if an unsupported GPU type is provieded
- Increase the timeout minutes to 20 when GPUs are selected
- Support direct gke-accelerator values as gpu_types

* Adjust GPU count retrieval to use _effective_gpus for consistency

* Paginate dataset metadata queries past Supabase row cap (#1719)

* Paginate dataset metadata queries past Supabase row cap.

Fixes harbor download and run truncating package datasets at 1,000 tasks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Format test_registry_db_client.py with ruff.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add TPU support to harbor and GKE environment (#1652)

* Address PR comments

- Early failure if an unsupported GPU type is provieded
- Increase the timeout minutes to 20 when GPUs are selected
- Support direct gke-accelerator values as gpu_types

* Adjust GPU count retrieval to use _effective_gpus for consistency

* Add TPU support to environment configuration

This change allows environments to properly support and validate TPU requirements, improving task execution flexibility.

* Add TPU support to GKE environment

This update introduces a mapping for TPU types, enhances the GKEEnvironment class to handle TPU configurations, and updates unit tests to validate TPU capabilities and configurations alongside existing GPU support.

* Update environment config model to use a dedicated class for TpuSpec

* Add new TPU config to docs

* Add --tpu_overrides to cli commands

* Validate mutual exclusion of GPU and TPU requests in GKE

* Fix merge conflicts

* Update TPU configuration to use a single TpuSpec

* Add Harbor Hub job result sharing blog post (#1732)

* Add Harbor Hub job result sharing blog post.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update job sharing blog title and landing page banner.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add CoreWeave Sandbox and W&B environment support (#1698)

* cw sandbox

* doc fix

* Fix (Add resource enforcement policies)

* final fixes

* comment cleanup

* fix(cwsandbox): clean up backend sandbox on any failed start()

* feat (Tensorlake): build sandboxes from OCI images instead of per-trial Dockerfile replay (#1734)

* update tensorlake integration to use oci image build

* Guard fcntl import for Windows test collection in tensorlake env

* Add managing resources docs for task configuration. (#1735)

Centralize enforcement policy and resource field guidance in the tasks docs.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Ready For Review] Fix artifact transfer archive collisions (#1733)

* Fix artifact transfer archive collisions

* Log transfer cleanup failures as warnings

* Use RPC for task version resolution (#1736)

* Allow tasks with docker_image to omit environment/Dockerfile (#1729)

* Allow tasks with docker_image to omit environment/Dockerfile.

Centralize environment definition validation and workdir helpers across supported providers.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix docker_image-only force_build and Runloop workdir default.

Use shared prebuilt-image selection when no Dockerfile exists, and restore /workspace fallback for Dockerfiles without WORKDIR.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Apply prebuilt docker_image policy to all compose providers.

Use should_use_prebuilt_docker_image in Daytona, Modal, and Islo, and unify Docker validation.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix lazy dockerfile_parse import and daytona formatting.

Move DockerfileParser import inside parse_dockerfile_workdir so core environments do not require the optional extra.

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Add dockerfile-parse to runloop optional extra.

Runloop now uses parse_dockerfile_workdir for WORKDIR resolution when a Dockerfile is present.

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: Add native agent adapter for Google Antigravity CLI (agy) (#1699)

* feat: Add native agent adapter for Google Antigravity CLI (agy)

* fix: remove unused import

* fix: correctly configure agy settings.json and model

* fix: update test to match new EnvironmentConfig defaults

* fix: remove unused run_model variable

* style: run ruff format on agy.py

* refactor: rename agy agent to antigravity-cli

Use antigravity-cli as the Harbor agent identifier and AntigravityCli
adapter naming instead of agy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(antigravity-cli): use Path.write_text for ATIF export

Address Devin review feedback and align with AGENTS.md file I/O guidance.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError (#1457)

* feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError

Adds three major improvements to the Daytona environment backend:

1. **Auto-snapshot with content-based caching**: New `auto_snapshot` parameter
   on DaytonaEnvironment enables automatic snapshot creation keyed by a SHA256
   hash of the full environment directory. Tasks sharing the same Dockerfile
   and fixtures reuse a single snapshot, eliminating redundant builds. Snapshots
   are region-aware (DAYTONA_TARGET) to prevent cross-region collisions. Per-
   snapshot async locks prevent redundant parallel creation.

2. **Transient error differentiation**: New `daytona_utils.py` module provides
   `is_transient_daytona_error()` which distinguishes rate limits and capacity
   errors from non-recoverable failures. Retry callbacks use 10 attempts with
   60s linear backoff for transient errors vs 3 attempts with exponential
   backoff for others — dramatically improving reliability under load.

3. **SandboxBuildFailedError**: New non-retryable exception for failed sandbox
   builds (bad Dockerfile, snapshot in ERROR state). Stops wasting retry budget
   on builds that will never succeed. Detected both in `_create_sandbox()` and
   `_wait_for_snapshot()`.

Supporting additions:
- `container_cache.py`: Hash utilities for environment directories and
  Dockerfiles, plus task analysis helpers for predicting snapshot counts
- DinD auto-snapshot support with image-hash-based naming
- `ephemeral=True` flag on all sandbox creation calls
- `assume_global_snapshot` for optimistic handling of shared snapshots
  invisible to the GET API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove region_id param not in current Daytona SDK

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: remove DinD auto-snapshot additions, restore main's DinD start()

DinD snapshot management was not in scope for this PR. Restores
_DaytonaDinD.start() to main's original implementation. Removes
_get_dind_snapshot_name, _ensure_dind_auto_snapshot, _create_dind_snapshot
methods and unused hashlib import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: don't retry SandboxBuildFailedError/TimeoutError, close RL client

- Add _is_non_retryable() guard to all retry callbacks so
  SandboxBuildFailedError and TimeoutError are never retried
- Close temporary AsyncDaytona client after RL-region snapshot builds
  to prevent HTTP session leaks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(daytona): harden PR #1457 with unit tests and small fixes

Add tests for daytona_utils retry classification and container_cache hashing.
Stop treating invalid bearer tokens as transient, trim unused analyze helpers,
evict idle per-snapshot locks, and document auto_snapshot ERROR behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): extract snapshot service and collapse retry helpers

Move snapshot lifecycle into daytona_snapshots.py with a single state
resolver and SnapshotPolicy. Replace six retry callbacks with
daytona_retry_callbacks(). Simplify _DaytonaDirect.start() via
_resolve_start_sandbox_params() and remove the string-matched fallback catch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): dedupe ensure_auto paths and add optional snapshot GET

Collapse fast/slow auto-snapshot resolution into shared helpers and use a
documented non-retrying GET for pre-create ERROR cleanup.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: use Task.short_name for environment_name

Add Task.short_name (delegates to package short_name, else task dir name)
and pass it as environment_name so Daytona snapshot templates and container
naming avoid registry org prefixes and slashes in paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): move modules into daytona/ package

Group environment, snapshots, and utils under environments/daytona/
to match docker/ and singularity/. Default assume_global_snapshot to
False so missing template snapshots fall back to Dockerfile builds.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(container_cache): length-prefix paths in environment hash

Avoid ambiguous SHA256 updates where a file path could concatenate with
the next file's content. Adds a regression test for the ab/a+b case.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(daytona): wait for concurrent snapshot create to become active

Handle PENDING snapshots before create and wait for ACTIVE after
already-exists/conflict errors instead of returning the name immediately.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(container_cache): length-prefix file content in environment hash

Extend domain-separated hashing so path and content bytes cannot be
ambiguous across files (Devin review follow-up).

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Benjamin Feuer <penfever@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Upload environment/ files for prebuilt docker_image tasks (#1737)

* Upload environment/ to workdir for prebuilt docker_image tasks.

When docker_image is set without a Dockerfile or docker-compose.yaml,
environments copy non-empty environment/ into the container workdir at
the end of start().

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix CI: format tests and isolate cwsandbox environment_dir fixtures.

Use a dedicated empty environment/ subdirectory so post-start uploads do
not run during unit tests that assert exact exec call counts.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Format cwsandbox test_wandb.py

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix cwsandbox tests to write Dockerfile under environment/.

Aligns with environment_dir fixture so prebuilt-image allowance tests
exercise the intended layout.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* downgrade logging.

* Stop writing per-episode log folders in Terminus-2 (#1740)

* Stop writing per-episode log folders in Terminus-2.

Episode prompt/response/debug files are redundant now that trajectory.json captures each turn.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Terminus-2 tests after removing episode logging paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Ready for Review] Adapter | Review bot prompt update for agent reward hacking checks (#1747)

* Update adapter review prompts

* Update prompt based on some sanity check runs

* Add benchmark identity leakage check

* Add linear.review link to PR diff links workflow (#1749)

* fix link.

* v0.9.0

* claude_code: handle redacted_thinking content blocks (#1752)

Anthropic's `redacted_thinking` is a standard, documented content block
type that can appear in any assistant message when extended thinking is
enabled. Its `data` field is opaque ciphertext that clients cannot
decrypt — the contract is to pass it back unchanged on subsequent API
calls, never to expose it as user-facing text.

Today _extract_text_reasoning_tool_uses doesn't recognise the type, so
the block falls through to the catch-all that `_stringify`s the whole
block dict and appends the resulting JSON envelope to text_parts.
Trajectories then carry an ATIF `message` like
  '{"type":"redacted_thinking","data":"…"}'
in the assistant turn. On may26 there are 2,050 such steps across 127
trials in the bundled corpus, all claude-code paired with vendor-routed
models (e.g. tencent/hy3-preview-20260421 via OpenRouter).

OpenRouter additionally mis-uses the redacted_thinking envelope to pass
through PLAIN reasoning from non-Anthropic models: `data` is
`openrouter.reasoning:<b64>`, where the base64 decodes to plain JSON
`{"text":"…","type":"reasoning.text"}`. That content isn't
actually encrypted — it should land in reasoning_content like every
other thinking block.

Add a redacted_thinking branch before the generic fallback that:
  - if data starts with `openrouter.reasoning:`, b64-decodes the
    payload, parses the inner JSON, and appends the inner `text` to
    reasoning_parts;
  - otherwise drops the block. This preserves the API contract for
    genuine Anthropic ciphertext (it remains opaque) and stops the
    envelope JSON from polluting human-readable trajectory text.

Updates the existing test_redacted_thinking_not_in_reasoning to assert
the envelope is now absent from both text and reasoning (it previously
only asserted absence from reasoning, accepting the stringified-into-
text behaviour), and adds two new tests covering the OpenRouter decode
and malformed-payload-dropped paths.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal>

* claude_code: unwrap text content blocks in user-event tool_result loop (#1753)

In _convert_events_to_trajectory, the user-event content loop already
handles tool_result blocks specifically. Anything else falls through to
`self._stringify(block)` — which JSON-encodes the whole block dict and
appends the resulting envelope to text_parts. So a content block like
  {"type": "text", "text": "<10 KB of skill documentation>"}
ends up in the ATIF user step's `message` as
  '{"type":"text","text":"Base directory for this skill: …"}'
verbatim — downstream renderers that expect `message` to be human
text can't read it.

Claude Code injects these text blocks as user content alongside the
tool_result when a Skill is loaded (the block carries the skill's
documentation). Saw 4 such steps in a recent harbor-index corpus scan
on skillsbench × {glm-5.1, MiniMax/MiniMax-M2.7} runs.

Fix: before the generic _stringify fallback, recognise
`{"type":"text","text":<str>}` and surface its inner string. Non-text
blocks and text blocks with non-string `text` still hit the stringify
fallback so behaviour for unknown shapes is unchanged.

Adds test_user_event_text_content_block_unwrapped covering the end-to-end
path through _convert_events_to_trajectory.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal>

* fix(modal): default _ModalDirect.exec to non-login shell (#1744)

The strategy-refactor PR (#1311) introduced `login=True` on the default
`_ModalDirect.exec` path, which causes the underlying SDK call to use
`bash -lc <cmd>`. A login shell re-sources `/etc/profile` and the
shell's profile files, which **clobbers `PATH`** as set by the image's
`ENV PATH=…` directives.

This breaks any task that pins toolchains via image-level `ENV PATH`:
- Go tasks lose `/usr/local/go/bin` (everything that does
  `go build`/`go test` fails)
- Rust tasks lose `~/.cargo/bin` (cargo not found)
- Anything with custom `pipx`/`uv`/Node prefixes baked into image
  layers gets reset to the inherited login default

Reverting this single line to `login=False` restores the pre-#1311
`bash -c` behavior and preserves the image's PATH.

The lower-level `_sdk_exec` still exposes `login` as a parameter, so
strategies that genuinely want a login shell can opt in explicitly.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Add viewer sign-in and sync auth with the CLI (#1755)

* Add viewer sign-in and sync auth with the CLI.

Enable OAuth login/logout in the local viewer, pick up CLI credential changes via mtime-based cache invalidation, and align page headers with Harbor Hub.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix credential sync detection on Windows.

Use a content hash instead of mtime, which can be unchanged across rapid writes on Windows.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix credential sync baseline after local writes.

Set initialized state in note_credentials_written and isolate credential sync tests so they pass independently.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(claude-code): preserve user-message bytes in ATIF trajectory (drop .strip()) (#1724)

* [claude-code] preserve user message bytes (no .strip())

Downstream pipelines that hash the user step.message bytes for cross-
harness equivalence checks rely on byte-identical comparisons against
the canonical instruction.md. Stripping trailing/leading whitespace in
the ATIF normalizer breaks those checks silently.

`_convert_events_to_trajectory` accepts user-event content in three
shapes; all three were applying `.strip()` to the persisted bytes:

  * `content: str` (the shape `claude --print -- "..."` emits) — fixed
    by replacing `text = content.strip()` with `text = content` and
    tightening the existing truthy gate to `if text.strip():` so empty
    / whitespace-only entries are still dropped without mutating bytes
    in the non-empty case.

  * `content: list` (programmatic / SDK callers that wrap the
    instruction in `{"type": "text", "text": "..."}` blocks) — fixed by
    extracting `block["text"]` verbatim instead of routing through
    `_stringify`, and by dropping `part.strip()` from the join (the
    `if part.strip()` filter still removes empty / whitespace-only
    parts so we never emit `\n\n` between nothing). Non-text non-
    tool_result blocks (e.g. image blocks) continue to fall through to
    `_stringify`, which json-encodes them; the patch deliberately does
    not try to byte-faithful those — they have no canonical text bytes
    to be faithful to.

  * `content` else-branch (defensive fallback for unusual shapes) —
    fixed by the same rule: keep raw `_stringify(content)` bytes and
    use `.strip()` only in the empty-skip filter.

Adds regression tests covering string-content trailing newline /
leading whitespace / internal whitespace / empty / whitespace-only,
list-content single-block byte-faithful / multi-block join / empty-
part filter / non-text non-tool_result block json-encoded, and the
fallback else-branch on a non-str non-list content payload.

* fix(tests): run byte-faithful suite in CI (declare hypothesis, drop module skip)

The module-level `pytest.importorskip("hypothesis")` skipped the ENTIRE
test file when hypothesis was absent — not just the property test, but
also the byte-faithful regression suite this PR adds and the pre-existing
reasoning-extraction / session-selection tests. hypothesis was not in the
dev dependency group nor in uv.lock, and CI installs via
`uv sync --all-packages --all-extras --locked`, so it was never present:
the file collected to "0 items / 1 skipped" and CI was green-but-empty.

Declare hypothesis in [dependency-groups].dev (uv.lock updated) and import
it normally at module top so the whole file collects and runs.

Verified locally: pytest now collects 47 tests (was 0 / 1 skipped); all
pass including the 2000-example property test. ruff check + format clean.

* fix(opencode): include the user prompt as a user step in the ATIF trajectory (#1759)

OpenCode trajectories had no source="user" step: _convert_events_to_trajectory
only emitted agent steps, so the prompt was missing (the docstring even claimed
a user step was synthesised, but the code never added one).

OpenCode's `run --format=json` stream omits the prompt entirely
(anomalyco/opencode#29997); it is only recoverable via `opencode export`.
Capture the rendered instruction in run() and prepend a source="user" step,
preferring OpenCode's own `user` event when present (forward-compatible with
anomalyco/opencode#29998) and falling back to the instruction otherwise.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* Fix Claude Code trajectory conversion for duplicate events (#1741)

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* feat(gemini-cli): support Login with Google (oauth-personal) via credential upload (#1764)

Adds opt-in "Login with Google" auth to the gemini-cli agent, mirroring the
Codex agent's auth.json injection:
  - GEMINI_OAUTH_CREDS_PATH=<path> → upload that oauth_creds.json
  - GEMINI_FORCE_OAUTH=<truthy>    → upload ~/.gemini/oauth_creds.json
Default behavior (GEMINI_API_KEY / Vertex env) is unchanged.

On opt-in, uploads oauth_creds.json to a staging dir, chowns it to the agent
user (upload_file lands as root), copies it into ~/.gemini with 0600, and sets
settings security.auth.selectedType=oauth-personal so headless mode uses the
credential without prompting. The API key is not passed under OAuth;
GOOGLE_CLOUD_PROJECT is still forwarded. Staged secrets are removed afterward.

Verified: gemini unit suite passes (ruff + ty clean) and a real Docker run with
GEMINI_FORCE_OAUTH=true completed hello-world (reward 1.0) authenticating via
OAuth.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* Network mode and optional allowlist (#1455)

* Refactor: 'allow_internet_access' boolean attribute to 'internet' enum

* Add require_internet_access field instead of replacing allow_internet

Keep allow_internet unchanged to avoid breaking existing configs. Add a
new require_internet_access boolean to annotate tasks that need internet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Rename require_internet_access to require_internet

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Refactor task internet config to enum

* Add per-role network policies

* Default network policy to public

* Use lowercase network modes

* Add E2B dynamic network policies

* Add E2B network policy example

* Generalize network allowlist example

* Support setup-only network allowlists

* Support lifecycle network allowlists

* Fix trial logger cleanup on init failure

* Restore E2B sandbox timeout

* Handle legacy allow_internet task configs

* Restrict shared verifier network switching

* Close trial log handlers in construction-only tests

* Reject misplaced network policy fields

* Scope network policy to trial phases and migrate E2B to update_network() (#1754)

* Add first-class CLI flags for run-specific network allowlists.

Expose --allow-host and --verifier-allow-host on harbor run/trials while keeping legacy extra_network_allowlists agent kwarg support.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Scope network policy to trial phases and migrate E2B to update_network().

Apply environment baseline at env start, agent policy only during agent.run(), and verifier policy only during verifier.verify(); rename no_network to no-network and limit --allow-host to the agent phase. Use AsyncSandbox.update_network() with e2b>=2.25.0.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Treat agent/verifier network fields as optional phase overrides.

Split baseline vs phase network config, skip dynamic switches when phase matches baseline, add static/dynamic E2B matrix examples, and remove redundant explicit network_mode from tasks that inherit environment defaults.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Split run-time allowlist flags and document network policy hierarchy.

Replace --allow-host with --allow-environment-host (baseline) and --allow-agent-host (agent phase), and tighten task docs around baseline vs override resolution.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Validate separate verifier network policy at init and warn on unused CLI hosts.

Unify phase-switch validation for shared and separate verifier modes, route separate verifier plans through _network_plan, and warn when run-time allowlist flags are ignored on public baselines.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Use None for shared verifier baseline to fix separate-mode validation.

Shared mode no longer duplicates agent_env_baseline in verifier_env_baseline,
so init validation can infer container layout without comparing baselines.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Document phase-scoped network policy in skills and fix example drift.

Restore no-network baselines on verifier examples after the phase-policy
migration, fix matrix README paths, and update create-task/rewardkit skills.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Bump task schema version to 1.3 for phase-scoped network policy.

Update the TaskConfig default, harbor init/register paths, docs, skills,
examples, and tests. Schema 1.2 tasks remain loadable.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Remove unused Any import from trial module.

Fixes ruff F401 ahead of merge into main CI.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Merge allow-environment-host into inherited separate verifier baseline.

When separate verifier mode falls back to [environment] without an explicit
[verifier.environment], apply the same run-time host merge as the agent env.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix viewer network policy display for phase overrides.

[agent] and [verifier] no longer default to Public when network_mode is
absent; show the inherited baseline instead. Add Verifier Environment Network
when [verifier.environment] is set.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix windows multistep test fixtures for network plan resolution.

Partially constructed MultiStepTrial mocks now include agent and environment
config so _run_shared_verifier can resolve phase network policy.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix CI lint and type errors after main merge.

Build E2B allowlist options directly, narrow separate verifier baseline
before phase switching, and drop an unused test import.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Apply ruff formatting to network policy files.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Rename trial run-time allowlist fields to extra_allowed_hosts.

Keep --allow-agent-host and --allow-environment-host as CLI flags while
mapping them to agent.extra_allowed_hosts and environment.extra_allowed_hosts.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add changelog entry for phase-scoped network policy.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Boxuan Li <boxuanli@microsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* v0.13.0

* Add job plugin support and refactor Harbor Hub upload (#1762)

* Add job plugin support and refactor Harbor Hub upload as an internal plugin.

Introduce --plugin for optional integrations, shared import-path loading, and implement upload via HarborHubUploadPlugin while keeping --upload as the CLI entry point.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix missing TrialPaths import in environment factory.

Restores the import removed during import_path refactor so lint and type checks pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix CI lint and type errors in plugin upload code.

Restore formatting and type the Harbor Hub visibility helper as PublicJobVisibility.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Print job results before user plugin finalize and isolate plugin failures.

Move finalize_job_plugins after the results table so a plugin error cannot hide completed run output, and log per-plugin finalize failures without blocking others.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add plugin configuration via --pk and job config plugins list.

Support one CLI plugin with constructor kwargs, multiple plugins via job yaml, and pass kwargs through PluginConfig into plugin constructors.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Rename JobPlugin lifecycle methods to on_job_start and on_job_end.

Align plugin hooks with Harbor job lifecycle naming and update the upload plugin and tests accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Resolve harbor.plugins entry points for --plugin short names.

Add entry point lookup before plugin import, plus harbor plugins list for discovering installed plugins.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix plugins module/package naming conflict.

Rename the CLI typer module to plugins_cmd so harbor.cli.plugins remains
a package for HarborHubUploadPlugin and other built-in plugin implementations.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Apply ruff formatting to plugin-related files.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Require plugins to implement on_job_end.

Make BaseJobPlugin.on_job_end abstract so every plugin explicitly
defines both lifecycle hooks instead of inheriting a silent no-op.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add harbor-langsmith plugin package for LangSmith integration. (#1702)

Extract LangSmith job tracking into a workspace package that registers
via harbor.plugins entry points and installs with harbor[langsmith].

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Add harbor-langsmith publish script and PyPI package metadata.

Pin harbor>=0.13.0 for the job plugin API and record Harbor authorship
before publishing harbor-langsmith to PyPI.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fail fast on Harbor Hub auth errors when using --upload (#1781)

* Fail fast on Harbor Hub auth errors when using --upload.

Validate Hub auth before trials start and treat expired or invalid sessions as fatal instead of falling back to end-of-run batch upload.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Handle stale auth gracefully in status and fix formatting.

Catch Supabase auth errors during harbor auth status and invalid session checks so users see a login prompt instead of a traceback.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Centralize Supabase session validation in auth layer.

Add shared session helpers that map auth API failures to consistent errors, clear stale credentials on invalid refresh tokens, and reuse them from status checks, upload auth, and registry DB calls.

Co-authored-by: Curso…
RishiDesai added a commit to RishiDesai/harbor that referenced this pull request Jun 9, 2026
* [kimi-cli] Add OpenRouter as a supported provider (#1568)

Allow `harbor run -a kimi-cli -m openrouter/<provider>/<model>` (e.g.
`openrouter/moonshotai/kimi-k2.6`) by registering an `openrouter` entry in
`_PROVIDER_CONFIG`. OpenRouter is OpenAI-compatible, so it reuses the
`openai_legacy` provider type with `https://openrouter.ai/api/v1` and
`OPENROUTER_API_KEY`.

Without this, the agent raises `Unsupported provider 'openrouter' for
kimi-cli` from `_build_config_json` because the model-name prefix
(`openrouter`) isn't a registered key. Since the model name is split on
the first `/` only, the part forwarded to kimi-cli (and on to OpenRouter)
remains in the `<vendor>/<model>` form OpenRouter expects.

* Fix Harbor upload handling for resumable Supabase storage (#1570)

* Add TUS uploads.

* Resumabel publsihing.

* Fix ATIF RFC link in trajectory-format documentation (#1583)

Fix ATIF RFC link in trajectory-format documentation. (The one near the end was fixed by a robot but the one near the top was missed.)

* Fix terminus temp & cursor CLI. Closes #1586.

* Add Tensorlake to sandbox providers list (#1585)

* add tensorlake in sandbox provider list

* update the tensorlake link to harbor page in tensorlake docs

* fix(opencode): Allow any model provider to be specified with -m (#1590)

* fix using snapshot (#1587)

* v0.6.5

* Allow configuring Daytona connection_pool_maxsize via env kwargs (#1445)

Forwarded through `DaytonaClientManager` into `DaytonaConfig` when the shared `AsyncDaytona` client is built. Pass via `--ek connection_pool_maxsize=N` (`=null` for unlimited). Bumps `daytona>=0.165.0`.

Signed-off-by: rovle <lovre.pesut@gmail.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Support Devin CLI agent in Harbor (#1605)

* Support Devin CLI agent in Harbor

* Fix api server url

* Clean up comments and update logging configuration

Removed unnecessary comments and adjusted logging environment variables.

* Update opus version from 4.5 to 4.7

* Minor updates to chagnelog.

* v0.6.6

* rewardkit: individual judge mode, per-criterion files, document extraction (#1606)

* rewardkit: add individual judge mode, per-criterion files, document extraction

* rewardkit: silence ty unresolved-import for optional markitdown

* rewardkit: 0.1.3

* rewardkit: stable JSON Schema for individual-mode judge calls (#1611)

* rewardkit: stable JSON Schema for individual-mode judge calls

When `mode = "individual"`, rewardkit fires one structured-output LLM call per
criterion. The old `_build_response_schema` used the criterion's name as the
top-level property, so 60 differently-named criteria produced 60 distinct
schema texts → 60 grammar compilations on Anthropic's side → busted the 20/min
grammar-compilation rate limit and crashed the verifier.

Single-criterion calls now return the flat `{"score", "reasoning"}` shape
instead of a name-wrapped object. All individual-mode calls with the same
output format share byte-identical schema text, hit the compilation cache,
and never trip the rate limit. Multi-criterion (batched) mode is unchanged.

`parse_judge_response` accepts both the new flat shape and the existing
by-name shape, so any model that still returns the wrapped form keeps working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* rewardkit: detect flat shape by value type, not name lookup

The unwrap check in parse_judge_response keyed off whether the criterion's
name was absent from data. That broke for criteria auto-named 'score' or
'reasoning' (e.g. description='Score the work' → name='score'): with the
flat-shape response {"score": "yes", "reasoning": "ok"}, "score" IS in data,
so the unwrap was skipped and data.get("score") returned a string instead
of a dict, raising ValueError.

Switch to value-type detection — flat shape has a leaf at data["score"],
by-name shape has a nested dict — so the name collision is harmless. Adds
three regression tests covering the 'score' / 'reasoning' edge cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* rewardkit: add --je / --judge flags + REWARDKIT_JUDGE override (#1609)

* fix: build harbor-rewardkit into local dist for publish (#1608)

* fix: oracle agent run fail in user agent mode (#1615)

* Update Tensorlake integration to use the lastest SDK (#1621)

* unpin sdk version and update apis

* fix lifecycle

* api update

* bump up the disk size

* update

* fix

* change back to TaskGroup

* improve test coverage

* fix

* fix: classify Anthropic/Bedrock prompt-too-long errors as context length (#1619)

* fix: classify Anthropic prompt-too-long errors as context length

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: classify Bedrock input-too-long errors as context length

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Daytona auth and rich verifier rewards (#1620)

* fix(pi): Allow any model provider to be specified with -m (#1614)

* fix(pi): Allow any model provider to be specified with -m

* Run formatter

* Fix retry exclude CLI override (#1622)

* Speed up test suite (#1625)

* fix: Handle deprecated modal API - remove usage of `Sandbox.mkdir` (#1630)

* Update deprecated modal api

* Remove comment difdf

* islo.dev fix - docker in vm ca (#1599)

* fix: redundant ca management in docker caused dpkg to fail installing the ca-certificates

* test(islo): align unit tests with CA-mount removal and user kwarg refactor

- Replace positive CA bundle bind-mount assertion with a negative one so
  the test guards against the redundant mount being reintroduced.
- Rename the two user-wrapping tests and assert the user is forwarded
  via the SDK's user= kwarg instead of being baked into a su wrapper
  command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Tomer Ezer <46822143+tomerezer@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add Islo as cloud sandbox provider (#1578)

- Add Islo to the providers list in cloud-sandboxes.mdx
- Note Islo support for multi-container deployments
- Bump islo SDK pin to >=0.3.0

* feat(islo): add docker-compose support (#1559)

* chore: update parity_summary.csv [skip ci]

* feat(islo): add docker-compose support

Adds a compose mode to the ISLO environment provider so multi-service
tasks (e.g. examples/tasks/hello-mcp with an mcp-server sidecar) can run
on islo. Mirrors the Daytona DinD pattern and reuses the shared compose
templates from harbor.environments.docker.

- Detects docker-compose.yaml in the task's environment dir; takes
  priority over the prebuilt-image / Dockerfile / runner branches
- Builds & runs a multi-service compose project inside the islo VM with
  a conventional `main` service that the agent execs into
- Two-hop file transfer (SDK -> VM temp -> docker compose cp main:) with
  a volume-mounted fast path for verifier/agent/artifacts log dirs
- Honors allow_internet=False via the shared no-network overlay; declares
  the disable_internet capability when in compose mode
- Writes an islo-specific TLS/CA overlay compose file at startup (kept
  off the shared templates) so the main service trusts the gateway's
  MITM certs and gets NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / etc.
- Compose-aware stop() (docker compose down --remove-orphans) and
  attach() (islo use ... -- bash -lc '<env> docker compose exec main bash')

Adds 30 unit tests covering detection, env vars, file flags (templates,
no-network, prebuilt swap, CA overlay), command builder, volume-mount
mappings, exec/stop/attach routing, and file-transfer fast path + two-hop
behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(islo): drop cross-provider references from compose comments

Tighten the compose-mode comments to describe what islo does without
naming sibling providers, since those mentions don't help a reader
trying to understand the islo file in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(islo): address compose review feedback

- Reserve Harbor compose infra env vars: a task or persistent env var
  named CPUS / MEMORY / CONTEXT_DIR / MAIN_IMAGE_NAME / HOST_*_LOGS_PATH
  / ENV_*_LOGS_PATH would previously silently shadow the infra value and
  break compose interpolation. Infra vars now win, with a warning logged
  on collision.
- Sanitize compose project name to docker compose's required regex
  ([a-z0-9][a-z0-9_-]*); session_ids with dots, slashes, colons, or
  leading punctuation no longer surface as a confusing compose error.
- Clarify the disable_internet capability docstring: it advertises
  whether the env CAN honor allow_internet=False, not whether it's
  currently doing so.
- Replace 'replace(prefix, ...)' with explicit slicing in
  _compose_sandbox_log_path to be obviously correct without relying on
  the startswith guard above it.
- Tighten compose-mode comments.

Tests:
- Replace the misnamed test_validate_raises_when_compose_yaml_missing_after_init
  (which never asserted a raise) with a real validator coverage test
  pair.
- Add coverage for project-name sanitization (disallowed chars, leading
  punctuation), env-var precedence (infra wins), collision warning,
  disable_internet capability gating (compose vs non-compose, plus
  validator interaction with allow_internet=False), _write_ca_overlay
  shape and error path, and _wait_for_main_container success/timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(islo): install docker compose plugin in compose mode

E2E run against the real islo backend surfaced that the islo-runner
image's docker doesn't ship the Compose v2 CLI plugin, so
``docker compose -p ...`` fails with ``unknown shorthand flag: 'p'``
because the docker CLI tries to parse ``-p`` as its own flag.

Adds ``_ensure_compose_plugin`` which:
- Probes ``docker compose version`` and skips if the plugin is already
  present.
- Otherwise downloads the latest ``docker-compose-linux-<arch>`` binary
  into ``~/.docker/cli-plugins`` (works on Alpine and Debian-based VMs
  without a package manager) using whichever of curl/wget is available.

Called once in ``_start_compose`` after the daemon is up.

Verified: ``harbor run -p examples/tasks/hello-mcp --env islo
--agent oracle`` now completes end-to-end with reward 1.0 against real
islo (job 2026-04-30__15-55-05).

Tests: 3 new cases — plugin already present (skip install), plugin
missing (install via cli-plugins), install failure surfaces RuntimeError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert "fix(islo): install docker compose plugin in compose mode"

The islo-runner image now ships with the Docker Compose v2 CLI plugin
preinstalled, so the runtime install step is no longer needed.

This reverts the runtime probe + plugin download from cli-plugins, the
three associated unit tests, and saves ~10–15s on compose-mode cold
start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Adam Goldschmidt <adamgold7@gmail.com>

* fix(terminus-2): reset per-run state and attribute step exceptions in multi-step trials (#1566)

Multi-step support added in PR #1234 made the trial layer call agent.run()
once per step but did not update Terminus2, which stores per-trial state
on the instance. Three categories of bugs result:

1. Trajectory step IDs are non-sequential.
   The initial-prompt Step appends with step_id=1 hardcoded, but
   _trajectory_steps persists across run() calls. After step 2 we get
   [1,2,3,1,2,3,...] which fails Pydantic validation in
   _dump_trajectory(): all terminus-2 multi-step trials fail.

2. Per-run state accumulators leak across steps. _api_request_times,
   _trajectory_steps, _subagent_metrics, _subagent_rollout_details,
   _summarization_count, _session_id, _pending_completion,
   _pending_subagent_refs, _pending_handoff_prompt, _timestamped_markers
   are all written but never reset. Concrete consequences:
     - All step_results' metadata.api_request_times_msec reference the
       same growing list (Python aliasing) -> per-step latency
       tracking unusable.
     - Step N's trajectory.json contains all of steps 1..N (quadratic
       disk usage, downstream consumers see duplicated content).
     - All per-step trajectory.json files share one session_id.
     - If summarization fires in step 1, every later step's reported
       n_input_tokens / cost_usd is inflated by step 1's summarization
       cost.

3. Trial._execute_step_agent only catches asyncio.TimeoutError and
   NonZeroAgentExitCodeError. Any other exception (LLM errors, network
   errors, validation errors, anything from a subprocess agent) bubbles
   to trial-level. step_result.exception_info stays None on the failing
   step and remaining steps are silently aborted.

Fix:
  - Add Terminus2._reset_per_run_state(), called at the top of run().
    Clears all per-trial accumulators. A user-provided session_id (kwarg)
    is preserved via a new _user_provided_session_id attribute.
  - Widen Trial._execute_step_agent's except to Exception, matching the
    sibling _verify_step (line 603) and the caller of _run_step_setup
    (line 638). The explicit abort at trial.py:673
    (`if exception_info and not verifier_result: break`) still fires
    when needed; the trial smartly continues if the verifier still
    produced a result.

Verified against a 2-step task: 1/1 trial, mean reward 1.0, 0 exceptions,
distinct session ids per step, distinct api_request_times_msec per step.
Verified against a step-1-timeout-step-2-recovers task: step 1 records
TimeoutError, step 2 still runs with fully isolated state, trial reward
0.5 (mean of 0 + 1.0).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(islo): drop redundant compose overlay (broken by merge skew with #1599) (#1639)

PR #1559 (docker-compose support) introduced `_write_ca_overlay`,
which bind-mounted the VM's CA bundle into the `main` service and
set NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / REQUESTS_CA_BUNDLE.
PR #1599 (merged 2 minutes earlier) had just removed the
`_VM_CA_BUNDLE` constant and the equivalent `docker run -v` mount,
because the redundant CA mount caused `dpkg` to fail installing
`ca-certificates` inside the container — the runner image already
trusts the gateway's MITM certs via its base CA store.

Neither PR rebased on the other. Upstream main currently references
`_VM_CA_BUNDLE` at 4 call sites inside `_write_ca_overlay` with no
matching definition. The module imports (Python late-binds names
in function bodies) but compose-mode tasks crash with
`NameError: name '_VM_CA_BUNDLE' is not defined` the moment a
sandbox starts.

Fix: drop the provider-side overlay entirely. Removed:

- `_write_ca_overlay` method and its caller in `_start_compose`
- `_COMPOSE_CA_OVERLAY_NAME` constant
- the `-f` flag for the overlay in `_compose_file_flags`
- the two overlay unit tests and the overlay assertion at
  test_islo.py:1280

Daytona's DinD compose path (daytona.py:461) already works
without any provider-side overlay — tasks declare their own
locale + env in their compose/Dockerfile. Matching that contract
on islo as well. Added a regression test
(`TestComposeFileFlagsHasNoProviderOverlay`) that asserts no
`docker-compose-islo-*` path is injected into the `-f` flags.

Verified end-to-end against api.islo.dev with the oracle agent on
examples/tasks/hello-mcp (compose-mode): build + compose-up +
verifier complete cleanly, reward 1.0.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tensorlake): preserve env state on snapshot restore (#1637)

* snapshot fixes

* fix

* fixed

* [Ready for Review] Update GDB adapter dependency and invocation (#1527)

* Update GDB adapter dependency and invocation

Pin the adapter to lica-gdb 0.2.1 and remove the adapter's conflicting gdb console script so generation uses the explicit module entry point.

Made-with: Cursor

* Update GDB registry dataset docs

Made-with: Cursor

* Update GDB parity review links

Made-with: Cursor

* Add GDB adapter CLI alias

Made-with: Cursor

* Add separate verifier environments (#1655)

* Add separate verifier environments

* Add separate verifier changelog and compose env compatibility

* Handle verifier artifact staging collisions

* minor updates.

* Minor fixes.

* Update skills. Add blog post.

* v0.7.0

* Remove internal trial timeout retries (#1628)

* Fix task.toml writing.

* Fix task.toml writing.

* Add Novita environment support to Harbor (#1025)

* Add Novita environment support to Harbor

- Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service.
- Implemented end-to-end and unit tests for NovitaEnvironment functionality.

* Fix CI failures: type errors, lint, and pytest collection crash

- Add type: ignore comments for novita_sandbox SDK type issues
- Move sys.exit() guard into __main__ block so pytest collection doesn't crash
- Add template reuse test phase to e2e integration test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix COPY instruction parsing and timeout_sec=0 handling

- Skip COPY --from=... instructions (multi-stage builds)
- Filter out COPY flags (--chown, --chmod) before extracting source path
- Use explicit None check for timeout_sec to allow timeout_sec=0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address Devin review: internet flag, default timeout, multi-source COPY

- Set can_disable_internet to False (not yet supported by Novita SDK)
- Change default exec timeout from 60s to 0 (no timeout), matching e2b
- Handle multi-source COPY instructions (COPY a.py b.py /dest/)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Windows path separator in upload_dir remote paths

Use PurePosixPath for remote sandbox paths to ensure forward slashes
on all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Change default exec timeout from 0 to 300s

The novita_sandbox SDK defaults to 60s internally when 0 is passed.
Use 300s (5 minutes) to avoid premature termination of long-running
agent and verifier commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix build error log index and defer API base URL resolution

- Use logs[-1] instead of logs[-2] for build failure error message
- Move NOVITA_BASE_URL lookup from class definition to __init__,
  consistent with NOVITA_API_KEY handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle null logs in build failure error reporting

Use `status.get("logs") or []` instead of `status.get("logs", [])`
to handle API returning `"logs": null`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Wrap _http_client.aclose() in try/except in stop()

Prevent transport-level errors during HTTP client cleanup from
propagating out of stop() and masking the trial outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Preserve sandbox when delete=False for debugging

When stop(delete=False) is called, skip killing the sandbox and closing
the HTTP client so the sandbox remains running for debugging purposes.
This aligns with how other environments (e.g. GKE) handle the delete flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: use alias endpoint for template lookup and fix stale alias recovery

- Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias}
  endpoint for O(1) template lookup instead of scanning all templates
- Add stale alias recovery in _api_create_template: on 403 "Alias already used",
  look up the stale template via alias endpoint, delete it, then retry creation
- Include API key suffix in template alias to avoid cross-account conflicts
- Increase build timeout from 600s to 1200s for heavy Dockerfiles
- Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU)
- Update tests to cover new alias endpoint behavior (44 tests passing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: auto-recover from stale cached templates on sandbox creation

When _find_template_by_alias returns a template ID that no longer exists
in the backend (alias registered but build failed/incomplete), AsyncSandbox
would raise a SandboxException("404: template not found"). Now start()
catches this case, deletes the stale template via REST API, and triggers
a fresh build before retrying sandbox creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: include last 5 log lines in build failure error message

Previously only the last log line was shown, which was often just
"Postprocessing finished. Cleaning up..." instead of the actual error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors

* chore: update parity_summary.csv [skip ci]

* Fix review issues and CI failures in Novita environment

- Add _merge_env(env) call in exec() so persistent env vars (--ae flags,
  task [environment.env] config) are correctly forwarded to sandbox commands
- Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment
  interface (fixes type-check invalid-method-override errors)
- Close HTTP client in stop(delete=False) to prevent resource leak; update
  test to assert aclose is called
- Fix uv.lock: missing [[package]] header before networkx entry caused TOML
  parse errors that broke all CI checks; regenerate lockfile cleanly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix exec() to respect user parameter via _resolve_user

The user parameter was accepted but never used — all commands ran as
root. Now calls _resolve_user(user) to honour the orchestrator-set
default_user (e.g. task agent.user / verifier.user from task.toml).

Novita SDK's user parameter is Literal["root", "user"], so map any
non-root resolved user to "user"; add Literal import accordingly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add preflight() and chmod 777 on log dirs in Novita environment

- Add preflight() classmethod to validate NOVITA_API_KEY before any
  trials are queued, giving immediate feedback instead of failing mid-job
- chmod 777 agent/verifier log directories after creation in start() so
  non-root agent/verifier users can write reward files and logs
- Update start() test mocks to handle both foreground (healthcheck) and
  background (exec) sandbox.commands.run call patterns

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format test_novita.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix template name slash escaping and cwd quoting in exec

- Replace '/' with '__' in template alias construction so org/name task
  names (e.g. harbor/hello-world) don't break REST API URL paths
- Use shlex.quote(effective_cwd) in exec() to handle paths with spaces
  or shell metacharacters safely

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use timeout=0 (no limit) as default in exec, aligning with E2B

timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means
no connection time limit, avoiding premature 300s cutoffs on long-running
agent setup or verifier scripts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment

* refactor: move novita-sandbox to optional extra, matching other cloud providers

- Move `novita-sandbox` from main deps to `[novita]` optional extra
- Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it)
- Include `harbor[novita]` in the `cloud` bundle
- Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same
  lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor
- Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError

Without this guard, instantiating NovitaEnvironment when novita-sandbox
is not installed raises a raw NameError (on DockerfileParser) instead of
a helpful MissingExtraError with install instructions. Follows the same
pattern as E2BEnvironment and RunloopEnvironment.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: import EnvironmentCapabilities in Novita environment

Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: update Novita capability tests

Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: fix file upload endpoint

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Minor fixes for ruff.

* Minor fixes for type check (#1665)

* Simplify trial flow (#1672)

* Refactor trial execution by shape

* Clean up trial helper typing

* Skip Windows container hello world without Docker

* Partial refactor.

* Improve artifact handler.

* Minor multi step fixes.

* Make artifact handler paths operation scoped

* Fix CI after trial flow cleanup

* Keep download dir excludes explicit

* Rename download dir exclusions helper

* Address artifact exclusion review comments

* Avoid duplicate single-step artifact recovery

* Avoid double stop after cancellation

---------

Co-authored-by: gabeorlanski <gabeorlanski@gmail.com>

* fix(terminus-2): make tmux send-keys dash-proof and improve send-keys error messages (#1657)

- _tmux_send_keys: append `--` end-of-options marker to the
  `tmux send-keys -t <session>` prefix so keys beginning with `-`
  (e.g. `-x`, `-Lfoo`) are treated as literal key arguments rather
  than being parsed as tmux options.
- _send_blocking_keys / _send_non_blocking_keys: include `command`
  (truncated to 100 chars), `return_code`, `stderr`, and `stdout` in
  the raised RuntimeError to make intermittent send-keys failures
  easier to diagnose from logs.
- tests: update _extract_send_keys_payload helper for the new `--`
  separator and add coverage for keys starting with `-` and for the
  enriched failure messages.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [codex] add repeatable skill inputs (#1674)

* add repeatable skill inputs

* Register injected skills for Cursor CLI

* Use Cursor native skills directory

* Simplify skill resolution

* Make injected skills readable by agents

* Address skill input review comments

* Reject relative task skills dir for injected skills

* Add skills CLI alias

* Rename injected skill config to skills

* Add runtime skills job example

* Trim runtime skills example config

* [codex] add repeatable extra docker compose overlays (#1676)

* add repeatable extra docker compose overlays

* preserve modal compose build markers

* preserve cloud compose file precedence

* Guard extra compose by environment capability

* Rename extra compose config paths

* Revert "Rename extra compose config paths"

This reverts commit 5c531c6d5a7117d6e1fdf9d58e01a8e088dd002e.

* Add extra compose job example

* Address extra compose example comments

* Nest extra compose job example

* Fix skills merge.

* [codex] Add runtime MCP config support (#1675)

* Add runtime MCP config support

* Use extra compose overlay for MCP proof example

* Remove MCP proof example volume

* Use Python base image in MCP proof task

* Document MCP proof compose context

* Trim MCP proof job defaults

* Embed MCP proof runtime config

* [codex] Add extra instruction path support (#1682)

* feat: add support for --extra-instruction-paths

* Add extra instruction path support

* Fix lock equality env serialization

* Fix lock equality for digest-backed paths

---------

Co-authored-by: ZHAO Jin-Xiang <xiaoxiangmoe@gmail.com>

* v0.7.1

* fix(terminus): use UTF-8 byte length for tmux send-keys size checks (#1680)

* Update reward output documentation (#1684)

Update based on change in #1620

* Add minimal verifier extension hook (#1653)

* Add minimal verifier extension hook

Add a small verifier factory hook that allows jobs to provide an optional custom verifier by import path while keeping the existing task verification flow as the default.

This enables job-specific verification to supplement task-specific checks. For example, a job can attach generic trajectory evaluators, policy checks, or run-level scoring logic across many tasks without rebuilding, copying, or modifying those task definitions.

The hook keeps task authorship and job evaluation concerns separate: tasks continue to define their normal verification, and jobs can opt into additional verifier behavior only when needed.

Default behavior is unchanged when no custom verifier is configured.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Tighten verifier extension contract

Introduce BaseVerifier and VerifierContext so custom verifiers receive a stable construction context while the built-in verifier keeps legacy kwargs compatibility.

Require verifier outputs to be VerifierResult before assigning them to trial results, preserving Harbor aggregation semantics for built-in and imported verifiers. Keep legacy import-path constructors working through an adapter that enforces the return contract.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Reject unused verifier kwargs

Fail fast when verifier kwargs are provided without a verifier import path, since the built-in verifier does not consume arbitrary extension kwargs.

This makes CLI/config mistakes visible instead of silently dropping values like --verifier-kwarg foo=bar.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Fix verifier factory test patch

Update Windows multi-step verifier tests to patch VerifierFactory.create_verifier_from_config after trial verification moved behind the factory hook.

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>

* Simplify verifier extension constructor

* Simplify verifier factory contract

* Fix skills merge example config paths

---------

Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Minor improvements.

* fix: fail opencode runs on error events (#1658)

* Update Novita to latest SDK build flow (#1688)

* Add Novita environment support to Harbor

- Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service.
- Implemented end-to-end and unit tests for NovitaEnvironment functionality.

* Fix CI failures: type errors, lint, and pytest collection crash

- Add type: ignore comments for novita_sandbox SDK type issues
- Move sys.exit() guard into __main__ block so pytest collection doesn't crash
- Add template reuse test phase to e2e integration test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix COPY instruction parsing and timeout_sec=0 handling

- Skip COPY --from=... instructions (multi-stage builds)
- Filter out COPY flags (--chown, --chmod) before extracting source path
- Use explicit None check for timeout_sec to allow timeout_sec=0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address Devin review: internet flag, default timeout, multi-source COPY

- Set can_disable_internet to False (not yet supported by Novita SDK)
- Change default exec timeout from 60s to 0 (no timeout), matching e2b
- Handle multi-source COPY instructions (COPY a.py b.py /dest/)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Windows path separator in upload_dir remote paths

Use PurePosixPath for remote sandbox paths to ensure forward slashes
on all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Change default exec timeout from 0 to 300s

The novita_sandbox SDK defaults to 60s internally when 0 is passed.
Use 300s (5 minutes) to avoid premature termination of long-running
agent and verifier commands.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix build error log index and defer API base URL resolution

- Use logs[-1] instead of logs[-2] for build failure error message
- Move NOVITA_BASE_URL lookup from class definition to __init__,
  consistent with NOVITA_API_KEY handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Handle null logs in build failure error reporting

Use `status.get("logs") or []` instead of `status.get("logs", [])`
to handle API returning `"logs": null`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Wrap _http_client.aclose() in try/except in stop()

Prevent transport-level errors during HTTP client cleanup from
propagating out of stop() and masking the trial outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Preserve sandbox when delete=False for debugging

When stop(delete=False) is called, skip killing the sandbox and closing
the HTTP client so the sandbox remains running for debugging purposes.
This aligns with how other environments (e.g. GKE) handle the delete flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: use alias endpoint for template lookup and fix stale alias recovery

- Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias}
  endpoint for O(1) template lookup instead of scanning all templates
- Add stale alias recovery in _api_create_template: on 403 "Alias already used",
  look up the stale template via alias endpoint, delete it, then retry creation
- Include API key suffix in template alias to avoid cross-account conflicts
- Increase build timeout from 600s to 1200s for heavy Dockerfiles
- Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU)
- Update tests to cover new alias endpoint behavior (44 tests passing)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: auto-recover from stale cached templates on sandbox creation

When _find_template_by_alias returns a template ID that no longer exists
in the backend (alias registered but build failed/incomplete), AsyncSandbox
would raise a SandboxException("404: template not found"). Now start()
catches this case, deletes the stale template via REST API, and triggers
a fresh build before retrying sandbox creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* novita: include last 5 log lines in build failure error message

Previously only the last log line was shown, which was often just
"Postprocessing finished. Cleaning up..." instead of the actual error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors

* chore: update parity_summary.csv [skip ci]

* Fix review issues and CI failures in Novita environment

- Add _merge_env(env) call in exec() so persistent env vars (--ae flags,
  task [environment.env] config) are correctly forwarded to sandbox commands
- Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment
  interface (fixes type-check invalid-method-override errors)
- Close HTTP client in stop(delete=False) to prevent resource leak; update
  test to assert aclose is called
- Fix uv.lock: missing [[package]] header before networkx entry caused TOML
  parse errors that broke all CI checks; regenerate lockfile cleanly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix exec() to respect user parameter via _resolve_user

The user parameter was accepted but never used — all commands ran as
root. Now calls _resolve_user(user) to honour the orchestrator-set
default_user (e.g. task agent.user / verifier.user from task.toml).

Novita SDK's user parameter is Literal["root", "user"], so map any
non-root resolved user to "user"; add Literal import accordingly.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Add preflight() and chmod 777 on log dirs in Novita environment

- Add preflight() classmethod to validate NOVITA_API_KEY before any
  trials are queued, giving immediate feedback instead of failing mid-job
- chmod 777 agent/verifier log directories after creation in start() so
  non-root agent/verifier users can write reward files and logs
- Update start() test mocks to handle both foreground (healthcheck) and
  background (exec) sandbox.commands.run call patterns

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format test_novita.py

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Fix template name slash escaping and cwd quoting in exec

- Replace '/' with '__' in template alias construction so org/name task
  names (e.g. harbor/hello-world) don't break REST API URL paths
- Use shlex.quote(effective_cwd) in exec() to handle paths with spaces
  or shell metacharacters safely

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Use timeout=0 (no limit) as default in exec, aligning with E2B

timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means
no connection time limit, avoiding premature 300s cutoffs on long-running
agent setup or verifier scripts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment

* refactor: move novita-sandbox to optional extra, matching other cloud providers

- Move `novita-sandbox` from main deps to `[novita]` optional extra
- Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it)
- Include `harbor[novita]` in the `cloud` bundle
- Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same
  lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor
- Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed
- Regenerate uv.lock

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError

Without this guard, instantiating NovitaEnvironment when novita-sandbox
is not installed raises a raw NameError (on DockerfileParser) instead of
a helpful MissingExtraError with install instructions. Follows the same
pattern as E2BEnvironment and RunloopEnvironment.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Update src/harbor/environments/novita.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix: import EnvironmentCapabilities in Novita environment

Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: update Novita capability tests

Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: fix file upload endpoint

* fix: integrate Novita SDK template builds

Use the Novita SDK template builder directly while preserving Harbor's Dockerfile COPY handling, and pin the alpha SDK version without enabling global prerelease resolution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: pin Novita sandbox domain

Use the regional Novita sandbox endpoint consistently so local domain overrides cannot route template operations to the wrong API host.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: avoid Novita SDK import during test collection

Load Novita SDK modules only when the Novita environment actually needs them so pytest can collect E2B and Novita tests in the same process without duplicate protobuf descriptor registration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* Fix EnvironmentConfig deprecation warnings on default construction.

Migrate legacy memory/storage fields in a before validator instead of
Field(deprecated=...) plus an after validator, and reject conflicting
legacy and modern resource values.

Closes #1693

Co-authored-by: Cursor <cursoragent@cursor.com>

* Estimate cursor-cli cost from usage via LiteLLM

Cursor CLI stream-json reports token usage on result events but not
dollar cost. Parse optional totalCost when present and otherwise
estimate from per-category token counts using LiteLLM pricing.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add built-in pricing for Cursor Composer models in cursor-cli.

LiteLLM does not list cursor/composer models, so estimate cost from token
usage using Cursor's published rates before falling back to LiteLLM.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [codex] Add resource enforcement policies (#1697)

* Add resource enforcement policies

* Pre flight check.

* Fix CHANGELOG breaking changes for resource enforcement policies.

Document removed task resource defaults and stricter validation instead of incorrectly claiming --cpus/--memory repurposed numeric overrides.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* v0.8.0

* Fix resource default test after provider-default change (#1701)

* fix tests on main

* chore: rerun CI

* Document job sharing (#1706)

* feat(viewer): add ←/→ trial navigation, ⌥+←/→ tab cycling, persistent tab across trials, and X/N position indicator on the trial page (#1705)

* docs(atif): refresh trajectory format page to v1.7 (#1704)

The trajectory format docs page still advertised ATIF-v1.4 as current and stopped its supported-versions list at v1.4, while the canonical RFC (rfcs/0001-trajectory-format.md) has been at v1.7 for several releases. Bump the example schema_version strings to ATIF-v1.7 and extend the Schema Versions section with v1.5, v1.6, and v1.7 entries summarized from the RFC's Version History.

No code changes; docs only.

* Add PR diff links workflow with manual dispatch. (#1716)

Post devinreview and diffshub links when PRs open, and allow testing on existing PRs via workflow_dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: add Openclaw installed agent (#1661)

* feat: add openclaw installed agent

* Cleanup commit

* save full session turns

* NeMo-Flow Integration

* cleanup

* update defaults

* fix test for updated defaults

* Fix tests for new defaults

* Fix lint error

* Remove nemoflow from PR

Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com>

* refactor(openclaw): generalize provider config normalization

Address review feedback: drop NVIDIA-specific code paths from the
OpenClaw plugin so it works generically across any OpenAI-compatible
provider.

- Replace `_merge_nvidia_base_url_from_env` and
  `_normalize_nvidia_models_provider` with provider-agnostic
  `_merge_provider_base_url_from_env` and
  `_normalize_provider_models_schema` that derive the provider from
  `--model` (e.g. `openai/gpt-4.1` -> `OPENAI_BASE_URL`).
- Remove the hardcoded NVIDIA default base URL; users select a
  custom provider via env or `openclaw_config`.
- Update class docstring to use `openai/*` as the generic example.
- Rewrite the NVIDIA-themed unit tests to cover the generic
  behavior with `openai/*`.

The `nvidia` entry in the env-var forwarding switch is retained
alongside ~15 other providers (anthropic, openai, google, ...) as a
plain provider registry, since removing it would break existing
`nvidia/*` model selections.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>

* feature(api): multi-provider compatibility for openclaw

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>

---------

Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Co-authored-by: Bryan Bednarski <bbednarski@nvidia.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Add GPU support to GKE environment (#1640)

* Add GPU support to GKE environment

* Address PR comments

- Early failure if an unsupported GPU type is provieded
- Increase the timeout minutes to 20 when GPUs are selected
- Support direct gke-accelerator values as gpu_types

* Adjust GPU count retrieval to use _effective_gpus for consistency

* Paginate dataset metadata queries past Supabase row cap (#1719)

* Paginate dataset metadata queries past Supabase row cap.

Fixes harbor download and run truncating package datasets at 1,000 tasks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Format test_registry_db_client.py with ruff.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add TPU support to harbor and GKE environment (#1652)

* Address PR comments

- Early failure if an unsupported GPU type is provieded
- Increase the timeout minutes to 20 when GPUs are selected
- Support direct gke-accelerator values as gpu_types

* Adjust GPU count retrieval to use _effective_gpus for consistency

* Add TPU support to environment configuration

This change allows environments to properly support and validate TPU requirements, improving task execution flexibility.

* Add TPU support to GKE environment

This update introduces a mapping for TPU types, enhances the GKEEnvironment class to handle TPU configurations, and updates unit tests to validate TPU capabilities and configurations alongside existing GPU support.

* Update environment config model to use a dedicated class for TpuSpec

* Add new TPU config to docs

* Add --tpu_overrides to cli commands

* Validate mutual exclusion of GPU and TPU requests in GKE

* Fix merge conflicts

* Update TPU configuration to use a single TpuSpec

* Add Harbor Hub job result sharing blog post (#1732)

* Add Harbor Hub job result sharing blog post.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update job sharing blog title and landing page banner.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add CoreWeave Sandbox and W&B environment support (#1698)

* cw sandbox

* doc fix

* Fix (Add resource enforcement policies)

* final fixes

* comment cleanup

* fix(cwsandbox): clean up backend sandbox on any failed start()

* feat (Tensorlake): build sandboxes from OCI images instead of per-trial Dockerfile replay (#1734)

* update tensorlake integration to use oci image build

* Guard fcntl import for Windows test collection in tensorlake env

* Add managing resources docs for task configuration. (#1735)

Centralize enforcement policy and resource field guidance in the tasks docs.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Ready For Review] Fix artifact transfer archive collisions (#1733)

* Fix artifact transfer archive collisions

* Log transfer cleanup failures as warnings

* Use RPC for task version resolution (#1736)

* Allow tasks with docker_image to omit environment/Dockerfile (#1729)

* Allow tasks with docker_image to omit environment/Dockerfile.

Centralize environment definition validation and workdir helpers across supported providers.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix docker_image-only force_build and Runloop workdir default.

Use shared prebuilt-image selection when no Dockerfile exists, and restore /workspace fallback for Dockerfiles without WORKDIR.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Apply prebuilt docker_image policy to all compose providers.

Use should_use_prebuilt_docker_image in Daytona, Modal, and Islo, and unify Docker validation.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix lazy dockerfile_parse import and daytona formatting.

Move DockerfileParser import inside parse_dockerfile_workdir so core environments do not require the optional extra.

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

* Add dockerfile-parse to runloop optional extra.

Runloop now uses parse_dockerfile_workdir for WORKDIR resolution when a Dockerfile is present.

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: Add native agent adapter for Google Antigravity CLI (agy) (#1699)

* feat: Add native agent adapter for Google Antigravity CLI (agy)

* fix: remove unused import

* fix: correctly configure agy settings.json and model

* fix: update test to match new EnvironmentConfig defaults

* fix: remove unused run_model variable

* style: run ruff format on agy.py

* refactor: rename agy agent to antigravity-cli

Use antigravity-cli as the Harbor agent identifier and AntigravityCli
adapter naming instead of agy.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(antigravity-cli): use Path.write_text for ATIF export

Address Devin review feedback and align with AGENTS.md file I/O guidance.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError (#1457)

* feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError

Adds three major improvements to the Daytona environment backend:

1. **Auto-snapshot with content-based caching**: New `auto_snapshot` parameter
   on DaytonaEnvironment enables automatic snapshot creation keyed by a SHA256
   hash of the full environment directory. Tasks sharing the same Dockerfile
   and fixtures reuse a single snapshot, eliminating redundant builds. Snapshots
   are region-aware (DAYTONA_TARGET) to prevent cross-region collisions. Per-
   snapshot async locks prevent redundant parallel creation.

2. **Transient error differentiation**: New `daytona_utils.py` module provides
   `is_transient_daytona_error()` which distinguishes rate limits and capacity
   errors from non-recoverable failures. Retry callbacks use 10 attempts with
   60s linear backoff for transient errors vs 3 attempts with exponential
   backoff for others — dramatically improving reliability under load.

3. **SandboxBuildFailedError**: New non-retryable exception for failed sandbox
   builds (bad Dockerfile, snapshot in ERROR state). Stops wasting retry budget
   on builds that will never succeed. Detected both in `_create_sandbox()` and
   `_wait_for_snapshot()`.

Supporting additions:
- `container_cache.py`: Hash utilities for environment directories and
  Dockerfiles, plus task analysis helpers for predicting snapshot counts
- DinD auto-snapshot support with image-hash-based naming
- `ephemeral=True` flag on all sandbox creation calls
- `assume_global_snapshot` for optimistic handling of shared snapshots
  invisible to the GET API

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove region_id param not in current Daytona SDK

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: remove DinD auto-snapshot additions, restore main's DinD start()

DinD snapshot management was not in scope for this PR. Restores
_DaytonaDinD.start() to main's original implementation. Removes
_get_dind_snapshot_name, _ensure_dind_auto_snapshot, _create_dind_snapshot
methods and unused hashlib import.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: don't retry SandboxBuildFailedError/TimeoutError, close RL client

- Add _is_non_retryable() guard to all retry callbacks so
  SandboxBuildFailedError and TimeoutError are never retried
- Close temporary AsyncDaytona client after RL-region snapshot builds
  to prevent HTTP session leaks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(daytona): harden PR #1457 with unit tests and small fixes

Add tests for daytona_utils retry classification and container_cache hashing.
Stop treating invalid bearer tokens as transient, trim unused analyze helpers,
evict idle per-snapshot locks, and document auto_snapshot ERROR behavior.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): extract snapshot service and collapse retry helpers

Move snapshot lifecycle into daytona_snapshots.py with a single state
resolver and SnapshotPolicy. Replace six retry callbacks with
daytona_retry_callbacks(). Simplify _DaytonaDirect.start() via
_resolve_start_sandbox_params() and remove the string-matched fallback catch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): dedupe ensure_auto paths and add optional snapshot GET

Collapse fast/slow auto-snapshot resolution into shared helpers and use a
documented non-retrying GET for pre-create ERROR cleanup.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: use Task.short_name for environment_name

Add Task.short_name (delegates to package short_name, else task dir name)
and pass it as environment_name so Daytona snapshot templates and container
naming avoid registry org prefixes and slashes in paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(daytona): move modules into daytona/ package

Group environment, snapshots, and utils under environments/daytona/
to match docker/ and singularity/. Default assume_global_snapshot to
False so missing template snapshots fall back to Dockerfile builds.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(container_cache): length-prefix paths in environment hash

Avoid ambiguous SHA256 updates where a file path could concatenate with
the next file's content. Adds a regression test for the ab/a+b case.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(daytona): wait for concurrent snapshot create to become active

Handle PENDING snapshots before create and wait for ACTIVE after
already-exists/conflict errors instead of returning the name immediately.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(container_cache): length-prefix file content in environment hash

Extend domain-separated hashing so path and content bytes cannot be
ambiguous across files (Devin review follow-up).

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Benjamin Feuer <penfever@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Alex Shaw <alexgshaw64@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* Upload environment/ files for prebuilt docker_image tasks (#1737)

* Upload environment/ to workdir for prebuilt docker_image tasks.

When docker_image is set without a Dockerfile or docker-compose.yaml,
environments copy non-empty environment/ into the container workdir at
the end of start().

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix CI: format tests and isolate cwsandbox environment_dir fixtures.

Use a dedicated empty environment/ subdirectory so post-start uploads do
not run during unit tests that assert exact exec call counts.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Format cwsandbox test_wandb.py

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix cwsandbox tests to write Dockerfile under environment/.

Aligns with environment_dir fixture so prebuilt-image allowance tests
exercise the intended layout.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* downgrade logging.

* Stop writing per-episode log folders in Terminus-2 (#1740)

* Stop writing per-episode log folders in Terminus-2.

Episode prompt/response/debug files are redundant now that trajectory.json captures each turn.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix Terminus-2 tests after removing episode logging paths.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* [Ready for Review] Adapter | Review bot prompt update for agent reward hacking checks (#1747)

* Update adapter review prompts

* Update prompt based on some sanity check runs

* Add benchmark identity leakage check

* Add linear.review link to PR diff links workflow (#1749)

* fix link.

* v0.9.0

* claude_code: handle redacted_thinking content blocks (#1752)

Anthropic's `redacted_thinking` is a standard, documented content block
type that can appear in any assistant message when extended thinking is
enabled. Its `data` field is opaque ciphertext that clients cannot
decrypt — the contract is to pass it back unchanged on subsequent API
calls, never to expose it as user-facing text.

Today _extract_text_reasoning_tool_uses doesn't recognise the type, so
the block falls through to the catch-all that `_stringify`s the whole
block dict and appends the resulting JSON envelope to text_parts.
Trajectories then carry an ATIF `message` like
  '{"type":"redacted_thinking","data":"…"}'
in the assistant turn. On may26 there are 2,050 such steps across 127
trials in the bundled corpus, all claude-code paired with vendor-routed
models (e.g. tencent/hy3-preview-20260421 via OpenRouter).

OpenRouter additionally mis-uses the redacted_thinking envelope to pass
through PLAIN reasoning from non-Anthropic models: `data` is
`openrouter.reasoning:<b64>`, where the base64 decodes to plain JSON
`{"text":"…","type":"reasoning.text"}`. That content isn't
actually encrypted — it should land in reasoning_content like every
other thinking block.

Add a redacted_thinking branch before the generic fallback that:
  - if data starts with `openrouter.reasoning:`, b64-decodes the
    payload, parses the inner JSON, and appends the inner `text` to
    reasoning_parts;
  - otherwise drops the block. This preserves the API contract for
    genuine Anthropic ciphertext (it remains opaque) and stops the
    envelope JSON from polluting human-readable trajectory text.

Updates the existing test_redacted_thinking_not_in_reasoning to assert
the envelope is now absent from both text and reasoning (it previously
only asserted absence from reasoning, accepting the stringified-into-
text behaviour), and adds two new tests covering the OpenRouter decode
and malformed-payload-dropped paths.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal>

* claude_code: unwrap text content blocks in user-event tool_result loop (#1753)

In _convert_events_to_trajectory, the user-event content loop already
handles tool_result blocks specifically. Anything else falls through to
`self._stringify(block)` — which JSON-encodes the whole block dict and
appends the resulting envelope to text_parts. So a content block like
  {"type": "text", "text": "<10 KB of skill documentation>"}
ends up in the ATIF user step's `message` as
  '{"type":"text","text":"Base directory for this skill: …"}'
verbatim — downstream renderers that expect `message` to be human
text can't read it.

Claude Code injects these text blocks as user content alongside the
tool_result when a Skill is loaded (the block carries the skill's
documentation). Saw 4 such steps in a recent harbor-index corpus scan
on skillsbench × {glm-5.1, MiniMax/MiniMax-M2.7} runs.

Fix: before the generic _stringify fallback, recognise
`{"type":"text","text":<str>}` and surface its inner string. Non-text
blocks and text blocks with non-string `text` still hit the stringify
fallback so behaviour for unknown shapes is unchanged.

Adds test_user_event_text_content_block_unwrapped covering the end-to-end
path through _convert_events_to_trajectory.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal>

* fix(modal): default _ModalDirect.exec to non-login shell (#1744)

The strategy-refactor PR (#1311) introduced `login=True` on the default
`_ModalDirect.exec` path, which causes the underlying SDK call to use
`bash -lc <cmd>`. A login shell re-sources `/etc/profile` and the
shell's profile files, which **clobbers `PATH`** as set by the image's
`ENV PATH=…` directives.

This breaks any task that pins toolchains via image-level `ENV PATH`:
- Go tasks lose `/usr/local/go/bin` (everything that does
  `go build`/`go test` fails)
- Rust tasks lose `~/.cargo/bin` (cargo not found)
- Anything with custom `pipx`/`uv`/Node prefixes baked into image
  layers gets reset to the inherited login default

Reverting this single line to `login=False` restores the pre-#1311
`bash -c` behavior and preserves the image's PATH.

The lower-level `_sdk_exec` still exposes `login` as a parameter, so
strategies that genuinely want a login shell can opt in explicitly.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Add viewer sign-in and sync auth with the CLI (#1755)

* Add viewer sign-in and sync auth with the CLI.

Enable OAuth login/logout in the local viewer, pick up CLI credential changes via mtime-based cache invalidation, and align page headers with Harbor Hub.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix credential sync detection on Windows.

Use a content hash instead of mtime, which can be unchanged across rapid writes on Windows.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix credential sync baseline after local writes.

Set initialized state in note_credentials_written and isolate credential sync tests so they pass independently.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(claude-code): preserve user-message bytes in ATIF trajectory (drop .strip()) (#1724)

* [claude-code] preserve user message bytes (no .strip())

Downstream pipelines that hash the user step.message bytes for cross-
harness equivalence checks rely on byte-identical comparisons against
the canonical instruction.md. Stripping trailing/leading whitespace in
the ATIF normalizer breaks those checks silently.

`_convert_events_to_trajectory` accepts user-event content in three
shapes; all three were applying `.strip()` to the persisted bytes:

  * `content: str` (the shape `claude --print -- "..."` emits) — fixed
    by replacing `text = content.strip()` with `text = content` and
    tightening the existing truthy gate to `if text.strip():` so empty
    / whitespace-only entries are still dropped without mutating bytes
    in the non-empty case.

  * `content: list` (programmatic / SDK callers that wrap the
    instruction in `{"type": "text", "text": "..."}` blocks) — fixed by
    extracting `block["text"]` verbatim instead of routing through
    `_stringify`, and by dropping `part.strip()` from the join (the
    `if part.strip()` filter still removes empty / whitespace-only
    parts so we never emit `\n\n` between nothing). Non-text non-
    tool_result blocks (e.g. image blocks) continue to fall through to
    `_stringify`, which json-encodes them; the patch deliberately does
    not try to byte-faithful those — they have no canonical text bytes
    to be faithful to.

  * `content` else-branch (defensive fallback for unusual shapes) —
    fixed by the same rule: keep raw `_stringify(content)` bytes and
    use `.strip()` only in the empty-skip filter.

Adds regression tests covering string-content trailing newline /
leading whitespace / internal whitespace / empty / whitespace-only,
list-content single-block byte-faithful / multi-block join / empty-
part filter / non-text non-tool_result block json-encoded, and the
fallback else-branch on a non-str non-list content payload.

* fix(tests): run byte-faithful suite in CI (declare hypothesis, drop module skip)

The module-level `pytest.importorskip("hypothesis")` skipped the ENTIRE
test file when hypothesis was absent — not just the property test, but
also the byte-faithful regression suite this PR adds and the pre-existing
reasoning-extraction / session-selection tests. hypothesis was not in the
dev dependency group nor in uv.lock, and CI installs via
`uv sync --all-packages --all-extras --locked`, so it was never present:
the file collected to "0 items / 1 skipped" and CI was green-but-empty.

Declare hy…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants