Add CoreWeave Sandbox and W&B environment support by matthoare117-wandb · Pull Request #1698 · harbor-framework/harbor

matthoare117-wandb · 2026-05-22T02:17:07Z

Summary

Adds CoreWeave Sandboxes and W&B Sandboxes as Harbor cloud execution environments.
This PR includes:

cwsandbox environment support using the CoreWeave Sandbox SDK
wandb environment support using wandb.sandbox
Environment factory registration and EnvironmentType entries for both
Unit coverage for lifecycle, env propagation, file transfer, command execution, cleanup, and W&B secret handling
Docs update listing CoreWeave Sandboxes and W&B Sandboxes as cloud sandbox options

Implementation Notes

cwsandbox is the CoreWeave Sandbox-backed Harbor environment. It handles startup/stop, exec, file transfer, resource/env mapping, and cleanup through the CoreWeave SDK.
wandb is the W&B Sandbox-backed Harbor environment. It reuses the cwsandbox implementation, but uses wandb.sandbox auth and W&B sandbox secret handling.

Validation

Ran the full : full terminal-bench/terminal-bench-2-1 run with n_attempts: 3 on both cwsandbox and wandb.
Screenshots/results below:

CWSandbox:

Wandb Sandbox:

vercel · 2026-05-22T02:17:11Z

@matthoare117-wandb is attempting to deploy a commit to the Harbor Framework Team on Vercel.

A member of the Team first needs to authorize it.

github-actions · 2026-05-25T18:29:33Z

Enjoy a better diff viewing experience by clicking one of these URLs:

* fix(terminus-2): reset per-run state and attribute step exceptions in multi-step trials (#1566) Multi-step support added in PR #1234 made the trial layer call agent.run() once per step but did not update Terminus2, which stores per-trial state on the instance. Three categories of bugs result: 1. Trajectory step IDs are non-sequential. The initial-prompt Step appends with step_id=1 hardcoded, but _trajectory_steps persists across run() calls. After step 2 we get [1,2,3,1,2,3,...] which fails Pydantic validation in _dump_trajectory(): all terminus-2 multi-step trials fail. 2. Per-run state accumulators leak across steps. _api_request_times, _trajectory_steps, _subagent_metrics, _subagent_rollout_details, _summarization_count, _session_id, _pending_completion, _pending_subagent_refs, _pending_handoff_prompt, _timestamped_markers are all written but never reset. Concrete consequences: - All step_results' metadata.api_request_times_msec reference the same growing list (Python aliasing) -> per-step latency tracking unusable. - Step N's trajectory.json contains all of steps 1..N (quadratic disk usage, downstream consumers see duplicated content). - All per-step trajectory.json files share one session_id. - If summarization fires in step 1, every later step's reported n_input_tokens / cost_usd is inflated by step 1's summarization cost. 3. Trial._execute_step_agent only catches asyncio.TimeoutError and NonZeroAgentExitCodeError. Any other exception (LLM errors, network errors, validation errors, anything from a subprocess agent) bubbles to trial-level. step_result.exception_info stays None on the failing step and remaining steps are silently aborted. Fix: - Add Terminus2._reset_per_run_state(), called at the top of run(). Clears all per-trial accumulators. A user-provided session_id (kwarg) is preserved via a new _user_provided_session_id attribute. - Widen Trial._execute_step_agent's except to Exception, matching the sibling _verify_step (line 603) and the caller of _run_step_setup (line 638). The explicit abort at trial.py:673 (`if exception_info and not verifier_result: break`) still fires when needed; the trial smartly continues if the verifier still produced a result. Verified against a 2-step task: 1/1 trial, mean reward 1.0, 0 exceptions, distinct session ids per step, distinct api_request_times_msec per step. Verified against a step-1-timeout-step-2-recovers task: step 1 records TimeoutError, step 2 still runs with fully isolated state, trial reward 0.5 (mean of 0 + 1.0). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(islo): drop redundant compose overlay (broken by merge skew with #1599) (#1639) PR #1559 (docker-compose support) introduced `_write_ca_overlay`, which bind-mounted the VM's CA bundle into the `main` service and set NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / REQUESTS_CA_BUNDLE. PR #1599 (merged 2 minutes earlier) had just removed the `_VM_CA_BUNDLE` constant and the equivalent `docker run -v` mount, because the redundant CA mount caused `dpkg` to fail installing `ca-certificates` inside the container — the runner image already trusts the gateway's MITM certs via its base CA store. Neither PR rebased on the other. Upstream main currently references `_VM_CA_BUNDLE` at 4 call sites inside `_write_ca_overlay` with no matching definition. The module imports (Python late-binds names in function bodies) but compose-mode tasks crash with `NameError: name '_VM_CA_BUNDLE' is not defined` the moment a sandbox starts. Fix: drop the provider-side overlay entirely. Removed: - `_write_ca_overlay` method and its caller in `_start_compose` - `_COMPOSE_CA_OVERLAY_NAME` constant - the `-f` flag for the overlay in `_compose_file_flags` - the two overlay unit tests and the overlay assertion at test_islo.py:1280 Daytona's DinD compose path (daytona.py:461) already works without any provider-side overlay — tasks declare their own locale + env in their compose/Dockerfile. Matching that contract on islo as well. Added a regression test (`TestComposeFileFlagsHasNoProviderOverlay`) that asserts no `docker-compose-islo-*` path is injected into the `-f` flags. Verified end-to-end against api.islo.dev with the oracle agent on examples/tasks/hello-mcp (compose-mode): build + compose-up + verifier complete cleanly, reward 1.0. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tensorlake): preserve env state on snapshot restore (#1637) * snapshot fixes * fix * fixed * [Ready for Review] Update GDB adapter dependency and invocation (#1527) * Update GDB adapter dependency and invocation Pin the adapter to lica-gdb 0.2.1 and remove the adapter's conflicting gdb console script so generation uses the explicit module entry point. Made-with: Cursor * Update GDB registry dataset docs Made-with: Cursor * Update GDB parity review links Made-with: Cursor * Add GDB adapter CLI alias Made-with: Cursor * Add separate verifier environments (#1655) * Add separate verifier environments * Add separate verifier changelog and compose env compatibility * Handle verifier artifact staging collisions * minor updates. * Minor fixes. * Update skills. Add blog post. * v0.7.0 * Remove internal trial timeout retries (#1628) * Fix task.toml writing. * Fix task.toml writing. * Add Novita environment support to Harbor (#1025) * Add Novita environment support to Harbor - Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service. - Implemented end-to-end and unit tests for NovitaEnvironment functionality. * Fix CI failures: type errors, lint, and pytest collection crash - Add type: ignore comments for novita_sandbox SDK type issues - Move sys.exit() guard into __main__ block so pytest collection doesn't crash - Add template reuse test phase to e2e integration test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix COPY instruction parsing and timeout_sec=0 handling - Skip COPY --from=... instructions (multi-stage builds) - Filter out COPY flags (--chown, --chmod) before extracting source path - Use explicit None check for timeout_sec to allow timeout_sec=0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address Devin review: internet flag, default timeout, multi-source COPY - Set can_disable_internet to False (not yet supported by Novita SDK) - Change default exec timeout from 60s to 0 (no timeout), matching e2b - Handle multi-source COPY instructions (COPY a.py b.py /dest/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Windows path separator in upload_dir remote paths Use PurePosixPath for remote sandbox paths to ensure forward slashes on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Change default exec timeout from 0 to 300s The novita_sandbox SDK defaults to 60s internally when 0 is passed. Use 300s (5 minutes) to avoid premature termination of long-running agent and verifier commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix build error log index and defer API base URL resolution - Use logs[-1] instead of logs[-2] for build failure error message - Move NOVITA_BASE_URL lookup from class definition to __init__, consistent with NOVITA_API_KEY handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle null logs in build failure error reporting Use `status.get("logs") or []` instead of `status.get("logs", [])` to handle API returning `"logs": null`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wrap _http_client.aclose() in try/except in stop() Prevent transport-level errors during HTTP client cleanup from propagating out of stop() and masking the trial outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Preserve sandbox when delete=False for debugging When stop(delete=False) is called, skip killing the sandbox and closing the HTTP client so the sandbox remains running for debugging purposes. This aligns with how other environments (e.g. GKE) handle the delete flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: use alias endpoint for template lookup and fix stale alias recovery - Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias} endpoint for O(1) template lookup instead of scanning all templates - Add stale alias recovery in _api_create_template: on 403 "Alias already used", look up the stale template via alias endpoint, delete it, then retry creation - Include API key suffix in template alias to avoid cross-account conflicts - Increase build timeout from 600s to 1200s for heavy Dockerfiles - Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU) - Update tests to cover new alias endpoint behavior (44 tests passing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: auto-recover from stale cached templates on sandbox creation When _find_template_by_alias returns a template ID that no longer exists in the backend (alias registered but build failed/incomplete), AsyncSandbox would raise a SandboxException("404: template not found"). Now start() catches this case, deletes the stale template via REST API, and triggers a fresh build before retrying sandbox creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: include last 5 log lines in build failure error message Previously only the last log line was shown, which was often just "Postprocessing finished. Cleaning up..." instead of the actual error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors * chore: update parity_summary.csv [skip ci] * Fix review issues and CI failures in Novita environment - Add _merge_env(env) call in exec() so persistent env vars (--ae flags, task [environment.env] config) are correctly forwarded to sandbox commands - Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment interface (fixes type-check invalid-method-override errors) - Close HTTP client in stop(delete=False) to prevent resource leak; update test to assert aclose is called - Fix uv.lock: missing [[package]] header before networkx entry caused TOML parse errors that broke all CI checks; regenerate lockfile cleanly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix exec() to respect user parameter via _resolve_user The user parameter was accepted but never used — all commands ran as root. Now calls _resolve_user(user) to honour the orchestrator-set default_user (e.g. task agent.user / verifier.user from task.toml). Novita SDK's user parameter is Literal["root", "user"], so map any non-root resolved user to "user"; add Literal import accordingly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add preflight() and chmod 777 on log dirs in Novita environment - Add preflight() classmethod to validate NOVITA_API_KEY before any trials are queued, giving immediate feedback instead of failing mid-job - chmod 777 agent/verifier log directories after creation in start() so non-root agent/verifier users can write reward files and logs - Update start() test mocks to handle both foreground (healthcheck) and background (exec) sandbox.commands.run call patterns Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * style: ruff format test_novita.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix template name slash escaping and cwd quoting in exec - Replace '/' with '__' in template alias construction so org/name task names (e.g. harbor/hello-world) don't break REST API URL paths - Use shlex.quote(effective_cwd) in exec() to handle paths with spaces or shell metacharacters safely Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use timeout=0 (no limit) as default in exec, aligning with E2B timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means no connection time limit, avoiding premature 300s cutoffs on long-running agent setup or verifier scripts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment * refactor: move novita-sandbox to optional extra, matching other cloud providers - Move `novita-sandbox` from main deps to `[novita]` optional extra - Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it) - Include `harbor[novita]` in the `cloud` bundle - Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor - Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError Without this guard, instantiating NovitaEnvironment when novita-sandbox is not installed raises a raw NameError (on DockerfileParser) instead of a helpful MissingExtraError with install instructions. Follows the same pattern as E2BEnvironment and RunloopEnvironment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: import EnvironmentCapabilities in Novita environment Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: update Novita capability tests Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: fix file upload endpoint --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Minor fixes for ruff. * Minor fixes for type check (#1665) * Simplify trial flow (#1672) * Refactor trial execution by shape * Clean up trial helper typing * Skip Windows container hello world without Docker * Partial refactor. * Improve artifact handler. * Minor multi step fixes. * Make artifact handler paths operation scoped * Fix CI after trial flow cleanup * Keep download dir excludes explicit * Rename download dir exclusions helper * Address artifact exclusion review comments * Avoid duplicate single-step artifact recovery * Avoid double stop after cancellation --------- Co-authored-by: gabeorlanski <gabeorlanski@gmail.com> * fix(terminus-2): make tmux send-keys dash-proof and improve send-keys error messages (#1657) - _tmux_send_keys: append `--` end-of-options marker to the `tmux send-keys -t <session>` prefix so keys beginning with `-` (e.g. `-x`, `-Lfoo`) are treated as literal key arguments rather than being parsed as tmux options. - _send_blocking_keys / _send_non_blocking_keys: include `command` (truncated to 100 chars), `return_code`, `stderr`, and `stdout` in the raised RuntimeError to make intermittent send-keys failures easier to diagnose from logs. - tests: update _extract_send_keys_payload helper for the new `--` separator and add coverage for keys starting with `-` and for the enriched failure messages. Co-authored-by: Cursor <cursoragent@cursor.com> * [codex] add repeatable skill inputs (#1674) * add repeatable skill inputs * Register injected skills for Cursor CLI * Use Cursor native skills directory * Simplify skill resolution * Make injected skills readable by agents * Address skill input review comments * Reject relative task skills dir for injected skills * Add skills CLI alias * Rename injected skill config to skills * Add runtime skills job example * Trim runtime skills example config * [codex] add repeatable extra docker compose overlays (#1676) * add repeatable extra docker compose overlays * preserve modal compose build markers * preserve cloud compose file precedence * Guard extra compose by environment capability * Rename extra compose config paths * Revert "Rename extra compose config paths" This reverts commit 5c531c6d5a7117d6e1fdf9d58e01a8e088dd002e. * Add extra compose job example * Address extra compose example comments * Nest extra compose job example * Fix skills merge. * [codex] Add runtime MCP config support (#1675) * Add runtime MCP config support * Use extra compose overlay for MCP proof example * Remove MCP proof example volume * Use Python base image in MCP proof task * Document MCP proof compose context * Trim MCP proof job defaults * Embed MCP proof runtime config * [codex] Add extra instruction path support (#1682) * feat: add support for --extra-instruction-paths * Add extra instruction path support * Fix lock equality env serialization * Fix lock equality for digest-backed paths --------- Co-authored-by: ZHAO Jin-Xiang <xiaoxiangmoe@gmail.com> * v0.7.1 * fix(terminus): use UTF-8 byte length for tmux send-keys size checks (#1680) * Update reward output documentation (#1684) Update based on change in #1620 * Add minimal verifier extension hook (#1653) * Add minimal verifier extension hook Add a small verifier factory hook that allows jobs to provide an optional custom verifier by import path while keeping the existing task verification flow as the default. This enables job-specific verification to supplement task-specific checks. For example, a job can attach generic trajectory evaluators, policy checks, or run-level scoring logic across many tasks without rebuilding, copying, or modifying those task definitions. The hook keeps task authorship and job evaluation concerns separate: tasks continue to define their normal verification, and jobs can opt into additional verifier behavior only when needed. Default behavior is unchanged when no custom verifier is configured. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Tighten verifier extension contract Introduce BaseVerifier and VerifierContext so custom verifiers receive a stable construction context while the built-in verifier keeps legacy kwargs compatibility. Require verifier outputs to be VerifierResult before assigning them to trial results, preserving Harbor aggregation semantics for built-in and imported verifiers. Keep legacy import-path constructors working through an adapter that enforces the return contract. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Reject unused verifier kwargs Fail fast when verifier kwargs are provided without a verifier import path, since the built-in verifier does not consume arbitrary extension kwargs. This makes CLI/config mistakes visible instead of silently dropping values like --verifier-kwarg foo=bar. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Fix verifier factory test patch Update Windows multi-step verifier tests to patch VerifierFactory.create_verifier_from_config after trial verification moved behind the factory hook. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Simplify verifier extension constructor * Simplify verifier factory contract * Fix skills merge example config paths --------- Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Minor improvements. * fix: fail opencode runs on error events (#1658) * Update Novita to latest SDK build flow (#1688) * Add Novita environment support to Harbor - Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service. - Implemented end-to-end and unit tests for NovitaEnvironment functionality. * Fix CI failures: type errors, lint, and pytest collection crash - Add type: ignore comments for novita_sandbox SDK type issues - Move sys.exit() guard into __main__ block so pytest collection doesn't crash - Add template reuse test phase to e2e integration test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix COPY instruction parsing and timeout_sec=0 handling - Skip COPY --from=... instructions (multi-stage builds) - Filter out COPY flags (--chown, --chmod) before extracting source path - Use explicit None check for timeout_sec to allow timeout_sec=0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address Devin review: internet flag, default timeout, multi-source COPY - Set can_disable_internet to False (not yet supported by Novita SDK) - Change default exec timeout from 60s to 0 (no timeout), matching e2b - Handle multi-source COPY instructions (COPY a.py b.py /dest/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Windows path separator in upload_dir remote paths Use PurePosixPath for remote sandbox paths to ensure forward slashes on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Change default exec timeout from 0 to 300s The novita_sandbox SDK defaults to 60s internally when 0 is passed. Use 300s (5 minutes) to avoid premature termination of long-running agent and verifier commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix build error log index and defer API base URL resolution - Use logs[-1] instead of logs[-2] for build failure error message - Move NOVITA_BASE_URL lookup from class definition to __init__, consistent with NOVITA_API_KEY handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle null logs in build failure error reporting Use `status.get("logs") or []` instead of `status.get("logs", [])` to handle API returning `"logs": null`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wrap _http_client.aclose() in try/except in stop() Prevent transport-level errors during HTTP client cleanup from propagating out of stop() and masking the trial outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Preserve sandbox when delete=False for debugging When stop(delete=False) is called, skip killing the sandbox and closing the HTTP client so the sandbox remains running for debugging purposes. This aligns with how other environments (e.g. GKE) handle the delete flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: use alias endpoint for template lookup and fix stale alias recovery - Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias} endpoint for O(1) template lookup instead of scanning all templates - Add stale alias recovery in _api_create_template: on 403 "Alias already used", look up the stale template via alias endpoint, delete it, then retry creation - Include API key suffix in template alias to avoid cross-account conflicts - Increase build timeout from 600s to 1200s for heavy Dockerfiles - Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU) - Update tests to cover new alias endpoint behavior (44 tests passing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: auto-recover from stale cached templates on sandbox creation When _find_template_by_alias returns a template ID that no longer exists in the backend (alias registered but build failed/incomplete), AsyncSandbox would raise a SandboxException("404: template not found"). Now start() catches this case, deletes the stale template via REST API, and triggers a fresh build before retrying sandbox creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: include last 5 log lines in build failure error message Previously only the last log line was shown, which was often just "Postprocessing finished. Cleaning up..." instead of the actual error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors * chore: update parity_summary.csv [skip ci] * Fix review issues and CI failures in Novita environment - Add _merge_env(env) call in exec() so persistent env vars (--ae flags, task [environment.env] config) are correctly forwarded to sandbox commands - Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment interface (fixes type-check invalid-method-override errors) - Close HTTP client in stop(delete=False) to prevent resource leak; update test to assert aclose is called - Fix uv.lock: missing [[package]] header before networkx entry caused TOML parse errors that broke all CI checks; regenerate lockfile cleanly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix exec() to respect user parameter via _resolve_user The user parameter was accepted but never used — all commands ran as root. Now calls _resolve_user(user) to honour the orchestrator-set default_user (e.g. task agent.user / verifier.user from task.toml). Novita SDK's user parameter is Literal["root", "user"], so map any non-root resolved user to "user"; add Literal import accordingly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add preflight() and chmod 777 on log dirs in Novita environment - Add preflight() classmethod to validate NOVITA_API_KEY before any trials are queued, giving immediate feedback instead of failing mid-job - chmod 777 agent/verifier log directories after creation in start() so non-root agent/verifier users can write reward files and logs - Update start() test mocks to handle both foreground (healthcheck) and background (exec) sandbox.commands.run call patterns Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * style: ruff format test_novita.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix template name slash escaping and cwd quoting in exec - Replace '/' with '__' in template alias construction so org/name task names (e.g. harbor/hello-world) don't break REST API URL paths - Use shlex.quote(effective_cwd) in exec() to handle paths with spaces or shell metacharacters safely Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use timeout=0 (no limit) as default in exec, aligning with E2B timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means no connection time limit, avoiding premature 300s cutoffs on long-running agent setup or verifier scripts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment * refactor: move novita-sandbox to optional extra, matching other cloud providers - Move `novita-sandbox` from main deps to `[novita]` optional extra - Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it) - Include `harbor[novita]` in the `cloud` bundle - Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor - Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError Without this guard, instantiating NovitaEnvironment when novita-sandbox is not installed raises a raw NameError (on DockerfileParser) instead of a helpful MissingExtraError with install instructions. Follows the same pattern as E2BEnvironment and RunloopEnvironment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: import EnvironmentCapabilities in Novita environment Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: update Novita capability tests Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: fix file upload endpoint * fix: integrate Novita SDK template builds Use the Novita SDK template builder directly while preserving Harbor's Dockerfile COPY handling, and pin the alpha SDK version without enabling global prerelease resolution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: pin Novita sandbox domain Use the regional Novita sandbox endpoint consistently so local domain overrides cannot route template operations to the wrong API host. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: avoid Novita SDK import during test collection Load Novita SDK modules only when the Novita environment actually needs them so pytest can collect E2B and Novita tests in the same process without duplicate protobuf descriptor registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Fix EnvironmentConfig deprecation warnings on default construction. Migrate legacy memory/storage fields in a before validator instead of Field(deprecated=...) plus an after validator, and reject conflicting legacy and modern resource values. Closes #1693 Co-authored-by: Cursor <cursoragent@cursor.com> * Estimate cursor-cli cost from usage via LiteLLM Cursor CLI stream-json reports token usage on result events but not dollar cost. Parse optional totalCost when present and otherwise estimate from per-category token counts using LiteLLM pricing. Co-authored-by: Cursor <cursoragent@cursor.com> * Add built-in pricing for Cursor Composer models in cursor-cli. LiteLLM does not list cursor/composer models, so estimate cost from token usage using Cursor's published rates before falling back to LiteLLM. Co-authored-by: Cursor <cursoragent@cursor.com> * [codex] Add resource enforcement policies (#1697) * Add resource enforcement policies * Pre flight check. * Fix CHANGELOG breaking changes for resource enforcement policies. Document removed task resource defaults and stricter validation instead of incorrectly claiming --cpus/--memory repurposed numeric overrides. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * v0.8.0 * Fix resource default test after provider-default change (#1701) * fix tests on main * chore: rerun CI * Document job sharing (#1706) * feat(viewer): add ←/→ trial navigation, ⌥+←/→ tab cycling, persistent tab across trials, and X/N position indicator on the trial page (#1705) * docs(atif): refresh trajectory format page to v1.7 (#1704) The trajectory format docs page still advertised ATIF-v1.4 as current and stopped its supported-versions list at v1.4, while the canonical RFC (rfcs/0001-trajectory-format.md) has been at v1.7 for several releases. Bump the example schema_version strings to ATIF-v1.7 and extend the Schema Versions section with v1.5, v1.6, and v1.7 entries summarized from the RFC's Version History. No code changes; docs only. * Add PR diff links workflow with manual dispatch. (#1716) Post devinreview and diffshub links when PRs open, and allow testing on existing PRs via workflow_dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add Openclaw installed agent (#1661) * feat: add openclaw installed agent * Cleanup commit * save full session turns * NeMo-Flow Integration * cleanup * update defaults * fix test for updated defaults * Fix tests for new defaults * Fix lint error * Remove nemoflow from PR Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com> * refactor(openclaw): generalize provider config normalization Address review feedback: drop NVIDIA-specific code paths from the OpenClaw plugin so it works generically across any OpenAI-compatible provider. - Replace `_merge_nvidia_base_url_from_env` and `_normalize_nvidia_models_provider` with provider-agnostic `_merge_provider_base_url_from_env` and `_normalize_provider_models_schema` that derive the provider from `--model` (e.g. `openai/gpt-4.1` -> `OPENAI_BASE_URL`). - Remove the hardcoded NVIDIA default base URL; users select a custom provider via env or `openclaw_config`. - Update class docstring to use `openai/*` as the generic example. - Rewrite the NVIDIA-themed unit tests to cover the generic behavior with `openai/*`. The `nvidia` entry in the env-var forwarding switch is retained alongside ~15 other providers (anthropic, openai, google, ...) as a plain provider registry, since removing it would break existing `nvidia/*` model selections. Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> * feature(api): multi-provider compatibility for openclaw Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> --------- Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com> Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> Co-authored-by: Bryan Bednarski <bbednarski@nvidia.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Add GPU support to GKE environment (#1640) * Add GPU support to GKE environment * Address PR comments - Early failure if an unsupported GPU type is provieded - Increase the timeout minutes to 20 when GPUs are selected - Support direct gke-accelerator values as gpu_types * Adjust GPU count retrieval to use _effective_gpus for consistency * Paginate dataset metadata queries past Supabase row cap (#1719) * Paginate dataset metadata queries past Supabase row cap. Fixes harbor download and run truncating package datasets at 1,000 tasks. Co-authored-by: Cursor <cursoragent@cursor.com> * Format test_registry_db_client.py with ruff. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Add TPU support to harbor and GKE environment (#1652) * Address PR comments - Early failure if an unsupported GPU type is provieded - Increase the timeout minutes to 20 when GPUs are selected - Support direct gke-accelerator values as gpu_types * Adjust GPU count retrieval to use _effective_gpus for consistency * Add TPU support to environment configuration This change allows environments to properly support and validate TPU requirements, improving task execution flexibility. * Add TPU support to GKE environment This update introduces a mapping for TPU types, enhances the GKEEnvironment class to handle TPU configurations, and updates unit tests to validate TPU capabilities and configurations alongside existing GPU support. * Update environment config model to use a dedicated class for TpuSpec * Add new TPU config to docs * Add --tpu_overrides to cli commands * Validate mutual exclusion of GPU and TPU requests in GKE * Fix merge conflicts * Update TPU configuration to use a single TpuSpec * Add Harbor Hub job result sharing blog post (#1732) * Add Harbor Hub job result sharing blog post. Co-authored-by: Cursor <cursoragent@cursor.com> * Update job sharing blog title and landing page banner. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Add CoreWeave Sandbox and W&B environment support (#1698) * cw sandbox * doc fix * Fix (Add resource enforcement policies) * final fixes * comment cleanup * fix(cwsandbox): clean up backend sandbox on any failed start() * feat (Tensorlake): build sandboxes from OCI images instead of per-trial Dockerfile replay (#1734) * update tensorlake integration to use oci image build * Guard fcntl import for Windows test collection in tensorlake env * Add managing resources docs for task configuration. (#1735) Centralize enforcement policy and resource field guidance in the tasks docs. Co-authored-by: Cursor <cursoragent@cursor.com> * [Ready For Review] Fix artifact transfer archive collisions (#1733) * Fix artifact transfer archive collisions * Log transfer cleanup failures as warnings * Use RPC for task version resolution (#1736) * Allow tasks with docker_image to omit environment/Dockerfile (#1729) * Allow tasks with docker_image to omit environment/Dockerfile. Centralize environment definition validation and workdir helpers across supported providers. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix docker_image-only force_build and Runloop workdir default. Use shared prebuilt-image selection when no Dockerfile exists, and restore /workspace fallback for Dockerfiles without WORKDIR. Co-authored-by: Cursor <cursoragent@cursor.com> * Apply prebuilt docker_image policy to all compose providers. Use should_use_prebuilt_docker_image in Daytona, Modal, and Islo, and unify Docker validation. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix lazy dockerfile_parse import and daytona formatting. Move DockerfileParser import inside parse_dockerfile_workdir so core environments do not require the optional extra. Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Add dockerfile-parse to runloop optional extra. Runloop now uses parse_dockerfile_workdir for WORKDIR resolution when a Dockerfile is present. Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat: Add native agent adapter for Google Antigravity CLI (agy) (#1699) * feat: Add native agent adapter for Google Antigravity CLI (agy) * fix: remove unused import * fix: correctly configure agy settings.json and model * fix: update test to match new EnvironmentConfig defaults * fix: remove unused run_model variable * style: run ruff format on agy.py * refactor: rename agy agent to antigravity-cli Use antigravity-cli as the Harbor agent identifier and AntigravityCli adapter naming instead of agy. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(antigravity-cli): use Path.write_text for ATIF export Address Devin review feedback and align with AGENTS.md file I/O guidance. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError (#1457) * feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError Adds three major improvements to the Daytona environment backend: 1. **Auto-snapshot with content-based caching**: New `auto_snapshot` parameter on DaytonaEnvironment enables automatic snapshot creation keyed by a SHA256 hash of the full environment directory. Tasks sharing the same Dockerfile and fixtures reuse a single snapshot, eliminating redundant builds. Snapshots are region-aware (DAYTONA_TARGET) to prevent cross-region collisions. Per- snapshot async locks prevent redundant parallel creation. 2. **Transient error differentiation**: New `daytona_utils.py` module provides `is_transient_daytona_error()` which distinguishes rate limits and capacity errors from non-recoverable failures. Retry callbacks use 10 attempts with 60s linear backoff for transient errors vs 3 attempts with exponential backoff for others — dramatically improving reliability under load. 3. **SandboxBuildFailedError**: New non-retryable exception for failed sandbox builds (bad Dockerfile, snapshot in ERROR state). Stops wasting retry budget on builds that will never succeed. Detected both in `_create_sandbox()` and `_wait_for_snapshot()`. Supporting additions: - `container_cache.py`: Hash utilities for environment directories and Dockerfiles, plus task analysis helpers for predicting snapshot counts - DinD auto-snapshot support with image-hash-based naming - `ephemeral=True` flag on all sandbox creation calls - `assume_global_snapshot` for optimistic handling of shared snapshots invisible to the GET API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove region_id param not in current Daytona SDK Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: remove DinD auto-snapshot additions, restore main's DinD start() DinD snapshot management was not in scope for this PR. Restores _DaytonaDinD.start() to main's original implementation. Removes _get_dind_snapshot_name, _ensure_dind_auto_snapshot, _create_dind_snapshot methods and unused hashlib import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: don't retry SandboxBuildFailedError/TimeoutError, close RL client - Add _is_non_retryable() guard to all retry callbacks so SandboxBuildFailedError and TimeoutError are never retried - Close temporary AsyncDaytona client after RL-region snapshot builds to prevent HTTP session leaks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(daytona): harden PR #1457 with unit tests and small fixes Add tests for daytona_utils retry classification and container_cache hashing. Stop treating invalid bearer tokens as transient, trim unused analyze helpers, evict idle per-snapshot locks, and document auto_snapshot ERROR behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): extract snapshot service and collapse retry helpers Move snapshot lifecycle into daytona_snapshots.py with a single state resolver and SnapshotPolicy. Replace six retry callbacks with daytona_retry_callbacks(). Simplify _DaytonaDirect.start() via _resolve_start_sandbox_params() and remove the string-matched fallback catch. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): dedupe ensure_auto paths and add optional snapshot GET Collapse fast/slow auto-snapshot resolution into shared helpers and use a documented non-retrying GET for pre-create ERROR cleanup. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: use Task.short_name for environment_name Add Task.short_name (delegates to package short_name, else task dir name) and pass it as environment_name so Daytona snapshot templates and container naming avoid registry org prefixes and slashes in paths. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): move modules into daytona/ package Group environment, snapshots, and utils under environments/daytona/ to match docker/ and singularity/. Default assume_global_snapshot to False so missing template snapshots fall back to Dockerfile builds. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(container_cache): length-prefix paths in environment hash Avoid ambiguous SHA256 updates where a file path could concatenate with the next file's content. Adds a regression test for the ab/a+b case. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(daytona): wait for concurrent snapshot create to become active Handle PENDING snapshots before create and wait for ACTIVE after already-exists/conflict errors instead of returning the name immediately. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(container_cache): length-prefix file content in environment hash Extend domain-separated hashing so path and content bytes cannot be ambiguous across files (Devin review follow-up). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Benjamin Feuer <penfever@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Upload environment/ files for prebuilt docker_image tasks (#1737) * Upload environment/ to workdir for prebuilt docker_image tasks. When docker_image is set without a Dockerfile or docker-compose.yaml, environments copy non-empty environment/ into the container workdir at the end of start(). Co-authored-by: Cursor <cursoragent@cursor.com> * Fix CI: format tests and isolate cwsandbox environment_dir fixtures. Use a dedicated empty environment/ subdirectory so post-start uploads do not run during unit tests that assert exact exec call counts. Co-authored-by: Cursor <cursoragent@cursor.com> * Format cwsandbox test_wandb.py Co-authored-by: Cursor <cursoragent@cursor.com> * Fix cwsandbox tests to write Dockerfile under environment/. Aligns with environment_dir fixture so prebuilt-image allowance tests exercise the intended layout. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * downgrade logging. * Stop writing per-episode log folders in Terminus-2 (#1740) * Stop writing per-episode log folders in Terminus-2. Episode prompt/response/debug files are redundant now that trajectory.json captures each turn. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Terminus-2 tests after removing episode logging paths. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * [Ready for Review] Adapter | Review bot prompt update for agent reward hacking checks (#1747) * Update adapter review prompts * Update prompt based on some sanity check runs * Add benchmark identity leakage check * Add linear.review link to PR diff links workflow (#1749) * fix link. * v0.9.0 * claude_code: handle redacted_thinking content blocks (#1752) Anthropic's `redacted_thinking` is a standard, documented content block type that can appear in any assistant message when extended thinking is enabled. Its `data` field is opaque ciphertext that clients cannot decrypt — the contract is to pass it back unchanged on subsequent API calls, never to expose it as user-facing text. Today _extract_text_reasoning_tool_uses doesn't recognise the type, so the block falls through to the catch-all that `_stringify`s the whole block dict and appends the resulting JSON envelope to text_parts. Trajectories then carry an ATIF `message` like '{"type":"redacted_thinking","data":"…"}' in the assistant turn. On may26 there are 2,050 such steps across 127 trials in the bundled corpus, all claude-code paired with vendor-routed models (e.g. tencent/hy3-preview-20260421 via OpenRouter). OpenRouter additionally mis-uses the redacted_thinking envelope to pass through PLAIN reasoning from non-Anthropic models: `data` is `openrouter.reasoning:<b64>`, where the base64 decodes to plain JSON `{"text":"…","type":"reasoning.text"}`. That content isn't actually encrypted — it should land in reasoning_content like every other thinking block. Add a redacted_thinking branch before the generic fallback that: - if data starts with `openrouter.reasoning:`, b64-decodes the payload, parses the inner JSON, and appends the inner `text` to reasoning_parts; - otherwise drops the block. This preserves the API contract for genuine Anthropic ciphertext (it remains opaque) and stops the envelope JSON from polluting human-readable trajectory text. Updates the existing test_redacted_thinking_not_in_reasoning to assert the envelope is now absent from both text and reasoning (it previously only asserted absence from reasoning, accepting the stringified-into- text behaviour), and adds two new tests covering the OpenRouter decode and malformed-payload-dropped paths. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal> * claude_code: unwrap text content blocks in user-event tool_result loop (#1753) In _convert_events_to_trajectory, the user-event content loop already handles tool_result blocks specifically. Anything else falls through to `self._stringify(block)` — which JSON-encodes the whole block dict and appends the resulting envelope to text_parts. So a content block like {"type": "text", "text": "<10 KB of skill documentation>"} ends up in the ATIF user step's `message` as '{"type":"text","text":"Base directory for this skill: …"}' verbatim — downstream renderers that expect `message` to be human text can't read it. Claude Code injects these text blocks as user content alongside the tool_result when a Skill is loaded (the block carries the skill's documentation). Saw 4 such steps in a recent harbor-index corpus scan on skillsbench × {glm-5.1, MiniMax/MiniMax-M2.7} runs. Fix: before the generic _stringify fallback, recognise `{"type":"text","text":<str>}` and surface its inner string. Non-text blocks and text blocks with non-string `text` still hit the stringify fallback so behaviour for unknown shapes is unchanged. Adds test_user_event_text_content_block_unwrapped covering the end-to-end path through _convert_events_to_trajectory. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal> * fix(modal): default _ModalDirect.exec to non-login shell (#1744) The strategy-refactor PR (#1311) introduced `login=True` on the default `_ModalDirect.exec` path, which causes the underlying SDK call to use `bash -lc <cmd>`. A login shell re-sources `/etc/profile` and the shell's profile files, which **clobbers `PATH`** as set by the image's `ENV PATH=…` directives. This breaks any task that pins toolchains via image-level `ENV PATH`: - Go tasks lose `/usr/local/go/bin` (everything that does `go build`/`go test` fails) - Rust tasks lose `~/.cargo/bin` (cargo not found) - Anything with custom `pipx`/`uv`/Node prefixes baked into image layers gets reset to the inherited login default Reverting this single line to `login=False` restores the pre-#1311 `bash -c` behavior and preserves the image's PATH. The lower-level `_sdk_exec` still exposes `login` as a parameter, so strategies that genuinely want a login shell can opt in explicitly. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * Add viewer sign-in and sync auth with the CLI (#1755) * Add viewer sign-in and sync auth with the CLI. Enable OAuth login/logout in the local viewer, pick up CLI credential changes via mtime-based cache invalidation, and align page headers with Harbor Hub. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix credential sync detection on Windows. Use a content hash instead of mtime, which can be unchanged across rapid writes on Windows. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix credential sync baseline after local writes. Set initialized state in note_credentials_written and isolate credential sync tests so they pass independently. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * fix(claude-code): preserve user-message bytes in ATIF trajectory (drop .strip()) (#1724) * [claude-code] preserve user message bytes (no .strip()) Downstream pipelines that hash the user step.message bytes for cross- harness equivalence checks rely on byte-identical comparisons against the canonical instruction.md. Stripping trailing/leading whitespace in the ATIF normalizer breaks those checks silently. `_convert_events_to_trajectory` accepts user-event content in three shapes; all three were applying `.strip()` to the persisted bytes: * `content: str` (the shape `claude --print -- "..."` emits) — fixed by replacing `text = content.strip()` with `text = content` and tightening the existing truthy gate to `if text.strip():` so empty / whitespace-only entries are still dropped without mutating bytes in the non-empty case. * `content: list` (programmatic / SDK callers that wrap the instruction in `{"type": "text", "text": "..."}` blocks) — fixed by extracting `block["text"]` verbatim instead of routing through `_stringify`, and by dropping `part.strip()` from the join (the `if part.strip()` filter still removes empty / whitespace-only parts so we never emit `\n\n` between nothing). Non-text non- tool_result blocks (e.g. image blocks) continue to fall through to `_stringify`, which json-encodes them; the patch deliberately does not try to byte-faithful those — they have no canonical text bytes to be faithful to. * `content` else-branch (defensive fallback for unusual shapes) — fixed by the same rule: keep raw `_stringify(content)` bytes and use `.strip()` only in the empty-skip filter. Adds regression tests covering string-content trailing newline / leading whitespace / internal whitespace / empty / whitespace-only, list-content single-block byte-faithful / multi-block join / empty- part filter / non-text non-tool_result block json-encoded, and the fallback else-branch on a non-str non-list content payload. * fix(tests): run byte-faithful suite in CI (declare hypothesis, drop module skip) The module-level `pytest.importorskip("hypothesis")` skipped the ENTIRE test file when hypothesis was absent — not just the property test, but also the byte-faithful regression suite this PR adds and the pre-existing reasoning-extraction / session-selection tests. hypothesis was not in the dev dependency group nor in uv.lock, and CI installs via `uv sync --all-packages --all-extras --locked`, so it was never present: the file collected to "0 items / 1 skipped" and CI was green-but-empty. Declare hypothesis in [dependency-groups].dev (uv.lock updated) and import it normally at module top so the whole file collects and runs. Verified locally: pytest now collects 47 tests (was 0 / 1 skipped); all pass including the 2000-example property test. ruff check + format clean. * fix(opencode): include the user prompt as a user step in the ATIF trajectory (#1759) OpenCode trajectories had no source="user" step: _convert_events_to_trajectory only emitted agent steps, so the prompt was missing (the docstring even claimed a user step was synthesised, but the code never added one). OpenCode's `run --format=json` stream omits the prompt entirely (anomalyco/opencode#29997); it is only recoverable via `opencode export`. Capture the rendered instruction in run() and prepend a source="user" step, preferring OpenCode's own `user` event when present (forward-compatible with anomalyco/opencode#29998) and falling back to the instruction otherwise. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * Fix Claude Code trajectory conversion for duplicate events (#1741) Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * feat(gemini-cli): support Login with Google (oauth-personal) via credential upload (#1764) Adds opt-in "Login with Google" auth to the gemini-cli agent, mirroring the Codex agent's auth.json injection: - GEMINI_OAUTH_CREDS_PATH=<path> → upload that oauth_creds.json - GEMINI_FORCE_OAUTH=<truthy> → upload ~/.gemini/oauth_creds.json Default behavior (GEMINI_API_KEY / Vertex env) is unchanged. On opt-in, uploads oauth_creds.json to a staging dir, chowns it to the agent user (upload_file lands as root), copies it into ~/.gemini with 0600, and sets settings security.auth.selectedType=oauth-personal so headless mode uses the credential without prompting. The API key is not passed under OAuth; GOOGLE_CLOUD_PROJECT is still forwarded. Staged secrets are removed afterward. Verified: gemini unit suite passes (ruff + ty clean) and a real Docker run with GEMINI_FORCE_OAUTH=true completed hello-world (reward 1.0) authenticating via OAuth. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * Network mode and optional allowlist (#1455) * Refactor: 'allow_internet_access' boolean attribute to 'internet' enum * Add require_internet_access field instead of replacing allow_internet Keep allow_internet unchanged to avoid breaking existing configs. Add a new require_internet_access boolean to annotate tasks that need internet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename require_internet_access to require_internet Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Refactor task internet config to enum * Add per-role network policies * Default network policy to public * Use lowercase network modes * Add E2B dynamic network policies * Add E2B network policy example * Generalize network allowlist example * Support setup-only network allowlists * Support lifecycle network allowlists * Fix trial logger cleanup on init failure * Restore E2B sandbox timeout * Handle legacy allow_internet task configs * Restrict shared verifier network switching * Close trial log handlers in construction-only tests * Reject misplaced network policy fields * Scope network policy to trial phases and migrate E2B to update_network() (#1754) * Add first-class CLI flags for run-specific network allowlists. Expose --allow-host and --verifier-allow-host on harbor run/trials while keeping legacy extra_network_allowlists agent kwarg support. Co-authored-by: Cursor <cursoragent@cursor.com> * Scope network policy to trial phases and migrate E2B to update_network(). Apply environment baseline at env start, agent policy only during agent.run(), and verifier policy only during verifier.verify(); rename no_network to no-network and limit --allow-host to the agent phase. Use AsyncSandbox.update_network() with e2b>=2.25.0. Co-authored-by: Cursor <cursoragent@cursor.com> * Treat agent/verifier network fields as optional phase overrides. Split baseline vs phase network config, skip dynamic switches when phase matches baseline, add static/dynamic E2B matrix examples, and remove redundant explicit network_mode from tasks that inherit environment defaults. Co-authored-by: Cursor <cursoragent@cursor.com> * Split run-time allowlist flags and document network policy hierarchy. Replace --allow-host with --allow-environment-host (baseline) and --allow-agent-host (agent phase), and tighten task docs around baseline vs override resolution. Co-authored-by: Cursor <cursoragent@cursor.com> * Validate separate verifier network policy at init and warn on unused CLI hosts. Unify phase-switch validation for shared and separate verifier modes, route separate verifier plans through _network_plan, and warn when run-time allowlist flags are ignored on public baselines. Co-authored-by: Cursor <cursoragent@cursor.com> * Use None for shared verifier baseline to fix separate-mode validation. Shared mode no longer duplicates agent_env_baseline in verifier_env_baseline, so init validation can infer container layout without comparing baselines. Co-authored-by: Cursor <cursoragent@cursor.com> * Document phase-scoped network policy in skills and fix example drift. Restore no-network baselines on verifier examples after the phase-policy migration, fix matrix README paths, and update create-task/rewardkit skills. Co-authored-by: Cursor <cursoragent@cursor.com> * Bump task schema version to 1.3 for phase-scoped network policy. Update the TaskConfig default, harbor init/register paths, docs, skills, examples, and tests. Schema 1.2 tasks remain loadable. Co-authored-by: Cursor <cursoragent@cursor.com> * Remove unused Any import from trial module. Fixes ruff F401 ahead of merge into main CI. Co-authored-by: Cursor <cursoragent@cursor.com> * Merge allow-environment-host into inherited separate verifier baseline. When separate verifier mode falls back to [environment] without an explicit [verifier.environment], apply the same run-time host merge as the agent env. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix viewer network policy display for phase overrides. [agent] and [verifier] no longer default to Public when network_mode is absent; show the inherited baseline instead. Add Verifier Environment Network when [verifier.environment] is set. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix windows multistep test fixtures for network plan resolution. Partially constructed MultiStepTrial mocks now include agent and environment config so _run_shared_verifier can resolve phase network policy. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Fix CI lint and type errors after main merge. Build E2B allowlist options directly, narrow separate verifier baseline before phase switching, and drop an unused test import. Co-authored-by: Cursor <cursoragent@cursor.com> * Apply ruff formatting to network policy files. Co-authored-by: Cursor <cursoragent@cursor.com> * Rename trial run-time allowlist fields to extra_allowed_hosts. Keep --allow-agent-host and --allow-environment-host as CLI flags while mapping them to agent.extra_allowed_hosts and environment.extra_allowed_hosts. Co-authored-by: Cursor <cursoragent@cursor.com> * Add changelog entry for phase-scoped network policy. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Boxuan Li <boxuanli@microsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * v0.13.0 * Add job plugin support and refactor Harbor Hub upload (#1762) * Add job plugin support and refactor Harbor Hub upload as an internal plugin. Introduce --plugin for optional integrations, shared import-path loading, and implement upload via HarborHubUploadPlugin while keeping --upload as the CLI entry point. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix missing TrialPaths import in environment factory. Restores the import removed during import_path refactor so lint and type checks pass. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix CI lint and type errors in plugin upload code. Restore formatting and type the Harbor Hub visibility helper as PublicJobVisibility. Co-authored-by: Cursor <cursoragent@cursor.com> * Print job results before user plugin finalize and isolate plugin failures. Move finalize_job_plugins after the results table so a plugin error cannot hide completed run output, and log per-plugin finalize failures without blocking others. Co-authored-by: Cursor <cursoragent@cursor.com> * Add plugin configuration via --pk and job config plugins list. Support one CLI plugin with constructor kwargs, multiple plugins via job yaml, and pass kwargs through PluginConfig into plugin constructors. Co-authored-by: Cursor <cursoragent@cursor.com> * Rename JobPlugin lifecycle methods to on_job_start and on_job_end. Align plugin hooks with Harbor job lifecycle naming and update the upload plugin and tests accordingly. Co-authored-by: Cursor <cursoragent@cursor.com> * Resolve harbor.plugins entry points for --plugin short names. Add entry point lookup before plugin import, plus harbor plugins list for discovering installed plugins. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix plugins module/package naming conflict. Rename the CLI typer module to plugins_cmd so harbor.cli.plugins remains a package for HarborHubUploadPlugin and other built-in plugin implementations. Co-authored-by: Cursor <cursoragent@cursor.com> * Apply ruff formatting to plugin-related files. Co-authored-by: Cursor <cursoragent@cursor.com> * Require plugins to implement on_job_end. Make BaseJobPlugin.on_job_end abstract so every plugin explicitly defines both lifecycle hooks instead of inheriting a silent no-op. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Add harbor-langsmith plugin package for LangSmith integration. (#1702) Extract LangSmith job tracking into a workspace package that registers via harbor.plugins entry points and installs with harbor[langsmith]. Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Add harbor-langsmith publish script and PyPI package metadata. Pin harbor>=0.13.0 for the job plugin API and record Harbor authorship before publishing harbor-langsmith to PyPI. Co-authored-by: Cursor <cursoragent@cursor.com> * Fail fast on Harbor Hub auth errors when using --upload (#1781) * Fail fast on Harbor Hub auth errors when using --upload. Validate Hub auth before trials start and treat expired or invalid sessions as fatal instead of falling back to end-of-run batch upload. Co-authored-by: Cursor <cursoragent@cursor.com> * Handle stale auth gracefully in status and fix formatting. Catch Supabase auth errors during harbor auth status and invalid session checks so users see a login prompt instead of a traceback. Co-authored-by: Cursor <cursoragent@cursor.com> * Centralize Supabase session validation in auth layer. Add shared session helpers that map auth API failures to consistent errors, clear stale credentials on invalid refresh tokens, and reuse them from status checks, upload auth, and registry DB calls. Co-authored-by: Curso…

* [kimi-cli] Add OpenRouter as a supported provider (#1568) Allow `harbor run -a kimi-cli -m openrouter/<provider>/<model>` (e.g. `openrouter/moonshotai/kimi-k2.6`) by registering an `openrouter` entry in `_PROVIDER_CONFIG`. OpenRouter is OpenAI-compatible, so it reuses the `openai_legacy` provider type with `https://openrouter.ai/api/v1` and `OPENROUTER_API_KEY`. Without this, the agent raises `Unsupported provider 'openrouter' for kimi-cli` from `_build_config_json` because the model-name prefix (`openrouter`) isn't a registered key. Since the model name is split on the first `/` only, the part forwarded to kimi-cli (and on to OpenRouter) remains in the `<vendor>/<model>` form OpenRouter expects. * Fix Harbor upload handling for resumable Supabase storage (#1570) * Add TUS uploads. * Resumabel publsihing. * Fix ATIF RFC link in trajectory-format documentation (#1583) Fix ATIF RFC link in trajectory-format documentation. (The one near the end was fixed by a robot but the one near the top was missed.) * Fix terminus temp & cursor CLI. Closes #1586. * Add Tensorlake to sandbox providers list (#1585) * add tensorlake in sandbox provider list * update the tensorlake link to harbor page in tensorlake docs * fix(opencode): Allow any model provider to be specified with -m (#1590) * fix using snapshot (#1587) * v0.6.5 * Allow configuring Daytona connection_pool_maxsize via env kwargs (#1445) Forwarded through `DaytonaClientManager` into `DaytonaConfig` when the shared `AsyncDaytona` client is built. Pass via `--ek connection_pool_maxsize=N` (`=null` for unlimited). Bumps `daytona>=0.165.0`. Signed-off-by: rovle <lovre.pesut@gmail.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Support Devin CLI agent in Harbor (#1605) * Support Devin CLI agent in Harbor * Fix api server url * Clean up comments and update logging configuration Removed unnecessary comments and adjusted logging environment variables. * Update opus version from 4.5 to 4.7 * Minor updates to chagnelog. * v0.6.6 * rewardkit: individual judge mode, per-criterion files, document extraction (#1606) * rewardkit: add individual judge mode, per-criterion files, document extraction * rewardkit: silence ty unresolved-import for optional markitdown * rewardkit: 0.1.3 * rewardkit: stable JSON Schema for individual-mode judge calls (#1611) * rewardkit: stable JSON Schema for individual-mode judge calls When `mode = "individual"`, rewardkit fires one structured-output LLM call per criterion. The old `_build_response_schema` used the criterion's name as the top-level property, so 60 differently-named criteria produced 60 distinct schema texts → 60 grammar compilations on Anthropic's side → busted the 20/min grammar-compilation rate limit and crashed the verifier. Single-criterion calls now return the flat `{"score", "reasoning"}` shape instead of a name-wrapped object. All individual-mode calls with the same output format share byte-identical schema text, hit the compilation cache, and never trip the rate limit. Multi-criterion (batched) mode is unchanged. `parse_judge_response` accepts both the new flat shape and the existing by-name shape, so any model that still returns the wrapped form keeps working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * rewardkit: detect flat shape by value type, not name lookup The unwrap check in parse_judge_response keyed off whether the criterion's name was absent from data. That broke for criteria auto-named 'score' or 'reasoning' (e.g. description='Score the work' → name='score'): with the flat-shape response {"score": "yes", "reasoning": "ok"}, "score" IS in data, so the unwrap was skipped and data.get("score") returned a string instead of a dict, raising ValueError. Switch to value-type detection — flat shape has a leaf at data["score"], by-name shape has a nested dict — so the name collision is harmless. Adds three regression tests covering the 'score' / 'reasoning' edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * rewardkit: add --je / --judge flags + REWARDKIT_JUDGE override (#1609) * fix: build harbor-rewardkit into local dist for publish (#1608) * fix: oracle agent run fail in user agent mode (#1615) * Update Tensorlake integration to use the lastest SDK (#1621) * unpin sdk version and update apis * fix lifecycle * api update * bump up the disk size * update * fix * change back to TaskGroup * improve test coverage * fix * fix: classify Anthropic/Bedrock prompt-too-long errors as context length (#1619) * fix: classify Anthropic prompt-too-long errors as context length Co-authored-by: Cursor <cursoragent@cursor.com> * fix: classify Bedrock input-too-long errors as context length Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Daytona auth and rich verifier rewards (#1620) * fix(pi): Allow any model provider to be specified with -m (#1614) * fix(pi): Allow any model provider to be specified with -m * Run formatter * Fix retry exclude CLI override (#1622) * Speed up test suite (#1625) * fix: Handle deprecated modal API - remove usage of `Sandbox.mkdir` (#1630) * Update deprecated modal api * Remove comment difdf * islo.dev fix - docker in vm ca (#1599) * fix: redundant ca management in docker caused dpkg to fail installing the ca-certificates * test(islo): align unit tests with CA-mount removal and user kwarg refactor - Replace positive CA bundle bind-mount assertion with a negative one so the test guards against the redundant mount being reintroduced. - Rename the two user-wrapping tests and assert the user is forwarded via the SDK's user= kwarg instead of being baked into a su wrapper command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Tomer Ezer <46822143+tomerezer@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add Islo as cloud sandbox provider (#1578) - Add Islo to the providers list in cloud-sandboxes.mdx - Note Islo support for multi-container deployments - Bump islo SDK pin to >=0.3.0 * feat(islo): add docker-compose support (#1559) * chore: update parity_summary.csv [skip ci] * feat(islo): add docker-compose support Adds a compose mode to the ISLO environment provider so multi-service tasks (e.g. examples/tasks/hello-mcp with an mcp-server sidecar) can run on islo. Mirrors the Daytona DinD pattern and reuses the shared compose templates from harbor.environments.docker. - Detects docker-compose.yaml in the task's environment dir; takes priority over the prebuilt-image / Dockerfile / runner branches - Builds & runs a multi-service compose project inside the islo VM with a conventional `main` service that the agent execs into - Two-hop file transfer (SDK -> VM temp -> docker compose cp main:) with a volume-mounted fast path for verifier/agent/artifacts log dirs - Honors allow_internet=False via the shared no-network overlay; declares the disable_internet capability when in compose mode - Writes an islo-specific TLS/CA overlay compose file at startup (kept off the shared templates) so the main service trusts the gateway's MITM certs and gets NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / etc. - Compose-aware stop() (docker compose down --remove-orphans) and attach() (islo use ... -- bash -lc '<env> docker compose exec main bash') Adds 30 unit tests covering detection, env vars, file flags (templates, no-network, prebuilt swap, CA overlay), command builder, volume-mount mappings, exec/stop/attach routing, and file-transfer fast path + two-hop behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(islo): drop cross-provider references from compose comments Tighten the compose-mode comments to describe what islo does without naming sibling providers, since those mentions don't help a reader trying to understand the islo file in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(islo): address compose review feedback - Reserve Harbor compose infra env vars: a task or persistent env var named CPUS / MEMORY / CONTEXT_DIR / MAIN_IMAGE_NAME / HOST_*_LOGS_PATH / ENV_*_LOGS_PATH would previously silently shadow the infra value and break compose interpolation. Infra vars now win, with a warning logged on collision. - Sanitize compose project name to docker compose's required regex ([a-z0-9][a-z0-9_-]*); session_ids with dots, slashes, colons, or leading punctuation no longer surface as a confusing compose error. - Clarify the disable_internet capability docstring: it advertises whether the env CAN honor allow_internet=False, not whether it's currently doing so. - Replace 'replace(prefix, ...)' with explicit slicing in _compose_sandbox_log_path to be obviously correct without relying on the startswith guard above it. - Tighten compose-mode comments. Tests: - Replace the misnamed test_validate_raises_when_compose_yaml_missing_after_init (which never asserted a raise) with a real validator coverage test pair. - Add coverage for project-name sanitization (disallowed chars, leading punctuation), env-var precedence (infra wins), collision warning, disable_internet capability gating (compose vs non-compose, plus validator interaction with allow_internet=False), _write_ca_overlay shape and error path, and _wait_for_main_container success/timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(islo): install docker compose plugin in compose mode E2E run against the real islo backend surfaced that the islo-runner image's docker doesn't ship the Compose v2 CLI plugin, so ``docker compose -p ...`` fails with ``unknown shorthand flag: 'p'`` because the docker CLI tries to parse ``-p`` as its own flag. Adds ``_ensure_compose_plugin`` which: - Probes ``docker compose version`` and skips if the plugin is already present. - Otherwise downloads the latest ``docker-compose-linux-<arch>`` binary into ``~/.docker/cli-plugins`` (works on Alpine and Debian-based VMs without a package manager) using whichever of curl/wget is available. Called once in ``_start_compose`` after the daemon is up. Verified: ``harbor run -p examples/tasks/hello-mcp --env islo --agent oracle`` now completes end-to-end with reward 1.0 against real islo (job 2026-04-30__15-55-05). Tests: 3 new cases — plugin already present (skip install), plugin missing (install via cli-plugins), install failure surfaces RuntimeError. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "fix(islo): install docker compose plugin in compose mode" The islo-runner image now ships with the Docker Compose v2 CLI plugin preinstalled, so the runtime install step is no longer needed. This reverts the runtime probe + plugin download from cli-plugins, the three associated unit tests, and saves ~10–15s on compose-mode cold start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Adam Goldschmidt <adamgold7@gmail.com> * fix(terminus-2): reset per-run state and attribute step exceptions in multi-step trials (#1566) Multi-step support added in PR #1234 made the trial layer call agent.run() once per step but did not update Terminus2, which stores per-trial state on the instance. Three categories of bugs result: 1. Trajectory step IDs are non-sequential. The initial-prompt Step appends with step_id=1 hardcoded, but _trajectory_steps persists across run() calls. After step 2 we get [1,2,3,1,2,3,...] which fails Pydantic validation in _dump_trajectory(): all terminus-2 multi-step trials fail. 2. Per-run state accumulators leak across steps. _api_request_times, _trajectory_steps, _subagent_metrics, _subagent_rollout_details, _summarization_count, _session_id, _pending_completion, _pending_subagent_refs, _pending_handoff_prompt, _timestamped_markers are all written but never reset. Concrete consequences: - All step_results' metadata.api_request_times_msec reference the same growing list (Python aliasing) -> per-step latency tracking unusable. - Step N's trajectory.json contains all of steps 1..N (quadratic disk usage, downstream consumers see duplicated content). - All per-step trajectory.json files share one session_id. - If summarization fires in step 1, every later step's reported n_input_tokens / cost_usd is inflated by step 1's summarization cost. 3. Trial._execute_step_agent only catches asyncio.TimeoutError and NonZeroAgentExitCodeError. Any other exception (LLM errors, network errors, validation errors, anything from a subprocess agent) bubbles to trial-level. step_result.exception_info stays None on the failing step and remaining steps are silently aborted. Fix: - Add Terminus2._reset_per_run_state(), called at the top of run(). Clears all per-trial accumulators. A user-provided session_id (kwarg) is preserved via a new _user_provided_session_id attribute. - Widen Trial._execute_step_agent's except to Exception, matching the sibling _verify_step (line 603) and the caller of _run_step_setup (line 638). The explicit abort at trial.py:673 (`if exception_info and not verifier_result: break`) still fires when needed; the trial smartly continues if the verifier still produced a result. Verified against a 2-step task: 1/1 trial, mean reward 1.0, 0 exceptions, distinct session ids per step, distinct api_request_times_msec per step. Verified against a step-1-timeout-step-2-recovers task: step 1 records TimeoutError, step 2 still runs with fully isolated state, trial reward 0.5 (mean of 0 + 1.0). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(islo): drop redundant compose overlay (broken by merge skew with #1599) (#1639) PR #1559 (docker-compose support) introduced `_write_ca_overlay`, which bind-mounted the VM's CA bundle into the `main` service and set NODE_EXTRA_CA_CERTS / SSL_CERT_FILE / REQUESTS_CA_BUNDLE. PR #1599 (merged 2 minutes earlier) had just removed the `_VM_CA_BUNDLE` constant and the equivalent `docker run -v` mount, because the redundant CA mount caused `dpkg` to fail installing `ca-certificates` inside the container — the runner image already trusts the gateway's MITM certs via its base CA store. Neither PR rebased on the other. Upstream main currently references `_VM_CA_BUNDLE` at 4 call sites inside `_write_ca_overlay` with no matching definition. The module imports (Python late-binds names in function bodies) but compose-mode tasks crash with `NameError: name '_VM_CA_BUNDLE' is not defined` the moment a sandbox starts. Fix: drop the provider-side overlay entirely. Removed: - `_write_ca_overlay` method and its caller in `_start_compose` - `_COMPOSE_CA_OVERLAY_NAME` constant - the `-f` flag for the overlay in `_compose_file_flags` - the two overlay unit tests and the overlay assertion at test_islo.py:1280 Daytona's DinD compose path (daytona.py:461) already works without any provider-side overlay — tasks declare their own locale + env in their compose/Dockerfile. Matching that contract on islo as well. Added a regression test (`TestComposeFileFlagsHasNoProviderOverlay`) that asserts no `docker-compose-islo-*` path is injected into the `-f` flags. Verified end-to-end against api.islo.dev with the oracle agent on examples/tasks/hello-mcp (compose-mode): build + compose-up + verifier complete cleanly, reward 1.0. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tensorlake): preserve env state on snapshot restore (#1637) * snapshot fixes * fix * fixed * [Ready for Review] Update GDB adapter dependency and invocation (#1527) * Update GDB adapter dependency and invocation Pin the adapter to lica-gdb 0.2.1 and remove the adapter's conflicting gdb console script so generation uses the explicit module entry point. Made-with: Cursor * Update GDB registry dataset docs Made-with: Cursor * Update GDB parity review links Made-with: Cursor * Add GDB adapter CLI alias Made-with: Cursor * Add separate verifier environments (#1655) * Add separate verifier environments * Add separate verifier changelog and compose env compatibility * Handle verifier artifact staging collisions * minor updates. * Minor fixes. * Update skills. Add blog post. * v0.7.0 * Remove internal trial timeout retries (#1628) * Fix task.toml writing. * Fix task.toml writing. * Add Novita environment support to Harbor (#1025) * Add Novita environment support to Harbor - Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service. - Implemented end-to-end and unit tests for NovitaEnvironment functionality. * Fix CI failures: type errors, lint, and pytest collection crash - Add type: ignore comments for novita_sandbox SDK type issues - Move sys.exit() guard into __main__ block so pytest collection doesn't crash - Add template reuse test phase to e2e integration test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix COPY instruction parsing and timeout_sec=0 handling - Skip COPY --from=... instructions (multi-stage builds) - Filter out COPY flags (--chown, --chmod) before extracting source path - Use explicit None check for timeout_sec to allow timeout_sec=0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address Devin review: internet flag, default timeout, multi-source COPY - Set can_disable_internet to False (not yet supported by Novita SDK) - Change default exec timeout from 60s to 0 (no timeout), matching e2b - Handle multi-source COPY instructions (COPY a.py b.py /dest/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Windows path separator in upload_dir remote paths Use PurePosixPath for remote sandbox paths to ensure forward slashes on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Change default exec timeout from 0 to 300s The novita_sandbox SDK defaults to 60s internally when 0 is passed. Use 300s (5 minutes) to avoid premature termination of long-running agent and verifier commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix build error log index and defer API base URL resolution - Use logs[-1] instead of logs[-2] for build failure error message - Move NOVITA_BASE_URL lookup from class definition to __init__, consistent with NOVITA_API_KEY handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle null logs in build failure error reporting Use `status.get("logs") or []` instead of `status.get("logs", [])` to handle API returning `"logs": null`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wrap _http_client.aclose() in try/except in stop() Prevent transport-level errors during HTTP client cleanup from propagating out of stop() and masking the trial outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Preserve sandbox when delete=False for debugging When stop(delete=False) is called, skip killing the sandbox and closing the HTTP client so the sandbox remains running for debugging purposes. This aligns with how other environments (e.g. GKE) handle the delete flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: use alias endpoint for template lookup and fix stale alias recovery - Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias} endpoint for O(1) template lookup instead of scanning all templates - Add stale alias recovery in _api_create_template: on 403 "Alias already used", look up the stale template via alias endpoint, delete it, then retry creation - Include API key suffix in template alias to avoid cross-account conflicts - Increase build timeout from 600s to 1200s for heavy Dockerfiles - Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU) - Update tests to cover new alias endpoint behavior (44 tests passing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: auto-recover from stale cached templates on sandbox creation When _find_template_by_alias returns a template ID that no longer exists in the backend (alias registered but build failed/incomplete), AsyncSandbox would raise a SandboxException("404: template not found"). Now start() catches this case, deletes the stale template via REST API, and triggers a fresh build before retrying sandbox creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: include last 5 log lines in build failure error message Previously only the last log line was shown, which was often just "Postprocessing finished. Cleaning up..." instead of the actual error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors * chore: update parity_summary.csv [skip ci] * Fix review issues and CI failures in Novita environment - Add _merge_env(env) call in exec() so persistent env vars (--ae flags, task [environment.env] config) are correctly forwarded to sandbox commands - Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment interface (fixes type-check invalid-method-override errors) - Close HTTP client in stop(delete=False) to prevent resource leak; update test to assert aclose is called - Fix uv.lock: missing [[package]] header before networkx entry caused TOML parse errors that broke all CI checks; regenerate lockfile cleanly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix exec() to respect user parameter via _resolve_user The user parameter was accepted but never used — all commands ran as root. Now calls _resolve_user(user) to honour the orchestrator-set default_user (e.g. task agent.user / verifier.user from task.toml). Novita SDK's user parameter is Literal["root", "user"], so map any non-root resolved user to "user"; add Literal import accordingly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add preflight() and chmod 777 on log dirs in Novita environment - Add preflight() classmethod to validate NOVITA_API_KEY before any trials are queued, giving immediate feedback instead of failing mid-job - chmod 777 agent/verifier log directories after creation in start() so non-root agent/verifier users can write reward files and logs - Update start() test mocks to handle both foreground (healthcheck) and background (exec) sandbox.commands.run call patterns Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * style: ruff format test_novita.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix template name slash escaping and cwd quoting in exec - Replace '/' with '__' in template alias construction so org/name task names (e.g. harbor/hello-world) don't break REST API URL paths - Use shlex.quote(effective_cwd) in exec() to handle paths with spaces or shell metacharacters safely Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use timeout=0 (no limit) as default in exec, aligning with E2B timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means no connection time limit, avoiding premature 300s cutoffs on long-running agent setup or verifier scripts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment * refactor: move novita-sandbox to optional extra, matching other cloud providers - Move `novita-sandbox` from main deps to `[novita]` optional extra - Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it) - Include `harbor[novita]` in the `cloud` bundle - Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor - Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError Without this guard, instantiating NovitaEnvironment when novita-sandbox is not installed raises a raw NameError (on DockerfileParser) instead of a helpful MissingExtraError with install instructions. Follows the same pattern as E2BEnvironment and RunloopEnvironment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: import EnvironmentCapabilities in Novita environment Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: update Novita capability tests Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: fix file upload endpoint --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Minor fixes for ruff. * Minor fixes for type check (#1665) * Simplify trial flow (#1672) * Refactor trial execution by shape * Clean up trial helper typing * Skip Windows container hello world without Docker * Partial refactor. * Improve artifact handler. * Minor multi step fixes. * Make artifact handler paths operation scoped * Fix CI after trial flow cleanup * Keep download dir excludes explicit * Rename download dir exclusions helper * Address artifact exclusion review comments * Avoid duplicate single-step artifact recovery * Avoid double stop after cancellation --------- Co-authored-by: gabeorlanski <gabeorlanski@gmail.com> * fix(terminus-2): make tmux send-keys dash-proof and improve send-keys error messages (#1657) - _tmux_send_keys: append `--` end-of-options marker to the `tmux send-keys -t <session>` prefix so keys beginning with `-` (e.g. `-x`, `-Lfoo`) are treated as literal key arguments rather than being parsed as tmux options. - _send_blocking_keys / _send_non_blocking_keys: include `command` (truncated to 100 chars), `return_code`, `stderr`, and `stdout` in the raised RuntimeError to make intermittent send-keys failures easier to diagnose from logs. - tests: update _extract_send_keys_payload helper for the new `--` separator and add coverage for keys starting with `-` and for the enriched failure messages. Co-authored-by: Cursor <cursoragent@cursor.com> * [codex] add repeatable skill inputs (#1674) * add repeatable skill inputs * Register injected skills for Cursor CLI * Use Cursor native skills directory * Simplify skill resolution * Make injected skills readable by agents * Address skill input review comments * Reject relative task skills dir for injected skills * Add skills CLI alias * Rename injected skill config to skills * Add runtime skills job example * Trim runtime skills example config * [codex] add repeatable extra docker compose overlays (#1676) * add repeatable extra docker compose overlays * preserve modal compose build markers * preserve cloud compose file precedence * Guard extra compose by environment capability * Rename extra compose config paths * Revert "Rename extra compose config paths" This reverts commit 5c531c6d5a7117d6e1fdf9d58e01a8e088dd002e. * Add extra compose job example * Address extra compose example comments * Nest extra compose job example * Fix skills merge. * [codex] Add runtime MCP config support (#1675) * Add runtime MCP config support * Use extra compose overlay for MCP proof example * Remove MCP proof example volume * Use Python base image in MCP proof task * Document MCP proof compose context * Trim MCP proof job defaults * Embed MCP proof runtime config * [codex] Add extra instruction path support (#1682) * feat: add support for --extra-instruction-paths * Add extra instruction path support * Fix lock equality env serialization * Fix lock equality for digest-backed paths --------- Co-authored-by: ZHAO Jin-Xiang <xiaoxiangmoe@gmail.com> * v0.7.1 * fix(terminus): use UTF-8 byte length for tmux send-keys size checks (#1680) * Update reward output documentation (#1684) Update based on change in #1620 * Add minimal verifier extension hook (#1653) * Add minimal verifier extension hook Add a small verifier factory hook that allows jobs to provide an optional custom verifier by import path while keeping the existing task verification flow as the default. This enables job-specific verification to supplement task-specific checks. For example, a job can attach generic trajectory evaluators, policy checks, or run-level scoring logic across many tasks without rebuilding, copying, or modifying those task definitions. The hook keeps task authorship and job evaluation concerns separate: tasks continue to define their normal verification, and jobs can opt into additional verifier behavior only when needed. Default behavior is unchanged when no custom verifier is configured. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Tighten verifier extension contract Introduce BaseVerifier and VerifierContext so custom verifiers receive a stable construction context while the built-in verifier keeps legacy kwargs compatibility. Require verifier outputs to be VerifierResult before assigning them to trial results, preserving Harbor aggregation semantics for built-in and imported verifiers. Keep legacy import-path constructors working through an adapter that enforces the return contract. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Reject unused verifier kwargs Fail fast when verifier kwargs are provided without a verifier import path, since the built-in verifier does not consume arbitrary extension kwargs. This makes CLI/config mistakes visible instead of silently dropping values like --verifier-kwarg foo=bar. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Fix verifier factory test patch Update Windows multi-step verifier tests to patch VerifierFactory.create_verifier_from_config after trial verification moved behind the factory hook. Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> * Simplify verifier extension constructor * Simplify verifier factory contract * Fix skills merge example config paths --------- Signed-off-by: Anuradha Karuppiah <26330987+AnuradhaKaruppiah@users.noreply.github.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Minor improvements. * fix: fail opencode runs on error events (#1658) * Update Novita to latest SDK build flow (#1688) * Add Novita environment support to Harbor - Introduced NovitaEnvironment class for integration with Novita's cloud sandbox service. - Implemented end-to-end and unit tests for NovitaEnvironment functionality. * Fix CI failures: type errors, lint, and pytest collection crash - Add type: ignore comments for novita_sandbox SDK type issues - Move sys.exit() guard into __main__ block so pytest collection doesn't crash - Add template reuse test phase to e2e integration test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix COPY instruction parsing and timeout_sec=0 handling - Skip COPY --from=... instructions (multi-stage builds) - Filter out COPY flags (--chown, --chmod) before extracting source path - Use explicit None check for timeout_sec to allow timeout_sec=0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address Devin review: internet flag, default timeout, multi-source COPY - Set can_disable_internet to False (not yet supported by Novita SDK) - Change default exec timeout from 60s to 0 (no timeout), matching e2b - Handle multi-source COPY instructions (COPY a.py b.py /dest/) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Windows path separator in upload_dir remote paths Use PurePosixPath for remote sandbox paths to ensure forward slashes on all platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Change default exec timeout from 0 to 300s The novita_sandbox SDK defaults to 60s internally when 0 is passed. Use 300s (5 minutes) to avoid premature termination of long-running agent and verifier commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix build error log index and defer API base URL resolution - Use logs[-1] instead of logs[-2] for build failure error message - Move NOVITA_BASE_URL lookup from class definition to __init__, consistent with NOVITA_API_KEY handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Handle null logs in build failure error reporting Use `status.get("logs") or []` instead of `status.get("logs", [])` to handle API returning `"logs": null`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wrap _http_client.aclose() in try/except in stop() Prevent transport-level errors during HTTP client cleanup from propagating out of stop() and masking the trial outcome. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Preserve sandbox when delete=False for debugging When stop(delete=False) is called, skip killing the sandbox and closing the HTTP client so the sandbox remains running for debugging purposes. This aligns with how other environments (e.g. GKE) handle the delete flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: use alias endpoint for template lookup and fix stale alias recovery - Replace _api_list_templates + iteration with direct GET /templates/aliases/{alias} endpoint for O(1) template lookup instead of scanning all templates - Add stale alias recovery in _api_create_template: on 403 "Alias already used", look up the stale template via alias endpoint, delete it, then retry creation - Include API key suffix in template alias to avoid cross-account conflicts - Increase build timeout from 600s to 1200s for heavy Dockerfiles - Add _MIN_MEMORY_MB_PER_CPU constant (512 MB/CPU) - Update tests to cover new alias endpoint behavior (44 tests passing) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: auto-recover from stale cached templates on sandbox creation When _find_template_by_alias returns a template ID that no longer exists in the backend (alias registered but build failed/incomplete), AsyncSandbox would raise a SandboxException("404: template not found"). Now start() catches this case, deletes the stale template via REST API, and triggers a fresh build before retrying sandbox creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * novita: include last 5 log lines in build failure error message Previously only the last log line was shown, which was often just "Postprocessing finished. Cleaning up..." instead of the actual error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(novita): upload COPY files via S3 pre-signed URL to fix 413 errors * chore: update parity_summary.csv [skip ci] * Fix review issues and CI failures in Novita environment - Add _merge_env(env) call in exec() so persistent env vars (--ae flags, task [environment.env] config) are correctly forwarded to sandbox commands - Add user parameter to exec(), is_dir(), is_file() to match BaseEnvironment interface (fixes type-check invalid-method-override errors) - Close HTTP client in stop(delete=False) to prevent resource leak; update test to assert aclose is called - Fix uv.lock: missing [[package]] header before networkx entry caused TOML parse errors that broke all CI checks; regenerate lockfile cleanly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix exec() to respect user parameter via _resolve_user The user parameter was accepted but never used — all commands ran as root. Now calls _resolve_user(user) to honour the orchestrator-set default_user (e.g. task agent.user / verifier.user from task.toml). Novita SDK's user parameter is Literal["root", "user"], so map any non-root resolved user to "user"; add Literal import accordingly. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Add preflight() and chmod 777 on log dirs in Novita environment - Add preflight() classmethod to validate NOVITA_API_KEY before any trials are queued, giving immediate feedback instead of failing mid-job - chmod 777 agent/verifier log directories after creation in start() so non-root agent/verifier users can write reward files and logs - Update start() test mocks to handle both foreground (healthcheck) and background (exec) sandbox.commands.run call patterns Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * style: ruff format test_novita.py Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Fix template name slash escaping and cwd quoting in exec - Replace '/' with '__' in template alias construction so org/name task names (e.g. harbor/hello-world) don't break REST API URL paths - Use shlex.quote(effective_cwd) in exec() to handle paths with spaces or shell metacharacters safely Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Use timeout=0 (no limit) as default in exec, aligning with E2B timeout_sec or 0 matches E2B and the Novita SDK docs where 0 means no connection time limit, avoiding premature 300s cutoffs on long-running agent setup or verifier scripts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: deal with build conflict error and enhance Dockerfile handling in NovitaEnvironment * refactor: move novita-sandbox to optional extra, matching other cloud providers - Move `novita-sandbox` from main deps to `[novita]` optional extra - Add `dockerfile-parse` to `novita` extra (was only in `e2b`, but novita.py needs it) - Include `harbor[novita]` in the `cloud` bundle - Wrap SDK imports in try/except with `_HAS_NOVITA` flag, following the same lazy-import pattern introduced for daytona/e2b/modal in the upstream refactor - Raise `MissingExtraError` in `preflight()` when novita-sandbox is not installed - Regenerate uv.lock Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * fix: add _HAS_NOVITA guard in __init__ for clear MissingExtraError Without this guard, instantiating NovitaEnvironment when novita-sandbox is not installed raises a raw NameError (on DockerfileParser) instead of a helpful MissingExtraError with install instructions. Follows the same pattern as E2BEnvironment and RunloopEnvironment. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Update src/harbor/environments/novita.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix: import EnvironmentCapabilities in Novita environment Add the missing capabilities import after migrating NovitaEnvironment to the new capabilities API so ruff and ty can resolve the type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: update Novita capability tests Update Novita environment tests to assert the new capabilities API after migrating away from deprecated properties. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: fix file upload endpoint * fix: integrate Novita SDK template builds Use the Novita SDK template builder directly while preserving Harbor's Dockerfile COPY handling, and pin the alpha SDK version without enabling global prerelease resolution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: pin Novita sandbox domain Use the regional Novita sandbox endpoint consistently so local domain overrides cannot route template operations to the wrong API host. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: avoid Novita SDK import during test collection Load Novita SDK modules only when the Novita environment actually needs them so pytest can collect E2B and Novita tests in the same process without duplicate protobuf descriptor registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> * Fix EnvironmentConfig deprecation warnings on default construction. Migrate legacy memory/storage fields in a before validator instead of Field(deprecated=...) plus an after validator, and reject conflicting legacy and modern resource values. Closes #1693 Co-authored-by: Cursor <cursoragent@cursor.com> * Estimate cursor-cli cost from usage via LiteLLM Cursor CLI stream-json reports token usage on result events but not dollar cost. Parse optional totalCost when present and otherwise estimate from per-category token counts using LiteLLM pricing. Co-authored-by: Cursor <cursoragent@cursor.com> * Add built-in pricing for Cursor Composer models in cursor-cli. LiteLLM does not list cursor/composer models, so estimate cost from token usage using Cursor's published rates before falling back to LiteLLM. Co-authored-by: Cursor <cursoragent@cursor.com> * [codex] Add resource enforcement policies (#1697) * Add resource enforcement policies * Pre flight check. * Fix CHANGELOG breaking changes for resource enforcement policies. Document removed task resource defaults and stricter validation instead of incorrectly claiming --cpus/--memory repurposed numeric overrides. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * v0.8.0 * Fix resource default test after provider-default change (#1701) * fix tests on main * chore: rerun CI * Document job sharing (#1706) * feat(viewer): add ←/→ trial navigation, ⌥+←/→ tab cycling, persistent tab across trials, and X/N position indicator on the trial page (#1705) * docs(atif): refresh trajectory format page to v1.7 (#1704) The trajectory format docs page still advertised ATIF-v1.4 as current and stopped its supported-versions list at v1.4, while the canonical RFC (rfcs/0001-trajectory-format.md) has been at v1.7 for several releases. Bump the example schema_version strings to ATIF-v1.7 and extend the Schema Versions section with v1.5, v1.6, and v1.7 entries summarized from the RFC's Version History. No code changes; docs only. * Add PR diff links workflow with manual dispatch. (#1716) Post devinreview and diffshub links when PRs open, and allow testing on existing PRs via workflow_dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: add Openclaw installed agent (#1661) * feat: add openclaw installed agent * Cleanup commit * save full session turns * NeMo-Flow Integration * cleanup * update defaults * fix test for updated defaults * Fix tests for new defaults * Fix lint error * Remove nemoflow from PR Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com> * refactor(openclaw): generalize provider config normalization Address review feedback: drop NVIDIA-specific code paths from the OpenClaw plugin so it works generically across any OpenAI-compatible provider. - Replace `_merge_nvidia_base_url_from_env` and `_normalize_nvidia_models_provider` with provider-agnostic `_merge_provider_base_url_from_env` and `_normalize_provider_models_schema` that derive the provider from `--model` (e.g. `openai/gpt-4.1` -> `OPENAI_BASE_URL`). - Remove the hardcoded NVIDIA default base URL; users select a custom provider via env or `openclaw_config`. - Update class docstring to use `openai/*` as the generic example. - Rewrite the NVIDIA-themed unit tests to cover the generic behavior with `openai/*`. The `nvidia` entry in the env-var forwarding switch is retained alongside ~15 other providers (anthropic, openai, google, ...) as a plain provider registry, since removing it would break existing `nvidia/*` model selections. Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> * feature(api): multi-provider compatibility for openclaw Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> --------- Signed-off-by: Sam Oluwalana <soluwalana@nvidia.com> Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com> Co-authored-by: Bryan Bednarski <bbednarski@nvidia.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Add GPU support to GKE environment (#1640) * Add GPU support to GKE environment * Address PR comments - Early failure if an unsupported GPU type is provieded - Increase the timeout minutes to 20 when GPUs are selected - Support direct gke-accelerator values as gpu_types * Adjust GPU count retrieval to use _effective_gpus for consistency * Paginate dataset metadata queries past Supabase row cap (#1719) * Paginate dataset metadata queries past Supabase row cap. Fixes harbor download and run truncating package datasets at 1,000 tasks. Co-authored-by: Cursor <cursoragent@cursor.com> * Format test_registry_db_client.py with ruff. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Add TPU support to harbor and GKE environment (#1652) * Address PR comments - Early failure if an unsupported GPU type is provieded - Increase the timeout minutes to 20 when GPUs are selected - Support direct gke-accelerator values as gpu_types * Adjust GPU count retrieval to use _effective_gpus for consistency * Add TPU support to environment configuration This change allows environments to properly support and validate TPU requirements, improving task execution flexibility. * Add TPU support to GKE environment This update introduces a mapping for TPU types, enhances the GKEEnvironment class to handle TPU configurations, and updates unit tests to validate TPU capabilities and configurations alongside existing GPU support. * Update environment config model to use a dedicated class for TpuSpec * Add new TPU config to docs * Add --tpu_overrides to cli commands * Validate mutual exclusion of GPU and TPU requests in GKE * Fix merge conflicts * Update TPU configuration to use a single TpuSpec * Add Harbor Hub job result sharing blog post (#1732) * Add Harbor Hub job result sharing blog post. Co-authored-by: Cursor <cursoragent@cursor.com> * Update job sharing blog title and landing page banner. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * Add CoreWeave Sandbox and W&B environment support (#1698) * cw sandbox * doc fix * Fix (Add resource enforcement policies) * final fixes * comment cleanup * fix(cwsandbox): clean up backend sandbox on any failed start() * feat (Tensorlake): build sandboxes from OCI images instead of per-trial Dockerfile replay (#1734) * update tensorlake integration to use oci image build * Guard fcntl import for Windows test collection in tensorlake env * Add managing resources docs for task configuration. (#1735) Centralize enforcement policy and resource field guidance in the tasks docs. Co-authored-by: Cursor <cursoragent@cursor.com> * [Ready For Review] Fix artifact transfer archive collisions (#1733) * Fix artifact transfer archive collisions * Log transfer cleanup failures as warnings * Use RPC for task version resolution (#1736) * Allow tasks with docker_image to omit environment/Dockerfile (#1729) * Allow tasks with docker_image to omit environment/Dockerfile. Centralize environment definition validation and workdir helpers across supported providers. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix docker_image-only force_build and Runloop workdir default. Use shared prebuilt-image selection when no Dockerfile exists, and restore /workspace fallback for Dockerfiles without WORKDIR. Co-authored-by: Cursor <cursoragent@cursor.com> * Apply prebuilt docker_image policy to all compose providers. Use should_use_prebuilt_docker_image in Daytona, Modal, and Islo, and unify Docker validation. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix lazy dockerfile_parse import and daytona formatting. Move DockerfileParser import inside parse_dockerfile_workdir so core environments do not require the optional extra. Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> * Add dockerfile-parse to runloop optional extra. Runloop now uses parse_dockerfile_workdir for WORKDIR resolution when a Dockerfile is present. Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat: Add native agent adapter for Google Antigravity CLI (agy) (#1699) * feat: Add native agent adapter for Google Antigravity CLI (agy) * fix: remove unused import * fix: correctly configure agy settings.json and model * fix: update test to match new EnvironmentConfig defaults * fix: remove unused run_model variable * style: run ruff format on agy.py * refactor: rename agy agent to antigravity-cli Use antigravity-cli as the Harbor agent identifier and AntigravityCli adapter naming instead of agy. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(antigravity-cli): use Path.write_text for ATIF export Address Devin review feedback and align with AGENTS.md file I/O guidance. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError (#1457) * feat: Daytona auto-snapshot, transient error handling, and SandboxBuildFailedError Adds three major improvements to the Daytona environment backend: 1. **Auto-snapshot with content-based caching**: New `auto_snapshot` parameter on DaytonaEnvironment enables automatic snapshot creation keyed by a SHA256 hash of the full environment directory. Tasks sharing the same Dockerfile and fixtures reuse a single snapshot, eliminating redundant builds. Snapshots are region-aware (DAYTONA_TARGET) to prevent cross-region collisions. Per- snapshot async locks prevent redundant parallel creation. 2. **Transient error differentiation**: New `daytona_utils.py` module provides `is_transient_daytona_error()` which distinguishes rate limits and capacity errors from non-recoverable failures. Retry callbacks use 10 attempts with 60s linear backoff for transient errors vs 3 attempts with exponential backoff for others — dramatically improving reliability under load. 3. **SandboxBuildFailedError**: New non-retryable exception for failed sandbox builds (bad Dockerfile, snapshot in ERROR state). Stops wasting retry budget on builds that will never succeed. Detected both in `_create_sandbox()` and `_wait_for_snapshot()`. Supporting additions: - `container_cache.py`: Hash utilities for environment directories and Dockerfiles, plus task analysis helpers for predicting snapshot counts - DinD auto-snapshot support with image-hash-based naming - `ephemeral=True` flag on all sandbox creation calls - `assume_global_snapshot` for optimistic handling of shared snapshots invisible to the GET API Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove region_id param not in current Daytona SDK Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: remove DinD auto-snapshot additions, restore main's DinD start() DinD snapshot management was not in scope for this PR. Restores _DaytonaDinD.start() to main's original implementation. Removes _get_dind_snapshot_name, _ensure_dind_auto_snapshot, _create_dind_snapshot methods and unused hashlib import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: don't retry SandboxBuildFailedError/TimeoutError, close RL client - Add _is_non_retryable() guard to all retry callbacks so SandboxBuildFailedError and TimeoutError are never retried - Close temporary AsyncDaytona client after RL-region snapshot builds to prevent HTTP session leaks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(daytona): harden PR #1457 with unit tests and small fixes Add tests for daytona_utils retry classification and container_cache hashing. Stop treating invalid bearer tokens as transient, trim unused analyze helpers, evict idle per-snapshot locks, and document auto_snapshot ERROR behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): extract snapshot service and collapse retry helpers Move snapshot lifecycle into daytona_snapshots.py with a single state resolver and SnapshotPolicy. Replace six retry callbacks with daytona_retry_callbacks(). Simplify _DaytonaDirect.start() via _resolve_start_sandbox_params() and remove the string-matched fallback catch. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): dedupe ensure_auto paths and add optional snapshot GET Collapse fast/slow auto-snapshot resolution into shared helpers and use a documented non-retrying GET for pre-create ERROR cleanup. Co-authored-by: Cursor <cursoragent@cursor.com> * feat: use Task.short_name for environment_name Add Task.short_name (delegates to package short_name, else task dir name) and pass it as environment_name so Daytona snapshot templates and container naming avoid registry org prefixes and slashes in paths. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(daytona): move modules into daytona/ package Group environment, snapshots, and utils under environments/daytona/ to match docker/ and singularity/. Default assume_global_snapshot to False so missing template snapshots fall back to Dockerfile builds. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(container_cache): length-prefix paths in environment hash Avoid ambiguous SHA256 updates where a file path could concatenate with the next file's content. Adds a regression test for the ab/a+b case. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(daytona): wait for concurrent snapshot create to become active Handle PENDING snapshots before create and wait for ACTIVE after already-exists/conflict errors instead of returning the name immediately. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(container_cache): length-prefix file content in environment hash Extend domain-separated hashing so path and content bytes cannot be ambiguous across files (Devin review follow-up). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Benjamin Feuer <penfever@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alex Shaw <alexgshaw64@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * Upload environment/ files for prebuilt docker_image tasks (#1737) * Upload environment/ to workdir for prebuilt docker_image tasks. When docker_image is set without a Dockerfile or docker-compose.yaml, environments copy non-empty environment/ into the container workdir at the end of start(). Co-authored-by: Cursor <cursoragent@cursor.com> * Fix CI: format tests and isolate cwsandbox environment_dir fixtures. Use a dedicated empty environment/ subdirectory so post-start uploads do not run during unit tests that assert exact exec call counts. Co-authored-by: Cursor <cursoragent@cursor.com> * Format cwsandbox test_wandb.py Co-authored-by: Cursor <cursoragent@cursor.com> * Fix cwsandbox tests to write Dockerfile under environment/. Aligns with environment_dir fixture so prebuilt-image allowance tests exercise the intended layout. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * downgrade logging. * Stop writing per-episode log folders in Terminus-2 (#1740) * Stop writing per-episode log folders in Terminus-2. Episode prompt/response/debug files are redundant now that trajectory.json captures each turn. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix Terminus-2 tests after removing episode logging paths. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * [Ready for Review] Adapter | Review bot prompt update for agent reward hacking checks (#1747) * Update adapter review prompts * Update prompt based on some sanity check runs * Add benchmark identity leakage check * Add linear.review link to PR diff links workflow (#1749) * fix link. * v0.9.0 * claude_code: handle redacted_thinking content blocks (#1752) Anthropic's `redacted_thinking` is a standard, documented content block type that can appear in any assistant message when extended thinking is enabled. Its `data` field is opaque ciphertext that clients cannot decrypt — the contract is to pass it back unchanged on subsequent API calls, never to expose it as user-facing text. Today _extract_text_reasoning_tool_uses doesn't recognise the type, so the block falls through to the catch-all that `_stringify`s the whole block dict and appends the resulting JSON envelope to text_parts. Trajectories then carry an ATIF `message` like '{"type":"redacted_thinking","data":"…"}' in the assistant turn. On may26 there are 2,050 such steps across 127 trials in the bundled corpus, all claude-code paired with vendor-routed models (e.g. tencent/hy3-preview-20260421 via OpenRouter). OpenRouter additionally mis-uses the redacted_thinking envelope to pass through PLAIN reasoning from non-Anthropic models: `data` is `openrouter.reasoning:<b64>`, where the base64 decodes to plain JSON `{"text":"…","type":"reasoning.text"}`. That content isn't actually encrypted — it should land in reasoning_content like every other thinking block. Add a redacted_thinking branch before the generic fallback that: - if data starts with `openrouter.reasoning:`, b64-decodes the payload, parses the inner JSON, and appends the inner `text` to reasoning_parts; - otherwise drops the block. This preserves the API contract for genuine Anthropic ciphertext (it remains opaque) and stops the envelope JSON from polluting human-readable trajectory text. Updates the existing test_redacted_thinking_not_in_reasoning to assert the envelope is now absent from both text and reasoning (it previously only asserted absence from reasoning, accepting the stringified-into- text behaviour), and adds two new tests covering the OpenRouter decode and malformed-payload-dropped paths. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal> * claude_code: unwrap text content blocks in user-event tool_result loop (#1753) In _convert_events_to_trajectory, the user-event content loop already handles tool_result blocks specifically. Anything else falls through to `self._stringify(block)` — which JSON-encodes the whole block dict and appends the resulting envelope to text_parts. So a content block like {"type": "text", "text": "<10 KB of skill documentation>"} ends up in the ATIF user step's `message` as '{"type":"text","text":"Base directory for this skill: …"}' verbatim — downstream renderers that expect `message` to be human text can't read it. Claude Code injects these text blocks as user content alongside the tool_result when a Skill is loaded (the block carries the skill's documentation). Saw 4 such steps in a recent harbor-index corpus scan on skillsbench × {glm-5.1, MiniMax/MiniMax-M2.7} runs. Fix: before the generic _stringify fallback, recognise `{"type":"text","text":<str>}` and surface its inner string. Non-text blocks and text blocks with non-string `text` still hit the stringify fallback so behaviour for unknown shapes is unchanged. Adds test_user_event_text_content_block_unwrapped covering the end-to-end path through _convert_events_to_trajectory. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-163.ap-northeast-2.compute.internal> * fix(modal): default _ModalDirect.exec to non-login shell (#1744) The strategy-refactor PR (#1311) introduced `login=True` on the default `_ModalDirect.exec` path, which causes the underlying SDK call to use `bash -lc <cmd>`. A login shell re-sources `/etc/profile` and the shell's profile files, which **clobbers `PATH`** as set by the image's `ENV PATH=…` directives. This breaks any task that pins toolchains via image-level `ENV PATH`: - Go tasks lose `/usr/local/go/bin` (everything that does `go build`/`go test` fails) - Rust tasks lose `~/.cargo/bin` (cargo not found) - Anything with custom `pipx`/`uv`/Node prefixes baked into image layers gets reset to the inherited login default Reverting this single line to `login=False` restores the pre-#1311 `bash -c` behavior and preserves the image's PATH. The lower-level `_sdk_exec` still exposes `login` as a parameter, so strategies that genuinely want a login shell can opt in explicitly. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * Add viewer sign-in and sync auth with the CLI (#1755) * Add viewer sign-in and sync auth with the CLI. Enable OAuth login/logout in the local viewer, pick up CLI credential changes via mtime-based cache invalidation, and align page headers with Harbor Hub. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix credential sync detection on Windows. Use a content hash instead of mtime, which can be unchanged across rapid writes on Windows. Co-authored-by: Cursor <cursoragent@cursor.com> * Fix credential sync baseline after local writes. Set initialized state in note_credentials_written and isolate credential sync tests so they pass independently. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * fix(claude-code): preserve user-message bytes in ATIF trajectory (drop .strip()) (#1724) * [claude-code] preserve user message bytes (no .strip()) Downstream pipelines that hash the user step.message bytes for cross- harness equivalence checks rely on byte-identical comparisons against the canonical instruction.md. Stripping trailing/leading whitespace in the ATIF normalizer breaks those checks silently. `_convert_events_to_trajectory` accepts user-event content in three shapes; all three were applying `.strip()` to the persisted bytes: * `content: str` (the shape `claude --print -- "..."` emits) — fixed by replacing `text = content.strip()` with `text = content` and tightening the existing truthy gate to `if text.strip():` so empty / whitespace-only entries are still dropped without mutating bytes in the non-empty case. * `content: list` (programmatic / SDK callers that wrap the instruction in `{"type": "text", "text": "..."}` blocks) — fixed by extracting `block["text"]` verbatim instead of routing through `_stringify`, and by dropping `part.strip()` from the join (the `if part.strip()` filter still removes empty / whitespace-only parts so we never emit `\n\n` between nothing). Non-text non- tool_result blocks (e.g. image blocks) continue to fall through to `_stringify`, which json-encodes them; the patch deliberately does not try to byte-faithful those — they have no canonical text bytes to be faithful to. * `content` else-branch (defensive fallback for unusual shapes) — fixed by the same rule: keep raw `_stringify(content)` bytes and use `.strip()` only in the empty-skip filter. Adds regression tests covering string-content trailing newline / leading whitespace / internal whitespace / empty / whitespace-only, list-content single-block byte-faithful / multi-block join / empty- part filter / non-text non-tool_result block json-encoded, and the fallback else-branch on a non-str non-list content payload. * fix(tests): run byte-faithful suite in CI (declare hypothesis, drop module skip) The module-level `pytest.importorskip("hypothesis")` skipped the ENTIRE test file when hypothesis was absent — not just the property test, but also the byte-faithful regression suite this PR adds and the pre-existing reasoning-extraction / session-selection tests. hypothesis was not in the dev dependency group nor in uv.lock, and CI installs via `uv sync --all-packages --all-extras --locked`, so it was never present: the file collected to "0 items / 1 skipped" and CI was green-but-empty. Declare hy…

cw sandbox

d6da98f

github-actions Bot added area:environments area:tests area:core area:package area:docs labels May 22, 2026

matthoare117-wandb added 6 commits May 22, 2026 10:42

doc fix

d43b760

Fix (Add resource enforcement policies)

ab60e28

Merge branch 'main' into hoare-cw/wandb

502801a

Merge branch 'main' into hoare-cw/wandb

525cd10

final fixes

2da55d7

comment cleanup

393e681

matthoare117-wandb marked this pull request as ready for review May 23, 2026 02:13

matthoare117-wandb added 2 commits May 26, 2026 09:53

Merge branch 'main' into hoare-cw/wandb

dceb49c

fix(cwsandbox): clean up backend sandbox on any failed start()

c899107

alexgshaw approved these changes May 27, 2026

View reviewed changes

alexgshaw merged commit f99317c into harbor-framework:main May 27, 2026
6 checks passed

caffeinum mentioned this pull request May 28, 2026

sc shielded perovskite 09f7 team2027/harbor#21

Merged

RishiDesai mentioned this pull request Jun 9, 2026

Merge upstream harbor-framework v0.13.1 (latest PyPI release) RishiDesai/harbor#21

Merged

zozo123 mentioned this pull request Jun 10, 2026

feat: add experimental Crabbox environment adapter #1745

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CoreWeave Sandbox and W&B environment support#1698

Add CoreWeave Sandbox and W&B environment support#1698
alexgshaw merged 9 commits into
harbor-framework:mainfrom
matthoare117-wandb:hoare-cw/wandb

matthoare117-wandb commented May 22, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthoare117-wandb commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Notes

Validation

Uh oh!

vercel Bot commented May 22, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthoare117-wandb commented May 22, 2026 •

edited

Loading