Skip to content

fix(rest): add GET /v1/:prefix/metrics org-scoped runtime metrics#689

Draft
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-runtime-metrics-route
Draft

fix(rest): add GET /v1/:prefix/metrics org-scoped runtime metrics#689
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-runtime-metrics-route

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

rt.metrics() over REST 404'd because the NestJS router never mounted the route. Adds BoxliteMetricsController returning the same shape the SDK's RuntimeMetricsResponse expects, aggregated from the sandbox table in one grouped query.

Test plan — two-sided verified

Pin: scripts/test/e2e/cases/test_box_metrics.py

Stack: local e2e. API restart only.

Case Pre-fix (main) Post-fix (this PR)
test_runtime_metrics_counts_active_boxes (await rt.metrics() before + after create) RuntimeError: box not found: /api/v1/<org>/metrics statusCode 404 RuntimeMetrics returned; boxes_created_total ticks up by ≥1 after rt.create()
test_box_metrics_increments_command_count passes already (box-level metrics existed) passes
test_box_metrics_shape_is_complete passes already passes

total_commands_executed / total_exec_errors return 0 in the response — that data lives in a runner-side log the API doesn't aggregate today. The SDK is tolerant because each field carries #[serde(default)]; a follow-up can wire the exec counters without an SDK version bump.

`rt.metrics()` over REST 404'd because the NestJS router never mounted
the route. The SDK's MetricsRegistry was getting "no such endpoint" for
every metrics fetch and falling back to zero-everywhere.

Adds `BoxliteMetricsController` mirroring the SDK's
`RuntimeMetricsResponse` shape exactly (boxes_created_total /
boxes_failed_total / num_running_boxes / total_commands_executed /
total_exec_errors). Aggregates from the sandbox table in one grouped
query (state → count). exec counters return 0 because that data lives
in a runner-side log we don't aggregate at the API yet; SDK is tolerant
because every response field carries `#[serde(default)]`.

E2E regression test:
  sdks/python/tests/e2e/test_box_metrics.py::test_runtime_metrics_counts_active_boxes
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 762bb2a9-dd61-41e0-8eae-4760e21cd31e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

The fix in the preceding commit closes the gap this test exercises.
Mirrors the layout of `scripts/test/e2e/cases/` on main.
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
test_exec_user.py → boxlite-ai#686, test_network_allow_net.py → boxlite-ai#687,
test_files_io.py → boxlite-ai#688, test_box_metrics.py → boxlite-ai#689,
test_snapshot_clone.py → boxlite-ai#694, test_images_pull_list.py → boxlite-ai#696.

Remaining 3 cases (test_exec_attach.py, test_volume_readonly.py,
test_cli_detach_recovery.py) stay here because they pin REST-path gaps
that don't have a matching fix PR in this session — they document the
contract for future work to land against.
DorianZheng pushed a commit that referenced this pull request Jun 10, 2026
#710)

add 5 e2e cases pinning REST contracts not covered today

## Coverage gaps

Three classes of behaviour run through the SDK → API → Runner → VM
chain but had no e2e pin on the chain itself; bugs in any one of them
would silently regress through `make test:integration:*` (which only
exercises local-FFI):

```
1. CLI detach lifecycle — `boxlite run -d` returns, CLI process exits,
   does a fresh CLI process still see / exec the box?
   FFI side  src/boxlite/tests/detach.rs, recovery.rs  ✓
   E2E side  scripts/test/e2e/cases/                    ✗
2. Execution.attach / reattach contract — bogus exec_id should be a
   typed client error; reattach to a completed exec should return a
   usable handle.
   Runner side  apps/runner/.../boxlite_exec_attach_test.go  ✓
   SDK <-> API  end-to-end                                    ✗
3. host bind mount via REST — the cloud runtime intentionally dropped
   host bind mounts (#639); REST callers passing host paths must
   silently no-op, no /mnt/<x> in the guest.
   FFI surface  src/boxlite/tests/mount_security.rs  ✓
   REST contract                                      ✗
```

## Cases shipped

```
scripts/test/e2e/cases/test_cli_detach_recovery.py   (2 cases)
scripts/test/e2e/cases/test_exec_attach.py           (2 cases)
scripts/test/e2e/cases/test_volume_readonly.py       (1 case)
```

5 cases total. Author also dropped 6 cases from this branch's earlier
incarnation that were already committed alongside their respective
fix PRs (#686 / #688 / #689 / #691 / #692 / #696), so the diff is
strictly net-new coverage.

## Test plan — run against current main

Stack: local e2e, runner unchanged, no source edits.

| Case | Result | Notes |
|---|---|---|
| `test_detached_box_exec_propagates_exit_code_on_fresh_cli` | ✅ PASS |
exit-code passthrough across CLI processes |
| `test_detached_box_survives_cli_exit_and_is_reusable` | ⚠️ XFAIL
(strict) | reaches step 3 (`boxlite exec <id> echo still-alive`) then
hits the stdout-drop race that #563 fixes — marker drops when #563 lands
|
| `test_attach_with_bogus_id_is_typed_error` | ✅ PASS | bogus exec_id →
typed `Exception` (not 5xx, not silent) |
| `test_reattach_after_original_completes` | ⚠️ XFAIL (strict) | same
stdout-drop race (#563) on the original exec's `out=='first-output'`
assertion |
| `test_host_bind_mount_via_rest_is_silently_ignored` | ✅ PASS | the box
created with `volumes=[(host_dir, "/mnt/ro", True)]` reports
`MOUNT_LINE=<none>` from `/proc/mounts` and the host marker file is
untouched — REST silently dropped the host path |

The 2 XFAILs are tied to #563 (`fix(go-sdk): fold stream drain into
Execution.Wait`). Once #563 merges, both xfails flip xpass-strict, the
markers come off, and the suite is 5/5 green. No additional fix work is
needed in this PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Added end-to-end tests verifying CLI detach survival and box
reusability across fresh CLI processes, including exit code propagation
tests.
* Added end-to-end tests for SDK reattach functionality, validating
session state after execution completion.
* Added end-to-end test confirming proper host bind mount handling
during REST execution.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant