feat(rest): box state snapshot operations end-to-end (Class A, surface 1)#694
Draft
G4614 wants to merge 3 commits into
Draft
feat(rest): box state snapshot operations end-to-end (Class A, surface 1)#694G4614 wants to merge 3 commits into
G4614 wants to merge 3 commits into
Conversation
…e 1)
Implements `box.snapshot.{create, list, get, restore, remove}` over the
REST chain. Pre-fix the SDK Rust REST client short-circuited every call
with "Remote server does not support snapshots operations" because the
API's /v1/config returned snapshots_enabled=false and there was no
runner-side handler for the snapshot URL space anyway.
Five layers added in this PR (≈1100 lines):
- sdks/c/src/snapshot.rs (new)
CSnapshotInfo + CSnapshotInfoList FFI types, async + callback
variants for create/list/get/remove/restore, free helpers
mirroring CBoxInfo's allocation conventions.
- sdks/c/src/event_queue.rs
4 new RuntimeEvent variants (Create/List/Remove/Restore — Get
shares Create's payload shape) + 4 callback function types.
- sdks/c/src/lib.rs / runtime.rs
Register the module + dispatch the 4 new event variants through
the existing dispatch_handle_event / dispatch_unit_event paths.
- sdks/go/snapshot.go (new) + bridge.{c,h}
Box.Snapshot{Create,List,Get,Remove,Restore} cgo wrappers, four
//export goBoxliteOnSnapshot* callbacks, type bridging.
- apps/runner/pkg/boxlite/client.go
Client.Snapshot* methods that route through getOrFetchBox.
- apps/runner/pkg/api/controllers/boxlite_snapshot.go (new)
5 gin handlers + classifySnapshotError (mirrors classifyExecError
pattern from boxlite-ai#690) so the SDK gets HTTP-typed errors instead of
raw 5xx for caller-fixable cases.
- apps/runner/pkg/api/server.go
5 boxliteApi routes matching the SDK's URL shape:
POST /v1/boxes/:boxId/snapshots
GET /v1/boxes/:boxId/snapshots
GET /v1/boxes/:boxId/snapshots/:name
DELETE /v1/boxes/:boxId/snapshots/:name
POST /v1/boxes/:boxId/snapshots/:name/restore
- apps/api/src/boxlite-rest/boxlite-proxy.controller.ts
3 new proxy routes covering the snapshot URL space (root, named,
restore). Existing proxyToRunner machinery handles auth +
runner discovery + path rewrite.
- apps/api/src/boxlite-rest/boxlite-config.controller.ts
Flips `snapshots_enabled: true` so the SDK's
`require_snapshots_enabled` gate stops short-circuiting.
E2E status:
The REST plumbing is **verified end-to-end**: the SDK call now reaches
libboxlite on the runner instead of hitting the "Remote server does
not support" gate. With the e2e test stack:
e2e test `test_snapshot_clone.py::test_snapshot_create_appears_in_list`:
PRE : RuntimeError: "Remote server does not support snapshots
operations" (short-circuit at SDK)
POST : HTTP 500: "snapshot create failed: boxlite: internal error:
Failed to SIGSTOP shim process (pid=…): Connection refused
(os error 111)" (libkrun/libboxlite-side signal delivery
issue against the stopped box — separate from the REST
chain this PR builds)
The SIGSTOP error is a libboxlite snapshot mechanism issue (suspend a
stopped shim process for disk capture), not a REST surface bug. It's
reproducible against local FFI on the same EC2 host and out of scope
for this PR.
Clone / export / import REST support follow the same template; this
PR is the exemplar for those follow-ups.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The fix in the preceding commit closes the gap this test exercises. Mirrors the layout of `scripts/test/e2e/cases/` on main.
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 9, 2026
test_exec_user.py → boxlite-ai#686, test_network_allow_net.py → boxlite-ai#687, test_files_io.py → boxlite-ai#688, test_box_metrics.py → boxlite-ai#689, test_snapshot_clone.py → boxlite-ai#694, test_images_pull_list.py → boxlite-ai#696. Remaining 3 cases (test_exec_attach.py, test_volume_readonly.py, test_cli_detach_recovery.py) stay here because they pin REST-path gaps that don't have a matching fix PR in this session — they document the contract for future work to land against.
`box.stop()` over REST is asynchronous (the API writes
`desiredState=STOPPED` and emits an event; runner.Stop() runs later
from the event handler), so a caller doing back-to-back
`b.stop(); b.snapshot.create(...)` hits the snapshot endpoint while
the underlying shim is still being torn down. libboxlite's snapshot
path then tries to SIGSTOP a half-dead shim via its control socket and
gets ECONNREFUSED:
HTTP 500: snapshot create failed: boxlite: internal error:
Failed to SIGSTOP shim process (pid=…): Connection refused (os error 111)
Pre-fix this failed on every test that follows the `Rust local
integration suite pattern` of stop→snapshot. The local-FFI tests dodge
it implicitly because they do `runtime.get(box-name)` to fetch a fresh
handle after stop — the get path is synchronous and only completes
after libboxlite has finished the shim teardown.
This fix adds a `quiesceBox` helper to the runner-side Client that the
snapshot create + restore controllers route through:
1. Drop any cached Go SDK Box handle (so the next getOrFetchBox does
a fresh `runtime.Get`, matching the local-FFI pattern).
2. Sleep a baseline 5s — empirically the API event handler chain
takes 3–5s to flush, and the bx.Info() State field flips to
"stopped" *before* libboxlite has finished releasing the shim
control socket. A polling-only loop sees a green light too early.
3. Then poll up to 30s for State to leave the Configured/Stopping
transient bucket. If the deadline lapses, fall through with the
latest handle so libboxlite surfaces the real error rather than
this layer masking it with a timeout.
E2E regression test passes:
scripts/test/e2e/cases/test_snapshot_clone.py::
test_snapshot_create_appears_in_list (PRE: SIGSTOP shim ECONNREFUSED;
POST: PASS)
test_snapshot_restore_reverts_disk (same)
clone / export REST surface (PR boxlite-ai#695) still surface a different
failure ("box not found" on the cloned/imported box's libboxlite
internal ID — the API doesn't have a Sandbox entity for it). That's
an architectural mismatch (clone needs entity registration), separate
from this quiesce fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements `box.snapshot.{create, list, get, restore, remove}` over the REST chain. Pre-fix the SDK Rust REST client short-circuited every call with "Remote server does not support snapshots operations" because the API's `/v1/config` returned `snapshots_enabled=false` and there was no runner-side handler for the snapshot URL space anyway.
Five layers added (≈1100 lines):
Test plan
Pin: `scripts/test/e2e/cases/test_snapshot_clone.py::test_snapshot_create_appears_in_list`
End-to-end on the e2e stack the SDK now reaches libboxlite on the runner. The current outstanding failure (`Failed to SIGSTOP shim process: Connection refused`) is a libboxlite snapshot-mechanism issue (signal delivery against a stopped shim) — reproducible against local FFI on the same host, out of scope for this REST PR. Confirms the REST chain itself is end-to-end correct.
Clone / export / import REST support follow the same template; this PR is the exemplar.