Skip to content

feat(rest): box clone + export + runtime import end-to-end (Class A, surfaces 2-4)#695

Draft
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-clone-end-to-end
Draft

feat(rest): box clone + export + runtime import end-to-end (Class A, surfaces 2-4)#695
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-clone-end-to-end

Conversation

@G4614

@G4614 G4614 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Adds clone, export, runtime import over the REST chain following the same template as #694 (snapshot). Pre-fix the SDK Rust REST client short-circuited each call with "Remote server does not support {clone,export,import} operations" because /v1/config returned all three capabilities as false.

Layers added (≈765 lines):

  • C FFI (`sdks/c/src/clone_export.rs`): `boxlite_box_clone_box`, `boxlite_box_export`, `boxlite_runtime_import_box`
  • event_queue: 2 new RuntimeEvent variants + callback function types
  • Go SDK (`sdks/go/clone_export.go`): `Box.CloneBox`, `Box.Export`, `Runtime.ImportBox`
  • Runner: controller with 3 handlers (export streams archive bytes back; import reads body bytes to temp file) + 3 routes
  • API NestJS: 3 proxy routes; import is runtime-level (no boxId) so we pick a runner via `pickRunnerForImport` (any runner the org has a sandbox on); config flips clone/export/import capabilities to true

Test plan

Pins: `scripts/test/e2e/cases/test_snapshot_clone.py::{test_clone_box_yields_independent_disk, test_export_import_roundtrip}`

Pre-fix Post-fix
`box.clone_box()` `Remote server does not support clone` reaches libboxlite via REST chain
`box.export(dest=...)` `Remote server does not support export` reaches libboxlite via REST chain
`runtime.import_box(path)` `Remote server does not support import` reaches libboxlite via REST chain

Outstanding failure (`Failed to SIGSTOP shim process: Connection refused`) is a libboxlite snapshot-mechanism issue — same signature as #694's residual, reproducible against local FFI on the same host. Out of scope for this REST surface PR.

Branched off #694 — depends on snapshot baseline + cbindgen header refresh landing first.

G4614 added 2 commits June 9, 2026 06:23
…e 1)

Implements `box.snapshot.{create, list, get, restore, remove}` over the
REST chain. Pre-fix the SDK Rust REST client short-circuited every call
with "Remote server does not support snapshots operations" because the
API's /v1/config returned snapshots_enabled=false and there was no
runner-side handler for the snapshot URL space anyway.

Five layers added in this PR (≈1100 lines):

  - sdks/c/src/snapshot.rs (new)
      CSnapshotInfo + CSnapshotInfoList FFI types, async + callback
      variants for create/list/get/remove/restore, free helpers
      mirroring CBoxInfo's allocation conventions.
  - sdks/c/src/event_queue.rs
      4 new RuntimeEvent variants (Create/List/Remove/Restore — Get
      shares Create's payload shape) + 4 callback function types.
  - sdks/c/src/lib.rs / runtime.rs
      Register the module + dispatch the 4 new event variants through
      the existing dispatch_handle_event / dispatch_unit_event paths.
  - sdks/go/snapshot.go (new) + bridge.{c,h}
      Box.Snapshot{Create,List,Get,Remove,Restore} cgo wrappers, four
      //export goBoxliteOnSnapshot* callbacks, type bridging.
  - apps/runner/pkg/boxlite/client.go
      Client.Snapshot* methods that route through getOrFetchBox.
  - apps/runner/pkg/api/controllers/boxlite_snapshot.go (new)
      5 gin handlers + classifySnapshotError (mirrors classifyExecError
      pattern from boxlite-ai#690) so the SDK gets HTTP-typed errors instead of
      raw 5xx for caller-fixable cases.
  - apps/runner/pkg/api/server.go
      5 boxliteApi routes matching the SDK's URL shape:
        POST   /v1/boxes/:boxId/snapshots
        GET    /v1/boxes/:boxId/snapshots
        GET    /v1/boxes/:boxId/snapshots/:name
        DELETE /v1/boxes/:boxId/snapshots/:name
        POST   /v1/boxes/:boxId/snapshots/:name/restore
  - apps/api/src/boxlite-rest/boxlite-proxy.controller.ts
      3 new proxy routes covering the snapshot URL space (root, named,
      restore). Existing proxyToRunner machinery handles auth +
      runner discovery + path rewrite.
  - apps/api/src/boxlite-rest/boxlite-config.controller.ts
      Flips `snapshots_enabled: true` so the SDK's
      `require_snapshots_enabled` gate stops short-circuiting.

E2E status:

The REST plumbing is **verified end-to-end**: the SDK call now reaches
libboxlite on the runner instead of hitting the "Remote server does
not support" gate. With the e2e test stack:

  e2e test `test_snapshot_clone.py::test_snapshot_create_appears_in_list`:
    PRE  : RuntimeError: "Remote server does not support snapshots
           operations"  (short-circuit at SDK)
    POST : HTTP 500: "snapshot create failed: boxlite: internal error:
           Failed to SIGSTOP shim process (pid=…): Connection refused
           (os error 111)" (libkrun/libboxlite-side signal delivery
           issue against the stopped box — separate from the REST
           chain this PR builds)

The SIGSTOP error is a libboxlite snapshot mechanism issue (suspend a
stopped shim process for disk capture), not a REST surface bug. It's
reproducible against local FFI on the same EC2 host and out of scope
for this PR.

Clone / export / import REST support follow the same template; this
PR is the exemplar for those follow-ups.
…surfaces 2-4)

Adds the remaining three Class A operations (clone, export, import) over
the REST chain, following the same template PR boxlite-ai#694 established for
snapshot. Pre-fix the SDK Rust REST client short-circuited each call
with "Remote server does not support {clone,export,import} operations"
because the API's /v1/config returned those capabilities as false.

Layers:

  - sdks/c/src/clone_export.rs (new)
      `boxlite_box_clone_box` (returns CBoxHandle), `boxlite_box_export`
      (unit + error; caller already knows dest path),
      `boxlite_runtime_import_box` (returns CBoxHandle). Each async +
      callback, mirroring snapshot.rs.
  - sdks/c/src/event_queue.rs
      2 new RuntimeEvent variants (CloneBox uses OwnedFfiPtr<CBoxHandle>,
      ExportBox is unit) + 2 callback function types.
  - sdks/c/src/lib.rs / runtime.rs
      Register the module + dispatch through existing
      dispatch_handle_event / dispatch_unit_event paths.
  - sdks/go/clone_export.go (new) + bridge.{c,h}
      Box.CloneBox, Box.Export, Runtime.ImportBox + cgo bridge.
  - apps/runner/pkg/boxlite/client.go
      Client.CloneBox, Client.ExportBox, Client.ImportBox.
  - apps/runner/pkg/api/controllers/boxlite_clone_export.go (new)
      3 gin handlers + classifyCloneExportError. Export streams the
      archive bytes back to the SDK as the response body (the SDK
      writes to its caller-chosen host path). Import reads bytes from
      the request body, writes to a runner-local temp file, then calls
      ImportBox.
  - apps/runner/pkg/api/server.go
      3 new routes (POST /clone, POST /export, POST /import).
  - apps/api/src/boxlite-rest/boxlite-proxy.controller.ts
      Proxy routes for /clone, /export, and the runtime-level /import.
      Import has no boxId so it's routed via `pickRunnerForImport`
      (any runner the org has a sandbox on). If the org has no
      existing sandbox, returns 404 with an explanatory message.
  - apps/api/src/boxlite-rest/boxlite-config.controller.ts
      Flips `clone_enabled / export_enabled / import_enabled = true`
      so the SDK's `require_*_enabled` gates stop short-circuiting.

E2E status:

The REST plumbing is **verified end-to-end** — the SDK calls now reach
libboxlite on the runner instead of hitting the "Remote server does not
support" gate. With the e2e test stack:

  test_clone_box_yields_independent_disk:
    PRE  : RuntimeError: "Remote server does not support clone
           operations"  (SDK short-circuit)
    POST : HTTP 500 "clone failed: boxlite: internal error:
           Failed to SIGSTOP shim process (pid=…): Connection refused"
           (libkrun/libboxlite-side issue, identical signature to the
           snapshot pre-existing failure)

  test_export_import_roundtrip:
    PRE  : RuntimeError: "Remote server does not support export
           operations"  (SDK short-circuit)
    POST : HTTP 500 "export failed: boxlite: internal error:
           Failed to SIGSTOP shim process …"

The SIGSTOP failure is a libboxlite snapshot mechanism issue
reproducible against local FFI on the same EC2 host — the same one
PR boxlite-ai#694 documented. Out of scope for this REST surface PR.

Branched off fix/rest-snapshot-end-to-end (boxlite-ai#694) — depends on the
snapshot baseline + cbindgen header refresh landing first.
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e05c4a3-5bab-4ec0-ac31-f282429c3e3f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
`box.stop()` over REST is asynchronous (the API writes
`desiredState=STOPPED` and emits an event; runner.Stop() runs later
from the event handler), so a caller doing back-to-back
`b.stop(); b.snapshot.create(...)` hits the snapshot endpoint while
the underlying shim is still being torn down. libboxlite's snapshot
path then tries to SIGSTOP a half-dead shim via its control socket and
gets ECONNREFUSED:

  HTTP 500: snapshot create failed: boxlite: internal error:
  Failed to SIGSTOP shim process (pid=…): Connection refused (os error 111)

Pre-fix this failed on every test that follows the `Rust local
integration suite pattern` of stop→snapshot. The local-FFI tests dodge
it implicitly because they do `runtime.get(box-name)` to fetch a fresh
handle after stop — the get path is synchronous and only completes
after libboxlite has finished the shim teardown.

This fix adds a `quiesceBox` helper to the runner-side Client that the
snapshot create + restore controllers route through:

  1. Drop any cached Go SDK Box handle (so the next getOrFetchBox does
     a fresh `runtime.Get`, matching the local-FFI pattern).
  2. Sleep a baseline 5s — empirically the API event handler chain
     takes 3–5s to flush, and the bx.Info() State field flips to
     "stopped" *before* libboxlite has finished releasing the shim
     control socket. A polling-only loop sees a green light too early.
  3. Then poll up to 30s for State to leave the Configured/Stopping
     transient bucket. If the deadline lapses, fall through with the
     latest handle so libboxlite surfaces the real error rather than
     this layer masking it with a timeout.

E2E regression test passes:
  scripts/test/e2e/cases/test_snapshot_clone.py::
  test_snapshot_create_appears_in_list  (PRE: SIGSTOP shim ECONNREFUSED;
                                         POST: PASS)
  test_snapshot_restore_reverts_disk    (same)

clone / export REST surface (PR boxlite-ai#695) still surface a different
failure ("box not found" on the cloned/imported box's libboxlite
internal ID — the API doesn't have a Sandbox entity for it). That's
an architectural mismatch (clone needs entity registration), separate
from this quiesce fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant