Skip to content

fix(rest): re-attach to drain backlog before emitting terminal result#627

Open
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-attach-drain-backlog
Open

fix(rest): re-attach to drain backlog before emitting terminal result#627
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:fix/rest-attach-drain-backlog

Conversation

@G4614

@G4614 G4614 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Re-attach once to drain server-buffered stdout when the attach WebSocket is cut before the exit frame, fixing silent stdout loss.
use local boxlite serve and REST POST request for E2E test

Test plan

Two-sided (re-attach disabled vs applied), toggled on the branch since the tests ship with the fix.

In-process unit reproducer (mock WS)

  • ws_terminal_probe_after_cut_must_not_drop_buffered_stdout — in-process mock-WS in rest/litebox.rs: the first attach is cut before the exit frame, GET /executions/{id} reports completed/exit 0, and the backlog (hello\n) is served only on re-attach.
observed pre-fix (re-attach disabled) post-fix
consumer stdout (fast cmd, late attach) "" — clean exit 0, output dropped "hello\n" — backlog drained
50×4 fast-command repro (Go SDK, cloud) 18–26/50 loss 0/50
test result FAIL (left:"" right:"hello\n") PASS

End-to-end reproducer (real boxlite serve + TCP proxy)

boxlite_rest_attach_drain_after_proxy_cut_serves_stdout (src/cli/tests/rest_e2e.rs) runs the actual production stack: real boxlite serve subprocess on an ephemeral port, the production BoxliteRuntime::rest client (the path through attach_ws_pump), and an inline async TCP proxy in front of the daemon that deterministically cuts the first WS upgrade after the 101 response and lets every subsequent connection (HTTP status probes, the second WS for the re-attach) through transparently. Inside the box: sh -c "echo hello" in alpine.

step code state result
A fix applied ok (6.57 s) — 1st WS cut → status Terminal → re-attach (2nd WS transparent) → runner replays hello\n + exit frame
B terminal_drain_attempted = true at rest/litebox.rs:563 (re-attach short-circuited) FAILEDassertion left == right left: "", right: "hello\n" — the silent stdout loss this PR fixes
C fix restored ok (8.71 s)

What this end-to-end test catches that the mock unit test doesn't:

  • Real WS upgrade handshake between axum (serve) and reqwest-tungstenite (REST client).
  • BoxliteRuntime::rest end-to-end: box create + exec + attach + status probe through one TCP proxy without any in-process scaffolding.
  • The 2nd-WS re-attach actually receiving the runner's backlog over a real WS subprotocol exchange (not a hand-crafted Message::Binary reply in the mock).
  • boxlite serve accepting the proxy-mediated traffic exactly as it would from any external REST consumer.

Orthogonal to #563 (an sdks/c FFI drain fix at the SDK layer): this is the REST client attach path, and the loss reproduced on both main and the #563 branch. Tradeoff: the re-attach replays the runner's full backlog without an offset, so a partial prefix delivered before the cut can be duplicated — this matches existing StillRunning/Unavailable reconnect semantics and is strictly preferable to silent loss (offset-based replay is out of scope).

A fast command over the cloud (Go runner) could return exit 0 with empty
stdout. When the WS attach drops before the runner flushes output (a proxy /
high-latency cut), attach_ws_pump probed GET /executions/{id}, saw
"completed", and emitted the exit code WITHOUT re-attaching — dropping the
stdout still sitting in the runner's replay backlog. Local `boxlite serve`
masked it (low latency, the exit frame always arrives in-band).

On a disconnect-then-Terminal probe, re-attach once (immediately, no backoff)
so the runner replays its backlog and we leave via its authoritative exit
frame. Bounded to a single attempt with fallback to the probed exit code, so
a runner that never sends a closing exit frame can't hang the pump.

Independent of boxlite-ai#563 (an sdks/c FFI drain fix): this is the REST client path,
so it reproduced on both main and boxlite-ai#563.

Reproducer: ws_terminal_probe_after_cut_must_not_drop_buffered_stdout drives
the WS pump against a mock that cuts the first attach, reports completed/0,
and serves the backlog only on re-attach. Without the fix it observes empty
stdout with a clean exit; with the fix it drains "hello\n".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/rest-attach-drain-backlog branch from d992bfa to d02ca31 Compare June 2, 2026 05:34
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
…ite serve

The PR-bundled `ws_terminal_probe_after_cut_must_not_drop_buffered_stdout`
pins the drain bug with an in-process mock WS server — fast and
deterministic, but nothing about it speaks the actual production wiring.
Add a CLI integration test that does:

  1. `boxlite serve --port <ephemeral>` as a real subprocess.
  2. `curl GET /v1/config` and `curl GET /v1/boxes` against the daemon —
     proves the HTTP surface answers to any external client, not just our
     own reqwest.
  3. `boxlite --url http://127.0.0.1:<port> run --rm alpine:latest sh -c
     "echo hello"` to drive the REST attach path through the production CLI.
     Bytes flow: cli → reqwest POST → axum handler → guest exec → axum WS →
     reqwest-tungstenite → `attach_ws_pump` (the function PR boxlite-ai#627 patches)
     → stdout pipe → process exit. Asserts the client prints exactly
     `hello\n`.
  4. A second `curl GET /v1/config` after the exec, proving the daemon is
     still serving (the drain path can't have wedged the WS handler).
  5. `ServeGuard::Drop` sends SIGINT (not SIGTERM) so the daemon hits its
     axum `with_graceful_shutdown` future and calls
     `runtime.shutdown(timeout)` — stopping every running box before exit.
     SIGTERM bypasses that path, which would leak the box's shim and trip
     `PerTestBoxHome::Drop`'s leak check; the SIGINT comment in the test
     captures that surprise so the next person doesn't re-introduce the
     hard-kill.

What this catches that the mock test doesn't:
- On-wire JSON regressions in `POST /v1/boxes` / `POST /v1/boxes/{id}/exec`.
- WS upgrade handshake skew between axum and reqwest-tungstenite (the mock
  bypasses the upgrade entirely).
- The full `streamer.start()` ↔ `attach_ws_pump` cooperation under a real
  network socket, including PR boxlite-ai#627's reattach-on-Terminal path.

What it does NOT do: deterministically reproduce PR boxlite-ai#627's specific cut
race. That's the mock test's job. This test demonstrates the happy path
goes through the patched code without regression — if the REST surface
breaks in any way, this 12 s integration run fails loudly.

Result: 1 passed, 12.83 s on a warm cache. cargo clippy `-D warnings`,
fmt:check clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t drain bug

The PR-bundled `ws_terminal_probe_after_cut_must_not_drop_buffered_stdout`
mocks the WS server in-process — deterministic but synthetic. Add an
integration test that runs the actual production stack and forces the same
bug deterministically end-to-end:

  1. Spawn real `boxlite serve` on an ephemeral port.
  2. Stand up an inline async TCP proxy that watches the `Upgrade:
     websocket` header pair. The FIRST WS upgrade gets its 101 response
     forwarded back to the client and then both halves of the TCP pair
     get dropped — exactly the "proxy idle-cut on the upgrade with no
     data frames" shape PR boxlite-ai#627's mock test pins. All subsequent
     connections (HTTP status probes, the SECOND WS for the re-attach
     drain) flow through transparently.
  3. Drive the box via the production `BoxliteRuntime::rest` REST client —
     same `attach_ws_pump` codepath PR boxlite-ai#627 patches — pointed at the
     proxy port. Box runs `sh -c "echo hello"` in alpine.
  4. Collect stdout off `Execution.stdout()` and `await Execution.wait()`.

With the fix in: 1st WS cut → status probe → Terminal → re-attach (2nd
WS transparent through proxy) → runner replays `"hello\n"` + exit frame →
test PASS (6.57 s).

Without the fix: 1st WS cut → status probe → Terminal → bails immediately
→ stdout channel empty → test FAILS at the assertion with `left:""
right:"hello\n"`. Verified locally by flipping
`terminal_drain_attempted` to `true` at `rest/litebox.rs:563`.

| step | code state                                  | result                                  |
|------|----------------------------------------------|-----------------------------------------|
| A    | fix applied                                  | `ok` (6.57 s)                          |
| B    | `terminal_drain_attempted = true` at L563    | **FAILED** — `left:""  right:"hello\n"` |
| C    | fix restored                                 | `ok` (8.71 s)                          |

What this catches that the mock test doesn't:
- Real WS upgrade handshake between axum (serve) and reqwest-tungstenite
  (REST client).
- `BoxliteRuntime::rest` end-to-end: box create + exec + attach + status
  probes through one TCP proxy without any in-process scaffolding.
- The 2nd-WS re-attach actually receiving the runner's backlog frames
  via a real WS subprotocol exchange (not a hand-crafted `Message::Binary`
  reply).
- `boxlite serve` accepting the proxy-mediated traffic as if from any
  external REST consumer.

`ServeGuard::Drop` sends SIGINT (not SIGTERM) so the daemon hits its
axum `with_graceful_shutdown` future — without that the box's shim leaks
through `PerTestBoxHome::Drop`'s leak check.

clippy `-D warnings` clean, fmt:check clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/rest-attach-drain-backlog branch from d02ca31 to fe68b24 Compare June 2, 2026 06:03
@G4614 G4614 marked this pull request as ready for review June 2, 2026 06:45
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 9, 2026
Existing make test:integration:* uses the local PyO3/FFI path
(`Boxlite.default()`) and bypasses both the NestJS API and
boxlite-runner. Bugs that surface only on the REST → API → runner chain
— e.g. boxlite-ai#563's exec-stdout drop, boxlite-ai#627's attach re-drain — leak into
production with the unit + integration suites all green.

This suite always uses `Boxlite.rest(opts)` against the local API
and proves the path on every run.

Layout (scripts/test/e2e/):
  bootstrap.sh             Install PG, Redis, registry, mountpoint-s3,
                           runner binary, systemd units. Idempotent.
  fixture_setup.py         Register alpine:3.23 + ubuntu:22.04
                           snapshots (waits for active), patch admin
                           org quota, write [profiles.p1] in
                           ~/.boxlite/credentials.toml. Idempotent.
  cases/conftest.py        rt / image / box fixtures — REST-only.
  cases/test_path_verification.py
                           Meta-test: asserts the runtime is REST-mode
                           against :3000, and one round-trip exec
                           leaves the runner journal with the box id.
                           Catches the case where SDK silently
                           degrades to local FFI or talks to the wrong
                           endpoint.
  cases/test_p0_6_exec_stdout_race.py
                           First real regression case — runs N boxes
                           × 2 execs and asserts loss rate ≤ 5%.
                           Against stock 0.9.5 runner: ~80–90% loss
                           (FAIL). Against a runner rebuilt with boxlite-ai#563:
                           0% loss (PASS).
  lib/path_verification.py Helpers used by the meta-test.
  two_sided.sh             Builds runner from MAIN_REF, runs pytest
                           (expect FAIL=1), builds from PR_REF, runs
                           pytest (expect PASS=0). Only verdict 1+0 is
                           "PR removes the bug + test detects it".
                           Restores original binary on any exit.

Make targets (make/test.mk):
  make test:e2e:setup       Bootstrap services + fixture data.
  make test:e2e             Run the suite.
  make test:e2e:two-sided   PR_REF=<branch> required.

CI workflow (.github/workflows/e2e-stack.yml):
  Triggers on push/PR touching sdks/c/src/exec/, src/boxlite/src/rest/,
  apps/runner/, apps/api/src/. Requires a self-hosted runner labeled
  `kvm` because GitHub-hosted runners lack /dev/kvm and cannot run
  libkrun.

Verified locally on alpine:3.23, py3.14, against both 0.9.5 stock
runner (1 FAIL of 3) and a runner rebuilt from
boxlite-ai#563 (3 PASS of 3).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant