fix(rest): re-attach to drain backlog before emitting terminal result#3
Closed
G4614 wants to merge 1 commit into
Closed
fix(rest): re-attach to drain backlog before emitting terminal result#3G4614 wants to merge 1 commit into
G4614 wants to merge 1 commit into
Conversation
A fast command over the cloud (Go runner) could return exit 0 with empty
stdout. When the WS attach drops before the runner flushes output (a proxy /
high-latency cut), attach_ws_pump probed GET /executions/{id}, saw
"completed", and emitted the exit code WITHOUT re-attaching — dropping the
stdout still sitting in the runner's replay backlog. Local `boxlite serve`
masked it (low latency, the exit frame always arrives in-band).
On a disconnect-then-Terminal probe, re-attach once (immediately, no backoff)
so the runner replays its backlog and we leave via its authoritative exit
frame. Bounded to a single attempt with fallback to the probed exit code, so
a runner that never sends a closing exit frame can't hang the pump.
Independent of boxlite-ai#563 (an sdks/c FFI drain fix): this is the REST client path,
so it reproduced on both main and boxlite-ai#563.
Reproducer: ws_terminal_probe_after_cut_must_not_drop_buffered_stdout drives
the WS pump against a mock that cuts the first attach, reports completed/0,
and serves the backlog only on re-attach. Without the fix it observes empty
stdout with a clean exit; with the fix it drains "hello\n".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Superseded by cross-repo PR boxlite-ai#627 (fork → upstream). This fork-internal PR was opened by mistake. |
G4614
added a commit
that referenced
this pull request
Jun 1, 2026
The pre-fix GC scope was a single sweep of images/disk-images/*.ext4.
On a long-running `boxlite serve` (or any host that crash-leaked artifacts
without a runtime restart), the two biggest reclaim sources — orphan
boxes/<id>/ dirs (per-box qcow2 overlay + logs + sockets) and orphan
bases/*.qcow2 (snapshot/clone bases left when the `try_gc_base` cascade
crashed) — were unreachable from `boxlite gc`. Cleanup ran only at
startup or as a remove_box() side effect.
Now `collect_garbage` runs three sweeps in a fixed order:
1. boxes/<id>/ — id absent from box DB, past 10-min mtime grace
2. bases/*.qcow2 — base_path absent from base_disk table, past grace
3. images/disk-images/*.ext4 — unchanged behavior, but pinning is built
from live-only boxes (id in DB) so an orphan's stale qcow2 backing
chain doesn't keep image cache pinned
Empty `known` (DB read failure or fresh runtime with no boxes) short-
circuits #1 and treats every on-disk box as live in #3 — safer than
mistakenly deleting every box dir or every image-disk under a transient
DB error.
GcReport now carries box_dirs_removed / bases_removed alongside the
existing disk_images_removed counters; `boxlite gc` prints all three.
Tests: 3 new (sweeps_orphan_box_dirs_only,
orphan_box_dir_does_not_pin_image_disk_image, sweeps_orphan_bases_only)
plus existing 4 still green; CLI gc_cli updated for the new output shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
boxlite run <img> <fast-cmd>against the cloud (Go runner) intermittently returned exit 0 with empty stdout for commands that exit almost instantly (e.g.uptime). The command ran inside the box, but its output was silently dropped before reaching the client (~80% over the cloud path; localboxlite servemasked it via low attach latency).src/boxlite/src/rest/litebox.rs): when the attach WebSocket drops before an exit frame (a proxy / high-latency cut),attach_ws_pumpprobesGET /executions/{id}; on aTerminal("completed") result it emitted the exit code and returned without re-attaching — dropping the stdout still buffered in the runner's replay backlog.Terminalprobe, re-attach once, immediately (no backoff) so the runner replays its backlog and we exit via its authoritative exit frame. Bounded by aterminal_drain_attemptedflag, with fallback to the probed exit code so a runner that never sends a closing exit frame can't hang the pump.Independent of boxlite-ai#563 (an
sdks/cFFI drain fix): this is the REST client path, so the bug reproduced on bothmainand the boxlite-ai#563 branch. They are orthogonal fixes for the same user-visible symptom.Tradeoff
If the first connection delivered partial stdout before being cut, the re-attach replays the runner's full backlog, so the already-delivered prefix can be duplicated. This matches the existing
StillRunning/Unavailablereconnect semantics (the runner'sstreamBusreplays without an offset) and is strictly preferable to silent loss. Eliminating it entirely needs offset-based replay in the attach protocol (out of scope here).Test plan
ws_terminal_probe_after_cut_must_not_drop_buffered_stdout(reuses the in-process mock-WS harness): cuts the first attach, reportscompleted/exit 0, serves the backlog only on re-attach.mainbase: fails without the production change (left:"" right:"hello\n", clean exit 0), passes with it.ws_clean_exit_emits_result,ws_close_without_exit_falls_back_to_status,ws_text_error_frame_logs_but_continues.cargo fmt --check+cargo clippy -p boxlite --features rest --lib -D warningsclean on the change.🤖 Generated with Claude Code