[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1
Open
lilongen wants to merge 82 commits into
Open
[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1lilongen wants to merge 82 commits into
lilongen wants to merge 82 commits into
Conversation
…gn docs
Migrated from session work on fix/sandbox-from-image-alpine to start
dedicated feat/cloud-mvp track. Contents:
apps/infra-local/
- goal.md
- poc/single_service.py Phase 0 — single postgres box PoC (✅)
- poc/multi_service.py Phase 1 — multi-service + host-as-hub (✅)
- poc/diagnose_network.py network diagnostic for box-to-box
- poc/diagnose_network.result captured diagnostic output
- poc/README.md Phase 0/1 docs + pass criteria
docs/apps/
- cloud-mvp-plan.md Foundation-first MVP roadmap (rewritten)
- cloud-mvp-plan.md.bak-mvp-deadline-version prior team/deadline version
- own-dog-food-local-infra-solution.md dogfood orchestrator design + Phase 1 results
- infra-vs-local-infra.md apps/infra vs apps/infra-local comparison
- apps-overview.md apps/ one-pager
- apps-comprehensive.md apps/ full breakdown
- apps-api-overview.md apps/api NestJS/TypeORM walkthrough
- api-client-go.md apps/api-client-go auto-gen overview
- sdk-feedback/ dogfood-surfaced SDK gaps:
- 01-host-boxlite-internal-unwired.md (+ linear-friendly variants)
- 02-postgres-trust-via-host-as-hub.md
BoxLite cloud MVP.md input PRD-style brief
PoC status:
- Phase 0 ✅ single postgres box, all 7 sub-phases pass
- Phase 1 ✅ multi-box, host-as-hub via Mac LAN IP, detach=True works
- 2 SDK bugs surfaced + documented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 multi-service + box-to-box networking via host.boxlite.internal validated end-to-end (12/12 phases pass). Revert PoC to non-default host ports (25432/26379) as the durable hygiene rule for any service that could collide with a local dev install; lift the same rule into the design doc (§3.8) and bake it into the planned doctor preflight (§1.7.F).
Concretizes parent design doc §12.2 into a Phase-2 implementation contract: walking skeleton (postgres-only end-to-end) before scaling to the full 10-service orchestrator. Flat package layout, explicit SERVICES registry, doctor port-preflight, integration tests on real BoxLite. Ready for handoff to writing-plans.
Bite-sized 10-task plan derived from the Phase 2 spec. TDD where the logic is testable in isolation (config, lsof parsing, topo_sort); integration test for the end-to-end orchestrator flow. Ready for subagent-driven or inline execution.
…pr + cover data_dir fallback
…ter test coverage
…althy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g_dir + empty-graph coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipe assertion + skip-doctor in itest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator helpers (_http_probe + _is_already_running_error); integration test proves end-to-end 5-service round-trip on real BoxLite.
… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
minio/minio:latest (RHEL UBI base) ships layers with directories having no owner-write bit. The SDK's per-start rootfs merge then fails with "storage error: Failed to ... Permission denied (os error 13)" or a "RustPanic" at write time. Apply owner-write idempotently to the extracted image cache before each start. Idempotent + cheap (~10ms). Remove when SDK fixes rootfs-merge to relax dir perms at extract time.
Dashboard's terminal/preview features call /api/sandbox/:id/ports/:port/ signed-preview-url and load the returned URL in an iframe. The URL shape is `http://<port>-<token>.<proxy-domain>` (e.g. 22222-abc.localhost:28080). For this to work the full chain needs: browser → Caddy :28080 → apps/proxy :4000 → runner :3003 → sandbox Adds a Caddy host-regexp matcher that detects the `<digits>-<token>.` subdomain pattern and reverse-proxies to the host's port 4000, where apps/proxy listens. apps/proxy resolves the token via Redis to a sandbox + runner, then forwards (with auth headers) to the runner's `/sandboxes/<id>/toolbox/*` endpoint. Also adds apps/proxy to go.work so it builds against the local common-go / api-client-go workspace modules. apps/proxy itself is built the same way as runner (`go build ./cmd/proxy`); env vars match SST's prod config (PROXY_API_KEY, BOXLITE_API_URL, OIDC_*, REDIS_*). Verified: - curl with Host: 22222-<token>.localhost:28080 through Caddy returns the runner's xterm.js terminal HTML (HTTP 200). - WebSocket upgrade reaches runner (101 Switching Protocols).
…md64
apps/runner/pkg/boxlite/registry.go hardcoded:
var linuxAmd64Platform = v1.Platform{OS: "linux", Architecture: "amd64"}
This is correct for prod (EC2 x86_64 runners) but on Apple Silicon
(M-series) M5 native runners it pulled amd64 manifests for multi-arch
images. The microVM then booted an arm64 kernel with amd64 contents,
and every shell exec failed with:
failed to execvp err=ENOEXEC filename="/bin/sh"
ENOEXEC: Exec format error executing '/bin/sh'
Symptoms: sandbox created + state=started, but dashboard terminal
opens with "[Connection closed]" because exec/<id>/toolbox/ws can't
spawn a shell inside the microVM.
Fix: derive Architecture from `runtime.GOARCH` so the runner asks
the registry for the right manifest variant. Verified:
- Image pulled to local registry now reports architecture=arm64 in
its config blob.
- New sandbox boots; dashboard Terminal tab → Connect → shows
`root@boxlite:~#` live prompt (Ubuntu 22.04 arm64).
- WebSocket terminal session stays open instead of immediate close.
Together with the previous commit (Caddy → proxy wiring), this
completes the L2 end-to-end terminal feature for the M5 native
local stack.
…board)
Daytona-fork gates several server routes behind PostHog feature flags via
`@RequireFlagsEnabled([{flagKey:X, defaultValue:false}])`. When PostHog
isn't configured (no POSTHOG_API_KEY), every flag falls back to its
call-site `defaultValue: false` → the guard fails → NestJS returns 404
"Cannot POST/GET /api/..." (the "hide route from unauthorized callers"
pattern). In local dev this:
- breaks POST /api/regions (Create Region dialog)
- breaks GET /api/runners (Runners page list)
- breaks several /api/regions/:id sub-routes
- ...any other org_infrastructure / org_experiments / sandbox_spending
feature gated by these flags
Patches `OpenFeaturePostHogProvider` with a `bootstrapFlags` map that's
consulted ONLY when PostHog isn't configured. Wires the same flags as
dashboard's `LOCAL_DEV_FEATURE_FLAG_DEFAULTS` in
PostHogProviderWrapper.tsx so server + client agree on local-dev defaults:
organization_infrastructure: true
organization_experiments: true
dashboard_playground: true
dashboard_webhooks: true
dashboard_create-sandbox: true
sandbox_spending: true
Production with a real POSTHOG_API_KEY ignores `bootstrapFlags` entirely
and uses the PostHog control plane as before.
Verified end-to-end via Playwright:
- POST /api/regions returns 201 (was 404)
- Dashboard "+ Create Region" dialog → submit → new custom region row
appears in the list immediately
… docs Adds 7 wrapper scripts under apps/infra-local/scripts/ that orchestrate the L2 native processes (API + Runner + Proxy + Dashboard) alongside the existing L1 (BoxLite boxes via `python -m boxlite_local`): stack-build.sh — `go build` runner + proxy + `yarn install` stack-up.sh — ensure L1 up + start all/named L2 components stack-down.sh — stop named L2 components (preserves L1 by default) stack-restart.sh — bounce component(s); runner also rebuilds stack-status.sh — one-screen health: L1 boxes + L2 PIDs + ports stack-logs.sh — tail any component's log (or all) stack-reset.sh — soft/hard/nuke: progressively wipe runtime state All exposed via Makefile targets (`make stack-*`). Component-level control via `COMPONENTS=...` variable. Logs/PIDs under `apps/infra-local/.logs/` (gitignored). Each starter is idempotent: re-running stack-up only starts components that are down. Each starter also kills any stale listener on its port before launching (defends against EADDRINUSE from a crashed prior run). stack-down's orphan-sweep step honors the component list — partial stops (e.g. `make stack-restart COMPONENTS=runner`) leave the others untouched (caught + fixed during the verification run below). Verified end-to-end: - Cold start: `make stack-reset && make stack-up` brings all 4 L2 + 10 L1 services healthy in ~60 s. - Dashboard HMR: edit `ErrorBoundaryFallback.tsx` heading text → browser reflects change in ~3 s without page reload (Vite HMR). - API watch: change `health.controller.ts` response shape → `curl /api/health` returns new shape in ~1 s (nx serve auto-rebuild). - Runner rebuild: `make stack-restart COMPONENTS=runner` rebuilds binary + restarts process, `/info` returns new appVersion in ~10 s; api/proxy/dashboard PIDs unchanged (verified isolation). Docs: - `docs/apps/infra-local-status.md` — inventory of what's real vs. mock vs. missing (L1 / L2 / L3 + 6 mocked + 7 absent) - `docs/apps/infra-local-usage.md` — daily workflow guide; first section is now the wrapper TL;DR
Adds regions-*.png, runners-*.png, term-*.png, test-*.png to the ignore list. These are Playwright artifacts from L2 verification runs and local dev/test sessions — not source.
…reate Personal org JwtStrategy.validate() lazily creates a user row on first OIDC login but omitted `personalOrganizationDefaultRegionId` from the CreateUserDto. The OnAsyncEvent listener `OrganizationService.handleUserCreatedEvent` then attempted to build the Personal organization with `defaultRegionId=undefined`. The save silently failed (no async-event result is awaited at the caller), leaving the user with no organization. User-visible symptom on first OIDC login from a fresh DB (e.g. after `make stack-reset` or a brand-new local stack): - `GET /api/organizations` returns `[]` - Dashboard's SelectedOrganizationProvider reads `organizations[0]` → undefined → subsequent `.id` access throws - ErrorBoundary renders "Cannot read properties of undefined (reading 'id')" — the dashboard never loads Fix: pass `personalOrganizationDefaultRegionId` from `config.defaultRegion.id` (env-driven, `DEFAULT_REGION_ID` with default `'us'`). This is the same region the API auto-seeds at boot, so it always exists by the time any user logs in. Verified end-to-end: - Wiped seeded org + user, cleared browser storage, re-login via Dex. - User auto-created with Personal org `defaultRegionId='us'` and organization_user row `role='owner'`. - Dashboard navigates to /dashboard/onboarding successfully (no ErrorBoundary). Notes for prod parity: - Behavior unchanged when running with a real PostHog / Auth0 deploy: the field was just an undefined optional before; it's now an existing region id. `configService.getOrThrow` ensures we fail loudly at startup if `DEFAULT_REGION_ID` resolves to nothing. - This complements the API `bootstrapFlags` patch (a5dd131) — both remove silent failure modes that only show up in fresh local DBs.
…oard Two regressions caught while debugging create-sandbox: 1) stack-down's stop_component was killing whole process group (`kill -PGID`). stack-up.sh launches all 4 native components from the same parent shell, so the nohup'd background jobs inherit the same pgid — `kill -PGID` on one took out the unrelated siblings (a `stack-restart COMPONENTS=dashboard` knocked over api + proxy). Fix: kill just the specific PID. The per-component pkill-by-name sweep in the orphan-cleanup phase still picks up the actual server children (nx serve → node, etc.). 2) Dashboard launch lacked VITE_API_URL=/api. The @boxlite-ai/sdk client falls back to its prod default `https://app.boxlite.io/api` when VITE_API_URL is unset, so create-sandbox calls escaped the sandbox and failed with ERR_CONNECTION_CLOSED. Vite's dev-server proxy in vite.config.mts already forwards /api → localhost:3001; we just needed to point the SDK at the relative path.
Adds `seed-init-data.sh` + wires it into `stack-up` and surfaces it
as a standalone Makefile target. The script does NOT pre-insert
anything (the API self-seeds at boot via app.service.ts
initializeXxx). Instead it:
1. Restarts the API if running, so the seed cycle re-runs against
the truncated DB.
2. Polls until admin user + admin Personal org + default region 'us'
all land — proof that the API's onApplicationBootstrap completed.
3. Waits up to 7 minutes for the default `ubuntu:22.04` snapshot to
reach 'active' (the long pole, cold pull from local registry can
take 2-5 min on M5).
Updates `stack-reset` to truncate the `"user"` table too, so the
API's `if (await findOne(BOXLITE_ADMIN_USER_ID)) return` early-exit
guard doesn't strand it: with no admin user row, the API recreates
admin + personal org + api key + default snapshot fresh.
Updates `stack-up` to invoke `seed-init-data.sh --no-bounce` after
api+runner just started, so the first stack-up after a reset doesn't
need a manual follow-up — the wrapper returns only when the dashboard
can actually create a sandbox.
Verified end-to-end:
- Full TRUNCATE of all user-data tables (incl. `"user"`)
- `make stack-up` — API auto-seeds admin user/org/region (T+~5s),
default snapshot enters PENDING → PULLING (then registry box was
hung; restart unblocked it) → ACTIVE
- `curl POST /api/sandbox` with admin key → HTTP 200, sandbox
starts in ~10s
- Dashboard "+ Create Sandbox" → auto-navigates to new sandbox →
Terminal tab → Connect → `root@boxlite:~#` live prompt
Also exposes `make seed-init-data` for ad-hoc verification any time.
…boxes
Adds `make stack-rebuild-l1-box BOX=<name>` — wraps
`boxlite rm boxlite-local-<name> --force && python -m boxlite_local up <name>`
for one-shot destroy + recreate of a stuck L1 service.
Surfaces two real failure modes seen this week:
1. Dex SQLite session db keeps stale grants across SIGKILL of the
containing box, so subsequent OIDC logins reuse the cached
access_token from a prior session. Browser thinks it's logged in
(oidc-client's `expires_at` is computed from refresh token TTL),
but `accessTokenIat` decodes to days ago and API returns 401.
Fix: `BOX=dex` resets the session db; clear browser storage and
re-login.
2. Registry box (`registry:2`) hangs after SIGKILL of boxlite-shim:
TCP listener stays up but the registry process inside doesn't
answer HTTP, so any snapshot pull hangs in PULLING forever.
`curl http://127.0.0.1:25000/v2/_catalog` 5s-timeout is the
positive identification. Fix: `BOX=registry`.
Both are added to `docs/apps/infra-local-usage.md` "常见问题" table
with the symptom → diagnosis → one-line fix mapping.
The root underlying cause both share: pkill -9 on the host-side
boxlite-shim can corrupt persistent state of the in-box process.
SIGTERM via `make stack-down` does not have this issue.
…ree) Adds section 5.5 to infra-local-usage.md that maps the 5 cleanup levels (stack-restart → stack-rebuild-l1-box → stack-reset → stack-reset-hard → stack-nuke) to concrete scenarios with timing. Replaces the previous ad-hoc "what do I do if X" scattered across sections with one decision table + 3 scenario walkthroughs: 1. Full rebuild (new machine / serious breakage) — ~5min 2. Reset + re-up (常用, dirty DB) — ~60s 3. Partial reset/up (90% daily use) — ~3-10s Key principle surfaced: start at the lightest tier, escalate only if that doesn't fix it. Don't blindly stack-nuke.
gitignore syntax does NOT support inline comments — `pattern # comment` is parsed as the literal path `pattern # comment`, not as a pattern with a side comment. Result: 3 known-local files (apps/apps symlink, apps/api/.swcrc, sdks/go/boxlite-c-v*/) were never actually ignored and kept appearing under `git status` for every dev. Moved all comments to their own lines. Verified with `git check-ignore -v` that each pattern now matches its target.
One-page reference of every infra-local service: PostgreSQL, Redis, MinIO, Dex (OIDC), OCI registry, Jaeger, pgAdmin, OpenTelemetry collector, Registry UI, Caddy reverse proxy. For each: host port + in-box address + image + auth + data volume + sample one-liner. Single source of truth is the InfraConfig dataclass in boxlite_local/config.py — links throughout point at exact line ranges.
Adds a 'Documentation Language' section: every committed *.md, README, CONTRIBUTING, design note, ADR, plan file, and inline comment block must be in English. Non-English drafts are fine in scratch/chat but must be translated before `git add`. Trigger: the apps/infra-local/CONNECTIONS.md flow this week — a Chinese version was committed, immediately translated 1 commit later, then required a history rewrite to scrub the Chinese version. This rule prevents the same cycle from recurring. AI assistants should refuse to `git add` non-English markdown directly and ask the user to confirm translation first.
Adds docs/apps/milestones/2026-05-25-infra-local-ready.md — a single English-only summary of what the ms/infra-local-ready tag delivers: - Executive summary of the 3 layers (L1 infra-local / L2 native control plane / L3 user sandboxes) - Operate-by-make surface (12 stack-* + seed-init-data targets) - End-to-end verified workflows - Architecture changes per layer - Key unblocking fixes (runner GOARCH, jwt.strategy seed, PostHog bootstrapFlags, Caddy→Proxy wiring, .env symlink, SSH_GATEWAY_API_KEY guard) - Known mocked / missing services (PostHog, Billing, Webhooks, Snapshot Manager, SSH Gateway, ClickHouse, OpenSearch, SMTP) - Phase chronology (Phase 1 PoC → 3d wrap → L2 boot → L2 hardening) - Candidate next milestones The tag itself (0a71bb5) stays where it is; this doc is the release notes for that point, committed after the tag — conventional pattern for milestone summaries.
Renames the milestone tag from ms/infra-local-ready to apps/infra-local/v0.9.0 to match the existing component-version tag scheme (e.g. sdks/go/v0.9.5). Tag operation performed: git tag -a apps/infra-local/v0.9.0 0a71bb5 -F <preserved-message> git tag -d ms/infra-local-ready Doc file renamed via git mv (history preserved): docs/apps/milestones/2026-05-25-infra-local-ready.md → docs/apps/milestones/2026-05-25-apps-infra-local-v0.9.0.md In-doc tag references updated. Tag now sorts cleanly alongside sdks/go/v* in `git tag -l '*/*'`.
Renames again to match the existing repo convention (milestone/<component>/v<n.n.n>) already used by milestone/scale-runner/v0.1.0. Tag operation: git tag -a milestone/infra-local/v0.1.0 0a71bb5 -F <preserved-msg> git tag -d apps/infra-local/v0.9.0 Doc renamed via git mv; in-doc tag refs updated. Memory updated too.
Adds a second account to the dex staticPasswords block so the local
stack ships with both an admin path and a regular-user path for
dashboard E2E testing.
- email: test01@boxlite.dev
- password: password
- userID: 5678
- OIDC sub: CgQ1Njc4EgVsb2NhbA (base64 of protobuf
`{userID:'5678', connectorID:'local'}`)
This account behaves like any other OIDC login: on first sign-in the
API's JwtStrategy auto-creates the `user` row + `Personal` organization
+ `organization_user` owner-of-own-org via the
OrganizationService.handleUserCreatedEvent listener. No SQL seed is
needed — the dex config IS the seed, applied automatically on every
`make stack-up` / `make stack-nuke && make stack-up` (because dex
reads staticPasswords from services.py on every box start, and the
stack-reset truncates `"user"` so all API auto-seed paths re-run).
Updates CONNECTIONS.md:
- §1 'Existing DB user rows' table now lists all three users
- §4 'Built-in login accounts' now documents both dex accounts with
username / userID / expected OIDC sub / platform role
- §4 'OAuth clients' references `make stack-rebuild-l1-box BOX=dex`
(the modern wrapper) instead of `make restart-svc dex` (which no
longer exists)
Verified: dex rebuilt → browser login as test01@boxlite.dev/password →
dashboard renders onboarding page with `test01` in the top-right user
chip → DB confirms 3 users (boxlite-admin/admin/test01), each owning
their own Personal org.
Documentation cleanup, English translation per CLAUDE.md, alignment with
what shipped in milestone/infra-local/v0.1.0 (2026-05-25), and one fix
that unblocks dashboard-initiated sandbox creation on the M5 native dev
runner.
Documentation
-------------
- Delete 21 historical / PoC files: apps/infra-local/poc/ (5 PoC
scripts), docs/superpowers/specs|plans (7 phase spec/plan files),
docs/apps/{cloud-mvp-plan,apps-overview,apps-comprehensive,
api-client-go,apps-api-overview}.md, *.bak files, and the
pre-MVP "BoxLite cloud MVP.md" draft.
- Translate 4 user-facing docs from Chinese to English per CLAUDE.md
"Documentation Language" rule:
docs/apps/infra-local-{status,usage}.md
docs/apps/infra-vs-local-infra.md
docs/apps/own-dog-food-local-infra-solution.md
- Update apps/infra-local/{README,CONNECTIONS}.md and the design docs
to describe L1 + L2 orchestration via `make stack-*` (the actual
implementation), not the docker-compose + Lima route in the original
design doc.
- Fix broken cross-references in surviving docs after the cleanup.
Dev runner-score override
-------------------------
apps/infra-local/scripts/stack-up.sh exports the following before
launching the API:
RUNNER_AVAILABILITY_SCORE_THRESHOLD=5 (prod default 10)
RUNNER_MEMORY_PENALTY_THRESHOLD=95 (prod default 75)
RUNNER_DISK_PENALTY_THRESHOLD=95 (prod default 75)
Root cause: the Go runner reports host-wide CPU/RAM/disk usage to the
API, not just what the runner + its boxes consume. On a real EC2 host
that's the right signal. On a dev Mac sharing RAM with VS Code, Chrome,
Docker Desktop, and the L1 dev stack itself, those metrics routinely
exceed the prod 75% penalty threshold and drag the runner's
availabilityScore below the 10-default cutoff. The API then rejects
sandbox-create with "No available runners" even though the runner is
idle.
Documented in apps/infra-local/CONNECTIONS.md (new "Dev-only
runner-score overrides" section) and the v0.1.0 milestone
"Known limitations" table. The structural fix (have the runner report
only its own / boxes-owned resources) is tracked as a follow-up
outside this milestone.
Stack-reset.sh soft-reset behavior
----------------------------------
apps/infra-local/scripts/stack-reset.sh: soft reset now PRESERVES
identity + infra (user / organization / organization_user /
organization_role / region / runner / api_key) and only TRUNCATEs
runtime data (sandbox / snapshot / snapshot_runner / audit_log).
Result: an already-logged-in browser session survives a soft reset —
no forced re-login. --hard still wipes schema entirely; --nuke still
destroys L1 boxes too.
End-to-end verified
-------------------
Two full cold-start cycles (make stack-nuke && make stack-build &&
make stack-up), each followed by:
- Dashboard login via Dex (admin@boxlite.dev / password)
- Snapshots page shows ubuntu:22.04 as Active
- Create Sandbox via dashboard UI -> state Started (first attempt)
- Terminal Connect -> root@boxlite:~# prompt in iframe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New principle on this branch: local infra targets the M5 native runner only. Lima-based multi-host runner support is being explored in a separate worktree and is deliberately not in scope here. Doc / comment edits: - apps/infra-local/goal.md: translated from Chinese to English per CLAUDE.md "Documentation Language" rule. - apps/infra-local/tests/integration/test_e2e_full.py: drop "and Lima runner VM later" from the resource-budget docstring. - docs/apps/infra-local-status.md: drop "no Lima" qualifiers from the platform line and the L2 runner row. - docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md: drop "no Lima" from the headline; reword to "everything runs natively on M5". - docs/apps/infra-vs-local-infra.md: replace the entire §2 "Why Lima instead of HVF" decision archive (180+ lines) with a short "Runner placement on this branch — M5 native (HVF)" section. Update §1 topology + design decisions, §3 comparison-table rows (runner / sandbox isolation / autoscaler InfraProvider / multi-runner support), §4.1/4.2 asymmetries, §6 file pointers, and §7 one-sentence summary to match the M5-native reality. Production-parity tradeoff is acknowledged but flagged as future work outside this milestone. - docs/apps/own-dog-food-local-infra-solution.md: rewrite §2.2, §2.4 (runner path), key-design-choices list, repo layout tree, §5.1 resource budget, §11 decision table, and §12.2 phase plan to describe the M5 native runner instead of a runner-in-Lima. Verification: - grep -iw "lima|limactl|LimaInfraProvider" → 0 hits across all PR-scope files. - Python CJK regex check → 0 CJK chars across all PR-scope files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke install
The previous §1 jumped from "yarn / go / python already installed"
straight to `make stack-build && make stack-up`, skipping two things
a fresh checkout actually needs:
1. The Python orchestrator package isn't installed yet — `python -m
boxlite_local` doesn't work until `make install` runs `pip install
-e ".[test]"`.
2. The `boxlite` Python SDK and CLI must already be present in the
active environment (it's a transitive dep of `boxlite_local`, not
installed by `make install`).
Restructured §1 into three sub-sections:
- §1.1 Prereqs — table listing the actual required tools + versions,
plus a 3-line sanity check that surfaces missing prereqs before
`make` runs and produces a less-actionable failure.
- §1.2 Three-step bring-up — now correctly shows
make install (pip install the orchestrator package)
make stack-build (yarn + go builds)
make stack-up (L1 + L2 + seed)
with timing expectations (5-7 min cold, ~30 s-1 min warm).
- §1.3 First-time dashboard login — explicit credentials + the
end-to-end smoke (create sandbox → terminal → root@boxlite:~#)
so first-time users know what success looks like.
No behavior change — purely documentation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… setup Add the first-time bring-up commands (make install + make stack-build + make stack-up) at the top of the TL;DR cheat sheet so the entire day-one workflow is visible in one block, without having to scroll to §1.2. Day-to-day flow keeps its own bring-up line for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot
After a machine reboot the L1 microVM boxes are gone but the postgres
data volume (~/.boxlite-local/data/pg/) persists on disk with the full
schema. `make stack-up` sees no postgres box, runs `make up-with-schema`,
which brings the box back (schema already present) and then runs
`make load-schema` — which previously hard-failed:
ERR: public schema already has 27 table(s).
Schema baseline is not idempotent. Run 'make wipe && make up' first
So every post-reboot `make stack-up` died at the schema step.
Fix: apply-schema.sh now treats an already-loaded schema as a no-op
instead of an error. When the public schema is non-empty it checks the
`migrations` table to distinguish:
- COMPLETE prior load (tables + migrations recorded) → skip, exit 0
- PARTIAL half-applied baseline (tables but no migrations) → still
refuse with exit 3 (genuinely broken state; needs `make wipe`)
This makes load-schema / up-with-schema / stack-up all idempotent across
reboots. The non-idempotent baseline itself is unchanged — we just stop
trying to re-apply it when it's already there.
Verified on a live post-reboot stack:
Schema already loaded (27 tables, 88 migrations recorded) — skipping.
exit=0
and `make stack-up` then proceeds to L2 (api/runner/proxy/dashboard all up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goal: a single `make stack-up` should work from a fresh checkout, after a reboot, or after `make stack-down` — no need to remember to run `make install` / `make stack-build` first. Rather than wiring install/stack-build as hard `make` prerequisites (which would force a pip-resolve + go-build on *every* stack-up, including the fast daily restart loop), stack-up.sh now does both checks *conditionally* so the common restart path pays nothing: - New: if `python -c "import boxlite_local"` fails, run `make install` before bringing up L1 (which calls `python -m boxlite_local`). - Existing (kept): if /tmp/boxlite-runner or /tmp/boxlite-proxy is missing, run stack-build.sh. Clarified in a comment that it only builds when missing — use `make stack-restart COMPONENTS=runner` to rebuild after a source change. Combined with the load-schema idempotency fix, `make stack-up` is now the single entry point in all scenarios: fresh checkout → install + up-with-schema + build + L2 post-reboot → up-with-schema (schema skip) + build (/tmp cleared) + L2 post-down → up-with-schema (boxes back) + L2 (binaries + pkg present) Docs updated (README §Quick start, infra-local-usage §0 + §1.2) to present `make stack-up` as the one command, with the explicit targets kept as optional for forcing a rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e→stack-up dashboard
Schema loading
- Add scripts/build-all-in-one-sql.py: consolidates every apps/api TypeORM
migration (legacy + pre/post-deploy, 87 total → 539 stmts) into a single
sql/merged-schema.auto-gen.sql. Resolves TS-side ${...} interpolations,
inlines parameterized queries, and mirrors TypeORM's enum/constraint
auto-renames so the output loads cleanly from zero.
- `make load-schema` now regenerates + loads the merged schema; apply-schema.sh
defaults to it and accepts a SCHEMA_SQL_FILE override.
- Drop sql/schema-baseline.sql + sql/REFRESH.md: the prod pg_dump is no longer
the load source; schema is now generated from migrations (kept reachable via
SCHEMA_SQL_FILE if ever needed for an A/B comparison).
Fix: wipe → stack-up left the dashboard non-functional
- Root cause: `make down`/`wipe` tore down only the L1 boxes, leaving the L2
native procs (api/runner/proxy/dashboard) running. After a wipe the stale API
held connections to the destroyed-and-recreated DB and never re-ran
onApplicationBootstrap against the fresh DB → no admin user/org/region →
dashboard loads but is unusable. (Confirmed independent of the schema swap:
reproduced identically with the old baseline dump.)
- `make down`/`wipe` now stop L2 first (stack-down.sh).
- stack-up.sh stops any stale L2 when it (re)creates L1, covering teardown paths
that bypass make (stack-rebuild-l1-box, direct `boxlite rm`,
`python -m boxlite_local down`).
Verified: working-stack → make wipe → make stack-up → 27 tables / 87 migrations,
admin user + org + region seeded, dashboard :3000 + /api proxy + dex all HTTP 200.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Box.Export/Runtime.ImportBox FFI (C/Go/Node/Python SDKs) + runner CreateBackup/restore to S3 + id-preserving import. Foundation for scale-down live migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lilongen
pushed a commit
that referenced
this pull request
Jun 9, 2026
… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 1 of 3 of the runner-scale work (manual add-runner / scale-down-runner).
In-process box
Export/Runtime.ImportBoxFFI across C/Go/Node/Python SDKs + the runner's S3 backup/restore (id-preserving, sosandbox.id == box.idsurvives migration).Base:
feat/cloud-mvp. Stacked: PR2 (infra) → PR3 (api) build on this; review/merge 1 → 2 → 3 into feat/cloud-mvp.