[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner) by lilongen · Pull Request #1 · lilongen/boxlite

lilongen · 2026-05-29T08:48:10Z

Part 1 of 3 of the runner-scale work (manual add-runner / scale-down-runner).

In-process box Export/Runtime.ImportBox FFI across C/Go/Node/Python SDKs + the runner's S3 backup/restore (id-preserving, so sandbox.id == box.id survives migration).

Base: feat/cloud-mvp. Stacked: PR2 (infra) → PR3 (api) build on this; review/merge 1 → 2 → 3 into feat/cloud-mvp.

…gn docs Migrated from session work on fix/sandbox-from-image-alpine to start dedicated feat/cloud-mvp track. Contents: apps/infra-local/ - goal.md - poc/single_service.py Phase 0 — single postgres box PoC (✅) - poc/multi_service.py Phase 1 — multi-service + host-as-hub (✅) - poc/diagnose_network.py network diagnostic for box-to-box - poc/diagnose_network.result captured diagnostic output - poc/README.md Phase 0/1 docs + pass criteria docs/apps/ - cloud-mvp-plan.md Foundation-first MVP roadmap (rewritten) - cloud-mvp-plan.md.bak-mvp-deadline-version prior team/deadline version - own-dog-food-local-infra-solution.md dogfood orchestrator design + Phase 1 results - infra-vs-local-infra.md apps/infra vs apps/infra-local comparison - apps-overview.md apps/ one-pager - apps-comprehensive.md apps/ full breakdown - apps-api-overview.md apps/api NestJS/TypeORM walkthrough - api-client-go.md apps/api-client-go auto-gen overview - sdk-feedback/ dogfood-surfaced SDK gaps: - 01-host-boxlite-internal-unwired.md (+ linear-friendly variants) - 02-postgres-trust-via-host-as-hub.md BoxLite cloud MVP.md input PRD-style brief PoC status: - Phase 0 ✅ single postgres box, all 7 sub-phases pass - Phase 1 ✅ multi-box, host-as-hub via Mac LAN IP, detach=True works - 2 SDK bugs surfaced + documented Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase 1 multi-service + box-to-box networking via host.boxlite.internal validated end-to-end (12/12 phases pass). Revert PoC to non-default host ports (25432/26379) as the durable hygiene rule for any service that could collide with a local dev install; lift the same rule into the design doc (§3.8) and bake it into the planned doctor preflight (§1.7.F).

Concretizes parent design doc §12.2 into a Phase-2 implementation contract: walking skeleton (postgres-only end-to-end) before scaling to the full 10-service orchestrator. Flat package layout, explicit SERVICES registry, doctor port-preflight, integration tests on real BoxLite. Ready for handoff to writing-plans.

Bite-sized 10-task plan derived from the Phase 2 spec. TDD where the logic is testable in isolation (config, lsof parsing, topo_sort); integration test for the end-to-end orchestrator flow. Ready for subagent-driven or inline execution.

…aclasses

…pr + cover data_dir fallback

…EP 604 unions

…ter test coverage

…althy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g_dir + empty-graph coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ser calls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ipe assertion + skip-doctor in itest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… already exist

5-service stack: pg (Phase 2) + redis/minio/minio-init/registry (3a). Introduces http_url healthcheck, one_shot lifecycle, repo_root resolution. Closes Phase-2 debt #1 (narrow start_service exception); defers debt #2 and tcp_port (no caller in 3a). Autonomous execution per /goal directive.

Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator helpers (_http_probe + _is_already_running_error); integration test proves end-to-end 5-service round-trip on real BoxLite.

…ction

… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

minio/minio:latest (RHEL UBI base) ships layers with directories having no owner-write bit. The SDK's per-start rootfs merge then fails with "storage error: Failed to ... Permission denied (os error 13)" or a "RustPanic" at write time. Apply owner-write idempotently to the extracted image cache before each start. Idempotent + cheap (~10ms). Remove when SDK fixes rootfs-merge to relax dir perms at extract time.

…erm error

…not files)

… override

Dashboard's terminal/preview features call /api/sandbox/:id/ports/:port/ signed-preview-url and load the returned URL in an iframe. The URL shape is `http://<port>-<token>.<proxy-domain>` (e.g. 22222-abc.localhost:28080). For this to work the full chain needs: browser → Caddy :28080 → apps/proxy :4000 → runner :3003 → sandbox Adds a Caddy host-regexp matcher that detects the `<digits>-<token>.` subdomain pattern and reverse-proxies to the host's port 4000, where apps/proxy listens. apps/proxy resolves the token via Redis to a sandbox + runner, then forwards (with auth headers) to the runner's `/sandboxes/<id>/toolbox/*` endpoint. Also adds apps/proxy to go.work so it builds against the local common-go / api-client-go workspace modules. apps/proxy itself is built the same way as runner (`go build ./cmd/proxy`); env vars match SST's prod config (PROXY_API_KEY, BOXLITE_API_URL, OIDC_*, REDIS_*). Verified: - curl with Host: 22222-<token>.localhost:28080 through Caddy returns the runner's xterm.js terminal HTML (HTTP 200). - WebSocket upgrade reaches runner (101 Switching Protocols).

…md64 apps/runner/pkg/boxlite/registry.go hardcoded: var linuxAmd64Platform = v1.Platform{OS: "linux", Architecture: "amd64"} This is correct for prod (EC2 x86_64 runners) but on Apple Silicon (M-series) M5 native runners it pulled amd64 manifests for multi-arch images. The microVM then booted an arm64 kernel with amd64 contents, and every shell exec failed with: failed to execvp err=ENOEXEC filename="/bin/sh" ENOEXEC: Exec format error executing '/bin/sh' Symptoms: sandbox created + state=started, but dashboard terminal opens with "[Connection closed]" because exec/<id>/toolbox/ws can't spawn a shell inside the microVM. Fix: derive Architecture from `runtime.GOARCH` so the runner asks the registry for the right manifest variant. Verified: - Image pulled to local registry now reports architecture=arm64 in its config blob. - New sandbox boots; dashboard Terminal tab → Connect → shows `root@boxlite:~#` live prompt (Ubuntu 22.04 arm64). - WebSocket terminal session stays open instead of immediate close. Together with the previous commit (Caddy → proxy wiring), this completes the L2 end-to-end terminal feature for the M5 native local stack.

…board) Daytona-fork gates several server routes behind PostHog feature flags via `@RequireFlagsEnabled([{flagKey:X, defaultValue:false}])`. When PostHog isn't configured (no POSTHOG_API_KEY), every flag falls back to its call-site `defaultValue: false` → the guard fails → NestJS returns 404 "Cannot POST/GET /api/..." (the "hide route from unauthorized callers" pattern). In local dev this: - breaks POST /api/regions (Create Region dialog) - breaks GET /api/runners (Runners page list) - breaks several /api/regions/:id sub-routes - ...any other org_infrastructure / org_experiments / sandbox_spending feature gated by these flags Patches `OpenFeaturePostHogProvider` with a `bootstrapFlags` map that's consulted ONLY when PostHog isn't configured. Wires the same flags as dashboard's `LOCAL_DEV_FEATURE_FLAG_DEFAULTS` in PostHogProviderWrapper.tsx so server + client agree on local-dev defaults: organization_infrastructure: true organization_experiments: true dashboard_playground: true dashboard_webhooks: true dashboard_create-sandbox: true sandbox_spending: true Production with a real POSTHOG_API_KEY ignores `bootstrapFlags` entirely and uses the PostHog control plane as before. Verified end-to-end via Playwright: - POST /api/regions returns 201 (was 404) - Dashboard "+ Create Region" dialog → submit → new custom region row appears in the list immediately

… docs Adds 7 wrapper scripts under apps/infra-local/scripts/ that orchestrate the L2 native processes (API + Runner + Proxy + Dashboard) alongside the existing L1 (BoxLite boxes via `python -m boxlite_local`): stack-build.sh — `go build` runner + proxy + `yarn install` stack-up.sh — ensure L1 up + start all/named L2 components stack-down.sh — stop named L2 components (preserves L1 by default) stack-restart.sh — bounce component(s); runner also rebuilds stack-status.sh — one-screen health: L1 boxes + L2 PIDs + ports stack-logs.sh — tail any component's log (or all) stack-reset.sh — soft/hard/nuke: progressively wipe runtime state All exposed via Makefile targets (`make stack-*`). Component-level control via `COMPONENTS=...` variable. Logs/PIDs under `apps/infra-local/.logs/` (gitignored). Each starter is idempotent: re-running stack-up only starts components that are down. Each starter also kills any stale listener on its port before launching (defends against EADDRINUSE from a crashed prior run). stack-down's orphan-sweep step honors the component list — partial stops (e.g. `make stack-restart COMPONENTS=runner`) leave the others untouched (caught + fixed during the verification run below). Verified end-to-end: - Cold start: `make stack-reset && make stack-up` brings all 4 L2 + 10 L1 services healthy in ~60 s. - Dashboard HMR: edit `ErrorBoundaryFallback.tsx` heading text → browser reflects change in ~3 s without page reload (Vite HMR). - API watch: change `health.controller.ts` response shape → `curl /api/health` returns new shape in ~1 s (nx serve auto-rebuild). - Runner rebuild: `make stack-restart COMPONENTS=runner` rebuilds binary + restarts process, `/info` returns new appVersion in ~10 s; api/proxy/dashboard PIDs unchanged (verified isolation). Docs: - `docs/apps/infra-local-status.md` — inventory of what's real vs. mock vs. missing (L1 / L2 / L3 + 6 mocked + 7 absent) - `docs/apps/infra-local-usage.md` — daily workflow guide; first section is now the wrapper TL;DR

Adds regions-*.png, runners-*.png, term-*.png, test-*.png to the ignore list. These are Playwright artifacts from L2 verification runs and local dev/test sessions — not source.

…reate Personal org JwtStrategy.validate() lazily creates a user row on first OIDC login but omitted `personalOrganizationDefaultRegionId` from the CreateUserDto. The OnAsyncEvent listener `OrganizationService.handleUserCreatedEvent` then attempted to build the Personal organization with `defaultRegionId=undefined`. The save silently failed (no async-event result is awaited at the caller), leaving the user with no organization. User-visible symptom on first OIDC login from a fresh DB (e.g. after `make stack-reset` or a brand-new local stack): - `GET /api/organizations` returns `[]` - Dashboard's SelectedOrganizationProvider reads `organizations[0]` → undefined → subsequent `.id` access throws - ErrorBoundary renders "Cannot read properties of undefined (reading 'id')" — the dashboard never loads Fix: pass `personalOrganizationDefaultRegionId` from `config.defaultRegion.id` (env-driven, `DEFAULT_REGION_ID` with default `'us'`). This is the same region the API auto-seeds at boot, so it always exists by the time any user logs in. Verified end-to-end: - Wiped seeded org + user, cleared browser storage, re-login via Dex. - User auto-created with Personal org `defaultRegionId='us'` and organization_user row `role='owner'`. - Dashboard navigates to /dashboard/onboarding successfully (no ErrorBoundary). Notes for prod parity: - Behavior unchanged when running with a real PostHog / Auth0 deploy: the field was just an undefined optional before; it's now an existing region id. `configService.getOrThrow` ensures we fail loudly at startup if `DEFAULT_REGION_ID` resolves to nothing. - This complements the API `bootstrapFlags` patch (a5dd131) — both remove silent failure modes that only show up in fresh local DBs.

…oard Two regressions caught while debugging create-sandbox: 1) stack-down's stop_component was killing whole process group (`kill -PGID`). stack-up.sh launches all 4 native components from the same parent shell, so the nohup'd background jobs inherit the same pgid — `kill -PGID` on one took out the unrelated siblings (a `stack-restart COMPONENTS=dashboard` knocked over api + proxy). Fix: kill just the specific PID. The per-component pkill-by-name sweep in the orphan-cleanup phase still picks up the actual server children (nx serve → node, etc.). 2) Dashboard launch lacked VITE_API_URL=/api. The @boxlite-ai/sdk client falls back to its prod default `https://app.boxlite.io/api` when VITE_API_URL is unset, so create-sandbox calls escaped the sandbox and failed with ERR_CONNECTION_CLOSED. Vite's dev-server proxy in vite.config.mts already forwards /api → localhost:3001; we just needed to point the SDK at the relative path.

Adds `seed-init-data.sh` + wires it into `stack-up` and surfaces it as a standalone Makefile target. The script does NOT pre-insert anything (the API self-seeds at boot via app.service.ts initializeXxx). Instead it: 1. Restarts the API if running, so the seed cycle re-runs against the truncated DB. 2. Polls until admin user + admin Personal org + default region 'us' all land — proof that the API's onApplicationBootstrap completed. 3. Waits up to 7 minutes for the default `ubuntu:22.04` snapshot to reach 'active' (the long pole, cold pull from local registry can take 2-5 min on M5). Updates `stack-reset` to truncate the `"user"` table too, so the API's `if (await findOne(BOXLITE_ADMIN_USER_ID)) return` early-exit guard doesn't strand it: with no admin user row, the API recreates admin + personal org + api key + default snapshot fresh. Updates `stack-up` to invoke `seed-init-data.sh --no-bounce` after api+runner just started, so the first stack-up after a reset doesn't need a manual follow-up — the wrapper returns only when the dashboard can actually create a sandbox. Verified end-to-end: - Full TRUNCATE of all user-data tables (incl. `"user"`) - `make stack-up` — API auto-seeds admin user/org/region (T+~5s), default snapshot enters PENDING → PULLING (then registry box was hung; restart unblocked it) → ACTIVE - `curl POST /api/sandbox` with admin key → HTTP 200, sandbox starts in ~10s - Dashboard "+ Create Sandbox" → auto-navigates to new sandbox → Terminal tab → Connect → `root@boxlite:~#` live prompt Also exposes `make seed-init-data` for ad-hoc verification any time.

…boxes Adds `make stack-rebuild-l1-box BOX=<name>` — wraps `boxlite rm boxlite-local-<name> --force && python -m boxlite_local up <name>` for one-shot destroy + recreate of a stuck L1 service. Surfaces two real failure modes seen this week: 1. Dex SQLite session db keeps stale grants across SIGKILL of the containing box, so subsequent OIDC logins reuse the cached access_token from a prior session. Browser thinks it's logged in (oidc-client's `expires_at` is computed from refresh token TTL), but `accessTokenIat` decodes to days ago and API returns 401. Fix: `BOX=dex` resets the session db; clear browser storage and re-login. 2. Registry box (`registry:2`) hangs after SIGKILL of boxlite-shim: TCP listener stays up but the registry process inside doesn't answer HTTP, so any snapshot pull hangs in PULLING forever. `curl http://127.0.0.1:25000/v2/_catalog` 5s-timeout is the positive identification. Fix: `BOX=registry`. Both are added to `docs/apps/infra-local-usage.md` "常见问题" table with the symptom → diagnosis → one-line fix mapping. The root underlying cause both share: pkill -9 on the host-side boxlite-shim can corrupt persistent state of the in-box process. SIGTERM via `make stack-down` does not have this issue.

…ree) Adds section 5.5 to infra-local-usage.md that maps the 5 cleanup levels (stack-restart → stack-rebuild-l1-box → stack-reset → stack-reset-hard → stack-nuke) to concrete scenarios with timing. Replaces the previous ad-hoc "what do I do if X" scattered across sections with one decision table + 3 scenario walkthroughs: 1. Full rebuild (new machine / serious breakage) — ~5min 2. Reset + re-up (常用, dirty DB) — ~60s 3. Partial reset/up (90% daily use) — ~3-10s Key principle surfaced: start at the lightest tier, escalate only if that doesn't fix it. Don't blindly stack-nuke.

gitignore syntax does NOT support inline comments — `pattern # comment` is parsed as the literal path `pattern # comment`, not as a pattern with a side comment. Result: 3 known-local files (apps/apps symlink, apps/api/.swcrc, sdks/go/boxlite-c-v*/) were never actually ignored and kept appearing under `git status` for every dev. Moved all comments to their own lines. Verified with `git check-ignore -v` that each pattern now matches its target.

One-page reference of every infra-local service: PostgreSQL, Redis, MinIO, Dex (OIDC), OCI registry, Jaeger, pgAdmin, OpenTelemetry collector, Registry UI, Caddy reverse proxy. For each: host port + in-box address + image + auth + data volume + sample one-liner. Single source of truth is the InfraConfig dataclass in boxlite_local/config.py — links throughout point at exact line ranges.

Adds a 'Documentation Language' section: every committed *.md, README, CONTRIBUTING, design note, ADR, plan file, and inline comment block must be in English. Non-English drafts are fine in scratch/chat but must be translated before `git add`. Trigger: the apps/infra-local/CONNECTIONS.md flow this week — a Chinese version was committed, immediately translated 1 commit later, then required a history rewrite to scrub the Chinese version. This rule prevents the same cycle from recurring. AI assistants should refuse to `git add` non-English markdown directly and ask the user to confirm translation first.

Adds docs/apps/milestones/2026-05-25-infra-local-ready.md — a single English-only summary of what the ms/infra-local-ready tag delivers: - Executive summary of the 3 layers (L1 infra-local / L2 native control plane / L3 user sandboxes) - Operate-by-make surface (12 stack-* + seed-init-data targets) - End-to-end verified workflows - Architecture changes per layer - Key unblocking fixes (runner GOARCH, jwt.strategy seed, PostHog bootstrapFlags, Caddy→Proxy wiring, .env symlink, SSH_GATEWAY_API_KEY guard) - Known mocked / missing services (PostHog, Billing, Webhooks, Snapshot Manager, SSH Gateway, ClickHouse, OpenSearch, SMTP) - Phase chronology (Phase 1 PoC → 3d wrap → L2 boot → L2 hardening) - Candidate next milestones The tag itself (0a71bb5) stays where it is; this doc is the release notes for that point, committed after the tag — conventional pattern for milestone summaries.

Renames the milestone tag from ms/infra-local-ready to apps/infra-local/v0.9.0 to match the existing component-version tag scheme (e.g. sdks/go/v0.9.5). Tag operation performed: git tag -a apps/infra-local/v0.9.0 0a71bb5 -F <preserved-message> git tag -d ms/infra-local-ready Doc file renamed via git mv (history preserved): docs/apps/milestones/2026-05-25-infra-local-ready.md → docs/apps/milestones/2026-05-25-apps-infra-local-v0.9.0.md In-doc tag references updated. Tag now sorts cleanly alongside sdks/go/v* in `git tag -l '*/*'`.

Renames again to match the existing repo convention (milestone/<component>/v<n.n.n>) already used by milestone/scale-runner/v0.1.0. Tag operation: git tag -a milestone/infra-local/v0.1.0 0a71bb5 -F <preserved-msg> git tag -d apps/infra-local/v0.9.0 Doc renamed via git mv; in-doc tag refs updated. Memory updated too.

Adds a second account to the dex staticPasswords block so the local stack ships with both an admin path and a regular-user path for dashboard E2E testing. - email: test01@boxlite.dev - password: password - userID: 5678 - OIDC sub: CgQ1Njc4EgVsb2NhbA (base64 of protobuf `{userID:'5678', connectorID:'local'}`) This account behaves like any other OIDC login: on first sign-in the API's JwtStrategy auto-creates the `user` row + `Personal` organization + `organization_user` owner-of-own-org via the OrganizationService.handleUserCreatedEvent listener. No SQL seed is needed — the dex config IS the seed, applied automatically on every `make stack-up` / `make stack-nuke && make stack-up` (because dex reads staticPasswords from services.py on every box start, and the stack-reset truncates `"user"` so all API auto-seed paths re-run). Updates CONNECTIONS.md: - §1 'Existing DB user rows' table now lists all three users - §4 'Built-in login accounts' now documents both dex accounts with username / userID / expected OIDC sub / platform role - §4 'OAuth clients' references `make stack-rebuild-l1-box BOX=dex` (the modern wrapper) instead of `make restart-svc dex` (which no longer exists) Verified: dex rebuilt → browser login as test01@boxlite.dev/password → dashboard renders onboarding page with `test01` in the top-right user chip → DB confirms 3 users (boxlite-admin/admin/test01), each owning their own Personal org.

Documentation cleanup, English translation per CLAUDE.md, alignment with what shipped in milestone/infra-local/v0.1.0 (2026-05-25), and one fix that unblocks dashboard-initiated sandbox creation on the M5 native dev runner. Documentation ------------- - Delete 21 historical / PoC files: apps/infra-local/poc/ (5 PoC scripts), docs/superpowers/specs|plans (7 phase spec/plan files), docs/apps/{cloud-mvp-plan,apps-overview,apps-comprehensive, api-client-go,apps-api-overview}.md, *.bak files, and the pre-MVP "BoxLite cloud MVP.md" draft. - Translate 4 user-facing docs from Chinese to English per CLAUDE.md "Documentation Language" rule: docs/apps/infra-local-{status,usage}.md docs/apps/infra-vs-local-infra.md docs/apps/own-dog-food-local-infra-solution.md - Update apps/infra-local/{README,CONNECTIONS}.md and the design docs to describe L1 + L2 orchestration via `make stack-*` (the actual implementation), not the docker-compose + Lima route in the original design doc. - Fix broken cross-references in surviving docs after the cleanup. Dev runner-score override ------------------------- apps/infra-local/scripts/stack-up.sh exports the following before launching the API: RUNNER_AVAILABILITY_SCORE_THRESHOLD=5 (prod default 10) RUNNER_MEMORY_PENALTY_THRESHOLD=95 (prod default 75) RUNNER_DISK_PENALTY_THRESHOLD=95 (prod default 75) Root cause: the Go runner reports host-wide CPU/RAM/disk usage to the API, not just what the runner + its boxes consume. On a real EC2 host that's the right signal. On a dev Mac sharing RAM with VS Code, Chrome, Docker Desktop, and the L1 dev stack itself, those metrics routinely exceed the prod 75% penalty threshold and drag the runner's availabilityScore below the 10-default cutoff. The API then rejects sandbox-create with "No available runners" even though the runner is idle. Documented in apps/infra-local/CONNECTIONS.md (new "Dev-only runner-score overrides" section) and the v0.1.0 milestone "Known limitations" table. The structural fix (have the runner report only its own / boxes-owned resources) is tracked as a follow-up outside this milestone. Stack-reset.sh soft-reset behavior ---------------------------------- apps/infra-local/scripts/stack-reset.sh: soft reset now PRESERVES identity + infra (user / organization / organization_user / organization_role / region / runner / api_key) and only TRUNCATEs runtime data (sandbox / snapshot / snapshot_runner / audit_log). Result: an already-logged-in browser session survives a soft reset — no forced re-login. --hard still wipes schema entirely; --nuke still destroys L1 boxes too. End-to-end verified ------------------- Two full cold-start cycles (make stack-nuke && make stack-build && make stack-up), each followed by: - Dashboard login via Dex (admin@boxlite.dev / password) - Snapshots page shows ubuntu:22.04 as Active - Create Sandbox via dashboard UI -> state Started (first attempt) - Terminal Connect -> root@boxlite:~# prompt in iframe Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New principle on this branch: local infra targets the M5 native runner only. Lima-based multi-host runner support is being explored in a separate worktree and is deliberately not in scope here. Doc / comment edits: - apps/infra-local/goal.md: translated from Chinese to English per CLAUDE.md "Documentation Language" rule. - apps/infra-local/tests/integration/test_e2e_full.py: drop "and Lima runner VM later" from the resource-budget docstring. - docs/apps/infra-local-status.md: drop "no Lima" qualifiers from the platform line and the L2 runner row. - docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md: drop "no Lima" from the headline; reword to "everything runs natively on M5". - docs/apps/infra-vs-local-infra.md: replace the entire §2 "Why Lima instead of HVF" decision archive (180+ lines) with a short "Runner placement on this branch — M5 native (HVF)" section. Update §1 topology + design decisions, §3 comparison-table rows (runner / sandbox isolation / autoscaler InfraProvider / multi-runner support), §4.1/4.2 asymmetries, §6 file pointers, and §7 one-sentence summary to match the M5-native reality. Production-parity tradeoff is acknowledged but flagged as future work outside this milestone. - docs/apps/own-dog-food-local-infra-solution.md: rewrite §2.2, §2.4 (runner path), key-design-choices list, repo layout tree, §5.1 resource budget, §11 decision table, and §12.2 phase plan to describe the M5 native runner instead of a runner-in-Lima. Verification: - grep -iw "lima|limactl|LimaInfraProvider" → 0 hits across all PR-scope files. - Python CJK regex check → 0 CJK chars across all PR-scope files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ke install The previous §1 jumped from "yarn / go / python already installed" straight to `make stack-build && make stack-up`, skipping two things a fresh checkout actually needs: 1. The Python orchestrator package isn't installed yet — `python -m boxlite_local` doesn't work until `make install` runs `pip install -e ".[test]"`. 2. The `boxlite` Python SDK and CLI must already be present in the active environment (it's a transitive dep of `boxlite_local`, not installed by `make install`). Restructured §1 into three sub-sections: - §1.1 Prereqs — table listing the actual required tools + versions, plus a 3-line sanity check that surfaces missing prereqs before `make` runs and produces a less-actionable failure. - §1.2 Three-step bring-up — now correctly shows make install (pip install the orchestrator package) make stack-build (yarn + go builds) make stack-up (L1 + L2 + seed) with timing expectations (5-7 min cold, ~30 s-1 min warm). - §1.3 First-time dashboard login — explicit credentials + the end-to-end smoke (create sandbox → terminal → root@boxlite:~#) so first-time users know what success looks like. No behavior change — purely documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… setup Add the first-time bring-up commands (make install + make stack-build + make stack-up) at the top of the TL;DR cheat sheet so the entire day-one workflow is visible in one block, without having to scroll to §1.2. Day-to-day flow keeps its own bring-up line for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…boot After a machine reboot the L1 microVM boxes are gone but the postgres data volume (~/.boxlite-local/data/pg/) persists on disk with the full schema. `make stack-up` sees no postgres box, runs `make up-with-schema`, which brings the box back (schema already present) and then runs `make load-schema` — which previously hard-failed: ERR: public schema already has 27 table(s). Schema baseline is not idempotent. Run 'make wipe && make up' first So every post-reboot `make stack-up` died at the schema step. Fix: apply-schema.sh now treats an already-loaded schema as a no-op instead of an error. When the public schema is non-empty it checks the `migrations` table to distinguish: - COMPLETE prior load (tables + migrations recorded) → skip, exit 0 - PARTIAL half-applied baseline (tables but no migrations) → still refuse with exit 3 (genuinely broken state; needs `make wipe`) This makes load-schema / up-with-schema / stack-up all idempotent across reboots. The non-idempotent baseline itself is unchanged — we just stop trying to re-apply it when it's already there. Verified on a live post-reboot stack: Schema already loaded (27 tables, 88 migrations recorded) — skipping. exit=0 and `make stack-up` then proceeds to L2 (api/runner/proxy/dashboard all up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Goal: a single `make stack-up` should work from a fresh checkout, after a reboot, or after `make stack-down` — no need to remember to run `make install` / `make stack-build` first. Rather than wiring install/stack-build as hard `make` prerequisites (which would force a pip-resolve + go-build on *every* stack-up, including the fast daily restart loop), stack-up.sh now does both checks *conditionally* so the common restart path pays nothing: - New: if `python -c "import boxlite_local"` fails, run `make install` before bringing up L1 (which calls `python -m boxlite_local`). - Existing (kept): if /tmp/boxlite-runner or /tmp/boxlite-proxy is missing, run stack-build.sh. Clarified in a comment that it only builds when missing — use `make stack-restart COMPONENTS=runner` to rebuild after a source change. Combined with the load-schema idempotency fix, `make stack-up` is now the single entry point in all scenarios: fresh checkout → install + up-with-schema + build + L2 post-reboot → up-with-schema (schema skip) + build (/tmp cleared) + L2 post-down → up-with-schema (boxes back) + L2 (binaries + pkg present) Docs updated (README §Quick start, infra-local-usage §0 + §1.2) to present `make stack-up` as the one command, with the explicit targets kept as optional for forcing a rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e→stack-up dashboard Schema loading - Add scripts/build-all-in-one-sql.py: consolidates every apps/api TypeORM migration (legacy + pre/post-deploy, 87 total → 539 stmts) into a single sql/merged-schema.auto-gen.sql. Resolves TS-side ${...} interpolations, inlines parameterized queries, and mirrors TypeORM's enum/constraint auto-renames so the output loads cleanly from zero. - `make load-schema` now regenerates + loads the merged schema; apply-schema.sh defaults to it and accepts a SCHEMA_SQL_FILE override. - Drop sql/schema-baseline.sql + sql/REFRESH.md: the prod pg_dump is no longer the load source; schema is now generated from migrations (kept reachable via SCHEMA_SQL_FILE if ever needed for an A/B comparison). Fix: wipe → stack-up left the dashboard non-functional - Root cause: `make down`/`wipe` tore down only the L1 boxes, leaving the L2 native procs (api/runner/proxy/dashboard) running. After a wipe the stale API held connections to the destroyed-and-recreated DB and never re-ran onApplicationBootstrap against the fresh DB → no admin user/org/region → dashboard loads but is unusable. (Confirmed independent of the schema swap: reproduced identically with the old baseline dump.) - `make down`/`wipe` now stop L2 first (stack-down.sh). - stack-up.sh stops any stale L2 when it (re)creates L1, covering teardown paths that bypass make (stack-rebuild-l1-box, direct `boxlite rm`, `python -m boxlite_local down`). Verified: working-stack → make wipe → make stack-up → 27 tables / 87 migrations, admin user + org + region seeded, dashboard :3000 + /api proxy + dex all HTTP 200. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Box.Export/Runtime.ImportBox FFI (C/Go/Node/Python SDKs) + runner CreateBackup/restore to S3 + id-preserving import. Foundation for scale-down live migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5-service stack: pg (Phase 2) + redis/minio/minio-init/registry (3a). Introduces http_url healthcheck, one_shot lifecycle, repo_root resolution. Closes Phase-2 debt #1 (narrow start_service exception); defers debt #2 and tcp_port (no caller in 3a). Autonomous execution per /goal directive.

… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lile and others added 30 commits May 20, 2026 17:31

feat(infra-local): scaffold boxlite_local package + pytest wiring

fc5615d

feat(infra-local): add types.py — ServiceSpec/HealthCheck/Doctor* dat…

66b3761

…aclasses

feat(infra-local): add InfraConfig with env-var override + tests

cd483f1

refactor(infra-local): clearer int-env error + hide pg_password in re…

c38e02c

…pr + cover data_dir fallback

feat(infra-local): add postgres SPEC + SERVICES registry

4091740

feat(infra-local): add exec_collect helper for box.exec stream draining

962d4d8

refactor(infra-local): drop dead bytes branch in exec_collect + use P…

3d4ccd7

…EP 604 unions

feat(infra-local): add doctor preflight (SDK + runtime + port checks)

4931c52

refactor(infra-local): tighter lsof error path + clearer parser + bet…

92eafe2

…ter test coverage

feat(infra-local): add orchestrator (topo_sort + up/down/ps + wait_he…

127c118

…althy) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(infra-local): safer mkdir, bounded healthcheck probe, workin…

65cb74c

…g_dir + empty-graph coverage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(infra-local): add CLI dispatch + python -m boxlite_local entry

6fc39e4

refactor(infra-local): drop unused Severity import + idiomatic subpar…

7bc6a6c

…ser calls

test(infra-local): add integration smoke (doctor → up → ps → down)

8b4e5af

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(infra-local): promote get_runtime to public + load-bearing w…

f2a63b2

…ipe assertion + skip-doctor in itest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(infra-local): skip integration test if any boxlite-local-* boxes…

7b50556

… already exist

docs(infra-local): add Phase 3a implementation plan

da1a0f0

Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator helpers (_http_probe + _is_already_running_error); integration test proves end-to-end 5-service round-trip on real BoxLite.

feat(infra-local): extend InfraConfig with 3a fields + repo_root dete…

d67a0cc

…ction

feat(infra-local): http_url healthcheck + one_shot lifecycle + narrow…

97a5c41

… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(infra-local): add redis + minio + minio-init + registry specs

36a22b2

test(infra-local): rename + extend integration to 5-service round-trip

94a4701

fix(infra-local): retry box.start once after re-chmod on SDK rootfs p…

64b3033

…erm error

fix(infra-local): inline minio-init script (SDK requires dir mounts, …

82970a1

…not files)

fix(infra-local): one-shot exit detection via exec probe + entrypoint…

566ec09

… override

lile and others added 25 commits May 22, 2026 13:58

chore: gitignore additional session test screenshot patterns

d563216

Adds regions-*.png, runners-*.png, term-*.png, test-*.png to the ignore list. These are Playwright artifacts from L2 verification runs and local dev/test sessions — not source.

lilongen mentioned this pull request May 29, 2026

[runner-scale 2/3] feat(infra): IInfraProvider abstraction + runner-ops orchestration + CLI #2

Open

lilongen force-pushed the feat/cloud-mvp branch from 57c5604 to 6071a90 Compare May 29, 2026 09:21

lilongen pushed a commit that referenced this pull request Jun 9, 2026

feat(infra-local): http_url healthcheck + one_shot lifecycle + narrow…

c08f599

… exception (debt #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lilongen force-pushed the feat/cloud-mvp branch from 9358cdb to 503efc3 Compare June 9, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1

[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1
lilongen wants to merge 82 commits into
feat/cloud-mvpfrom
feat/runner-scale-1-backup

lilongen commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilongen commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant