Skip to content

[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1

Open
lilongen wants to merge 82 commits into
feat/cloud-mvpfrom
feat/runner-scale-1-backup
Open

[runner-scale 1/3] feat: in-process box backup/restore (runtime + SDKs + runner)#1
lilongen wants to merge 82 commits into
feat/cloud-mvpfrom
feat/runner-scale-1-backup

Conversation

@lilongen

Copy link
Copy Markdown
Owner

Part 1 of 3 of the runner-scale work (manual add-runner / scale-down-runner).

In-process box Export/Runtime.ImportBox FFI across C/Go/Node/Python SDKs + the runner's S3 backup/restore (id-preserving, so sandbox.id == box.id survives migration).

Base: feat/cloud-mvp. Stacked: PR2 (infra) → PR3 (api) build on this; review/merge 1 → 2 → 3 into feat/cloud-mvp.

lile and others added 30 commits May 20, 2026 17:31
…gn docs

Migrated from session work on fix/sandbox-from-image-alpine to start
dedicated feat/cloud-mvp track. Contents:

apps/infra-local/
  - goal.md
  - poc/single_service.py        Phase 0 — single postgres box PoC (✅)
  - poc/multi_service.py         Phase 1 — multi-service + host-as-hub (✅)
  - poc/diagnose_network.py      network diagnostic for box-to-box
  - poc/diagnose_network.result  captured diagnostic output
  - poc/README.md                Phase 0/1 docs + pass criteria

docs/apps/
  - cloud-mvp-plan.md            Foundation-first MVP roadmap (rewritten)
  - cloud-mvp-plan.md.bak-mvp-deadline-version  prior team/deadline version
  - own-dog-food-local-infra-solution.md  dogfood orchestrator design + Phase 1 results
  - infra-vs-local-infra.md      apps/infra vs apps/infra-local comparison
  - apps-overview.md             apps/ one-pager
  - apps-comprehensive.md        apps/ full breakdown
  - apps-api-overview.md         apps/api NestJS/TypeORM walkthrough
  - api-client-go.md             apps/api-client-go auto-gen overview
  - sdk-feedback/                dogfood-surfaced SDK gaps:
    - 01-host-boxlite-internal-unwired.md  (+ linear-friendly variants)
    - 02-postgres-trust-via-host-as-hub.md

BoxLite cloud MVP.md             input PRD-style brief

PoC status:
  - Phase 0 ✅ single postgres box, all 7 sub-phases pass
  - Phase 1 ✅ multi-box, host-as-hub via Mac LAN IP, detach=True works
  - 2 SDK bugs surfaced + documented

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 multi-service + box-to-box networking via host.boxlite.internal
validated end-to-end (12/12 phases pass). Revert PoC to non-default host
ports (25432/26379) as the durable hygiene rule for any service that
could collide with a local dev install; lift the same rule into the
design doc (§3.8) and bake it into the planned doctor preflight (§1.7.F).
Concretizes parent design doc §12.2 into a Phase-2 implementation
contract: walking skeleton (postgres-only end-to-end) before scaling to
the full 10-service orchestrator. Flat package layout, explicit
SERVICES registry, doctor port-preflight, integration tests on real
BoxLite. Ready for handoff to writing-plans.
Bite-sized 10-task plan derived from the Phase 2 spec. TDD where the
logic is testable in isolation (config, lsof parsing, topo_sort);
integration test for the end-to-end orchestrator flow. Ready for
subagent-driven or inline execution.
…althy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g_dir + empty-graph coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ipe assertion + skip-doctor in itest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5-service stack: pg (Phase 2) + redis/minio/minio-init/registry (3a).
Introduces http_url healthcheck, one_shot lifecycle, repo_root resolution.
Closes Phase-2 debt #1 (narrow start_service exception); defers debt #2
and tcp_port (no caller in 3a). Autonomous execution per /goal directive.
Bite-sized 5-task plan. TDD for InfraConfig extensions + orchestrator
helpers (_http_probe + _is_already_running_error); integration test
proves end-to-end 5-service round-trip on real BoxLite.
… exception (debt #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
minio/minio:latest (RHEL UBI base) ships layers with directories having
no owner-write bit. The SDK's per-start rootfs merge then fails with
"storage error: Failed to ... Permission denied (os error 13)" or a
"RustPanic" at write time. Apply owner-write idempotently to the extracted
image cache before each start. Idempotent + cheap (~10ms). Remove when
SDK fixes rootfs-merge to relax dir perms at extract time.
lile and others added 25 commits May 22, 2026 13:58
Dashboard's terminal/preview features call /api/sandbox/:id/ports/:port/
signed-preview-url and load the returned URL in an iframe. The URL shape
is `http://<port>-<token>.<proxy-domain>` (e.g. 22222-abc.localhost:28080).

For this to work the full chain needs:
  browser → Caddy :28080 → apps/proxy :4000 → runner :3003 → sandbox

Adds a Caddy host-regexp matcher that detects the `<digits>-<token>.`
subdomain pattern and reverse-proxies to the host's port 4000, where
apps/proxy listens. apps/proxy resolves the token via Redis to a sandbox
+ runner, then forwards (with auth headers) to the runner's
`/sandboxes/<id>/toolbox/*` endpoint.

Also adds apps/proxy to go.work so it builds against the local
common-go / api-client-go workspace modules. apps/proxy itself is built
the same way as runner (`go build ./cmd/proxy`); env vars match SST's
prod config (PROXY_API_KEY, BOXLITE_API_URL, OIDC_*, REDIS_*).

Verified:
- curl with Host: 22222-<token>.localhost:28080 through Caddy returns
  the runner's xterm.js terminal HTML (HTTP 200).
- WebSocket upgrade reaches runner (101 Switching Protocols).
…md64

apps/runner/pkg/boxlite/registry.go hardcoded:

  var linuxAmd64Platform = v1.Platform{OS: "linux", Architecture: "amd64"}

This is correct for prod (EC2 x86_64 runners) but on Apple Silicon
(M-series) M5 native runners it pulled amd64 manifests for multi-arch
images. The microVM then booted an arm64 kernel with amd64 contents,
and every shell exec failed with:

  failed to execvp err=ENOEXEC filename="/bin/sh"
  ENOEXEC: Exec format error executing '/bin/sh'

Symptoms: sandbox created + state=started, but dashboard terminal
opens with "[Connection closed]" because exec/<id>/toolbox/ws can't
spawn a shell inside the microVM.

Fix: derive Architecture from `runtime.GOARCH` so the runner asks
the registry for the right manifest variant. Verified:

- Image pulled to local registry now reports architecture=arm64 in
  its config blob.
- New sandbox boots; dashboard Terminal tab → Connect → shows
  `root@boxlite:~#` live prompt (Ubuntu 22.04 arm64).
- WebSocket terminal session stays open instead of immediate close.

Together with the previous commit (Caddy → proxy wiring), this
completes the L2 end-to-end terminal feature for the M5 native
local stack.
…board)

Daytona-fork gates several server routes behind PostHog feature flags via
`@RequireFlagsEnabled([{flagKey:X, defaultValue:false}])`. When PostHog
isn't configured (no POSTHOG_API_KEY), every flag falls back to its
call-site `defaultValue: false` → the guard fails → NestJS returns 404
"Cannot POST/GET /api/..." (the "hide route from unauthorized callers"
pattern). In local dev this:

- breaks POST /api/regions (Create Region dialog)
- breaks GET /api/runners (Runners page list)
- breaks several /api/regions/:id sub-routes
- ...any other org_infrastructure / org_experiments / sandbox_spending
  feature gated by these flags

Patches `OpenFeaturePostHogProvider` with a `bootstrapFlags` map that's
consulted ONLY when PostHog isn't configured. Wires the same flags as
dashboard's `LOCAL_DEV_FEATURE_FLAG_DEFAULTS` in
PostHogProviderWrapper.tsx so server + client agree on local-dev defaults:

  organization_infrastructure: true
  organization_experiments:    true
  dashboard_playground:        true
  dashboard_webhooks:          true
  dashboard_create-sandbox:    true
  sandbox_spending:            true

Production with a real POSTHOG_API_KEY ignores `bootstrapFlags` entirely
and uses the PostHog control plane as before.

Verified end-to-end via Playwright:
- POST /api/regions returns 201 (was 404)
- Dashboard "+ Create Region" dialog → submit → new custom region row
  appears in the list immediately
… docs

Adds 7 wrapper scripts under apps/infra-local/scripts/ that orchestrate
the L2 native processes (API + Runner + Proxy + Dashboard) alongside
the existing L1 (BoxLite boxes via `python -m boxlite_local`):

  stack-build.sh    — `go build` runner + proxy + `yarn install`
  stack-up.sh       — ensure L1 up + start all/named L2 components
  stack-down.sh     — stop named L2 components (preserves L1 by default)
  stack-restart.sh  — bounce component(s); runner also rebuilds
  stack-status.sh   — one-screen health: L1 boxes + L2 PIDs + ports
  stack-logs.sh     — tail any component's log (or all)
  stack-reset.sh    — soft/hard/nuke: progressively wipe runtime state

All exposed via Makefile targets (`make stack-*`). Component-level
control via `COMPONENTS=...` variable. Logs/PIDs under
`apps/infra-local/.logs/` (gitignored).

Each starter is idempotent: re-running stack-up only starts components
that are down. Each starter also kills any stale listener on its port
before launching (defends against EADDRINUSE from a crashed prior run).

stack-down's orphan-sweep step honors the component list — partial
stops (e.g. `make stack-restart COMPONENTS=runner`) leave the others
untouched (caught + fixed during the verification run below).

Verified end-to-end:
- Cold start: `make stack-reset && make stack-up` brings all 4 L2 +
  10 L1 services healthy in ~60 s.
- Dashboard HMR: edit `ErrorBoundaryFallback.tsx` heading text →
  browser reflects change in ~3 s without page reload (Vite HMR).
- API watch: change `health.controller.ts` response shape →
  `curl /api/health` returns new shape in ~1 s (nx serve auto-rebuild).
- Runner rebuild: `make stack-restart COMPONENTS=runner` rebuilds
  binary + restarts process, `/info` returns new appVersion in ~10 s;
  api/proxy/dashboard PIDs unchanged (verified isolation).

Docs:
- `docs/apps/infra-local-status.md` — inventory of what's real vs.
  mock vs. missing (L1 / L2 / L3 + 6 mocked + 7 absent)
- `docs/apps/infra-local-usage.md` — daily workflow guide;
  first section is now the wrapper TL;DR
Adds regions-*.png, runners-*.png, term-*.png, test-*.png to the
ignore list. These are Playwright artifacts from L2 verification
runs and local dev/test sessions — not source.
…reate Personal org

JwtStrategy.validate() lazily creates a user row on first OIDC login but
omitted `personalOrganizationDefaultRegionId` from the CreateUserDto.
The OnAsyncEvent listener `OrganizationService.handleUserCreatedEvent`
then attempted to build the Personal organization with
`defaultRegionId=undefined`. The save silently failed (no async-event
result is awaited at the caller), leaving the user with no organization.

User-visible symptom on first OIDC login from a fresh DB (e.g. after
`make stack-reset` or a brand-new local stack):
- `GET /api/organizations` returns `[]`
- Dashboard's SelectedOrganizationProvider reads `organizations[0]` →
  undefined → subsequent `.id` access throws
- ErrorBoundary renders "Cannot read properties of undefined (reading
  'id')" — the dashboard never loads

Fix: pass `personalOrganizationDefaultRegionId` from
`config.defaultRegion.id` (env-driven, `DEFAULT_REGION_ID` with default
`'us'`). This is the same region the API auto-seeds at boot, so it
always exists by the time any user logs in.

Verified end-to-end:
- Wiped seeded org + user, cleared browser storage, re-login via Dex.
- User auto-created with Personal org `defaultRegionId='us'` and
  organization_user row `role='owner'`.
- Dashboard navigates to /dashboard/onboarding successfully (no
  ErrorBoundary).

Notes for prod parity:
- Behavior unchanged when running with a real PostHog / Auth0 deploy:
  the field was just an undefined optional before; it's now an existing
  region id. `configService.getOrThrow` ensures we fail loudly at
  startup if `DEFAULT_REGION_ID` resolves to nothing.
- This complements the API `bootstrapFlags` patch (a5dd131) — both
  remove silent failure modes that only show up in fresh local DBs.
…oard

Two regressions caught while debugging create-sandbox:

1) stack-down's stop_component was killing whole process group
   (`kill -PGID`). stack-up.sh launches all 4 native components from the
   same parent shell, so the nohup'd background jobs inherit the same
   pgid — `kill -PGID` on one took out the unrelated siblings (a
   `stack-restart COMPONENTS=dashboard` knocked over api + proxy).
   Fix: kill just the specific PID. The per-component pkill-by-name
   sweep in the orphan-cleanup phase still picks up the actual server
   children (nx serve → node, etc.).

2) Dashboard launch lacked VITE_API_URL=/api. The @boxlite-ai/sdk
   client falls back to its prod default `https://app.boxlite.io/api`
   when VITE_API_URL is unset, so create-sandbox calls escaped the
   sandbox and failed with ERR_CONNECTION_CLOSED. Vite's dev-server
   proxy in vite.config.mts already forwards /api → localhost:3001;
   we just needed to point the SDK at the relative path.
Adds `seed-init-data.sh` + wires it into `stack-up` and surfaces it
as a standalone Makefile target. The script does NOT pre-insert
anything (the API self-seeds at boot via app.service.ts
initializeXxx). Instead it:

  1. Restarts the API if running, so the seed cycle re-runs against
     the truncated DB.
  2. Polls until admin user + admin Personal org + default region 'us'
     all land — proof that the API's onApplicationBootstrap completed.
  3. Waits up to 7 minutes for the default `ubuntu:22.04` snapshot to
     reach 'active' (the long pole, cold pull from local registry can
     take 2-5 min on M5).

Updates `stack-reset` to truncate the `"user"` table too, so the
API's `if (await findOne(BOXLITE_ADMIN_USER_ID)) return` early-exit
guard doesn't strand it: with no admin user row, the API recreates
admin + personal org + api key + default snapshot fresh.

Updates `stack-up` to invoke `seed-init-data.sh --no-bounce` after
api+runner just started, so the first stack-up after a reset doesn't
need a manual follow-up — the wrapper returns only when the dashboard
can actually create a sandbox.

Verified end-to-end:
- Full TRUNCATE of all user-data tables (incl. `"user"`)
- `make stack-up` — API auto-seeds admin user/org/region (T+~5s),
  default snapshot enters PENDING → PULLING (then registry box was
  hung; restart unblocked it) → ACTIVE
- `curl POST /api/sandbox` with admin key → HTTP 200, sandbox
  starts in ~10s
- Dashboard "+ Create Sandbox" → auto-navigates to new sandbox →
  Terminal tab → Connect → `root@boxlite:~#` live prompt

Also exposes `make seed-init-data` for ad-hoc verification any time.
…boxes

Adds `make stack-rebuild-l1-box BOX=<name>` — wraps
`boxlite rm boxlite-local-<name> --force && python -m boxlite_local up <name>`
for one-shot destroy + recreate of a stuck L1 service.

Surfaces two real failure modes seen this week:

  1. Dex SQLite session db keeps stale grants across SIGKILL of the
     containing box, so subsequent OIDC logins reuse the cached
     access_token from a prior session. Browser thinks it's logged in
     (oidc-client's `expires_at` is computed from refresh token TTL),
     but `accessTokenIat` decodes to days ago and API returns 401.
     Fix: `BOX=dex` resets the session db; clear browser storage and
     re-login.

  2. Registry box (`registry:2`) hangs after SIGKILL of boxlite-shim:
     TCP listener stays up but the registry process inside doesn't
     answer HTTP, so any snapshot pull hangs in PULLING forever.
     `curl http://127.0.0.1:25000/v2/_catalog` 5s-timeout is the
     positive identification. Fix: `BOX=registry`.

Both are added to `docs/apps/infra-local-usage.md` "常见问题" table
with the symptom → diagnosis → one-line fix mapping.

The root underlying cause both share: pkill -9 on the host-side
boxlite-shim can corrupt persistent state of the in-box process.
SIGTERM via `make stack-down` does not have this issue.
…ree)

Adds section 5.5 to infra-local-usage.md that maps the 5 cleanup
levels (stack-restart → stack-rebuild-l1-box → stack-reset →
stack-reset-hard → stack-nuke) to concrete scenarios with timing.

Replaces the previous ad-hoc "what do I do if X" scattered across
sections with one decision table + 3 scenario walkthroughs:

  1. Full rebuild (new machine / serious breakage)     — ~5min
  2. Reset + re-up (常用, dirty DB)                    — ~60s
  3. Partial reset/up (90% daily use)                  — ~3-10s

Key principle surfaced: start at the lightest tier, escalate only if
that doesn't fix it. Don't blindly stack-nuke.
gitignore syntax does NOT support inline comments — `pattern  # comment`
is parsed as the literal path `pattern  # comment`, not as a pattern
with a side comment. Result: 3 known-local files (apps/apps symlink,
apps/api/.swcrc, sdks/go/boxlite-c-v*/) were never actually ignored
and kept appearing under `git status` for every dev.

Moved all comments to their own lines. Verified with
`git check-ignore -v` that each pattern now matches its target.
One-page reference of every infra-local service: PostgreSQL, Redis,
MinIO, Dex (OIDC), OCI registry, Jaeger, pgAdmin, OpenTelemetry
collector, Registry UI, Caddy reverse proxy. For each: host port +
in-box address + image + auth + data volume + sample one-liner.

Single source of truth is the InfraConfig dataclass in
boxlite_local/config.py — links throughout point at exact line ranges.
Adds a 'Documentation Language' section: every committed *.md, README,
CONTRIBUTING, design note, ADR, plan file, and inline comment block
must be in English. Non-English drafts are fine in scratch/chat but
must be translated before `git add`.

Trigger: the apps/infra-local/CONNECTIONS.md flow this week — a
Chinese version was committed, immediately translated 1 commit later,
then required a history rewrite to scrub the Chinese version. This
rule prevents the same cycle from recurring.

AI assistants should refuse to `git add` non-English markdown
directly and ask the user to confirm translation first.
Adds docs/apps/milestones/2026-05-25-infra-local-ready.md — a single
English-only summary of what the ms/infra-local-ready tag delivers:

- Executive summary of the 3 layers (L1 infra-local / L2 native control
  plane / L3 user sandboxes)
- Operate-by-make surface (12 stack-* + seed-init-data targets)
- End-to-end verified workflows
- Architecture changes per layer
- Key unblocking fixes (runner GOARCH, jwt.strategy seed,
  PostHog bootstrapFlags, Caddy→Proxy wiring, .env symlink,
  SSH_GATEWAY_API_KEY guard)
- Known mocked / missing services (PostHog, Billing, Webhooks,
  Snapshot Manager, SSH Gateway, ClickHouse, OpenSearch, SMTP)
- Phase chronology (Phase 1 PoC → 3d wrap → L2 boot → L2 hardening)
- Candidate next milestones

The tag itself (0a71bb5) stays where it is; this doc is the release
notes for that point, committed after the tag — conventional pattern
for milestone summaries.
Renames the milestone tag from ms/infra-local-ready to
apps/infra-local/v0.9.0 to match the existing component-version
tag scheme (e.g. sdks/go/v0.9.5).

Tag operation performed:
  git tag -a apps/infra-local/v0.9.0 0a71bb5 -F <preserved-message>
  git tag -d ms/infra-local-ready

Doc file renamed via git mv (history preserved):
  docs/apps/milestones/2026-05-25-infra-local-ready.md
  → docs/apps/milestones/2026-05-25-apps-infra-local-v0.9.0.md

In-doc tag references updated. Tag now sorts cleanly alongside
sdks/go/v* in `git tag -l '*/*'`.
Renames again to match the existing repo convention
(milestone/<component>/v<n.n.n>) already used by
milestone/scale-runner/v0.1.0.

Tag operation:
  git tag -a milestone/infra-local/v0.1.0 0a71bb5 -F <preserved-msg>
  git tag -d apps/infra-local/v0.9.0

Doc renamed via git mv; in-doc tag refs updated. Memory updated too.
Adds a second account to the dex staticPasswords block so the local
stack ships with both an admin path and a regular-user path for
dashboard E2E testing.

  - email:    test01@boxlite.dev
  - password: password
  - userID:   5678
  - OIDC sub: CgQ1Njc4EgVsb2NhbA  (base64 of protobuf
              `{userID:'5678', connectorID:'local'}`)

This account behaves like any other OIDC login: on first sign-in the
API's JwtStrategy auto-creates the `user` row + `Personal` organization
+ `organization_user` owner-of-own-org via the
OrganizationService.handleUserCreatedEvent listener. No SQL seed is
needed — the dex config IS the seed, applied automatically on every
`make stack-up` / `make stack-nuke && make stack-up` (because dex
reads staticPasswords from services.py on every box start, and the
stack-reset truncates `"user"` so all API auto-seed paths re-run).

Updates CONNECTIONS.md:
- §1 'Existing DB user rows' table now lists all three users
- §4 'Built-in login accounts' now documents both dex accounts with
  username / userID / expected OIDC sub / platform role
- §4 'OAuth clients' references `make stack-rebuild-l1-box BOX=dex`
  (the modern wrapper) instead of `make restart-svc dex` (which no
  longer exists)

Verified: dex rebuilt → browser login as test01@boxlite.dev/password →
dashboard renders onboarding page with `test01` in the top-right user
chip → DB confirms 3 users (boxlite-admin/admin/test01), each owning
their own Personal org.
Documentation cleanup, English translation per CLAUDE.md, alignment with
what shipped in milestone/infra-local/v0.1.0 (2026-05-25), and one fix
that unblocks dashboard-initiated sandbox creation on the M5 native dev
runner.

Documentation
-------------
- Delete 21 historical / PoC files: apps/infra-local/poc/ (5 PoC
  scripts), docs/superpowers/specs|plans (7 phase spec/plan files),
  docs/apps/{cloud-mvp-plan,apps-overview,apps-comprehensive,
  api-client-go,apps-api-overview}.md, *.bak files, and the
  pre-MVP "BoxLite cloud MVP.md" draft.
- Translate 4 user-facing docs from Chinese to English per CLAUDE.md
  "Documentation Language" rule:
    docs/apps/infra-local-{status,usage}.md
    docs/apps/infra-vs-local-infra.md
    docs/apps/own-dog-food-local-infra-solution.md
- Update apps/infra-local/{README,CONNECTIONS}.md and the design docs
  to describe L1 + L2 orchestration via `make stack-*` (the actual
  implementation), not the docker-compose + Lima route in the original
  design doc.
- Fix broken cross-references in surviving docs after the cleanup.

Dev runner-score override
-------------------------
apps/infra-local/scripts/stack-up.sh exports the following before
launching the API:

  RUNNER_AVAILABILITY_SCORE_THRESHOLD=5    (prod default 10)
  RUNNER_MEMORY_PENALTY_THRESHOLD=95       (prod default 75)
  RUNNER_DISK_PENALTY_THRESHOLD=95         (prod default 75)

Root cause: the Go runner reports host-wide CPU/RAM/disk usage to the
API, not just what the runner + its boxes consume. On a real EC2 host
that's the right signal. On a dev Mac sharing RAM with VS Code, Chrome,
Docker Desktop, and the L1 dev stack itself, those metrics routinely
exceed the prod 75% penalty threshold and drag the runner's
availabilityScore below the 10-default cutoff. The API then rejects
sandbox-create with "No available runners" even though the runner is
idle.

Documented in apps/infra-local/CONNECTIONS.md (new "Dev-only
runner-score overrides" section) and the v0.1.0 milestone
"Known limitations" table. The structural fix (have the runner report
only its own / boxes-owned resources) is tracked as a follow-up
outside this milestone.

Stack-reset.sh soft-reset behavior
----------------------------------
apps/infra-local/scripts/stack-reset.sh: soft reset now PRESERVES
identity + infra (user / organization / organization_user /
organization_role / region / runner / api_key) and only TRUNCATEs
runtime data (sandbox / snapshot / snapshot_runner / audit_log).
Result: an already-logged-in browser session survives a soft reset —
no forced re-login. --hard still wipes schema entirely; --nuke still
destroys L1 boxes too.

End-to-end verified
-------------------
Two full cold-start cycles (make stack-nuke && make stack-build &&
make stack-up), each followed by:
  - Dashboard login via Dex (admin@boxlite.dev / password)
  - Snapshots page shows ubuntu:22.04 as Active
  - Create Sandbox via dashboard UI -> state Started (first attempt)
  - Terminal Connect -> root@boxlite:~# prompt in iframe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New principle on this branch: local infra targets the M5 native runner
only. Lima-based multi-host runner support is being explored in a
separate worktree and is deliberately not in scope here.

Doc / comment edits:

- apps/infra-local/goal.md: translated from Chinese to English per
  CLAUDE.md "Documentation Language" rule.
- apps/infra-local/tests/integration/test_e2e_full.py: drop
  "and Lima runner VM later" from the resource-budget docstring.
- docs/apps/infra-local-status.md: drop "no Lima" qualifiers from
  the platform line and the L2 runner row.
- docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md:
  drop "no Lima" from the headline; reword to "everything runs
  natively on M5".
- docs/apps/infra-vs-local-infra.md: replace the entire §2 "Why
  Lima instead of HVF" decision archive (180+ lines) with a short
  "Runner placement on this branch — M5 native (HVF)" section.
  Update §1 topology + design decisions, §3 comparison-table rows
  (runner / sandbox isolation / autoscaler InfraProvider /
  multi-runner support), §4.1/4.2 asymmetries, §6 file pointers,
  and §7 one-sentence summary to match the M5-native reality.
  Production-parity tradeoff is acknowledged but flagged as future
  work outside this milestone.
- docs/apps/own-dog-food-local-infra-solution.md: rewrite §2.2,
  §2.4 (runner path), key-design-choices list, repo layout tree,
  §5.1 resource budget, §11 decision table, and §12.2 phase plan
  to describe the M5 native runner instead of a runner-in-Lima.

Verification:
- grep -iw "lima|limactl|LimaInfraProvider" → 0 hits across all
  PR-scope files.
- Python CJK regex check → 0 CJK chars across all PR-scope files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke install

The previous §1 jumped from "yarn / go / python already installed"
straight to `make stack-build && make stack-up`, skipping two things
a fresh checkout actually needs:

1. The Python orchestrator package isn't installed yet — `python -m
   boxlite_local` doesn't work until `make install` runs `pip install
   -e ".[test]"`.
2. The `boxlite` Python SDK and CLI must already be present in the
   active environment (it's a transitive dep of `boxlite_local`, not
   installed by `make install`).

Restructured §1 into three sub-sections:

- §1.1 Prereqs — table listing the actual required tools + versions,
  plus a 3-line sanity check that surfaces missing prereqs before
  `make` runs and produces a less-actionable failure.
- §1.2 Three-step bring-up — now correctly shows
    make install        (pip install the orchestrator package)
    make stack-build    (yarn + go builds)
    make stack-up       (L1 + L2 + seed)
  with timing expectations (5-7 min cold, ~30 s-1 min warm).
- §1.3 First-time dashboard login — explicit credentials + the
  end-to-end smoke (create sandbox → terminal → root@boxlite:~#)
  so first-time users know what success looks like.

No behavior change — purely documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… setup

Add the first-time bring-up commands (make install + make stack-build
+ make stack-up) at the top of the TL;DR cheat sheet so the entire
day-one workflow is visible in one block, without having to scroll to
§1.2. Day-to-day flow keeps its own bring-up line for clarity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…boot

After a machine reboot the L1 microVM boxes are gone but the postgres
data volume (~/.boxlite-local/data/pg/) persists on disk with the full
schema. `make stack-up` sees no postgres box, runs `make up-with-schema`,
which brings the box back (schema already present) and then runs
`make load-schema` — which previously hard-failed:

    ERR: public schema already has 27 table(s).
    Schema baseline is not idempotent. Run 'make wipe && make up' first

So every post-reboot `make stack-up` died at the schema step.

Fix: apply-schema.sh now treats an already-loaded schema as a no-op
instead of an error. When the public schema is non-empty it checks the
`migrations` table to distinguish:

  - COMPLETE prior load (tables + migrations recorded) → skip, exit 0
  - PARTIAL half-applied baseline (tables but no migrations) → still
    refuse with exit 3 (genuinely broken state; needs `make wipe`)

This makes load-schema / up-with-schema / stack-up all idempotent across
reboots. The non-idempotent baseline itself is unchanged — we just stop
trying to re-apply it when it's already there.

Verified on a live post-reboot stack:
    Schema already loaded (27 tables, 88 migrations recorded) — skipping.
    exit=0
and `make stack-up` then proceeds to L2 (api/runner/proxy/dashboard all up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goal: a single `make stack-up` should work from a fresh checkout, after
a reboot, or after `make stack-down` — no need to remember to run
`make install` / `make stack-build` first.

Rather than wiring install/stack-build as hard `make` prerequisites
(which would force a pip-resolve + go-build on *every* stack-up,
including the fast daily restart loop), stack-up.sh now does both
checks *conditionally* so the common restart path pays nothing:

- New: if `python -c "import boxlite_local"` fails, run `make install`
  before bringing up L1 (which calls `python -m boxlite_local`).
- Existing (kept): if /tmp/boxlite-runner or /tmp/boxlite-proxy is
  missing, run stack-build.sh. Clarified in a comment that it only
  builds when missing — use `make stack-restart COMPONENTS=runner` to
  rebuild after a source change.

Combined with the load-schema idempotency fix, `make stack-up` is now
the single entry point in all scenarios:
  fresh checkout → install + up-with-schema + build + L2
  post-reboot    → up-with-schema (schema skip) + build (/tmp cleared) + L2
  post-down      → up-with-schema (boxes back) + L2  (binaries + pkg present)

Docs updated (README §Quick start, infra-local-usage §0 + §1.2) to
present `make stack-up` as the one command, with the explicit targets
kept as optional for forcing a rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e→stack-up dashboard

Schema loading
- Add scripts/build-all-in-one-sql.py: consolidates every apps/api TypeORM
  migration (legacy + pre/post-deploy, 87 total → 539 stmts) into a single
  sql/merged-schema.auto-gen.sql. Resolves TS-side ${...} interpolations,
  inlines parameterized queries, and mirrors TypeORM's enum/constraint
  auto-renames so the output loads cleanly from zero.
- `make load-schema` now regenerates + loads the merged schema; apply-schema.sh
  defaults to it and accepts a SCHEMA_SQL_FILE override.
- Drop sql/schema-baseline.sql + sql/REFRESH.md: the prod pg_dump is no longer
  the load source; schema is now generated from migrations (kept reachable via
  SCHEMA_SQL_FILE if ever needed for an A/B comparison).

Fix: wipe → stack-up left the dashboard non-functional
- Root cause: `make down`/`wipe` tore down only the L1 boxes, leaving the L2
  native procs (api/runner/proxy/dashboard) running. After a wipe the stale API
  held connections to the destroyed-and-recreated DB and never re-ran
  onApplicationBootstrap against the fresh DB → no admin user/org/region →
  dashboard loads but is unusable. (Confirmed independent of the schema swap:
  reproduced identically with the old baseline dump.)
- `make down`/`wipe` now stop L2 first (stack-down.sh).
- stack-up.sh stops any stale L2 when it (re)creates L1, covering teardown paths
  that bypass make (stack-rebuild-l1-box, direct `boxlite rm`,
  `python -m boxlite_local down`).

Verified: working-stack → make wipe → make stack-up → 27 tables / 87 migrations,
admin user + org + region seeded, dashboard :3000 + /api proxy + dex all HTTP 200.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Box.Export/Runtime.ImportBox FFI (C/Go/Node/Python SDKs) + runner CreateBackup/restore to S3 + id-preserving import. Foundation for scale-down live migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lilongen pushed a commit that referenced this pull request Jun 9, 2026
5-service stack: pg (Phase 2) + redis/minio/minio-init/registry (3a).
Introduces http_url healthcheck, one_shot lifecycle, repo_root resolution.
Closes Phase-2 debt #1 (narrow start_service exception); defers debt #2
and tcp_port (no caller in 3a). Autonomous execution per /goal directive.
lilongen pushed a commit that referenced this pull request Jun 9, 2026
… exception (debt #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant