Skip to content

feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595

Merged
DorianZheng merged 108 commits into
boxlite-ai:mainfrom
lilongen:feat/cloud-mvp
Jun 10, 2026
Merged

feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595
DorianZheng merged 108 commits into
boxlite-ai:mainfrom
lilongen:feat/cloud-mvp

Conversation

@lilongen

@lilongen lilongen commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds apps/infra-local/ — a fully self-hosted BoxLite cloud-MVP
control plane that runs on a single Apple Silicon Mac. One command
(make stack-up) brings up the entire stack and lets a developer create
real microVM sandboxes through the dashboard — no AWS, no Docker daemon
for the application boxes; everything runs natively on M5.

Ships as milestone/infra-local/v0.1.0. Cold-start → working sandbox +
browser terminal (root@boxlite:~#) in ~80 s.

Dogfood principle: the control-plane services run as BoxLite boxes
(not docker-compose), so any BoxLite weakness is felt by the team
immediately.

Architecture

Layer What Where
L1 — infra-local 10 BoxLite microVM boxes: PostgreSQL, Redis, MinIO, OCI registry, Dex (OIDC), Caddy, Jaeger, pgAdmin, registry-ui, OpenTelemetry collector (+ one-shot minio-init) libkrun microVMs on macOS Hypervisor.framework, orchestrated by the boxlite_local Python package
L2 — control plane 4 native processes: NestJS API (:3001), Go Runner (:3003), Go Proxy (:4000), Vite Dashboard (:3000) Native macOS arm64, driven by make stack-*
L3 — user sandboxes N user-created Ubuntu/Alpine sandboxes libkrun microVMs spawned by the L2 Runner in ~/.boxlite-runner/

The Runner runs natively on M5 (HVF + libkrun) — single runner host.
Multi-host autoscaler testing (the old Lima route) is intentionally out
of scope here and parked in a separate worktree.

What works end-to-end (verified)

  • Cold start: make stack-nuke && make stack-up boots all services + auto-seeds + waits for the default snapshot — ~80 s.
  • OIDC login via Dex (admin@boxlite.dev / password, plus a normal test01@boxlite.dev); API auto-creates user + Personal org + owner row on first login.
  • Sandbox lifecycle: create → start → stop → destroy via dashboard or POST /api/sandbox.
  • Live terminal: dashboard Terminal → Connect → real interactive shell inside the microVM (browser → Caddy → Proxy → Runner → microVM).
  • Snapshot pull: API auto-creates the ubuntu:22.04 default snapshot; runner pulls the arm64 layer from the local registry.
  • Region + API-key management, audit logs, etc.

Verified twice via full cold-start + browser-driven E2E (login → snapshot Active → create sandbox → terminal root@boxlite:~#).

Operate-by-make

make stack-up is the single, self-healing entry point — works from
a fresh checkout, after a reboot, or after make stack-down:

  • auto-runs make install if the orchestrator package isn't importable
  • auto-builds the native runner/proxy binaries if missing (e.g. /tmp cleared on reboot)
  • load-schema is idempotent — skips cleanly when the PG data volume survived a reboot

Other targets: stack-status, stack-logs COMPONENT=…, stack-restart COMPONENTS=…, stack-rebuild-l1-box BOX=…, tiered cleanup
stack-reset / stack-reset-hard / stack-nuke. See
docs/apps/infra-local-usage.md.

Notable implementation notes

  • Dev runner-score override (stack-up.sh): the Go runner reports
    host-wide CPU/RAM/disk to the API. On a dev Mac sharing RAM with
    IDE/Chrome/Docker, that drags availabilityScore below the prod
    threshold and the API rejects sandbox-create with "No available
    runners". stack-up.sh exports RUNNER_AVAILABILITY_SCORE_THRESHOLD=5
    / RUNNER_MEMORY_PENALTY_THRESHOLD=95 / RUNNER_DISK_PENALTY_THRESHOLD=95
    before launching the API. Documented in CONNECTIONS.md; structural
    fix (runner reports only its own usage) tracked as follow-up.
  • Supporting app changes to make the milestone work: PostHog
    feature-flag bootstrap (api + dashboard), runner runtime.GOARCH
    image pull (fixes ENOEXEC on arm64), jwt.strategy user-create
    fix, Caddy host-regexp → proxy for sandbox port-preview URLs.

Known limitations (intentional, not bugs)

PostHog, Billing (Stripe), Svix webhooks, Snapshot Manager (Dockerfile
builds), SSH Gateway, ClickHouse, OpenSearch, SMTP, dns-shim + TLS are
mocked / deferred — see the
milestone doc
"Known limitations" table. Production-parity gap (HVF vs KVM libkrun
backend) is acknowledged and revisited when the autoscaler lands.

Docs

Test plan

  • make stack-up from fresh checkout (auto install + build + up)
  • make stack-nuke && make stack-up cold start → ~80 s to working stack
  • Reboot → make stack-up (load-schema idempotent skip, binaries rebuilt)
  • Dashboard login via Dex → snapshot ubuntu:22.04 Active
  • Create sandbox via UI → state Started → terminal root@boxlite:~#
  • All committed docs English-only (CLAUDE.md); no Lima refs in scripts/code

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Full local dev orchestrator and CLI for running the multi-service local stack, with service registry, healthchecks and startup tooling.
    • Local no-network dashboard mock and local feature-flag bootstrap for easier offline development.
    • New helper scripts and Make targets for building, running, resetting and seeding the stack.
    • Expanded .gitignore for local artifacts.
  • Bug Fixes

    • Fixed sandbox query ordering issues.
    • New users now get the platform default region set at creation.
  • Documentation

    • Comprehensive local-dev usage guide and connections reference added.
  • Tests

    • Large suite of unit and integration tests covering orchestrator, services, and end-to-end local stack.
  • Chores

    • Updated dev tooling configs and package/tooling versions; added local packaging metadata.

Copilot AI review requested due to automatic review settings May 26, 2026 11:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a BoxLite-dogfooded apps/infra-local/ stack (Python orchestrator + shell wrappers) and documentation for a fully self-hosted local control plane, alongside several fixes to unblock local Apple Silicon development (runner image arch selection, local PostHog flag bootstrapping, etc.).

Changes:

  • Add apps/infra-local/ Python orchestrator (boxlite_local) with unit/integration tests, plus L2 wrapper scripts (stack-*) and schema loader tooling.
  • Add extensive infra-local milestone and usage/status docs plus connection references.
  • Adjust local-dev behavior in runner (platform selection) and dashboard/API (PostHog bootstrap flags + MSW local config).

Reviewed changes

Copilot reviewed 50 out of 54 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tsconfig.base.json Adds a top-level tsconfig reference (currently appears as a path placeholder).
libs Adds a top-level libs reference (currently appears as a path placeholder).
docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md Milestone snapshot + architecture + operations overview.
docs/apps/infra-vs-local-infra.md Rationale/comparison doc updated with “what shipped” notes.
docs/apps/infra-local-usage.md Day-to-day usage guide and troubleshooting.
docs/apps/infra-local-status.md Current state inventory of local stack.
apps/runner/pkg/boxlite/registry.go Switch snapshot pull platform arch from hardcoded amd64 to host arch.
apps/package.json Adds forge deps/types, swc-loader, bumps @swc/core, adds resolutions.
apps/infra-local/tests/unit/test_topo.py Unit tests for topo sorting.
apps/infra-local/tests/unit/test_orchestrator.py Unit tests for orchestrator helper behavior.
apps/infra-local/tests/unit/test_doctor_lsof.py Unit tests for lsof parsing and ownership rules.
apps/infra-local/tests/unit/test_config.py Unit tests for InfraConfig defaults and env overrides.
apps/infra-local/tests/integration/test_multi_service.py Integration smoke test for multi-service bring-up.
apps/infra-local/tests/integration/test_e2e_full.py Full E2E protocol-level integration suite.
apps/infra-local/sql/REFRESH.md Procedure to refresh prod schema baseline.
apps/infra-local/scripts/stack-up.sh Starts L1 + L2 components and seeds init data.
apps/infra-local/scripts/stack-status.sh One-screen L1/L2 status output.
apps/infra-local/scripts/stack-restart.sh Restarts L2 components (rebuilds runner).
apps/infra-local/scripts/stack-reset.sh Soft/hard/nuke reset workflows.
apps/infra-local/scripts/stack-logs.sh Tail component logs.
apps/infra-local/scripts/stack-down.sh Stops L2 components and optionally L1.
apps/infra-local/scripts/stack-build.sh Builds native runner/proxy binaries and installs node deps.
apps/infra-local/scripts/seed-init-data.sh Waits for API seeding + default snapshot readiness.
apps/infra-local/scripts/apply-schema.sh Loads schema-baseline.sql into local postgres.
apps/infra-local/scripts/_stack-common.sh Shared helpers/paths/ports for stack scripts.
apps/infra-local/pyproject.toml Defines boxlite_local Python package + test extras.
apps/infra-local/goal.md Original infra-local goal note (currently non-English).
apps/infra-local/configs/minio/init.sh MinIO init helper script.
apps/infra-local/boxlite_local/types.py Dataclasses/types for orchestrator and doctor.
apps/infra-local/boxlite_local/services.py Declarative ServiceSpec registry for L1 services.
apps/infra-local/boxlite_local/orchestrator.py Topo sort, up/down/ps, health checks, SDK workarounds.
apps/infra-local/boxlite_local/execwrap.py Helper to collect exec stdout/stderr and exit code.
apps/infra-local/boxlite_local/doctor.py Preflight checks (SDK, runtime, port conflicts).
apps/infra-local/boxlite_local/config.py InfraConfig + env loading and derived properties.
apps/infra-local/boxlite_local/cli.py argparse CLI for python -m boxlite_local.
apps/infra-local/boxlite_local/main.py Module entrypoint.
apps/infra-local/boxlite_local/init.py Package metadata/version.
apps/infra-local/README.md Primary infra-local documentation and workflows.
apps/infra-local/Makefile Make targets wrapping Python CLI + L2 stack scripts.
apps/infra-local/CONNECTIONS.md Endpoints/credentials/env var reference.
apps/go.work Expands go.work workspace list (adds proxy, reorders entries).
apps/dashboard/src/mocks/handlers.ts Adds local-dev MSW fixtures + local /config response.
apps/dashboard/src/components/PostHogProviderWrapper.tsx Adds local-dev bootstrap feature flags fallback.
apps/api/src/common/providers/openfeature-posthog.provider.ts Adds bootstrapFlags fallback when PostHog not configured.
apps/api/src/auth/jwt.strategy.ts Passes default region id when auto-creating personal org.
apps/api/src/app.module.ts Wires bootstrapFlags into OpenFeature PostHog provider config.
apps/api/project.json Disables generatePackageJson for API build.
CLAUDE.md Adds “Committed documentation MUST be in English” rule.
.gitignore Ignores infra-local logs/env/symlinks and local artifacts.
Comments suppressed due to low confidence (4)

tsconfig.base.json:1

  • This file appears to contain only a path string, which is not valid tsconfig JSON and will break TypeScript tooling that expects a JSON object. If the intent is to reference apps/tsconfig.base.json, commit this as a symlink (so Git records it as a link) or replace the file contents with a real tsconfig that uses extends to point at the apps config.
    libs:1
  • This looks like it was intended to be a directory alias/symlink, but as committed it’s a regular file containing a path string. That will break any build tooling that expects libs/ to be a directory. Commit a real symlink (recorded by Git) or remove this file and update references to point at the correct directory.
    apps/infra-local/scripts/stack-up.sh:1
  • On macOS, BSD xargs does not support -r, so this command won’t reliably kill the prior listener; it can leave the port occupied and cause the subsequent start to fail with EADDRINUSE. Remove -r (since the port is known to be listening here) or replace this with a macOS-safe loop over PIDs.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now misleading because it no longer forces amd64. Rename it to something like linuxPlatform / defaultPullPlatform to avoid confusion when reading call sites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/infra-local/goal.md Outdated
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +72 to +76
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
})
} as Partial<BoxliteConfiguration>)
Comment on lines +66 to +76
// defaults above. All network behavior disabled: no /decide call (we
// pre-load flags via bootstrap), no event capture, no autocapture.
return (
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/CONNECTIONS.md Outdated
Comment thread apps/infra-local/boxlite_local/services.py Outdated
Comment thread apps/infra-local/sql/REFRESH.md Outdated
Comment thread docs/apps/infra-vs-local-infra.md Outdated

exporters:
debug:
verbosity: basic

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otel collector is wired but doesn't feed Jaeger — local Jaeger UI will be empty.

Prod's OtelCollector (apps/infra/sst.config.ts:438) is BoxLite's custom build with the ClickHouse exporter. Here the local collector uses upstream otel/opentelemetry-collector:latest with exporters: debug only — traces land on stdout, never reach SPEC_JAEGER.

Two clean fixes, pick one:

  • (a) Wire it through: add otlp exporter pointing at jaeger:4317 and add jaeger to the pipelines.traces.exporters list. Local Jaeger UI becomes useful.
  • (b) Drop SPEC_OTEL entirely: Jaeger 1.67 accepts OTLP natively (COLLECTOR_OTLP_ENABLED=true is already set on SPEC_JAEGER) — have api/runner OTLP straight at jaeger and remove the collector hop.

Right now we ship both services but they don't talk, which is worse than shipping neither. Worth fixing in this PR or immediate follow-up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01850227 — went with option (a), keeping the collector in the path for prod parity.

  • _otel_config now adds an otlp/jaeger exporter and puts it on the traces pipeline alongside debug (metrics/logs stay debug-only — Jaeger doesn't ingest them).
  • Jaeger's OTLP gRPC receiver is now host-mapped (26687 → box :4317); the collector reaches it via host-as-hub (host.boxlite.internal:26687). COLLECTOR_OTLP_ENABLED=true was already set.
  • SPEC_OTEL now depends_on=["jaeger"].

Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the collector (:24318) → the Jaeger query API returns the service + trace within ~2s, so the Jaeger UI is no longer empty. Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8). The only remaining gap vs prod is the custom boxlite_exporter build (ClickHouse / api push-back), still noted as a limitation.

Copilot AI review requested due to automatic review settings May 27, 2026 04:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (4)

apps/runner/pkg/boxlite/registry.go:1

  • linuxAmd64Platform no longer represents amd64 specifically (it’s now host-arch dependent). Rename to something like linuxHostPlatform/defaultLinuxPlatform to avoid misleading future readers and call sites.
    apps/package.json:1
  • Using a caret range in resolutions can reduce determinism (you may get different transitive trees over time). Prefer pinning @types/express to an exact version here (and ensure the lockfile resolves consistently) to avoid unexpected type changes across installs.
    tsconfig.base.json:1
  • This file content is not valid JSON unless it’s being committed as a git symlink target. Ensure tsconfig.base.json is actually a symlink (mode 120000) to apps/tsconfig.base.json; otherwise TypeScript tooling will fail to parse it.
    docs/apps/infra-vs-local-infra.md:1
  • These port numbers conflict with the rest of the infra-local docs/scripts in this PR (API is started on :3001; dashboard on :3000). Update the table to reflect the actual local ports so readers don’t follow incorrect endpoints.

Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/infra-local/boxlite_local/services.py
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +388 to +412
http.get(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all GET]', request.url)
if (request.url.includes('paginated')) {
return HttpResponse.json({ items: [], totalItems: 0, totalPages: 0 })
}
return HttpResponse.json([])
}),

// Catch-all POST/PUT/PATCH/DELETE — return {} so mutations don't error.
http.post(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all POST]', request.url)
return HttpResponse.json({})
}),
http.put(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all PUT]', request.url)
return HttpResponse.json({})
}),
http.patch(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all PATCH]', request.url)
return HttpResponse.json({})
}),
http.delete(`${API_URL}/*`, async ({ request }) => {
console.log('[MSW catch-all DELETE]', request.url)
return HttpResponse.json({})
}),
Comment on lines +69 to +83
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
Comment thread apps/infra-local/CONNECTIONS.md
Copilot AI review requested due to automatic review settings May 27, 2026 08:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (5)

tsconfig.base.json:1

  • This file content is not valid JSON. If the intent is to make the repo-root tsconfig.base.json a symlink to apps/tsconfig.base.json, ensure it is committed as a git symlink (mode 120000). If it lands as a regular file, TypeScript tooling reading tsconfig.base.json will fail to parse it.
    libs:1
  • Like tsconfig.base.json, this looks like it’s meant to be a symlink. If it is committed as a normal file containing the text apps/libs, any tooling expecting libs/ to be a directory (or a workspace path) will break. Ensure this is a git symlink (mode 120000) rather than a regular file.
    apps/infra-local/scripts/stack-up.sh:1
  • On macOS/BSD, xargs does not support -r (GNU-only), so this will error and can break set -e flows. Replace the pipeline with a macOS-compatible pattern (e.g., capture PIDs and conditionally kill, or use an xargs invocation that does not rely on -r). Apply the same fix to the other xargs -r occurrences in this script.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now inaccurate since it can be arm64 (or others). Rename it to reflect the new behavior (e.g., linuxDefaultPlatform, linuxHostArchPlatform, etc.) to avoid confusing future readers.
    apps/package.json:1
  • As of Aug 2025, @swc/core versions were not in the 1.15.x range; ^1.15.33 may be a non-existent release and would break installs. Please confirm the intended @swc/core version exists in npm (and align it with the repo’s SWC toolchain expectations).

Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +338 to +342
const KEY = '__msw_api_keys__'
const load = (): ApiKey[] => {
try { return JSON.parse(sessionStorage.getItem(KEY) ?? '[]') } catch { return [] }
}
const save = (s: ApiKey[]) => sessionStorage.setItem(KEY, JSON.stringify(s))
Comment on lines +69 to +84
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
>
Comment thread apps/infra-local/CONNECTIONS.md Outdated
Copilot AI review requested due to automatic review settings May 28, 2026 10:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 54 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (4)

apps/infra-local/scripts/stack-up.sh:1

  • xargs -r is not supported by macOS/BSD xargs, so this command will fail on the target platform. Replace with a portable pattern (e.g., capture PIDs into a variable and only call kill when non-empty, or use xargs kill -9 guarded by if/grep -q .). Apply the same fix to the runner/proxy/dashboard blocks that use xargs -r.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now misleading because it can resolve to non-amd64 architectures. Rename it to something accurate (e.g., linuxHostPlatform or defaultLinuxPlatform) to avoid confusion and incorrect future usage.
    apps/package.json:1
  • Using a caret range in resolutions can reduce determinism (the resolved version may vary over time and across tooling). Consider pinning @types/express to an exact version to keep installs reproducible and avoid surprise type changes.
    tsconfig.base.json:1
  • This appears to be a Git symlink (stored as a file containing the target path). On platforms/environments that don’t support symlinks (or where Git symlinks are disabled, common on Windows), this will become a plain text file and break TypeScript tooling expecting valid JSON at tsconfig.base.json. If Windows support is needed, consider a real JSON file that extends apps/tsconfig.base.json instead of a symlink.

Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +48 to +51
defaultRegionId: 'local',
defaultRegion: 'local',
role: 'OWNER',
}
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
userId: _USER_ID,
email: 'admin@boxlite.dev',
name: 'Local Admin',
role: 'owner',
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +292 to +298
return HttpResponse.json({
id: _USER_ID,
email: 'admin@boxlite.dev',
name: 'Local Admin',
role: 'OWNER',
personalOrganizationId: _ORG_ID,
})
Comment on lines +69 to +84
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
>
@lilongen lilongen changed the title docs(infra-local): align with v0.1.0 + dev runner-score override feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner) May 29, 2026
Copilot AI review requested due to automatic review settings May 29, 2026 09:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

apps/runner/pkg/boxlite/registry.go:1

  • The variable name linuxAmd64Platform is now misleading because Architecture is derived from runtime.GOARCH (often arm64 on Apple Silicon). Rename it to something architecture-agnostic (e.g., linuxDefaultPlatform / linuxHostPlatform) to avoid confusion and prevent future misuse.

Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines 70 to 75
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
Comment thread apps/infra-local/CONNECTIONS.md
Comment thread apps/infra-local/scripts/build-all-in-one-sql.py Outdated
Copilot Bot review requested due to automatic review settings May 29, 2026 10:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

tsconfig.base.json:1

  • tsconfig.base.json must be valid JSON for TypeScript/Nx tooling; the current content is a plain path string and will cause parsing failures. If the intent is to share apps/tsconfig.base.json, either commit a real symlink (so the repo root tsconfig.base.json resolves to the apps file) or replace this with a minimal JSON file that extends the apps base config.
    libs:1
  • A root-level libs entry is expected to be a directory (or a symlink to a directory) in most workspaces; committing it as a plain file containing a path will break module resolution and filesystem expectations. If this is intended to be a symlink to apps/libs, commit it as an actual symlink (or adjust workspace configuration so code references apps/libs directly).
    apps/runner/pkg/boxlite/registry.go:1
  • The identifier linuxAmd64Platform is now misleading because it no longer hardcodes amd64 and can evaluate to arm64, etc. Rename it to something architecture-agnostic (e.g., linuxHostPlatform / defaultLinuxPlatform) to avoid incorrect assumptions elsewhere.
    apps/package.json:1
  • The specified node-forge version ^1.4.0 may not exist on npm (as of common published versions, node-forge is typically 1.3.x). If this version is not actually published, installs will fail; please verify the published version and pin to an existing release (or update to the correct package/version if a different fork is intended).

Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +72 to +74
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
Comment on lines +75 to +76
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
Comment thread apps/infra-local/boxlite_local/config.py Outdated
Copilot Bot review requested due to automatic review settings May 29, 2026 10:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (6)

apps/runner/pkg/boxlite/registry.go:1

  • The variable name linuxAmd64Platform is now misleading since it can be arm64 (or other). Rename to something architecture-neutral (e.g. linuxHostPlatform / defaultLinuxPlatform) so callers don’t infer it is always amd64.
    tsconfig.base.json:1
  • This appears to be a committed symlink (Git stores symlinks as a file whose contents is the target path). Symlinks can be problematic on Windows and in some packaging/CI environments; consider replacing this with a real tsconfig.base.json that uses extends to reference apps/tsconfig.base.json, or document/enforce symlink support in contributing/CI.
    libs:1
  • Similar to tsconfig.base.json, this looks like a committed symlink to apps/libs. If cross-platform support matters, consider using an actual directory or a tooling-level path mapping instead of a symlink, or ensure CI and team dev environments can reliably handle symlinks.
    apps/package.json:1
  • Based on versions available up to my knowledge cutoff (Aug 2025), node-forge@^1.4.0 and @swc/core@^1.15.33 look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
    apps/package.json:1
  • Based on versions available up to my knowledge cutoff (Aug 2025), node-forge@^1.4.0 and @swc/core@^1.15.33 look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
    apps/package.json:1
  • Using a caret range in resolutions can undermine determinism and make installs vary across time. Prefer pinning an exact version in resolutions (and relying on the lockfile) unless there’s a specific reason to allow upgrades here.

Comment on lines +30 to +42
ts: TopologicalSorter[str] = TopologicalSorter()
for name, spec in services.items():
ts.add(name, *spec.depends_on)
ts.prepare()
layers: list[list[str]] = []
while ts.is_active():
layer = sorted(ts.get_ready())
if not layer:
break
layers.append(layer)
for name in layer:
ts.done(name)
return layers
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines 70 to 75
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
Comment thread apps/api/project.json Outdated
"main": "apps/api/src/main.ts",
"tsConfig": "apps/api/tsconfig.app.json",
"generatePackageJson": true,
"generatePackageJson": false,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 3424fb25 — back to generatePackageJson: true. The local stack runs the API via nx serve api (not nx build), so this prod-build option was never needed locally; disabling it only risked the deploy artifact. Resolving.

Copilot Bot review requested due to automatic review settings May 29, 2026 11:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (4)

tsconfig.base.json:1

  • tsconfig.base.json is not valid JSON as committed (it contains only a path). If this is intended to be a symlink, ensure it’s committed as a symlink in git (mode 120000); otherwise TypeScript tooling will fail when reading the root tsconfig. If symlinks are not desired/portable, replace this with an actual JSON tsconfig that extends ./apps/tsconfig.base.json.
    libs:1
  • This file content looks like a symlink target (and as a regular file it’s not a valid directory). If the intent is a repository-root libs symlink to apps/libs, make sure it’s committed as a symlink; otherwise consumers expecting a libs/ directory at repo root will break.
    apps/runner/pkg/boxlite/registry.go:1
  • The variable name linuxAmd64Platform is now misleading because it no longer hardcodes amd64 (it uses runtime.GOARCH). Rename it to reflect the new semantics (e.g., linuxHostPlatform, linuxDefaultPlatform, or linuxPlatformForHostArch) to avoid confusing future readers and callers.
    apps/package.json:1
  • As of my knowledge cutoff (Aug 2025), node-forge’s latest published version is 1.3.1 and ^1.4.0 is likely not resolvable from npm. Please verify the intended version exists; otherwise installs will fail in CI. If the goal is simply to add node-forge, consider pinning to ^1.3.1 (or whatever the current published latest is) and updating @types/node-forge accordingly.

Comment on lines +69 to +83
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
Comment thread apps/dashboard/src/mocks/handlers.ts Outdated
Comment on lines +69 to +76
proxyToolboxUrl: 'http://localhost:28080',
defaultSnapshot: 'ubuntu:22.04',
dashboardUrl: 'http://localhost:3000',
maxAutoArchiveInterval: 43200,
maintananceMode: false,
environment: 'local',
billingApiUrl: BILLING_API_URL,
})
} as Partial<BoxliteConfiguration>)
Comment on lines 105 to 111
// Bootstrap-aware (mirror evaluateFlag's behavior for object resolution).
if (!this.isConfigured || !this.client) {
// Object flags are rare in our usage; not added to bootstrapFlags map
// (typed as boolean | string | number). Fall through to default.
logger.debug(`PostHog not configured, returning default value for flag ${flagKey}`)
return {
value: defaultValue,
// organization.defaultRegionId is set silently fails — leaving
// the new user with no Personal org and the dashboard with
// "Cannot read properties of undefined (reading 'id')".
personalOrganizationDefaultRegionId: this.configService.getOrThrow('defaultRegion.id'),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair — out of scope for infra-local. Reverted in 3424fb25. It's a real fix (OIDC-login auto-created users get a Personal org with defaultRegionId=undefined, mirroring what app.service.ts already does for the seed path), but it belongs in its own PR rather than bundled here. I'll open a standalone PR for it. Resolving.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standalone PR opened: #668 (with a reproducer test, two-sided verified).

@@ -0,0 +1,217 @@
# Milestone: `milestone/infra-local/v0.1.0`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not file development documents not related to the repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed the development-journal docs in 3424fb25 — milestone snapshot, infra-local-status.md, and own-dog-food-local-infra-solution.md are gone (history lives in git), and the README/usage links that pointed at them are fixed. Kept only the operational README.md / CONNECTIONS.md / infra-local-usage.md. Resolving.

Comment thread CLAUDE.md Outdated

- [docs/development/rust-style.md](./docs/development/rust-style.md)

## Documentation Language

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not add this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 3424fb25 — CLAUDE.md is reverted to main, so this PR no longer touches it. Resolving.

@cla-assistant

cla-assistant Bot commented Jun 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a BoxLite-based local infra stack, local feature-flag bootstrap fallbacks for API/dashboard, Python orchestrator and shell wrappers for L1/L2, comprehensive integration and unit tests, workspace/build adjustments, and operational docs for running the local development environment.

Changes

Feature-flag bootstrap for local development

Layer / File(s) Summary
API feature-flag bootstrap contract and evaluation
apps/api/src/app.module.ts, apps/api/src/common/providers/openfeature-posthog.provider.ts
OpenFeature PostHog provider now supports optional bootstrapFlags that supply default flag values when PostHog is not configured locally, returned with STATIC resolution reason.
Dashboard local bootstrap and mock API
apps/dashboard/src/components/PostHogProviderWrapper.tsx, apps/dashboard/src/mocks/handlers.ts
PostHog provider mounts in stub no-network mode for local unconfigured cases, bootstrapping flags from defaults. MSW handlers now provide local CRUD for org/user/API-key endpoints and catch-all fallbacks for unimplemented mutations.

BoxLite local infrastructure orchestration

Layer / File(s) Summary
Workspace and build plumbing
.gitignore, apps/api/tsconfig.json, apps/api/webpack.config.js, apps/package.json, apps/runner/pkg/boxlite/registry.go, libs, tsconfig.base.json
Updates root wiring (symlinks, build config extends path), dev-only Webpack plugin filtering, vitest version bump, and dynamic host-architecture platform detection for registry pulls.
Python types, configuration, and utilities
apps/infra-local/boxlite_local/types.py, apps/infra-local/boxlite_local/config.py, apps/infra-local/pyproject.toml, apps/infra-local/boxlite_local/__init__.py, apps/infra-local/boxlite_local/execwrap.py
Defines health check and service spec data structures, InfraConfig dataclass loading from BOXLITE_* env vars, module entry point, and async execution wrapper for box command collection.
CLI and preflight validation
apps/infra-local/boxlite_local/cli.py, apps/infra-local/boxlite_local/__main__.py, apps/infra-local/boxlite_local/doctor.py
Implements argparse-based CLI dispatcher (doctor/up/down/ps subcommands), preflight checks for SDK imports, runtime reachability, and port availability.
Service definitions and orchestration
apps/infra-local/boxlite_local/services.py, apps/infra-local/boxlite_local/orchestrator.py
Declares services with health checks, dependencies, and config generation (Postgres, Redis, MinIO, registry, Dex, Jaeger, pgAdmin, OTEL, Caddy). Implements async orchestration: topological startup, health probes with retries, one-shot lifecycle, and status listing.
Makefile and shell orchestration
apps/infra-local/Makefile, apps/infra-local/scripts/_stack-common.sh, apps/infra-local/scripts/stack-build.sh, apps/infra-local/scripts/stack-down.sh, apps/infra-local/scripts/stack-logs.sh, apps/infra-local/scripts/stack-restart.sh, apps/infra-local/scripts/stack-status.sh
Makefile wraps Python CLI and delegates to shell scripts. Shared helper script provides component constants, logging, PID/log file management, health checks. Individual scripts handle build, component lifecycle with port conflict resolution, status reporting, and log tailing.
Stack reset and full startup
apps/infra-local/scripts/stack-reset.sh, apps/infra-local/scripts/stack-up.sh
Three-tier reset modes (soft: truncate tables; hard: drop/recreate schema; nuke: full wipe). Stack-up orchestrates L1 detection/bootstrap, L2 process startup with health checks, symlink setup, and seed-data verification.
Initialization and seeding
apps/infra-local/configs/minio/init.sh, apps/infra-local/scripts/seed-init-data.sh
MinIO init script creates default bucket. Seed script verifies admin user, personal org, default region, and snapshot active state with configurable timeouts.
Comprehensive test coverage
apps/infra-local/tests/integration/test_e2e_full.py, apps/infra-local/tests/integration/test_multi_service.py, apps/infra-local/tests/unit/test_*.py
E2E tests validate multi-service behavior: SQL/KV/S3 roundtrips, registry catalog, Dex JWKS, Jaeger API, OTLP receiver, Caddy routes, stability, and resource budget. Unit tests cover config parsing, doctor checks, orchestrator probes, and topology sorting.
Operational documentation
apps/infra-local/README.md, apps/infra-local/CONNECTIONS.md, docs/apps/infra-local-usage.md, CLAUDE.md
README describes L1/L2 architecture and quick-start. CONNECTIONS.md documents service endpoints and env overrides. Usage guide covers prerequisites, setup, day-to-day loops, reset tiers, testing, and troubleshooting. CLAUDE.md adds communication guidance.
Supporting fixes
apps/api/src/sandbox/managers/sandbox.manager.ts, apps/api/src/auth/jwt.strategy.ts
Normalizes SQL orderBy fields (remove quotes). Sets personalOrganizationDefaultRegionId on new user creation to prevent cascade failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • boxlite-ai/boxlite#671: Both PRs change apps/runner/pkg/boxlite/registry.go to derive the registry platform from runtime.GOARCH (this PR does the same change).
  • boxlite-ai/boxlite#668: Overlaps the JwtStrategy.validate change that anchors personal organization default region to defaultRegion.id.

Suggested reviewers

  • law-chain-hot

Poem

🐰 I hopped through code to make dev life bright,
Bootstrapped flags when PostHog slipped from sight,
Boxes and scripts hum in a tidy stack,
Run make stack-up and watch the world come back!

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (11)
docs/apps/infra-local-usage.md-230-233 (1)

230-233: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the fenced code block in Section 6.1.

Line 230 opens a fenced block without a language, which triggers markdownlint rule MD040.

💡 Proposed fix
-```
+```text
 http://localhost:3000  → pick admin / user to log in (dex static users)
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/apps/infra-local-usage.md around lines 230 - 233, The fenced code block
in Section 6.1 is missing a language identifier which triggers markdownlint
MD040; update the backticks that start the block around the line containing
"http://localhost:3000 → pick admin / user to log in (dex static users)" to
include a language hint such as text (e.g., change totext) so the fenced
block is properly annotated.


</details>

<!-- cr-comment:v1:b8aae239d297f09d95853ce4 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>docs/apps/infra-local-usage.md-16-31 (1)</summary><blockquote>

`16-31`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Use one component-selector variable consistently in examples.**

Line 16 uses `COMPONENT=...`, while Line 31 documents `COMPONENTS=` as the selector and later examples also mix forms. This is copy/paste-prone and can break command usage.

<details>
<summary>💡 Proposed fix</summary>

```diff
-make stack-logs COMPONENT=api
+make stack-logs COMPONENTS=api
@@
-make stack-logs COMPONENT=runner                 # single component
-make stack-logs COMPONENT=all                    # everything
+make stack-logs COMPONENTS=runner                # single component
+make stack-logs COMPONENTS=all                   # everything
```
</details>

  


Also applies to: 215-217

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/apps/infra-local-usage.md` around lines 16 - 31, The docs use two
different component selector variables (`COMPONENT=` vs `COMPONENTS=`) which is
inconsistent and error-prone; pick one name and update all examples to use it
consistently (e.g., replace `COMPONENT=api` with `COMPONENTS=api` and ensure
`make stack-restart`/`make stack-logs`/`make stack-down` examples and the
explanatory sentence all reference the same `COMPONENTS=` variable, including
the other occurrences noted around lines 215-217); also ensure the parenthetical
"(empty = all)" matches the chosen variable name.
```

</details>

<!-- cr-comment:v1:cc43f5ae1876e8bcec1e75ad -->

</blockquote></details>
<details>
<summary>docs/apps/infra-local-usage.md-127-132 (1)</summary><blockquote>

`127-132`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Avoid direct `go build` in this workflow section; use make wrapper commands.**

This section bypasses the documented stack wrappers and can drift from repo-standard flags/paths. Prefer `make stack-restart COMPONENTS=runner` (or documented wrapper equivalents) for consistency with the rest of this guide.

Based on learnings: "Use `make` targets for build, test, lint, format, setup, and distribution. Do not run `cargo`, `npm`, `python`, `go`, or `cbindgen` directly when a make target exists."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/apps/infra-local-usage.md` around lines 127 - 132, Replace the direct
`go build` workflow snippet that kills and rebuilds the runner with the
repo-standard make wrapper; update the block that currently shows `pkill -9 -f
boxlite-runner` and `cd apps/runner && go build -o /tmp/boxlite-runner
./cmd/runner && cd -` to use the documented make target (e.g., `make
stack-restart COMPONENTS=runner`) so builds and restarts go through the
repository's standardized wrappers and flags; ensure the surrounding text
references the status doc cheatsheet as before.
```

</details>

<!-- cr-comment:v1:1688ac1e148725377598f58f -->

_Source: Learnings_

</blockquote></details>
<details>
<summary>apps/infra-local/boxlite_local/config.py-10-15 (1)</summary><blockquote>

`10-15`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Validate port env vars with bounds (1-65535) at load time.**

Current parsing accepts numeric but invalid ports (`0`, negatives, `>65535`), so failures are deferred to runtime networking paths instead of failing fast at config boundary.

<details>
<summary>Proposed fix</summary>

```diff
-def _parse_int_env(name: str, default: str) -> int:
+def _parse_int_env(
+    name: str,
+    default: str,
+    *,
+    min_value: int | None = None,
+    max_value: int | None = None,
+) -> int:
     raw = os.environ.get(name, default)
     try:
-        return int(raw)
+        value = int(raw)
     except ValueError as e:
         raise ValueError(f"{name} must be an integer, got: {raw!r}") from e
+    if min_value is not None and value < min_value:
+        raise ValueError(f"{name} must be >= {min_value}, got: {value}")
+    if max_value is not None and value > max_value:
+        raise ValueError(f"{name} must be <= {max_value}, got: {value}")
+    return value
```

```diff
-            pg_host_port=_parse_int_env("BOXLITE_PG_HOST_PORT", "25432"),
+            pg_host_port=_parse_int_env("BOXLITE_PG_HOST_PORT", "25432", min_value=1, max_value=65535),
...
-            redis_host_port=_parse_int_env("BOXLITE_REDIS_HOST_PORT", "26379"),
+            redis_host_port=_parse_int_env("BOXLITE_REDIS_HOST_PORT", "26379", min_value=1, max_value=65535),
```

(Apply the same bounds to all `*_PORT` fields.)
</details>

Based on learnings: “Validate inputs at the boundary. Untrusted inputs should be checked where they enter” and “Use explicit error handling: fail fast on missing config/invalid inputs.”  
   


Also applies to: 80-99

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/boxlite_local/config.py` around lines 10 - 15, The
_parse_int_env function currently converts env values to int but doesn't
validate port ranges, allowing invalid ports (0, negative, >65535); update
_parse_int_env to check that the parsed integer is within 1..65535 and raise a
ValueError with a clear message like "{name} must be a valid port between 1 and
65535, got: {raw!r}" when out of range, and ensure every configuration entry
that uses _parse_int_env for any *_PORT setting (all occurrences of parsing port
envs) uses this tightened validation so invalid ports fail fast at load time.
```

</details>

<!-- cr-comment:v1:1ce598fc6d1ee05a46d81046 -->

_Source: Learnings_

</blockquote></details>
<details>
<summary>apps/infra-local/CONNECTIONS.md-40-42 (1)</summary><blockquote>

`40-42`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add language identifiers to unlabeled fenced code blocks (MD040).**

These blocks should use explicit fence languages (e.g., `text`) to satisfy markdownlint.





Also applies to: 45-47, 81-83, 86-88, 138-140, 144-146, 189-191

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @apps/infra-local/CONNECTIONS.md around lines 40 - 42, Several fenced code
blocks in CONNECTIONS.md are unlabeled which triggers markdownlint MD040; update
each unlabeled triple-backtick fence (including the shown postgres URI block and
the other occurrences at the commented ranges) to include a language identifier
such as "text" (e.g., change totext) so the code blocks are explicitly
labeled; search for similar unlabeled fences in CONNECTIONS.md and add the
language identifier consistently.


</details>

<!-- cr-comment:v1:c44a6a8ecfcd36ecff6e5c03 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/infra-local/README.md-226-230 (1)</summary><blockquote>

`226-230`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add fence languages to satisfy MD040 markdownlint warnings.**

The fenced blocks here should specify a language (e.g., `text`) to keep docs lint-clean.




<details>
<summary>Suggested patch</summary>

```diff
-     ```
+     ```text
      127.0.0.1 pgadmin.boxlite.test
      127.0.0.1 jaeger.boxlite.test
      # ... etc
      ```
@@
-```
+```text
 apps/infra-local/
 ├── Makefile                          # convenience wrappers (L1 + stack-* L2)
 ...
 └── tests/
 ...
-```
+```
```
</details>


Also applies to: 276-325

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @apps/infra-local/README.md around lines 226 - 230, The Markdown fenced code
blocks that show the hosts entries (the block containing "127.0.0.1
pgadmin.boxlite.test" / "127.0.0.1 jaeger.boxlite.test") and the directory tree
block starting with "apps/infra-local/" (and the other similar fenced blocks in
the same README section) must specify a fence language to satisfy MD040; update
each opening fence from totext so the hosts block and the
directory-tree/other blocks include the language identifier while leaving the
block contents unchanged.


</details>

<!-- cr-comment:v1:8d82a48e39c68ffaa7c88307 -->

_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>apps/infra-local/CONNECTIONS.md-116-121 (1)</summary><blockquote>

`116-121`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Fix the `aws` example command syntax.**

`AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` must prefix the command, not trail after `s3 ls`; the current snippet is not a valid invocation.




<details>
<summary>Suggested patch</summary>

```diff
-aws --endpoint-url http://127.0.0.1:29000 \
-    --region us-east-1 \
-    s3 ls \
-  AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin
+AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin \
+aws --endpoint-url http://127.0.0.1:29000 \
+    --region us-east-1 \
+    s3 ls
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/CONNECTIONS.md` around lines 116 - 121, The aws CLI example
is invalid because the environment variables are placed after the command; put
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY before the `aws --endpoint-url ...
s3 ls` invocation so they prefix the command (e.g. export or inline before `aws
--endpoint-url ... s3 ls`), ensuring the credentials are applied to the `aws`
call rather than trailing after `s3 ls`.
```

</details>

<!-- cr-comment:v1:e504f0fb613d6d0ef1972b84 -->

</blockquote></details>
<details>
<summary>apps/infra-local/scripts/build-all-in-one-sql.py-1027-1029 (1)</summary><blockquote>

`1027-1029`: _⚠️ Potential issue_ | _🟡 Minor_

**Narrow `except BaseException` to `except Exception` in `apps/infra-local/scripts/build-all-in-one-sql.py`**

This loop intentionally collects per-migration failures and continues, but the file has no other handling for `KeyboardInterrupt`/`SystemExit`, so catching `BaseException` can swallow normal abort semantics.

<details>
<summary>Suggested patch</summary>

```diff
-        except BaseException as exc:  # noqa: BLE001
+        except Exception as exc:
             failures.append((path, exc))
             continue
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/build-all-in-one-sql.py` around lines 1027 - 1029,
Replace the overly broad except BaseException in the migration loop with except
Exception so system-exiting signals aren't swallowed; specifically update the
except clause that currently does "except BaseException as exc:" (the block that
appends to the failures list using failures.append((path, exc)) and continues)
to "except Exception as exc:" to preserve KeyboardInterrupt/SystemExit behavior
while still collecting migration errors.
```

</details>

<!-- cr-comment:v1:8f9a13e2225d3558e1432c3b -->

_Source: Learnings_

</blockquote></details>
<details>
<summary>apps/infra-local/scripts/seed-init-data.sh-125-129 (1)</summary><blockquote>

`125-129`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Flag parsing is order-dependent and can ignore `--no-bounce`.**

`--no-bounce` is only honored when it is the first argument. Example: `seed-init-data.sh --no-wait --no-bounce` still bounces API.





<details>
<summary>Proposed fix</summary>

```diff
-if [ "${1:-}" = "--no-bounce" ]; then
-  log "skipping API bounce (--no-bounce); assuming API just woke + is running its boot cycle"
-else
-  bounce_api_if_running
-fi
+NO_BOUNCE=false
+NO_WAIT=false
+for arg in "$@"; do
+  case "$arg" in
+    --no-bounce) NO_BOUNCE=true ;;
+    --no-wait) NO_WAIT=true ;;
+    *) die "unknown argument: $arg" ;;
+  esac
+done
+
+if [ "$NO_BOUNCE" = "true" ]; then
+  log "skipping API bounce (--no-bounce); assuming API just woke + is running its boot cycle"
+else
+  bounce_api_if_running
+fi
@@
-if [ "${1:-}" = "--no-wait" ] || [ "${2:-}" = "--no-wait" ]; then
+if [ "$NO_WAIT" = "true" ]; then
```
</details>


Also applies to: 144-144

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/seed-init-data.sh` around lines 125 - 129, The
script currently checks only the first positional parameter for "--no-bounce",
so flags like "--no-wait --no-bounce" are ignored; update the check in
seed-init-data.sh to scan all args (e.g., loop over "$@" or use a case pattern)
and set a boolean if any argument equals "--no-bounce", then use that boolean to
decide whether to call bounce_api_if_running; apply the same change to the other
occurrence referenced (around the second check) so both places reliably honor
--no-bounce regardless of argument order.
```

</details>

<!-- cr-comment:v1:f99316165ca0615131728a6f -->

</blockquote></details>
<details>
<summary>apps/infra-local/scripts/_stack-common.sh-55-69 (1)</summary><blockquote>

`55-69`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Validate PID ownership before declaring a component “running”.**

Line 63 treats any live PID from the pidfile as valid. If the PID was recycled, `stack-up` can incorrectly skip startup for a component that is actually down. Please verify the PID belongs to the expected process (or pair liveness with the expected listening port) before returning it.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/_stack-common.sh` around lines 55 - 69,
component_pid currently trusts any live PID from pid_file as the component
owner; update it to validate ownership before returning the PID by (1) reading
the pid from pid_file (existing pid="$(cat "$pf")"), (2) checking
/proc/$pid/cmdline or /proc/$pid/exe (or lsof/ss on the expected listening port)
to confirm the process command or listening socket matches the expected
component identity, and only then echo the pid; if validation fails, remove the
stale pidfile (rm -f "$pf") and return empty. Use the same pid_file and
component_pid symbols so stack-up will skip only when a confirmed matching
process is running.
```

</details>

<!-- cr-comment:v1:c3a3e693dfa2b71cbcc731d7 -->

</blockquote></details>
<details>
<summary>apps/infra-local/scripts/stack-logs.sh-20-23 (1)</summary><blockquote>

`20-23`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Handle `all` mode when no log files exist.**

On Line 22, `tail -F "${LOGS_DIR}"/*.log` fails when there are no `*.log` files yet. Add a guard so `stack-logs.sh all` returns a clear message instead of an immediate error.

<details>
<summary>Proposed fix</summary>

```diff
 if [ "$1" = "all" ]; then
   # `tail -f` multiple files at once; each line gets ==> filename <== headers.
-  exec tail -F "${LOGS_DIR}"/*.log
+  compgen -G "${LOGS_DIR}/*.log" > /dev/null || die "no log files yet in ${LOGS_DIR}"
+  exec tail -F "${LOGS_DIR}"/*.log
 fi
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/stack-logs.sh` around lines 20 - 23, The "all"
branch in stack-logs.sh currently runs tail -F "${LOGS_DIR}"/*.log which fails
if no log files exist; update the if-block that handles "$1" = "all" to first
check whether any files match the pattern (e.g. test for the presence of
"${LOGS_DIR}"/*.log or use a glob check) and if none are found print a clear
message like "No log files found in $LOGS_DIR" and exit non‑error (or exit 0)
instead of invoking tail; keep using LOGS_DIR and the existing tail -F
"${LOGS_DIR}"/*.log when matches are present.
```

</details>

<!-- cr-comment:v1:d9245aab9f1ed445a331b136 -->

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (4)</summary><blockquote>

<details>
<summary>apps/package.json (1)</summary><blockquote>

`116-116`: _⚡ Quick win_

**Move `@types/node-forge` to `devDependencies`.**

This is a type-only package and should not ship in runtime dependencies.

<details>
<summary>♻️ Proposed fix</summary>

```diff
-    "`@types/node-forge`": "^1.3.14",
     "algoliasearch": "^5.30.0",
```

```diff
   "devDependencies": {
+    "`@types/node-forge`": "^1.3.14",
     "`@babel/preset-env`": "^7.22.0",
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/package.json` at line 116, The dependency "`@types/node-forge`" is a
type-only package and should be moved from dependencies to devDependencies in
package.json; open the file, remove the "`@types/node-forge`" entry from the
top-level "dependencies" object and add the same entry (with the same version
"^1.3.14") under "devDependencies", then run npm/yarn install (or update
lockfile) to ensure the lockfile reflects the change and that no runtime code
imports this types-only package.
```

</details>

<!-- cr-comment:v1:7f8b203f55849fba42213e45 -->

</blockquote></details>
<details>
<summary>apps/infra-local/tests/unit/test_doctor_lsof.py (1)</summary><blockquote>

`6-47`: _⚡ Quick win_

**Add focused unit tests for `check_port_free` subprocess branches.**

Please add monkeypatched tests for: `lsof` missing, non-zero + stderr, and timeout/error branch. These are the high-risk paths that gate `stack-up`.

   

Based on learnings: Add or update tests when behavior changes around branching, retries, security checks, or boundaries.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/tests/unit/test_doctor_lsof.py` around lines 6 - 47, Add
unit tests that monkeypatch subprocess.run for the check_port_free code paths in
boxlite_local.doctor: simulate (1) FileNotFoundError to test the "lsof missing"
branch, (2) a CompletedProcess with returncode != 0 and non-empty stderr to test
the non-zero+stderr branch, and (3) a subprocess.TimeoutExpired (or other
exception) to exercise the timeout/error branch; place these tests alongside the
existing _parse_lsof_F tests (e.g., in
apps/infra-local/tests/unit/test_doctor_lsof.py), use monkeypatch to replace
subprocess.run or the function that wraps it, and assert the expected return
values or error handling behavior of check_port_free (refer to the
check_port_free function and any helper that invokes subprocess.run).
```

</details>

<!-- cr-comment:v1:23093557339b5987261aed8d -->

_Source: Learnings_

</blockquote></details>
<details>
<summary>apps/infra-local/README.md (1)</summary><blockquote>

`348-349`: _⚡ Quick win_

**Use the Make target for status instead of direct Python invocation.**

Replace `python -m boxlite_local ps` with `make ps` to stay consistent with the project workflow contract.

Based on learnings: "Use make targets for build, test, lint, format, setup, and distribution. Do not run ... python directly when a make target exists."





<details>
<summary>Suggested patch</summary>

```diff
-**Debug a stuck service:** `make stack-status` first → identify the red
-component → use the lightest possible cleanup. `python -m boxlite_local ps`
+**Debug a stuck service:** `make stack-status` first → identify the red
+component → use the lightest possible cleanup. `make ps`
 shows L1 box state; `boxlite logs boxlite-local-<name>` shows guest
 logs; `make stack-logs COMPONENT=<name>` tails L2 logs from
 `apps/infra-local/.logs/`.
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/README.md` around lines 348 - 349, Replace the direct Python
invocation "python -m boxlite_local ps" with the Make target "make ps" wherever
it appears in documentation or examples (e.g., the README snippet referencing L1
box state); update the command examples and any surrounding text so they
instruct developers to run "make ps" for status, preserving existing notes about
"boxlite logs boxlite-local-<name>" and ensuring formatting/inline code
backticks match the rest of the doc.
```

</details>

<!-- cr-comment:v1:101a04bdc495abb15a195e22 -->

_Source: Learnings_

</blockquote></details>
<details>
<summary>apps/infra-local/Makefile (1)</summary><blockquote>

`89-91`: _⚡ Quick win_

**Use `$(PY)` in `stack-rebuild-l1-box` for interpreter consistency.**

Line 91 hardcodes `python`, bypassing the Makefile’s configured interpreter variable (`PY`).





<details>
<summary>Proposed fix</summary>

```diff
-	python -m boxlite_local up $(BOX)
+	$(PY) -m boxlite_local up $(BOX)
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/Makefile` around lines 89 - 91, The rule for
stack-rebuild-l1-box hardcodes the interpreter by calling "python -m
boxlite_local up $(BOX)"; change it to use the Makefile's interpreter variable
by replacing that invocation with "$(PY) -m boxlite_local up $(BOX)" so the
target (stack-rebuild-l1-box) respects the configured PY value and interpreter
consistency across the Makefile.
```

</details>

<!-- cr-comment:v1:ab7e5bf33feef3ca68e3ae15 -->

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @apps/api/src/sandbox/managers/sandbox.manager.ts:

  • Line 259: The ORDER BY clauses are using an unquoted mixed-case identifier
    which PostgreSQL folds to lowercase; update the .orderBy calls that reference
    activity.lastActivityAt (e.g., the occurrence in sandbox_last_activity and the
    other .orderBy at the later query) to use a quoted identifier matching the WHERE
    clauses — change the argument from 'activity.lastActivityAt' to
    'activity."lastActivityAt"' in both places so it matches
    activity."lastActivityAt" used elsewhere.

In @apps/infra-local/boxlite_local/cli.py:

  • Around line 30-35: The CLI currently lets p_up and p_down accept any service
    names and silently ignore typos; update the argument parsing to validate the
    "services" values against the canonical SERVICES list at parse-time and raise a
    parser error for any unknown names. Change the add_argument call for "services"
    (both p_up and p_down) to use a custom type or argparse Action that checks each
    value is in SERVICES (or use choices=SERVICES) so argparse fails fast; ensure
    the same validation logic is applied to the other parser(s) referenced around
    lines 75-79 so all entry points reject invalid service names immediately.

In @apps/infra-local/boxlite_local/doctor.py:

  • Around line 121-126: The lsof call in check_port_free() can hang; add a
    timeout (e.g. timeout=5) to the subprocess.run(...) invocation and catch
    subprocess.TimeoutExpired around that call (the call in check_port_free used by
    doctor() and orchestrator.up()). On TimeoutExpired, log or surface a clear error
    via the same logger used by doctor() / orchestrator (or return False) so the
    preflight fails fast instead of blocking stack-up; ensure the handler preserves
    existing behavior for other non-zero results.

In @apps/infra-local/boxlite_local/execwrap.py:

  • Around line 34-39: exec_collect currently awaits execution.wait() with no
    bound; add a timeout parameter to exec_collect (e.g., timeout: Optional[float] =
    None), pass it through to the SDK call that creates the execution (box.exec(...,
    timeout=timeout)) and also guard the final wait with
    asyncio.wait_for(execution.wait(), timeout) so the underlying process is killed
    on timeout (catch asyncio.TimeoutError, call execution.kill() or
    execution.terminate() and await execution.wait() to ensure cleanup, then
    re-raise or return a clear timeout result). Update callers such as
    _wait_one_shot_exit to pass an appropriate timeout (or None) when invoking
    exec_collect and keep _wait_healthy_exec behavior unchanged if it already wraps
    exec_collect in asyncio.wait_for.

In @apps/infra-local/boxlite_local/orchestrator.py:

  • Around line 228-231: The try/except around runtime.get currently swallows all
    exceptions; change it to only treat a "box not found" case as idempotent and let
    other errors surface: catch the specific not-found exception type thrown by your
    runtime client (or inspect the exception for a not-found indicator) when calling
    runtime.get and return False only in that branch, but for any other exception
    re-raise (or log and raise) so transport/runtime failures aren’t masked; update
    the try/except that wraps runtime.get (in the down/remove flow) accordingly and
    ensure unexpected exceptions are not silently suppressed.

In @apps/infra-local/boxlite_local/services.py:

  • Line 62: Replace floating Docker tags in services.py so images are pinned to
    immutable versions/digests: update "minio/minio:latest", "minio/mc:latest",
    "joxit/docker-registry-ui:main", and "otel/opentelemetry-collector:latest" to
    specific semver tags or sha256 digests (preferred). Locate these image strings
    in apps/infra-local/boxlite_local/services.py and change them to explicit
    versions (or image@sha256:...) and run your local stack-up to verify startup;
    ensure any README or docs referencing these images are updated accordingly.

In @apps/infra-local/scripts/seed-init-data.sh:

  • Line 147: The script currently treats wait_for_default_snapshot failures as
    non-fatal by calling warn and continuing to print ok "init data ready"; change
    this so snapshot readiness failures cause the script to fail fast: replace the
    warn fallback for wait_for_default_snapshot with a non-zero exit (e.g., call die
    or exit 1) or otherwise return a failing status before the ok "init data ready"
    message so downstream automation receives a non-zero exit; locate the call to
    wait_for_default_snapshot and the later ok "init data ready" message and ensure
    the failure path prevents reaching that ok and returns an error.

In @apps/infra-local/scripts/stack-up.sh:

  • Around line 74-76: The current use of ln -sf to create "${APPS_DIR}/.env" and
    "${APPS_DIR}/apps" will clobber existing non-symlink files; update the script to
    first test each target: if the path does not exist create the symlink with ln
    -s, if it exists and is a symlink leave it alone, and if it exists but is not a
    symlink print a clear error and exit non-zero so we don't overwrite developer
    files; apply this logic for the ".env" and "apps" targets (references: APPS_DIR,
    "${APPS_DIR}/.env", "${APPS_DIR}/apps", and the ln -s invocation).
  • Around line 85-88: The current port cleanup blindly runs lsof/... | xargs kill
    -9 which can terminate unrelated processes; update the port cleanup in the
    blocks that call port_listening and warn to: prefer finding stack-owned PID(s)
    first (check a corresponding pidfile if one exists, or filter lsof/ps output by
    a known command pattern for this stack using pgrep -f or lsof -c), send SIGTERM
    to those PIDs, wait a short period (e.g. poll for port closure up to a few
    seconds), and only use SIGKILL (kill -9) as a fallback if the port is still
    held; replace direct lsof ... | xargs -r kill -9 with a safe function that
    accepts the port (used where port_listening is called) and implements
    pidfile/command validation, TERM then KILL fallback, and preserves the existing
    logging via warn.

In @apps/infra-local/sql/merged-schema.auto-gen.sql:

  • Line 10: The merged schema enables "uuid-ossp" but never enables "pgcrypto",
    so calls to gen_random_uuid() (referenced in the diff as gen_random_uuid()) will
    fail on a fresh DB; update the schema generation step to emit CREATE EXTENSION
    IF NOT EXISTS "pgcrypto"; ensure this CREATE EXTENSION is placed before any use
    of gen_random_uuid() in the generated SQL (keep the existing CREATE EXTENSION IF
    NOT EXISTS "uuid-ossp" as well) and modify the generator/template that produces
    merged-schema.auto-gen.sql so future regenerations include the pgcrypto
    extension.

In @apps/infra-local/tests/integration/test_e2e_full.py:

  • Around line 83-90: The fixture currently calls asyncio.run(setup()) outside
    the try, so if setup raises (including pytest.skip) the finally cleanup
    (asyncio.run(teardown()), mp.undo(), shutil.rmtree(tmp)) never runs; change the
    fixture to call setup() inside a try block and track whether setup completed
    (e.g., setup_done flag set to True after asyncio.run(setup()) returns) then in
    the finally always perform mp.undo() and shutil.rmtree(tmp, ignore_errors=True)
    and only call asyncio.run(teardown()) if setup_done (or ensure teardown is safe
    to call unconditionally); ensure pytest.skip exceptions are allowed to propagate
    but cleanup still runs.

In @apps/package.json:

  • Around line 343-345: The package.json currently forces a mismatched typing by
    keeping "express" at ^5.2.0 while "resolutions" pins "@types/express" to
    ^4.17.25; update the resolutions (or remove it) so "@types/express" aligns with
    Express 5 (e.g., change the resolution to @types/express@^5.x or delete the
    override), then run install and TypeScript build to ensure no type conflicts in
    NestJS 11; locate the mismatch in the "resolutions" block and adjust or remove
    the "@types/express" entry accordingly.

Minor comments:
In @apps/infra-local/boxlite_local/config.py:

  • Around line 10-15: The _parse_int_env function currently converts env values
    to int but doesn't validate port ranges, allowing invalid ports (0, negative,

65535); update _parse_int_env to check that the parsed integer is within
1..65535 and raise a ValueError with a clear message like "{name} must be a
valid port between 1 and 65535, got: {raw!r}" when out of range, and ensure
every configuration entry that uses _parse_int_env for any *_PORT setting (all
occurrences of parsing port envs) uses this tightened validation so invalid
ports fail fast at load time.

In @apps/infra-local/CONNECTIONS.md:

  • Around line 40-42: Several fenced code blocks in CONNECTIONS.md are unlabeled
    which triggers markdownlint MD040; update each unlabeled triple-backtick fence
    (including the shown postgres URI block and the other occurrences at the
    commented ranges) to include a language identifier such as "text" (e.g., change
    totext) so the code blocks are explicitly labeled; search for similar
    unlabeled fences in CONNECTIONS.md and add the language identifier consistently.
  • Around line 116-121: The aws CLI example is invalid because the environment
    variables are placed after the command; put AWS_ACCESS_KEY_ID and
    AWS_SECRET_ACCESS_KEY before the aws --endpoint-url ... s3 ls invocation so
    they prefix the command (e.g. export or inline before aws --endpoint-url ... s3 ls), ensuring the credentials are applied to the aws call rather than
    trailing after s3 ls.

In @apps/infra-local/README.md:

  • Around line 226-230: The Markdown fenced code blocks that show the hosts
    entries (the block containing "127.0.0.1 pgadmin.boxlite.test" / "127.0.0.1
    jaeger.boxlite.test") and the directory tree block starting with
    "apps/infra-local/" (and the other similar fenced blocks in the same README
    section) must specify a fence language to satisfy MD040; update each opening
    fence from totext so the hosts block and the directory-tree/other blocks
    include the language identifier while leaving the block contents unchanged.

In @apps/infra-local/scripts/_stack-common.sh:

  • Around line 55-69: component_pid currently trusts any live PID from pid_file
    as the component owner; update it to validate ownership before returning the PID
    by (1) reading the pid from pid_file (existing pid="$(cat "$pf")"), (2) checking
    /proc/$pid/cmdline or /proc/$pid/exe (or lsof/ss on the expected listening port)
    to confirm the process command or listening socket matches the expected
    component identity, and only then echo the pid; if validation fails, remove the
    stale pidfile (rm -f "$pf") and return empty. Use the same pid_file and
    component_pid symbols so stack-up will skip only when a confirmed matching
    process is running.

In @apps/infra-local/scripts/build-all-in-one-sql.py:

  • Around line 1027-1029: Replace the overly broad except BaseException in the
    migration loop with except Exception so system-exiting signals aren't swallowed;
    specifically update the except clause that currently does "except BaseException
    as exc:" (the block that appends to the failures list using
    failures.append((path, exc)) and continues) to "except Exception as exc:" to
    preserve KeyboardInterrupt/SystemExit behavior while still collecting migration
    errors.

In @apps/infra-local/scripts/seed-init-data.sh:

  • Around line 125-129: The script currently checks only the first positional
    parameter for "--no-bounce", so flags like "--no-wait --no-bounce" are ignored;
    update the check in seed-init-data.sh to scan all args (e.g., loop over "$@" or
    use a case pattern) and set a boolean if any argument equals "--no-bounce", then
    use that boolean to decide whether to call bounce_api_if_running; apply the same
    change to the other occurrence referenced (around the second check) so both
    places reliably honor --no-bounce regardless of argument order.

In @apps/infra-local/scripts/stack-logs.sh:

  • Around line 20-23: The "all" branch in stack-logs.sh currently runs tail -F
    "${LOGS_DIR}"/.log which fails if no log files exist; update the if-block that
    handles "$1" = "all" to first check whether any files match the pattern (e.g.
    test for the presence of "${LOGS_DIR}"/
    .log or use a glob check) and if none
    are found print a clear message like "No log files found in $LOGS_DIR" and exit
    non‑error (or exit 0) instead of invoking tail; keep using LOGS_DIR and the
    existing tail -F "${LOGS_DIR}"/*.log when matches are present.

In @docs/apps/infra-local-usage.md:

  • Around line 230-233: The fenced code block in Section 6.1 is missing a
    language identifier which triggers markdownlint MD040; update the backticks that
    start the block around the line containing "http://localhost:3000 → pick admin
    / user to log in (dex static users)" to include a language hint such as text
    (e.g., change totext) so the fenced block is properly annotated.
  • Around line 16-31: The docs use two different component selector variables
    (COMPONENT= vs COMPONENTS=) which is inconsistent and error-prone; pick one
    name and update all examples to use it consistently (e.g., replace
    COMPONENT=api with COMPONENTS=api and ensure make stack-restart/make stack-logs/make stack-down examples and the explanatory sentence all
    reference the same COMPONENTS= variable, including the other occurrences noted
    around lines 215-217); also ensure the parenthetical "(empty = all)" matches the
    chosen variable name.
  • Around line 127-132: Replace the direct go build workflow snippet that kills
    and rebuilds the runner with the repo-standard make wrapper; update the block
    that currently shows pkill -9 -f boxlite-runner and cd apps/runner && go build -o /tmp/boxlite-runner ./cmd/runner && cd - to use the documented make
    target (e.g., make stack-restart COMPONENTS=runner) so builds and restarts go
    through the repository's standardized wrappers and flags; ensure the surrounding
    text references the status doc cheatsheet as before.

Nitpick comments:
In @apps/infra-local/Makefile:

  • Around line 89-91: The rule for stack-rebuild-l1-box hardcodes the interpreter
    by calling "python -m boxlite_local up $(BOX)"; change it to use the Makefile's
    interpreter variable by replacing that invocation with "$(PY) -m boxlite_local
    up $(BOX)" so the target (stack-rebuild-l1-box) respects the configured PY value
    and interpreter consistency across the Makefile.

In @apps/infra-local/README.md:

  • Around line 348-349: Replace the direct Python invocation "python -m
    boxlite_local ps" with the Make target "make ps" wherever it appears in
    documentation or examples (e.g., the README snippet referencing L1 box state);
    update the command examples and any surrounding text so they instruct developers
    to run "make ps" for status, preserving existing notes about "boxlite logs
    boxlite-local-" and ensuring formatting/inline code backticks match the
    rest of the doc.

In @apps/infra-local/tests/unit/test_doctor_lsof.py:

  • Around line 6-47: Add unit tests that monkeypatch subprocess.run for the
    check_port_free code paths in boxlite_local.doctor: simulate (1)
    FileNotFoundError to test the "lsof missing" branch, (2) a CompletedProcess with
    returncode != 0 and non-empty stderr to test the non-zero+stderr branch, and (3)
    a subprocess.TimeoutExpired (or other exception) to exercise the timeout/error
    branch; place these tests alongside the existing _parse_lsof_F tests (e.g., in
    apps/infra-local/tests/unit/test_doctor_lsof.py), use monkeypatch to replace
    subprocess.run or the function that wraps it, and assert the expected return
    values or error handling behavior of check_port_free (refer to the
    check_port_free function and any helper that invokes subprocess.run).

In @apps/package.json:

  • Line 116: The dependency "@types/node-forge" is a type-only package and should
    be moved from dependencies to devDependencies in package.json; open the file,
    remove the "@types/node-forge" entry from the top-level "dependencies" object
    and add the same entry (with the same version "^1.3.14") under
    "devDependencies", then run npm/yarn install (or update lockfile) to ensure the
    lockfile reflects the change and that no runtime code imports this types-only
    package.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `92dacb5e-5744-4885-be73-16a14db0f4fc`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 6048e91e6c98afca85a5ca176cd2b1ffb174241d and 3424fb25eddedb742c6e9c0f07d50280d286896e.

</details>

<details>
<summary>📒 Files selected for processing (49)</summary>

* `.gitignore`
* `CLAUDE.md`
* `apps/api/src/app.module.ts`
* `apps/api/src/common/providers/openfeature-posthog.provider.ts`
* `apps/api/src/sandbox/managers/sandbox.manager.ts`
* `apps/api/tsconfig.json`
* `apps/dashboard/src/components/PostHogProviderWrapper.tsx`
* `apps/dashboard/src/mocks/handlers.ts`
* `apps/infra-local/CONNECTIONS.md`
* `apps/infra-local/Makefile`
* `apps/infra-local/README.md`
* `apps/infra-local/boxlite_local/__init__.py`
* `apps/infra-local/boxlite_local/__main__.py`
* `apps/infra-local/boxlite_local/cli.py`
* `apps/infra-local/boxlite_local/config.py`
* `apps/infra-local/boxlite_local/doctor.py`
* `apps/infra-local/boxlite_local/execwrap.py`
* `apps/infra-local/boxlite_local/orchestrator.py`
* `apps/infra-local/boxlite_local/services.py`
* `apps/infra-local/boxlite_local/types.py`
* `apps/infra-local/configs/minio/init.sh`
* `apps/infra-local/goal.md`
* `apps/infra-local/pyproject.toml`
* `apps/infra-local/scripts/_stack-common.sh`
* `apps/infra-local/scripts/apply-schema.sh`
* `apps/infra-local/scripts/build-all-in-one-sql.py`
* `apps/infra-local/scripts/seed-init-data.sh`
* `apps/infra-local/scripts/stack-build.sh`
* `apps/infra-local/scripts/stack-down.sh`
* `apps/infra-local/scripts/stack-logs.sh`
* `apps/infra-local/scripts/stack-reset.sh`
* `apps/infra-local/scripts/stack-restart.sh`
* `apps/infra-local/scripts/stack-status.sh`
* `apps/infra-local/scripts/stack-up.sh`
* `apps/infra-local/sql/merged-schema.auto-gen.sql`
* `apps/infra-local/tests/__init__.py`
* `apps/infra-local/tests/integration/__init__.py`
* `apps/infra-local/tests/integration/test_e2e_full.py`
* `apps/infra-local/tests/integration/test_multi_service.py`
* `apps/infra-local/tests/unit/__init__.py`
* `apps/infra-local/tests/unit/test_config.py`
* `apps/infra-local/tests/unit/test_doctor_lsof.py`
* `apps/infra-local/tests/unit/test_orchestrator.py`
* `apps/infra-local/tests/unit/test_topo.py`
* `apps/package.json`
* `apps/runner/pkg/boxlite/registry.go`
* `docs/apps/infra-local-usage.md`
* `libs`
* `tsconfig.base.json`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread apps/api/src/sandbox/managers/sandbox.manager.ts Outdated
Comment on lines +30 to +35
p_up.add_argument("services", nargs="*", help="Subset of services (default: all)")
p_up.add_argument("--skip-doctor", action="store_true", help="Bypass preflight checks")

p_down = sub.add_parser("down", help="Stop + remove services.")
p_down.add_argument("services", nargs="*", help="Subset of services (default: all)")
p_down.add_argument("--wipe", action="store_true", help="Also remove the data dir")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject unknown service names at parse-time instead of silently succeeding.

up/down currently accept arbitrary names; typos can return success while doing nothing. Validate against SERVICES in argparse so bad input fails immediately.

Proposed fix
-    p_up.add_argument("services", nargs="*", help="Subset of services (default: all)")
+    p_up.add_argument(
+        "services",
+        nargs="*",
+        choices=sorted(SERVICES.keys()),
+        help="Subset of services (default: all)",
+    )
...
-    p_down.add_argument("services", nargs="*", help="Subset of services (default: all)")
+    p_down.add_argument(
+        "services",
+        nargs="*",
+        choices=sorted(SERVICES.keys()),
+        help="Subset of services (default: all)",
+    )

Based on learnings: “Validate inputs at the boundary” and “Use explicit error handling: fail fast on invalid inputs.”

Also applies to: 75-79

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/boxlite_local/cli.py` around lines 30 - 35, The CLI
currently lets p_up and p_down accept any service names and silently ignore
typos; update the argument parsing to validate the "services" values against the
canonical SERVICES list at parse-time and raise a parser error for any unknown
names. Change the add_argument call for "services" (both p_up and p_down) to use
a custom type or argparse Action that checks each value is in SERVICES (or use
choices=SERVICES) so argparse fails fast; ensure the same validation logic is
applied to the other parser(s) referenced around lines 75-79 so all entry points
reject invalid service names immediately.

Source: Learnings

Comment on lines +121 to +126
proc = subprocess.run(
["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
capture_output=True,
text=True,
check=False,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect the target file around the reported lines
FILE="apps/infra-local/boxlite_local/doctor.py"
echo "=== Context: $FILE (lines 90-170) ==="
nl -ba "$FILE" | sed -n '90,170p'

# 2) Find the definition of check_port_free (and surrounding logic)
echo
echo "=== Grep: check_port_free in $FILE ==="
rg -n "def check_port_free|check_port_free\(" "$FILE" || true

# 3) Inspect doctor() and any higher-level timeouts/cancellation/strict handling
echo
echo "=== Grep: doctor( and strict usage in $FILE ==="
rg -n "def doctor\(|strict\s*=\s*True|doctor\(" "$FILE" || true

# 4) Search repository for where doctor(strict=True) is called (to confirm blocking path)
echo
echo "=== Repo search: doctor(strict=True) ==="
rg -n "doctor\(\s*.*strict\s*=\s*True|doctor\(\s*strict\s*=\s*True|strict\s*=\s*True.*doctor\(" . || true

# 5) Search for stack-up / orchestrator call path around doctor invocation
echo
echo "=== Repo search: stack-up and doctor usage ==="
rg -n "stack-up|stack_up|stackup|orchestrator|startup.*doctor|doctor.*stack" apps/infra-local/boxlite_local . || true

Repository: boxlite-ai/boxlite

Length of output: 196


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="apps/infra-local/boxlite_local/doctor.py"

echo "=== Context: $FILE (lines 90-170) ==="
awk 'NR>=90 && NR<=170 {printf "%6d %s\n", NR, $0}' "$FILE" || true

echo
echo "=== Grep: check_port_free in $FILE ==="
rg -n "def check_port_free|check_port_free\(" "$FILE" || true

echo
echo "=== Grep: doctor() and strict usage in $FILE ==="
rg -n "def doctor\(|strict\s*=\s*True|doctor\(" "$FILE" || true

echo
echo "=== Repo search: doctor(strict=True) ==="
rg -n "doctor\(\s*.*strict\s*=\s*True|strict\s*=\s*True.*doctor\(" . || true

echo
echo "=== Repo search: stack-up / orchestrator call path around doctor ==="
rg -n "stack-up|stack_up|stackup|orchestrator|startup.*doctor|doctor.*stack" apps/infra-local/boxlite_local . || true

Repository: boxlite-ai/boxlite

Length of output: 12340


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE_DOCTOR="apps/infra-local/boxlite_local/doctor.py"
FILE_ORCH="apps/infra-local/boxlite_local/orchestrator.py"

echo "=== doctor.py: context 160-260 ==="
awk 'NR>=150 && NR<=260 {printf "%6d %s\n", NR, $0}' "$FILE_DOCTOR" || true

echo
echo "=== doctor.py: subprocess.run occurrences ==="
rg -n "subprocess\.run\(" "$FILE_DOCTOR" || true

echo
echo "=== orchestrator.py: context around doctor(strict=True) ==="
awk 'NR>=330 && NR<=430 {printf "%6d %s\n", NR, $0}' "$FILE_ORCH" || true

echo
echo "=== orchestrator.py: subprocess usage (to see if there is generic timeouts) ==="
rg -n "subprocess\.(run|check_output|Popen)\(" "$FILE_ORCH" || true

Repository: boxlite-ai/boxlite

Length of output: 6041


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo search: wait_for / timeout around up()/doctor() ==="
rg -n "asyncio\.wait_for|wait_for\(|timeout[_a-zA-Z]*=.*doctor|doctor\(.*\).*timeout|signal\.alarm|setitimer" apps/infra-local/boxlite_local . || true

echo
echo "=== Repo search: CLI entry points calling orchestrator.up ==="
rg -n "orchestrator\.up|await up\(|up\(config" apps/infra-local/boxlite_local/cli.py apps/infra-local/boxlite_local . || true

echo
echo "=== Repo search: any timeout in scripts/stack-up.sh ==="
rg -n "timeout|kill|alarm" apps/infra-local/scripts/stack-up.sh apps/infra-local/scripts/*.sh apps/infra-local/README.md || true

Repository: boxlite-ai/boxlite

Length of output: 7777


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE_DOCTOR="apps/infra-local/boxlite_local/doctor.py"

echo "=== doctor.py: imports/context (1-120) ==="
awk 'NR>=1 && NR<=130 {printf "%6d %s\n", NR, $0}' "$FILE_DOCTOR"

Repository: boxlite-ai/boxlite

Length of output: 5332


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="apps/infra-local/boxlite_local/types.py"
awk 'NR>=1 && NR<=140 {printf "%6d %s\n", NR, $0}' "$FILE"

Repository: boxlite-ai/boxlite

Length of output: 2764


Add a timeout to the lsof preflight port check to prevent hangs.

check_port_free() runs subprocess.run(["lsof", ...]) without a timeout, and doctor(..., strict=True) is awaited from orchestrator.up(), so a stalled lsof blocks the preflight and stops stack-up from progressing.

🛠️ Proposed fix
-    proc = subprocess.run(
-        ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
-        capture_output=True,
-        text=True,
-        check=False,
-    )
+    try:
+        proc = subprocess.run(
+            ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
+            capture_output=True,
+            text=True,
+            check=False,
+            timeout=5,
+        )
+    except subprocess.TimeoutExpired:
+        return DoctorCheck(
+            name=name,
+            severity=Severity.FAIL,
+            msg=f"lsof timed out while checking port {port}",
+            hint="Retry after closing heavy file/network scans; cannot verify port conflict while lsof is hung.",
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
proc = subprocess.run(
["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
capture_output=True,
text=True,
check=False,
)
try:
proc = subprocess.run(
["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
capture_output=True,
text=True,
check=False,
timeout=5,
)
except subprocess.TimeoutExpired:
return DoctorCheck(
name=name,
severity=Severity.FAIL,
msg=f"lsof timed out while checking port {port}",
hint="Retry after closing heavy file/network scans; cannot verify port conflict while lsof is hung.",
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/boxlite_local/doctor.py` around lines 121 - 126, The lsof
call in check_port_free() can hang; add a timeout (e.g. timeout=5) to the
subprocess.run(...) invocation and catch subprocess.TimeoutExpired around that
call (the call in check_port_free used by doctor() and orchestrator.up()). On
TimeoutExpired, log or surface a clear error via the same logger used by
doctor() / orchestrator (or return False) so the preflight fails fast instead of
blocking stack-up; ensure the handler preserves existing behavior for other
non-zero results.

Source: Learnings

Comment on lines +34 to +39
await asyncio.gather(
drain(execution.stdout(), out_parts),
drain(execution.stderr(), err_parts),
)
result = await execution.wait()
return result.exit_code, "".join(out_parts), "".join(err_parts)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify call sites and whether timeout wrappers already exist.
rg -n "exec_collect\(" apps/infra-local -C2
rg -n "asyncio\.wait_for|asyncio\.timeout" apps/infra-local -C2

Repository: boxlite-ai/boxlite

Length of output: 6330


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect exec_collect implementation closely
sed -n '1,220p' apps/infra-local/boxlite_local/execwrap.py | nl -ba | sed -n '1,220p'

# Look for timeout/cancellation handling inside execwrap.py
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill|finally" apps/infra-local/boxlite_local/execwrap.py

# Check for any exec_collect signature/kwargs containing timeout at call sites
rg -n "exec_collect\([^)]*timeout|timeout_s\s*=" apps/infra-local -S

Repository: boxlite-ai/boxlite

Length of output: 105


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="apps/infra-local/boxlite_local/execwrap.py"

# Show exec_collect region (avoid huge output)
echo "=== execwrap.py (first 220 lines with numbers) ==="
cat -n "$FILE" | sed -n '1,220p'

echo "=== execwrap.py (lines around exec_collect) ==="
# Find line number of `def exec_collect`
LINE=$(rg -n "async def exec_collect" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-30))
END=$((LINE+140))
cat -n "$FILE" | sed -n "${START},${END}p"

echo "=== Search for timeout/cancellation handling in execwrap.py ==="
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill" "$FILE"

Repository: boxlite-ai/boxlite

Length of output: 3637


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="apps/infra-local/boxlite_local/execwrap.py"

# Show exec_collect region using safe line bounds
LINE=$(rg -n "async def exec_collect" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-10))
if [ "$START" -lt 1 ]; then START=1; fi
END=$((LINE+120))

echo "=== execwrap.py: exec_collect vicinity ($START..$END) ==="
cat -n "$FILE" | sed -n "${START},${END}p"

echo "=== Search for timeout/cancellation handling in execwrap.py ==="
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill" "$FILE" || true

echo "=== Search for any cleanup/finally usage around exec_collect ==="
rg -n "finally|except" "$FILE" || true

Repository: boxlite-ai/boxlite

Length of output: 1788


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where Execution and box.exec are implemented
rg -n "class Execution|async def wait\(|def wait\(|stdout\(\)|stderr\(\)|async def stdout|async def stderr|async def exec\(" apps/infra-local -S

# Narrow on Execution.wait implementation
rg -n "async def wait\(" apps/infra-local -S

# Narrow on box.exec / "def exec" implementation
rg -n "async def exec\(|def exec\(" apps/infra-local -S

Repository: boxlite-ai/boxlite

Length of output: 865


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all exec_collect call sites in the repo
rg -n "exec_collect\(" . -S

# Find where Execution is implemented (outside tests)
rg -n "class Execution\b|async def wait\(|def wait\(|async def stdout\(|async def stderr\(|def stdout\(|def stderr\(" . -S

Repository: boxlite-ai/boxlite

Length of output: 2362


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all exec_collect call sites in the repo
rg -n "exec_collect\(" . -S

# Find where Execution is implemented (outside tests)
rg -n "class Execution\b|async def wait\(|def wait\(|async def stdout\(|async def stderr\(|def stdout\(|def stderr\(" . -S

Repository: boxlite-ai/boxlite

Length of output: 2362


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect SDK execution/wait implementations
sed -n '150,320p' sdks/python/boxlite/orchestration/box_runtime.py | cat -n
echo "----"
sed -n '150,320p' sdks/python/boxlite/sync_api/_execution.py | cat -n
echo "----"
sed -n '120,240p' sdks/python/boxlite/interactivebox.py | cat -n

echo "---- Searches for cancellation/termination in SDK files ----"
rg -n "CancelledError|cancelled|terminate|kill|shutdown" sdks/python/boxlite -S

Repository: boxlite-ai/boxlite

Length of output: 28638


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect SDK execution/wait implementations
echo "=== sdks/python/boxlite/orchestration/box_runtime.py (approx. 150-320) ==="
sed -n '150,320p' sdks/python/boxlite/orchestration/box_runtime.py | cat -n

echo "=== sdks/python/boxlite/sync_api/_execution.py (approx. 150-320) ==="
sed -n '150,320p' sdks/python/boxlite/sync_api/_execution.py | cat -n

echo "=== sdks/python/boxlite/interactivebox.py (approx. 120-240) ==="
sed -n '120,240p' sdks/python/boxlite/interactivebox.py | cat -n

echo "=== Searches for cancellation/termination in SDK files ==="
rg -n "CancelledError|cancelled|terminate|kill|shutdown" sdks/python/boxlite -S

Repository: boxlite-ai/boxlite

Length of output: 28834


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the concrete async Execution implementation (not just wrappers/tests)
rg -n "class Execution\b" sdks/python/boxlite -S
rg -n "async def wait\(" sdks/python/boxlite -S
rg -n "def kill\(" sdks/python/boxlite -S

# Inspect orchestration/box_runtime.py around the CancelledError locations we saw earlier
FILE="sdks/python/boxlite/orchestration/box_runtime.py"
echo "=== orchestration/box_runtime.py: context around CancelledError ==="
rg -n "CancelledError" "$FILE" -n || true
# Print a couple windows around the first few matches (safe bounds)
for ln in $(rg -n "CancelledError" "$FILE" | cut -d: -f1 | head -n 6); do
  START=$((ln-40)); [ "$START" -lt 1 ] && START=1
  END=$((ln+40))
  echo "---- $FILE ($START..$END) around CancelledError at line $ln ----"
  sed -n "${START},${END}p" "$FILE" | cat -n
done

Repository: boxlite-ai/boxlite

Length of output: 44


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="sdks/python/boxlite/orchestration/box_runtime.py"

echo "=== Top of $FILE ==="
sed -n '1,120p' "$FILE" | cat -n

echo "=== Lines containing CancelledError in $FILE ==="
rg -n "CancelledError" "$FILE" || true

echo "=== Lines containing _execution.kill() in $FILE ==="
rg -n "_execution\.kill\(|execution\.kill\(" "$FILE" || true

echo "=== Lines containing def/async def wait in sdks/python/boxlite ==="
rg -n "async def wait\(|def wait\(" sdks/python/boxlite -S

echo "=== Grep for kill/terminate/subprocess cleanup related to Execution.wait ==="
rg -n "kill\(|terminate\(|SIGTERM|SIGKILL|subprocess" sdks/python/boxlite -S

Repository: boxlite-ai/boxlite

Length of output: 6631


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect orchestrator timeout handling
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py around asyncio.wait_for(exec_collect) ==="
rg -n "wait_for\\(\\s*exec_collect\\(|exec_collect\\(box, cmd" "$ORCH"
# Print relevant ranges manually (safe)
sed -n '230,310p' "$ORCH" | cat -n
echo "----"
sed -n '320,390p' "$ORCH" | cat -n

# Inspect SDK for the concrete Execution implementation (beyond sync wrappers)
echo "=== Search sdks/python/boxlite for Execution implementation details ==="
rg -n "async def wait\\(" sdks/python/boxlite -S
rg -n "def kill\\(" sdks/python/boxlite -S
rg -n "class .*Execution" sdks/python/boxlite -S
rg -n "execution\\.kill|_execution\\.kill|\\.kill\\(\\)" sdks/python/boxlite -S | head -n 50

# Identify likely file containing async wait()/kill: print those files if found
for f in $(rg -l "async def wait\\(|def kill\\(" sdks/python/boxlite -S); do
  echo "---- file: $f ----"
  head -n 5 "$f"
done

Repository: boxlite-ai/boxlite

Length of output: 8193


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect orchestrator timeout handling
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py around asyncio.wait_for(exec_collect) ==="
rg -n "wait_for\\(\\s*exec_collect\\(|exec_collect\\(box, cmd" "$ORCH"
# Print relevant ranges manually (safe)
sed -n '230,310p' "$ORCH" | cat -n
echo "----"
sed -n '320,390p' "$ORCH" | cat -n

# Inspect SDK for the concrete Execution implementation (beyond sync wrappers)
echo "=== Search sdks/python/boxlite for Execution implementation details ==="
rg -n "async def wait\\(" sdks/python/boxlite -S
rg -n "def kill\\(" sdks/python/boxlite -S
rg -n "class .*Execution" sdks/python/boxlite -S
rg -n "execution\\.kill|_execution\\.kill|\\.kill\\(\\)" sdks/python/boxlite -S | head -n 50

# Identify likely file containing async wait()/kill: print those files if found
for f in $(rg -l "async def wait\\(|def kill\\(" sdks/python/boxlite -S); do
  echo "---- file: $f ----"
  head -n 5 "$f"
done

Repository: boxlite-ai/boxlite

Length of output: 8193


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S

# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S

# Look for ".stdout(" / ".stderr(" / ".wait()" / ".kill()" patterns in the SDK exec implementation
rg -n "exec\\(|def exec\\(|async def exec\\(" sdks/python/boxlite -S | head -n 50

Repository: boxlite-ai/boxlite

Length of output: 5169


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S

# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S

# Look for where async/sync exec is defined to locate the native Execution object
rg -n "async def exec\\(|def exec\\(" sdks/python/boxlite -S | head -n 50

Repository: boxlite-ai/boxlite

Length of output: 652


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S

# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S

# Locate where exec is defined (to find the native Execution object interface)
rg -n "async def exec\\(|def exec\\(" sdks/python/boxlite -S | head -n 50

Repository: boxlite-ai/boxlite

Length of output: 652


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect async exec implementation in simplebox.py (to see Execution interface/type)
FILE="sdks/python/boxlite/simplebox.py"
echo "=== simplebox.py: async def exec vicinity ==="
rg -n "async def exec\\(" "$FILE"
LINE=$(rg -n "async def exec\\(" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-40)); [ "$START" -lt 1 ] && START=1
END=$((LINE+140))
sed -n "${START},${END}p" "$FILE" | cat -n

# Inspect sync_api/_execution.py for wait/kill interface
FILE="sdks/python/boxlite/sync_api/_execution.py"
echo "=== sync_api/_execution.py: kill + wait interface vicinity ==="
rg -n "def kill\\(|def wait\\(|async def wait\\(" "$FILE"
sed -n '200,320p' "$FILE" | cat -n

# Find non-sync Execution class definitions (broader regex)
rg -n "class .*Execution" sdks/python/boxlite -S

# Search for ".kill()" usage outside orchestration/box_runtime.py to infer native execution handling
rg -n "\.kill\(" sdks/python/boxlite -S

Repository: boxlite-ai/boxlite

Length of output: 10186


🏁 Script executed:

#!/bin/bash
set -euo pipefail

ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top around runtime/get_runtime/exec_collect usage ==="
sed -n '1,120p' "$ORCH" | cat -n

echo "=== orchestrator.py: find where runtime.get(name) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S

echo "=== orchestrator.py: show around one-shot bootstrap usage of runtime.get and exec_collect(box, 'true') ==="
# locate the function containing exec_collect(box, "true"
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n

echo "=== execwrap.py: search for kill/terminate usage around exec_collect ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -S

Repository: boxlite-ai/boxlite

Length of output: 10688


🏁 Script executed:

#!/bin/bash
set -euo pipefail

ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top around runtime/get_runtime/exec_collect usage ==="
sed -n '1,120p' "$ORCH" | cat -n

echo "=== orchestrator.py: find where runtime.get(name) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S

echo "=== orchestrator.py: show around one-shot bootstrap usage of runtime.get and exec_collect(box, 'true') ==="
# locate the function containing exec_collect(box, "true"
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n

echo "=== execwrap.py: search for kill/terminate usage around exec_collect ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -S || true

Repository: boxlite-ai/boxlite

Length of output: 10688


🏁 Script executed:

#!/bin/bash
set -euo pipefail

ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top ==="
sed -n '1,120p' "$ORCH" | cat -n

echo "=== orchestrator.py: find where runtime.get(...) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S || true

echo "=== orchestrator.py: show around one-shot bootstrap probe using exec_collect(box, \"true\", ...) ==="
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n

echo "=== execwrap.py: search for kill/terminate/cancellation handling ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -S || true

Repository: boxlite-ai/boxlite

Length of output: 10629


Add an execution timeout to exec_collect (and ensure it stops the underlying process) to prevent orchestration hangs.

apps/infra-local/boxlite_local/execwrap.py’s exec_collect drains stdout/stderr and then does result = await execution.wait() with no time bound. Even though _wait_healthy_exec(...) wraps exec_collect in asyncio.wait_for(..., timeout=hc.timeout_s), _wait_one_shot_exit(...) awaits exec_collect(box, "true", []) directly (no per-exec timeout), so a stuck command can block the orchestration loop beyond the loop’s timeout_s.

exec_collect should accept a timeout and propagate it to box.exec(..., timeout=...) (the SDK supports a timeout arg) or otherwise guarantee the underlying execution is killed on timeout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/boxlite_local/execwrap.py` around lines 34 - 39,
exec_collect currently awaits execution.wait() with no bound; add a timeout
parameter to exec_collect (e.g., timeout: Optional[float] = None), pass it
through to the SDK call that creates the execution (box.exec(...,
timeout=timeout)) and also guard the final wait with
asyncio.wait_for(execution.wait(), timeout) so the underlying process is killed
on timeout (catch asyncio.TimeoutError, call execution.kill() or
execution.terminate() and await execution.wait() to ensure cleanup, then
re-raise or return a clear timeout result). Update callers such as
_wait_one_shot_exit to pass an appropriate timeout (or None) when invoking
exec_collect and keep _wait_healthy_exec behavior unchanged if it already wraps
exec_collect in asyncio.wait_for.

Source: Learnings

Comment on lines +31 to +33
for name, spec in services.items():
ts.add(name, *spec.depends_on)
ts.prepare()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate depends_on entries before graph construction.

Line 32 accepts dependency names that are not keys in services. Those unknown nodes can appear in topo output, then Line 386/Line 401 will crash on services[n] with KeyError.

Proposed fix
 def topo_sort(services: dict[str, ServiceSpec]) -> list[list[str]]:
@@
-    ts: TopologicalSorter[str] = TopologicalSorter()
+    unknown_deps = sorted(
+        {
+            dep
+            for spec in services.values()
+            for dep in spec.depends_on
+            if dep not in services
+        }
+    )
+    if unknown_deps:
+        raise ValueError(f"Unknown service dependency(ies): {', '.join(unknown_deps)}")
+
+    ts: TopologicalSorter[str] = TopologicalSorter()

Comment on lines +74 to +76
[ -L "${APPS_DIR}/.env" ] || ln -sf api/.env "${APPS_DIR}/.env"
# apps/apps → . (webpack.config.js path resolution quirk)
[ -L "${APPS_DIR}/apps" ] || ln -sf . "${APPS_DIR}/apps"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid clobbering a real apps/.env with forced symlink creation.

Line 74 uses ln -sf, which will overwrite an existing non-symlink file. That can destroy local developer config unexpectedly. Guard for existing non-symlink paths and fail with a clear message instead of force-replacing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/stack-up.sh` around lines 74 - 76, The current use
of ln -sf to create "${APPS_DIR}/.env" and "${APPS_DIR}/apps" will clobber
existing non-symlink files; update the script to first test each target: if the
path does not exist create the symlink with ln -s, if it exists and is a symlink
leave it alone, and if it exists but is not a symlink print a clear error and
exit non-zero so we don't overwrite developer files; apply this logic for the
".env" and "apps" targets (references: APPS_DIR, "${APPS_DIR}/.env",
"${APPS_DIR}/apps", and the ln -s invocation).

Comment on lines +85 to +88
if port_listening "${PORT_API}"; then
warn "port ${PORT_API} already in use — killing prior listener"
lsof -ti :${PORT_API} -sTCP:LISTEN | xargs -r kill -9 2>/dev/null || true
sleep 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t blindly kill -9 any process listening on these ports.

These blocks terminate whatever is bound to the port, even if it’s unrelated to this stack. This can kill other local services and lose work. Limit termination to stack-owned PIDs (from pidfiles / command validation), and use TERM with a short wait before KILL fallback.

Also applies to: 126-129, 155-158, 185-188

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 87-87: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/scripts/stack-up.sh` around lines 85 - 88, The current port
cleanup blindly runs lsof/... | xargs kill -9 which can terminate unrelated
processes; update the port cleanup in the blocks that call port_listening and
warn to: prefer finding stack-owned PID(s) first (check a corresponding pidfile
if one exists, or filter lsof/ps output by a known command pattern for this
stack using pgrep -f or lsof -c), send SIGTERM to those PIDs, wait a short
period (e.g. poll for port closure up to a few seconds), and only use SIGKILL
(kill -9) as a fallback if the port is still held; replace direct lsof ... |
xargs -r kill -9 with a safe function that accepts the port (used where
port_listening is called) and implements pidfile/command validation, TERM then
KILL fallback, and preserves the existing logging via warn.

Comment thread apps/infra-local/sql/merged-schema.auto-gen.sql Outdated
Comment on lines +83 to +90
asyncio.run(setup())
try:
yield cfg
finally:
asyncio.run(teardown())
mp.undo()
shutil.rmtree(tmp, ignore_errors=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee fixture cleanup when setup skips or errors.

Line 83 runs setup() before entering the current try/finally. If setup raises pytest.skip (Line 63) or another exception, Lines 87-90 never run, so the env patch/temp dir can leak into subsequent tests.

💡 Proposed fix
-    asyncio.run(setup())
-    try:
-        yield cfg
-    finally:
-        asyncio.run(teardown())
-        mp.undo()
-        shutil.rmtree(tmp, ignore_errors=True)
+    try:
+        asyncio.run(setup())
+        yield cfg
+    finally:
+        try:
+            asyncio.run(teardown())
+        finally:
+            mp.undo()
+            shutil.rmtree(tmp, ignore_errors=True)

Based on learnings: "Close and release files, sockets, clients, browser handles, and subprocesses explicitly."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/infra-local/tests/integration/test_e2e_full.py` around lines 83 - 90,
The fixture currently calls asyncio.run(setup()) outside the try, so if setup
raises (including pytest.skip) the finally cleanup (asyncio.run(teardown()),
mp.undo(), shutil.rmtree(tmp)) never runs; change the fixture to call setup()
inside a try block and track whether setup completed (e.g., setup_done flag set
to True after asyncio.run(setup()) returns) then in the finally always perform
mp.undo() and shutil.rmtree(tmp, ignore_errors=True) and only call
asyncio.run(teardown()) if setup_done (or ensure teardown is safe to call
unconditionally); ensure pytest.skip exceptions are allowed to propagate but
cleanup still runs.

Source: Learnings

Comment thread apps/package.json Outdated
Copilot Bot review requested due to automatic review settings June 8, 2026 04:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 43 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

tsconfig.base.json:1

  • This appears to be a committed symlink (or symlink-like pointer) rather than a valid JSON tsconfig. Git symlinks can break on Windows checkouts and some tooling that expects tsconfig.base.json to be parseable JSON. Consider replacing this with a real root tsconfig.base.json (or a small JSON file that extends the apps/tsconfig.base.json) to avoid cross-platform and toolchain issues; apply the same approach to the root-level libs pointer as well.
    apps/infra-local/tests/integration/test_e2e_full.py:1
  • asyncio.run(setup()) happens before the try/finally. If setup() triggers pytest.skip(...) or raises/asserts, the fixture exits before entering the finally, so mp.undo() and shutil.rmtree(tmp) won’t run (leaking env vars and temp dirs). Move the setup() call inside the try so cleanup runs even when setup skips/fails.

Comment on lines +69 to +83
<PostHogProvider
apiKey="phc_local_dev_no_op"
options={{
// api_host is required by posthog-js but never used because
// capturing is opted out and flags are bootstrapped below.
api_host: 'https://localhost.invalid',
bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS },
advanced_disable_feature_flags: true, // skip /decide network call
autocapture: false,
capture_pageview: false,
capture_pageleave: false,
opt_out_capturing_by_default: true,
disable_session_recording: true,
loaded: (posthog) => posthog.opt_out_capturing(),
}}
Comment on lines +20 to +23
if [ "$1" = "all" ]; then
# `tail -f` multiple files at once; each line gets ==> filename <== headers.
exec tail -F "${LOGS_DIR}"/*.log
fi
Comment on lines +98 to +108
_MINIO_INIT_SCRIPT = """\
set -eu
for i in 1 2 3 4 5; do
if mc alias set boxlite "$MINIO_URL" "$MINIO_USER" "$MINIO_PASSWORD" 2>/dev/null; then break; fi
echo "init: minio not ready yet (attempt $i)"
sleep 2
done
mc alias set boxlite "$MINIO_URL" "$MINIO_USER" "$MINIO_PASSWORD"
mc mb --ignore-existing boxlite/boxlite
echo "init: ok - boxlite bucket ready"
"""
@lilongen

lilongen commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

This push does three things:

  1. Scope split — moved changes not directly related to infra-local out into their own separate PRs (already removed from this PR's diff after the rebase).

  2. DB schema setup — switched from pre-merging all migrations into a single all-in-one SQL baseline to running the migration scripts directly, matching apps/api's real migration path.

  3. Pinned typeorm — pinned typeorm to 0.3.28 to avoid the orderBy + limit regression in 0.3.29.

… closed)

The local-dev PostHog bootstrap (forcing the 6 UI/route-gating flags to true
when PostHog isn't configured) could fail OPEN in any prod/staging deploy that
ships without POSTHOG_API_KEY: sst.config.ts injects the key only conditionally,
and an absent key made the API expose gated routes (POST /regions, GET /runners)
and the dashboard render gated surfaces (create-sandbox, infrastructure, etc.)
that should stay hidden.

Gate the bootstrap behind a hard production check at all three spots:
- app.module.ts: skip passing bootstrapFlags when NODE_ENV==='production'
- openfeature-posthog.provider.ts: defense-in-depth — refuse bootstrapFlags in
  production so the flag-resolution boundary itself can't fail open regardless
  of how it's wired; unconfigured prod falls through to each call-site's
  defaultValue (false)
- PostHogProviderWrapper.tsx: in a prod build (import.meta.env.PROD) with no
  PostHog config, render children without a provider so flags resolve falsy

infra-local is unaffected: NODE_ENV=development / vite dev (PROD=false) still
get the bootstrap. Verified the running stack — dashboard still renders Create
Sandbox after the gate; the three files type-check clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot Bot review requested due to automatic review settings June 9, 2026 04:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 40 out of 44 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (4)

apps/infra-local/scripts/stack-up.sh:1

  • xargs -r is GNU-specific and isn’t supported by macOS/BSD xargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs from lsof -ti, then kill only if non-empty), or use a portable xargs pattern that won’t error on empty input.
    apps/infra-local/scripts/stack-up.sh:1
  • xargs -r is GNU-specific and isn’t supported by macOS/BSD xargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs from lsof -ti, then kill only if non-empty), or use a portable xargs pattern that won’t error on empty input.
    apps/infra-local/scripts/stack-up.sh:1
  • xargs -r is GNU-specific and isn’t supported by macOS/BSD xargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs from lsof -ti, then kill only if non-empty), or use a portable xargs pattern that won’t error on empty input.
    apps/infra-local/scripts/stack-up.sh:1
  • xargs -r is GNU-specific and isn’t supported by macOS/BSD xargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs from lsof -ti, then kill only if non-empty), or use a portable xargs pattern that won’t error on empty input.

Comment on lines +20 to +23
if [ "$1" = "all" ]; then
# `tail -f` multiple files at once; each line gets ==> filename <== headers.
exec tail -F "${LOGS_DIR}"/*.log
fi
}}
>
{children}
</PostHogProvider>
Comment on lines 171 to 178
Comment thread apps/infra-local/Makefile
Comment on lines +79 to +80
stack-reset-hard: ## like stack-reset, but also re-applies prod schema baseline
./scripts/stack-reset.sh --hard
lile and others added 3 commits June 9, 2026 16:53
Fresh clones have no apps/api/.env (gitignored — prod holds real secrets
there), so stack-up.sh's unconditional `apps/.env -> api/.env` symlink
dangled and `. ./.env` failed, leaving the L2 API unable to start.

Ship configs/api.env (all deterministic local-stack placeholders, no real
secrets) and seed apps/api/.env from it on first stack-up when missing,
without clobbering a dev's edits.
…y-required

Verified end-to-end that infra-local deploys without it: cold stack-up
→ 10/10 L1 boxes healthy → API healthy → seed completes (snapshot
active; seed-init-data.sh seeds via psql, not the flag-gated routes)
→ POST /api/sandbox → microVM started. The control plane never
consults the feature flags — only flag-gated dashboard UI and admin
routes do.

Dropped:
- openfeature-posthog.provider.ts bootstrap support (reverted to main)
  and its spec, which only tested the bootstrap gate
- the bootstrapFlags injection block in app.module.ts
- the dashboard PostHogProviderWrapper local-dev defaults (reverted to
  main)

Kept (deliberately not reverted):
- the DB pool statement_timeout/query_timeout wiring in app.module.ts +
  configuration.ts — consumed by stack-up.sh's DB_*_TIMEOUT_MS exports
  (microVM↔pg wedge fix)
- apps/package.json typeorm 0.3.28 pin (quoted-orderBy 0.3.29 trigger
  still live in sandbox.manager.ts) and @types/node-forge

Behavior change in local dev (documented in infra-local-usage.md):
without a configured PostHog, flag-gated surfaces fail closed — the
dashboard "Create Sandbox" button and admin routes like POST
/api/regions stay hidden. API sandbox creation is unaffected.
Copilot Bot review requested due to automatic review settings June 10, 2026 05:26
@DorianZheng DorianZheng requested a review from a team as a code owner June 10, 2026 05:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Reverts configuration.ts + app.module.ts to their main (merge-base
e4bbd0e) versions and removes the now-stranded DB_QUERY_TIMEOUT_MS /
DB_STATEMENT_TIMEOUT_MS exports from stack-up.sh. apps/api now differs
from main only by the webpack/tsconfig build fixes — verified earlier
today by a full cold deploy on exactly this app-file state (10/10 L1
healthy → API healthy → seed → sandbox started).

Accepted trade-off (dev-only, documented in services.py): without the
client-side query_timeout, a query wedged on the microVM↔pg transport
can hold its API pool slot until `make stack-restart COMPONENTS=api`.
The server-side caps stay in the pg box config (statement_timeout,
idle_in_transaction_session_timeout, tcp_keepalives), which reap the
orphaned backend. Root cause is the transport silently wedging a
stream — to be filed against src/; the env knobs can return as a
standalone main PR if prod wants them as general hygiene.
Copilot Bot review requested due to automatic review settings June 10, 2026 05:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

…ags instead

Reverts apps/api/webpack.config.js to its main (merge-base e4bbd0e)
version. The local-dev plugin filtering (GeneratePackageJsonPlugin +
ForkTsCheckerWebpackPlugin by constructor-name string match) is
replaced by the executors' own documented options, passed at the
single local call site in stack-up.sh:

  nx serve api --buildTargetOptions.generatePackageJson=false \
               --buildTargetOptions.skipTypeChecking=true

(@nx/js:node serve forwards buildTargetOptions to the
@nx/webpack:webpack build target; the JSON-string form fails Nx's
schema validation, so the dotted form is used.) CI/prod builds keep
both plugins — this is invocation-scoped, not config-scoped.

CONNECTIONS.md and infra-local-usage.md updated so no doc instructs a
bare `nx serve api`, which would crash in this workspace (yarn4 +
gitignored lockfile leaves npm deps out of the Nx project graph).

Verified: nx build api green with flags on the reverted config; API
relaunched via the exact new invocation → /api/health 200; sandbox
created on it → state `started` in ~5s. apps/api now differs from
main only by the 2-line tsconfig.json extends fix.
Moves docs/apps/infra-local-usage.md into apps/infra-local/README.md as
a "Usage guide" section so the component's docs live with the component
(alongside CONNECTIONS.md). Content moved verbatim; headings demoted one
level (code fences untouched), the one relative link rewritten for the
new location, and the README's three links to the old path turned into
internal anchors. No remaining references to the old path.
Copilot Bot review requested due to automatic review settings June 10, 2026 07:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Signed-off-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
DorianZheng
DorianZheng previously approved these changes Jun 10, 2026
Main's new api-client-drift workflow runs `yarn install --immutable`
against the newly-committed apps/yarn.lock; the PR's merge-ref failed
YN0028 because the branch's typeorm 0.3.28 pin + @types/node-forge
weren't in that lock, and main's tar ^7.5.16 bump conflicted textually
with the adjacent pin line.

Manual content (everything else is main's side verbatim):
- apps/package.json: union resolution — main's tar ^7.5.16 + the
  branch's typeorm 0.3.28 pin (0.3.29 quoted-orderBy regression guard)
- apps/yarn.lock: regenerated from main's lock with the merged manifest
  (+24/-14: typeorm 0.3.28 + @types/node-forge entries, stale
  descriptors pruned)

Verified: `yarn install --immutable` passes on the merged tree (the
exact failed CI check); `nx build api --skip-nx-cache` green.
…l dependsOn

serve's defaultConfiguration has pointed at api:build:development since
the workspace import without that configuration existing. Add it
(generatePackageJson=false — GeneratePackageJsonPlugin crashes in this
workspace; skipTypeChecking=true — pre-existing type-only OTLP lockfile
skew), and remove serve's dependsOn:["build"]: @nx/js:node builds its
buildTarget itself, and the dependsOn forced a second bare-config build
that hit the GeneratePackageJsonPlugin crash whenever the Nx graph was
fresh (the main merge invalidated the stale cache that had been masking
it; configurations don't propagate to plain-string dependsOn tasks, so
the stack-up.sh flags could not reach it).

Production-safe: apps/api/Dockerfile runs `nx build api
--configuration=production` and never invokes serve; no path values
touched. Verified: plain `nx serve api` on the merged tree reaches
/api/health 200 (crashed before); POST /api/box creates a box.
Copilot Bot review requested due to automatic review settings June 10, 2026 08:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants