feat(infra-local): BoxLite-based local dev stack (L1 + L2 + M5 native runner)#595
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces a BoxLite-dogfooded apps/infra-local/ stack (Python orchestrator + shell wrappers) and documentation for a fully self-hosted local control plane, alongside several fixes to unblock local Apple Silicon development (runner image arch selection, local PostHog flag bootstrapping, etc.).
Changes:
- Add
apps/infra-local/Python orchestrator (boxlite_local) with unit/integration tests, plus L2 wrapper scripts (stack-*) and schema loader tooling. - Add extensive infra-local milestone and usage/status docs plus connection references.
- Adjust local-dev behavior in runner (platform selection) and dashboard/API (PostHog bootstrap flags + MSW local config).
Reviewed changes
Copilot reviewed 50 out of 54 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tsconfig.base.json | Adds a top-level tsconfig reference (currently appears as a path placeholder). |
| libs | Adds a top-level libs reference (currently appears as a path placeholder). |
| docs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md | Milestone snapshot + architecture + operations overview. |
| docs/apps/infra-vs-local-infra.md | Rationale/comparison doc updated with “what shipped” notes. |
| docs/apps/infra-local-usage.md | Day-to-day usage guide and troubleshooting. |
| docs/apps/infra-local-status.md | Current state inventory of local stack. |
| apps/runner/pkg/boxlite/registry.go | Switch snapshot pull platform arch from hardcoded amd64 to host arch. |
| apps/package.json | Adds forge deps/types, swc-loader, bumps @swc/core, adds resolutions. |
| apps/infra-local/tests/unit/test_topo.py | Unit tests for topo sorting. |
| apps/infra-local/tests/unit/test_orchestrator.py | Unit tests for orchestrator helper behavior. |
| apps/infra-local/tests/unit/test_doctor_lsof.py | Unit tests for lsof parsing and ownership rules. |
| apps/infra-local/tests/unit/test_config.py | Unit tests for InfraConfig defaults and env overrides. |
| apps/infra-local/tests/integration/test_multi_service.py | Integration smoke test for multi-service bring-up. |
| apps/infra-local/tests/integration/test_e2e_full.py | Full E2E protocol-level integration suite. |
| apps/infra-local/sql/REFRESH.md | Procedure to refresh prod schema baseline. |
| apps/infra-local/scripts/stack-up.sh | Starts L1 + L2 components and seeds init data. |
| apps/infra-local/scripts/stack-status.sh | One-screen L1/L2 status output. |
| apps/infra-local/scripts/stack-restart.sh | Restarts L2 components (rebuilds runner). |
| apps/infra-local/scripts/stack-reset.sh | Soft/hard/nuke reset workflows. |
| apps/infra-local/scripts/stack-logs.sh | Tail component logs. |
| apps/infra-local/scripts/stack-down.sh | Stops L2 components and optionally L1. |
| apps/infra-local/scripts/stack-build.sh | Builds native runner/proxy binaries and installs node deps. |
| apps/infra-local/scripts/seed-init-data.sh | Waits for API seeding + default snapshot readiness. |
| apps/infra-local/scripts/apply-schema.sh | Loads schema-baseline.sql into local postgres. |
| apps/infra-local/scripts/_stack-common.sh | Shared helpers/paths/ports for stack scripts. |
| apps/infra-local/pyproject.toml | Defines boxlite_local Python package + test extras. |
| apps/infra-local/goal.md | Original infra-local goal note (currently non-English). |
| apps/infra-local/configs/minio/init.sh | MinIO init helper script. |
| apps/infra-local/boxlite_local/types.py | Dataclasses/types for orchestrator and doctor. |
| apps/infra-local/boxlite_local/services.py | Declarative ServiceSpec registry for L1 services. |
| apps/infra-local/boxlite_local/orchestrator.py | Topo sort, up/down/ps, health checks, SDK workarounds. |
| apps/infra-local/boxlite_local/execwrap.py | Helper to collect exec stdout/stderr and exit code. |
| apps/infra-local/boxlite_local/doctor.py | Preflight checks (SDK, runtime, port conflicts). |
| apps/infra-local/boxlite_local/config.py | InfraConfig + env loading and derived properties. |
| apps/infra-local/boxlite_local/cli.py | argparse CLI for python -m boxlite_local. |
| apps/infra-local/boxlite_local/main.py | Module entrypoint. |
| apps/infra-local/boxlite_local/init.py | Package metadata/version. |
| apps/infra-local/README.md | Primary infra-local documentation and workflows. |
| apps/infra-local/Makefile | Make targets wrapping Python CLI + L2 stack scripts. |
| apps/infra-local/CONNECTIONS.md | Endpoints/credentials/env var reference. |
| apps/go.work | Expands go.work workspace list (adds proxy, reorders entries). |
| apps/dashboard/src/mocks/handlers.ts | Adds local-dev MSW fixtures + local /config response. |
| apps/dashboard/src/components/PostHogProviderWrapper.tsx | Adds local-dev bootstrap feature flags fallback. |
| apps/api/src/common/providers/openfeature-posthog.provider.ts | Adds bootstrapFlags fallback when PostHog not configured. |
| apps/api/src/auth/jwt.strategy.ts | Passes default region id when auto-creating personal org. |
| apps/api/src/app.module.ts | Wires bootstrapFlags into OpenFeature PostHog provider config. |
| apps/api/project.json | Disables generatePackageJson for API build. |
| CLAUDE.md | Adds “Committed documentation MUST be in English” rule. |
| .gitignore | Ignores infra-local logs/env/symlinks and local artifacts. |
Comments suppressed due to low confidence (4)
tsconfig.base.json:1
- This file appears to contain only a path string, which is not valid
tsconfigJSON and will break TypeScript tooling that expects a JSON object. If the intent is to referenceapps/tsconfig.base.json, commit this as a symlink (so Git records it as a link) or replace the file contents with a realtsconfigthat usesextendsto point at the apps config.
libs:1 - This looks like it was intended to be a directory alias/symlink, but as committed it’s a regular file containing a path string. That will break any build tooling that expects
libs/to be a directory. Commit a real symlink (recorded by Git) or remove this file and update references to point at the correct directory.
apps/infra-local/scripts/stack-up.sh:1 - On macOS, BSD
xargsdoes not support-r, so this command won’t reliably kill the prior listener; it can leave the port occupied and cause the subsequent start to fail with EADDRINUSE. Remove-r(since the port is known to be listening here) or replace this with a macOS-safe loop over PIDs.
apps/runner/pkg/boxlite/registry.go:1 - The variable name
linuxAmd64Platformis now misleading because it no longer forcesamd64. Rename it to something likelinuxPlatform/defaultPullPlatformto avoid confusion when reading call sites.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, | ||
| }) | ||
| } as Partial<BoxliteConfiguration>) |
| // defaults above. All network behavior disabled: no /decide call (we | ||
| // pre-load flags via bootstrap), no event capture, no autocapture. | ||
| return ( | ||
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call |
|
|
||
| exporters: | ||
| debug: | ||
| verbosity: basic |
There was a problem hiding this comment.
Otel collector is wired but doesn't feed Jaeger — local Jaeger UI will be empty.
Prod's OtelCollector (apps/infra/sst.config.ts:438) is BoxLite's custom build with the ClickHouse exporter. Here the local collector uses upstream otel/opentelemetry-collector:latest with exporters: debug only — traces land on stdout, never reach SPEC_JAEGER.
Two clean fixes, pick one:
- (a) Wire it through: add
otlpexporter pointing atjaeger:4317and add jaeger to thepipelines.traces.exporterslist. Local Jaeger UI becomes useful. - (b) Drop
SPEC_OTELentirely: Jaeger 1.67 accepts OTLP natively (COLLECTOR_OTLP_ENABLED=trueis already set onSPEC_JAEGER) — have api/runner OTLP straight at jaeger and remove the collector hop.
Right now we ship both services but they don't talk, which is worse than shipping neither. Worth fixing in this PR or immediate follow-up.
There was a problem hiding this comment.
Fixed in 01850227 — went with option (a), keeping the collector in the path for prod parity.
_otel_confignow adds anotlp/jaegerexporter and puts it on the traces pipeline alongsidedebug(metrics/logs stay debug-only — Jaeger doesn't ingest them).- Jaeger's OTLP gRPC receiver is now host-mapped (
26687→ box:4317); the collector reaches it via host-as-hub (host.boxlite.internal:26687).COLLECTOR_OTLP_ENABLED=truewas already set. SPEC_OTELnowdepends_on=["jaeger"].
Verified end-to-end on a live stack: POST an OTLP/HTTP trace to the collector (:24318) → the Jaeger query API returns the service + trace within ~2s, so the Jaeger UI is no longer empty. Docs updated (README §Known-limitations #2, CONNECTIONS §6/§8). The only remaining gap vs prod is the custom boxlite_exporter build (ClickHouse / api push-back), still noted as a limitation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (4)
apps/runner/pkg/boxlite/registry.go:1
linuxAmd64Platformno longer represents amd64 specifically (it’s now host-arch dependent). Rename to something likelinuxHostPlatform/defaultLinuxPlatformto avoid misleading future readers and call sites.
apps/package.json:1- Using a caret range in
resolutionscan reduce determinism (you may get different transitive trees over time). Prefer pinning@types/expressto an exact version here (and ensure the lockfile resolves consistently) to avoid unexpected type changes across installs.
tsconfig.base.json:1 - This file content is not valid JSON unless it’s being committed as a git symlink target. Ensure
tsconfig.base.jsonis actually a symlink (mode 120000) toapps/tsconfig.base.json; otherwise TypeScript tooling will fail to parse it.
docs/apps/infra-vs-local-infra.md:1 - These port numbers conflict with the rest of the infra-local docs/scripts in this PR (API is started on :3001; dashboard on :3000). Update the table to reflect the actual local ports so readers don’t follow incorrect endpoints.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| http.get(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all GET]', request.url) | ||
| if (request.url.includes('paginated')) { | ||
| return HttpResponse.json({ items: [], totalItems: 0, totalPages: 0 }) | ||
| } | ||
| return HttpResponse.json([]) | ||
| }), | ||
|
|
||
| // Catch-all POST/PUT/PATCH/DELETE — return {} so mutations don't error. | ||
| http.post(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all POST]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.put(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all PUT]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.patch(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all PATCH]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), | ||
| http.delete(`${API_URL}/*`, async ({ request }) => { | ||
| console.log('[MSW catch-all DELETE]', request.url) | ||
| return HttpResponse.json({}) | ||
| }), |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (5)
tsconfig.base.json:1
- This file content is not valid JSON. If the intent is to make the repo-root
tsconfig.base.jsona symlink toapps/tsconfig.base.json, ensure it is committed as a git symlink (mode 120000). If it lands as a regular file, TypeScript tooling readingtsconfig.base.jsonwill fail to parse it.
libs:1 - Like
tsconfig.base.json, this looks like it’s meant to be a symlink. If it is committed as a normal file containing the textapps/libs, any tooling expectinglibs/to be a directory (or a workspace path) will break. Ensure this is a git symlink (mode 120000) rather than a regular file.
apps/infra-local/scripts/stack-up.sh:1 - On macOS/BSD,
xargsdoes not support-r(GNU-only), so this will error and can breakset -eflows. Replace the pipeline with a macOS-compatible pattern (e.g., capture PIDs and conditionallykill, or use anxargsinvocation that does not rely on-r). Apply the same fix to the otherxargs -roccurrences in this script.
apps/runner/pkg/boxlite/registry.go:1 - The variable name
linuxAmd64Platformis now inaccurate since it can bearm64(or others). Rename it to reflect the new behavior (e.g.,linuxDefaultPlatform,linuxHostArchPlatform, etc.) to avoid confusing future readers.
apps/package.json:1 - As of Aug 2025,
@swc/coreversions were not in the1.15.xrange;^1.15.33may be a non-existent release and would break installs. Please confirm the intended@swc/coreversion exists in npm (and align it with the repo’s SWC toolchain expectations).
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| const KEY = '__msw_api_keys__' | ||
| const load = (): ApiKey[] => { | ||
| try { return JSON.parse(sessionStorage.getItem(KEY) ?? '[]') } catch { return [] } | ||
| } | ||
| const save = (s: ApiKey[]) => sessionStorage.setItem(KEY, JSON.stringify(s)) |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} | ||
| > |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 54 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (4)
apps/infra-local/scripts/stack-up.sh:1
xargs -ris not supported by macOS/BSDxargs, so this command will fail on the target platform. Replace with a portable pattern (e.g., capture PIDs into a variable and only callkillwhen non-empty, or usexargs kill -9guarded byif/grep -q .). Apply the same fix to the runner/proxy/dashboard blocks that usexargs -r.
apps/runner/pkg/boxlite/registry.go:1- The variable name
linuxAmd64Platformis now misleading because it can resolve to non-amd64 architectures. Rename it to something accurate (e.g.,linuxHostPlatformordefaultLinuxPlatform) to avoid confusion and incorrect future usage.
apps/package.json:1 - Using a caret range in
resolutionscan reduce determinism (the resolved version may vary over time and across tooling). Consider pinning@types/expressto an exact version to keep installs reproducible and avoid surprise type changes.
tsconfig.base.json:1 - This appears to be a Git symlink (stored as a file containing the target path). On platforms/environments that don’t support symlinks (or where Git symlinks are disabled, common on Windows), this will become a plain text file and break TypeScript tooling expecting valid JSON at
tsconfig.base.json. If Windows support is needed, consider a real JSON file thatextendsapps/tsconfig.base.jsoninstead of a symlink.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, |
| defaultRegionId: 'local', | ||
| defaultRegion: 'local', | ||
| role: 'OWNER', | ||
| } |
| userId: _USER_ID, | ||
| email: 'admin@boxlite.dev', | ||
| name: 'Local Admin', | ||
| role: 'owner', |
| return HttpResponse.json({ | ||
| id: _USER_ID, | ||
| email: 'admin@boxlite.dev', | ||
| name: 'Local Admin', | ||
| role: 'OWNER', | ||
| personalOrganizationId: _ORG_ID, | ||
| }) |
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} | ||
| > |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
apps/runner/pkg/boxlite/registry.go:1
- The variable name
linuxAmd64Platformis now misleading becauseArchitectureis derived fromruntime.GOARCH(oftenarm64on Apple Silicon). Rename it to something architecture-agnostic (e.g.,linuxDefaultPlatform/linuxHostPlatform) to avoid confusion and prevent future misuse.
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
tsconfig.base.json:1
tsconfig.base.jsonmust be valid JSON for TypeScript/Nx tooling; the current content is a plain path string and will cause parsing failures. If the intent is to shareapps/tsconfig.base.json, either commit a real symlink (so the repo roottsconfig.base.jsonresolves to the apps file) or replace this with a minimal JSON file thatextendsthe apps base config.
libs:1- A root-level
libsentry is expected to be a directory (or a symlink to a directory) in most workspaces; committing it as a plain file containing a path will break module resolution and filesystem expectations. If this is intended to be a symlink toapps/libs, commit it as an actual symlink (or adjust workspace configuration so code referencesapps/libsdirectly).
apps/runner/pkg/boxlite/registry.go:1 - The identifier
linuxAmd64Platformis now misleading because it no longer hardcodesamd64and can evaluate toarm64, etc. Rename it to something architecture-agnostic (e.g.,linuxHostPlatform/defaultLinuxPlatform) to avoid incorrect assumptions elsewhere.
apps/package.json:1 - The specified
node-forgeversion^1.4.0may not exist on npm (as of common published versions,node-forgeis typically1.3.x). If this version is not actually published, installs will fail; please verify the published version and pin to an existing release (or update to the correct package/version if a different fork is intended).
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', |
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (6)
apps/runner/pkg/boxlite/registry.go:1
- The variable name
linuxAmd64Platformis now misleading since it can bearm64(or other). Rename to something architecture-neutral (e.g.linuxHostPlatform/defaultLinuxPlatform) so callers don’t infer it is always amd64.
tsconfig.base.json:1 - This appears to be a committed symlink (Git stores symlinks as a file whose contents is the target path). Symlinks can be problematic on Windows and in some packaging/CI environments; consider replacing this with a real
tsconfig.base.jsonthat usesextendsto referenceapps/tsconfig.base.json, or document/enforce symlink support in contributing/CI.
libs:1 - Similar to
tsconfig.base.json, this looks like a committed symlink toapps/libs. If cross-platform support matters, consider using an actual directory or a tooling-level path mapping instead of a symlink, or ensure CI and team dev environments can reliably handle symlinks.
apps/package.json:1 - Based on versions available up to my knowledge cutoff (Aug 2025),
node-forge@^1.4.0and@swc/core@^1.15.33look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
apps/package.json:1 - Based on versions available up to my knowledge cutoff (Aug 2025),
node-forge@^1.4.0and@swc/core@^1.15.33look potentially non-existent or at least unusually high jumps. Please verify these versions exist on npm (and that the lockfile/tooling supports them); if not, pin to a published version or adjust the range to the intended release line.
apps/package.json:1 - Using a caret range in
resolutionscan undermine determinism and make installs vary across time. Prefer pinning an exact version inresolutions(and relying on the lockfile) unless there’s a specific reason to allow upgrades here.
| ts: TopologicalSorter[str] = TopologicalSorter() | ||
| for name, spec in services.items(): | ||
| ts.add(name, *spec.depends_on) | ||
| ts.prepare() | ||
| layers: list[list[str]] = [] | ||
| while ts.is_active(): | ||
| layer = sorted(ts.get_ready()) | ||
| if not layer: | ||
| break | ||
| layers.append(layer) | ||
| for name in layer: | ||
| ts.done(name) | ||
| return layers |
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, |
| "main": "apps/api/src/main.ts", | ||
| "tsConfig": "apps/api/tsconfig.app.json", | ||
| "generatePackageJson": true, | ||
| "generatePackageJson": false, |
There was a problem hiding this comment.
Reverted in 3424fb25 — back to generatePackageJson: true. The local stack runs the API via nx serve api (not nx build), so this prod-build option was never needed locally; disabling it only risked the deploy artifact. Resolving.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 53 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
tsconfig.base.json:1
tsconfig.base.jsonis not valid JSON as committed (it contains only a path). If this is intended to be a symlink, ensure it’s committed as a symlink in git (mode 120000); otherwise TypeScript tooling will fail when reading the root tsconfig. If symlinks are not desired/portable, replace this with an actual JSON tsconfig that extends./apps/tsconfig.base.json.
libs:1- This file content looks like a symlink target (and as a regular file it’s not a valid directory). If the intent is a repository-root
libssymlink toapps/libs, make sure it’s committed as a symlink; otherwise consumers expecting alibs/directory at repo root will break.
apps/runner/pkg/boxlite/registry.go:1 - The variable name
linuxAmd64Platformis now misleading because it no longer hardcodesamd64(it usesruntime.GOARCH). Rename it to reflect the new semantics (e.g.,linuxHostPlatform,linuxDefaultPlatform, orlinuxPlatformForHostArch) to avoid confusing future readers and callers.
apps/package.json:1 - As of my knowledge cutoff (Aug 2025),
node-forge’s latest published version is1.3.1and^1.4.0is likely not resolvable from npm. Please verify the intended version exists; otherwise installs will fail in CI. If the goal is simply to addnode-forge, consider pinning to^1.3.1(or whatever the current published latest is) and updating@types/node-forgeaccordingly.
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} |
| proxyToolboxUrl: 'http://localhost:28080', | ||
| defaultSnapshot: 'ubuntu:22.04', | ||
| dashboardUrl: 'http://localhost:3000', | ||
| maxAutoArchiveInterval: 43200, | ||
| maintananceMode: false, | ||
| environment: 'local', | ||
| billingApiUrl: BILLING_API_URL, | ||
| }) | ||
| } as Partial<BoxliteConfiguration>) |
| // Bootstrap-aware (mirror evaluateFlag's behavior for object resolution). | ||
| if (!this.isConfigured || !this.client) { | ||
| // Object flags are rare in our usage; not added to bootstrapFlags map | ||
| // (typed as boolean | string | number). Fall through to default. | ||
| logger.debug(`PostHog not configured, returning default value for flag ${flagKey}`) | ||
| return { | ||
| value: defaultValue, |
| // organization.defaultRegionId is set silently fails — leaving | ||
| // the new user with no Personal org and the dashboard with | ||
| // "Cannot read properties of undefined (reading 'id')". | ||
| personalOrganizationDefaultRegionId: this.configService.getOrThrow('defaultRegion.id'), |
There was a problem hiding this comment.
Fair — out of scope for infra-local. Reverted in 3424fb25. It's a real fix (OIDC-login auto-created users get a Personal org with defaultRegionId=undefined, mirroring what app.service.ts already does for the seed path), but it belongs in its own PR rather than bundled here. I'll open a standalone PR for it. Resolving.
There was a problem hiding this comment.
Standalone PR opened: #668 (with a reproducer test, two-sided verified).
| @@ -0,0 +1,217 @@ | |||
| # Milestone: `milestone/infra-local/v0.1.0` | |||
There was a problem hiding this comment.
do not file development documents not related to the repo
There was a problem hiding this comment.
Agreed. Removed the development-journal docs in 3424fb25 — milestone snapshot, infra-local-status.md, and own-dog-food-local-infra-solution.md are gone (history lives in git), and the README/usage links that pointed at them are fixed. Kept only the operational README.md / CONNECTIONS.md / infra-local-usage.md. Resolving.
|
|
||
| - [docs/development/rust-style.md](./docs/development/rust-style.md) | ||
|
|
||
| ## Documentation Language |
There was a problem hiding this comment.
Removed in 3424fb25 — CLAUDE.md is reverted to main, so this PR no longer touches it. Resolving.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a BoxLite-based local infra stack, local feature-flag bootstrap fallbacks for API/dashboard, Python orchestrator and shell wrappers for L1/L2, comprehensive integration and unit tests, workspace/build adjustments, and operational docs for running the local development environment. ChangesFeature-flag bootstrap for local development
BoxLite local infrastructure orchestration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (11)
docs/apps/infra-local-usage.md-230-233 (1)
230-233:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a language identifier to the fenced code block in Section 6.1.
Line 230 opens a fenced block without a language, which triggers markdownlint rule MD040.
💡 Proposed fix
-``` +```text http://localhost:3000 → pick admin / user to log in (dex static users)</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@docs/apps/infra-local-usage.mdaround lines 230 - 233, The fenced code block
in Section 6.1 is missing a language identifier which triggers markdownlint
MD040; update the backticks that start the block around the line containing
"http://localhost:3000 → pick admin / user to log in (dex static users)" to
include a language hint such as text (e.g., changetotext) so the fenced
block is properly annotated.</details> <!-- cr-comment:v1:b8aae239d297f09d95853ce4 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>docs/apps/infra-local-usage.md-16-31 (1)</summary><blockquote> `16-31`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Use one component-selector variable consistently in examples.** Line 16 uses `COMPONENT=...`, while Line 31 documents `COMPONENTS=` as the selector and later examples also mix forms. This is copy/paste-prone and can break command usage. <details> <summary>💡 Proposed fix</summary> ```diff -make stack-logs COMPONENT=api +make stack-logs COMPONENTS=api @@ -make stack-logs COMPONENT=runner # single component -make stack-logs COMPONENT=all # everything +make stack-logs COMPONENTS=runner # single component +make stack-logs COMPONENTS=all # everything ``` </details> Also applies to: 215-217 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/apps/infra-local-usage.md` around lines 16 - 31, The docs use two different component selector variables (`COMPONENT=` vs `COMPONENTS=`) which is inconsistent and error-prone; pick one name and update all examples to use it consistently (e.g., replace `COMPONENT=api` with `COMPONENTS=api` and ensure `make stack-restart`/`make stack-logs`/`make stack-down` examples and the explanatory sentence all reference the same `COMPONENTS=` variable, including the other occurrences noted around lines 215-217); also ensure the parenthetical "(empty = all)" matches the chosen variable name. ``` </details> <!-- cr-comment:v1:cc43f5ae1876e8bcec1e75ad --> </blockquote></details> <details> <summary>docs/apps/infra-local-usage.md-127-132 (1)</summary><blockquote> `127-132`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Avoid direct `go build` in this workflow section; use make wrapper commands.** This section bypasses the documented stack wrappers and can drift from repo-standard flags/paths. Prefer `make stack-restart COMPONENTS=runner` (or documented wrapper equivalents) for consistency with the rest of this guide. Based on learnings: "Use `make` targets for build, test, lint, format, setup, and distribution. Do not run `cargo`, `npm`, `python`, `go`, or `cbindgen` directly when a make target exists." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/apps/infra-local-usage.md` around lines 127 - 132, Replace the direct `go build` workflow snippet that kills and rebuilds the runner with the repo-standard make wrapper; update the block that currently shows `pkill -9 -f boxlite-runner` and `cd apps/runner && go build -o /tmp/boxlite-runner ./cmd/runner && cd -` to use the documented make target (e.g., `make stack-restart COMPONENTS=runner`) so builds and restarts go through the repository's standardized wrappers and flags; ensure the surrounding text references the status doc cheatsheet as before. ``` </details> <!-- cr-comment:v1:1688ac1e148725377598f58f --> _Source: Learnings_ </blockquote></details> <details> <summary>apps/infra-local/boxlite_local/config.py-10-15 (1)</summary><blockquote> `10-15`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Validate port env vars with bounds (1-65535) at load time.** Current parsing accepts numeric but invalid ports (`0`, negatives, `>65535`), so failures are deferred to runtime networking paths instead of failing fast at config boundary. <details> <summary>Proposed fix</summary> ```diff -def _parse_int_env(name: str, default: str) -> int: +def _parse_int_env( + name: str, + default: str, + *, + min_value: int | None = None, + max_value: int | None = None, +) -> int: raw = os.environ.get(name, default) try: - return int(raw) + value = int(raw) except ValueError as e: raise ValueError(f"{name} must be an integer, got: {raw!r}") from e + if min_value is not None and value < min_value: + raise ValueError(f"{name} must be >= {min_value}, got: {value}") + if max_value is not None and value > max_value: + raise ValueError(f"{name} must be <= {max_value}, got: {value}") + return value ``` ```diff - pg_host_port=_parse_int_env("BOXLITE_PG_HOST_PORT", "25432"), + pg_host_port=_parse_int_env("BOXLITE_PG_HOST_PORT", "25432", min_value=1, max_value=65535), ... - redis_host_port=_parse_int_env("BOXLITE_REDIS_HOST_PORT", "26379"), + redis_host_port=_parse_int_env("BOXLITE_REDIS_HOST_PORT", "26379", min_value=1, max_value=65535), ``` (Apply the same bounds to all `*_PORT` fields.) </details> Based on learnings: “Validate inputs at the boundary. Untrusted inputs should be checked where they enter” and “Use explicit error handling: fail fast on missing config/invalid inputs.” Also applies to: 80-99 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/boxlite_local/config.py` around lines 10 - 15, The _parse_int_env function currently converts env values to int but doesn't validate port ranges, allowing invalid ports (0, negative, >65535); update _parse_int_env to check that the parsed integer is within 1..65535 and raise a ValueError with a clear message like "{name} must be a valid port between 1 and 65535, got: {raw!r}" when out of range, and ensure every configuration entry that uses _parse_int_env for any *_PORT setting (all occurrences of parsing port envs) uses this tightened validation so invalid ports fail fast at load time. ``` </details> <!-- cr-comment:v1:1ce598fc6d1ee05a46d81046 --> _Source: Learnings_ </blockquote></details> <details> <summary>apps/infra-local/CONNECTIONS.md-40-42 (1)</summary><blockquote> `40-42`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add language identifiers to unlabeled fenced code blocks (MD040).** These blocks should use explicit fence languages (e.g., `text`) to satisfy markdownlint. Also applies to: 45-47, 81-83, 86-88, 138-140, 144-146, 189-191 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@apps/infra-local/CONNECTIONS.mdaround lines 40 - 42, Several fenced code
blocks in CONNECTIONS.md are unlabeled which triggers markdownlint MD040; update
each unlabeled triple-backtick fence (including the shown postgres URI block and
the other occurrences at the commented ranges) to include a language identifier
such as "text" (e.g., changetotext) so the code blocks are explicitly
labeled; search for similar unlabeled fences in CONNECTIONS.md and add the
language identifier consistently.</details> <!-- cr-comment:v1:c44a6a8ecfcd36ecff6e5c03 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>apps/infra-local/README.md-226-230 (1)</summary><blockquote> `226-230`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add fence languages to satisfy MD040 markdownlint warnings.** The fenced blocks here should specify a language (e.g., `text`) to keep docs lint-clean. <details> <summary>Suggested patch</summary> ```diff - ``` + ```text 127.0.0.1 pgadmin.boxlite.test 127.0.0.1 jaeger.boxlite.test # ... etc ``` @@ -``` +```text apps/infra-local/ ├── Makefile # convenience wrappers (L1 + stack-* L2) ... └── tests/ ... -``` +``` ``` </details> Also applies to: 276-325 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@apps/infra-local/README.mdaround lines 226 - 230, The Markdown fenced code
blocks that show the hosts entries (the block containing "127.0.0.1
pgadmin.boxlite.test" / "127.0.0.1 jaeger.boxlite.test") and the directory tree
block starting with "apps/infra-local/" (and the other similar fenced blocks in
the same README section) must specify a fence language to satisfy MD040; update
each opening fence fromtotext so the hosts block and the
directory-tree/other blocks include the language identifier while leaving the
block contents unchanged.</details> <!-- cr-comment:v1:8d82a48e39c68ffaa7c88307 --> _Source: Linters/SAST tools_ </blockquote></details> <details> <summary>apps/infra-local/CONNECTIONS.md-116-121 (1)</summary><blockquote> `116-121`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Fix the `aws` example command syntax.** `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` must prefix the command, not trail after `s3 ls`; the current snippet is not a valid invocation. <details> <summary>Suggested patch</summary> ```diff -aws --endpoint-url http://127.0.0.1:29000 \ - --region us-east-1 \ - s3 ls \ - AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin +AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin \ +aws --endpoint-url http://127.0.0.1:29000 \ + --region us-east-1 \ + s3 ls ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/CONNECTIONS.md` around lines 116 - 121, The aws CLI example is invalid because the environment variables are placed after the command; put AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY before the `aws --endpoint-url ... s3 ls` invocation so they prefix the command (e.g. export or inline before `aws --endpoint-url ... s3 ls`), ensuring the credentials are applied to the `aws` call rather than trailing after `s3 ls`. ``` </details> <!-- cr-comment:v1:e504f0fb613d6d0ef1972b84 --> </blockquote></details> <details> <summary>apps/infra-local/scripts/build-all-in-one-sql.py-1027-1029 (1)</summary><blockquote> `1027-1029`: _⚠️ Potential issue_ | _🟡 Minor_ **Narrow `except BaseException` to `except Exception` in `apps/infra-local/scripts/build-all-in-one-sql.py`** This loop intentionally collects per-migration failures and continues, but the file has no other handling for `KeyboardInterrupt`/`SystemExit`, so catching `BaseException` can swallow normal abort semantics. <details> <summary>Suggested patch</summary> ```diff - except BaseException as exc: # noqa: BLE001 + except Exception as exc: failures.append((path, exc)) continue ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/scripts/build-all-in-one-sql.py` around lines 1027 - 1029, Replace the overly broad except BaseException in the migration loop with except Exception so system-exiting signals aren't swallowed; specifically update the except clause that currently does "except BaseException as exc:" (the block that appends to the failures list using failures.append((path, exc)) and continues) to "except Exception as exc:" to preserve KeyboardInterrupt/SystemExit behavior while still collecting migration errors. ``` </details> <!-- cr-comment:v1:8f9a13e2225d3558e1432c3b --> _Source: Learnings_ </blockquote></details> <details> <summary>apps/infra-local/scripts/seed-init-data.sh-125-129 (1)</summary><blockquote> `125-129`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Flag parsing is order-dependent and can ignore `--no-bounce`.** `--no-bounce` is only honored when it is the first argument. Example: `seed-init-data.sh --no-wait --no-bounce` still bounces API. <details> <summary>Proposed fix</summary> ```diff -if [ "${1:-}" = "--no-bounce" ]; then - log "skipping API bounce (--no-bounce); assuming API just woke + is running its boot cycle" -else - bounce_api_if_running -fi +NO_BOUNCE=false +NO_WAIT=false +for arg in "$@"; do + case "$arg" in + --no-bounce) NO_BOUNCE=true ;; + --no-wait) NO_WAIT=true ;; + *) die "unknown argument: $arg" ;; + esac +done + +if [ "$NO_BOUNCE" = "true" ]; then + log "skipping API bounce (--no-bounce); assuming API just woke + is running its boot cycle" +else + bounce_api_if_running +fi @@ -if [ "${1:-}" = "--no-wait" ] || [ "${2:-}" = "--no-wait" ]; then +if [ "$NO_WAIT" = "true" ]; then ``` </details> Also applies to: 144-144 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/scripts/seed-init-data.sh` around lines 125 - 129, The script currently checks only the first positional parameter for "--no-bounce", so flags like "--no-wait --no-bounce" are ignored; update the check in seed-init-data.sh to scan all args (e.g., loop over "$@" or use a case pattern) and set a boolean if any argument equals "--no-bounce", then use that boolean to decide whether to call bounce_api_if_running; apply the same change to the other occurrence referenced (around the second check) so both places reliably honor --no-bounce regardless of argument order. ``` </details> <!-- cr-comment:v1:f99316165ca0615131728a6f --> </blockquote></details> <details> <summary>apps/infra-local/scripts/_stack-common.sh-55-69 (1)</summary><blockquote> `55-69`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Validate PID ownership before declaring a component “running”.** Line 63 treats any live PID from the pidfile as valid. If the PID was recycled, `stack-up` can incorrectly skip startup for a component that is actually down. Please verify the PID belongs to the expected process (or pair liveness with the expected listening port) before returning it. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/scripts/_stack-common.sh` around lines 55 - 69, component_pid currently trusts any live PID from pid_file as the component owner; update it to validate ownership before returning the PID by (1) reading the pid from pid_file (existing pid="$(cat "$pf")"), (2) checking /proc/$pid/cmdline or /proc/$pid/exe (or lsof/ss on the expected listening port) to confirm the process command or listening socket matches the expected component identity, and only then echo the pid; if validation fails, remove the stale pidfile (rm -f "$pf") and return empty. Use the same pid_file and component_pid symbols so stack-up will skip only when a confirmed matching process is running. ``` </details> <!-- cr-comment:v1:c3a3e693dfa2b71cbcc731d7 --> </blockquote></details> <details> <summary>apps/infra-local/scripts/stack-logs.sh-20-23 (1)</summary><blockquote> `20-23`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Handle `all` mode when no log files exist.** On Line 22, `tail -F "${LOGS_DIR}"/*.log` fails when there are no `*.log` files yet. Add a guard so `stack-logs.sh all` returns a clear message instead of an immediate error. <details> <summary>Proposed fix</summary> ```diff if [ "$1" = "all" ]; then # `tail -f` multiple files at once; each line gets ==> filename <== headers. - exec tail -F "${LOGS_DIR}"/*.log + compgen -G "${LOGS_DIR}/*.log" > /dev/null || die "no log files yet in ${LOGS_DIR}" + exec tail -F "${LOGS_DIR}"/*.log fi ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/scripts/stack-logs.sh` around lines 20 - 23, The "all" branch in stack-logs.sh currently runs tail -F "${LOGS_DIR}"/*.log which fails if no log files exist; update the if-block that handles "$1" = "all" to first check whether any files match the pattern (e.g. test for the presence of "${LOGS_DIR}"/*.log or use a glob check) and if none are found print a clear message like "No log files found in $LOGS_DIR" and exit non‑error (or exit 0) instead of invoking tail; keep using LOGS_DIR and the existing tail -F "${LOGS_DIR}"/*.log when matches are present. ``` </details> <!-- cr-comment:v1:d9245aab9f1ed445a331b136 --> </blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (4)</summary><blockquote> <details> <summary>apps/package.json (1)</summary><blockquote> `116-116`: _⚡ Quick win_ **Move `@types/node-forge` to `devDependencies`.** This is a type-only package and should not ship in runtime dependencies. <details> <summary>♻️ Proposed fix</summary> ```diff - "`@types/node-forge`": "^1.3.14", "algoliasearch": "^5.30.0", ``` ```diff "devDependencies": { + "`@types/node-forge`": "^1.3.14", "`@babel/preset-env`": "^7.22.0", ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/package.json` at line 116, The dependency "`@types/node-forge`" is a type-only package and should be moved from dependencies to devDependencies in package.json; open the file, remove the "`@types/node-forge`" entry from the top-level "dependencies" object and add the same entry (with the same version "^1.3.14") under "devDependencies", then run npm/yarn install (or update lockfile) to ensure the lockfile reflects the change and that no runtime code imports this types-only package. ``` </details> <!-- cr-comment:v1:7f8b203f55849fba42213e45 --> </blockquote></details> <details> <summary>apps/infra-local/tests/unit/test_doctor_lsof.py (1)</summary><blockquote> `6-47`: _⚡ Quick win_ **Add focused unit tests for `check_port_free` subprocess branches.** Please add monkeypatched tests for: `lsof` missing, non-zero + stderr, and timeout/error branch. These are the high-risk paths that gate `stack-up`. Based on learnings: Add or update tests when behavior changes around branching, retries, security checks, or boundaries. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/tests/unit/test_doctor_lsof.py` around lines 6 - 47, Add unit tests that monkeypatch subprocess.run for the check_port_free code paths in boxlite_local.doctor: simulate (1) FileNotFoundError to test the "lsof missing" branch, (2) a CompletedProcess with returncode != 0 and non-empty stderr to test the non-zero+stderr branch, and (3) a subprocess.TimeoutExpired (or other exception) to exercise the timeout/error branch; place these tests alongside the existing _parse_lsof_F tests (e.g., in apps/infra-local/tests/unit/test_doctor_lsof.py), use monkeypatch to replace subprocess.run or the function that wraps it, and assert the expected return values or error handling behavior of check_port_free (refer to the check_port_free function and any helper that invokes subprocess.run). ``` </details> <!-- cr-comment:v1:23093557339b5987261aed8d --> _Source: Learnings_ </blockquote></details> <details> <summary>apps/infra-local/README.md (1)</summary><blockquote> `348-349`: _⚡ Quick win_ **Use the Make target for status instead of direct Python invocation.** Replace `python -m boxlite_local ps` with `make ps` to stay consistent with the project workflow contract. Based on learnings: "Use make targets for build, test, lint, format, setup, and distribution. Do not run ... python directly when a make target exists." <details> <summary>Suggested patch</summary> ```diff -**Debug a stuck service:** `make stack-status` first → identify the red -component → use the lightest possible cleanup. `python -m boxlite_local ps` +**Debug a stuck service:** `make stack-status` first → identify the red +component → use the lightest possible cleanup. `make ps` shows L1 box state; `boxlite logs boxlite-local-<name>` shows guest logs; `make stack-logs COMPONENT=<name>` tails L2 logs from `apps/infra-local/.logs/`. ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/README.md` around lines 348 - 349, Replace the direct Python invocation "python -m boxlite_local ps" with the Make target "make ps" wherever it appears in documentation or examples (e.g., the README snippet referencing L1 box state); update the command examples and any surrounding text so they instruct developers to run "make ps" for status, preserving existing notes about "boxlite logs boxlite-local-<name>" and ensuring formatting/inline code backticks match the rest of the doc. ``` </details> <!-- cr-comment:v1:101a04bdc495abb15a195e22 --> _Source: Learnings_ </blockquote></details> <details> <summary>apps/infra-local/Makefile (1)</summary><blockquote> `89-91`: _⚡ Quick win_ **Use `$(PY)` in `stack-rebuild-l1-box` for interpreter consistency.** Line 91 hardcodes `python`, bypassing the Makefile’s configured interpreter variable (`PY`). <details> <summary>Proposed fix</summary> ```diff - python -m boxlite_local up $(BOX) + $(PY) -m boxlite_local up $(BOX) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/infra-local/Makefile` around lines 89 - 91, The rule for stack-rebuild-l1-box hardcodes the interpreter by calling "python -m boxlite_local up $(BOX)"; change it to use the Makefile's interpreter variable by replacing that invocation with "$(PY) -m boxlite_local up $(BOX)" so the target (stack-rebuild-l1-box) respects the configured PY value and interpreter consistency across the Makefile. ``` </details> <!-- cr-comment:v1:ab7e5bf33feef3ca68e3ae15 --> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@apps/api/src/sandbox/managers/sandbox.manager.ts:
- Line 259: The ORDER BY clauses are using an unquoted mixed-case identifier
which PostgreSQL folds to lowercase; update the .orderBy calls that reference
activity.lastActivityAt (e.g., the occurrence in sandbox_last_activity and the
other .orderBy at the later query) to use a quoted identifier matching the WHERE
clauses — change the argument from 'activity.lastActivityAt' to
'activity."lastActivityAt"' in both places so it matches
activity."lastActivityAt" used elsewhere.In
@apps/infra-local/boxlite_local/cli.py:
- Around line 30-35: The CLI currently lets p_up and p_down accept any service
names and silently ignore typos; update the argument parsing to validate the
"services" values against the canonical SERVICES list at parse-time and raise a
parser error for any unknown names. Change the add_argument call for "services"
(both p_up and p_down) to use a custom type or argparse Action that checks each
value is in SERVICES (or use choices=SERVICES) so argparse fails fast; ensure
the same validation logic is applied to the other parser(s) referenced around
lines 75-79 so all entry points reject invalid service names immediately.In
@apps/infra-local/boxlite_local/doctor.py:
- Around line 121-126: The lsof call in check_port_free() can hang; add a
timeout (e.g. timeout=5) to the subprocess.run(...) invocation and catch
subprocess.TimeoutExpired around that call (the call in check_port_free used by
doctor() and orchestrator.up()). On TimeoutExpired, log or surface a clear error
via the same logger used by doctor() / orchestrator (or return False) so the
preflight fails fast instead of blocking stack-up; ensure the handler preserves
existing behavior for other non-zero results.In
@apps/infra-local/boxlite_local/execwrap.py:
- Around line 34-39: exec_collect currently awaits execution.wait() with no
bound; add a timeout parameter to exec_collect (e.g., timeout: Optional[float] =
None), pass it through to the SDK call that creates the execution (box.exec(...,
timeout=timeout)) and also guard the final wait with
asyncio.wait_for(execution.wait(), timeout) so the underlying process is killed
on timeout (catch asyncio.TimeoutError, call execution.kill() or
execution.terminate() and await execution.wait() to ensure cleanup, then
re-raise or return a clear timeout result). Update callers such as
_wait_one_shot_exit to pass an appropriate timeout (or None) when invoking
exec_collect and keep _wait_healthy_exec behavior unchanged if it already wraps
exec_collect in asyncio.wait_for.In
@apps/infra-local/boxlite_local/orchestrator.py:
- Around line 228-231: The try/except around runtime.get currently swallows all
exceptions; change it to only treat a "box not found" case as idempotent and let
other errors surface: catch the specific not-found exception type thrown by your
runtime client (or inspect the exception for a not-found indicator) when calling
runtime.get and return False only in that branch, but for any other exception
re-raise (or log and raise) so transport/runtime failures aren’t masked; update
the try/except that wraps runtime.get (in the down/remove flow) accordingly and
ensure unexpected exceptions are not silently suppressed.In
@apps/infra-local/boxlite_local/services.py:
- Line 62: Replace floating Docker tags in services.py so images are pinned to
immutable versions/digests: update "minio/minio:latest", "minio/mc:latest",
"joxit/docker-registry-ui:main", and "otel/opentelemetry-collector:latest" to
specific semver tags or sha256 digests (preferred). Locate these image strings
in apps/infra-local/boxlite_local/services.py and change them to explicit
versions (or image@sha256:...) and run your local stack-up to verify startup;
ensure any README or docs referencing these images are updated accordingly.In
@apps/infra-local/scripts/seed-init-data.sh:
- Line 147: The script currently treats wait_for_default_snapshot failures as
non-fatal by calling warn and continuing to print ok "init data ready"; change
this so snapshot readiness failures cause the script to fail fast: replace the
warn fallback for wait_for_default_snapshot with a non-zero exit (e.g., call die
or exit 1) or otherwise return a failing status before the ok "init data ready"
message so downstream automation receives a non-zero exit; locate the call to
wait_for_default_snapshot and the later ok "init data ready" message and ensure
the failure path prevents reaching that ok and returns an error.In
@apps/infra-local/scripts/stack-up.sh:
- Around line 74-76: The current use of ln -sf to create "${APPS_DIR}/.env" and
"${APPS_DIR}/apps" will clobber existing non-symlink files; update the script to
first test each target: if the path does not exist create the symlink with ln
-s, if it exists and is a symlink leave it alone, and if it exists but is not a
symlink print a clear error and exit non-zero so we don't overwrite developer
files; apply this logic for the ".env" and "apps" targets (references: APPS_DIR,
"${APPS_DIR}/.env", "${APPS_DIR}/apps", and the ln -s invocation).- Around line 85-88: The current port cleanup blindly runs lsof/... | xargs kill
-9 which can terminate unrelated processes; update the port cleanup in the
blocks that call port_listening and warn to: prefer finding stack-owned PID(s)
first (check a corresponding pidfile if one exists, or filter lsof/ps output by
a known command pattern for this stack using pgrep -f or lsof -c), send SIGTERM
to those PIDs, wait a short period (e.g. poll for port closure up to a few
seconds), and only use SIGKILL (kill -9) as a fallback if the port is still
held; replace direct lsof ... | xargs -r kill -9 with a safe function that
accepts the port (used where port_listening is called) and implements
pidfile/command validation, TERM then KILL fallback, and preserves the existing
logging via warn.In
@apps/infra-local/sql/merged-schema.auto-gen.sql:
- Line 10: The merged schema enables "uuid-ossp" but never enables "pgcrypto",
so calls to gen_random_uuid() (referenced in the diff as gen_random_uuid()) will
fail on a fresh DB; update the schema generation step to emit CREATE EXTENSION
IF NOT EXISTS "pgcrypto"; ensure this CREATE EXTENSION is placed before any use
of gen_random_uuid() in the generated SQL (keep the existing CREATE EXTENSION IF
NOT EXISTS "uuid-ossp" as well) and modify the generator/template that produces
merged-schema.auto-gen.sql so future regenerations include the pgcrypto
extension.In
@apps/infra-local/tests/integration/test_e2e_full.py:
- Around line 83-90: The fixture currently calls asyncio.run(setup()) outside
the try, so if setup raises (including pytest.skip) the finally cleanup
(asyncio.run(teardown()), mp.undo(), shutil.rmtree(tmp)) never runs; change the
fixture to call setup() inside a try block and track whether setup completed
(e.g., setup_done flag set to True after asyncio.run(setup()) returns) then in
the finally always perform mp.undo() and shutil.rmtree(tmp, ignore_errors=True)
and only call asyncio.run(teardown()) if setup_done (or ensure teardown is safe
to call unconditionally); ensure pytest.skip exceptions are allowed to propagate
but cleanup still runs.In
@apps/package.json:
- Around line 343-345: The package.json currently forces a mismatched typing by
keeping "express" at ^5.2.0 while "resolutions" pins "@types/express" to
^4.17.25; update the resolutions (or remove it) so "@types/express" aligns with
Express 5 (e.g., change the resolution to@types/express@^5.x or delete the
override), then run install and TypeScript build to ensure no type conflicts in
NestJS 11; locate the mismatch in the "resolutions" block and adjust or remove
the "@types/express" entry accordingly.
Minor comments:
In@apps/infra-local/boxlite_local/config.py:
- Around line 10-15: The _parse_int_env function currently converts env values
to int but doesn't validate port ranges, allowing invalid ports (0, negative,65535); update _parse_int_env to check that the parsed integer is within
1..65535 and raise a ValueError with a clear message like "{name} must be a
valid port between 1 and 65535, got: {raw!r}" when out of range, and ensure
every configuration entry that uses _parse_int_env for any *_PORT setting (all
occurrences of parsing port envs) uses this tightened validation so invalid
ports fail fast at load time.In
@apps/infra-local/CONNECTIONS.md:
- Around line 40-42: Several fenced code blocks in CONNECTIONS.md are unlabeled
which triggers markdownlint MD040; update each unlabeled triple-backtick fence
(including the shown postgres URI block and the other occurrences at the
commented ranges) to include a language identifier such as "text" (e.g., change
totext) so the code blocks are explicitly labeled; search for similar
unlabeled fences in CONNECTIONS.md and add the language identifier consistently.- Around line 116-121: The aws CLI example is invalid because the environment
variables are placed after the command; put AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY before theaws --endpoint-url ... s3 lsinvocation so
they prefix the command (e.g. export or inline beforeaws --endpoint-url ... s3 ls), ensuring the credentials are applied to theawscall rather than
trailing afters3 ls.In
@apps/infra-local/README.md:
- Around line 226-230: The Markdown fenced code blocks that show the hosts
entries (the block containing "127.0.0.1 pgadmin.boxlite.test" / "127.0.0.1
jaeger.boxlite.test") and the directory tree block starting with
"apps/infra-local/" (and the other similar fenced blocks in the same README
section) must specify a fence language to satisfy MD040; update each opening
fence fromtotext so the hosts block and the directory-tree/other blocks
include the language identifier while leaving the block contents unchanged.In
@apps/infra-local/scripts/_stack-common.sh:
- Around line 55-69: component_pid currently trusts any live PID from pid_file
as the component owner; update it to validate ownership before returning the PID
by (1) reading the pid from pid_file (existing pid="$(cat "$pf")"), (2) checking
/proc/$pid/cmdline or /proc/$pid/exe (or lsof/ss on the expected listening port)
to confirm the process command or listening socket matches the expected
component identity, and only then echo the pid; if validation fails, remove the
stale pidfile (rm -f "$pf") and return empty. Use the same pid_file and
component_pid symbols so stack-up will skip only when a confirmed matching
process is running.In
@apps/infra-local/scripts/build-all-in-one-sql.py:
- Around line 1027-1029: Replace the overly broad except BaseException in the
migration loop with except Exception so system-exiting signals aren't swallowed;
specifically update the except clause that currently does "except BaseException
as exc:" (the block that appends to the failures list using
failures.append((path, exc)) and continues) to "except Exception as exc:" to
preserve KeyboardInterrupt/SystemExit behavior while still collecting migration
errors.In
@apps/infra-local/scripts/seed-init-data.sh:
- Around line 125-129: The script currently checks only the first positional
parameter for "--no-bounce", so flags like "--no-wait --no-bounce" are ignored;
update the check in seed-init-data.sh to scan all args (e.g., loop over "$@" or
use a case pattern) and set a boolean if any argument equals "--no-bounce", then
use that boolean to decide whether to call bounce_api_if_running; apply the same
change to the other occurrence referenced (around the second check) so both
places reliably honor --no-bounce regardless of argument order.In
@apps/infra-local/scripts/stack-logs.sh:
- Around line 20-23: The "all" branch in stack-logs.sh currently runs tail -F
"${LOGS_DIR}"/.log which fails if no log files exist; update the if-block that
handles "$1" = "all" to first check whether any files match the pattern (e.g.
test for the presence of "${LOGS_DIR}"/.log or use a glob check) and if none
are found print a clear message like "No log files found in $LOGS_DIR" and exit
non‑error (or exit 0) instead of invoking tail; keep using LOGS_DIR and the
existing tail -F "${LOGS_DIR}"/*.log when matches are present.In
@docs/apps/infra-local-usage.md:
- Around line 230-233: The fenced code block in Section 6.1 is missing a
language identifier which triggers markdownlint MD040; update the backticks that
start the block around the line containing "http://localhost:3000 → pick admin
/ user to log in (dex static users)" to include a language hint such as text
(e.g., changetotext) so the fenced block is properly annotated.- Around line 16-31: The docs use two different component selector variables
(COMPONENT=vsCOMPONENTS=) which is inconsistent and error-prone; pick one
name and update all examples to use it consistently (e.g., replace
COMPONENT=apiwithCOMPONENTS=apiand ensuremake stack-restart/make stack-logs/make stack-downexamples and the explanatory sentence all
reference the sameCOMPONENTS=variable, including the other occurrences noted
around lines 215-217); also ensure the parenthetical "(empty = all)" matches the
chosen variable name.- Around line 127-132: Replace the direct
go buildworkflow snippet that kills
and rebuilds the runner with the repo-standard make wrapper; update the block
that currently showspkill -9 -f boxlite-runnerandcd apps/runner && go build -o /tmp/boxlite-runner ./cmd/runner && cd -to use the documented make
target (e.g.,make stack-restart COMPONENTS=runner) so builds and restarts go
through the repository's standardized wrappers and flags; ensure the surrounding
text references the status doc cheatsheet as before.
Nitpick comments:
In@apps/infra-local/Makefile:
- Around line 89-91: The rule for stack-rebuild-l1-box hardcodes the interpreter
by calling "python -m boxlite_local up $(BOX)"; change it to use the Makefile's
interpreter variable by replacing that invocation with "$(PY) -m boxlite_local
up $(BOX)" so the target (stack-rebuild-l1-box) respects the configured PY value
and interpreter consistency across the Makefile.In
@apps/infra-local/README.md:
- Around line 348-349: Replace the direct Python invocation "python -m
boxlite_local ps" with the Make target "make ps" wherever it appears in
documentation or examples (e.g., the README snippet referencing L1 box state);
update the command examples and any surrounding text so they instruct developers
to run "make ps" for status, preserving existing notes about "boxlite logs
boxlite-local-" and ensuring formatting/inline code backticks match the
rest of the doc.In
@apps/infra-local/tests/unit/test_doctor_lsof.py:
- Around line 6-47: Add unit tests that monkeypatch subprocess.run for the
check_port_free code paths in boxlite_local.doctor: simulate (1)
FileNotFoundError to test the "lsof missing" branch, (2) a CompletedProcess with
returncode != 0 and non-empty stderr to test the non-zero+stderr branch, and (3)
a subprocess.TimeoutExpired (or other exception) to exercise the timeout/error
branch; place these tests alongside the existing _parse_lsof_F tests (e.g., in
apps/infra-local/tests/unit/test_doctor_lsof.py), use monkeypatch to replace
subprocess.run or the function that wraps it, and assert the expected return
values or error handling behavior of check_port_free (refer to the
check_port_free function and any helper that invokes subprocess.run).In
@apps/package.json:
- Line 116: The dependency "
@types/node-forge" is a type-only package and should
be moved from dependencies to devDependencies in package.json; open the file,
remove the "@types/node-forge" entry from the top-level "dependencies" object
and add the same entry (with the same version "^1.3.14") under
"devDependencies", then run npm/yarn install (or update lockfile) to ensure the
lockfile reflects the change and that no runtime code imports this types-only
package.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro Plus **Run ID**: `92dacb5e-5744-4885-be73-16a14db0f4fc` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 6048e91e6c98afca85a5ca176cd2b1ffb174241d and 3424fb25eddedb742c6e9c0f07d50280d286896e. </details> <details> <summary>📒 Files selected for processing (49)</summary> * `.gitignore` * `CLAUDE.md` * `apps/api/src/app.module.ts` * `apps/api/src/common/providers/openfeature-posthog.provider.ts` * `apps/api/src/sandbox/managers/sandbox.manager.ts` * `apps/api/tsconfig.json` * `apps/dashboard/src/components/PostHogProviderWrapper.tsx` * `apps/dashboard/src/mocks/handlers.ts` * `apps/infra-local/CONNECTIONS.md` * `apps/infra-local/Makefile` * `apps/infra-local/README.md` * `apps/infra-local/boxlite_local/__init__.py` * `apps/infra-local/boxlite_local/__main__.py` * `apps/infra-local/boxlite_local/cli.py` * `apps/infra-local/boxlite_local/config.py` * `apps/infra-local/boxlite_local/doctor.py` * `apps/infra-local/boxlite_local/execwrap.py` * `apps/infra-local/boxlite_local/orchestrator.py` * `apps/infra-local/boxlite_local/services.py` * `apps/infra-local/boxlite_local/types.py` * `apps/infra-local/configs/minio/init.sh` * `apps/infra-local/goal.md` * `apps/infra-local/pyproject.toml` * `apps/infra-local/scripts/_stack-common.sh` * `apps/infra-local/scripts/apply-schema.sh` * `apps/infra-local/scripts/build-all-in-one-sql.py` * `apps/infra-local/scripts/seed-init-data.sh` * `apps/infra-local/scripts/stack-build.sh` * `apps/infra-local/scripts/stack-down.sh` * `apps/infra-local/scripts/stack-logs.sh` * `apps/infra-local/scripts/stack-reset.sh` * `apps/infra-local/scripts/stack-restart.sh` * `apps/infra-local/scripts/stack-status.sh` * `apps/infra-local/scripts/stack-up.sh` * `apps/infra-local/sql/merged-schema.auto-gen.sql` * `apps/infra-local/tests/__init__.py` * `apps/infra-local/tests/integration/__init__.py` * `apps/infra-local/tests/integration/test_e2e_full.py` * `apps/infra-local/tests/integration/test_multi_service.py` * `apps/infra-local/tests/unit/__init__.py` * `apps/infra-local/tests/unit/test_config.py` * `apps/infra-local/tests/unit/test_doctor_lsof.py` * `apps/infra-local/tests/unit/test_orchestrator.py` * `apps/infra-local/tests/unit/test_topo.py` * `apps/package.json` * `apps/runner/pkg/boxlite/registry.go` * `docs/apps/infra-local-usage.md` * `libs` * `tsconfig.base.json` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| p_up.add_argument("services", nargs="*", help="Subset of services (default: all)") | ||
| p_up.add_argument("--skip-doctor", action="store_true", help="Bypass preflight checks") | ||
|
|
||
| p_down = sub.add_parser("down", help="Stop + remove services.") | ||
| p_down.add_argument("services", nargs="*", help="Subset of services (default: all)") | ||
| p_down.add_argument("--wipe", action="store_true", help="Also remove the data dir") |
There was a problem hiding this comment.
Reject unknown service names at parse-time instead of silently succeeding.
up/down currently accept arbitrary names; typos can return success while doing nothing. Validate against SERVICES in argparse so bad input fails immediately.
Proposed fix
- p_up.add_argument("services", nargs="*", help="Subset of services (default: all)")
+ p_up.add_argument(
+ "services",
+ nargs="*",
+ choices=sorted(SERVICES.keys()),
+ help="Subset of services (default: all)",
+ )
...
- p_down.add_argument("services", nargs="*", help="Subset of services (default: all)")
+ p_down.add_argument(
+ "services",
+ nargs="*",
+ choices=sorted(SERVICES.keys()),
+ help="Subset of services (default: all)",
+ )Based on learnings: “Validate inputs at the boundary” and “Use explicit error handling: fail fast on invalid inputs.”
Also applies to: 75-79
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/boxlite_local/cli.py` around lines 30 - 35, The CLI
currently lets p_up and p_down accept any service names and silently ignore
typos; update the argument parsing to validate the "services" values against the
canonical SERVICES list at parse-time and raise a parser error for any unknown
names. Change the add_argument call for "services" (both p_up and p_down) to use
a custom type or argparse Action that checks each value is in SERVICES (or use
choices=SERVICES) so argparse fails fast; ensure the same validation logic is
applied to the other parser(s) referenced around lines 75-79 so all entry points
reject invalid service names immediately.
Source: Learnings
| proc = subprocess.run( | ||
| ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"], | ||
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect the target file around the reported lines
FILE="apps/infra-local/boxlite_local/doctor.py"
echo "=== Context: $FILE (lines 90-170) ==="
nl -ba "$FILE" | sed -n '90,170p'
# 2) Find the definition of check_port_free (and surrounding logic)
echo
echo "=== Grep: check_port_free in $FILE ==="
rg -n "def check_port_free|check_port_free\(" "$FILE" || true
# 3) Inspect doctor() and any higher-level timeouts/cancellation/strict handling
echo
echo "=== Grep: doctor( and strict usage in $FILE ==="
rg -n "def doctor\(|strict\s*=\s*True|doctor\(" "$FILE" || true
# 4) Search repository for where doctor(strict=True) is called (to confirm blocking path)
echo
echo "=== Repo search: doctor(strict=True) ==="
rg -n "doctor\(\s*.*strict\s*=\s*True|doctor\(\s*strict\s*=\s*True|strict\s*=\s*True.*doctor\(" . || true
# 5) Search for stack-up / orchestrator call path around doctor invocation
echo
echo "=== Repo search: stack-up and doctor usage ==="
rg -n "stack-up|stack_up|stackup|orchestrator|startup.*doctor|doctor.*stack" apps/infra-local/boxlite_local . || trueRepository: boxlite-ai/boxlite
Length of output: 196
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="apps/infra-local/boxlite_local/doctor.py"
echo "=== Context: $FILE (lines 90-170) ==="
awk 'NR>=90 && NR<=170 {printf "%6d %s\n", NR, $0}' "$FILE" || true
echo
echo "=== Grep: check_port_free in $FILE ==="
rg -n "def check_port_free|check_port_free\(" "$FILE" || true
echo
echo "=== Grep: doctor() and strict usage in $FILE ==="
rg -n "def doctor\(|strict\s*=\s*True|doctor\(" "$FILE" || true
echo
echo "=== Repo search: doctor(strict=True) ==="
rg -n "doctor\(\s*.*strict\s*=\s*True|strict\s*=\s*True.*doctor\(" . || true
echo
echo "=== Repo search: stack-up / orchestrator call path around doctor ==="
rg -n "stack-up|stack_up|stackup|orchestrator|startup.*doctor|doctor.*stack" apps/infra-local/boxlite_local . || trueRepository: boxlite-ai/boxlite
Length of output: 12340
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE_DOCTOR="apps/infra-local/boxlite_local/doctor.py"
FILE_ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== doctor.py: context 160-260 ==="
awk 'NR>=150 && NR<=260 {printf "%6d %s\n", NR, $0}' "$FILE_DOCTOR" || true
echo
echo "=== doctor.py: subprocess.run occurrences ==="
rg -n "subprocess\.run\(" "$FILE_DOCTOR" || true
echo
echo "=== orchestrator.py: context around doctor(strict=True) ==="
awk 'NR>=330 && NR<=430 {printf "%6d %s\n", NR, $0}' "$FILE_ORCH" || true
echo
echo "=== orchestrator.py: subprocess usage (to see if there is generic timeouts) ==="
rg -n "subprocess\.(run|check_output|Popen)\(" "$FILE_ORCH" || trueRepository: boxlite-ai/boxlite
Length of output: 6041
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Repo search: wait_for / timeout around up()/doctor() ==="
rg -n "asyncio\.wait_for|wait_for\(|timeout[_a-zA-Z]*=.*doctor|doctor\(.*\).*timeout|signal\.alarm|setitimer" apps/infra-local/boxlite_local . || true
echo
echo "=== Repo search: CLI entry points calling orchestrator.up ==="
rg -n "orchestrator\.up|await up\(|up\(config" apps/infra-local/boxlite_local/cli.py apps/infra-local/boxlite_local . || true
echo
echo "=== Repo search: any timeout in scripts/stack-up.sh ==="
rg -n "timeout|kill|alarm" apps/infra-local/scripts/stack-up.sh apps/infra-local/scripts/*.sh apps/infra-local/README.md || trueRepository: boxlite-ai/boxlite
Length of output: 7777
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE_DOCTOR="apps/infra-local/boxlite_local/doctor.py"
echo "=== doctor.py: imports/context (1-120) ==="
awk 'NR>=1 && NR<=130 {printf "%6d %s\n", NR, $0}' "$FILE_DOCTOR"Repository: boxlite-ai/boxlite
Length of output: 5332
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="apps/infra-local/boxlite_local/types.py"
awk 'NR>=1 && NR<=140 {printf "%6d %s\n", NR, $0}' "$FILE"Repository: boxlite-ai/boxlite
Length of output: 2764
Add a timeout to the lsof preflight port check to prevent hangs.
check_port_free() runs subprocess.run(["lsof", ...]) without a timeout, and doctor(..., strict=True) is awaited from orchestrator.up(), so a stalled lsof blocks the preflight and stops stack-up from progressing.
🛠️ Proposed fix
- proc = subprocess.run(
- ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
- capture_output=True,
- text=True,
- check=False,
- )
+ try:
+ proc = subprocess.run(
+ ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"],
+ capture_output=True,
+ text=True,
+ check=False,
+ timeout=5,
+ )
+ except subprocess.TimeoutExpired:
+ return DoctorCheck(
+ name=name,
+ severity=Severity.FAIL,
+ msg=f"lsof timed out while checking port {port}",
+ hint="Retry after closing heavy file/network scans; cannot verify port conflict while lsof is hung.",
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| proc = subprocess.run( | |
| ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"], | |
| capture_output=True, | |
| text=True, | |
| check=False, | |
| ) | |
| try: | |
| proc = subprocess.run( | |
| ["lsof", "-nP", f"-iTCP:{port}", "-sTCP:LISTEN", "-F", "pcLn"], | |
| capture_output=True, | |
| text=True, | |
| check=False, | |
| timeout=5, | |
| ) | |
| except subprocess.TimeoutExpired: | |
| return DoctorCheck( | |
| name=name, | |
| severity=Severity.FAIL, | |
| msg=f"lsof timed out while checking port {port}", | |
| hint="Retry after closing heavy file/network scans; cannot verify port conflict while lsof is hung.", | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/boxlite_local/doctor.py` around lines 121 - 126, The lsof
call in check_port_free() can hang; add a timeout (e.g. timeout=5) to the
subprocess.run(...) invocation and catch subprocess.TimeoutExpired around that
call (the call in check_port_free used by doctor() and orchestrator.up()). On
TimeoutExpired, log or surface a clear error via the same logger used by
doctor() / orchestrator (or return False) so the preflight fails fast instead of
blocking stack-up; ensure the handler preserves existing behavior for other
non-zero results.
Source: Learnings
| await asyncio.gather( | ||
| drain(execution.stdout(), out_parts), | ||
| drain(execution.stderr(), err_parts), | ||
| ) | ||
| result = await execution.wait() | ||
| return result.exit_code, "".join(out_parts), "".join(err_parts) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify call sites and whether timeout wrappers already exist.
rg -n "exec_collect\(" apps/infra-local -C2
rg -n "asyncio\.wait_for|asyncio\.timeout" apps/infra-local -C2Repository: boxlite-ai/boxlite
Length of output: 6330
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect exec_collect implementation closely
sed -n '1,220p' apps/infra-local/boxlite_local/execwrap.py | nl -ba | sed -n '1,220p'
# Look for timeout/cancellation handling inside execwrap.py
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill|finally" apps/infra-local/boxlite_local/execwrap.py
# Check for any exec_collect signature/kwargs containing timeout at call sites
rg -n "exec_collect\([^)]*timeout|timeout_s\s*=" apps/infra-local -SRepository: boxlite-ai/boxlite
Length of output: 105
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="apps/infra-local/boxlite_local/execwrap.py"
# Show exec_collect region (avoid huge output)
echo "=== execwrap.py (first 220 lines with numbers) ==="
cat -n "$FILE" | sed -n '1,220p'
echo "=== execwrap.py (lines around exec_collect) ==="
# Find line number of `def exec_collect`
LINE=$(rg -n "async def exec_collect" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-30))
END=$((LINE+140))
cat -n "$FILE" | sed -n "${START},${END}p"
echo "=== Search for timeout/cancellation handling in execwrap.py ==="
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill" "$FILE"Repository: boxlite-ai/boxlite
Length of output: 3637
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="apps/infra-local/boxlite_local/execwrap.py"
# Show exec_collect region using safe line bounds
LINE=$(rg -n "async def exec_collect" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-10))
if [ "$START" -lt 1 ]; then START=1; fi
END=$((LINE+120))
echo "=== execwrap.py: exec_collect vicinity ($START..$END) ==="
cat -n "$FILE" | sed -n "${START},${END}p"
echo "=== Search for timeout/cancellation handling in execwrap.py ==="
rg -n "timeout_s|wait_for|CancelledError|cancel|terminate|kill" "$FILE" || true
echo "=== Search for any cleanup/finally usage around exec_collect ==="
rg -n "finally|except" "$FILE" || trueRepository: boxlite-ai/boxlite
Length of output: 1788
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where Execution and box.exec are implemented
rg -n "class Execution|async def wait\(|def wait\(|stdout\(\)|stderr\(\)|async def stdout|async def stderr|async def exec\(" apps/infra-local -S
# Narrow on Execution.wait implementation
rg -n "async def wait\(" apps/infra-local -S
# Narrow on box.exec / "def exec" implementation
rg -n "async def exec\(|def exec\(" apps/infra-local -SRepository: boxlite-ai/boxlite
Length of output: 865
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all exec_collect call sites in the repo
rg -n "exec_collect\(" . -S
# Find where Execution is implemented (outside tests)
rg -n "class Execution\b|async def wait\(|def wait\(|async def stdout\(|async def stderr\(|def stdout\(|def stderr\(" . -SRepository: boxlite-ai/boxlite
Length of output: 2362
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all exec_collect call sites in the repo
rg -n "exec_collect\(" . -S
# Find where Execution is implemented (outside tests)
rg -n "class Execution\b|async def wait\(|def wait\(|async def stdout\(|async def stderr\(|def stdout\(|def stderr\(" . -SRepository: boxlite-ai/boxlite
Length of output: 2362
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect SDK execution/wait implementations
sed -n '150,320p' sdks/python/boxlite/orchestration/box_runtime.py | cat -n
echo "----"
sed -n '150,320p' sdks/python/boxlite/sync_api/_execution.py | cat -n
echo "----"
sed -n '120,240p' sdks/python/boxlite/interactivebox.py | cat -n
echo "---- Searches for cancellation/termination in SDK files ----"
rg -n "CancelledError|cancelled|terminate|kill|shutdown" sdks/python/boxlite -SRepository: boxlite-ai/boxlite
Length of output: 28638
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect SDK execution/wait implementations
echo "=== sdks/python/boxlite/orchestration/box_runtime.py (approx. 150-320) ==="
sed -n '150,320p' sdks/python/boxlite/orchestration/box_runtime.py | cat -n
echo "=== sdks/python/boxlite/sync_api/_execution.py (approx. 150-320) ==="
sed -n '150,320p' sdks/python/boxlite/sync_api/_execution.py | cat -n
echo "=== sdks/python/boxlite/interactivebox.py (approx. 120-240) ==="
sed -n '120,240p' sdks/python/boxlite/interactivebox.py | cat -n
echo "=== Searches for cancellation/termination in SDK files ==="
rg -n "CancelledError|cancelled|terminate|kill|shutdown" sdks/python/boxlite -SRepository: boxlite-ai/boxlite
Length of output: 28834
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the concrete async Execution implementation (not just wrappers/tests)
rg -n "class Execution\b" sdks/python/boxlite -S
rg -n "async def wait\(" sdks/python/boxlite -S
rg -n "def kill\(" sdks/python/boxlite -S
# Inspect orchestration/box_runtime.py around the CancelledError locations we saw earlier
FILE="sdks/python/boxlite/orchestration/box_runtime.py"
echo "=== orchestration/box_runtime.py: context around CancelledError ==="
rg -n "CancelledError" "$FILE" -n || true
# Print a couple windows around the first few matches (safe bounds)
for ln in $(rg -n "CancelledError" "$FILE" | cut -d: -f1 | head -n 6); do
START=$((ln-40)); [ "$START" -lt 1 ] && START=1
END=$((ln+40))
echo "---- $FILE ($START..$END) around CancelledError at line $ln ----"
sed -n "${START},${END}p" "$FILE" | cat -n
doneRepository: boxlite-ai/boxlite
Length of output: 44
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="sdks/python/boxlite/orchestration/box_runtime.py"
echo "=== Top of $FILE ==="
sed -n '1,120p' "$FILE" | cat -n
echo "=== Lines containing CancelledError in $FILE ==="
rg -n "CancelledError" "$FILE" || true
echo "=== Lines containing _execution.kill() in $FILE ==="
rg -n "_execution\.kill\(|execution\.kill\(" "$FILE" || true
echo "=== Lines containing def/async def wait in sdks/python/boxlite ==="
rg -n "async def wait\(|def wait\(" sdks/python/boxlite -S
echo "=== Grep for kill/terminate/subprocess cleanup related to Execution.wait ==="
rg -n "kill\(|terminate\(|SIGTERM|SIGKILL|subprocess" sdks/python/boxlite -SRepository: boxlite-ai/boxlite
Length of output: 6631
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect orchestrator timeout handling
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py around asyncio.wait_for(exec_collect) ==="
rg -n "wait_for\\(\\s*exec_collect\\(|exec_collect\\(box, cmd" "$ORCH"
# Print relevant ranges manually (safe)
sed -n '230,310p' "$ORCH" | cat -n
echo "----"
sed -n '320,390p' "$ORCH" | cat -n
# Inspect SDK for the concrete Execution implementation (beyond sync wrappers)
echo "=== Search sdks/python/boxlite for Execution implementation details ==="
rg -n "async def wait\\(" sdks/python/boxlite -S
rg -n "def kill\\(" sdks/python/boxlite -S
rg -n "class .*Execution" sdks/python/boxlite -S
rg -n "execution\\.kill|_execution\\.kill|\\.kill\\(\\)" sdks/python/boxlite -S | head -n 50
# Identify likely file containing async wait()/kill: print those files if found
for f in $(rg -l "async def wait\\(|def kill\\(" sdks/python/boxlite -S); do
echo "---- file: $f ----"
head -n 5 "$f"
doneRepository: boxlite-ai/boxlite
Length of output: 8193
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect orchestrator timeout handling
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py around asyncio.wait_for(exec_collect) ==="
rg -n "wait_for\\(\\s*exec_collect\\(|exec_collect\\(box, cmd" "$ORCH"
# Print relevant ranges manually (safe)
sed -n '230,310p' "$ORCH" | cat -n
echo "----"
sed -n '320,390p' "$ORCH" | cat -n
# Inspect SDK for the concrete Execution implementation (beyond sync wrappers)
echo "=== Search sdks/python/boxlite for Execution implementation details ==="
rg -n "async def wait\\(" sdks/python/boxlite -S
rg -n "def kill\\(" sdks/python/boxlite -S
rg -n "class .*Execution" sdks/python/boxlite -S
rg -n "execution\\.kill|_execution\\.kill|\\.kill\\(\\)" sdks/python/boxlite -S | head -n 50
# Identify likely file containing async wait()/kill: print those files if found
for f in $(rg -l "async def wait\\(|def kill\\(" sdks/python/boxlite -S); do
echo "---- file: $f ----"
head -n 5 "$f"
doneRepository: boxlite-ai/boxlite
Length of output: 8193
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S
# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S
# Look for ".stdout(" / ".stderr(" / ".wait()" / ".kill()" patterns in the SDK exec implementation
rg -n "exec\\(|def exec\\(|async def exec\\(" sdks/python/boxlite -S | head -n 50Repository: boxlite-ai/boxlite
Length of output: 5169
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S
# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S
# Look for where async/sync exec is defined to locate the native Execution object
rg -n "async def exec\\(|def exec\\(" sdks/python/boxlite -S | head -n 50Repository: boxlite-ai/boxlite
Length of output: 652
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for usage of kill on execution objects across repo
rg -n "\.kill\(" apps/infra-local/boxlite_local sdks/python/boxlite -S
# Find where exec returns an Execution and see its interface (wait/stdout/stderr/kill)
rg -n "return .*Execution|-> .*Execution|class .*Execution" sdks/python/boxlite -S
# Locate where exec is defined (to find the native Execution object interface)
rg -n "async def exec\\(|def exec\\(" sdks/python/boxlite -S | head -n 50Repository: boxlite-ai/boxlite
Length of output: 652
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect async exec implementation in simplebox.py (to see Execution interface/type)
FILE="sdks/python/boxlite/simplebox.py"
echo "=== simplebox.py: async def exec vicinity ==="
rg -n "async def exec\\(" "$FILE"
LINE=$(rg -n "async def exec\\(" "$FILE" | head -n 1 | cut -d: -f1)
START=$((LINE-40)); [ "$START" -lt 1 ] && START=1
END=$((LINE+140))
sed -n "${START},${END}p" "$FILE" | cat -n
# Inspect sync_api/_execution.py for wait/kill interface
FILE="sdks/python/boxlite/sync_api/_execution.py"
echo "=== sync_api/_execution.py: kill + wait interface vicinity ==="
rg -n "def kill\\(|def wait\\(|async def wait\\(" "$FILE"
sed -n '200,320p' "$FILE" | cat -n
# Find non-sync Execution class definitions (broader regex)
rg -n "class .*Execution" sdks/python/boxlite -S
# Search for ".kill()" usage outside orchestration/box_runtime.py to infer native execution handling
rg -n "\.kill\(" sdks/python/boxlite -SRepository: boxlite-ai/boxlite
Length of output: 10186
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top around runtime/get_runtime/exec_collect usage ==="
sed -n '1,120p' "$ORCH" | cat -n
echo "=== orchestrator.py: find where runtime.get(name) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S
echo "=== orchestrator.py: show around one-shot bootstrap usage of runtime.get and exec_collect(box, 'true') ==="
# locate the function containing exec_collect(box, "true"
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n
echo "=== execwrap.py: search for kill/terminate usage around exec_collect ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -SRepository: boxlite-ai/boxlite
Length of output: 10688
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top around runtime/get_runtime/exec_collect usage ==="
sed -n '1,120p' "$ORCH" | cat -n
echo "=== orchestrator.py: find where runtime.get(name) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S
echo "=== orchestrator.py: show around one-shot bootstrap usage of runtime.get and exec_collect(box, 'true') ==="
# locate the function containing exec_collect(box, "true"
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n
echo "=== execwrap.py: search for kill/terminate usage around exec_collect ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -S || trueRepository: boxlite-ai/boxlite
Length of output: 10688
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ORCH="apps/infra-local/boxlite_local/orchestrator.py"
echo "=== orchestrator.py: show imports/top ==="
sed -n '1,120p' "$ORCH" | cat -n
echo "=== orchestrator.py: find where runtime.get(...) is used and what box type it is ==="
rg -n "runtime\.get\(|create_box\(|ManagedBox|SimpleBox|box\.exec\(" "$ORCH" -S || true
echo "=== orchestrator.py: show around one-shot bootstrap probe using exec_collect(box, \"true\", ...) ==="
LINE=$(rg -n "exec_collect\\(box, \"true\"" "$ORCH" | head -n1 | cut -d: -f1)
START=$((LINE-60)); [ "$START" -lt 1 ] && START=1
END=$((LINE+60))
sed -n "${START},${END}p" "$ORCH" | cat -n
echo "=== execwrap.py: search for kill/terminate/cancellation handling ==="
FILE="apps/infra-local/boxlite_local/execwrap.py"
rg -n "kill|terminate|finally|CancelledError" "$FILE" -S || trueRepository: boxlite-ai/boxlite
Length of output: 10629
Add an execution timeout to exec_collect (and ensure it stops the underlying process) to prevent orchestration hangs.
apps/infra-local/boxlite_local/execwrap.py’s exec_collect drains stdout/stderr and then does result = await execution.wait() with no time bound. Even though _wait_healthy_exec(...) wraps exec_collect in asyncio.wait_for(..., timeout=hc.timeout_s), _wait_one_shot_exit(...) awaits exec_collect(box, "true", []) directly (no per-exec timeout), so a stuck command can block the orchestration loop beyond the loop’s timeout_s.
exec_collect should accept a timeout and propagate it to box.exec(..., timeout=...) (the SDK supports a timeout arg) or otherwise guarantee the underlying execution is killed on timeout.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/boxlite_local/execwrap.py` around lines 34 - 39,
exec_collect currently awaits execution.wait() with no bound; add a timeout
parameter to exec_collect (e.g., timeout: Optional[float] = None), pass it
through to the SDK call that creates the execution (box.exec(...,
timeout=timeout)) and also guard the final wait with
asyncio.wait_for(execution.wait(), timeout) so the underlying process is killed
on timeout (catch asyncio.TimeoutError, call execution.kill() or
execution.terminate() and await execution.wait() to ensure cleanup, then
re-raise or return a clear timeout result). Update callers such as
_wait_one_shot_exit to pass an appropriate timeout (or None) when invoking
exec_collect and keep _wait_healthy_exec behavior unchanged if it already wraps
exec_collect in asyncio.wait_for.
Source: Learnings
| for name, spec in services.items(): | ||
| ts.add(name, *spec.depends_on) | ||
| ts.prepare() |
There was a problem hiding this comment.
Validate depends_on entries before graph construction.
Line 32 accepts dependency names that are not keys in services. Those unknown nodes can appear in topo output, then Line 386/Line 401 will crash on services[n] with KeyError.
Proposed fix
def topo_sort(services: dict[str, ServiceSpec]) -> list[list[str]]:
@@
- ts: TopologicalSorter[str] = TopologicalSorter()
+ unknown_deps = sorted(
+ {
+ dep
+ for spec in services.values()
+ for dep in spec.depends_on
+ if dep not in services
+ }
+ )
+ if unknown_deps:
+ raise ValueError(f"Unknown service dependency(ies): {', '.join(unknown_deps)}")
+
+ ts: TopologicalSorter[str] = TopologicalSorter()| [ -L "${APPS_DIR}/.env" ] || ln -sf api/.env "${APPS_DIR}/.env" | ||
| # apps/apps → . (webpack.config.js path resolution quirk) | ||
| [ -L "${APPS_DIR}/apps" ] || ln -sf . "${APPS_DIR}/apps" |
There was a problem hiding this comment.
Avoid clobbering a real apps/.env with forced symlink creation.
Line 74 uses ln -sf, which will overwrite an existing non-symlink file. That can destroy local developer config unexpectedly. Guard for existing non-symlink paths and fail with a clear message instead of force-replacing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/scripts/stack-up.sh` around lines 74 - 76, The current use
of ln -sf to create "${APPS_DIR}/.env" and "${APPS_DIR}/apps" will clobber
existing non-symlink files; update the script to first test each target: if the
path does not exist create the symlink with ln -s, if it exists and is a symlink
leave it alone, and if it exists but is not a symlink print a clear error and
exit non-zero so we don't overwrite developer files; apply this logic for the
".env" and "apps" targets (references: APPS_DIR, "${APPS_DIR}/.env",
"${APPS_DIR}/apps", and the ln -s invocation).
| if port_listening "${PORT_API}"; then | ||
| warn "port ${PORT_API} already in use — killing prior listener" | ||
| lsof -ti :${PORT_API} -sTCP:LISTEN | xargs -r kill -9 2>/dev/null || true | ||
| sleep 1 |
There was a problem hiding this comment.
Don’t blindly kill -9 any process listening on these ports.
These blocks terminate whatever is bound to the port, even if it’s unrelated to this stack. This can kill other local services and lose work. Limit termination to stack-owned PIDs (from pidfiles / command validation), and use TERM with a short wait before KILL fallback.
Also applies to: 126-129, 155-158, 185-188
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 87-87: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/scripts/stack-up.sh` around lines 85 - 88, The current port
cleanup blindly runs lsof/... | xargs kill -9 which can terminate unrelated
processes; update the port cleanup in the blocks that call port_listening and
warn to: prefer finding stack-owned PID(s) first (check a corresponding pidfile
if one exists, or filter lsof/ps output by a known command pattern for this
stack using pgrep -f or lsof -c), send SIGTERM to those PIDs, wait a short
period (e.g. poll for port closure up to a few seconds), and only use SIGKILL
(kill -9) as a fallback if the port is still held; replace direct lsof ... |
xargs -r kill -9 with a safe function that accepts the port (used where
port_listening is called) and implements pidfile/command validation, TERM then
KILL fallback, and preserves the existing logging via warn.
| asyncio.run(setup()) | ||
| try: | ||
| yield cfg | ||
| finally: | ||
| asyncio.run(teardown()) | ||
| mp.undo() | ||
| shutil.rmtree(tmp, ignore_errors=True) | ||
|
|
There was a problem hiding this comment.
Guarantee fixture cleanup when setup skips or errors.
Line 83 runs setup() before entering the current try/finally. If setup raises pytest.skip (Line 63) or another exception, Lines 87-90 never run, so the env patch/temp dir can leak into subsequent tests.
💡 Proposed fix
- asyncio.run(setup())
- try:
- yield cfg
- finally:
- asyncio.run(teardown())
- mp.undo()
- shutil.rmtree(tmp, ignore_errors=True)
+ try:
+ asyncio.run(setup())
+ yield cfg
+ finally:
+ try:
+ asyncio.run(teardown())
+ finally:
+ mp.undo()
+ shutil.rmtree(tmp, ignore_errors=True)Based on learnings: "Close and release files, sockets, clients, browser handles, and subprocesses explicitly."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/infra-local/tests/integration/test_e2e_full.py` around lines 83 - 90,
The fixture currently calls asyncio.run(setup()) outside the try, so if setup
raises (including pytest.skip) the finally cleanup (asyncio.run(teardown()),
mp.undo(), shutil.rmtree(tmp)) never runs; change the fixture to call setup()
inside a try block and track whether setup completed (e.g., setup_done flag set
to True after asyncio.run(setup()) returns) then in the finally always perform
mp.undo() and shutil.rmtree(tmp, ignore_errors=True) and only call
asyncio.run(teardown()) if setup_done (or ensure teardown is safe to call
unconditionally); ensure pytest.skip exceptions are allowed to propagate but
cleanup still runs.
Source: Learnings
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 43 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
tsconfig.base.json:1
- This appears to be a committed symlink (or symlink-like pointer) rather than a valid JSON
tsconfig. Git symlinks can break on Windows checkouts and some tooling that expectstsconfig.base.jsonto be parseable JSON. Consider replacing this with a real roottsconfig.base.json(or a small JSON file thatextendstheapps/tsconfig.base.json) to avoid cross-platform and toolchain issues; apply the same approach to the root-levellibspointer as well.
apps/infra-local/tests/integration/test_e2e_full.py:1 asyncio.run(setup())happens before thetry/finally. Ifsetup()triggerspytest.skip(...)or raises/asserts, the fixture exits before entering thefinally, somp.undo()andshutil.rmtree(tmp)won’t run (leaking env vars and temp dirs). Move thesetup()call inside thetryso cleanup runs even when setup skips/fails.
| <PostHogProvider | ||
| apiKey="phc_local_dev_no_op" | ||
| options={{ | ||
| // api_host is required by posthog-js but never used because | ||
| // capturing is opted out and flags are bootstrapped below. | ||
| api_host: 'https://localhost.invalid', | ||
| bootstrap: { featureFlags: LOCAL_DEV_FEATURE_FLAG_DEFAULTS }, | ||
| advanced_disable_feature_flags: true, // skip /decide network call | ||
| autocapture: false, | ||
| capture_pageview: false, | ||
| capture_pageleave: false, | ||
| opt_out_capturing_by_default: true, | ||
| disable_session_recording: true, | ||
| loaded: (posthog) => posthog.opt_out_capturing(), | ||
| }} |
| if [ "$1" = "all" ]; then | ||
| # `tail -f` multiple files at once; each line gets ==> filename <== headers. | ||
| exec tail -F "${LOGS_DIR}"/*.log | ||
| fi |
| _MINIO_INIT_SCRIPT = """\ | ||
| set -eu | ||
| for i in 1 2 3 4 5; do | ||
| if mc alias set boxlite "$MINIO_URL" "$MINIO_USER" "$MINIO_PASSWORD" 2>/dev/null; then break; fi | ||
| echo "init: minio not ready yet (attempt $i)" | ||
| sleep 2 | ||
| done | ||
| mc alias set boxlite "$MINIO_URL" "$MINIO_USER" "$MINIO_PASSWORD" | ||
| mc mb --ignore-existing boxlite/boxlite | ||
| echo "init: ok - boxlite bucket ready" | ||
| """ |
|
This push does three things:
|
… closed) The local-dev PostHog bootstrap (forcing the 6 UI/route-gating flags to true when PostHog isn't configured) could fail OPEN in any prod/staging deploy that ships without POSTHOG_API_KEY: sst.config.ts injects the key only conditionally, and an absent key made the API expose gated routes (POST /regions, GET /runners) and the dashboard render gated surfaces (create-sandbox, infrastructure, etc.) that should stay hidden. Gate the bootstrap behind a hard production check at all three spots: - app.module.ts: skip passing bootstrapFlags when NODE_ENV==='production' - openfeature-posthog.provider.ts: defense-in-depth — refuse bootstrapFlags in production so the flag-resolution boundary itself can't fail open regardless of how it's wired; unconfigured prod falls through to each call-site's defaultValue (false) - PostHogProviderWrapper.tsx: in a prod build (import.meta.env.PROD) with no PostHog config, render children without a provider so flags resolve falsy infra-local is unaffected: NODE_ENV=development / vite dev (PROD=false) still get the bootstrap. Verified the running stack — dashboard still renders Create Sandbox after the gate; the three files type-check clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 40 out of 44 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (4)
apps/infra-local/scripts/stack-up.sh:1
xargs -ris GNU-specific and isn’t supported by macOS/BSDxargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs fromlsof -ti, thenkillonly if non-empty), or use a portablexargspattern that won’t error on empty input.
apps/infra-local/scripts/stack-up.sh:1xargs -ris GNU-specific and isn’t supported by macOS/BSDxargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs fromlsof -ti, thenkillonly if non-empty), or use a portablexargspattern that won’t error on empty input.
apps/infra-local/scripts/stack-up.sh:1xargs -ris GNU-specific and isn’t supported by macOS/BSDxargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs fromlsof -ti, thenkillonly if non-empty), or use a portablexargspattern that won’t error on empty input.
apps/infra-local/scripts/stack-up.sh:1xargs -ris GNU-specific and isn’t supported by macOS/BSDxargs, which will print an error (and can break the UX of these scripts). Replace the pipeline with a bash-level empty-check (capture PIDs fromlsof -ti, thenkillonly if non-empty), or use a portablexargspattern that won’t error on empty input.
| if [ "$1" = "all" ]; then | ||
| # `tail -f` multiple files at once; each line gets ==> filename <== headers. | ||
| exec tail -F "${LOGS_DIR}"/*.log | ||
| fi |
| }} | ||
| > | ||
| {children} | ||
| </PostHogProvider> |
| stack-reset-hard: ## like stack-reset, but also re-applies prod schema baseline | ||
| ./scripts/stack-reset.sh --hard |
Fresh clones have no apps/api/.env (gitignored — prod holds real secrets there), so stack-up.sh's unconditional `apps/.env -> api/.env` symlink dangled and `. ./.env` failed, leaving the L2 API unable to start. Ship configs/api.env (all deterministic local-stack placeholders, no real secrets) and seed apps/api/.env from it on first stack-up when missing, without clobbering a dev's edits.
…y-required Verified end-to-end that infra-local deploys without it: cold stack-up → 10/10 L1 boxes healthy → API healthy → seed completes (snapshot active; seed-init-data.sh seeds via psql, not the flag-gated routes) → POST /api/sandbox → microVM started. The control plane never consults the feature flags — only flag-gated dashboard UI and admin routes do. Dropped: - openfeature-posthog.provider.ts bootstrap support (reverted to main) and its spec, which only tested the bootstrap gate - the bootstrapFlags injection block in app.module.ts - the dashboard PostHogProviderWrapper local-dev defaults (reverted to main) Kept (deliberately not reverted): - the DB pool statement_timeout/query_timeout wiring in app.module.ts + configuration.ts — consumed by stack-up.sh's DB_*_TIMEOUT_MS exports (microVM↔pg wedge fix) - apps/package.json typeorm 0.3.28 pin (quoted-orderBy 0.3.29 trigger still live in sandbox.manager.ts) and @types/node-forge Behavior change in local dev (documented in infra-local-usage.md): without a configured PostHog, flag-gated surfaces fail closed — the dashboard "Create Sandbox" button and admin routes like POST /api/regions stay hidden. API sandbox creation is unaffected.
Reverts configuration.ts + app.module.ts to their main (merge-base e4bbd0e) versions and removes the now-stranded DB_QUERY_TIMEOUT_MS / DB_STATEMENT_TIMEOUT_MS exports from stack-up.sh. apps/api now differs from main only by the webpack/tsconfig build fixes — verified earlier today by a full cold deploy on exactly this app-file state (10/10 L1 healthy → API healthy → seed → sandbox started). Accepted trade-off (dev-only, documented in services.py): without the client-side query_timeout, a query wedged on the microVM↔pg transport can hold its API pool slot until `make stack-restart COMPONENTS=api`. The server-side caps stay in the pg box config (statement_timeout, idle_in_transaction_session_timeout, tcp_keepalives), which reap the orphaned backend. Root cause is the transport silently wedging a stream — to be filed against src/; the env knobs can return as a standalone main PR if prod wants them as general hygiene.
…ags instead Reverts apps/api/webpack.config.js to its main (merge-base e4bbd0e) version. The local-dev plugin filtering (GeneratePackageJsonPlugin + ForkTsCheckerWebpackPlugin by constructor-name string match) is replaced by the executors' own documented options, passed at the single local call site in stack-up.sh: nx serve api --buildTargetOptions.generatePackageJson=false \ --buildTargetOptions.skipTypeChecking=true (@nx/js:node serve forwards buildTargetOptions to the @nx/webpack:webpack build target; the JSON-string form fails Nx's schema validation, so the dotted form is used.) CI/prod builds keep both plugins — this is invocation-scoped, not config-scoped. CONNECTIONS.md and infra-local-usage.md updated so no doc instructs a bare `nx serve api`, which would crash in this workspace (yarn4 + gitignored lockfile leaves npm deps out of the Nx project graph). Verified: nx build api green with flags on the reverted config; API relaunched via the exact new invocation → /api/health 200; sandbox created on it → state `started` in ~5s. apps/api now differs from main only by the 2-line tsconfig.json extends fix.
Moves docs/apps/infra-local-usage.md into apps/infra-local/README.md as a "Usage guide" section so the component's docs live with the component (alongside CONNECTIONS.md). Content moved verbatim; headings demoted one level (code fences untouched), the one relative link rewritten for the new location, and the README's three links to the old path turned into internal anchors. No remaining references to the old path.
Signed-off-by: dorianzheng <8065637+DorianZheng@users.noreply.github.com>
Main's new api-client-drift workflow runs `yarn install --immutable` against the newly-committed apps/yarn.lock; the PR's merge-ref failed YN0028 because the branch's typeorm 0.3.28 pin + @types/node-forge weren't in that lock, and main's tar ^7.5.16 bump conflicted textually with the adjacent pin line. Manual content (everything else is main's side verbatim): - apps/package.json: union resolution — main's tar ^7.5.16 + the branch's typeorm 0.3.28 pin (0.3.29 quoted-orderBy regression guard) - apps/yarn.lock: regenerated from main's lock with the merged manifest (+24/-14: typeorm 0.3.28 + @types/node-forge entries, stale descriptors pruned) Verified: `yarn install --immutable` passes on the merged tree (the exact failed CI check); `nx build api --skip-nx-cache` green.
…l dependsOn serve's defaultConfiguration has pointed at api:build:development since the workspace import without that configuration existing. Add it (generatePackageJson=false — GeneratePackageJsonPlugin crashes in this workspace; skipTypeChecking=true — pre-existing type-only OTLP lockfile skew), and remove serve's dependsOn:["build"]: @nx/js:node builds its buildTarget itself, and the dependsOn forced a second bare-config build that hit the GeneratePackageJsonPlugin crash whenever the Nx graph was fresh (the main merge invalidated the stale cache that had been masking it; configurations don't propagate to plain-string dependsOn tasks, so the stack-up.sh flags could not reach it). Production-safe: apps/api/Dockerfile runs `nx build api --configuration=production` and never invokes serve; no path values touched. Verified: plain `nx serve api` on the merged tree reaches /api/health 200 (crashed before); POST /api/box creates a box.
Summary
Adds
apps/infra-local/— a fully self-hosted BoxLite cloud-MVPcontrol plane that runs on a single Apple Silicon Mac. One command
(
make stack-up) brings up the entire stack and lets a developer createreal microVM sandboxes through the dashboard — no AWS, no Docker daemon
for the application boxes; everything runs natively on M5.
Ships as
milestone/infra-local/v0.1.0. Cold-start → working sandbox +browser terminal (
root@boxlite:~#) in ~80 s.Architecture
boxlite_localPython package:3001), Go Runner (:3003), Go Proxy (:4000), Vite Dashboard (:3000)make stack-*~/.boxlite-runner/The Runner runs natively on M5 (HVF + libkrun) — single runner host.
Multi-host autoscaler testing (the old Lima route) is intentionally out
of scope here and parked in a separate worktree.
What works end-to-end (verified)
make stack-nuke && make stack-upboots all services + auto-seeds + waits for the default snapshot — ~80 s.admin@boxlite.dev/password, plus a normaltest01@boxlite.dev); API auto-creates user + Personal org + owner row on first login.POST /api/sandbox.ubuntu:22.04default snapshot; runner pulls the arm64 layer from the local registry.Verified twice via full cold-start + browser-driven E2E (login → snapshot Active → create sandbox → terminal
root@boxlite:~#).Operate-by-make
make stack-upis the single, self-healing entry point — works froma fresh checkout, after a reboot, or after
make stack-down:make installif the orchestrator package isn't importable/tmpcleared on reboot)load-schemais idempotent — skips cleanly when the PG data volume survived a rebootOther targets:
stack-status,stack-logs COMPONENT=…,stack-restart COMPONENTS=…,stack-rebuild-l1-box BOX=…, tiered cleanupstack-reset/stack-reset-hard/stack-nuke. Seedocs/apps/infra-local-usage.md.Notable implementation notes
stack-up.sh): the Go runner reportshost-wide CPU/RAM/disk to the API. On a dev Mac sharing RAM with
IDE/Chrome/Docker, that drags
availabilityScorebelow the prodthreshold and the API rejects sandbox-create with "No available
runners".
stack-up.shexportsRUNNER_AVAILABILITY_SCORE_THRESHOLD=5/
RUNNER_MEMORY_PENALTY_THRESHOLD=95/RUNNER_DISK_PENALTY_THRESHOLD=95before launching the API. Documented in
CONNECTIONS.md; structuralfix (runner reports only its own usage) tracked as follow-up.
feature-flag bootstrap (api + dashboard), runner
runtime.GOARCHimage pull (fixes
ENOEXECon arm64),jwt.strategyuser-createfix, Caddy host-regexp → proxy for sandbox port-preview URLs.
Known limitations (intentional, not bugs)
PostHog, Billing (Stripe), Svix webhooks, Snapshot Manager (Dockerfile
builds), SSH Gateway, ClickHouse, OpenSearch, SMTP, dns-shim + TLS are
mocked / deferred — see the
milestone doc
"Known limitations" table. Production-parity gap (HVF vs KVM libkrun
backend) is acknowledged and revisited when the autoscaler lands.
Docs
apps/infra-local/README.md— quick start + make targets + architectureapps/infra-local/CONNECTIONS.md— endpoint / credential / env-var referencedocs/apps/infra-local-usage.md— day-to-day workflow + tiered cleanup decision treedocs/apps/infra-local-status.md— real vs mock vs missing inventorydocs/apps/milestones/2026-05-25-milestone-infra-local-v0.1.0.md— milestone summaryTest plan
make stack-upfrom fresh checkout (auto install + build + up)make stack-nuke && make stack-upcold start → ~80 s to working stackmake stack-up(load-schema idempotent skip, binaries rebuilt)ubuntu:22.04Activeroot@boxlite:~#🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores