ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680)#724
ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680)#724G4614 wants to merge 93 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an always-on Tokyo e2e GitHub Actions workflow, an OIDC IAM provisioning script, API/runner build-and-deploy steps, pytest e2e execution with diagnostics, README and fixture updates, and a Dockerfile build-arg for configurable Nx builds. ChangesTokyo Always-on E2E Stack
Sequence Diagram(s)sequenceDiagram
participant GitHubActions
participant AWS as AWS(ECR/ECS/SSM/S3/ALB)
participant API as API(ECS Service / ALB)
participant Runner as Runner(EC2 boxlite-runner)
participant Pytests as Pytests
GitHubActions->>AWS: obtain OIDC id-token & assume role
GitHubActions->>AWS: resolve ECS cluster, ALB DNS/target-group, runner instance, S3 bucket
GitHubActions->>AWS: build & push API image to ECR (apps/api/Dockerfile.source)
GitHubActions->>AWS: build runner binary and upload to S3 (builds/<sha>/boxlite-runner)
GitHubActions->>AWS: register new ECS task definition with new image -> update service (rolling deploy)
AWS->>API: deploy new task -> ALB health checks
GitHubActions->>Runner: send SSM AWS-RunShellScript to download binary, replace executable, restart systemd
GitHubActions->>AWS: ECS Exec into API task to run psql quota init
GitHubActions->>Pytests: install SDK, write credentials (from SSM), run pytest via ALB DNS
Pytests->>API: e2e requests via ALB DNS
GitHubActions->>AWS: on failure, fetch CloudWatch logs and runner journal via SSM
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
PR #678 wires up an e2e suite but its workflow only triggers on PRs that touch specific paths. GitHub branch protection requires a check to ALWAYS report a result — a path-filtered workflow that doesn't fire leaves the required check pending forever, so the PR can't merge. Reshape so e2e can be a required check: 1. Drop `paths:` from the trigger — workflow always starts on every PR to main. Inside, a cheap `changes` job on a GitHub-hosted runner does the path-filter detection via dorny/paths-filter@v3. 2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min) stays gated on `changes.outputs.relevant == true`. PRs that don't touch relevant paths skip the kvm runner entirely — no contention cost. 3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses the conditional outcome into one stable check name `E2E gate (required)`. Reports success if e2e passed OR no relevant paths changed; fail if e2e ran and failed. Branch protection should be configured (separately, via repo Settings → Branches → main) to require the `E2E gate (required)` status check. That's a repo-settings action a maintainer needs to do once after this merges. Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes` + ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed. Cost on relevant PRs: same as before (full e2e on kvm runner). Stacks on #678 — this commit only makes sense once that PR's e2e-stack.yml is in main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autouse `verify_runner_saw_all_boxes` fixture shells out to `journalctl -u boxlite-runner`, which only resolves on a host where the runner systemd service is installed (the current self-hosted KVM runner). When the e2e suite migrates to a GitHub-hosted runner pointing at a remote stack (Tokyo Api/Runner deployment), journalctl returns zero hits and every box-creating test fails on the path-verify assert even though the box actually reached the remote runner. Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture (yield-and-return before the journalctl probe). The cloud-CI workflow will set this; the self-hosted KVM job keeps the guard. The FFI-bypass risk this guard defends against doesn't apply on a stock GHA runner — no KVM means libkrun can't start a VM, so a silent fallback would surface as a hard error, not as a passing-but-wrong test. Losing the guard in that environment loses no real safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`, added in the next commit). A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed because the existing `boxlite-e2e-github-actions` role is tuned for the us-east-1 KVM-runner provisioning path. Cloud e2e against the ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR push, ECS update-service + execute-command, SSM send-command to a specific runner EC2, S3 builds/ artifact upload, parameter-store reads. Splitting the roles keeps each one auditable under least-privilege. Scoping: - Trust policy `sub` is an enumerated allowlist (main pushes, pull_request, environment:e2e-cloud) — a stray workflow added on a feature branch can't assume the role. - ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner` so the role can only shell the runner EC2. - ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN; ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs. - Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*, ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting resource-level scoping for those actions. Sources scripts/common.sh for the shared log helpers so this script's output matches setup-ci-runner.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring the structure of the e2e-test.yml section (Why / Architecture / Triggers / Cost / Authentication / Required variables / Required AWS resources / Concurrency / Stack state). - Side-by-side table contrasts the two e2e workflows' IAM roles, regions, and scopes so future maintainers can tell at a glance why both exist. - Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that filename does not exist; the actual provisioning script is `scripts/ci/setup-ci-runner.sh`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AWS IAM CreateRole rejects U+2014 (em-dash) in the --description argument with a validation error. Replace the em-dash with an ASCII hyphen so the script runs cleanly during bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apps/api/Dockerfile uses `COPY apps/libs/sdk-typescript/`, `COPY apps/dashboard/`, `COPY apps/package.json` and friends — paths that exist at the repo root, not inside apps/api/. Using `apps/api` as the build context made buildx report: ERROR: failed to compute cache key: failed to calculate checksum of ref ...: "/apps/libs/sdk-typescript": not found Switch to `-f apps/api/Dockerfile .` so the context is the repo root. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0cdf723 to
49a95e2
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 42-56: Add a top-level workflow permissions block to restrict the
default GITHUB_TOKEN to the minimum (e.g., permissions: contents: read) so jobs
do not inherit broad repo access; then explicitly opt individual jobs (like the
e2e job that currently needs id-token) into additional scopes by adding
id-token: write under that job's permissions. Locate the workflow's
concurrency/group declaration (e2e-cloud-shared) and the e2e job definition to
apply the changes so default token scope is locked down and only required
permissions are granted per-job.
- Around line 91-101: The workflow's change filters are too narrow (the filters:
relevant block only matches apps/api/src/** and sdks/python/src/**) while the
job actually builds the full apps/api Docker context and the entire sdks/python
package; update the changes filter to include the full contexts (e.g.,
apps/api/** and sdks/python/**) or explicitly add Dockerfile, package*.json,
pyproject.toml and other top-level files (e.g., apps/api/Dockerfile,
apps/api/package*.json, sdks/python/pyproject.toml) so edits that affect what
gets built will trigger the e2e workflow; apply the same widening to the other
similar filters referenced in the file.
In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment check for BOXLITE_E2E_SKIP_PATH_VERIFY is
currently treating any non-empty value as true; change the conditional to
explicitly test for canonical truthy values (e.g., lowercased value in
("1","true","yes")) so that explicit disables like "0" or "false" do not skip
verification; update the if statement around the yield/return in conftest.py to
read the env var, normalize it (lower()), and compare against the accepted true
tokens (or use a standard parser like distutils.util.strtobool) before deciding
to yield and return.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a9da2708-7499-4191-886f-7cadda734569
📒 Files selected for processing (5)
.github/workflows/README.md.github/workflows/e2e-cloud.yml.github/workflows/e2e-stack.ymlscripts/ci/setup-e2e-cloud-oidc.shscripts/test/e2e/cases/conftest.py
💤 Files with no reviewable changes (1)
- .github/workflows/e2e-stack.yml
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/test/e2e/cases/conftest.py (1)
116-118:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse explicit truthy parsing for
BOXLITE_E2E_SKIP_PATH_VERIFY.Line 116 currently enables skip for any non-empty value, so
0/falsestill bypass verification. That contradicts the docstring’s “=1” contract.Suggested minimal fix
- if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY"): + if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY", "").strip().lower() in {"1", "true", "yes"}: yield return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/test/e2e/cases/conftest.py` around lines 116 - 118, The environment-var check currently treats any non-empty BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the intended truthy value (per docstring) by testing for "1" (e.g., os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path verification; update the conditional around the yield/return in conftest.py that references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.
🧹 Nitpick comments (1)
.github/workflows/e2e-cloud.yml (1)
87-88: ⚡ Quick winConsider pinning security-critical actions to commit SHAs.
Actions like
aws-actions/configure-aws-credentials,actions/checkout, andaws-actions/amazon-ecr-loginare security-sensitive (they handle credentials or code checkout). Pinning to commit SHAs instead of version tags prevents supply chain compromise if an action's repository is attacked.At minimum, pin the credential-handling actions:
# Example for configure-aws-credentials - uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2Also applies to: 115-115, 118-118, 207-207, 229-229, 233-233, 464-464
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e-cloud.yml around lines 87 - 88, The workflow uses version tags for security-sensitive actions (e.g., actions/checkout, dorny/paths-filter, aws-actions/configure-aws-credentials, aws-actions/amazon-ecr-login) which is vulnerable to supply-chain attacks; update each uses: entry for these actions to pin to a specific commit SHA instead of a tag (replace e.g. actions/checkout@v5 with the corresponding commit SHA for that release), doing this for all occurrences flagged (including the configure-aws-credentials and amazon-ecr-login usages) so the workflow references immutable commits.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment-var check currently treats any non-empty
BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the
intended truthy value (per docstring) by testing for "1" (e.g.,
os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path
verification; update the conditional around the yield/return in conftest.py that
references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.
---
Nitpick comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 87-88: The workflow uses version tags for security-sensitive
actions (e.g., actions/checkout, dorny/paths-filter,
aws-actions/configure-aws-credentials, aws-actions/amazon-ecr-login) which is
vulnerable to supply-chain attacks; update each uses: entry for these actions to
pin to a specific commit SHA instead of a tag (replace e.g. actions/checkout@v5
with the corresponding commit SHA for that release), doing this for all
occurrences flagged (including the configure-aws-credentials and
amazon-ecr-login usages) so the workflow references immutable commits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 777a5470-4e26-49d6-a116-7147a3b7fed9
📒 Files selected for processing (5)
.github/workflows/README.md.github/workflows/e2e-cloud.yml.github/workflows/e2e-stack.ymlscripts/ci/setup-e2e-cloud-oidc.shscripts/test/e2e/cases/conftest.py
💤 Files with no reviewable changes (1)
- .github/workflows/e2e-stack.yml
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/README.md
…ype check) `yarn nx build api --configuration=production` currently emits 160 TS errors against apps/api/src on main HEAD, blocking the e2e-cloud image build. The type drift is in pre-existing code (organization / webhook / user controllers + tracing.ts), unrelated to this PR's scope. Fixing those errors is a separate concern. `apps/api/project.json` already has a `development` build configuration with `skipTypeChecking: true` — webpack still produces the runtime bundle, just without the strict type-check gate. Add a `NX_BUILD_CONFIG` build arg to apps/api/Dockerfile: - Default `production` so production deploys are unchanged. - e2e-cloud workflow passes `development` to bypass the type-check gate while still producing the same `dist/apps/api/main.js` bundle the ENTRYPOINT runs. Runtime semantics are unchanged (same bundle, same entrypoint) — only the build-time strictness differs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/api/Dockerfile (1)
37-37: ⚡ Quick winUse
CACHE_BUSTinstead of a hardcoded epoch token.
ARG CACHE_BUST=1is declared on Line 16, but Line 37 hardcodesecho 1777389170; this makes the arg ineffective and obscures cache invalidation intent.Proposed change
-RUN echo 1777389170 && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true +RUN echo ${CACHE_BUST} && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/Dockerfile` at line 37, The RUN line currently echoes a hardcoded epoch token which ignores the declared ARG CACHE_BUST; update the RUN command in the Dockerfile (the line invoking echo and yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true) to echo the CACHE_BUST build arg instead (use $CACHE_BUST or ${CACHE_BUST}) so the declared ARG CACHE_BUST takes effect for cache invalidation, and ensure ARG CACHE_BUST is present earlier in the Dockerfile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@apps/api/Dockerfile`:
- Line 37: The RUN line currently echoes a hardcoded epoch token which ignores
the declared ARG CACHE_BUST; update the RUN command in the Dockerfile (the line
invoking echo and yarn nx build api --configuration=${NX_BUILD_CONFIG}
--nxBail=true) to echo the CACHE_BUST build arg instead (use $CACHE_BUST or
${CACHE_BUST}) so the declared ARG CACHE_BUST takes effect for cache
invalidation, and ensure ARG CACHE_BUST is present earlier in the Dockerfile.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e7f4cbe3-0a49-48c1-9444-cbff14532232
📒 Files selected for processing (2)
.github/workflows/e2e-cloud.ymlapps/api/Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/e2e-cloud.yml
… build) Previous attempt: --build-arg NX_BUILD_CONFIG=development bypassed the strict TS type check but still ran webpack/vite builds. The API webpack build succeeded under that config (skipTypeChecking: true), but the dashboard vite build still failed on a tsconfig path-resolution issue (`failed to resolve "extends":"../tsconfig.base.json"`). Even fixing the dashboard issue would still incur the full webpack bundle cost, which is wasted work for an e2e smoke test against deployed code. Source-mode approach: new apps/api/Dockerfile.source that - runs apps/api/src/main.ts directly via `yarn tsx ...` at container start (tsx strips types and transpiles on the fly; never type-checks) - skips both `nx build api` and `nx build dashboard` entirely - omits the apps/dashboard/ COPY altogether — e2e cases under scripts/test/e2e/cases/ hit the API directly, no dashboard static-file serving is exercised tsconfig.base.json is symlinked to tsconfig.json so tsx auto-discovers the `@boxlite-ai/*` path aliases that resolve to libs/*/src/index.ts. Production deploys are unaffected: apps/api/Dockerfile (the production Dockerfile) is unchanged; this is a sibling file used only by the e2e-cloud workflow. Trade-offs documented in apps/api/Dockerfile.source's header: - slower cold start (~5-15s extra for tsx initial transpile) - type errors surface at runtime instead of build time (acceptable for an e2e smoke against deployed code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/e2e-cloud.yml (2)
154-166:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't take
TargetGroups[0]from the ALB.This block asserts there is one matching load balancer, but it never asserts there is one attached target group. If another TG gets attached later, the deploy health gate can watch the wrong backend and either pass too early or fail a healthy rollout.
Suggested hardening
- API_LB_TG_ARN=$(aws elbv2 describe-load-balancers \ - --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \ - --output text \ - | xargs -I{} aws elbv2 describe-target-groups --load-balancer-arn {} \ - --query "TargetGroups[0].TargetGroupArn" --output text) + API_LB_ARN=$(aws elbv2 describe-load-balancers \ + --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \ + --output text) + API_LB_TG_COUNT=$(aws elbv2 describe-target-groups \ + --load-balancer-arn "$API_LB_ARN" \ + --query "length(TargetGroups)" --output text) + assert_one "ApiLoadBalancer-* target group" "$API_LB_TG_COUNT" + API_LB_TG_ARN=$(aws elbv2 describe-target-groups \ + --load-balancer-arn "$API_LB_ARN" \ + --query "TargetGroups[0].TargetGroupArn" --output text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e-cloud.yml around lines 154 - 166, The script currently reads TargetGroups[0] for API_LB_TG_ARN which can pick the wrong TG if multiple exist; update the logic to query and count target groups attached to the ALB (use aws elbv2 describe-target-groups --load-balancer-arn {} with a length(...) query and store/check a TG_COUNT variable), call assert_one (or reuse existing assert_one) to ensure exactly one target group is attached, and only then retrieve the single TargetGroupArn into API_LB_TG_ARN (instead of unguarded TargetGroups[0]); ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and the describe-target-groups call when making these changes.
42-47:⚠️ Potential issue | 🟠 MajorBind
workflow_dispatchruns to the trustede2e-cloudenvironment and avoid ambiguous ALB target-group selection.
jobs.e2ein.github/workflows/e2e-cloud.ymlhas noenvironment:;scripts/ci/setup-e2e-cloud-oidc.shtrust policy only allows...:ref:refs/heads/main,...:pull_request, or...:environment:e2e-cloud, soworkflow_dispatchon non-mainbranches won’t be able to assume the AWS role via OIDC.Suggested fix
e2e: name: E2E suite (Tokyo) needs: changes + environment: e2e-cloud # Skip on PRs that don't touch relevant paths. workflow_dispatch # always runs (operator explicitly asked for it). if: needs.changes.outputs.relevant == 'true' || github.event_name == 'workflow_dispatch'
- In
.github/workflows/e2e-cloud.yml(“Resolve stack resources”),api_lb_tg_arnis set usingTargetGroups[0].TargetGroupArnwithout asserting that exactly one target group is returned for the matched ALB; this can silently bind to the wrong target group. Add aTargetGroupscount/uniqueness assertion (or filter deterministically) before using index0.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e-cloud.yml around lines 42 - 47, Add environment: e2e-cloud to the jobs.e2e definition so workflow_dispatch runs are bound to the trusted e2e-cloud environment (matching the OIDC trust conditions in scripts/ci/setup-e2e-cloud-oidc.sh); and in the “Resolve stack resources” step where api_lb_tg_arn is derived from TargetGroups[0].TargetGroupArn, assert/filter that the ALB returns exactly one matching TargetGroup (or deterministically filter by Name/Tags) before using index 0 to avoid silently selecting the wrong target group.
🧹 Nitpick comments (1)
apps/api/Dockerfile.source (1)
26-63: ⚡ Quick winRun container as non-root user.
The container runs as root by default, which violates the principle of least privilege. Even in an e2e test environment, following security best practices reduces attack surface. The official Node.js image includes a pre-configured
nodeuser (UID 1000).🔒 Proposed fix
Add the USER directive after installing dependencies but before copying application code:
COPY apps/nx.json apps/tsconfig.base.json apps/NOTICE ./ # tsx auto-discovers tsconfig.json in cwd — symlink so the path-alias # definitions in apps/tsconfig.base.json (the `@boxlite-ai/*` mappings) # resolve at import time without needing an explicit --tsconfig flag. RUN ln -sf tsconfig.base.json tsconfig.json ENV NX_DAEMON=false +USER node + COPY apps/api/ apps/api/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/api/Dockerfile.source` around lines 26 - 63, The Dockerfile runs as root; update the boxlite-source stage to switch to the unprivileged node user after installing dependencies but before copying application files: create/ensure the /boxlite workdir is owned by node (chown -R node:node /boxlite) and add a USER node (UID 1000) directive before the COPY apps/api/ ... lines so subsequent COPY, HEALTHCHECK and ENTRYPOINT (yarn tsx apps/api/src/main.ts) run as the non-root node user; keep the existing RUN npm/yarn install steps as root then switch to node.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/api/Dockerfile.source`:
- Line 26: Set a non-root runtime user and make the application directory
writable to fix Trivy DS-0002: in the Dockerfile.source ensure you create or
verify the /boxlite directory, chown it to the existing node user (e.g., chown
-R node:node /boxlite), and then add USER node before the ENTRYPOINT so the
container runs as the node user rather than root; reference the Dockerfile's
ENTRYPOINT, the /boxlite path, and the USER directive when applying the changes.
---
Outside diff comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 154-166: The script currently reads TargetGroups[0] for
API_LB_TG_ARN which can pick the wrong TG if multiple exist; update the logic to
query and count target groups attached to the ALB (use aws elbv2
describe-target-groups --load-balancer-arn {} with a length(...) query and
store/check a TG_COUNT variable), call assert_one (or reuse existing assert_one)
to ensure exactly one target group is attached, and only then retrieve the
single TargetGroupArn into API_LB_TG_ARN (instead of unguarded TargetGroups[0]);
ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and the
describe-target-groups call when making these changes.
- Around line 42-47: Add environment: e2e-cloud to the jobs.e2e definition so
workflow_dispatch runs are bound to the trusted e2e-cloud environment (matching
the OIDC trust conditions in scripts/ci/setup-e2e-cloud-oidc.sh); and in the
“Resolve stack resources” step where api_lb_tg_arn is derived from
TargetGroups[0].TargetGroupArn, assert/filter that the ALB returns exactly one
matching TargetGroup (or deterministically filter by Name/Tags) before using
index 0 to avoid silently selecting the wrong target group.
---
Nitpick comments:
In `@apps/api/Dockerfile.source`:
- Around line 26-63: The Dockerfile runs as root; update the boxlite-source
stage to switch to the unprivileged node user after installing dependencies but
before copying application files: create/ensure the /boxlite workdir is owned by
node (chown -R node:node /boxlite) and add a USER node (UID 1000) directive
before the COPY apps/api/ ... lines so subsequent COPY, HEALTHCHECK and
ENTRYPOINT (yarn tsx apps/api/src/main.ts) run as the non-root node user; keep
the existing RUN npm/yarn install steps as root then switch to node.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 094c7c9a-9051-4c60-ac33-d16a8ef0a9a2
📒 Files selected for processing (2)
.github/workflows/e2e-cloud.ymlapps/api/Dockerfile.source
`cargo build -p boxlite-runner` was wrong — runner is a Go binary (apps/runner/cmd/runner/main.go), not a Rust crate. The Rust workspace has no `boxlite-runner` package, hence cargo's `error: package ID specification 'boxlite-runner' did not match any packages`. Mirror .github/workflows/build-runner-binary.yml's approach: - actions/setup-go@v5 with go-version 1.25.4 - Install libx11-dev + libxtst-dev + libxinerama-dev (computer-use links X11) - Download prebuilt libboxlite.a from `v$(Cargo.toml version)` GitHub release — currently v0.9.5 ✓ released 2026-05-16 - Rewrite apps/go.work to include only the modules we need to build - go mod download per module - Build daemon (CGO_ENABLED=0) → runner/pkg/daemon/static/daemon-amd64 - Build computer-use (CGO_ENABLED=1) → runner/pkg/daemon/static/boxlite-computer-use - Build runner (CGO_ENABLED=1, links libboxlite.a, embeds daemon and computer-use) → /tmp/boxlite-runner - aws s3 cp /tmp/boxlite-runner s3://.../builds/<sha>/boxlite-runner Pinning libboxlite.a to a released version means PR changes affecting only apps/runner Go code are picked up. PR changes to sdks/c that would require a fresh libboxlite.a are out of scope here — a follow-up could either trigger a full C SDK build or just download from a more recent release tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y branch Previous trust policy enumerated 3 specific subject patterns: - repo:.../ref:refs/heads/main - repo:.../pull_request - repo:.../environment:e2e-cloud workflow_dispatch on a feature branch (e.g. chore/e2e-required-merge-gate) produces subject `repo:.../ref:refs/heads/chore/e2e-required-merge-gate`, which matched none of the 3. STS rejected with "Not authorized to perform sts:AssumeRoleWithWebIdentity". Switch to `repo:boxlite-ai/boxlite:*` — accepts any subject from this specific repo. Scope is still bounded to a single repo (no other repos can assume this role), and the role's IAM policy is least-privilege (only the boxlite-e2e-ci stack resources). Trade-off: feature branches can trigger e2e-cloud via workflow_dispatch, which is a feature here (needed for iterative debugging) rather than a risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te.a from source)
Runner binary rebuild via Go CGO links libboxlite.a, but the prebuilt
libboxlite.a from latest release (v0.9.5) doesn't have the
boxlite_rest_options_set_path_prefix symbol that apps/runner Go code
references — main has drifted ahead of release.
Linker error:
undefined reference to `boxlite_rest_options_set_path_prefix'
Two paths to fix:
(a) Build libboxlite.a from Rust source in the workflow (+10-15min,
requires cross-build chain for musl + libkrun)
(b) Skip runner rebuild — Tokyo runner EC2 already runs a working
binary ABI-compatible with the deployed API
Going with (b) for now. The current Tokyo runner has been up 21+ hours
and passes the API↔runner handshake checks we observed earlier. PR
changes touching apps/runner Go code are not exercised by this
workflow as a result; that's a follow-up that can wire `make
dist:runner` (which builds libboxlite.a from source) into the build path.
Removed steps:
- actions/setup-go@v5
- Install runner build deps (libx11/libxtst/libxinerama)
- Determine version + download libboxlite.a
- Rewrite apps/go.work
- Download Go modules per module
- Build daemon / computer-use / boxlite-runner
- Upload runner binary to S3
- SSM Deploy runner binary to EC2
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…M needs decorator metadata) Source-mode container with tsx crashed immediately at module load: /boxlite/node_modules/typeorm/decorator/src/decorator/columns/PrimaryColumn.ts:72 ColumnTypeUndefinedError: Column type for User#id is not defined and cannot be guessed. Make sure you have turned on an "emitDecoratorMetadata": true option in tsconfig.json. Cause: tsx uses esbuild under the hood, which has long-standing non-support for `emitDecoratorMetadata: true`. TypeORM's @column / @entity / @PrimaryColumn decorators rely on `reflect-metadata` calls the compiler EMITS when that flag is set; without them, TypeORM can't infer column types and bails before the app even starts. Switch to ts-node, which uses the TypeScript compiler properly and honors emitDecoratorMetadata from the referenced tsconfig: ENTRYPOINT ["yarn", "ts-node", "--project", "apps/api/tsconfig.app.json", "--transpile-only", "-r", "tsconfig-paths/register", "apps/api/src/main.ts"] - --project: point at apps/api/tsconfig.app.json (where emitDecoratorMetadata and experimentalDecorators live). - --transpile-only: skip type-checking at startup (faster cold boot; the prod webpack build is the source-of-truth type gate, and it's already known to fail on main with 160 errors — see Dockerfile.source header). - -r tsconfig-paths/register: resolve `@boxlite-ai/*` path aliases defined in tsconfig.base.json at module-load time. Both ts-node (10.9.1) and tsconfig-paths are already in apps/package.json dependencies (used by the typeorm migration scripts), so no dep change. Cold start is now ~10-30s instead of tsx's ~5-15s — acceptable for the ALB health-check window (3 retries × 30s + delays = ~90s minimum). If it still races, will follow up by extending the health check grace period at the ECS service or task-def level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IAM action `ecs:ExecuteCommand` is evaluated against BOTH the task ARN AND the cluster ARN. Granting it on the task alone produced: AccessDeniedException: not authorized to perform: ecs:ExecuteCommand on resource: arn:aws:ecs:.../cluster/boxlite-e2e-ci-ClusterCluster-... Add the cluster pattern to the Resource list so the "Init admin-org sandbox quota" step can exec into the Api task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Init admin-org sandbox quota" workflow step execs into the Api container and runs `psql` against RDS to seed sandbox quota. The source-mode Dockerfile didn't install postgresql-client, so the step failed with `sh: 1: psql: not found`. Add it to the apt install line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema drift: post-deploy migration 1781016743403 renamed
max_{cpu,memory,disk}_per_sandbox → max_{cpu,memory,disk}_per_box on
the organization table. The quota-init UPDATE failed with:
ERROR: column "max_cpu_per_sandbox" of relation "organization"
does not exist
Rename all three column refs in the workflow to match the live schema.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous run actually applied the quota UPDATE successfully: UPDATE 1 ... | 4 | 8192 | 20 But psql wrapped the row across multiple lines and triggered a "--More--" pager prompt over the ECS Exec session, so the '4 *\| *8192 *\| *20' grep didn't match. Add `-A -t -P pager=off` and `PAGER=cat` so the SELECT row is a single unaligned line "<uuid>|4|8192|20", then grep on a tighter '\|4\|8192\|20$' anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quota row printed exactly as expected: 4a1eef70-8734-4814-b7a5-0ca3522a8760|4|8192|20 but `grep -E '\|4\|8192\|20$'` still failed because SSM Session Manager uses CRLF line endings, so the `$` anchor sees `\r` immediately before end-of-line, never the literal `20`. Pipe through `tr -d '\r'` first so the anchor matches reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bubblewrap-sys / libkrun-sys / e2fsprogs-sys crates have build.rs scripts that vendor their C dependencies via git submodules. Without the submodules initialized the build panics: bubblewrap vendor source not found at .../vendor/bubblewrap Initialize submodule: git submodule update --init --recursive These C deps are runtime requirements of the runner binary (which runs inside the EC2 host), NOT of the Python SDK client. The build.rs files all honor BOXLITE_DEPS_STUB=1 and emit a no-op `/nonexistent` linker hint — exactly the right behavior here. Set the env var on the SDK build step so we don't need to also init submodules or install meson/ninja on the GHA runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ass dead per-service role ARNs The e2e-ci stack's per-service Task/Execution roles (ProxyTaskRole, OtelCollectorTaskRole, SshGatewayTaskRole + their Execution pairs) were deleted from IAM out-of-band — same drift as ApiTaskRole- batkzzas before this PR's session recreated it. Live ECS TDs still reference the dead ARNs, so new task launches fail at AssumeRole NoSuchEntity. Workaround: at deploy time, resolve Api's currently-live taskRoleArn + executionRoleArn (those were recreated earlier with sst:app=boxlite and trust ecs-tasks.amazonaws.com), and override the cloned TD's roles to point at those. Both trust principals match any ECS service, so Proxy / OtelCollector / SshGateway can each AssumeRole the Api role. Least-privilege rough edge: borrowers get Api's S3 + log perms they don't need — acceptable for e2e-ci (no prod traffic). Admin's `sst deploy --stage e2e-ci` will recreate per-service roles properly; we can drop this override then.
… to test their deploy paths
c07f7cd to
8b66ae1
Compare
…w_call
Refactor the e2e-cloud workflow so the Api build+deploy chain is no
longer duplicated inline. Adds a deploy_api job that workflow_call's
.github/workflows/deploy-api.yml (mirroring how deploy_runner already
workflow_call's deploy-runner.yml). Both deploys are gated on actual
source changes via the existing `changes` paths-filter:
- runner_chain → deploy_runner
- api → deploy_api
PRs touching only api skip the ~11 min runner build chain; PRs
touching only runner skip the Docker buildx + ECS register-TD +
wait-services-stable round-trip; PRs touching only tests skip both.
E2e job (pytest) now needs all three (changes + deploy_runner +
deploy_api) and reuses the same skip-if-failure-only gate
(!failure() && !cancelled() && any-relevant-change) so it still runs
when an upstream job was correctly skipped.
Dropped from the inline e2e job:
- ECR login step + Build & push Api image step (deploy-api.yml)
- Deploy Api (rolling) step incl. clone-TD + UpdateService + ALB
healthy poll (deploy-api.yml does it + env patches for stale
S3_ACCESS_KEY / storage-bucket name + role override).
- SSM-agent-online preflight check on the runner (obsolete since
deploy-runner switched to SSH+SCP via EC2 Instance Connect).
- ECR repo describe preflight (now exercised inside deploy-api.yml).
Kept inline (still needed by pytest):
- Resolve stack resources (cluster, ALB DNS/TG, runner id, S3 bucket).
- /api/health probe through the LB before tests run.
- Init admin-org sandbox quota via ECS Exec.
- Python SDK build/install + pytest profile/snapshot setup + run.
- On-failure CloudWatch + journalctl dump.
Net: -104 lines, one source of truth per deploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… PortSpec impl merging upstream/main into the PR brought in #741's PortSpec-based WithPort but the session's earlier stub (added to unblock the runner build before #741 landed) remained in place, causing duplicate declarations: ports redeclared in struct boxConfig WithPort redeclared in this block cannot use portMapping{} as PortSpec value in append Remove the stub block: stub's ports field, portMapping struct, and stub WithPort func. #741's PortSpec + WithPort + ports []PortSpec are the canonical definitions.
The composed e2e-cloud pipeline previously deployed only Api + Runner. Supporting ECS services (Proxy / OtelCollector / SshGateway) went un-deployed in the CI loop — admin had to run `sst deploy --stage e2e-ci` to update them. This composes deploy-app-services.yml the same way as deploy-runner / deploy-api: - `changes` job adds paths-filter outputs for proxy / otel_collector / ssh_gateway, plus an `appsvc` step that emits a comma-separated list of services whose source actually changed. - New `deploy_app_services` job: workflow_call invokes deploy-app- services.yml, passing the changed-list via `services` input. Inner workflow's per-service gating then deploys only the touched ones. - `e2e` job needs += deploy_app_services so pytest waits for the supporting services to come up. Also: extend deploy-app-services.yml's workflow_call signature to accept the `services` input (it already had the same input on workflow_dispatch). Net: pure runner / api / test PRs skip the 3 supporting deploys entirely; otel-only PRs skip runner + api + the other 2 services.
…uota init The previous step UPDATE'd organization WHERE id='4a1eef70-...' — a UUID that happened to be the admin's default-org id in earlier e2e-ci stack instances. After this PR's DB schema reset (required to recover from #736's pre-launch migration squash being deployed against a stack whose history was already past those migrations), the app re-runs initializeAdminUser at boot and creates a new admin user + auto- generates a new default-org UUID. The hardcoded id no longer matches, UPDATE touches 0 rows, SELECT returns empty, the step fails. Switch the SQL to resolve the admin's default org via a subquery against organization_user (userId='boxlite-admin' AND isDefaultForUser=true) — that membership row is what app code already uses to find the admin's working org, and survives any future schema reset where the org UUID changes.
Merging the substantive bits of #745 and #746 here so PR #724's deploy-app-services CI can actually push an otel-collector image: * otel-collector/builder-config.yaml: fix module paths after #730 moved everything under apps/ (was: ../../../api-client-go etc., now: ../../../apps/api-client-go). The OCB build was failing with `filepath does not exist: /boxlite/otel-collector/exporter`. (#745) * otel-collector/Dockerfile: apk add python3 make g++ so node-gyp can compile any musl-incompatible native add-on during workspace `yarn install`. Kept even after dropping dockerode below — small safety net in case future deps reintroduce a native build. (#745) * apps/package.json: drop unused dockerode + @types/dockerode. They were inherited from the Daytona import in #460 but never imported in source. Drops the dockerode -> docker-modem -> ssh2 -> cpu-features chain; cpu-features was the optional native addon that forced the toolchain workaround in the first place. (#746) * apps/yarn.lock: regenerated (-161 lines). * apps/api-client-go/api/openapi.yaml: regenerate to catch up with main's existing `assignedRoleIds.default: []` drift so the api-client-drift PR check passes.
… gone The previous commit kept #745's apk add as a safety net, but cpu-features was the only musl-incompatible native dep in the workspace and it came in only via the now-removed dockerode chain. Verified locally: yarn install on node:22-alpine completes without any compiler toolchain.
PR-on-open triggering serializes every PR against the shared Tokyo stack (see concurrency: e2e-cloud-shared, cancel-in-progress: false). The suite is slow enough that gating every PR open / push on it just queues runs behind each other for hours. Push-to-main keeps the post-merge regression signal; workflow_dispatch covers ad-hoc PR validation when wanted.
Today an OtelCollector deploy on chore/e2e-required-merge-gate failed with "ECS rolled back. PRIMARY=...:7 expected ...:6" — two parallel Deploy App Services runs (push + dispatch on the same SHA) both registered task definitions, and one's UpdateService rolled the other back. Add per-workflow concurrency groups against the shared Tokyo stack to prevent the race. cancel-in-progress: false because a half-applied ECS rolling update / EC2 userdata swap is worse than waiting. - deploy-api.yml: group=deploy-api-shared - deploy-app-services.yml: group=deploy-app-services-shared - deploy-runner.yml: group=deploy-runner-shared - _deploy-single-service.yml: group=deploy-single-service-<service_name> (parameterized so different services parallelize but same-service callers serialize — belt-and-suspenders against the outer group)
Resolve stack resources was picking the API LB target group via TargetGroups[0] without checking how many target groups were attached. Blue/green deploys or extra listener rules can attach more than one TG to an ALB; in that case the e2e suite would silently bind smoke traffic to the wrong TG. Use the same assert_one guard as the cluster and LB checks above. Hoist the LB ARN into a variable so the count + the final lookup share one resolution.
…002) CodeRabbit / Trivy DS-0002 flagged the source-mode container running as root. Switch to the pre-existing `node` user (UID 1000 in node:24-slim) right before HEALTHCHECK + ENTRYPOINT. All RUN/COPY happen as root above, so the installed tree is read-only at runtime — least-privilege without needing a chown -R.
|
@coderabbitai full review |
✅ Action performedFull review finished. |
Bare `if os.environ.get(...)` is truthy for any non-empty string, so setting the var to "0" or "false" — the conventional "off" — still skipped the runner-journal verification. Accept the standard truthy spellings only.
Previously the e2e-cloud.yml pipeline glued deploy_{runner,api,app_services}
and the pytest suite into one workflow_dispatch entry, so the only
way to "rerun the tests" was to also redeploy everything (workflow_dispatch
forces all three deploy gates true). With the Tokyo stack as a shared
singleton that gets left in whatever state the last run produced, this
made cheap "rerun against current stack" cycles much more expensive
than they need to be.
* New: .github/workflows/e2e-cloud-test.yml. workflow_call + workflow_dispatch.
Contains only the e2e job — resolve resources, probe LB, init
admin-org quota, build SDK, configure pytest, run suite, collect
logs on failure. Shares the e2e-cloud-shared concurrency group so a
standalone test dispatch and an in-flight deploy queue against each
other (can't race the Tokyo stack).
* e2e-cloud.yml: e2e job is now a thin `uses:` wrapper around the new
workflow. Gating logic (skip-if-upstream-failed, paths-changed-only-
on-push) is preserved at the wrapper level; standalone dispatch on
e2e-cloud-test.yml skips those gates and just runs.
Header on e2e-cloud-test.yml warns explicitly that the stack state is
the dispatcher's responsibility — results reflect whatever's deployed,
not the branch the workflow was dispatched from.
The single `runner_source` filter triggered the full ~10 min libboxlite
Rust build on any path in the runner's dep graph, including Go-only
diffs (apps/runner, apps/daemon, sdks/go, ...) that just relink against
the existing libboxlite.a without changing it. Split into two filters:
* libboxlite_chain: paths whose content ends up in libboxlite.a
(src/boxlite, src/shared, src/deps, sdks/c, scripts/build, Cargo.{toml,lock})
* go_runner_chain: Go paths that consume libboxlite.a but don't
contribute to its bytes (apps/{runner,daemon,common-go,api-client-go,
libs/computer-use}, sdks/go)
`build_c_sdk` now runs only when libboxlite_chain changed.
`build_runner` switches libboxlite_source to "release" (pull
v${VERSION} tarball from GitHub Release) when libboxlite_chain didn't
change, "build" (use sibling build_c_sdk artifact) otherwise.
`build_runner`'s `if:` adds the !failure() && !cancelled() guard so
the skipped build_c_sdk on Go-only diffs doesn't cascade-skip it.
Also drop the dead `src/api-client/**` path from both deploy-runner.yml
and e2e-cloud.yml — that directory doesn't exist in the repo, so the
entry never matched anything.
workflow_dispatch / workflow_call behavior is unchanged — both still
force libboxlite_changed=true and rebuild from source. Only push-event
auto-triggering is optimized.
…layers
The bare ${SQL} interpolation went through GHA bash → aws ecs --command
→ sh -c → psql -c — four layers of double-quote parsing. The SQL's own
PG identifier quotes ("organizationId", "organization_user", etc.)
collided with the inner `\"...\"` bounds and got stripped, so psql
received the bare token `organizationId` which Postgres case-folds to
`organizationid` and reports as missing.
Encode the SQL to base64 on the runner (base64 chars never need quoting),
inline that into the --command string, then `base64 -d | psql -f -`
inside the container. psql sees the literal SQL bytes unchanged.
Last dispatched e2e-cloud run failed exactly here; the column is in
the schema (migrations/1741087887225-migration.ts:64), the bug was
purely shell-escape.
Drop this block before merging to main. GH won't expose workflow_dispatch on a brand-new workflow until it has run at least once; the push trigger (scoped to this branch and to self-modifications only) gives us that first run.
POST /api/snapshots and GET /api/snapshots?name=... no longer exist on the API — there is no snapshot controller in apps/api/src/**. PR #735 (feat(box): create boxes from curated images) replaced the pre-register-then-boot model with lazy curated-image pulls at box start, so the e2e test workflow no longer needs to seed snapshot rows. The pre-registration block was failing the "Configure pytest profile p1" step with `Cannot POST /api/snapshots` 404. The previous run got masked by the SQL-escape bug in quota init (now fixed); this surfaced as the next blocker.
waitForToolboxReady was introduced in fc88aa0 (2026-06-05, "feat: add agent-ready runtime catalog", which then landed on main via PR #715 on 2026-06-10). It HTTP-polls http://127.0.0.1:<hostPort>/version expecting a daemon on guest TCP port 2280 — that interface dates from the Daytona-daemon era and was never reimplemented after the Rust guest agent rewrite landed in dbb11ec (2026-04-01). The new agent binds **vsock://2695** for gRPC + notifies the host via vsock://2696; nothing inside the VM listens on TCP:2280, so libkrun's port-forward accepts the SYN and immediately reset-by-peer's, and every CREATE_BOX times out 30 s in. Production data from a Tokyo runner: in 24 h, 490 CREATE_BOX events, 0 toolbox-ready successes, 181 toolbox-ready failures. The exec path that fires immediately afterward (via the same vsock gRPC channel) **always succeeds** — confirming the box VM is healthy, the readiness check itself is the bug. Remove the dead probe: drop waitForToolboxReady from client.go's Create and Start, drop the function + its TCP/HTTP imports, drop the toolboxReadyTimeout field, drop the ToolboxReadyTimeout/ DaemonStartTimeoutSec config plumbing in main.go + config.go, drop the two now-unreachable tests. Box readiness is now signalled by bx.Start(ctx) returning (which itself blocks on the vsock notification from the guest). Branched off chore/e2e-required-merge-gate (PR #724) so the e2e-cloud stack picks this up next dispatch.
This PR is the same-repo version of #680. Same content, but the head
branch is now on
boxlite-ai/boxlitedirectly instead of a fork. Reason:PRs sourced from a fork strip the
vars/secretscontexts and theid-token: writepermission from the workflow run by GitHub Actionssecurity policy, which makes the OIDC handshake in
aws-actions/configure-aws-credentials@v4impossible.Closing #680 (link incoming).
What this PR contains
ci(e2e): make e2e suite a required merge gate— original ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 commit(3-job structure for changes / e2e / e2e-gate; e2e-gate was later
dropped in this stack, see below)
test(e2e): add BOXLITE_E2E_SKIP_PATH_VERIFY toggle for cloud CI—conftest fixture short-circuits when the cloud runner journal isn't
reachable from the test host (i.e. cloud CI where the runner is on
a remote EC2)
ci: add IAM OIDC role provisioning for e2e-cloud workflow— newscripts/ci/setup-e2e-cloud-oidc.shprovisionsboxlite-e2e-cloud-github-actionsIAM role (separate from theus-east-1
boxlite-e2e-github-actionsrole used bye2e-test.yml)ci(e2e): rename e2e-stack → e2e-cloud, target deployed Tokyo stack—the actual workflow rewrite: deploys API container + runner binary
from this checkout, then runs
pytest scripts/test/e2e/cases/against the deployed Tokyo
boxlite-e2e-ci-*stackdocs(workflows): document e2e-cloud and fix stale setup-aws-oidc.sh refci(e2e-cloud): drop gate job — defer 'required check' until Tokyo provisioned—removed the always-on
e2e-gatejob; deferred branch protectionrequired-check registration until the workflow has a green run
ci(e2e-cloud): hardcode AWS_REGION + AWS_ROLE_ARN to fix fork-PR vars gap—AWS account ID is "sensitive, not secret" per AWS docs; role safety is
enforced by trust policy. This commit is from when we were still
trying to make ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 work; now that this is a same-repo PR, the
${{ vars.* }}references would also work, but the hardcoded valuesare kept as the source of truth.
AWS prerequisites already provisioned (one-time, done before this PR opens)
token.actions.githubusercontent.com(alreadyexisted, shared with
e2e-test.yml)arn:aws:iam::064212132677:role/boxlite-e2e-cloud-github-actionswith the scoped inline policy from
scripts/ci/setup-e2e-cloud-oidc.sh/boxlite/e2e-ci/admin-api-keyAWS_E2E_CLOUD_REGION+AWS_E2E_CLOUD_ROLE_ARNon
boxlite-ai/boxlite(now redundant with the hardcoded values,but kept for future-friendly same-repo-PR runs)
Test plan
changesjob paths match this PR's diff → e2e job firesid-token: write)SSM update, admin-quota init, pytest run) iterate from there
Summary by CodeRabbit
Documentation
Infrastructure
Tests
Chores