ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680) by G4614 · Pull Request #724 · boxlite-ai/boxlite

G4614 · 2026-06-10T14:09:49Z

This PR is the same-repo version of #680. Same content, but the head
branch is now on boxlite-ai/boxlite directly instead of a fork. Reason:
PRs sourced from a fork strip the vars / secrets contexts and the
id-token: write permission from the workflow run by GitHub Actions
security policy, which makes the OIDC handshake in
aws-actions/configure-aws-credentials@v4 impossible.

Closing #680 (link incoming).

What this PR contains

ci(e2e): make e2e suite a required merge gate — original ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 commit
(3-job structure for changes / e2e / e2e-gate; e2e-gate was later
dropped in this stack, see below)
test(e2e): add BOXLITE_E2E_SKIP_PATH_VERIFY toggle for cloud CI —
conftest fixture short-circuits when the cloud runner journal isn't
reachable from the test host (i.e. cloud CI where the runner is on
a remote EC2)
ci: add IAM OIDC role provisioning for e2e-cloud workflow — new
scripts/ci/setup-e2e-cloud-oidc.sh provisions
boxlite-e2e-cloud-github-actions IAM role (separate from the
us-east-1 boxlite-e2e-github-actions role used by e2e-test.yml)
ci(e2e): rename e2e-stack → e2e-cloud, target deployed Tokyo stack —
the actual workflow rewrite: deploys API container + runner binary
from this checkout, then runs pytest scripts/test/e2e/cases/
against the deployed Tokyo boxlite-e2e-ci-* stack
docs(workflows): document e2e-cloud and fix stale setup-aws-oidc.sh ref
ci(e2e-cloud): drop gate job — defer 'required check' until Tokyo provisioned —
removed the always-on e2e-gate job; deferred branch protection
required-check registration until the workflow has a green run
ci(e2e-cloud): hardcode AWS_REGION + AWS_ROLE_ARN to fix fork-PR vars gap —
AWS account ID is "sensitive, not secret" per AWS docs; role safety is
enforced by trust policy. This commit is from when we were still
trying to make ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 work; now that this is a same-repo PR, the
${{ vars.* }} references would also work, but the hardcoded values
are kept as the source of truth.

AWS prerequisites already provisioned (one-time, done before this PR opens)

IAM OIDC provider on token.actions.githubusercontent.com (already
existed, shared with e2e-test.yml)
IAM role arn:aws:iam::064212132677:role/boxlite-e2e-cloud-github-actions
with the scoped inline policy from
scripts/ci/setup-e2e-cloud-oidc.sh
SSM SecureString parameter /boxlite/e2e-ci/admin-api-key
Repo variables AWS_E2E_CLOUD_REGION + AWS_E2E_CLOUD_ROLE_ARN
on boxlite-ai/boxlite (now redundant with the hardcoded values,
but kept for future-friendly same-repo-PR runs)

Test plan

e2e-cloud workflow's changes job paths match this PR's diff → e2e job fires
e2e job's OIDC auth step should now succeed (same-repo PR has
id-token: write)
Subsequent steps (build api image, push to ECR, ECS deploy, runner
SSM update, admin-quota init, pytest run) iterate from there

Summary by CodeRabbit

Documentation
- Updated CI workflow docs and added full cloud e2e regression suite documentation.
Infrastructure
- Added an always-on Tokyo cloud e2e workflow for full regression runs; serializes concurrent runs and supports manual dispatch.
- Removed the legacy stack-based e2e workflow.
- Added a provisioning helper to set up OIDC-based AWS permissions for the cloud workflow.
Tests
- e2e fixtures can skip runner path verification in constrained CI environments.
Chores
- Made API/dashboard build configuration parameterizable and added a source-mode API image for cloud e2e runs.

coderabbitai · 2026-06-10T14:10:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an always-on Tokyo e2e GitHub Actions workflow, an OIDC IAM provisioning script, API/runner build-and-deploy steps, pytest e2e execution with diagnostics, README and fixture updates, and a Dockerfile build-arg for configurable Nx builds.

Changes

Tokyo Always-on E2E Stack

Layer / File(s)	Summary
AWS OIDC IAM provisioning `scripts/ci/setup-e2e-cloud-oidc.sh`	Creates/verifies GitHub OIDC provider, builds assume-role trust and scoped permissions policy, creates/updates IAM role, and prints role ARN and GitHub Actions variable guidance.
Workflow metadata, concurrency, change gating, and docs `.github/workflows/e2e-cloud.yml`, `.github/workflows/README.md`	Adds e2e-cloud workflow docs and metadata, sets shared concurrency and Tokyo env vars, adds `dorny/paths-filter` changes job to gate expensive runs, and records the omitted `e2e-gate` note; fixes README provisioning script reference.
E2E prerequisites and resource resolution `.github/workflows/e2e-cloud.yml`	Main e2e job provisions OIDC credentials, resolves ECS cluster, ALB/target group, runner EC2 instance, and S3 bucket with exact-count assertions, and validates ECR existence and runner SSM agent Online.
API container and runner binary builds `.github/workflows/e2e-cloud.yml`, `apps/api/Dockerfile.source`, `apps/api/Dockerfile`	Authenticates to ECR, builds and pushes API image using `Dockerfile.source` (git SHA tag), builds `boxlite-runner` binary (Go/Cargo/libboxlite), uploads binary to S3, and adds `NX_BUILD_CONFIG` build-arg to the API Dockerfile.
API and runner service deployments `.github/workflows/e2e-cloud.yml`	Deploys API by cloning/registering ECS task definition with the new image and waiting for ALB health; updates runner via SSM `AWS-RunShellScript` to atomically replace the binary and restart the systemd service with polling.
Quota initialization, SDK build, pytest execution, and diagnostics `.github/workflows/e2e-cloud.yml`	Performs idempotent admin-org quota init via ECS Exec, builds/installs Python SDK, writes masked pytest credentials from SSM, runs e2e pytest with per-test timeouts and JUnit XML, uploads artifact, and collects CloudWatch logs and runner journal via SSM on failure.

Sequence Diagram(s)

sequenceDiagram
  participant GitHubActions
  participant AWS as AWS(ECR/ECS/SSM/S3/ALB)
  participant API as API(ECS Service / ALB)
  participant Runner as Runner(EC2 boxlite-runner)
  participant Pytests as Pytests

  GitHubActions->>AWS: obtain OIDC id-token & assume role
  GitHubActions->>AWS: resolve ECS cluster, ALB DNS/target-group, runner instance, S3 bucket
  GitHubActions->>AWS: build & push API image to ECR (apps/api/Dockerfile.source)
  GitHubActions->>AWS: build runner binary and upload to S3 (builds/<sha>/boxlite-runner)
  GitHubActions->>AWS: register new ECS task definition with new image -> update service (rolling deploy)
  AWS->>API: deploy new task -> ALB health checks
  GitHubActions->>Runner: send SSM AWS-RunShellScript to download binary, replace executable, restart systemd
  GitHubActions->>AWS: ECS Exec into API task to run psql quota init
  GitHubActions->>Pytests: install SDK, write credentials (from SSM), run pytest via ALB DNS
  Pytests->>API: e2e requests via ALB DNS
  GitHubActions->>AWS: on failure, fetch CloudWatch logs and runner journal via SSM

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

DorianZheng

Poem

🐰 I hopped from repo to cloud at night,

Pushed images, keys, and deploy delight,
OIDC gates and runner restarts,
Tests that dance through Tokyo parts,
CI carrots gleam — green checks in sight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: introducing the e2e-cloud workflow and IAM OIDC provisioning as a replacement for the previous approach.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/e2e-required-merge-gate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

PR #678 wires up an e2e suite but its workflow only triggers on PRs that touch specific paths. GitHub branch protection requires a check to ALWAYS report a result — a path-filtered workflow that doesn't fire leaves the required check pending forever, so the PR can't merge. Reshape so e2e can be a required check: 1. Drop `paths:` from the trigger — workflow always starts on every PR to main. Inside, a cheap `changes` job on a GitHub-hosted runner does the path-filter detection via dorny/paths-filter@v3. 2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min) stays gated on `changes.outputs.relevant == true`. PRs that don't touch relevant paths skip the kvm runner entirely — no contention cost. 3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses the conditional outcome into one stable check name `E2E gate (required)`. Reports success if e2e passed OR no relevant paths changed; fail if e2e ran and failed. Branch protection should be configured (separately, via repo Settings → Branches → main) to require the `E2E gate (required)` status check. That's a repo-settings action a maintainer needs to do once after this merges. Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes` + ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed. Cost on relevant PRs: same as before (full e2e on kvm runner). Stacks on #678 — this commit only makes sense once that PR's e2e-stack.yml is in main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The autouse `verify_runner_saw_all_boxes` fixture shells out to `journalctl -u boxlite-runner`, which only resolves on a host where the runner systemd service is installed (the current self-hosted KVM runner). When the e2e suite migrates to a GitHub-hosted runner pointing at a remote stack (Tokyo Api/Runner deployment), journalctl returns zero hits and every box-creating test fails on the path-verify assert even though the box actually reached the remote runner. Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture (yield-and-return before the journalctl probe). The cloud-CI workflow will set this; the self-hosted KVM job keeps the guard. The FFI-bypass risk this guard defends against doesn't apply on a stock GHA runner — no KVM means libkrun can't start a VM, so a silent fallback would surface as a hard error, not as a passing-but-wrong test. Losing the guard in that environment loses no real safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`, added in the next commit). A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed because the existing `boxlite-e2e-github-actions` role is tuned for the us-east-1 KVM-runner provisioning path. Cloud e2e against the ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR push, ECS update-service + execute-command, SSM send-command to a specific runner EC2, S3 builds/ artifact upload, parameter-store reads. Splitting the roles keeps each one auditable under least-privilege. Scoping: - Trust policy `sub` is an enumerated allowlist (main pushes, pull_request, environment:e2e-cloud) — a stray workflow added on a feature branch can't assume the role. - ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner` so the role can only shell the runner EC2. - ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN; ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs. - Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*, ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting resource-level scoping for those actions. Sources scripts/common.sh for the shared log helpers so this script's output matches setup-ci-runner.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring the structure of the e2e-test.yml section (Why / Architecture / Triggers / Cost / Authentication / Required variables / Required AWS resources / Concurrency / Stack state). - Side-by-side table contrasts the two e2e workflows' IAM roles, regions, and scopes so future maintainers can tell at a glance why both exist. - Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that filename does not exist; the actual provisioning script is `scripts/ci/setup-ci-runner.sh`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AWS IAM CreateRole rejects U+2014 (em-dash) in the --description argument with a validation error. Replace the em-dash with an ASCII hyphen so the script runs cleanly during bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

apps/api/Dockerfile uses `COPY apps/libs/sdk-typescript/`, `COPY apps/dashboard/`, `COPY apps/package.json` and friends — paths that exist at the repo root, not inside apps/api/. Using `apps/api` as the build context made buildx report: ERROR: failed to compute cache key: failed to calculate checksum of ref ...: "/apps/libs/sdk-typescript": not found Switch to `-f apps/api/Dockerfile .` so the context is the repo root. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 42-56: Add a top-level workflow permissions block to restrict the
default GITHUB_TOKEN to the minimum (e.g., permissions: contents: read) so jobs
do not inherit broad repo access; then explicitly opt individual jobs (like the
e2e job that currently needs id-token) into additional scopes by adding
id-token: write under that job's permissions. Locate the workflow's
concurrency/group declaration (e2e-cloud-shared) and the e2e job definition to
apply the changes so default token scope is locked down and only required
permissions are granted per-job.
- Around line 91-101: The workflow's change filters are too narrow (the filters:
relevant block only matches apps/api/src/** and sdks/python/src/**) while the
job actually builds the full apps/api Docker context and the entire sdks/python
package; update the changes filter to include the full contexts (e.g.,
apps/api/** and sdks/python/**) or explicitly add Dockerfile, package*.json,
pyproject.toml and other top-level files (e.g., apps/api/Dockerfile,
apps/api/package*.json, sdks/python/pyproject.toml) so edits that affect what
gets built will trigger the e2e workflow; apply the same widening to the other
similar filters referenced in the file.

In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment check for BOXLITE_E2E_SKIP_PATH_VERIFY is
currently treating any non-empty value as true; change the conditional to
explicitly test for canonical truthy values (e.g., lowercased value in
("1","true","yes")) so that explicit disables like "0" or "false" do not skip
verification; update the if statement around the yield/return in conftest.py to
read the env var, normalize it (lower()), and compare against the accepted true
tokens (or use a standard parser like distutils.util.strtobool) before deciding
to yield and return.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a9da2708-7499-4191-886f-7cadda734569

📥 Commits

Reviewing files that changed from the base of the PR and between 2d93cc3 and 0cdf723.

📒 Files selected for processing (5)

.github/workflows/README.md
.github/workflows/e2e-cloud.yml
.github/workflows/e2e-stack.yml
scripts/ci/setup-e2e-cloud-oidc.sh
scripts/test/e2e/cases/conftest.py

💤 Files with no reviewable changes (1)

.github/workflows/e2e-stack.yml

coderabbitai

♻️ Duplicate comments (1)

scripts/test/e2e/cases/conftest.py (1)

116-118: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use explicit truthy parsing for BOXLITE_E2E_SKIP_PATH_VERIFY.

Line 116 currently enables skip for any non-empty value, so 0/false still bypass verification. That contradicts the docstring’s “=1” contract.

Suggested minimal fix

-    if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY"):
+    if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY", "").strip().lower() in {"1", "true", "yes"}:
         yield
         return

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/conftest.py` around lines 116 - 118, The
environment-var check currently treats any non-empty
BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the
intended truthy value (per docstring) by testing for "1" (e.g.,
os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path
verification; update the conditional around the yield/return in conftest.py that
references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.

🧹 Nitpick comments (1)

.github/workflows/e2e-cloud.yml (1)
87-88: ⚡ Quick win

Consider pinning security-critical actions to commit SHAs.

Actions like aws-actions/configure-aws-credentials, actions/checkout, and aws-actions/amazon-ecr-login are security-sensitive (they handle credentials or code checkout). Pinning to commit SHAs instead of version tags prevents supply chain compromise if an action's repository is attacked.

At minimum, pin the credential-handling actions:
# Example for configure-aws-credentials
- uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
Also applies to: 115-115, 118-118, 207-207, 229-229, 233-233, 464-464
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 87 - 88, The workflow uses
version tags for security-sensitive actions (e.g., actions/checkout,
dorny/paths-filter, aws-actions/configure-aws-credentials,
aws-actions/amazon-ecr-login) which is vulnerable to supply-chain attacks;
update each uses: entry for these actions to pin to a specific commit SHA
instead of a tag (replace e.g. actions/checkout@v5 with the corresponding commit
SHA for that release), doing this for all occurrences flagged (including the
configure-aws-credentials and amazon-ecr-login usages) so the workflow
references immutable commits.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment-var check currently treats any non-empty
BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the
intended truthy value (per docstring) by testing for "1" (e.g.,
os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path
verification; update the conditional around the yield/return in conftest.py that
references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.

---

Nitpick comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 87-88: The workflow uses version tags for security-sensitive
actions (e.g., actions/checkout, dorny/paths-filter,
aws-actions/configure-aws-credentials, aws-actions/amazon-ecr-login) which is
vulnerable to supply-chain attacks; update each uses: entry for these actions to
pin to a specific commit SHA instead of a tag (replace e.g. actions/checkout@v5
with the corresponding commit SHA for that release), doing this for all
occurrences flagged (including the configure-aws-credentials and
amazon-ecr-login usages) so the workflow references immutable commits.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 777a5470-4e26-49d6-a116-7147a3b7fed9

📥 Commits

Reviewing files that changed from the base of the PR and between 0cdf723 and 49a95e2.

📒 Files selected for processing (5)

.github/workflows/README.md
.github/workflows/e2e-cloud.yml
.github/workflows/e2e-stack.yml
scripts/ci/setup-e2e-cloud-oidc.sh
scripts/test/e2e/cases/conftest.py

💤 Files with no reviewable changes (1)

.github/workflows/e2e-stack.yml

✅ Files skipped from review due to trivial changes (1)

.github/workflows/README.md

…ype check) `yarn nx build api --configuration=production` currently emits 160 TS errors against apps/api/src on main HEAD, blocking the e2e-cloud image build. The type drift is in pre-existing code (organization / webhook / user controllers + tracing.ts), unrelated to this PR's scope. Fixing those errors is a separate concern. `apps/api/project.json` already has a `development` build configuration with `skipTypeChecking: true` — webpack still produces the runtime bundle, just without the strict type-check gate. Add a `NX_BUILD_CONFIG` build arg to apps/api/Dockerfile: - Default `production` so production deploys are unchanged. - e2e-cloud workflow passes `development` to bypass the type-check gate while still producing the same `dist/apps/api/main.js` bundle the ENTRYPOINT runs. Runtime semantics are unchanged (same bundle, same entrypoint) — only the build-time strictness differs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

apps/api/Dockerfile (1)

37-37: ⚡ Quick win

Use CACHE_BUST instead of a hardcoded epoch token.

ARG CACHE_BUST=1 is declared on Line 16, but Line 37 hardcodes echo 1777389170; this makes the arg ineffective and obscures cache invalidation intent.

Proposed change

-RUN echo 1777389170 && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true
+RUN echo ${CACHE_BUST} && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/Dockerfile` at line 37, The RUN line currently echoes a hardcoded
epoch token which ignores the declared ARG CACHE_BUST; update the RUN command in
the Dockerfile (the line invoking echo and yarn nx build api
--configuration=${NX_BUILD_CONFIG} --nxBail=true) to echo the CACHE_BUST build
arg instead (use $CACHE_BUST or ${CACHE_BUST}) so the declared ARG CACHE_BUST
takes effect for cache invalidation, and ensure ARG CACHE_BUST is present
earlier in the Dockerfile.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/api/Dockerfile`:
- Line 37: The RUN line currently echoes a hardcoded epoch token which ignores
the declared ARG CACHE_BUST; update the RUN command in the Dockerfile (the line
invoking echo and yarn nx build api --configuration=${NX_BUILD_CONFIG}
--nxBail=true) to echo the CACHE_BUST build arg instead (use $CACHE_BUST or
${CACHE_BUST}) so the declared ARG CACHE_BUST takes effect for cache
invalidation, and ensure ARG CACHE_BUST is present earlier in the Dockerfile.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e7f4cbe3-0a49-48c1-9444-cbff14532232

📥 Commits

Reviewing files that changed from the base of the PR and between 49a95e2 and 442950a.

📒 Files selected for processing (2)

.github/workflows/e2e-cloud.yml
apps/api/Dockerfile

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/e2e-cloud.yml

… build) Previous attempt: --build-arg NX_BUILD_CONFIG=development bypassed the strict TS type check but still ran webpack/vite builds. The API webpack build succeeded under that config (skipTypeChecking: true), but the dashboard vite build still failed on a tsconfig path-resolution issue (`failed to resolve "extends":"../tsconfig.base.json"`). Even fixing the dashboard issue would still incur the full webpack bundle cost, which is wasted work for an e2e smoke test against deployed code. Source-mode approach: new apps/api/Dockerfile.source that - runs apps/api/src/main.ts directly via `yarn tsx ...` at container start (tsx strips types and transpiles on the fly; never type-checks) - skips both `nx build api` and `nx build dashboard` entirely - omits the apps/dashboard/ COPY altogether — e2e cases under scripts/test/e2e/cases/ hit the API directly, no dashboard static-file serving is exercised tsconfig.base.json is symlinked to tsconfig.json so tsx auto-discovers the `@boxlite-ai/*` path aliases that resolve to libs/*/src/index.ts. Production deploys are unaffected: apps/api/Dockerfile (the production Dockerfile) is unchanged; this is a sibling file used only by the e2e-cloud workflow. Trade-offs documented in apps/api/Dockerfile.source's header: - slower cold start (~5-15s extra for tsx initial transpile) - type errors surface at runtime instead of build time (acceptable for an e2e smoke against deployed code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

.github/workflows/e2e-cloud.yml (2)

154-166: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't take TargetGroups[0] from the ALB.

This block asserts there is one matching load balancer, but it never asserts there is one attached target group. If another TG gets attached later, the deploy health gate can watch the wrong backend and either pass too early or fail a healthy rollout.

Suggested hardening

-          API_LB_TG_ARN=$(aws elbv2 describe-load-balancers \
-            --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \
-            --output text \
-            | xargs -I{} aws elbv2 describe-target-groups --load-balancer-arn {} \
-                --query "TargetGroups[0].TargetGroupArn" --output text)
+          API_LB_ARN=$(aws elbv2 describe-load-balancers \
+            --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \
+            --output text)
+          API_LB_TG_COUNT=$(aws elbv2 describe-target-groups \
+            --load-balancer-arn "$API_LB_ARN" \
+            --query "length(TargetGroups)" --output text)
+          assert_one "ApiLoadBalancer-* target group" "$API_LB_TG_COUNT"
+          API_LB_TG_ARN=$(aws elbv2 describe-target-groups \
+            --load-balancer-arn "$API_LB_ARN" \
+            --query "TargetGroups[0].TargetGroupArn" --output text)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 154 - 166, The script currently
reads TargetGroups[0] for API_LB_TG_ARN which can pick the wrong TG if multiple
exist; update the logic to query and count target groups attached to the ALB
(use aws elbv2 describe-target-groups --load-balancer-arn {} with a length(...)
query and store/check a TG_COUNT variable), call assert_one (or reuse existing
assert_one) to ensure exactly one target group is attached, and only then
retrieve the single TargetGroupArn into API_LB_TG_ARN (instead of unguarded
TargetGroups[0]); ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and
the describe-target-groups call when making these changes.

42-47: ⚠️ Potential issue | 🟠 Major

Bind workflow_dispatch runs to the trusted e2e-cloud environment and avoid ambiguous ALB target-group selection.

jobs.e2e in .github/workflows/e2e-cloud.yml has no environment:; scripts/ci/setup-e2e-cloud-oidc.sh trust policy only allows ...:ref:refs/heads/main, ...:pull_request, or ...:environment:e2e-cloud, so workflow_dispatch on non-main branches won’t be able to assume the AWS role via OIDC.

Suggested fix

   e2e:
     name: E2E suite (Tokyo)
     needs: changes
+    environment: e2e-cloud
     # Skip on PRs that don't touch relevant paths. workflow_dispatch
     # always runs (operator explicitly asked for it).
     if: needs.changes.outputs.relevant == 'true' || github.event_name == 'workflow_dispatch'

In .github/workflows/e2e-cloud.yml (“Resolve stack resources”), api_lb_tg_arn is set using TargetGroups[0].TargetGroupArn without asserting that exactly one target group is returned for the matched ALB; this can silently bind to the wrong target group. Add a TargetGroups count/uniqueness assertion (or filter deterministically) before using index 0.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 42 - 47, Add environment:
e2e-cloud to the jobs.e2e definition so workflow_dispatch runs are bound to the
trusted e2e-cloud environment (matching the OIDC trust conditions in
scripts/ci/setup-e2e-cloud-oidc.sh); and in the “Resolve stack resources” step
where api_lb_tg_arn is derived from TargetGroups[0].TargetGroupArn,
assert/filter that the ALB returns exactly one matching TargetGroup (or
deterministically filter by Name/Tags) before using index 0 to avoid silently
selecting the wrong target group.

🧹 Nitpick comments (1)

apps/api/Dockerfile.source (1)
26-63: ⚡ Quick win

Run container as non-root user.

The container runs as root by default, which violates the principle of least privilege. Even in an e2e test environment, following security best practices reduces attack surface. The official Node.js image includes a pre-configured node user (UID 1000).
🔒 Proposed fix

Add the USER directive after installing dependencies but before copying application code:
 COPY apps/nx.json apps/tsconfig.base.json apps/NOTICE ./
 
 # tsx auto-discovers tsconfig.json in cwd — symlink so the path-alias
 # definitions in apps/tsconfig.base.json (the `@boxlite-ai/*` mappings)
 # resolve at import time without needing an explicit --tsconfig flag.
 RUN ln -sf tsconfig.base.json tsconfig.json
 
 ENV NX_DAEMON=false
 
+USER node
+
 COPY apps/api/ apps/api/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/Dockerfile.source` around lines 26 - 63, The Dockerfile runs as
root; update the boxlite-source stage to switch to the unprivileged node user
after installing dependencies but before copying application files:
create/ensure the /boxlite workdir is owned by node (chown -R node:node
/boxlite) and add a USER node (UID 1000) directive before the COPY apps/api/ ...
lines so subsequent COPY, HEALTHCHECK and ENTRYPOINT (yarn tsx
apps/api/src/main.ts) run as the non-root node user; keep the existing RUN
npm/yarn install steps as root then switch to node.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/api/Dockerfile.source`:
- Line 26: Set a non-root runtime user and make the application directory
writable to fix Trivy DS-0002: in the Dockerfile.source ensure you create or
verify the /boxlite directory, chown it to the existing node user (e.g., chown
-R node:node /boxlite), and then add USER node before the ENTRYPOINT so the
container runs as the node user rather than root; reference the Dockerfile's
ENTRYPOINT, the /boxlite path, and the USER directive when applying the changes.

---

Outside diff comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 154-166: The script currently reads TargetGroups[0] for
API_LB_TG_ARN which can pick the wrong TG if multiple exist; update the logic to
query and count target groups attached to the ALB (use aws elbv2
describe-target-groups --load-balancer-arn {} with a length(...) query and
store/check a TG_COUNT variable), call assert_one (or reuse existing assert_one)
to ensure exactly one target group is attached, and only then retrieve the
single TargetGroupArn into API_LB_TG_ARN (instead of unguarded TargetGroups[0]);
ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and the
describe-target-groups call when making these changes.
- Around line 42-47: Add environment: e2e-cloud to the jobs.e2e definition so
workflow_dispatch runs are bound to the trusted e2e-cloud environment (matching
the OIDC trust conditions in scripts/ci/setup-e2e-cloud-oidc.sh); and in the
“Resolve stack resources” step where api_lb_tg_arn is derived from
TargetGroups[0].TargetGroupArn, assert/filter that the ALB returns exactly one
matching TargetGroup (or deterministically filter by Name/Tags) before using
index 0 to avoid silently selecting the wrong target group.

---

Nitpick comments:
In `@apps/api/Dockerfile.source`:
- Around line 26-63: The Dockerfile runs as root; update the boxlite-source
stage to switch to the unprivileged node user after installing dependencies but
before copying application files: create/ensure the /boxlite workdir is owned by
node (chown -R node:node /boxlite) and add a USER node (UID 1000) directive
before the COPY apps/api/ ... lines so subsequent COPY, HEALTHCHECK and
ENTRYPOINT (yarn tsx apps/api/src/main.ts) run as the non-root node user; keep
the existing RUN npm/yarn install steps as root then switch to node.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 094c7c9a-9051-4c60-ac33-d16a8ef0a9a2

📥 Commits

Reviewing files that changed from the base of the PR and between 442950a and 27f1315.

📒 Files selected for processing (2)

.github/workflows/e2e-cloud.yml
apps/api/Dockerfile.source

`cargo build -p boxlite-runner` was wrong — runner is a Go binary (apps/runner/cmd/runner/main.go), not a Rust crate. The Rust workspace has no `boxlite-runner` package, hence cargo's `error: package ID specification 'boxlite-runner' did not match any packages`. Mirror .github/workflows/build-runner-binary.yml's approach: - actions/setup-go@v5 with go-version 1.25.4 - Install libx11-dev + libxtst-dev + libxinerama-dev (computer-use links X11) - Download prebuilt libboxlite.a from `v$(Cargo.toml version)` GitHub release — currently v0.9.5 ✓ released 2026-05-16 - Rewrite apps/go.work to include only the modules we need to build - go mod download per module - Build daemon (CGO_ENABLED=0) → runner/pkg/daemon/static/daemon-amd64 - Build computer-use (CGO_ENABLED=1) → runner/pkg/daemon/static/boxlite-computer-use - Build runner (CGO_ENABLED=1, links libboxlite.a, embeds daemon and computer-use) → /tmp/boxlite-runner - aws s3 cp /tmp/boxlite-runner s3://.../builds/<sha>/boxlite-runner Pinning libboxlite.a to a released version means PR changes affecting only apps/runner Go code are picked up. PR changes to sdks/c that would require a fresh libboxlite.a are out of scope here — a follow-up could either trigger a full C SDK build or just download from a more recent release tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y branch Previous trust policy enumerated 3 specific subject patterns: - repo:.../ref:refs/heads/main - repo:.../pull_request - repo:.../environment:e2e-cloud workflow_dispatch on a feature branch (e.g. chore/e2e-required-merge-gate) produces subject `repo:.../ref:refs/heads/chore/e2e-required-merge-gate`, which matched none of the 3. STS rejected with "Not authorized to perform sts:AssumeRoleWithWebIdentity". Switch to `repo:boxlite-ai/boxlite:*` — accepts any subject from this specific repo. Scope is still bounded to a single repo (no other repos can assume this role), and the role's IAM policy is least-privilege (only the boxlite-e2e-ci stack resources). Trade-off: feature branches can trigger e2e-cloud via workflow_dispatch, which is a feature here (needed for iterative debugging) rather than a risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…te.a from source) Runner binary rebuild via Go CGO links libboxlite.a, but the prebuilt libboxlite.a from latest release (v0.9.5) doesn't have the boxlite_rest_options_set_path_prefix symbol that apps/runner Go code references — main has drifted ahead of release. Linker error: undefined reference to `boxlite_rest_options_set_path_prefix' Two paths to fix: (a) Build libboxlite.a from Rust source in the workflow (+10-15min, requires cross-build chain for musl + libkrun) (b) Skip runner rebuild — Tokyo runner EC2 already runs a working binary ABI-compatible with the deployed API Going with (b) for now. The current Tokyo runner has been up 21+ hours and passes the API↔runner handshake checks we observed earlier. PR changes touching apps/runner Go code are not exercised by this workflow as a result; that's a follow-up that can wire `make dist:runner` (which builds libboxlite.a from source) into the build path. Removed steps: - actions/setup-go@v5 - Install runner build deps (libx11/libxtst/libxinerama) - Determine version + download libboxlite.a - Rewrite apps/go.work - Download Go modules per module - Build daemon / computer-use / boxlite-runner - Upload runner binary to S3 - SSM Deploy runner binary to EC2 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@entity

…M needs decorator metadata) Source-mode container with tsx crashed immediately at module load: /boxlite/node_modules/typeorm/decorator/src/decorator/columns/PrimaryColumn.ts:72 ColumnTypeUndefinedError: Column type for User#id is not defined and cannot be guessed. Make sure you have turned on an "emitDecoratorMetadata": true option in tsconfig.json. Cause: tsx uses esbuild under the hood, which has long-standing non-support for `emitDecoratorMetadata: true`. TypeORM's @column / @entity / @PrimaryColumn decorators rely on `reflect-metadata` calls the compiler EMITS when that flag is set; without them, TypeORM can't infer column types and bails before the app even starts. Switch to ts-node, which uses the TypeScript compiler properly and honors emitDecoratorMetadata from the referenced tsconfig: ENTRYPOINT ["yarn", "ts-node", "--project", "apps/api/tsconfig.app.json", "--transpile-only", "-r", "tsconfig-paths/register", "apps/api/src/main.ts"] - --project: point at apps/api/tsconfig.app.json (where emitDecoratorMetadata and experimentalDecorators live). - --transpile-only: skip type-checking at startup (faster cold boot; the prod webpack build is the source-of-truth type gate, and it's already known to fail on main with 160 errors — see Dockerfile.source header). - -r tsconfig-paths/register: resolve `@boxlite-ai/*` path aliases defined in tsconfig.base.json at module-load time. Both ts-node (10.9.1) and tsconfig-paths are already in apps/package.json dependencies (used by the typeorm migration scripts), so no dep change. Cold start is now ~10-30s instead of tsx's ~5-15s — acceptable for the ALB health-check window (3 retries × 30s + delays = ~90s minimum). If it still races, will follow up by extending the health check grace period at the ECS service or task-def level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The IAM action `ecs:ExecuteCommand` is evaluated against BOTH the task ARN AND the cluster ARN. Granting it on the task alone produced: AccessDeniedException: not authorized to perform: ecs:ExecuteCommand on resource: arn:aws:ecs:.../cluster/boxlite-e2e-ci-ClusterCluster-... Add the cluster pattern to the Resource list so the "Init admin-org sandbox quota" step can exec into the Api task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "Init admin-org sandbox quota" workflow step execs into the Api container and runs `psql` against RDS to seed sandbox quota. The source-mode Dockerfile didn't install postgresql-client, so the step failed with `sh: 1: psql: not found`. Add it to the apt install line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Schema drift: post-deploy migration 1781016743403 renamed max_{cpu,memory,disk}_per_sandbox → max_{cpu,memory,disk}_per_box on the organization table. The quota-init UPDATE failed with: ERROR: column "max_cpu_per_sandbox" of relation "organization" does not exist Rename all three column refs in the workflow to match the live schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous run actually applied the quota UPDATE successfully: UPDATE 1 ... | 4 | 8192 | 20 But psql wrapped the row across multiple lines and triggered a "--More--" pager prompt over the ECS Exec session, so the '4 *\| *8192 *\| *20' grep didn't match. Add `-A -t -P pager=off` and `PAGER=cat` so the SELECT row is a single unaligned line "<uuid>|4|8192|20", then grep on a tighter '\|4\|8192\|20$' anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The quota row printed exactly as expected: 4a1eef70-8734-4814-b7a5-0ca3522a8760|4|8192|20 but `grep -E '\|4\|8192\|20$'` still failed because SSM Session Manager uses CRLF line endings, so the `$` anchor sees `\r` immediately before end-of-line, never the literal `20`. Pipe through `tr -d '\r'` first so the anchor matches reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bubblewrap-sys / libkrun-sys / e2fsprogs-sys crates have build.rs scripts that vendor their C dependencies via git submodules. Without the submodules initialized the build panics: bubblewrap vendor source not found at .../vendor/bubblewrap Initialize submodule: git submodule update --init --recursive These C deps are runtime requirements of the runner binary (which runs inside the EC2 host), NOT of the Python SDK client. The build.rs files all honor BOXLITE_DEPS_STUB=1 and emit a no-op `/nonexistent` linker hint — exactly the right behavior here. Set the env var on the SDK build step so we don't need to also init submodules or install meson/ninja on the GHA runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…artup_failure)

…broadened

…ass dead per-service role ARNs The e2e-ci stack's per-service Task/Execution roles (ProxyTaskRole, OtelCollectorTaskRole, SshGatewayTaskRole + their Execution pairs) were deleted from IAM out-of-band — same drift as ApiTaskRole- batkzzas before this PR's session recreated it. Live ECS TDs still reference the dead ARNs, so new task launches fail at AssumeRole NoSuchEntity. Workaround: at deploy time, resolve Api's currently-live taskRoleArn + executionRoleArn (those were recreated earlier with sst:app=boxlite and trust ecs-tasks.amazonaws.com), and override the cloned TD's roles to point at those. Both trust principals match any ECS service, so Proxy / OtelCollector / SshGateway can each AssumeRole the Api role. Least-privilege rough edge: borrowers get Api's S3 + log perms they don't need — acceptable for e2e-ci (no prod traffic). Admin's `sst deploy --stage e2e-ci` will recreate per-service roles properly; we can drop this override then.

… to test their deploy paths

…merge-gate

…w_call Refactor the e2e-cloud workflow so the Api build+deploy chain is no longer duplicated inline. Adds a deploy_api job that workflow_call's .github/workflows/deploy-api.yml (mirroring how deploy_runner already workflow_call's deploy-runner.yml). Both deploys are gated on actual source changes via the existing `changes` paths-filter: - runner_chain → deploy_runner - api → deploy_api PRs touching only api skip the ~11 min runner build chain; PRs touching only runner skip the Docker buildx + ECS register-TD + wait-services-stable round-trip; PRs touching only tests skip both. E2e job (pytest) now needs all three (changes + deploy_runner + deploy_api) and reuses the same skip-if-failure-only gate (!failure() && !cancelled() && any-relevant-change) so it still runs when an upstream job was correctly skipped. Dropped from the inline e2e job: - ECR login step + Build & push Api image step (deploy-api.yml) - Deploy Api (rolling) step incl. clone-TD + UpdateService + ALB healthy poll (deploy-api.yml does it + env patches for stale S3_ACCESS_KEY / storage-bucket name + role override). - SSM-agent-online preflight check on the runner (obsolete since deploy-runner switched to SSH+SCP via EC2 Instance Connect). - ECR repo describe preflight (now exercised inside deploy-api.yml). Kept inline (still needed by pytest): - Resolve stack resources (cluster, ALB DNS/TG, runner id, S3 bucket). - /api/health probe through the LB before tests run. - Init admin-org sandbox quota via ECS Exec. - Python SDK build/install + pytest profile/snapshot setup + run. - On-failure CloudWatch + journalctl dump. Net: -104 lines, one source of truth per deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… PortSpec impl merging upstream/main into the PR brought in #741's PortSpec-based WithPort but the session's earlier stub (added to unblock the runner build before #741 landed) remained in place, causing duplicate declarations: ports redeclared in struct boxConfig WithPort redeclared in this block cannot use portMapping{} as PortSpec value in append Remove the stub block: stub's ports field, portMapping struct, and stub WithPort func. #741's PortSpec + WithPort + ports []PortSpec are the canonical definitions.

The composed e2e-cloud pipeline previously deployed only Api + Runner. Supporting ECS services (Proxy / OtelCollector / SshGateway) went un-deployed in the CI loop — admin had to run `sst deploy --stage e2e-ci` to update them. This composes deploy-app-services.yml the same way as deploy-runner / deploy-api: - `changes` job adds paths-filter outputs for proxy / otel_collector / ssh_gateway, plus an `appsvc` step that emits a comma-separated list of services whose source actually changed. - New `deploy_app_services` job: workflow_call invokes deploy-app- services.yml, passing the changed-list via `services` input. Inner workflow's per-service gating then deploys only the touched ones. - `e2e` job needs += deploy_app_services so pytest waits for the supporting services to come up. Also: extend deploy-app-services.yml's workflow_call signature to accept the `services` input (it already had the same input on workflow_dispatch). Net: pure runner / api / test PRs skip the 3 supporting deploys entirely; otel-only PRs skip runner + api + the other 2 services.

…uota init The previous step UPDATE'd organization WHERE id='4a1eef70-...' — a UUID that happened to be the admin's default-org id in earlier e2e-ci stack instances. After this PR's DB schema reset (required to recover from #736's pre-launch migration squash being deployed against a stack whose history was already past those migrations), the app re-runs initializeAdminUser at boot and creates a new admin user + auto- generates a new default-org UUID. The hardcoded id no longer matches, UPDATE touches 0 rows, SELECT returns empty, the step fails. Switch the SQL to resolve the admin's default org via a subquery against organization_user (userId='boxlite-admin' AND isDefaultForUser=true) — that membership row is what app code already uses to find the admin's working org, and survives any future schema reset where the org UUID changes.

Merging the substantive bits of #745 and #746 here so PR #724's deploy-app-services CI can actually push an otel-collector image: * otel-collector/builder-config.yaml: fix module paths after #730 moved everything under apps/ (was: ../../../api-client-go etc., now: ../../../apps/api-client-go). The OCB build was failing with `filepath does not exist: /boxlite/otel-collector/exporter`. (#745) * otel-collector/Dockerfile: apk add python3 make g++ so node-gyp can compile any musl-incompatible native add-on during workspace `yarn install`. Kept even after dropping dockerode below — small safety net in case future deps reintroduce a native build. (#745) * apps/package.json: drop unused dockerode + @types/dockerode. They were inherited from the Daytona import in #460 but never imported in source. Drops the dockerode -> docker-modem -> ssh2 -> cpu-features chain; cpu-features was the optional native addon that forced the toolchain workaround in the first place. (#746) * apps/yarn.lock: regenerated (-161 lines). * apps/api-client-go/api/openapi.yaml: regenerate to catch up with main's existing `assignedRoleIds.default: []` drift so the api-client-drift PR check passes.

… gone The previous commit kept #745's apk add as a safety net, but cpu-features was the only musl-incompatible native dep in the workspace and it came in only via the now-removed dockerode chain. Verified locally: yarn install on node:22-alpine completes without any compiler toolchain.

PR-on-open triggering serializes every PR against the shared Tokyo stack (see concurrency: e2e-cloud-shared, cancel-in-progress: false). The suite is slow enough that gating every PR open / push on it just queues runs behind each other for hours. Push-to-main keeps the post-merge regression signal; workflow_dispatch covers ad-hoc PR validation when wanted.

Today an OtelCollector deploy on chore/e2e-required-merge-gate failed with "ECS rolled back. PRIMARY=...:7 expected ...:6" — two parallel Deploy App Services runs (push + dispatch on the same SHA) both registered task definitions, and one's UpdateService rolled the other back. Add per-workflow concurrency groups against the shared Tokyo stack to prevent the race. cancel-in-progress: false because a half-applied ECS rolling update / EC2 userdata swap is worse than waiting. - deploy-api.yml: group=deploy-api-shared - deploy-app-services.yml: group=deploy-app-services-shared - deploy-runner.yml: group=deploy-runner-shared - _deploy-single-service.yml: group=deploy-single-service-<service_name> (parameterized so different services parallelize but same-service callers serialize — belt-and-suspenders against the outer group)

Resolve stack resources was picking the API LB target group via TargetGroups[0] without checking how many target groups were attached. Blue/green deploys or extra listener rules can attach more than one TG to an ALB; in that case the e2e suite would silently bind smoke traffic to the wrong TG. Use the same assert_one guard as the cluster and LB checks above. Hoist the LB ARN into a variable so the count + the final lookup share one resolution.

…002) CodeRabbit / Trivy DS-0002 flagged the source-mode container running as root. Switch to the pre-existing `node` user (UID 1000 in node:24-slim) right before HEALTHCHECK + ENTRYPOINT. All RUN/COPY happen as root above, so the installed tree is read-only at runtime — least-privilege without needing a chown -R.

G4614 · 2026-06-12T12:21:56Z

@coderabbitai full review

coderabbitai · 2026-06-12T12:22:02Z

✅ Action performed

Full review finished.

Bare `if os.environ.get(...)` is truthy for any non-empty string, so setting the var to "0" or "false" — the conventional "off" — still skipped the runner-journal verification. Accept the standard truthy spellings only.

Previously the e2e-cloud.yml pipeline glued deploy_{runner,api,app_services} and the pytest suite into one workflow_dispatch entry, so the only way to "rerun the tests" was to also redeploy everything (workflow_dispatch forces all three deploy gates true). With the Tokyo stack as a shared singleton that gets left in whatever state the last run produced, this made cheap "rerun against current stack" cycles much more expensive than they need to be. * New: .github/workflows/e2e-cloud-test.yml. workflow_call + workflow_dispatch. Contains only the e2e job — resolve resources, probe LB, init admin-org quota, build SDK, configure pytest, run suite, collect logs on failure. Shares the e2e-cloud-shared concurrency group so a standalone test dispatch and an in-flight deploy queue against each other (can't race the Tokyo stack). * e2e-cloud.yml: e2e job is now a thin `uses:` wrapper around the new workflow. Gating logic (skip-if-upstream-failed, paths-changed-only- on-push) is preserved at the wrapper level; standalone dispatch on e2e-cloud-test.yml skips those gates and just runs. Header on e2e-cloud-test.yml warns explicitly that the stack state is the dispatcher's responsibility — results reflect whatever's deployed, not the branch the workflow was dispatched from.

The single `runner_source` filter triggered the full ~10 min libboxlite Rust build on any path in the runner's dep graph, including Go-only diffs (apps/runner, apps/daemon, sdks/go, ...) that just relink against the existing libboxlite.a without changing it. Split into two filters: * libboxlite_chain: paths whose content ends up in libboxlite.a (src/boxlite, src/shared, src/deps, sdks/c, scripts/build, Cargo.{toml,lock}) * go_runner_chain: Go paths that consume libboxlite.a but don't contribute to its bytes (apps/{runner,daemon,common-go,api-client-go, libs/computer-use}, sdks/go) `build_c_sdk` now runs only when libboxlite_chain changed. `build_runner` switches libboxlite_source to "release" (pull v${VERSION} tarball from GitHub Release) when libboxlite_chain didn't change, "build" (use sibling build_c_sdk artifact) otherwise. `build_runner`'s `if:` adds the !failure() && !cancelled() guard so the skipped build_c_sdk on Go-only diffs doesn't cascade-skip it. Also drop the dead `src/api-client/**` path from both deploy-runner.yml and e2e-cloud.yml — that directory doesn't exist in the repo, so the entry never matched anything. workflow_dispatch / workflow_call behavior is unchanged — both still force libboxlite_changed=true and rebuild from source. Only push-event auto-triggering is optimized.

…layers The bare ${SQL} interpolation went through GHA bash → aws ecs --command → sh -c → psql -c — four layers of double-quote parsing. The SQL's own PG identifier quotes ("organizationId", "organization_user", etc.) collided with the inner `\"...\"` bounds and got stripped, so psql received the bare token `organizationId` which Postgres case-folds to `organizationid` and reports as missing. Encode the SQL to base64 on the runner (base64 chars never need quoting), inline that into the --command string, then `base64 -d | psql -f -` inside the container. psql sees the literal SQL bytes unchanged. Last dispatched e2e-cloud run failed exactly here; the column is in the schema (migrations/1741087887225-migration.ts:64), the bug was purely shell-escape.

Drop this block before merging to main. GH won't expose workflow_dispatch on a brand-new workflow until it has run at least once; the push trigger (scoped to this branch and to self-modifications only) gives us that first run.

POST /api/snapshots and GET /api/snapshots?name=... no longer exist on the API — there is no snapshot controller in apps/api/src/**. PR #735 (feat(box): create boxes from curated images) replaced the pre-register-then-boot model with lazy curated-image pulls at box start, so the e2e test workflow no longer needs to seed snapshot rows. The pre-registration block was failing the "Configure pytest profile p1" step with `Cannot POST /api/snapshots` 404. The previous run got masked by the SQL-escape bug in quota init (now fixed); this surfaced as the next blocker.

waitForToolboxReady was introduced in fc88aa0 (2026-06-05, "feat: add agent-ready runtime catalog", which then landed on main via PR #715 on 2026-06-10). It HTTP-polls http://127.0.0.1:<hostPort>/version expecting a daemon on guest TCP port 2280 — that interface dates from the Daytona-daemon era and was never reimplemented after the Rust guest agent rewrite landed in dbb11ec (2026-04-01). The new agent binds **vsock://2695** for gRPC + notifies the host via vsock://2696; nothing inside the VM listens on TCP:2280, so libkrun's port-forward accepts the SYN and immediately reset-by-peer's, and every CREATE_BOX times out 30 s in. Production data from a Tokyo runner: in 24 h, 490 CREATE_BOX events, 0 toolbox-ready successes, 181 toolbox-ready failures. The exec path that fires immediately afterward (via the same vsock gRPC channel) **always succeeds** — confirming the box VM is healthy, the readiness check itself is the bug. Remove the dead probe: drop waitForToolboxReady from client.go's Create and Start, drop the function + its TCP/HTTP imports, drop the toolboxReadyTimeout field, drop the ToolboxReadyTimeout/ DaemonStartTimeoutSec config plumbing in main.go + config.go, drop the two now-unreachable tests. Box readiness is now signalled by bx.Start(ctx) returning (which itself blocks on the vsock notification from the guest). Branched off chore/e2e-required-merge-gate (PR #724) so the e2e-cloud stack picks this up next dispatch.

G4614 requested a review from a team as a code owner June 10, 2026 14:09

G4614 mentioned this pull request Jun 10, 2026

ci(e2e): make e2e suite a required merge gate (stacks on #678) #680

Closed

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/e2e-cloud.yml Fixed

G4614 and others added 9 commits June 10, 2026 22:16

G4614 force-pushed the chore/e2e-required-merge-gate branch from 0cdf723 to 49a95e2 Compare June 10, 2026 14:16

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/e2e-cloud.yml

Comment thread .github/workflows/e2e-cloud.yml

Comment thread scripts/test/e2e/cases/conftest.py Outdated

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread apps/api/Dockerfile.source

G4614 and others added 10 commits June 10, 2026 22:50

G4614 added 4 commits June 12, 2026 12:23

ci(deploy-app-services): grant id-token:write on calling jobs (fix st…

53a65b9

…artup_failure)

ci(deploy-app-services): retest Proxy after OIDC UpdateService scope …

f290ba6

…broadened

ci(deploy-app-services): touch OtelCollector + SshGateway Dockerfiles…

8b66ae1

… to test their deploy paths

G4614 force-pushed the chore/e2e-required-merge-gate branch 2 times, most recently from c07f7cd to 8b66ae1 Compare June 12, 2026 09:19

G4614 and others added 3 commits June 12, 2026 17:20

Merge remote-tracking branch 'upstream/main' into chore/e2e-required-…

457a5cd

…merge-gate

G4614 mentioned this pull request Jun 12, 2026

fix(otel-collector): unblock Docker build after #730 apps/ normalization #745

Closed

2 tasks

G4614 added 9 commits June 12, 2026 18:38

ci(build-c): default GITHUB_TOKEN to contents: read

875b338

G4614 added 7 commits June 12, 2026 12:23

ci(e2e-cloud): default GITHUB_TOKEN to contents: read

75403a5

G4614 mentioned this pull request Jun 13, 2026

fix(runner): drop dead-on-arrival toolbox HTTP readiness check #754

Open

Conversation

G4614 commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR contains

AWS prerequisites already provisioned (one-time, done before this PR opens)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

G4614 commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

G4614 commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading