Skip to content

ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680)#724

Open
G4614 wants to merge 93 commits into
mainfrom
chore/e2e-required-merge-gate
Open

ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680)#724
G4614 wants to merge 93 commits into
mainfrom
chore/e2e-required-merge-gate

Conversation

@G4614

@G4614 G4614 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This PR is the same-repo version of #680. Same content, but the head
branch is now on boxlite-ai/boxlite directly instead of a fork. Reason:
PRs sourced from a fork strip the vars / secrets contexts and the
id-token: write permission from the workflow run by GitHub Actions
security policy, which makes the OIDC handshake in
aws-actions/configure-aws-credentials@v4 impossible.

Closing #680 (link incoming).

What this PR contains

  • ci(e2e): make e2e suite a required merge gate — original ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 commit
    (3-job structure for changes / e2e / e2e-gate; e2e-gate was later
    dropped in this stack, see below)
  • test(e2e): add BOXLITE_E2E_SKIP_PATH_VERIFY toggle for cloud CI
    conftest fixture short-circuits when the cloud runner journal isn't
    reachable from the test host (i.e. cloud CI where the runner is on
    a remote EC2)
  • ci: add IAM OIDC role provisioning for e2e-cloud workflow — new
    scripts/ci/setup-e2e-cloud-oidc.sh provisions
    boxlite-e2e-cloud-github-actions IAM role (separate from the
    us-east-1 boxlite-e2e-github-actions role used by e2e-test.yml)
  • ci(e2e): rename e2e-stack → e2e-cloud, target deployed Tokyo stack
    the actual workflow rewrite: deploys API container + runner binary
    from this checkout, then runs pytest scripts/test/e2e/cases/
    against the deployed Tokyo boxlite-e2e-ci-* stack
  • docs(workflows): document e2e-cloud and fix stale setup-aws-oidc.sh ref
  • ci(e2e-cloud): drop gate job — defer 'required check' until Tokyo provisioned
    removed the always-on e2e-gate job; deferred branch protection
    required-check registration until the workflow has a green run
  • ci(e2e-cloud): hardcode AWS_REGION + AWS_ROLE_ARN to fix fork-PR vars gap
    AWS account ID is "sensitive, not secret" per AWS docs; role safety is
    enforced by trust policy. This commit is from when we were still
    trying to make ci(e2e): make e2e suite a required merge gate (stacks on #678) #680 work; now that this is a same-repo PR, the
    ${{ vars.* }} references would also work, but the hardcoded values
    are kept as the source of truth.

AWS prerequisites already provisioned (one-time, done before this PR opens)

  • IAM OIDC provider on token.actions.githubusercontent.com (already
    existed, shared with e2e-test.yml)
  • IAM role arn:aws:iam::064212132677:role/boxlite-e2e-cloud-github-actions
    with the scoped inline policy from
    scripts/ci/setup-e2e-cloud-oidc.sh
  • SSM SecureString parameter /boxlite/e2e-ci/admin-api-key
  • Repo variables AWS_E2E_CLOUD_REGION + AWS_E2E_CLOUD_ROLE_ARN
    on boxlite-ai/boxlite (now redundant with the hardcoded values,
    but kept for future-friendly same-repo-PR runs)

Test plan

  • e2e-cloud workflow's changes job paths match this PR's diff → e2e job fires
  • e2e job's OIDC auth step should now succeed (same-repo PR has
    id-token: write)
  • Subsequent steps (build api image, push to ECR, ECS deploy, runner
    SSM update, admin-quota init, pytest run) iterate from there

Summary by CodeRabbit

  • Documentation

    • Updated CI workflow docs and added full cloud e2e regression suite documentation.
  • Infrastructure

    • Added an always-on Tokyo cloud e2e workflow for full regression runs; serializes concurrent runs and supports manual dispatch.
    • Removed the legacy stack-based e2e workflow.
    • Added a provisioning helper to set up OIDC-based AWS permissions for the cloud workflow.
  • Tests

    • e2e fixtures can skip runner path verification in constrained CI environments.
  • Chores

    • Made API/dashboard build configuration parameterizable and added a source-mode API image for cloud e2e runs.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an always-on Tokyo e2e GitHub Actions workflow, an OIDC IAM provisioning script, API/runner build-and-deploy steps, pytest e2e execution with diagnostics, README and fixture updates, and a Dockerfile build-arg for configurable Nx builds.

Changes

Tokyo Always-on E2E Stack

Layer / File(s) Summary
AWS OIDC IAM provisioning
scripts/ci/setup-e2e-cloud-oidc.sh
Creates/verifies GitHub OIDC provider, builds assume-role trust and scoped permissions policy, creates/updates IAM role, and prints role ARN and GitHub Actions variable guidance.
Workflow metadata, concurrency, change gating, and docs
.github/workflows/e2e-cloud.yml, .github/workflows/README.md
Adds e2e-cloud workflow docs and metadata, sets shared concurrency and Tokyo env vars, adds dorny/paths-filter changes job to gate expensive runs, and records the omitted e2e-gate note; fixes README provisioning script reference.
E2E prerequisites and resource resolution
.github/workflows/e2e-cloud.yml
Main e2e job provisions OIDC credentials, resolves ECS cluster, ALB/target group, runner EC2 instance, and S3 bucket with exact-count assertions, and validates ECR existence and runner SSM agent Online.
API container and runner binary builds
.github/workflows/e2e-cloud.yml, apps/api/Dockerfile.source, apps/api/Dockerfile
Authenticates to ECR, builds and pushes API image using Dockerfile.source (git SHA tag), builds boxlite-runner binary (Go/Cargo/libboxlite), uploads binary to S3, and adds NX_BUILD_CONFIG build-arg to the API Dockerfile.
API and runner service deployments
.github/workflows/e2e-cloud.yml
Deploys API by cloning/registering ECS task definition with the new image and waiting for ALB health; updates runner via SSM AWS-RunShellScript to atomically replace the binary and restart the systemd service with polling.
Quota initialization, SDK build, pytest execution, and diagnostics
.github/workflows/e2e-cloud.yml
Performs idempotent admin-org quota init via ECS Exec, builds/installs Python SDK, writes masked pytest credentials from SSM, runs e2e pytest with per-test timeouts and JUnit XML, uploads artifact, and collects CloudWatch logs and runner journal via SSM on failure.

Sequence Diagram(s)

sequenceDiagram
  participant GitHubActions
  participant AWS as AWS(ECR/ECS/SSM/S3/ALB)
  participant API as API(ECS Service / ALB)
  participant Runner as Runner(EC2 boxlite-runner)
  participant Pytests as Pytests

  GitHubActions->>AWS: obtain OIDC id-token & assume role
  GitHubActions->>AWS: resolve ECS cluster, ALB DNS/target-group, runner instance, S3 bucket
  GitHubActions->>AWS: build & push API image to ECR (apps/api/Dockerfile.source)
  GitHubActions->>AWS: build runner binary and upload to S3 (builds/<sha>/boxlite-runner)
  GitHubActions->>AWS: register new ECS task definition with new image -> update service (rolling deploy)
  AWS->>API: deploy new task -> ALB health checks
  GitHubActions->>Runner: send SSM AWS-RunShellScript to download binary, replace executable, restart systemd
  GitHubActions->>AWS: ECS Exec into API task to run psql quota init
  GitHubActions->>Pytests: install SDK, write credentials (from SSM), run pytest via ALB DNS
  Pytests->>API: e2e requests via ALB DNS
  GitHubActions->>AWS: on failure, fetch CloudWatch logs and runner journal via SSM
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • DorianZheng

Poem

🐰 I hopped from repo to cloud at night,

Pushed images, keys, and deploy delight,
OIDC gates and runner restarts,
Tests that dance through Tokyo parts,
CI carrots gleam — green checks in sight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: introducing the e2e-cloud workflow and IAM OIDC provisioning as a replacement for the previous approach.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/e2e-required-merge-gate

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread .github/workflows/e2e-cloud.yml Fixed
G4614 and others added 9 commits June 10, 2026 22:16
PR #678 wires up an e2e suite but its workflow only triggers on PRs
that touch specific paths. GitHub branch protection requires a check
to ALWAYS report a result — a path-filtered workflow that doesn't
fire leaves the required check pending forever, so the PR can't merge.

Reshape so e2e can be a required check:

1. Drop `paths:` from the trigger — workflow always starts on every
   PR to main. Inside, a cheap `changes` job on a GitHub-hosted
   runner does the path-filter detection via dorny/paths-filter@v3.

2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min)
   stays gated on `changes.outputs.relevant == true`. PRs that don't
   touch relevant paths skip the kvm runner entirely — no contention
   cost.

3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses
   the conditional outcome into one stable check name
   `E2E gate (required)`. Reports success if e2e passed OR no
   relevant paths changed; fail if e2e ran and failed.

Branch protection should be configured (separately, via repo
Settings → Branches → main) to require the `E2E gate (required)`
status check. That's a repo-settings action a maintainer needs to
do once after this merges.

Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes`
+ ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed.

Cost on relevant PRs: same as before (full e2e on kvm runner).

Stacks on #678 — this commit only makes sense once that PR's
e2e-stack.yml is in main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autouse `verify_runner_saw_all_boxes` fixture shells out to
`journalctl -u boxlite-runner`, which only resolves on a host where the
runner systemd service is installed (the current self-hosted KVM
runner). When the e2e suite migrates to a GitHub-hosted runner pointing
at a remote stack (Tokyo Api/Runner deployment), journalctl returns
zero hits and every box-creating test fails on the path-verify assert
even though the box actually reached the remote runner.

Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture
(yield-and-return before the journalctl probe). The cloud-CI workflow
will set this; the self-hosted KVM job keeps the guard.

The FFI-bypass risk this guard defends against doesn't apply on a stock
GHA runner — no KVM means libkrun can't start a VM, so a silent
fallback would surface as a hard error, not as a passing-but-wrong
test. Losing the guard in that environment loses no real safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh
but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`,
added in the next commit).

A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed
because the existing `boxlite-e2e-github-actions` role is tuned for
the us-east-1 KVM-runner provisioning path. Cloud e2e against the
ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR
push, ECS update-service + execute-command, SSM send-command to a
specific runner EC2, S3 builds/ artifact upload, parameter-store
reads. Splitting the roles keeps each one auditable under least-privilege.

Scoping:
- Trust policy `sub` is an enumerated allowlist (main pushes,
  pull_request, environment:e2e-cloud) — a stray workflow added on
  a feature branch can't assume the role.
- ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner`
  so the role can only shell the runner EC2.
- ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN;
  ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs.
- Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*,
  ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting
  resource-level scoping for those actions.

Sources scripts/common.sh for the shared log helpers so this script's
output matches setup-ci-runner.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR #680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring
  the structure of the e2e-test.yml section (Why / Architecture /
  Triggers / Cost / Authentication / Required variables / Required
  AWS resources / Concurrency / Stack state).
- Side-by-side table contrasts the two e2e workflows' IAM roles,
  regions, and scopes so future maintainers can tell at a glance why
  both exist.
- Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that
  filename does not exist; the actual provisioning script is
  `scripts/ci/setup-ci-runner.sh`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visioned

PR #680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AWS IAM CreateRole rejects U+2014 (em-dash) in the --description
argument with a validation error. Replace the em-dash with an ASCII
hyphen so the script runs cleanly during bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap

PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does
NOT pass repository variables (the `vars` context) to workflow runs
triggered by `pull_request` from external forks — security rule to
prevent fork PRs from reading upstream secrets/vars.

Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and
`AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow
run on this PR's head saw both as empty strings, and
`aws-actions/configure-aws-credentials@v4` failed at input validation
with "Input required and not supplied: aws-region" before any OIDC
handshake could occur.

Fix: hardcode the values in the workflow's env: block. Both are
acceptable to commit:

- `ap-northeast-1`: region name, public infrastructure information.
- IAM role ARN: contains the AWS account ID (12 digits). AWS docs
  classify account IDs as "sensitive, not secret"; the role's safety
  is enforced by its trust policy (which restricts AssumeRole to
  GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*`
  patterns), not by hiding the ARN. Industry-standard pattern in OSS
  CI infrastructure.

After this PR merges, future PRs sourced from same-repo branches WILL
see the repo vars again, but the hardcoded values remain the source
of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apps/api/Dockerfile uses `COPY apps/libs/sdk-typescript/`, `COPY apps/dashboard/`,
`COPY apps/package.json` and friends — paths that exist at the repo root,
not inside apps/api/. Using `apps/api` as the build context made
buildx report:

  ERROR: failed to compute cache key: failed to calculate checksum of
  ref ...: "/apps/libs/sdk-typescript": not found

Switch to `-f apps/api/Dockerfile .` so the context is the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the chore/e2e-required-merge-gate branch from 0cdf723 to 49a95e2 Compare June 10, 2026 14:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 42-56: Add a top-level workflow permissions block to restrict the
default GITHUB_TOKEN to the minimum (e.g., permissions: contents: read) so jobs
do not inherit broad repo access; then explicitly opt individual jobs (like the
e2e job that currently needs id-token) into additional scopes by adding
id-token: write under that job's permissions. Locate the workflow's
concurrency/group declaration (e2e-cloud-shared) and the e2e job definition to
apply the changes so default token scope is locked down and only required
permissions are granted per-job.
- Around line 91-101: The workflow's change filters are too narrow (the filters:
relevant block only matches apps/api/src/** and sdks/python/src/**) while the
job actually builds the full apps/api Docker context and the entire sdks/python
package; update the changes filter to include the full contexts (e.g.,
apps/api/** and sdks/python/**) or explicitly add Dockerfile, package*.json,
pyproject.toml and other top-level files (e.g., apps/api/Dockerfile,
apps/api/package*.json, sdks/python/pyproject.toml) so edits that affect what
gets built will trigger the e2e workflow; apply the same widening to the other
similar filters referenced in the file.

In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment check for BOXLITE_E2E_SKIP_PATH_VERIFY is
currently treating any non-empty value as true; change the conditional to
explicitly test for canonical truthy values (e.g., lowercased value in
("1","true","yes")) so that explicit disables like "0" or "false" do not skip
verification; update the if statement around the yield/return in conftest.py to
read the env var, normalize it (lower()), and compare against the accepted true
tokens (or use a standard parser like distutils.util.strtobool) before deciding
to yield and return.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a9da2708-7499-4191-886f-7cadda734569

📥 Commits

Reviewing files that changed from the base of the PR and between 2d93cc3 and 0cdf723.

📒 Files selected for processing (5)
  • .github/workflows/README.md
  • .github/workflows/e2e-cloud.yml
  • .github/workflows/e2e-stack.yml
  • scripts/ci/setup-e2e-cloud-oidc.sh
  • scripts/test/e2e/cases/conftest.py
💤 Files with no reviewable changes (1)
  • .github/workflows/e2e-stack.yml

Comment thread .github/workflows/e2e-cloud.yml
Comment thread .github/workflows/e2e-cloud.yml
Comment thread scripts/test/e2e/cases/conftest.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
scripts/test/e2e/cases/conftest.py (1)

116-118: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use explicit truthy parsing for BOXLITE_E2E_SKIP_PATH_VERIFY.

Line 116 currently enables skip for any non-empty value, so 0/false still bypass verification. That contradicts the docstring’s “=1” contract.

Suggested minimal fix
-    if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY"):
+    if os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY", "").strip().lower() in {"1", "true", "yes"}:
         yield
         return
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test/e2e/cases/conftest.py` around lines 116 - 118, The
environment-var check currently treats any non-empty
BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the
intended truthy value (per docstring) by testing for "1" (e.g.,
os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path
verification; update the conditional around the yield/return in conftest.py that
references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.
🧹 Nitpick comments (1)
.github/workflows/e2e-cloud.yml (1)

87-88: ⚡ Quick win

Consider pinning security-critical actions to commit SHAs.

Actions like aws-actions/configure-aws-credentials, actions/checkout, and aws-actions/amazon-ecr-login are security-sensitive (they handle credentials or code checkout). Pinning to commit SHAs instead of version tags prevents supply chain compromise if an action's repository is attacked.

At minimum, pin the credential-handling actions:

# Example for configure-aws-credentials
- uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2

Also applies to: 115-115, 118-118, 207-207, 229-229, 233-233, 464-464

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 87 - 88, The workflow uses
version tags for security-sensitive actions (e.g., actions/checkout,
dorny/paths-filter, aws-actions/configure-aws-credentials,
aws-actions/amazon-ecr-login) which is vulnerable to supply-chain attacks;
update each uses: entry for these actions to pin to a specific commit SHA
instead of a tag (replace e.g. actions/checkout@v5 with the corresponding commit
SHA for that release), doing this for all occurrences flagged (including the
configure-aws-credentials and amazon-ecr-login usages) so the workflow
references immutable commits.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@scripts/test/e2e/cases/conftest.py`:
- Around line 116-118: The environment-var check currently treats any non-empty
BOXLITE_E2E_SKIP_PATH_VERIFY as truthy; change it to explicitly parse the
intended truthy value (per docstring) by testing for "1" (e.g.,
os.environ.get("BOXLITE_E2E_SKIP_PATH_VERIFY") == "1") so only "1" skips path
verification; update the conditional around the yield/return in conftest.py that
references BOXLITE_E2E_SKIP_PATH_VERIFY accordingly.

---

Nitpick comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 87-88: The workflow uses version tags for security-sensitive
actions (e.g., actions/checkout, dorny/paths-filter,
aws-actions/configure-aws-credentials, aws-actions/amazon-ecr-login) which is
vulnerable to supply-chain attacks; update each uses: entry for these actions to
pin to a specific commit SHA instead of a tag (replace e.g. actions/checkout@v5
with the corresponding commit SHA for that release), doing this for all
occurrences flagged (including the configure-aws-credentials and
amazon-ecr-login usages) so the workflow references immutable commits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 777a5470-4e26-49d6-a116-7147a3b7fed9

📥 Commits

Reviewing files that changed from the base of the PR and between 0cdf723 and 49a95e2.

📒 Files selected for processing (5)
  • .github/workflows/README.md
  • .github/workflows/e2e-cloud.yml
  • .github/workflows/e2e-stack.yml
  • scripts/ci/setup-e2e-cloud-oidc.sh
  • scripts/test/e2e/cases/conftest.py
💤 Files with no reviewable changes (1)
  • .github/workflows/e2e-stack.yml
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/README.md

…ype check)

`yarn nx build api --configuration=production` currently emits 160
TS errors against apps/api/src on main HEAD, blocking the e2e-cloud
image build. The type drift is in pre-existing code (organization /
webhook / user controllers + tracing.ts), unrelated to this PR's
scope. Fixing those errors is a separate concern.

`apps/api/project.json` already has a `development` build configuration
with `skipTypeChecking: true` — webpack still produces the runtime
bundle, just without the strict type-check gate.

Add a `NX_BUILD_CONFIG` build arg to apps/api/Dockerfile:
- Default `production` so production deploys are unchanged.
- e2e-cloud workflow passes `development` to bypass the type-check
  gate while still producing the same `dist/apps/api/main.js` bundle
  the ENTRYPOINT runs.

Runtime semantics are unchanged (same bundle, same entrypoint) — only
the build-time strictness differs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/api/Dockerfile (1)

37-37: ⚡ Quick win

Use CACHE_BUST instead of a hardcoded epoch token.

ARG CACHE_BUST=1 is declared on Line 16, but Line 37 hardcodes echo 1777389170; this makes the arg ineffective and obscures cache invalidation intent.

Proposed change
-RUN echo 1777389170 && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true
+RUN echo ${CACHE_BUST} && yarn nx build api --configuration=${NX_BUILD_CONFIG} --nxBail=true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/Dockerfile` at line 37, The RUN line currently echoes a hardcoded
epoch token which ignores the declared ARG CACHE_BUST; update the RUN command in
the Dockerfile (the line invoking echo and yarn nx build api
--configuration=${NX_BUILD_CONFIG} --nxBail=true) to echo the CACHE_BUST build
arg instead (use $CACHE_BUST or ${CACHE_BUST}) so the declared ARG CACHE_BUST
takes effect for cache invalidation, and ensure ARG CACHE_BUST is present
earlier in the Dockerfile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/api/Dockerfile`:
- Line 37: The RUN line currently echoes a hardcoded epoch token which ignores
the declared ARG CACHE_BUST; update the RUN command in the Dockerfile (the line
invoking echo and yarn nx build api --configuration=${NX_BUILD_CONFIG}
--nxBail=true) to echo the CACHE_BUST build arg instead (use $CACHE_BUST or
${CACHE_BUST}) so the declared ARG CACHE_BUST takes effect for cache
invalidation, and ensure ARG CACHE_BUST is present earlier in the Dockerfile.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e7f4cbe3-0a49-48c1-9444-cbff14532232

📥 Commits

Reviewing files that changed from the base of the PR and between 49a95e2 and 442950a.

📒 Files selected for processing (2)
  • .github/workflows/e2e-cloud.yml
  • apps/api/Dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/e2e-cloud.yml

… build)

Previous attempt: --build-arg NX_BUILD_CONFIG=development bypassed the
strict TS type check but still ran webpack/vite builds. The API webpack
build succeeded under that config (skipTypeChecking: true), but the
dashboard vite build still failed on a tsconfig path-resolution issue
(`failed to resolve "extends":"../tsconfig.base.json"`). Even fixing
the dashboard issue would still incur the full webpack bundle cost,
which is wasted work for an e2e smoke test against deployed code.

Source-mode approach: new apps/api/Dockerfile.source that
- runs apps/api/src/main.ts directly via `yarn tsx ...` at container
  start (tsx strips types and transpiles on the fly; never type-checks)
- skips both `nx build api` and `nx build dashboard` entirely
- omits the apps/dashboard/ COPY altogether — e2e cases under
  scripts/test/e2e/cases/ hit the API directly, no dashboard
  static-file serving is exercised

tsconfig.base.json is symlinked to tsconfig.json so tsx auto-discovers
the `@boxlite-ai/*` path aliases that resolve to libs/*/src/index.ts.

Production deploys are unaffected: apps/api/Dockerfile (the production
Dockerfile) is unchanged; this is a sibling file used only by the
e2e-cloud workflow.

Trade-offs documented in apps/api/Dockerfile.source's header:
- slower cold start (~5-15s extra for tsx initial transpile)
- type errors surface at runtime instead of build time (acceptable
  for an e2e smoke against deployed code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/e2e-cloud.yml (2)

154-166: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't take TargetGroups[0] from the ALB.

This block asserts there is one matching load balancer, but it never asserts there is one attached target group. If another TG gets attached later, the deploy health gate can watch the wrong backend and either pass too early or fail a healthy rollout.

Suggested hardening
-          API_LB_TG_ARN=$(aws elbv2 describe-load-balancers \
-            --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \
-            --output text \
-            | xargs -I{} aws elbv2 describe-target-groups --load-balancer-arn {} \
-                --query "TargetGroups[0].TargetGroupArn" --output text)
+          API_LB_ARN=$(aws elbv2 describe-load-balancers \
+            --query "LoadBalancers[?starts_with(LoadBalancerName,'ApiLoadBalancer-')]|[0].LoadBalancerArn" \
+            --output text)
+          API_LB_TG_COUNT=$(aws elbv2 describe-target-groups \
+            --load-balancer-arn "$API_LB_ARN" \
+            --query "length(TargetGroups)" --output text)
+          assert_one "ApiLoadBalancer-* target group" "$API_LB_TG_COUNT"
+          API_LB_TG_ARN=$(aws elbv2 describe-target-groups \
+            --load-balancer-arn "$API_LB_ARN" \
+            --query "TargetGroups[0].TargetGroupArn" --output text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 154 - 166, The script currently
reads TargetGroups[0] for API_LB_TG_ARN which can pick the wrong TG if multiple
exist; update the logic to query and count target groups attached to the ALB
(use aws elbv2 describe-target-groups --load-balancer-arn {} with a length(...)
query and store/check a TG_COUNT variable), call assert_one (or reuse existing
assert_one) to ensure exactly one target group is attached, and only then
retrieve the single TargetGroupArn into API_LB_TG_ARN (instead of unguarded
TargetGroups[0]); ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and
the describe-target-groups call when making these changes.

42-47: ⚠️ Potential issue | 🟠 Major

Bind workflow_dispatch runs to the trusted e2e-cloud environment and avoid ambiguous ALB target-group selection.

  • jobs.e2e in .github/workflows/e2e-cloud.yml has no environment:; scripts/ci/setup-e2e-cloud-oidc.sh trust policy only allows ...:ref:refs/heads/main, ...:pull_request, or ...:environment:e2e-cloud, so workflow_dispatch on non-main branches won’t be able to assume the AWS role via OIDC.
Suggested fix
   e2e:
     name: E2E suite (Tokyo)
     needs: changes
+    environment: e2e-cloud
     # Skip on PRs that don't touch relevant paths. workflow_dispatch
     # always runs (operator explicitly asked for it).
     if: needs.changes.outputs.relevant == 'true' || github.event_name == 'workflow_dispatch'
  • In .github/workflows/e2e-cloud.yml (“Resolve stack resources”), api_lb_tg_arn is set using TargetGroups[0].TargetGroupArn without asserting that exactly one target group is returned for the matched ALB; this can silently bind to the wrong target group. Add a TargetGroups count/uniqueness assertion (or filter deterministically) before using index 0.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-cloud.yml around lines 42 - 47, Add environment:
e2e-cloud to the jobs.e2e definition so workflow_dispatch runs are bound to the
trusted e2e-cloud environment (matching the OIDC trust conditions in
scripts/ci/setup-e2e-cloud-oidc.sh); and in the “Resolve stack resources” step
where api_lb_tg_arn is derived from TargetGroups[0].TargetGroupArn,
assert/filter that the ALB returns exactly one matching TargetGroup (or
deterministically filter by Name/Tags) before using index 0 to avoid silently
selecting the wrong target group.
🧹 Nitpick comments (1)
apps/api/Dockerfile.source (1)

26-63: ⚡ Quick win

Run container as non-root user.

The container runs as root by default, which violates the principle of least privilege. Even in an e2e test environment, following security best practices reduces attack surface. The official Node.js image includes a pre-configured node user (UID 1000).

🔒 Proposed fix

Add the USER directive after installing dependencies but before copying application code:

 COPY apps/nx.json apps/tsconfig.base.json apps/NOTICE ./
 
 # tsx auto-discovers tsconfig.json in cwd — symlink so the path-alias
 # definitions in apps/tsconfig.base.json (the `@boxlite-ai/*` mappings)
 # resolve at import time without needing an explicit --tsconfig flag.
 RUN ln -sf tsconfig.base.json tsconfig.json
 
 ENV NX_DAEMON=false
 
+USER node
+
 COPY apps/api/ apps/api/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/api/Dockerfile.source` around lines 26 - 63, The Dockerfile runs as
root; update the boxlite-source stage to switch to the unprivileged node user
after installing dependencies but before copying application files:
create/ensure the /boxlite workdir is owned by node (chown -R node:node
/boxlite) and add a USER node (UID 1000) directive before the COPY apps/api/ ...
lines so subsequent COPY, HEALTHCHECK and ENTRYPOINT (yarn tsx
apps/api/src/main.ts) run as the non-root node user; keep the existing RUN
npm/yarn install steps as root then switch to node.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/api/Dockerfile.source`:
- Line 26: Set a non-root runtime user and make the application directory
writable to fix Trivy DS-0002: in the Dockerfile.source ensure you create or
verify the /boxlite directory, chown it to the existing node user (e.g., chown
-R node:node /boxlite), and then add USER node before the ENTRYPOINT so the
container runs as the node user rather than root; reference the Dockerfile's
ENTRYPOINT, the /boxlite path, and the USER directive when applying the changes.

---

Outside diff comments:
In @.github/workflows/e2e-cloud.yml:
- Around line 154-166: The script currently reads TargetGroups[0] for
API_LB_TG_ARN which can pick the wrong TG if multiple exist; update the logic to
query and count target groups attached to the ALB (use aws elbv2
describe-target-groups --load-balancer-arn {} with a length(...) query and
store/check a TG_COUNT variable), call assert_one (or reuse existing assert_one)
to ensure exactly one target group is attached, and only then retrieve the
single TargetGroupArn into API_LB_TG_ARN (instead of unguarded TargetGroups[0]);
ensure you reference LB_COUNT, API_LB_DNS, API_LB_TG_ARN, and the
describe-target-groups call when making these changes.
- Around line 42-47: Add environment: e2e-cloud to the jobs.e2e definition so
workflow_dispatch runs are bound to the trusted e2e-cloud environment (matching
the OIDC trust conditions in scripts/ci/setup-e2e-cloud-oidc.sh); and in the
“Resolve stack resources” step where api_lb_tg_arn is derived from
TargetGroups[0].TargetGroupArn, assert/filter that the ALB returns exactly one
matching TargetGroup (or deterministically filter by Name/Tags) before using
index 0 to avoid silently selecting the wrong target group.

---

Nitpick comments:
In `@apps/api/Dockerfile.source`:
- Around line 26-63: The Dockerfile runs as root; update the boxlite-source
stage to switch to the unprivileged node user after installing dependencies but
before copying application files: create/ensure the /boxlite workdir is owned by
node (chown -R node:node /boxlite) and add a USER node (UID 1000) directive
before the COPY apps/api/ ... lines so subsequent COPY, HEALTHCHECK and
ENTRYPOINT (yarn tsx apps/api/src/main.ts) run as the non-root node user; keep
the existing RUN npm/yarn install steps as root then switch to node.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 094c7c9a-9051-4c60-ac33-d16a8ef0a9a2

📥 Commits

Reviewing files that changed from the base of the PR and between 442950a and 27f1315.

📒 Files selected for processing (2)
  • .github/workflows/e2e-cloud.yml
  • apps/api/Dockerfile.source

Comment thread apps/api/Dockerfile.source
G4614 and others added 10 commits June 10, 2026 22:50
`cargo build -p boxlite-runner` was wrong — runner is a Go binary
(apps/runner/cmd/runner/main.go), not a Rust crate. The Rust workspace
has no `boxlite-runner` package, hence cargo's
`error: package ID specification 'boxlite-runner' did not match any packages`.

Mirror .github/workflows/build-runner-binary.yml's approach:

- actions/setup-go@v5 with go-version 1.25.4
- Install libx11-dev + libxtst-dev + libxinerama-dev (computer-use links X11)
- Download prebuilt libboxlite.a from `v$(Cargo.toml version)` GitHub
  release — currently v0.9.5 ✓ released 2026-05-16
- Rewrite apps/go.work to include only the modules we need to build
- go mod download per module
- Build daemon (CGO_ENABLED=0) → runner/pkg/daemon/static/daemon-amd64
- Build computer-use (CGO_ENABLED=1) → runner/pkg/daemon/static/boxlite-computer-use
- Build runner (CGO_ENABLED=1, links libboxlite.a, embeds daemon and
  computer-use) → /tmp/boxlite-runner
- aws s3 cp /tmp/boxlite-runner s3://.../builds/<sha>/boxlite-runner

Pinning libboxlite.a to a released version means PR changes affecting
only apps/runner Go code are picked up. PR changes to sdks/c that
would require a fresh libboxlite.a are out of scope here — a follow-up
could either trigger a full C SDK build or just download from a more
recent release tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y branch

Previous trust policy enumerated 3 specific subject patterns:
  - repo:.../ref:refs/heads/main
  - repo:.../pull_request
  - repo:.../environment:e2e-cloud

workflow_dispatch on a feature branch (e.g. chore/e2e-required-merge-gate)
produces subject `repo:.../ref:refs/heads/chore/e2e-required-merge-gate`,
which matched none of the 3. STS rejected with
"Not authorized to perform sts:AssumeRoleWithWebIdentity".

Switch to `repo:boxlite-ai/boxlite:*` — accepts any subject from this
specific repo. Scope is still bounded to a single repo (no other repos
can assume this role), and the role's IAM policy is least-privilege
(only the boxlite-e2e-ci stack resources). Trade-off: feature branches
can trigger e2e-cloud via workflow_dispatch, which is a feature here
(needed for iterative debugging) rather than a risk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te.a from source)

Runner binary rebuild via Go CGO links libboxlite.a, but the prebuilt
libboxlite.a from latest release (v0.9.5) doesn't have the
boxlite_rest_options_set_path_prefix symbol that apps/runner Go code
references — main has drifted ahead of release.

Linker error:
  undefined reference to `boxlite_rest_options_set_path_prefix'

Two paths to fix:
  (a) Build libboxlite.a from Rust source in the workflow (+10-15min,
      requires cross-build chain for musl + libkrun)
  (b) Skip runner rebuild — Tokyo runner EC2 already runs a working
      binary ABI-compatible with the deployed API

Going with (b) for now. The current Tokyo runner has been up 21+ hours
and passes the API↔runner handshake checks we observed earlier. PR
changes touching apps/runner Go code are not exercised by this
workflow as a result; that's a follow-up that can wire `make
dist:runner` (which builds libboxlite.a from source) into the build path.

Removed steps:
  - actions/setup-go@v5
  - Install runner build deps (libx11/libxtst/libxinerama)
  - Determine version + download libboxlite.a
  - Rewrite apps/go.work
  - Download Go modules per module
  - Build daemon / computer-use / boxlite-runner
  - Upload runner binary to S3
  - SSM Deploy runner binary to EC2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…M needs decorator metadata)

Source-mode container with tsx crashed immediately at module load:
  /boxlite/node_modules/typeorm/decorator/src/decorator/columns/PrimaryColumn.ts:72
  ColumnTypeUndefinedError: Column type for User#id is not defined
  and cannot be guessed. Make sure you have turned on an
  "emitDecoratorMetadata": true option in tsconfig.json.

Cause: tsx uses esbuild under the hood, which has long-standing
non-support for `emitDecoratorMetadata: true`. TypeORM's @column /
@entity / @PrimaryColumn decorators rely on `reflect-metadata` calls
the compiler EMITS when that flag is set; without them, TypeORM can't
infer column types and bails before the app even starts.

Switch to ts-node, which uses the TypeScript compiler properly and
honors emitDecoratorMetadata from the referenced tsconfig:

  ENTRYPOINT ["yarn", "ts-node",
              "--project", "apps/api/tsconfig.app.json",
              "--transpile-only",
              "-r", "tsconfig-paths/register",
              "apps/api/src/main.ts"]

- --project: point at apps/api/tsconfig.app.json (where
  emitDecoratorMetadata and experimentalDecorators live).
- --transpile-only: skip type-checking at startup (faster cold boot;
  the prod webpack build is the source-of-truth type gate, and it's
  already known to fail on main with 160 errors — see Dockerfile.source
  header).
- -r tsconfig-paths/register: resolve `@boxlite-ai/*` path aliases
  defined in tsconfig.base.json at module-load time.

Both ts-node (10.9.1) and tsconfig-paths are already in apps/package.json
dependencies (used by the typeorm migration scripts), so no dep change.

Cold start is now ~10-30s instead of tsx's ~5-15s — acceptable for the
ALB health-check window (3 retries × 30s + delays = ~90s minimum). If
it still races, will follow up by extending the health check grace
period at the ECS service or task-def level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IAM action `ecs:ExecuteCommand` is evaluated against BOTH the task
ARN AND the cluster ARN. Granting it on the task alone produced:

  AccessDeniedException: not authorized to perform: ecs:ExecuteCommand
  on resource: arn:aws:ecs:.../cluster/boxlite-e2e-ci-ClusterCluster-...

Add the cluster pattern to the Resource list so the "Init admin-org
sandbox quota" step can exec into the Api task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Init admin-org sandbox quota" workflow step execs into the Api
container and runs `psql` against RDS to seed sandbox quota. The
source-mode Dockerfile didn't install postgresql-client, so the step
failed with `sh: 1: psql: not found`. Add it to the apt install line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema drift: post-deploy migration 1781016743403 renamed
max_{cpu,memory,disk}_per_sandbox → max_{cpu,memory,disk}_per_box on
the organization table. The quota-init UPDATE failed with:
  ERROR: column "max_cpu_per_sandbox" of relation "organization"
  does not exist

Rename all three column refs in the workflow to match the live schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous run actually applied the quota UPDATE successfully:
  UPDATE 1
  ... | 4 | 8192 | 20

But psql wrapped the row across multiple lines and triggered a
"--More--" pager prompt over the ECS Exec session, so the
'4 *\| *8192 *\| *20' grep didn't match.

Add `-A -t -P pager=off` and `PAGER=cat` so the SELECT row is a
single unaligned line "<uuid>|4|8192|20", then grep on a tighter
'\|4\|8192\|20$' anchor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quota row printed exactly as expected:
  4a1eef70-8734-4814-b7a5-0ca3522a8760|4|8192|20

but `grep -E '\|4\|8192\|20$'` still failed because SSM Session Manager
uses CRLF line endings, so the `$` anchor sees `\r` immediately before
end-of-line, never the literal `20`.

Pipe through `tr -d '\r'` first so the anchor matches reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bubblewrap-sys / libkrun-sys / e2fsprogs-sys crates have build.rs
scripts that vendor their C dependencies via git submodules. Without
the submodules initialized the build panics:

  bubblewrap vendor source not found at .../vendor/bubblewrap
  Initialize submodule: git submodule update --init --recursive

These C deps are runtime requirements of the runner binary (which
runs inside the EC2 host), NOT of the Python SDK client. The build.rs
files all honor BOXLITE_DEPS_STUB=1 and emit a no-op
`/nonexistent` linker hint — exactly the right behavior here.

Set the env var on the SDK build step so we don't need to also init
submodules or install meson/ninja on the GHA runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added 4 commits June 12, 2026 12:23
…ass dead per-service role ARNs

The e2e-ci stack's per-service Task/Execution roles (ProxyTaskRole,
OtelCollectorTaskRole, SshGatewayTaskRole + their Execution pairs)
were deleted from IAM out-of-band — same drift as ApiTaskRole-
batkzzas before this PR's session recreated it. Live ECS TDs still
reference the dead ARNs, so new task launches fail at AssumeRole
NoSuchEntity.

Workaround: at deploy time, resolve Api's currently-live taskRoleArn +
executionRoleArn (those were recreated earlier with sst:app=boxlite
and trust ecs-tasks.amazonaws.com), and override the cloned TD's
roles to point at those. Both trust principals match any ECS service,
so Proxy / OtelCollector / SshGateway can each AssumeRole the Api
role. Least-privilege rough edge: borrowers get Api's S3 + log perms
they don't need — acceptable for e2e-ci (no prod traffic). Admin's
`sst deploy --stage e2e-ci` will recreate per-service roles
properly; we can drop this override then.
@G4614 G4614 force-pushed the chore/e2e-required-merge-gate branch 2 times, most recently from c07f7cd to 8b66ae1 Compare June 12, 2026 09:19
G4614 and others added 3 commits June 12, 2026 17:20
…w_call

Refactor the e2e-cloud workflow so the Api build+deploy chain is no
longer duplicated inline. Adds a deploy_api job that workflow_call's
.github/workflows/deploy-api.yml (mirroring how deploy_runner already
workflow_call's deploy-runner.yml). Both deploys are gated on actual
source changes via the existing `changes` paths-filter:
  - runner_chain → deploy_runner
  - api          → deploy_api
PRs touching only api skip the ~11 min runner build chain; PRs
touching only runner skip the Docker buildx + ECS register-TD +
wait-services-stable round-trip; PRs touching only tests skip both.

E2e job (pytest) now needs all three (changes + deploy_runner +
deploy_api) and reuses the same skip-if-failure-only gate
(!failure() && !cancelled() && any-relevant-change) so it still runs
when an upstream job was correctly skipped.

Dropped from the inline e2e job:
  - ECR login step + Build & push Api image step (deploy-api.yml)
  - Deploy Api (rolling) step incl. clone-TD + UpdateService + ALB
    healthy poll (deploy-api.yml does it + env patches for stale
    S3_ACCESS_KEY / storage-bucket name + role override).
  - SSM-agent-online preflight check on the runner (obsolete since
    deploy-runner switched to SSH+SCP via EC2 Instance Connect).
  - ECR repo describe preflight (now exercised inside deploy-api.yml).
Kept inline (still needed by pytest):
  - Resolve stack resources (cluster, ALB DNS/TG, runner id, S3 bucket).
  - /api/health probe through the LB before tests run.
  - Init admin-org sandbox quota via ECS Exec.
  - Python SDK build/install + pytest profile/snapshot setup + run.
  - On-failure CloudWatch + journalctl dump.

Net: -104 lines, one source of truth per deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… PortSpec impl

merging upstream/main into the PR brought in #741's PortSpec-based
WithPort but the session's earlier stub (added to unblock the runner
build before #741 landed) remained in place, causing duplicate
declarations:
  ports redeclared in struct boxConfig
  WithPort redeclared in this block
  cannot use portMapping{} as PortSpec value in append

Remove the stub block: stub's ports field, portMapping struct, and
stub WithPort func. #741's PortSpec + WithPort + ports []PortSpec are
the canonical definitions.
G4614 added 9 commits June 12, 2026 18:38
The composed e2e-cloud pipeline previously deployed only Api +
Runner. Supporting ECS services (Proxy / OtelCollector / SshGateway)
went un-deployed in the CI loop — admin had to run `sst deploy
--stage e2e-ci` to update them. This composes deploy-app-services.yml
the same way as deploy-runner / deploy-api:

- `changes` job adds paths-filter outputs for proxy / otel_collector /
  ssh_gateway, plus an `appsvc` step that emits a comma-separated list
  of services whose source actually changed.
- New `deploy_app_services` job: workflow_call invokes deploy-app-
  services.yml, passing the changed-list via `services` input. Inner
  workflow's per-service gating then deploys only the touched ones.
- `e2e` job needs += deploy_app_services so pytest waits for the
  supporting services to come up.

Also: extend deploy-app-services.yml's workflow_call signature to
accept the `services` input (it already had the same input on
workflow_dispatch).

Net: pure runner / api / test PRs skip the 3 supporting deploys
entirely; otel-only PRs skip runner + api + the other 2 services.
…uota init

The previous step UPDATE'd organization WHERE id='4a1eef70-...' — a
UUID that happened to be the admin's default-org id in earlier e2e-ci
stack instances. After this PR's DB schema reset (required to recover
from #736's pre-launch migration squash being deployed against a stack
whose history was already past those migrations), the app re-runs
initializeAdminUser at boot and creates a new admin user + auto-
generates a new default-org UUID. The hardcoded id no longer matches,
UPDATE touches 0 rows, SELECT returns empty, the step fails.

Switch the SQL to resolve the admin's default org via a subquery
against organization_user (userId='boxlite-admin' AND
isDefaultForUser=true) — that membership row is what app code already
uses to find the admin's working org, and survives any future schema
reset where the org UUID changes.
Merging the substantive bits of #745 and #746 here so PR #724's
deploy-app-services CI can actually push an otel-collector image:

* otel-collector/builder-config.yaml: fix module paths after #730
  moved everything under apps/ (was: ../../../api-client-go etc.,
  now: ../../../apps/api-client-go). The OCB build was failing with
  `filepath does not exist: /boxlite/otel-collector/exporter`. (#745)
* otel-collector/Dockerfile: apk add python3 make g++ so node-gyp
  can compile any musl-incompatible native add-on during workspace
  `yarn install`. Kept even after dropping dockerode below — small
  safety net in case future deps reintroduce a native build. (#745)
* apps/package.json: drop unused dockerode + @types/dockerode. They
  were inherited from the Daytona import in #460 but never imported
  in source. Drops the dockerode -> docker-modem -> ssh2 ->
  cpu-features chain; cpu-features was the optional native addon
  that forced the toolchain workaround in the first place. (#746)
* apps/yarn.lock: regenerated (-161 lines).
* apps/api-client-go/api/openapi.yaml: regenerate to catch up with
  main's existing `assignedRoleIds.default: []` drift so the
  api-client-drift PR check passes.
… gone

The previous commit kept #745's apk add as a safety net, but cpu-features
was the only musl-incompatible native dep in the workspace and it came
in only via the now-removed dockerode chain. Verified locally: yarn
install on node:22-alpine completes without any compiler toolchain.
PR-on-open triggering serializes every PR against the shared Tokyo
stack (see concurrency: e2e-cloud-shared, cancel-in-progress: false).
The suite is slow enough that gating every PR open / push on it just
queues runs behind each other for hours. Push-to-main keeps the
post-merge regression signal; workflow_dispatch covers ad-hoc PR
validation when wanted.
Today an OtelCollector deploy on chore/e2e-required-merge-gate failed
with "ECS rolled back. PRIMARY=...:7 expected ...:6" — two parallel
Deploy App Services runs (push + dispatch on the same SHA) both
registered task definitions, and one's UpdateService rolled the other
back.

Add per-workflow concurrency groups against the shared Tokyo stack
to prevent the race. cancel-in-progress: false because a half-applied
ECS rolling update / EC2 userdata swap is worse than waiting.

- deploy-api.yml: group=deploy-api-shared
- deploy-app-services.yml: group=deploy-app-services-shared
- deploy-runner.yml: group=deploy-runner-shared
- _deploy-single-service.yml: group=deploy-single-service-<service_name>
  (parameterized so different services parallelize but same-service
  callers serialize — belt-and-suspenders against the outer group)
Resolve stack resources was picking the API LB target group via
TargetGroups[0] without checking how many target groups were attached.
Blue/green deploys or extra listener rules can attach more than one TG
to an ALB; in that case the e2e suite would silently bind smoke traffic
to the wrong TG. Use the same assert_one guard as the cluster and LB
checks above. Hoist the LB ARN into a variable so the count + the
final lookup share one resolution.
…002)

CodeRabbit / Trivy DS-0002 flagged the source-mode container running as
root. Switch to the pre-existing `node` user (UID 1000 in node:24-slim)
right before HEALTHCHECK + ENTRYPOINT. All RUN/COPY happen as root above,
so the installed tree is read-only at runtime — least-privilege without
needing a chown -R.
@G4614

G4614 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

G4614 added 7 commits June 12, 2026 12:23
Bare `if os.environ.get(...)` is truthy for any non-empty string, so
setting the var to "0" or "false" — the conventional "off" — still
skipped the runner-journal verification. Accept the standard truthy
spellings only.
Previously the e2e-cloud.yml pipeline glued deploy_{runner,api,app_services}
and the pytest suite into one workflow_dispatch entry, so the only
way to "rerun the tests" was to also redeploy everything (workflow_dispatch
forces all three deploy gates true). With the Tokyo stack as a shared
singleton that gets left in whatever state the last run produced, this
made cheap "rerun against current stack" cycles much more expensive
than they need to be.

* New: .github/workflows/e2e-cloud-test.yml. workflow_call + workflow_dispatch.
  Contains only the e2e job — resolve resources, probe LB, init
  admin-org quota, build SDK, configure pytest, run suite, collect
  logs on failure. Shares the e2e-cloud-shared concurrency group so a
  standalone test dispatch and an in-flight deploy queue against each
  other (can't race the Tokyo stack).
* e2e-cloud.yml: e2e job is now a thin `uses:` wrapper around the new
  workflow. Gating logic (skip-if-upstream-failed, paths-changed-only-
  on-push) is preserved at the wrapper level; standalone dispatch on
  e2e-cloud-test.yml skips those gates and just runs.

Header on e2e-cloud-test.yml warns explicitly that the stack state is
the dispatcher's responsibility — results reflect whatever's deployed,
not the branch the workflow was dispatched from.
The single `runner_source` filter triggered the full ~10 min libboxlite
Rust build on any path in the runner's dep graph, including Go-only
diffs (apps/runner, apps/daemon, sdks/go, ...) that just relink against
the existing libboxlite.a without changing it. Split into two filters:

* libboxlite_chain: paths whose content ends up in libboxlite.a
  (src/boxlite, src/shared, src/deps, sdks/c, scripts/build, Cargo.{toml,lock})
* go_runner_chain: Go paths that consume libboxlite.a but don't
  contribute to its bytes (apps/{runner,daemon,common-go,api-client-go,
  libs/computer-use}, sdks/go)

`build_c_sdk` now runs only when libboxlite_chain changed.
`build_runner` switches libboxlite_source to "release" (pull
v${VERSION} tarball from GitHub Release) when libboxlite_chain didn't
change, "build" (use sibling build_c_sdk artifact) otherwise.
`build_runner`'s `if:` adds the !failure() && !cancelled() guard so
the skipped build_c_sdk on Go-only diffs doesn't cascade-skip it.

Also drop the dead `src/api-client/**` path from both deploy-runner.yml
and e2e-cloud.yml — that directory doesn't exist in the repo, so the
entry never matched anything.

workflow_dispatch / workflow_call behavior is unchanged — both still
force libboxlite_changed=true and rebuild from source. Only push-event
auto-triggering is optimized.
…layers

The bare ${SQL} interpolation went through GHA bash → aws ecs --command
→ sh -c → psql -c — four layers of double-quote parsing. The SQL's own
PG identifier quotes ("organizationId", "organization_user", etc.)
collided with the inner `\"...\"` bounds and got stripped, so psql
received the bare token `organizationId` which Postgres case-folds to
`organizationid` and reports as missing.

Encode the SQL to base64 on the runner (base64 chars never need quoting),
inline that into the --command string, then `base64 -d | psql -f -`
inside the container. psql sees the literal SQL bytes unchanged.

Last dispatched e2e-cloud run failed exactly here; the column is in
the schema (migrations/1741087887225-migration.ts:64), the bug was
purely shell-escape.
Drop this block before merging to main. GH won't expose
workflow_dispatch on a brand-new workflow until it has run at
least once; the push trigger (scoped to this branch and to
self-modifications only) gives us that first run.
POST /api/snapshots and GET /api/snapshots?name=... no longer exist
on the API — there is no snapshot controller in apps/api/src/**.
PR #735 (feat(box): create boxes from curated images) replaced the
pre-register-then-boot model with lazy curated-image pulls at box
start, so the e2e test workflow no longer needs to seed snapshot rows.

The pre-registration block was failing the "Configure pytest profile
p1" step with `Cannot POST /api/snapshots` 404. The previous run got
masked by the SQL-escape bug in quota init (now fixed); this surfaced
as the next blocker.
G4614 added a commit that referenced this pull request Jun 13, 2026
waitForToolboxReady was introduced in fc88aa0 (2026-06-05,
"feat: add agent-ready runtime catalog", which then landed on main
via PR #715 on 2026-06-10). It HTTP-polls http://127.0.0.1:<hostPort>/version
expecting a daemon on guest TCP port 2280 — that interface dates from
the Daytona-daemon era and was never reimplemented after the Rust
guest agent rewrite landed in dbb11ec (2026-04-01). The new agent
binds **vsock://2695** for gRPC + notifies the host via vsock://2696;
nothing inside the VM listens on TCP:2280, so libkrun's port-forward
accepts the SYN and immediately reset-by-peer's, and every CREATE_BOX
times out 30 s in.

Production data from a Tokyo runner: in 24 h, 490 CREATE_BOX events,
0 toolbox-ready successes, 181 toolbox-ready failures. The exec path
that fires immediately afterward (via the same vsock gRPC channel)
**always succeeds** — confirming the box VM is healthy, the readiness
check itself is the bug.

Remove the dead probe: drop waitForToolboxReady from client.go's Create
and Start, drop the function + its TCP/HTTP imports, drop the
toolboxReadyTimeout field, drop the ToolboxReadyTimeout/
DaemonStartTimeoutSec config plumbing in main.go + config.go, drop
the two now-unreachable tests. Box readiness is now signalled by
bx.Start(ctx) returning (which itself blocks on the vsock notification
from the guest).

Branched off chore/e2e-required-merge-gate (PR #724) so the e2e-cloud
stack picks this up next dispatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants