ci(e2e): make e2e suite a required merge gate (stacks on #678) by G4614 · Pull Request #680 · boxlite-ai/boxlite

G4614 · 2026-06-08T14:26:04Z

e2e force check

Problem

PR #678 ships the e2e suite as a workflow with `paths:` filter on its triggers. PRs that don't touch any of those paths cause the workflow not to fire — and GitHub branch protection's "required status check" expects a result on every PR. A required path-filtered check leaves PRs stuck in "pending checks" forever.

So today #678's e2e suite can be added to CI, but can't be a hard merge gate without one of these:

Add a fall-through workflow that always succeeds on irrelevant paths (separate file, name collision risk)
Drop the path filter and pay full e2e cost on every PR (~30 min on kvm runner)
Restructure into a single always-runs gate job (what this PR does)

Fix

Reshape the workflow into 3 jobs:

```
┌─────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ changes │──▶│ e2e (kvm) │──▶│ e2e-gate (required) │
│ ubuntu-latest │ │ self-hosted kvm runner │ │ ubuntu-latest │
│ paths-filter │ │ if: relevant or │ │ if: always() │
│ ~10s │ │ workflow_dispatch │ │ ~5s │
└─────────────────┘ └─────────────────────────┘ └─────────────────────────┘
```

`changes`: cheap GH-hosted path-filter check. Outputs `relevant=true|false`.
`e2e`: same expensive job test(e2e): SDK→API→Runner→VM regression suite #678 already has. Only runs when `changes` says relevant (or operator did `workflow_dispatch`). On irrelevant PRs the kvm runner is not consumed.
`e2e-gate`: always runs, collapses outcome into one stable check name. Pass if (no relevant paths) OR (e2e passed). Fail if e2e ran and failed.

Branch protection can require `E2E gate (required)` and it'll always report a result.

After merge

Repo admin needs to do once:

Settings → Branches → `main` → Edit → Require status checks to pass before merging → add `E2E gate (required)`

The workflow change can't do this itself (GitHub repo settings are out of band).

Cost

PR type	Before this PR	After this PR
Touches relevant paths	full e2e (~10-30 min, kvm)	full e2e + ~15s extra
Doesn't touch	workflow never runs, check missing → can't be required	~10s changes + ~5s gate on GH-hosted, no kvm consumed

Tradeoff vs the simpler "drop paths filter and just always run"

Always-run on every PR = self-hosted kvm runner saturated by doc-only PRs. Split into changes-gate-execute lets path filter live in code while still satisfying branch protection's "always reports" requirement.

🤖 Generated with Claude Code

coderabbitai · 2026-06-08T14:26:12Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5736e958-692e-4877-8922-a777f3e1a5e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cla-assistant · 2026-06-10T07:43:16Z

All committers have signed the CLA.

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR boxlite-ai#678 wires up an e2e suite but its workflow only triggers on PRs that touch specific paths. GitHub branch protection requires a check to ALWAYS report a result — a path-filtered workflow that doesn't fire leaves the required check pending forever, so the PR can't merge. Reshape so e2e can be a required check: 1. Drop `paths:` from the trigger — workflow always starts on every PR to main. Inside, a cheap `changes` job on a GitHub-hosted runner does the path-filter detection via dorny/paths-filter@v3. 2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min) stays gated on `changes.outputs.relevant == true`. PRs that don't touch relevant paths skip the kvm runner entirely — no contention cost. 3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses the conditional outcome into one stable check name `E2E gate (required)`. Reports success if e2e passed OR no relevant paths changed; fail if e2e ran and failed. Branch protection should be configured (separately, via repo Settings → Branches → main) to require the `E2E gate (required)` status check. That's a repo-settings action a maintainer needs to do once after this merges. Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes` + ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed. Cost on relevant PRs: same as before (full e2e on kvm runner). Stacks on boxlite-ai#678 — this commit only makes sense once that PR's e2e-stack.yml is in main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The autouse `verify_runner_saw_all_boxes` fixture shells out to `journalctl -u boxlite-runner`, which only resolves on a host where the runner systemd service is installed (the current self-hosted KVM runner). When the e2e suite migrates to a GitHub-hosted runner pointing at a remote stack (Tokyo Api/Runner deployment), journalctl returns zero hits and every box-creating test fails on the path-verify assert even though the box actually reached the remote runner. Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture (yield-and-return before the journalctl probe). The cloud-CI workflow will set this; the self-hosted KVM job keeps the guard. The FFI-bypass risk this guard defends against doesn't apply on a stock GHA runner — no KVM means libkrun can't start a VM, so a silent fallback would surface as a hard error, not as a passing-but-wrong test. Losing the guard in that environment loses no real safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`, added in the next commit). A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed because the existing `boxlite-e2e-github-actions` role is tuned for the us-east-1 KVM-runner provisioning path. Cloud e2e against the ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR push, ECS update-service + execute-command, SSM send-command to a specific runner EC2, S3 builds/ artifact upload, parameter-store reads. Splitting the roles keeps each one auditable under least-privilege. Scoping: - Trust policy `sub` is an enumerated allowlist (main pushes, pull_request, environment:e2e-cloud) — a stray workflow added on a feature branch can't assume the role. - ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner` so the role can only shell the runner EC2. - ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN; ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs. - Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*, ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting resource-level scoping for those actions. Sources scripts/common.sh for the shared log helpers so this script's output matches setup-ci-runner.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring the structure of the e2e-test.yml section (Why / Architecture / Triggers / Cost / Authentication / Required variables / Required AWS resources / Concurrency / Stack state). - Side-by-side table contrasts the two e2e workflows' IAM roles, regions, and scopes so future maintainers can tell at a glance why both exist. - Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that filename does not exist; the actual provisioning script is `scripts/ci/setup-ci-runner.sh`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AWS IAM CreateRole rejects U+2014 (em-dash) in the --description argument with a validation error. Replace the em-dash with an ASCII hyphen so the script runs cleanly during bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… gap PR boxlite-ai#680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 · 2026-06-10T14:09:52Z

Superseded by #724 — same content but as a same-repo PR (not from fork) so the e2e-cloud workflow's OIDC handshake can actually run. GitHub strips id-token: write permission from workflow runs triggered by PRs from forks, which made #680's CI structurally unable to pass.

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hide the AWS account ID + region + role ARN behind repo Variables — same pattern as .github/workflows/e2e-test.yml uses for AWS_ACCOUNT_ID. Previous version had them hardcoded because the original PR (#680) was a fork-PR and GitHub strips the `vars` context on fork-PR runs. PR #724 is now a same-repo PR (branch lives on boxlite-ai/boxlite), so the fork-PR workaround is no longer needed. Reads from: vars.AWS_E2E_CLOUD_REGION = ap-northeast-1 vars.AWS_E2E_CLOUD_ROLE_ARN = arn:aws:iam::<account>:role/boxlite-e2e-cloud-github-actions vars.AWS_ACCOUNT_ID (consumed by setup-e2e-cloud-oidc.sh) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@v5

The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 force-pushed the chore/e2e-required-merge-gate branch from 8df6ade to 41bbb6d Compare June 9, 2026 03:27

G4614 force-pushed the chore/e2e-required-merge-gate branch from 213359b to 97af472 Compare June 10, 2026 07:44

G4614 force-pushed the chore/e2e-required-merge-gate branch from be6793a to 347ee5b Compare June 10, 2026 08:42

G4614 and others added 8 commits June 10, 2026 17:14

G4614 force-pushed the chore/e2e-required-merge-gate branch from 5bfa094 to 0cdf723 Compare June 10, 2026 09:14

G4614 mentioned this pull request Jun 10, 2026

ci(e2e): e2e-cloud workflow + IAM OIDC provisioning (replaces #680) #724

Open

G4614 closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(e2e): make e2e suite a required merge gate (stacks on #678)#680

ci(e2e): make e2e suite a required merge gate (stacks on #678)#680
G4614 wants to merge 8 commits into
boxlite-ai:mainfrom
G4614:chore/e2e-required-merge-gate

G4614 commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review skipped

Uh oh!

cla-assistant Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

G4614 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

After merge

Cost

Tradeoff vs the simpler "drop paths filter and just always run"

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

cla-assistant Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

G4614 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

G4614 commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

cla-assistant Bot commented Jun 10, 2026 •

edited

Loading