ci(e2e): make e2e suite a required merge gate (stacks on #678)#680
Closed
G4614 wants to merge 8 commits into
Closed
ci(e2e): make e2e suite a required merge gate (stacks on #678)#680G4614 wants to merge 8 commits into
G4614 wants to merge 8 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
8df6ade to
41bbb6d
Compare
213359b to
97af472
Compare
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 10, 2026
…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 10, 2026
…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
be6793a to
347ee5b
Compare
PR boxlite-ai#678 wires up an e2e suite but its workflow only triggers on PRs that touch specific paths. GitHub branch protection requires a check to ALWAYS report a result — a path-filtered workflow that doesn't fire leaves the required check pending forever, so the PR can't merge. Reshape so e2e can be a required check: 1. Drop `paths:` from the trigger — workflow always starts on every PR to main. Inside, a cheap `changes` job on a GitHub-hosted runner does the path-filter detection via dorny/paths-filter@v3. 2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min) stays gated on `changes.outputs.relevant == true`. PRs that don't touch relevant paths skip the kvm runner entirely — no contention cost. 3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses the conditional outcome into one stable check name `E2E gate (required)`. Reports success if e2e passed OR no relevant paths changed; fail if e2e ran and failed. Branch protection should be configured (separately, via repo Settings → Branches → main) to require the `E2E gate (required)` status check. That's a repo-settings action a maintainer needs to do once after this merges. Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes` + ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed. Cost on relevant PRs: same as before (full e2e on kvm runner). Stacks on boxlite-ai#678 — this commit only makes sense once that PR's e2e-stack.yml is in main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autouse `verify_runner_saw_all_boxes` fixture shells out to `journalctl -u boxlite-runner`, which only resolves on a host where the runner systemd service is installed (the current self-hosted KVM runner). When the e2e suite migrates to a GitHub-hosted runner pointing at a remote stack (Tokyo Api/Runner deployment), journalctl returns zero hits and every box-creating test fails on the path-verify assert even though the box actually reached the remote runner. Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture (yield-and-return before the journalctl probe). The cloud-CI workflow will set this; the self-hosted KVM job keeps the guard. The FFI-bypass risk this guard defends against doesn't apply on a stock GHA runner — no KVM means libkrun can't start a VM, so a silent fallback would surface as a hard error, not as a passing-but-wrong test. Losing the guard in that environment loses no real safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`, added in the next commit). A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed because the existing `boxlite-e2e-github-actions` role is tuned for the us-east-1 KVM-runner provisioning path. Cloud e2e against the ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR push, ECS update-service + execute-command, SSM send-command to a specific runner EC2, S3 builds/ artifact upload, parameter-store reads. Splitting the roles keeps each one auditable under least-privilege. Scoping: - Trust policy `sub` is an enumerated allowlist (main pushes, pull_request, environment:e2e-cloud) — a stray workflow added on a feature branch can't assume the role. - ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner` so the role can only shell the runner EC2. - ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN; ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs. - Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*, ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting resource-level scoping for those actions. Sources scripts/common.sh for the shared log helpers so this script's output matches setup-ci-runner.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring the structure of the e2e-test.yml section (Why / Architecture / Triggers / Cost / Authentication / Required variables / Required AWS resources / Concurrency / Stack state). - Side-by-side table contrasts the two e2e workflows' IAM roles, regions, and scopes so future maintainers can tell at a glance why both exist. - Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that filename does not exist; the actual provisioning script is `scripts/ci/setup-ci-runner.sh`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visioned PR boxlite-ai#680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AWS IAM CreateRole rejects U+2014 (em-dash) in the --description argument with a validation error. Replace the em-dash with an ASCII hyphen so the script runs cleanly during bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap PR boxlite-ai#680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5bfa094 to
0cdf723
Compare
Contributor
Author
G4614
added a commit
that referenced
this pull request
Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 10, 2026
…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 10, 2026
… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 11, 2026
Hide the AWS account ID + region + role ARN behind repo Variables — same pattern as .github/workflows/e2e-test.yml uses for AWS_ACCOUNT_ID. Previous version had them hardcoded because the original PR (#680) was a fork-PR and GitHub strips the `vars` context on fork-PR runs. PR #724 is now a same-repo PR (branch lives on boxlite-ai/boxlite), so the fork-PR workaround is no longer needed. Reads from: vars.AWS_E2E_CLOUD_REGION = ap-northeast-1 vars.AWS_E2E_CLOUD_ROLE_ARN = arn:aws:iam::<account>:role/boxlite-e2e-cloud-github-actions vars.AWS_ACCOUNT_ID (consumed by setup-e2e-cloud-oidc.sh) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 12, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres, redis, docker registry, boxlite-runner systemd, boxlite-api) inside a self-hosted KVM runner per PR — slow to maintain (bootstrap drift, KVM runner contention) and only covers self-bootstrap state. This workflow keeps the same 3-job structure introduced by PR #680 (changes / e2e / e2e-gate) so the `E2E gate (required)` branch protection check is unchanged, but replaces the `e2e` job: - runs-on: ubuntu-latest (no /dev/kvm needed) - authenticates to AWS via OIDC using the new boxlite-e2e-cloud-github-actions role - builds the Api container image + boxlite-runner musl binary from this checkout - deploys to the always-on Tokyo stack: rolling-updates the Api ECS service to the new image and SSM-replaces the runner EC2's binary - initialises the admin-org sandbox quota row in RDS via ECS Exec into the Api task (DB_PASSWORD stays in container env — never in the SSM command body) - builds & installs the Python SDK from this checkout (path_prefix and other source-tree additions land ahead of the PyPI release) - runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify doesn't reach the remote runner) - on failure, dumps Api CloudWatch logs + boxlite-runner journalctl Concurrency uses a constant group (`e2e-cloud-shared`) because every run — push + PR — competes for the same singleton Tokyo stack. Per-ref grouping would let two PRs interleave ECS rolling updates. Resource discovery (cluster id, LB DNS, runner instance id, S3 bucket, ECR repo) is by-pattern at workflow start, with a strict count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial SST teardown fails the job loudly instead of silently binding to the wrong target. SSM agent ping-status is verified before the runner update step to fail fast on a dead agent. Adversarial review (multiple lenses) caught and this commit addresses: - credentials.toml written via printf rather than unquoted heredoc so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY immediately registered via `::add-mask::` - ECS execute-command output is captured and grepped for the expected SELECT row (Session Manager exit-code propagation has historically been unreliable through --interactive without a tty) - `wait services-stable` is followed by ALB target-health polling AND a /api/health curl with retry — services-stable alone races the LB health check - pytest wrapped in `timeout 35m` plus `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn the 45-min job timeout before `Collect logs on failure` runs - Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds are incremental - both actions/checkout occurrences bumped to @v5 to match the newer workflows in the repo Known limits (TODO/follow-up, not blocking): - boxlite-runner has no documented in-flight drain; restart drops any request landing during the swap window. Acceptable for e2e (no concurrent users) but a production-grade rollout would need a runner-side graceful drain. - No post-run restore of `main` HEAD on the Tokyo stack: the stack is left running the last e2e-cloud run's image. Console inspection of `api-<sha>` identifies the running revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 12, 2026
…visioned PR #680's original goal was to make the e2e suite a required merge gate. With the workflow now retargeted at the always-on Tokyo stack (previous commit), keeping `e2e-gate` as `(required)` would block every PR until the Tokyo IAM OIDC role + SSM parameter + admin-org quota are provisioned — a chicken-and-egg. Removing the gate job for now. The two-job structure (changes → e2e) still works for `workflow_dispatch` trial runs and for PR CI feedback (non-blocking). Re-add an `e2e-gate` job and the corresponding branch protection rule once a first green run is observed. Header comment updated to flag the deferred-required-check status so future contributors know why the workflow exists without being gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 12, 2026
… gap PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does NOT pass repository variables (the `vars` context) to workflow runs triggered by `pull_request` from external forks — security rule to prevent fork PRs from reading upstream secrets/vars. Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and `AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow run on this PR's head saw both as empty strings, and `aws-actions/configure-aws-credentials@v4` failed at input validation with "Input required and not supplied: aws-region" before any OIDC handshake could occur. Fix: hardcode the values in the workflow's env: block. Both are acceptable to commit: - `ap-northeast-1`: region name, public infrastructure information. - IAM role ARN: contains the AWS account ID (12 digits). AWS docs classify account IDs as "sensitive, not secret"; the role's safety is enforced by its trust policy (which restricts AssumeRole to GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*` patterns), not by hiding the ARN. Industry-standard pattern in OSS CI infrastructure. After this PR merges, future PRs sourced from same-repo branches WILL see the repo vars again, but the hardcoded values remain the source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
e2e force check
Problem
PR #678 ships the e2e suite as a workflow with `paths:` filter on its triggers. PRs that don't touch any of those paths cause the workflow not to fire — and GitHub branch protection's "required status check" expects a result on every PR. A required path-filtered check leaves PRs stuck in "pending checks" forever.
So today #678's e2e suite can be added to CI, but can't be a hard merge gate without one of these:
Fix
Reshape the workflow into 3 jobs:
```
┌─────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ changes │──▶│ e2e (kvm) │──▶│ e2e-gate (required) │
│ ubuntu-latest │ │ self-hosted kvm runner │ │ ubuntu-latest │
│ paths-filter │ │ if: relevant or │ │ if: always() │
│ ~10s │ │ workflow_dispatch │ │ ~5s │
└─────────────────┘ └─────────────────────────┘ └─────────────────────────┘
```
Branch protection can require `E2E gate (required)` and it'll always report a result.
After merge
Repo admin needs to do once:
The workflow change can't do this itself (GitHub repo settings are out of band).
Cost
Tradeoff vs the simpler "drop paths filter and just always run"
Always-run on every PR = self-hosted kvm runner saturated by doc-only PRs. Split into changes-gate-execute lets path filter live in code while still satisfying branch protection's "always reports" requirement.
🤖 Generated with Claude Code