Skip to content

ci(e2e): make e2e suite a required merge gate (stacks on #678)#680

Closed
G4614 wants to merge 8 commits into
boxlite-ai:mainfrom
G4614:chore/e2e-required-merge-gate
Closed

ci(e2e): make e2e suite a required merge gate (stacks on #678)#680
G4614 wants to merge 8 commits into
boxlite-ai:mainfrom
G4614:chore/e2e-required-merge-gate

Conversation

@G4614

@G4614 G4614 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

e2e force check

Problem

PR #678 ships the e2e suite as a workflow with `paths:` filter on its triggers. PRs that don't touch any of those paths cause the workflow not to fire — and GitHub branch protection's "required status check" expects a result on every PR. A required path-filtered check leaves PRs stuck in "pending checks" forever.

So today #678's e2e suite can be added to CI, but can't be a hard merge gate without one of these:

  • Add a fall-through workflow that always succeeds on irrelevant paths (separate file, name collision risk)
  • Drop the path filter and pay full e2e cost on every PR (~30 min on kvm runner)
  • Restructure into a single always-runs gate job (what this PR does)

Fix

Reshape the workflow into 3 jobs:

```
┌─────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ changes │──▶│ e2e (kvm) │──▶│ e2e-gate (required) │
│ ubuntu-latest │ │ self-hosted kvm runner │ │ ubuntu-latest │
│ paths-filter │ │ if: relevant or │ │ if: always() │
│ ~10s │ │ workflow_dispatch │ │ ~5s │
└─────────────────┘ └─────────────────────────┘ └─────────────────────────┘
```

  • `changes`: cheap GH-hosted path-filter check. Outputs `relevant=true|false`.
  • `e2e`: same expensive job test(e2e): SDK→API→Runner→VM regression suite #678 already has. Only runs when `changes` says relevant (or operator did `workflow_dispatch`). On irrelevant PRs the kvm runner is not consumed.
  • `e2e-gate`: always runs, collapses outcome into one stable check name. Pass if (no relevant paths) OR (e2e passed). Fail if e2e ran and failed.

Branch protection can require `E2E gate (required)` and it'll always report a result.

After merge

Repo admin needs to do once:

Settings → Branches → `main` → Edit → Require status checks to pass before merging → add `E2E gate (required)`

The workflow change can't do this itself (GitHub repo settings are out of band).

Cost

PR type Before this PR After this PR
Touches relevant paths full e2e (~10-30 min, kvm) full e2e + ~15s extra
Doesn't touch workflow never runs, check missing → can't be required ~10s changes + ~5s gate on GH-hosted, no kvm consumed

Tradeoff vs the simpler "drop paths filter and just always run"

Always-run on every PR = self-hosted kvm runner saturated by doc-only PRs. Split into changes-gate-execute lets path filter live in code while still satisfying branch protection's "always reports" requirement.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5736e958-692e-4877-8922-a777f3e1a5e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@G4614 G4614 force-pushed the chore/e2e-required-merge-gate branch from 8df6ade to 41bbb6d Compare June 9, 2026 03:27
@cla-assistant

cla-assistant Bot commented Jun 10, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@G4614 G4614 force-pushed the chore/e2e-required-merge-gate branch from 213359b to 97af472 Compare June 10, 2026 07:44
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 10, 2026
…visioned

PR boxlite-ai#680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 10, 2026
…visioned

PR boxlite-ai#680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the chore/e2e-required-merge-gate branch from be6793a to 347ee5b Compare June 10, 2026 08:42
G4614 and others added 8 commits June 10, 2026 17:14
PR boxlite-ai#678 wires up an e2e suite but its workflow only triggers on PRs
that touch specific paths. GitHub branch protection requires a check
to ALWAYS report a result — a path-filtered workflow that doesn't
fire leaves the required check pending forever, so the PR can't merge.

Reshape so e2e can be a required check:

1. Drop `paths:` from the trigger — workflow always starts on every
   PR to main. Inside, a cheap `changes` job on a GitHub-hosted
   runner does the path-filter detection via dorny/paths-filter@v3.

2. The expensive `e2e` job (self-hosted kvm runner, ~10-30 min)
   stays gated on `changes.outputs.relevant == true`. PRs that don't
   touch relevant paths skip the kvm runner entirely — no contention
   cost.

3. New `e2e-gate` job ALWAYS runs (`if: always()`) and collapses
   the conditional outcome into one stable check name
   `E2E gate (required)`. Reports success if e2e passed OR no
   relevant paths changed; fail if e2e ran and failed.

Branch protection should be configured (separately, via repo
Settings → Branches → main) to require the `E2E gate (required)`
status check. That's a repo-settings action a maintainer needs to
do once after this merges.

Cost on irrelevant PRs: one ~10s GitHub-hosted job for `changes`
+ ~5s for `e2e-gate`. Negligible. No self-hosted runner consumed.

Cost on relevant PRs: same as before (full e2e on kvm runner).

Stacks on boxlite-ai#678 — this commit only makes sense once that PR's
e2e-stack.yml is in main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The autouse `verify_runner_saw_all_boxes` fixture shells out to
`journalctl -u boxlite-runner`, which only resolves on a host where the
runner systemd service is installed (the current self-hosted KVM
runner). When the e2e suite migrates to a GitHub-hosted runner pointing
at a remote stack (Tokyo Api/Runner deployment), journalctl returns
zero hits and every box-creating test fails on the path-verify assert
even though the box actually reached the remote runner.

Setting BOXLITE_E2E_SKIP_PATH_VERIFY=1 short-circuits the fixture
(yield-and-return before the journalctl probe). The cloud-CI workflow
will set this; the self-hosted KVM job keeps the guard.

The FFI-bypass risk this guard defends against doesn't apply on a stock
GHA runner — no KVM means libkrun can't start a VM, so a silent
fallback would surface as a hard error, not as a passing-but-wrong
test. Losing the guard in that environment loses no real safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/ci/setup-e2e-cloud-oidc.sh — sibling to setup-ci-runner.sh
but for the cloud-stack regression suite (`.github/workflows/e2e-cloud.yml`,
added in the next commit).

A separate IAM role (`boxlite-e2e-cloud-github-actions`) is needed
because the existing `boxlite-e2e-github-actions` role is tuned for
the us-east-1 KVM-runner provisioning path. Cloud e2e against the
ap-northeast-1 Tokyo stack needs an orthogonal permission set — ECR
push, ECS update-service + execute-command, SSM send-command to a
specific runner EC2, S3 builds/ artifact upload, parameter-store
reads. Splitting the roles keeps each one auditable under least-privilege.

Scoping:
- Trust policy `sub` is an enumerated allowlist (main pushes,
  pull_request, environment:e2e-cloud) — a stray workflow added on
  a feature branch can't assume the role.
- ssm:SendCommand is gated by `ssm:resourceTag/Name=boxlite-runner`
  so the role can only shell the runner EC2.
- ecs:UpdateService restricted to boxlite-e2e-ci-*/Api service ARN;
  ecs:ExecuteCommand to boxlite-e2e-ci-*/* task ARNs.
- Resource wildcards remaining (ecr:GetAuthorizationToken, ssmmessages:*,
  ecs:Describe*, elbv2:Describe*) are forced by AWS not supporting
  resource-level scoping for those actions.

Sources scripts/common.sh for the shared log helpers so this script's
output matches setup-ci-runner.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR boxlite-ai#680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `e2e-cloud.yml` section to .github/workflows/README.md mirroring
  the structure of the e2e-test.yml section (Why / Architecture /
  Triggers / Cost / Authentication / Required variables / Required
  AWS resources / Concurrency / Stack state).
- Side-by-side table contrasts the two e2e workflows' IAM roles,
  regions, and scopes so future maintainers can tell at a glance why
  both exist.
- Fix line 198 which referenced `scripts/ci/setup-aws-oidc.sh` — that
  filename does not exist; the actual provisioning script is
  `scripts/ci/setup-ci-runner.sh`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…visioned

PR boxlite-ai#680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AWS IAM CreateRole rejects U+2014 (em-dash) in the --description
argument with a validation error. Replace the em-dash with an ASCII
hyphen so the script runs cleanly during bootstrap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap

PR boxlite-ai#680 is sourced from a fork (G4614/boxlite). GitHub Actions does
NOT pass repository variables (the `vars` context) to workflow runs
triggered by `pull_request` from external forks — security rule to
prevent fork PRs from reading upstream secrets/vars.

Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and
`AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow
run on this PR's head saw both as empty strings, and
`aws-actions/configure-aws-credentials@v4` failed at input validation
with "Input required and not supplied: aws-region" before any OIDC
handshake could occur.

Fix: hardcode the values in the workflow's env: block. Both are
acceptable to commit:

- `ap-northeast-1`: region name, public infrastructure information.
- IAM role ARN: contains the AWS account ID (12 digits). AWS docs
  classify account IDs as "sensitive, not secret"; the role's safety
  is enforced by its trust policy (which restricts AssumeRole to
  GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*`
  patterns), not by hiding the ARN. Industry-standard pattern in OSS
  CI infrastructure.

After this PR merges, future PRs sourced from same-repo branches WILL
see the repo vars again, but the hardcoded values remain the source
of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614

G4614 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #724 — same content but as a same-repo PR (not from fork) so the e2e-cloud workflow's OIDC handshake can actually run. GitHub strips id-token: write permission from workflow runs triggered by PRs from forks, which made #680's CI structurally unable to pass.

@G4614 G4614 closed this Jun 10, 2026
G4614 added a commit that referenced this pull request Jun 10, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR #680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 10, 2026
…visioned

PR #680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 10, 2026
… gap

PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does
NOT pass repository variables (the `vars` context) to workflow runs
triggered by `pull_request` from external forks — security rule to
prevent fork PRs from reading upstream secrets/vars.

Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and
`AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow
run on this PR's head saw both as empty strings, and
`aws-actions/configure-aws-credentials@v4` failed at input validation
with "Input required and not supplied: aws-region" before any OIDC
handshake could occur.

Fix: hardcode the values in the workflow's env: block. Both are
acceptable to commit:

- `ap-northeast-1`: region name, public infrastructure information.
- IAM role ARN: contains the AWS account ID (12 digits). AWS docs
  classify account IDs as "sensitive, not secret"; the role's safety
  is enforced by its trust policy (which restricts AssumeRole to
  GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*`
  patterns), not by hiding the ARN. Industry-standard pattern in OSS
  CI infrastructure.

After this PR merges, future PRs sourced from same-repo branches WILL
see the repo vars again, but the hardcoded values remain the source
of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 11, 2026
Hide the AWS account ID + region + role ARN behind repo Variables —
same pattern as .github/workflows/e2e-test.yml uses for AWS_ACCOUNT_ID.

Previous version had them hardcoded because the original PR (#680) was
a fork-PR and GitHub strips the `vars` context on fork-PR runs. PR #724
is now a same-repo PR (branch lives on boxlite-ai/boxlite), so the
fork-PR workaround is no longer needed.

Reads from:
  vars.AWS_E2E_CLOUD_REGION    = ap-northeast-1
  vars.AWS_E2E_CLOUD_ROLE_ARN  = arn:aws:iam::<account>:role/boxlite-e2e-cloud-github-actions
  vars.AWS_ACCOUNT_ID          (consumed by setup-e2e-cloud-oidc.sh)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 12, 2026
The previous e2e-stack.yml bootstrapped a full local stack (postgres,
redis, docker registry, boxlite-runner systemd, boxlite-api) inside a
self-hosted KVM runner per PR — slow to maintain (bootstrap drift,
KVM runner contention) and only covers self-bootstrap state.

This workflow keeps the same 3-job structure introduced by PR #680
(changes / e2e / e2e-gate) so the `E2E gate (required)` branch
protection check is unchanged, but replaces the `e2e` job:

- runs-on: ubuntu-latest (no /dev/kvm needed)
- authenticates to AWS via OIDC using the new
  boxlite-e2e-cloud-github-actions role
- builds the Api container image + boxlite-runner musl binary from
  this checkout
- deploys to the always-on Tokyo stack: rolling-updates the Api ECS
  service to the new image and SSM-replaces the runner EC2's binary
- initialises the admin-org sandbox quota row in RDS via ECS Exec into
  the Api task (DB_PASSWORD stays in container env — never in the
  SSM command body)
- builds & installs the Python SDK from this checkout (path_prefix
  and other source-tree additions land ahead of the PyPI release)
- runs `pytest scripts/test/e2e/cases/` against http://<api-lb>/api
  with BOXLITE_E2E_SKIP_PATH_VERIFY=1 (journalctl-based path verify
  doesn't reach the remote runner)
- on failure, dumps Api CloudWatch logs + boxlite-runner journalctl

Concurrency uses a constant group (`e2e-cloud-shared`) because every
run — push + PR — competes for the same singleton Tokyo stack. Per-ref
grouping would let two PRs interleave ECS rolling updates.

Resource discovery (cluster id, LB DNS, runner instance id, S3
bucket, ECR repo) is by-pattern at workflow start, with a strict
count=1 assertion so an orphaned `ApiLoadBalancer-*` from a partial
SST teardown fails the job loudly instead of silently binding to
the wrong target. SSM agent ping-status is verified before the
runner update step to fail fast on a dead agent.

Adversarial review (multiple lenses) caught and this commit addresses:
- credentials.toml written via printf rather than unquoted heredoc
  so a `$` inside the admin key can't be shell-expanded; ADMIN_KEY
  immediately registered via `::add-mask::`
- ECS execute-command output is captured and grepped for the
  expected SELECT row (Session Manager exit-code propagation has
  historically been unreliable through --interactive without a tty)
- `wait services-stable` is followed by ALB target-health polling
  AND a /api/health curl with retry — services-stable alone races
  the LB health check
- pytest wrapped in `timeout 35m` plus
  `--timeout=180 --timeout-method=thread` so a hung VM doesn't burn
  the 45-min job timeout before `Collect logs on failure` runs
- Rust build cached via Swatinem/rust-cache@v2 so repeat PR builds
  are incremental
- both actions/checkout occurrences bumped to @v5 to match the
  newer workflows in the repo

Known limits (TODO/follow-up, not blocking):
- boxlite-runner has no documented in-flight drain; restart drops
  any request landing during the swap window. Acceptable for e2e
  (no concurrent users) but a production-grade rollout would need
  a runner-side graceful drain.
- No post-run restore of `main` HEAD on the Tokyo stack: the stack
  is left running the last e2e-cloud run's image. Console inspection
  of `api-<sha>` identifies the running revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 12, 2026
…visioned

PR #680's original goal was to make the e2e suite a required merge
gate. With the workflow now retargeted at the always-on Tokyo stack
(previous commit), keeping `e2e-gate` as `(required)` would block
every PR until the Tokyo IAM OIDC role + SSM parameter +
admin-org quota are provisioned — a chicken-and-egg.

Removing the gate job for now. The two-job structure (changes → e2e)
still works for `workflow_dispatch` trial runs and for PR CI feedback
(non-blocking). Re-add an `e2e-gate` job and the corresponding branch
protection rule once a first green run is observed.

Header comment updated to flag the deferred-required-check status so
future contributors know why the workflow exists without being gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 12, 2026
… gap

PR #680 is sourced from a fork (G4614/boxlite). GitHub Actions does
NOT pass repository variables (the `vars` context) to workflow runs
triggered by `pull_request` from external forks — security rule to
prevent fork PRs from reading upstream secrets/vars.

Result: even though `gh variable set AWS_E2E_CLOUD_REGION` and
`AWS_E2E_CLOUD_ROLE_ARN` succeeded on boxlite-ai/boxlite, the workflow
run on this PR's head saw both as empty strings, and
`aws-actions/configure-aws-credentials@v4` failed at input validation
with "Input required and not supplied: aws-region" before any OIDC
handshake could occur.

Fix: hardcode the values in the workflow's env: block. Both are
acceptable to commit:

- `ap-northeast-1`: region name, public infrastructure information.
- IAM role ARN: contains the AWS account ID (12 digits). AWS docs
  classify account IDs as "sensitive, not secret"; the role's safety
  is enforced by its trust policy (which restricts AssumeRole to
  GitHub OIDC tokens with sub matching `repo:boxlite-ai/boxlite:*`
  patterns), not by hiding the ARN. Industry-standard pattern in OSS
  CI infrastructure.

After this PR merges, future PRs sourced from same-repo branches WILL
see the repo vars again, but the hardcoded values remain the source
of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant