Skip to content

Speed up CI and fix flaky E2E tests#4104

Merged
ChrisJBurns merged 7 commits intomainfrom
speeds-e2e-tests
Mar 11, 2026
Merged

Speed up CI and fix flaky E2E tests#4104
ChrisJBurns merged 7 commits intomainfrom
speeds-e2e-tests

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented Mar 11, 2026

Summary

  • CI jobs were slow and occasionally flaky: E2E tests failed intermittently with HTTP 500 because Docker image pulls exceeded the 60s API middleware timeout on cold CI caches, E2E Lifecycle jobs were hitting the 30-minute timeout on 2-core runners, operator E2E tests ran sequentially despite being safe to parallelize, and mock OIDC servers used hardcoded NodePorts that collided when tests ran in parallel.
  • Pre-pull Docker images (osv-mcp, gofetch, egress-proxy) in the E2E workflow so workload creation doesn't pay the image-pull cost inside the timeout window. Upgrade CPU/memory-intensive CI jobs from ubuntu-latest to ubuntu-8cores-32gb. Run operator E2E tests with 4-way Ginkgo parallelism. Use dynamic NodePort allocation for mock OIDC servers so parallel test processes never collide on port numbers. Replace a hardcoded port in TestRunConfigBuilder with dynamic allocation.

Type of change

  • Bug fix

Test plan

  • Manual testing (describe below)

Verified all CI workflows pass on the PR branch. Confirmed E2E Lifecycle jobs now complete in ~10 min (previously timing out at 30 min). Confirmed E2E tests no longer fail with HTTP 500 on first workload creation. Confirmed operator E2E tests pass with 8-way parallelism and dynamic NodePort allocation.

Changes

File Change
.github/workflows/e2e-tests.yml Add Docker image pre-pull step before E2E test execution
.github/workflows/lint.yml Upgrade to ubuntu-8cores-32gb runner
.github/workflows/operator-ci.yml Upgrade 4 jobs to ubuntu-8cores-32gb
.github/workflows/helm-charts-test.yml Upgrade to ubuntu-8cores-32gb runner
.github/workflows/test-e2e-lifecycle.yml Upgrade to ubuntu-8cores-32gb runner
cmd/thv-operator/Taskfile.yml Add --procs=8 to ginkgo for parallel operator E2E tests
pkg/runner/config_test.go Replace hardcoded port 60000 with networking.FindAvailable()
test/e2e/thv-operator/virtualmcp/helpers.go DeployParameterizedOIDCServer uses auto-assigned NodePort and returns it
test/e2e/thv-operator/virtualmcp/virtualmcp_session_management_v2_test.go Use dynamic NodePort from helper instead of hardcoded constant
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_discovery_test.go Use dynamic NodePort for inline mock OIDC service

Special notes for reviewers

  • Quick jobs (generate-crds, generate-crd-docs) are intentionally left on ubuntu-latest since they complete in under a minute.
  • Operator E2E tests use unique resource names and auto-assigned NodePorts per suite, making them safe for parallel execution.
  • The image pre-pull approach was chosen over increasing the middleware timeout because it's a structural fix that benefits all E2E test buckets equally.
  • Mock OIDC servers now use Kubernetes auto-assigned NodePorts (read back after service creation) instead of hardcoded values, eliminating port collisions entirely under parallel execution.

Generated with Claude Code

@ChrisJBurns ChrisJBurns requested a review from JAORMX as a code owner March 11, 2026 14:15
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 11, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.65%. Comparing base (7f1d943) to head (61f9857).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4104      +/-   ##
==========================================
- Coverage   68.70%   68.65%   -0.05%     
==========================================
  Files         454      454              
  Lines       46051    46051              
==========================================
- Hits        31641    31618      -23     
- Misses      11968    11992      +24     
+ Partials     2442     2441       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
ChrisJBurns and others added 5 commits March 11, 2026 17:52
Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
Pre-pull Docker images (osv-mcp, gofetch, egress-proxy) in the E2E CI
workflow so workload creation does not pay the image-pull cost inside
the 60s API middleware timeout. This eliminates the class of flakiness
where the first workload creation in a CI matrix bucket fails with
HTTP 500 because the image pull exceeds the timeout.

Also replace the hardcoded port 60000 in TestRunConfigBuilder with
networking.FindAvailable() to avoid failures when that port is already
in use on the CI runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upgrade CPU/memory-intensive CI jobs from ubuntu-latest (2 cores,
7 GB) to ubuntu-8cores-32gb to speed up PR turnaround:

- Linting: golangci-lint is CPU-bound, scales well with cores
- Operator tests, integration tests, build: Go compilation benefits
  from more cores
- Operator E2E tests: KIND cluster creation and Chainsaw tests are
  CPU/memory hungry (3 parallel jobs at ~8 min each)
- Helm chart tests: KIND cluster + ko builds benefit from more CPU

Quick jobs (generate-crds, generate-crd-docs) are left on
ubuntu-latest since they complete in under a minute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The v1.34.3 and v1.35.1 jobs were hitting the 30-minute timeout on
ubuntu-latest (2 cores). The job does ko builds (3 images), Docker
pulls (6 images), kind load operations, helm deploy, and E2E tests
which is too much for a 2-core runner within the time limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The VirtualMCP lifecycle E2E tests run sequentially by default, taking
~24 minutes as each test suite creates/waits/tears down its own K8s
resources. All test suites use unique resource names and auto-assigned
NodePorts, so they are safe to run concurrently.

Adding --procs=4 to the ginkgo command allows 4 test suites to run
simultaneously, which should cut the test phase to ~6-8 minutes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisJBurns ChrisJBurns changed the title DRAFT: Speeds e2e tests Speed up CI and fix flaky E2E tests Mar 11, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
The mock OIDC servers in session_management_v2 and auth_discovery tests
used NodePorts 30013 and 30010 which collide with auto-assigned
NodePorts when tests run in parallel. Move them to 30913 and 30910
to avoid the Kubernetes auto-assignment range (which starts low).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the size/XS Extra small PR: < 100 lines changed label Mar 11, 2026
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 11, 2026
Hardcoded NodePorts (30013, 30010) for mock OIDC test servers collided
with auto-assigned NodePorts when operator E2E tests run in parallel.
Instead of picking "safe" high ports, let Kubernetes auto-assign
NodePorts and read them back after service creation.

- DeployParameterizedOIDCServer: remove nodePort parameter, return the
  allocated port instead
- auth_discovery_test: hoist oidcNodePort to outer var block, read back
  from service after creation, use dynamic port in getOIDCToken
- session_management_v2_test: capture returned port from helper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 11, 2026
@ChrisJBurns ChrisJBurns merged commit 2fa4371 into main Mar 11, 2026
73 of 74 checks passed
@ChrisJBurns ChrisJBurns deleted the speeds-e2e-tests branch March 11, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants