Skip to content

fix(ci): improve CocoaPods install resilience against CDN rate limiting#29334

Merged
andrepimenta merged 2 commits into
mainfrom
ale/infra-3580-cocoapods-cdn-resilience
May 5, 2026
Merged

fix(ci): improve CocoaPods install resilience against CDN rate limiting#29334
andrepimenta merged 2 commits into
mainfrom
ale/infra-3580-cocoapods-cdn-resilience

Conversation

@alucardzom

@alucardzom alucardzom commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Description

Problem

CocoaPods CDN rate limiting accounts for 7% (~4 runs) of Setup Environment CI failures on main over 30 days (Mar 16 – Apr 16, 2026), per INFRA-3580 analysis. The error signatures are:

  • CDN: trunk URL couldn't be downloaded ... Response: 429 Too Many Requests
  • HTTP/2 framing layer errors during pod install

Root Cause

The current flow in setup-e2e-env/action.yml defeats its own CocoaPods specs cache:

  1. actions/cache@v4 restores ~/.cocoapods/repos (including trunk specs) — 100% cache hit rate observed
  2. pod repo remove trunk || true immediately deletes the restored trunk specs
  3. pod install --repo-update must re-download the entire trunk from cdn.cocoapods.org — thousands of HTTP requests

This maximises CDN pressure on every single iOS CI run, increasing the surface area for 429 rate-limit errors.

When trunk exists locally, --repo-update performs an incremental delta update — minimal CDN requests, fast.
Without trunk (after removal), --repo-update downloads everything from scratch — heavy CDN load, slow, triggers 429s.

Why pod repo remove trunk was added

The step was added in PR #28433 (commit bc06cd5123, Apr 8, 2026) as part of the macOS Sequoia → Tahoe migration for Xcode 26.x support. It was added proactively with the comment "prevent stale specs" — no review comments discussed the rationale, and no specific CDN failure motivated it.

Since Cirrus runners are ephemeral (VMs destroyed after each job — confirmed by Cirrus Labs: "Every job is executed in a reproducible isolated environment which is completely destroyed after the job is finished"), the only way trunk specs exist at the start of pod install is via the actions/cache restore. There is no leftover state from previous jobs. The --repo-update flag already handles staleness by fetching deltas when trunk exists locally.

Data

Pod install timing (successful runs, with trunk removal + full CDN re-download):

Run ID Duration Notes
24887670082 116s (1m 56s) Full CDN download after trunk removal
24887265337 ~100s Cache hit, trunk removed, re-downloaded
24888574759 ~120s Cache hit, trunk removed, re-downloaded

CocoaPods specs cache hit rate: 100% (exact or restore-key match in all sampled runs)

Cache flow (current, counterproductive):

  • Cache restore: ~1 MB restored from GitHub Actions cache → trunk specs present
  • pod repo remove trunk: deletes restored specs → trunk gone
  • pod install --repo-update: full CDN download → thousands of requests → 429 risk

Solution

  1. Remove standalone pod repo remove trunk step — let cached specs be used on first attempt for incremental --repo-update (low CDN load)
  2. Move trunk removal to on_retry_command — only clean on failure (handles corrupt/stale cache edge case)
  3. Increase max_attempts from 2 to 3 — matches other retry steps in the same file; CDN rate limits may need a third attempt
  4. Increase retry_wait_seconds from 30 to 60 — CDN 429 backoff windows need longer wait than apt mirror desyncs
  5. Add ::warning:: annotation on retry — makes CDN failures visible in GitHub Actions UI
  6. Add COCOAPODS_DISABLE_STATS=true — eliminates unnecessary analytics network calls during CI

First attempt (happy path): cached trunk specs + incremental --repo-update → few HTTP requests, low CDN load
On failure: trunk removed for clean slate → full CDN download on retry, warned in Actions UI
Second/third attempt: fresh download from CDN with 60s backoff between attempts

Changelog

CHANGELOG entry: null

Related issues

Refs: INFRA-3580

Manual testing steps

N/A — CI infrastructure change. Validated by any iOS E2E workflow run (retry logic is transparent in the happy path). The pod install step behavior is identical when the first attempt succeeds.

Screenshots/Recordings

Before

N/A

After

N/A

Pre-merge author checklist

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Made with Cursor


Note

Medium Risk
Changes the iOS CI dependency install flow and retry behavior, which could affect build stability if cache state differs from expectations, but it is limited to CI setup and guarded by retries.

Overview
Improves iOS E2E CI CocoaPods installation reliability by stopping the unconditional pod repo remove trunk cleanup so the restored CocoaPods specs cache can be used on the first pod install --repo-update attempt.

The CocoaPods install retry policy is strengthened (attempts 2→3, wait 30s→60s), and trunk cleanup is moved into on_retry_command with a GitHub Actions ::warning:: annotation; COCOAPODS_DISABLE_STATS=true is added to reduce extra network calls during CI.

Reviewed by Cursor Bugbot for commit 13ce7a7. Bugbot is set up for automated code reviews on this repo. Configure here.

Remove the standalone `pod repo remove trunk` step that runs before
every `pod install`. On ephemeral Cirrus runners the only source of
trunk specs is the actions/cache restore — deleting them forces a full
CDN re-download (thousands of HTTP requests) on every run, increasing
the surface area for 429 rate-limit errors.

Move trunk removal into `on_retry_command` so cached specs are used for
an incremental `--repo-update` on the first attempt (low CDN load), and
only cleared on failure for a clean retry.

Additional improvements:
- Increase max_attempts from 2 to 3 (matches other retry steps)
- Increase retry_wait_seconds from 30 to 60 (longer backoff for CDN 429)
- Add ::warning:: annotation on retry for visibility in Actions UI
- Add COCOAPODS_DISABLE_STATS=true to skip analytics during CI

Made-with: Cursor
@alucardzom alucardzom self-assigned this Apr 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbotv2 metamaskbotv2 Bot added the team-dev-ops DevOps team label Apr 24, 2026
@alucardzom alucardzom added no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed team-mobile-platform Mobile Platform team labels Apr 24, 2026
@alucardzom alucardzom marked this pull request as ready for review April 27, 2026 11:49
@alucardzom alucardzom requested a review from a team as a code owner April 27, 2026 11:49
@alucardzom alucardzom added the skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run label Apr 27, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

⏭️ Smart E2E selection skipped - skip-smart-e2e-selection label found

All E2E tests pre-selected.

View GitHub Actions results

@alucardzom

Copy link
Copy Markdown
Contributor Author

On hold pending #29247. PR #29247 (ci: reuse native E2E builds across commits and PRs) introduces a fingerprint-driven build reuse system that skips Ruby/Bundler/CocoaPods/Xcode setup on the reuse path. Once it merges, pod install will run far less frequently, significantly reducing the CDN rate limiting surface area. Will re-evaluate whether this fix is still needed after #29247 lands and CI failure rates are observed.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Fixture Validation — Schema is up to date
12 value mismatches detected (expected — fixture represents an existing user).
View details

@sonarqubecloud

Copy link
Copy Markdown

@andrepimenta andrepimenta added this pull request to the merge queue May 5, 2026
Merged via the queue into main with commit 9348a74 May 5, 2026
61 checks passed
@andrepimenta andrepimenta deleted the ale/infra-3580-cocoapods-cdn-resilience branch May 5, 2026 09:28
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
@metamaskbotv2 metamaskbotv2 Bot added the release-7.77.0 Issue or pull request that will be included in release 7.77.0 label May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed release-7.77.0 Issue or pull request that will be included in release 7.77.0 size-S skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run team-dev-ops DevOps team team-mobile-platform Mobile Platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants