Skip to content

fix(ci): add restore-keys and continue-on-error to yarn cache#29277

Merged
andrepimenta merged 2 commits into
mainfrom
ale/infra-3580-yarn-cache-resilience
May 5, 2026
Merged

fix(ci): add restore-keys and continue-on-error to yarn cache#29277
andrepimenta merged 2 commits into
mainfrom
ale/infra-3580-yarn-cache-resilience

Conversation

@alucardzom

@alucardzom alucardzom commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Description

Problem: iOS yarn cache download stalls account for 28% (~18 runs) of the 64 Setup Environment CI failures on main over 30 days (Mar 16 – Apr 16, 2026). The 610MB yarn cache downloads at 0.3-2.8 MB/s on Cirrus macOS self-hosted runners (ghcr.io/cirruslabs/macos-runner:tahoe), with tar extraction sometimes hanging for 9+ minutes until the action times out.

See INFRA-3580 for the full root cause analysis.

Investigation data:

Examined 6 iOS E2E setup failures from Apr 8-10:

Run What happened
24268253971 Cache hit, 582MB downloaded at ~2.6 MB/s (3.8min), tar extraction hung 9.4min → timeout
24236471310 Action timeout during bundle show cocoapods
24227213798 Foundry download failure (covered by PR #29255)
24226340316 Generic action timeout
24220368199 Corepack download stalledrepo.yarnpkg.com hung 4.8min
24217223040 Corepack download stalled — same, hung 2.6min

iOS E2E full setup timing (Cirrus macOS runners):

Period Samples Min Max Avg Median P95
Apr 8-10 (failure period) 15 180s 202s 189s 187s 202s
Apr 22-23 (recent) 15 188s 267s 208s 204s 267s

iOS setup takes ~2x longer than Android (189-208s vs 103-136s) due to Ruby, Bundler, CocoaPods, and Detox on top of shared steps.

Solution — two changes:

  1. restore-keys fallback — Currently the yarn cache is exact-match only. On cache miss (e.g., yarn.lock changed), it triggers a full 610MB re-download. Adding restore-keys lets yarn reuse a stale cache and only update the diff. Pattern already used by Bundler cache in the same file (line 272).

  2. continue-on-error: true — If the cache download or tar extraction stalls, the job currently fails entirely. With continue-on-error, a stalled cache is skipped and yarn install --immutable runs without cache (slower but succeeds). Pattern already used by CocoaPods specs cache in the same file (line 333).

Why restore-keys is safe

The concern: could restore-keys restore a stale node_modules from main with different package versions?

No. yarn install --immutable guarantees that node_modules will exactly match yarn.lock when it finishes, regardless of cache state. The cache is just a starting point — yarn adds, removes, or changes whatever is needed to reconcile. --immutable only prevents yarn.lock modifications, not node_modules updates.

Example: main merges a PR updating package X from v1.0→v1.1. Your PR branch (not rebased) still has X@v1.0 in yarn.lock. restore-keys restores main's cache with X@v1.1. yarn install --immutable runs → installs X@v1.0 per your lockfile. Same outcome as a cold install — the lockfile is always the source of truth. Cache only affects speed, not correctness.

continue-on-error — needs team input

This makes the cache step non-blocking. If the 610MB download stalls or tar extraction hangs (actual observed failure: 9.4min hang), the step fails gracefully and yarn install --immutable runs without cache (slower ~60-90s but succeeds).

Pros: stalled cache no longer kills the job; yarn install --immutable still produces correct node_modules; pattern already used by CocoaPods specs cache.

Cons: cache failures become silent (job passes but slower); reduces visibility into cache infra problems.

This change can be removed from the PR if the team prefers cache-or-fail behavior. The restore-keys change alone still adds value.

Changelog

CHANGELOG entry: null

Related issues

Fixes: INFRA-3580 (partial — addresses iOS yarn cache download stall sub-cause)

Manual testing steps

Feature: CI resilience for yarn cache

  Scenario: Yarn cache miss uses partial match fallback
    Given yarn.lock has changed since the last cached run

    When the "Restore Yarn cache" step runs
    Then it falls back to a partial key match via restore-keys
    And yarn install only updates the diff instead of full re-download

  Scenario: Stalled cache download doesn't block the job
    Given the yarn cache download stalls on a Cirrus macOS runner

    When the "Restore Yarn cache" step times out
    Then the step is marked as failed but the job continues (continue-on-error)
    And yarn install --immutable runs without cache

Screenshots/Recordings

N/A — CI workflow changes only, no UI impact.

Before

N/A

After

N/A

Pre-merge author checklist

  • I've followed MetaMask Contributor Docs and MetaMask Mobile Coding Standards. (N/A — CI workflow YAML only, no application code changes)
  • I've completed the PR template to the best of my ability
  • I've included tests if applicable (N/A — CI workflow configuration, validated by CI run on this PR)
  • I've documented my code using JSDoc format if applicable (N/A — CI workflow YAML, no code)
  • I've applied the right labels on the PR (see labeling guidelines). Not required for external contributors.

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Note

Low Risk
CI-only workflow change; it may reduce visibility into cache failures but doesn’t affect runtime code or dependency correctness.

Overview
Improves resilience of the setup-e2e-env composite action’s Yarn node_modules cache restore.

The Yarn cache step now uses restore-keys to allow partial cache matches when yarn.lock changes, and is marked continue-on-error: true so cache download/extraction failures don’t fail the job and yarn install --immutable can proceed without cache.

Reviewed by Cursor Bugbot for commit 2718150. Bugbot is set up for automated code reviews on this repo. Configure here.

Add restore-keys fallback to yarn cache so partial key matches can
reuse a stale cache instead of forcing a full 610MB re-download on
cache miss. Add continue-on-error so a stalled cache download or
extraction doesn't fail the entire job — yarn install will run without
cache (slower but succeeds).

Both patterns already exist in the same file: Bundler cache uses
restore-keys (line 272) and CocoaPods cache uses continue-on-error
(line 333). This addresses the iOS yarn cache download stall sub-cause
of INFRA-3580 (28%, ~18 runs).
@alucardzom alucardzom added team-mobile-platform Mobile Platform team no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed labels Apr 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbotv2 metamaskbotv2 Bot added the team-dev-ops DevOps team label Apr 23, 2026
@alucardzom alucardzom requested a review from jvbriones April 23, 2026 16:12
@alucardzom alucardzom marked this pull request as ready for review April 24, 2026 11:07
@alucardzom alucardzom requested a review from a team as a code owner April 24, 2026 11:07
@alucardzom alucardzom added the skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run label Apr 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

⏭️ Smart E2E selection skipped - skip-smart-e2e-selection label found

All E2E tests pre-selected.

View GitHub Actions results

@github-actions

Copy link
Copy Markdown
Contributor

E2E Fixture Validation — Schema is up to date
12 value mismatches detected (expected — fixture represents an existing user).
View details

@alucardzom

Copy link
Copy Markdown
Contributor Author

On hold pending #29247. PR #29247 (ci: reuse native E2E builds across commits and PRs) changes how E2E setup works — the lean reuse path still runs yarn install but with a shorter overall budget, reducing timeout pressure. Will re-evaluate after #29247 merges and CI failure rates are observed.

@sonarqubecloud

Copy link
Copy Markdown

@alucardzom

Copy link
Copy Markdown
Contributor Author

Ready to review

@andrepimenta andrepimenta added this pull request to the merge queue May 5, 2026
Merged via the queue into main with commit fa695f6 May 5, 2026
61 checks passed
@andrepimenta andrepimenta deleted the ale/infra-3580-yarn-cache-resilience branch May 5, 2026 09:28
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
@metamaskbotv2 metamaskbotv2 Bot added the release-7.77.0 Issue or pull request that will be included in release 7.77.0 label May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed release-7.77.0 Issue or pull request that will be included in release 7.77.0 size-XS skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run team-dev-ops DevOps team team-mobile-platform Mobile Platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants