Skip to content

fix(ci): add retry logic for Foundry installation#29255

Closed
alucardzom wants to merge 1 commit into
mainfrom
ale/infra-3580-foundry-retry-resilience
Closed

fix(ci): add retry logic for Foundry installation#29255
alucardzom wants to merge 1 commit into
mainfrom
ale/infra-3580-foundry-retry-resilience

Conversation

@alucardzom

@alucardzom alucardzom commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Description

Problem: Foundry download failures account for 16% (~10 runs) of the 64 Setup Environment CI failures on main over 30 days (Mar 16 – Apr 16, 2026). The failure modes include curl: (35) Recv failure: Connection reset by peer, corrupt tar archives, and malformed URLs. The current implementation has zero retry logic — a single curl or foundryup failure kills the entire step.

See INFRA-3580 for the full root cause analysis.

Solution:

Wrap the Foundry installation step in nick-fields/retry (3 attempts, 30s wait, 3min timeout per attempt), matching the existing repo pattern already used for Corepack, Yarn, and CocoaPods in the same file.

Key details:

  • set -euo pipefail at the top of the command block to ensure failures propagate correctly (learned from bugbot review on PR #29236)
  • on_retry_command cleans up partial/corrupt Foundry downloads (rm -rf $FOUNDRY_DIR) before each retry — addresses the "corrupt tar" failure mode
  • timeout_minutes: 3 per attempt — Foundry download + install typically takes seconds
  • GITHUB_PATH modification (echo "$FOUNDRY_BIN" >> "$GITHUB_PATH") works correctly inside nick-fields/retry since it's a file-based mechanism
  • Replaced env: FOUNDRY_VERSION with inline ${{ inputs.foundry-version }} expression since uses: steps in composite actions resolve inputs at parse time

Changelog

CHANGELOG entry: null

Related issues

Fixes: INFRA-3580 (partial — addresses Foundry download failure sub-cause)

Manual testing steps

Feature: CI resilience for Foundry installation

  Scenario: Foundry install completes successfully with retry logic
    Given a PR triggers E2E tests on CI

    When the setup-e2e-env composite action runs
    Then the "Install Foundry" step uses nick-fields/retry
    And transient curl/foundryup failures are retried up to 3 times with 30s wait
    And partial downloads are cleaned up before each retry

  Scenario: Foundry binaries are available to subsequent steps
    Given the "Install Foundry" step succeeds via nick-fields/retry

    When subsequent steps reference forge or cast
    Then the binaries are found on PATH via GITHUB_PATH

Screenshots/Recordings

N/A — CI workflow changes only, no UI impact.

Before

N/A

After

N/A

Pre-merge author checklist

  • I've followed MetaMask Contributor Docs and MetaMask Mobile Coding Standards. (N/A — CI workflow YAML only, no application code changes)
  • I've completed the PR template to the best of my ability
  • I've included tests if applicable (N/A — CI workflow configuration, validated by CI run on this PR)
  • I've documented my code using JSDoc format if applicable (N/A — CI workflow YAML, no code)
  • I've applied the right labels on the PR (see labeling guidelines). Not required for external contributors.

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Note

Low Risk
Low risk CI-only change that adds retries and cleanup around Foundry installation; main risk is masking persistent install issues or increasing setup time slightly on repeated failures.

Overview
Improves CI resilience by wrapping the Install Foundry step in .github/actions/setup-e2e-env/action.yml with nick-fields/retry (3 attempts, 30s backoff, 3-minute timeout).

Each retry now cleans up the Foundry directory to avoid partial/corrupt installs, and the install command is hardened with set -euo pipefail while directly using ${{ inputs.foundry-version }} for the requested version.

Reviewed by Cursor Bugbot for commit 959916f. Bugbot is set up for automated code reviews on this repo. Configure here.

Wrap foundryup install in nick-fields/retry with 3 attempts, 30s wait,
and 3min timeout to handle transient network failures that account for
~16% of setup environment CI failures. Add set -euo pipefail to ensure
failures propagate correctly. Add on_retry_command to clean up partial
or corrupt downloads before each retry attempt.
@alucardzom alucardzom added team-mobile-platform Mobile Platform team no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed labels Apr 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbotv2 metamaskbotv2 Bot added the team-dev-ops DevOps team label Apr 23, 2026
@alucardzom alucardzom marked this pull request as ready for review April 23, 2026 14:56
@alucardzom alucardzom requested a review from a team as a code owner April 23, 2026 14:56
@github-actions github-actions Bot added the risk:high AI analysis: high risk label Apr 23, 2026
@jvbriones jvbriones added the skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run label Apr 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

⏭️ Smart E2E selection skipped - skip-smart-e2e-selection label found

All E2E tests pre-selected.

View GitHub Actions results

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown
Contributor

E2E Fixture Validation — Schema is up to date
12 value mismatches detected (expected — fixture represents an existing user).
View details

@github-actions

Copy link
Copy Markdown
Contributor

AI PR Analysis

🚫 Merge safe: false | 🟠 Risk: high

Merge decision: AI analysis did not complete — manual review required before merging.

AI analysis did not complete. Manual review recommended.

View run

@Cal-L Cal-L left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@alucardzom alucardzom enabled auto-merge April 27, 2026 12:03
@alucardzom

Copy link
Copy Markdown
Contributor Author

Closing — superseded by #29191

PR #29191 (test: installs foundry with yarn instead of global install, merged Apr 23 by @christopherferreira9) replaced the Foundry installation mechanism on main. The curl + foundryup approach this PR wrapped with retry logic no longer exists.

Before (what this PR retried):

  • Direct curl to raw.githubusercontent.com/foundry-rs/foundry/master/foundryup/foundryup
  • foundryup -i <version> downloading binaries from GitHub releases
  • Failure modes: connection reset, corrupt tar, malformed URL — zero retry logic

After (#29191, now on main):

  • yarn install:foundryup → uses project's package.json mm-foundryup config
  • Binary resolved via node_modules/.bin/anvil, matching local dev and tests/seeder/anvil-manager.ts
  • Version/checksums managed by package.json, not a workflow input

The original failure modes (curl connection reset to raw.githubusercontent.com, corrupt tar from GitHub releases) no longer apply since the install now goes through yarn's package resolution, which already has retry logic via nick-fields/retry in the same composite action.

INFRA-3580 impact: The Foundry download failure sub-cause (16%, ~10 runs) is now addressed by the architecture change in #29191, not by retry logic. Closing this PR as obsolete.

@alucardzom alucardzom closed this Apr 27, 2026
auto-merge was automatically disabled April 27, 2026 12:25

Pull request was closed

@github-actions github-actions Bot locked and limited conversation to collaborators Apr 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed risk:high AI analysis: high risk size-S skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run team-dev-ops DevOps team team-mobile-platform Mobile Platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants