Skip to content

fix(ci): add retry logic for apt-get to prevent mirror desync failures#29236

Merged
alucardzom merged 3 commits into
mainfrom
ale/infra-3580-apt-retry-resilience
Apr 27, 2026
Merged

fix(ci): add retry logic for apt-get to prevent mirror desync failures#29236
alucardzom merged 3 commits into
mainfrom
ale/infra-3580-apt-retry-resilience

Conversation

@alucardzom

@alucardzom alucardzom commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Description

Problem: Ubuntu apt mirror desync failures account for 26% (~17 runs) of the 64 Setup Environment CI failures on main over 30 days (Mar 16 – Apr 16, 2026). The failure signature is apt-get update failing with File has unexpected size ... Mirror sync in progress? when Ubuntu mirrors are mid-sync. Additionally, dpkg lock contention on Cirrus self-hosted runners caused ~2% (~1 run) of failures.

All failures are 100% transient network/infrastructure issues — zero are caused by missing packages, version conflicts, or configuration errors. See INFRA-3580 for the full root cause analysis.

Solution:

  1. setup-e2e-env/action.yml — Wrap apt-get update + apt-get install in nick-fields/retry (3 attempts, 30s wait, 3min timeout per attempt), matching the existing repo pattern already used 3 times in the same file. Add -o DPkg::Lock::Timeout=120 to handle dpkg lock contention on Cirrus runners. Add on_retry_command: sudo apt-get clean to clear cached/corrupt package lists before each retry.

  2. create-release-draft.yml — Remove unnecessary apt update && apt install gh. GitHub CLI (gh) is pre-installed on all ubuntu-latest runner images (v2.89.0 on both Ubuntu 22.04 and 24.04 per actions/runner-images).

Data-backed design decisions:

  • Full "Set up E2E environment" composite action takes 99-160s (median 139s) across 15 samples from 5 recent runs. The apt step alone is ~5-15s in the happy path.
  • timeout_minutes: 3 per attempt — apt takes 5-15s normally; even with a 120s dpkg lock wait the worst case is ~135s. 3 min is 12-36x the happy path while avoiding a 5-min wait on true hangs.
  • retry_wait_seconds: 30 gives mirrors time to finish syncing (typically resolves in <60s).
  • on_retry_command: sudo apt-get clean clears cached/corrupt package lists before each retry so apt-get update fetches fresh data from mirrors.
  • DPkg::Lock::Timeout=120 has zero cost when no lock contention exists (the normal case). Needed because Android E2E runs on Cirrus self-hosted runners (ghcr.io/cirruslabs/ubuntu-runner-amd64:24.04-lg) where background unattended-upgrades or apt-daily can hold the dpkg lock.

Changelog

CHANGELOG entry: null

Related issues

Fixes: INFRA-3580 (partial — addresses Ubuntu apt mirror desync and dpkg lock contention sub-causes)

Manual testing steps

Feature: CI resilience for apt package installation

  Scenario: Android E2E setup completes successfully with retry logic
    Given a PR triggers Android E2E smoke tests on CI

    When the setup-e2e-env composite action runs on a Cirrus Linux runner
    Then the "Install required emulator dependencies" step uses nick-fields/retry
    And apt-get commands include DPkg::Lock::Timeout=120
    And transient apt mirror failures are retried up to 3 times with 30s wait

  Scenario: Release draft workflow no longer depends on apt
    Given a release tag triggers the create-release-draft workflow

    When the "Setup GitHub CLI" step runs
    Then it only runs gh auth login (no apt install)
    And the pre-installed gh CLI on ubuntu-latest is used directly

Screenshots/Recordings

N/A — CI workflow changes only, no UI impact.

Before

N/A

After

N/A

Pre-merge author checklist

  • I've followed MetaMask Contributor Docs and MetaMask Mobile Coding Standards. (N/A — CI workflow YAML only, no application code changes)
  • I've completed the PR template to the best of my ability
  • I've included tests if applicable (N/A — CI workflow configuration, validated by CI run on this PR)
  • I've documented my code using JSDoc format if applicable (N/A — CI workflow YAML, no code)
  • I've applied the right labels on the PR (see labeling guidelines). Not required for external contributors.

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Note

Low Risk
Low risk CI-only change; main impact is altering how Android E2E Linux dependencies are installed and could affect runner setup if the retry wrapper is misconfigured.

Overview
Improves Android E2E setup reliability by wrapping the Linux apt-get update/install step in nick-fields/retry, adding dpkg lock timeouts and cleanup on retry to better handle transient mirror/lock issues.

Simplifies create-release-draft by removing apt-based gh installation and only performing gh auth login before running the release draft script.

Reviewed by Cursor Bugbot for commit e6ee232. Bugbot is set up for automated code reviews on this repo. Configure here.

@alucardzom alucardzom added team-mobile-platform Mobile Platform team no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed labels Apr 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbotv2 metamaskbotv2 Bot added team-dev-ops DevOps team INVALID-PR-TEMPLATE PR's body doesn't match template labels Apr 23, 2026
@alucardzom alucardzom force-pushed the ale/infra-3580-apt-retry-resilience branch from 8e455ee to f918e0c Compare April 23, 2026 09:29
@metamaskbotv2 metamaskbotv2 Bot removed the INVALID-PR-TEMPLATE PR's body doesn't match template label Apr 23, 2026
Wrap apt-get commands in nick-fields/retry with 3 attempts and 30s wait
to handle transient Ubuntu mirror desync errors that account for ~26%
of setup environment CI failures. Add DPkg::Lock::Timeout=120 to handle
dpkg lock contention on Cirrus self-hosted runners.

Remove unnecessary apt install of gh CLI from create-release-draft
workflow since gh is pre-installed on all ubuntu-latest runner images.
Add on_retry_command to run apt-get clean before each retry, clearing
cached/corrupt package lists so apt-get update fetches fresh data from
mirrors. Reduce timeout_minutes from 5 to 3 — apt takes 5-15s normally
and even with a 120s dpkg lock wait the worst case is ~135s.
@alucardzom alucardzom force-pushed the ale/infra-3580-apt-retry-resilience branch from f918e0c to a03f753 Compare April 23, 2026 10:32
@github-actions

Copy link
Copy Markdown
Contributor

E2E Fixture Validation — Schema is up to date
12 value mismatches detected (expected — fixture represents an existing user).
View details

@alucardzom alucardzom marked this pull request as ready for review April 23, 2026 11:36
@alucardzom alucardzom requested a review from a team as a code owner April 23, 2026 11:36
jluque0101
jluque0101 previously approved these changes Apr 23, 2026
@github-actions github-actions Bot added the risk:high AI analysis: high risk label Apr 23, 2026
@alucardzom alucardzom requested a review from jvbriones April 23, 2026 11:41

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a03f753. Configure here.

Comment thread .github/actions/setup-e2e-env/action.yml
@alucardzom alucardzom requested a review from andrepimenta April 23, 2026 11:42
jvbriones
jvbriones previously approved these changes Apr 23, 2026
nick-fields/retry v3.0.2 does not set bash -e for multi-line commands.
Without it, if apt-get update or apt-get install fails, bash continues
to the trailing echo which exits 0, masking the failure and preventing
retries from triggering. Credit: Cursor Bugbot review.
@alucardzom alucardzom dismissed stale reviews from jvbriones and jluque0101 via e6ee232 April 23, 2026 11:53
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

  • Selected E2E tags: None (no tests recommended)
  • Selected Performance tags: None (no tests recommended)
  • Risk Level: low
  • AI Confidence: 97%
click to see 🤖 AI reasoning details

E2E Test Selection:
Both changed files are pure CI/infrastructure changes with no impact on application code or test logic:

  1. .github/actions/setup-e2e-env/action.yml: Adds retry logic (via nick-fields/retry) around the apt-get dependency installation step for Android emulator setup on Linux. This is a reliability improvement to handle transient apt-get lock contention or network failures. It adds DPkg::Lock::Timeout=120 and up to 3 retry attempts. This does not change what is installed, only how reliably it gets installed.

  2. .github/workflows/create-release-draft.yml: Removes the redundant sudo apt update && sudo apt install gh step since GitHub CLI is pre-installed on GitHub-hosted runners. This is a cleanup with no functional impact on the release draft process or any test pipeline.

Neither change touches application source code, test scenarios, controllers, UI components, navigation, or any user-facing functionality. No E2E tests need to run to validate these CI infrastructure improvements.

Performance Test Selection:
No performance-relevant code was changed. Both changes are CI infrastructure improvements (retry logic for apt-get and removal of redundant gh CLI installation). These have no impact on app rendering, data loading, state management, or any other performance-sensitive area.

View GitHub Actions results

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown
Contributor

AI PR Analysis

🚫 Merge safe: false | 🟠 Risk: high

Merge decision: AI analysis did not complete — manual review required before merging.

AI analysis did not complete. Manual review recommended.

View run

@alucardzom

Copy link
Copy Markdown
Contributor Author

Note on AI PR risk analysis failure: The AI PR risk analysis check shows high risk, but this is a false positive — the AI analysis didn't actually run. LiteLLM failed with a 401 (model access issue with anthropic/claude-sonnet-4-6) and the GitHub Copilot fallback returned a 403. The check defaulted to risk_level: high with "AI analysis did not complete — manual review required before merging." This is an infrastructure issue with the analyzer's provider configuration, not related to the PR changes.

@Cal-L Cal-L left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@alucardzom alucardzom added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit ee88bd3 Apr 27, 2026
63 of 65 checks passed
@alucardzom alucardzom deleted the ale/infra-3580-apt-retry-resilience branch April 27, 2026 12:17
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 27, 2026
@metamaskbotv2 metamaskbotv2 Bot added the release-7.76.0 Issue or pull request that will be included in release 7.76.0 label Apr 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

no-changelog no-changelog Indicates no external facing user changes, therefore no changelog documentation needed release-7.76.0 Issue or pull request that will be included in release 7.76.0 risk:high AI analysis: high risk size-S team-dev-ops DevOps team team-mobile-platform Mobile Platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants