Skip to content

ci(INFRA-3597): Phase 5 — Namespace APK fingerprint cache and artifact validation#29886

Merged
alucardzom merged 30 commits into
mainfrom
phase5/cache-and-artifacts
May 14, 2026
Merged

ci(INFRA-3597): Phase 5 — Namespace APK fingerprint cache and artifact validation#29886
alucardzom merged 30 commits into
mainfrom
phase5/cache-and-artifacts

Conversation

@bsgrigorov

@bsgrigorov bsgrigorov commented May 7, 2026

Copy link
Copy Markdown
Contributor

Description

INFRA-3597 Phase 5 — Cache and Artifact Architecture for Namespace runner migration. Replaces fragile caches without changing build-output contracts, covering all cache families across Android and iOS builds and E2E tests.

Changes:

Android Cache Architecture

  • Gradle local cache: cache: gradle + cache: maven via nscloud-cache-action in build-android-e2e.yml and run-e2e-workflow.yml
  • Gradle remote build cache: nsc cache gradle setup with branch-based write policy (--push=false for PR/fork branches)
  • APK fingerprint cache: Marker-based at $GRADLE_USER_HOME/apk-cache/ — cache hit skips full build
  • Yarn + .metamask + node_modules: Cached via nscloud-cache-action in all relevant workflows
  • E2E shards: Share the metamask-android-build cache volume (single tag per Namespace recommendation for best convergence)

iOS Cache Architecture

  • CocoaPods: cache: cocoapods in build-ios-e2e.yml and run-e2e-workflow.yml
  • Xcode/DerivedData: cache: xcode in build-ios-e2e.yml (replaces cirruslabs/cache on Namespace)
  • Detox framework cache: ~/Library/Detox path in iOS E2E nscloud-cache-action
  • macOS symlink limitation: node_modules, ios/vendor/bundle, ~/.cocoapods/repos excluded from explicit cache paths — on macOS nscloud-cache-action uses symlinks which break Xcode ScanDependencies and Ruby/Bundler require chains

Cache Write Policy

  • Gradle remote build cache: only main, release/*, stable/* can push (--push=true); PR/fork branches read-only (--push=false)
  • Cache volumes (nscloud-cache-action): read+write for all branches per Namespace recommendation (convergence model benefits from all jobs contributing cache generations)

Infrastructure Fixes

  • Skip overlapping actions/cache steps in setup-e2e-env when runner_provider == 'namespace' (Android system image, Yarn, Bundler, CocoaPods specs)
  • Remove /opt/android-sdk/system-images/... from nscloud-cache-action paths (pre-baked in Dockerfile base image, permission denied on bind-mount)
  • Cap Jest --maxWorkers=50% on Namespace unit shards to reduce OOM SIGKILL risk

Rollback Safety

  • All Namespace-specific logic gated on inputs.runner_provider == 'namespace'
  • runner_provider=current path unchanged and validated

Acceptance Criteria Status

# Criterion Status
1 Cache write policy enforced (PR/fork read-only) DONE
2 Yarn + .metamask converted DONE
3 Gradle local + remote build cache DONE
4 Gradle remote cache hits verified (2+ builds) DONE
5 APK fingerprint cache DONE
6 CocoaPods/Bundler cache DONE
7 DerivedData/Xcode cache DONE
8 Detox framework cache DONE
9 node_modules tarball preserved DONE
10 fail-on-cache-miss not removed DONE
11 nscloud-checkout-action not adopted (INFRA-3628) CORRECT
12 Dashboard metrics after 2-day warm-up Follow-up ticket (reviewed after warm-up)

Validation Runs

Run Provider Result
25792720168 namespace Android 27/27 E2E pass, iOS 23/27 (3 confirmations flakes), builds pass
25795480025 current Rollback validation (in progress)

Changelog

CHANGELOG entry: null

Related issues

Fixes: INFRA-3597 (parent epic INFRA-3511)

Manual testing steps

  1. Dispatch ci.yml with runner_provider=namespace — all builds and E2E tests should pass (except known flakes)
  2. Dispatch ci.yml with runner_provider=current — confirms existing Cirrus/GitHub runner path is unaffected
  3. Check Namespace cache dashboard after 2-day warm-up for steady-state hit rates

Screenshots/Recordings

N/A — CI infrastructure PR.

Pre-merge author checklist

  • I've followed MetaMask Contributor Docs and Coding Standards.
  • I've completed the PR template to the best of my ability
  • I've included tests if applicable
  • I've documented my code using JSDoc format if applicable
  • I've applied the right labels on the PR

Pre-merge reviewer checklist

  • I've manually tested the PR
  • I confirm that this PR addresses all acceptance criteria

Note

Medium Risk
Touches CI build/test gating and caching behavior across Android/iOS and E2E workflows; misconfiguration could cause cache poisoning/misses or skipped builds that break downstream tests.

Overview
Introduces Namespace-runner cache configuration via namespacelabs/nscloud-cache-action for Android (Gradle/Maven + apk-cache) and iOS (CocoaPods/Xcode + Detox cache in E2E runner), and skips redundant actions/cache restores/saves in setup-e2e-env when runner_provider == 'namespace'.

Adds a Namespace-only Android APK fingerprint cache (marker + stored APKs under ${GRADLE_USER_HOME}/apk-cache) that can short-circuit the native build path, plus Namespace Gradle remote build cache setup with branch-based push policy.

Adjusts CI stability on Namespace Linux by appending --maxWorkers=50% to sharded Jest unit runs to reduce OOM kills.

Reviewed by Cursor Bugbot for commit 9fc9d14. Bugbot is set up for automated code reviews on this repo. Configure here.

@bsgrigorov bsgrigorov self-assigned this May 7, 2026
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@metamaskbotv2 metamaskbotv2 Bot added the team-dev-ops DevOps team label May 7, 2026
@github-actions github-actions Bot added the size-S label May 7, 2026
@alucardzom alucardzom force-pushed the phase5/cache-and-artifacts branch from 9855ab9 to edfcf90 Compare May 11, 2026 07:46
@alucardzom alucardzom changed the title ci(infra-3597): cache Android E2E APK outputs on Namespace volume ci(INFRA-3597): Phase 5 — Namespace APK fingerprint cache and artifact validation May 11, 2026
@alucardzom alucardzom added the skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run label May 11, 2026
@github-actions github-actions Bot added size-M and removed size-S labels May 12, 2026
@alucardzom alucardzom force-pushed the phase5/cache-and-artifacts branch 2 times, most recently from 8cc139a to a669ea3 Compare May 12, 2026 12:22
@bsgrigorov bsgrigorov force-pushed the phase5/cache-and-artifacts branch from 97fa1cf to 136318e Compare May 12, 2026 18:56
alucardzom and others added 14 commits May 12, 2026 14:53
… hit detection

Namespace Cache Volumes are not key-based like cirruslabs/cache, so APK
reuse on the Namespace path requires a different mechanism. This adds:

- APK output dirs (prod/flask) to nscloud-cache-action paths so built
  APKs persist across runs on Namespace volumes
- A marker file (.e2e-apk-cache-marker) that records the key inputs
  (ref, build_type, cache_generation, fingerprint, Gradle hash) after
  a successful build
- A check step before the build gate that compares the marker to
  current inputs -- if they match AND both APKs exist, the gate
  reports needs-native-build=false and the repack path runs instead
- The marker is recorded only after a successful native build

The current (Cirrus/GH) path is unchanged -- find-reusable-build,
cirruslabs/cache, and the existing gate logic all remain gated on
runner_provider != namespace.

Adapted from Borislav Grigorov's initial approach (9855ab9) to work
with main's refactored gate pattern (find-reusable-build + gate step).

Phase 5 of INFRA-3597 / parent epic INFRA-3511.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rence

Mounting APK output dirs directly as nscloud cache volumes prevents
Gradle from deleting/recreating them during packageProdRelease:
  Unable to delete directory 'android/app/build/outputs/apk/prod/release'

Move APK cache to ~/.namespace-apk-cache/ (a dedicated staging area
on the cache volume). On cache hit, APKs are copied FROM staging TO
the Gradle output dirs. After a successful build, APKs are copied TO
staging and the marker is recorded. This avoids mount interference
while preserving APK reuse across Namespace runs.

Co-authored-by: Cursor <cursoragent@cursor.com>
…stence

The separate ~/.namespace-apk-cache/ path was not persisting between
Namespace runs despite being listed in nscloud-cache-action. The cache
volume grew (184MB -> 501MB) but the staging dir contents were empty
on the second run.

Move staging to $GRADLE_USER_HOME/apk-cache/ which is a subdirectory
of an already-persisted cache path (GRADLE_USER_HOME/caches and
/wrapper are confirmed to persist). This avoids relying on a
standalone new path that nscloud may not handle correctly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add nsc cache gradle setup step that generates an init script with
workspace-scoped short-term credentials. The script is placed in
$GRADLE_USER_HOME/init.d/ so it auto-loads on every ./gradlew
invocation without modifying scripts/build.sh.

This enables cross-run Gradle task reuse: first run populates the
remote cache (slow), subsequent runs reuse cached build outputs
(fast). Works alongside the existing nscloud-cache-action for
dependency downloads (local cache volumes for JARs/plugins).

Per Namespace docs and updated INFRA-3597 acceptance criteria.

Co-authored-by: Cursor <cursoragent@cursor.com>
…caching

Replace custom GRADLE_USER_HOME/caches and /wrapper paths with the
native cache: gradle support in nscloud-cache-action. The native
framework support handles Gradle dependency paths automatically and
may use a different persistence mechanism that resolves the cache
miss issue we've been seeing with custom path entries.

This works alongside the Gradle remote build cache (nsc cache gradle
setup) — Cache Volumes handle downloaded dependencies (JARs, plugins),
remote cache handles compilation outputs and task results.

Co-authored-by: Cursor <cursoragent@cursor.com>
… gradle

The native cache: gradle targets ~/.gradle/ but our GRADLE_USER_HOME
is /home/runner/_work/.gradle/ -- different paths. Gradle writes deps
to the custom location but the native cache mounts at the default,
so deps are never cached and Maven Central returns 429 on every cold
run.

Add both: cache: gradle for the default ~/.gradle (future-proofing)
plus explicit GRADLE_USER_HOME/caches and /wrapper for our custom
location. Belt and suspenders until GRADLE_USER_HOME is standardized.

Co-authored-by: Cursor <cursoragent@cursor.com>
Per Namespace team guidance: add maven cache mode alongside gradle to
ensure Maven dependency downloads (including plugins from Maven Central)
are retained in ~/.m2/repository across runs. This avoids repeated
bulk downloads that trigger HTTP 429 rate limiting from
repo.maven.apache.org on cold-cache builds.

Even after Namespace ships their in-house Maven mirrors, this local
caching remains beneficial as it skips downloads entirely.

Co-authored-by: Cursor <cursoragent@cursor.com>
…esting

Co-authored-by: Cursor <cursoragent@cursor.com>
…testing

Co-authored-by: Cursor <cursoragent@cursor.com>
…te limits

Co-authored-by: Cursor <cursoragent@cursor.com>
…id E2E on namespace

Mount the same cache volume paths (Gradle, Maven, Yarn, node_modules,
apk-cache) in E2E shards so the post-step commit preserves the build
job's cached data instead of overwriting it with an empty state.

Limit Android E2E on namespace to SmokeAccounts (1 shard) to validate
cache volume persistence end-to-end before enabling the full matrix.
iOS E2E remains skipped on namespace.

Co-authored-by: Cursor <cursoragent@cursor.com>
Only mount absolute Gradle paths in the E2E shard so its post-step
commit does not overwrite the build job's heavier node_modules/yarn
cache with a lighter fresh-install version.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Guard Namespace APK cache check with force-builds override so
  force-builds label/tag triggers a fresh build on namespace too.
- Remove hardcoded phase5/cache-and-artifacts branch from Gradle
  remote cache write policy. Only main, release/*, and stable/*
  branches can push.
@alucardzom alucardzom dismissed stale reviews from andrepimenta and jvbriones via d5e44b1 May 13, 2026 12:20
Comment thread .github/workflows/build-android-e2e.yml Outdated
jvbriones
jvbriones previously approved these changes May 13, 2026
tommasini
tommasini previously approved these changes May 13, 2026
bsgrigorov and others added 2 commits May 13, 2026 22:20
Match find-reusable-build: empty fingerprint omits source identity from the
marker; do not run Namespace APK cache hit logic in that case.

Co-authored-by: Cursor <cursoragent@cursor.com>
@bsgrigorov bsgrigorov dismissed stale reviews from jvbriones and tommasini via 39c1bed May 14, 2026 05:30
Comment thread .github/workflows/build-android-e2e.yml

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7ad2e5d. Configure here.

Comment thread .github/workflows/build-ios-e2e.yml
- Skip APK cache marker recording when source-fingerprint is empty
  to avoid writing markers that can never be matched.
- Guard iOS .metamask actions/cache restore with runner_provider !=
  namespace on both native-build and reuse-hit paths to prevent
  conflict with nscloud-cache-action symlinks on macOS.
@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

⏭️ Smart E2E selection skipped - skip-smart-e2e-selection label found

All E2E tests pre-selected.

View GitHub Actions results

@sonarqubecloud

Copy link
Copy Markdown

@alucardzom alucardzom moved this from Review finalised - Ready to be merged to Needs dev review in PR review queue May 14, 2026
Comment thread .github/workflows/ci.yml
# in sync with the length of matrix.shard
- run: yarn test:unit --shard=${{ matrix.shard }}/10 --forceExit --silent --coverageReporters=json --json --outputFile=tests/results/unit-test-results-${{ matrix.shard }}.json
# Namespace Linux: cap Jest workers to reduce cgroup OOM SIGKILL without tuning heap.
- run: yarn test:unit --shard=${{ matrix.shard }}/10${{ inputs.runner_provider == 'namespace' && ' --maxWorkers=50%' || '' }} --forceExit --silent --coverageReporters=json --json --outputFile=tests/results/unit-test-results-${{ matrix.shard }}.json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--maxWorkers=50%

do we know the runtime impact of this on namespace CI runs?

@alucardzom alucardzom added this pull request to the merge queue May 14, 2026
@github-project-automation github-project-automation Bot moved this from Needs dev review to Review finalised - Ready to be merged in PR review queue May 14, 2026
Merged via the queue into main with commit 0bfd755 May 14, 2026
116 checks passed
@alucardzom alucardzom deleted the phase5/cache-and-artifacts branch May 14, 2026 14:01
@github-actions github-actions Bot locked and limited conversation to collaborators May 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

size-M skip-smart-e2e-selection Skip Smart E2E selection, i.e. select all E2E tests to run team-dev-ops DevOps team

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants