Re-think our pypy3 CI runs

There is a very delicate balance we are playing with the RAM game on pypy3 CI runner. We currently have the `size-xl-x64` which has 32GB of RAM ([ref](https://user-docs.teleport.ethquokkaops.io/cicd/#runner-archs-and-sizes)). The more we increase the load, the worse this dance gets (obviously) but it's not just OOM on the whole process. Sometimes single runners are running OOM on particular tests and erring out instead of exiting with code 143.

- Single runners erring out ([example](https://github.com/ethereum/execution-specs/actions/runs/21020284060/job/60433728290#step:5:483))
- Full run OOM code 143 ([example](https://github.com/ethereum/execution-specs/actions/runs/21046247800/job/60521383433))

I recently did some playing around and was able to tune this to account for including up-to-Amsterdam tests [here](https://github.com/ethereum/execution-specs/pull/2022/changes/8ee14a9a077e7483d89266980fa6fa86b12385f0). But note how delicate this balance is. I was not able to use 24GB / 32GB (8 runners with 3G) for the runners because I believe we need more than 8GB for OS / other processes as it seems the entire process OOMs with these numbers. I could not find a good balance here even with 5 runners and 3G. Adding Amsterdam took us from ~63k to ~109k tests, I think we need some better options.

All this to say we need more than 32GB for this task or we need to reconsider how many tests we run for `pypy3`.

### Possible solutions

- 🩹 We could increase the RAM on our runner. This will work for some time but we'll eventually run into these issues again as, only adding Amsterdam, we went from 60k tests to 109k because Amsterdam fills all forks + its tests, etc. This is the "bandaid" fix.
- 📊 We could be more mindful about the kinds of runs we spin up. We could create a nightly build that splits at the full run into at most 4 forks (or something like this) and triggers separate runs of `pypy3` broken up into 4-fork runs for all forks. Then we can leave only the latest fork or N and N-1 to run on _every_ PR. We can add a note that reviewers should manually run previous fork tests if they are affected for a PR.

Or some other option that is more sustainable. Thoughts and ideas welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-think our pypy3 CI runs #2029

Possible solutions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Re-think our pypy3 CI runs #2029

Description

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions