Run arm64 tests in containers on self hosted runners#15927
Run arm64 tests in containers on self hosted runners#15927ahrtr merged 3 commits intoetcd-io:mainfrom jmhbnz:run-arm64-in-container
Conversation
|
Note for reviewers - as a result of this pull request, in future we will have the ability to pivot to running Our machines have capacity to do this, and with the isolation introduced in this pull request we will be free to add more runners and scale up so multiple jobs can be running at the same time. I will monitor how this initial change goes and in a month or two can create the issue to propose how we can scale up the |
|
I see that there are lots of format (e.g. indent) change. Could you raise a separate PR for the format only change? |
|
Four things.As @ahrtr requested, please make format change a separate file. Second, please remember that running in containers not always mean it's leaving clean slate. Containers and images will also leak from time to time. I don't think we need to implement a mitigation yet, but I would suggest we run a cron that cleans docker images and checks if there are containers running for longer than 3 hour (run timeout time) and kills them. Third, looks like we are adding more and more changes to arm64 workers and we don't have a unified way to document it. I think we should create a bootstrap script for runners so we will not lose the knowledge on setting up docker for running tests in containers. Fourth, we should discuss the tradeoff of having arm64 running different setup versus moving all workflows to containers. I personally prefer to avoid special cases so it's easier to use and reason about workers. I don't think there are any blockers for running all workflows on container and there are benefits like making the test environment more consistent with development configuration provided as devcontainer. |
Signed-off-by: James Blair <mail@jamesblair.net>
Signed-off-by: James Blair <mail@jamesblair.net>
Signed-off-by: James Blair <mail@jamesblair.net>
Done - Thanks for the feedback, I will raise a follow-up to do the lint fixing to keep this one focused.
Agree - I will raise a follow-up issue and get this done and documented.
Completely agree - I had to figure some things out the hard way to get this pr done. I think at bare minimum we should have some procedures documented for restarting or updating the runners along with access controls. As above I will raise this as a second follow-up task and sort this out.
Also agree - In my view we should move towards all workflows running in container once we prove from this pull request that it works nicely. It will allow us to consolidate the robustness template file once more and means we aren't maintaining two different approaches. |
|
Ok, looks like we agree on the direction, however I'm little scared to just migrate arm runners and follow up work being forgotten. I don't want to push all the work on one person, so I would like to ask @jmhbnz to create a issue with proper proposal to migrate all workflows to containers, so we can organize the work. This PR is a great PoC, but I want to make sure we properly scope and delegate the work. As discussed above, landing this will require:
Benefit of scoping the work and listing the tasks is that any contributor can pick them and we know that no work will be forgotten. |
I am neutral to this. The github standard runners do not have the cleanup issue as our self hosted ARM runners. Technically speaking it isn't necessary to migrate all workflows into containers. But the good side to migrate all workflows into containers is that we have unified way to run all workflows, and accordingly less maintenance burden. If we migrate all workflow, then eventually Either way, let's watch how it's going on the self-hosted runner firstly. Great work, thanks @jmhbnz |
|
We already have the overall plan #15951, so let me merge this PR and let's watch how it's going on the self-hosted runner for 1 ~ 2 weeks. |
|
Happy to report after the first four days things are looking green for the new container based
Will keep monitoring for another week or so. |
|
@jmhbnz very cool! |
|
In case it's overlooked, it looks like unexpected that the common e2e test result will be cached. I guess the container volume needs to be cleaned up each workflow run. |
Great spotting @chaochn47 that definitely needs to be looked at! I've raised #15986 to investigate and will raise any tweak required. |
This pull request updates our suite of nightly
arm64tests to run within docker containers on our self hostedarm64runners in order to fix isolation issues with workflow runs on self hosted runners.Below is a summary of what was required to get this working:
docker-iowas installed and running (was missing on one).runneruser on both machines was added to thedockergroup.container: golang:1.19-bullseyeas the workflow image (this matches our devcontainer.arm64.on: pull_requestI've verified that the updated workflows are running successfully, refer below: