Skip to content

Run arm64 tests in containers on self hosted runners#15927

Merged
ahrtr merged 3 commits intoetcd-io:mainfrom
jmhbnz:run-arm64-in-container
May 26, 2023
Merged

Run arm64 tests in containers on self hosted runners#15927
ahrtr merged 3 commits intoetcd-io:mainfrom
jmhbnz:run-arm64-in-container

Conversation

@jmhbnz
Copy link
Copy Markdown
Member

@jmhbnz jmhbnz commented May 19, 2023

This pull request updates our suite of nightly arm64 tests to run within docker containers on our self hosted arm64 runners in order to fix isolation issues with workflow runs on self hosted runners.

Below is a summary of what was required to get this working:

  • Update both runners to ensure docker-io was installed and running (was missing on one).
  • Ensure the runner user on both machines was added to the docker group.
  • Restart runner to propagate user permissions update.
  • Update workflow files to add new container: golang:1.19-bullseye as the workflow image (this matches our devcontainer.
  • Set default shell for containers to bash to work around /github/home/.gitconfig does not exist for container runs actions/checkout#1169
  • Create separate robustness template for arm64.
  • Verified workflows running successfully within containers by temporarily changing to on: pull_request

I've verified that the updated workflows are running successfully, refer below:

Workflow Link
Unit/Integration https://github.com/etcd-io/etcd/actions/runs/5043837640
End to end https://github.com/etcd-io/etcd/actions/runs/5043837630/jobs/9046140490
Robustness https://github.com/etcd-io/etcd/actions/runs/5043837736/jobs/9046143973

@jmhbnz jmhbnz changed the title Try running arm64 tests in golang 1.19 bullseye container Run arm64 tests in containers on self hosted runners May 23, 2023
Comment thread .github/workflows/robustness-template-arm64.yaml Outdated
@jmhbnz
Copy link
Copy Markdown
Member Author

jmhbnz commented May 23, 2023

Note for reviewers - as a result of this pull request, in future we will have the ability to pivot to running arm64 tests on each pull request.

Our machines have capacity to do this, and with the isolation introduced in this pull request we will be free to add more runners and scale up so multiple jobs can be running at the same time.

I will monitor how this initial change goes and in a month or two can create the issue to propose how we can scale up the arm64 runners and start running at least basic unit and integration on arm64 for each pull request.

@jmhbnz jmhbnz marked this pull request as ready for review May 23, 2023 02:40
@jmhbnz jmhbnz requested review from ahrtr and serathius May 23, 2023 02:40
@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented May 23, 2023

I see that there are lots of format (e.g. indent) change. Could you raise a separate PR for the format only change?

@serathius
Copy link
Copy Markdown
Member

Four things.As @ahrtr requested, please make format change a separate file.

Second, please remember that running in containers not always mean it's leaving clean slate. Containers and images will also leak from time to time. I don't think we need to implement a mitigation yet, but I would suggest we run a cron that cleans docker images and checks if there are containers running for longer than 3 hour (run timeout time) and kills them.

Third, looks like we are adding more and more changes to arm64 workers and we don't have a unified way to document it. I think we should create a bootstrap script for runners so we will not lose the knowledge on setting up docker for running tests in containers.

Fourth, we should discuss the tradeoff of having arm64 running different setup versus moving all workflows to containers. I personally prefer to avoid special cases so it's easier to use and reason about workers. I don't think there are any blockers for running all workflows on container and there are benefits like making the test environment more consistent with development configuration provided as devcontainer.

jmhbnz added 3 commits May 23, 2023 20:18
Signed-off-by: James Blair <mail@jamesblair.net>
Signed-off-by: James Blair <mail@jamesblair.net>
Signed-off-by: James Blair <mail@jamesblair.net>
@jmhbnz
Copy link
Copy Markdown
Member Author

jmhbnz commented May 23, 2023

Four things.As @ahrtr requested, please make format change a separate file.

Done - Thanks for the feedback, I will raise a follow-up to do the lint fixing to keep this one focused.

Second, please remember that running in containers not always mean it's leaving clean slate. Containers and images will also leak from time to time. I don't think we need to implement a mitigation yet, but I would suggest we run a cron that cleans docker images and checks if there are containers running for longer than 3 hour (run timeout time) and kills them.

Agree - I will raise a follow-up issue and get this done and documented.

Third, looks like we are adding more and more changes to arm64 workers and we don't have a unified way to document it. I think we should create a bootstrap script for runners so we will not lose the knowledge on setting up docker for running tests in containers.

Completely agree - I had to figure some things out the hard way to get this pr done. I think at bare minimum we should have some procedures documented for restarting or updating the runners along with access controls. As above I will raise this as a second follow-up task and sort this out.

Fourth, we should discuss the tradeoff of having arm64 running different setup versus moving all workflows to containers. I personally prefer to avoid special cases so it's easier to use and reason about workers. I don't think there are any blockers for running all workflows on container and there are benefits like making the test environment more consistent with development configuration provided as devcontainer.

Also agree - In my view we should move towards all workflows running in container once we prove from this pull request that it works nicely. It will allow us to consolidate the robustness template file once more and means we aren't maintaining two different approaches.

Comment thread .github/workflows/e2e-arm64.yaml
@serathius
Copy link
Copy Markdown
Member

Ok, looks like we agree on the direction, however I'm little scared to just migrate arm runners and follow up work being forgotten. I don't want to push all the work on one person, so I would like to ask @jmhbnz to create a issue with proper proposal to migrate all workflows to containers, so we can organize the work. This PR is a great PoC, but I want to make sure we properly scope and delegate the work.

As discussed above, landing this will require:

  • Documenting how to setup self-hosted runners for running containers
  • Testing that the setup works by creating a new node from scratch
  • Creating a script that cleans up old images/containers
  • Migrating all workflows to running containers
  • Updating devcontainer configuration to be consistent with CI

Benefit of scoping the work and listing the tasks is that any contributor can pick them and we know that no work will be forgotten.

Comment thread .github/workflows/robustness-template-arm64.yaml
@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented May 24, 2023

Migrating all workflows to running containers

I am neutral to this. The github standard runners do not have the cleanup issue as our self hosted ARM runners. Technically speaking it isn't necessary to migrate all workflows into containers. But the good side to migrate all workflows into containers is that we have unified way to run all workflows, and accordingly less maintenance burden.

If we migrate all workflow, then eventually robustness-template-arm64.yaml and robustness-template.yaml should be combined into one YAML file.

Either way, let's watch how it's going on the self-hosted runner firstly.

Great work, thanks @jmhbnz

Copy link
Copy Markdown
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @jmhbnz

@ahrtr
Copy link
Copy Markdown
Member

ahrtr commented May 26, 2023

We already have the overall plan #15951, so let me merge this PR and let's watch how it's going on the self-hosted runner for 1 ~ 2 weeks.

@ahrtr ahrtr merged commit bf903e5 into etcd-io:main May 26, 2023
@jmhbnz
Copy link
Copy Markdown
Member Author

jmhbnz commented May 31, 2023

Happy to report after the first four days things are looking green for the new container based arm64 job runs! 😌

Will keep monitoring for another week or so.

@dims
Copy link
Copy Markdown
Contributor

dims commented May 31, 2023

@jmhbnz very cool!

@chaochn47
Copy link
Copy Markdown
Member

In case it's overlooked, it looks like unexpected that the common e2e test result will be cached. I guess the container volume needs to be cleaned up each workflow run.

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	(cached)

@jmhbnz
Copy link
Copy Markdown
Member Author

jmhbnz commented May 31, 2023

In case it's overlooked, it looks like unexpected that the common e2e test result will be cached. I guess the container volume needs to be cleaned up each workflow run.

Great spotting @chaochn47 that definitely needs to be looked at! I've raised #15986 to investigate and will raise any tweak required.

@jmhbnz jmhbnz deleted the run-arm64-in-container branch July 27, 2023 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants