Skip to content

eks: changes required to support EKS cluster#1350

Merged
laurentsenta merged 85 commits intomasterfrom
tmp_eks_cluster
Nov 28, 2022
Merged

eks: changes required to support EKS cluster#1350
laurentsenta merged 85 commits intomasterfrom
tmp_eks_cluster

Conversation

@brdji
Copy link
Copy Markdown
Contributor

@brdji brdji commented Jun 7, 2022

This review contains all the required changes for the new EKS cluster. Most of the changes relate to the network annotations, IP ranges, configuration options, etc.

I have left comments listing the reasons for some of the most important changes.

Closes #1499
Related change in the infra: testground/infra#78

@galargh
Copy link
Copy Markdown
Contributor

galargh commented Oct 11, 2022

https://circleci.com/gh/testground/testground/6563 is failing

@brdji
Copy link
Copy Markdown
Contributor Author

brdji commented Oct 11, 2022

https://circleci.com/gh/testground/testground/6563 is failing

It's passing now. I have to point out that I have re-run the pipeline 8 times before it worked - we should open an issue to investigate and fix these flaky tests.

@brdji
Copy link
Copy Markdown
Contributor Author

brdji commented Oct 14, 2022

I have restored the code that applies the desired sysctl changes to run pods. Since we already modify net.core.somaxconn for the plan nodegroups when we deploy the cluster, these changes will be automatically applied to any pods testground creates. We have enabled the unsafe sysctl controls, and therefore, the sysctl changes are tested "out of the box" by running any example test case.
We have ran the storm test benchmark to see if the unsafe sysctls or net.core.somaxconn changes cause any issues. We ran the test 20 times using 20 and 30 instances, and have seen no issues - all test cases passed 100% of the time.

I have also checked the code, and sysctls do not affect the desired CPU / RAM / etc. resources of pods in the composition files; only various kernel settings (you can get the full list of sysctls by typing sudo sysctl -a)

@galargh
Copy link
Copy Markdown
Contributor

galargh commented Oct 17, 2022

I have restored the code that applies the desired sysctl changes to run pods. Since we already modify net.core.somaxconn for the plan nodegroups when we deploy the cluster, these changes will be automatically applied to any pods testground creates. We have enabled the unsafe sysctl controls, and therefore, the sysctl changes are tested "out of the box" by running any example test case. We have ran the storm test benchmark to see if the unsafe sysctls or net.core.somaxconn changes cause any issues. We ran the test 20 times using 20 and 30 instances, and have seen no issues - all test cases passed 100% of the time.

I have also checked the code, and sysctls do not affect the desired CPU / RAM / etc. resources of pods in the composition files; only various kernel settings (you can get the full list of sysctls by typing sudo sysctl -a)

I don't quite understand how we're actually testing wether the sysctls are set as we please or if the defaults are used. Could you reword or elaborate on it a little bit more? Would it makes sense to create a test which adjusts sysctls in a way that makes the test fail?

@brdji
Copy link
Copy Markdown
Contributor Author

brdji commented Oct 17, 2022

I don't quite understand how we're actually testing wether the sysctls are set as we please or if the defaults are used. Could you reword or elaborate on it a little bit more? Would it makes sense to create a test which adjusts sysctls in a way that makes the test fail?

We modify the net.core.somaxconn for the nodes when we deploy the cluster: for the daemon, and the plan nodes. This means that the default value is overridden. However, without this piece of code:

var sysctls []v1.Sysctl
	for _, v := range cfg.Sysctls {
		sysctl := strings.Split(v, "=")

		sysctls = append(sysctls, v1.Sysctl{Name: sysctl[0], Value: sysctl[1]})
	}

The sysctl values will not be applied to plan pods, and they will use the default values. In other words, we have overridden them for the (parent) node(s), but not for the plan pods.
Because I have restored the piece of code above (to the cluster_k8s runner, seen here), all of the sysctl changes (from the env, from parent nodes, etc.) will actually be applied when we create a new run pod.

How do we know that these changes have taken effect? Simple, without the changes where we permit unsafe syctls when creating the EKS cluster, creating a run pod would result in the following error: PodSecurityPolicy: unable to admit pod: [pod.spec.securityContext.sysctls[0]: Forbidden: unsafe sysctl \"net.core.somaxconn\" is not allowed]. In other words, we attempt to change the default value of the net.core.somaxconn (which is inherited in its changed form from the node), and are forbidden from doing so. The changes which allow unsafe syctls allow us to successfully apply the new values of unsafe sysctls (and by extension, we know that safe sysctls will be applied too) to pods.

So when I said that the sysctl changes are tested "out of the box", that's what I meant - in case the sysctl changes are forbidden, the any test run will fail - in effect, the test is already integrated into the cluster deployment and code.

Would it makes sense to create a test which adjusts sysctls in a way that makes the test fail?

I don't think it's possible to write a test that would do something like that, simply because it requires us to create a whole cluster without allowing unsafe sysctls.

@galargh
Copy link
Copy Markdown
Contributor

galargh commented Oct 18, 2022

Because I have restored the piece of code above (to the cluster_k8s runner, seen here), all of the sysctl changes (from the env, from parent nodes, etc.) will actually be applied when we create a new run pod.

I think I'm almost there with you but I feel like I'm still missing one connection.

Isn't this -

for _, v := range cfg.Sysctls {
sysctl := strings.Split(v, "=")
sysctls = append(sysctls, v1.Sysctl{Name: sysctl[0], Value: sysctl[1]})
}
- only executed if sysctls is set in env.toml file? Something like here:
sysctls = [
"net.core.somaxconn=10000",
]

From what I understand, we now successfully set unsafe sysctls on the ng2-plan worker nodes. But I don't see how that relates to setting them on plan pods.

@brdji
Copy link
Copy Markdown
Contributor Author

brdji commented Oct 18, 2022

Isn't this -
.....

  • only executed if sysctls is set in env.toml file? Something like here:

You are correct, I got some information mixed up. Yes, if the env.toml , or the manifest (we do have trickle-down env from there, correct?) specifies the value, then it will be passed to the pods.

sysctls = [
"net.core.somaxconn=10000",
]

From what I understand, we now successfully set unsafe sysctls on the ng2-plan worker nodes. But I don't see how that relates to setting them on plan pods.

I was wrong - no inheritance actually occurs, however we do set the net.core.somaxconn value in the .env.toml file for the daemon here, and in the following lines. And that value is used by the pods (10000).

@galargh
Copy link
Copy Markdown
Contributor

galargh commented Oct 18, 2022

Cheers, makes sense now 👍

@laurentsenta
Copy link
Copy Markdown
Contributor

(re-sharing the list of tasks here: #1499, thanks for the progress being made)

Unused on k8s runner.
Copy link
Copy Markdown
Contributor

@laurentsenta laurentsenta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing and fixing the PR,
We can merge as is and close #1499, congrats @brdji and team !

The Follow-up task list is in #1500.

@laurentsenta laurentsenta changed the title Changes required for EKS cluster eks: changes required to support EKS cluster Nov 23, 2022
@laurentsenta laurentsenta merged commit 8629aa2 into master Nov 28, 2022
@laurentsenta laurentsenta deleted the tmp_eks_cluster branch November 28, 2022 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EPIC: EKS support, v1, A minimal installation scripts & Testground support that can run simple tests (ping & storm)

6 participants