Simplify test suite startup conditions and provide an opt-out by johnSchnake · Pull Request #78500 · kubernetes/kubernetes

johnSchnake · 2019-05-29T21:13:01Z

What type of PR is this?

/kind bug

What this PR does / why we need it:
When starting the test suite there are multiple checks which take
place to ensure the cluster is ready to be tested. Some of these
were introduced due to problems discovered when testing very
large clusters and are not typically relevant for most users.

However, the checks recently were modified to be too strict and
introduced hardcoding to be permissive of a particular master
node label. For numerous users, their master nodes were tainted
in a different way but were still unschedulable, leading to the
test suite timing out during this wait condition.

This change does the following to remedy the situation:

provides a flag for opting out of all 'wait' logic at the start
of the suite
removes any special case handling of a particular master label
loosens the restriction that all the nodes have to be untainted
and instead just confirms there are not one of the known-bad taints
such as if the network is unavailable or the node is not ready.

Which issue(s) this PR fixes:
Fixes #74282

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Added a new testing flag "--start-now" which will skip certain checks at the start of the test suite which wait for nodes and pods to enter known good states. This provides a way of sidestepping this logic and getting unblocked if it ever causes a problem an a cluster.

/sig testing
/area test
/area test-framework
/priority important-longterm

k8s-ci-robot · 2019-05-29T21:13:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johnSchnake
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: krzyzacy

If they are not already assigned, you can assign the PR to them by writing /assign @krzyzacy in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johnSchnake · 2019-05-29T22:59:10Z

/retest

Also, I'd love to get this into the 1.15 release if possible.

johnSchnake · 2019-05-30T01:10:56Z

/retest

Also, to clarify for those reviewing: this started after I followed up with @liggitt about #76654 . That PR went in a slightly different direction which he wanted to back away from some and he suggested this approach. If this goes in, the intent is to close that other PR as far as I know (though there may have been a few other effects that PR needed to accomplish which would be split off into a separate PR)

johnSchnake · 2019-05-30T13:31:02Z

For some reason gofmt wasn't running against my file on save so I actually had some gofmt failures; all tests besides that have passed though. But with the repush I have to wait for all tests to go green again.

When starting the test suite there are multiple checks which take place to ensure the cluster is ready to be tested. Some of these were introduced due to problems discovered when testing very large clusters and are not typically relevent for most users. However, the checks recently were modified to be too strict and introduced hardcoding to be permissive of a particular master node label. For numerous users, their master nodes were tainted in a different way but were still unschedulable, leading to the test suite timing out during this wait condition. This change does the following to remedy the situation: - provides a flag for opting out of all 'wait' logic at the start of the suite - removes any special case handling of a particular master label - loosens the restriction that all the nodes have to be untainted and instead just confirms there are not one of the known-bad taints such as if the network is unavailable or the node is not ready.

johnSchnake · 2019-05-30T14:08:49Z

/retest

liggitt · 2019-05-30T15:11:08Z

flake in #78525

/retest

liggitt · 2019-05-30T15:18:37Z

test/e2e/e2e.go

-	// In large clusters we may get to this point but still have a bunch
-	// of nodes without Routes created. Since this would make a node
-	// unschedulable, we need to wait until all of them are schedulable.
-	framework.ExpectNoError(framework.WaitForAllNodesSchedulable(c, framework.TestContext.NodeSchedulableTimeout))


doesn't the skip have to go inside WaitForAllNodesSchedulable? there are lots of individual tests that call that

That is an interesting question. I was initially thinking to place it outside in order to limit the scope of this to just startup behavior. It could become blocking for someone if the tests themselves are doing that.

However, that makes the flag a bit more complicated. Do we really want to skip that logic in a test which is specifically calling the method? Even if the cluster started in a known good state (hence you may provide the flag to skip the startup waiting), if a test tears down a node and wants to wait for it to come back, you really do want to wait then, not skip it like you did at startup.

liggitt · 2019-05-30T15:22:43Z

test/e2e/framework/util.go

-				continue
-			}
-			if !isNodeSchedulable(node) || !isNodeUntainted(node) {
+			if !isNodeSchedulable(node) || isNodeConditionallyTainted(node) {


isNodeConditionallyTainted is a confusing name. suggest something like nodeHasConditionDrivenNoScheduleTaints

Sure, rename is fine to me.

liggitt · 2019-05-30T15:26:09Z

test/e2e/framework/test_context.go

 	flag.StringVar(&TestContext.EtcdUpgradeVersion, "etcd-upgrade-version", "", "The etcd binary version to upgrade to (e.g., '3.0.14', '2.3.7') if doing an etcd upgrade test.")
 	flag.StringVar(&TestContext.GCEUpgradeScript, "gce-upgrade-script", "", "Script to use to upgrade a GCE cluster.")
 	flag.BoolVar(&TestContext.CleanStart, "clean-start", false, "If true, purge all namespaces except default and system before running tests. This serves to Cleanup test namespaces from failed/interrupted e2e runs in a long-lived cluster.")
+	flag.BoolVar(&TestContext.SkipStartupWait, "start-now", false, "If true, disables multiple checks at the start of the test run ensuring nodes/pods/etc are in the proper state.")


this needs to be explicit about what checks and other flags it makes us ignore (--minStartupPods, --system-pods-startup-timeout, --system-daemonsets-startup-timeout, --node-schedulable-timeout, --allowed-not-ready-nodes?)

and the help for those other flags should be updated to add something like This flag is ignored if --start-now is set.

would a name like --skip-cluster-readiness-checks be clearer?

suggest aligning the testcontext var name with the flag (e.g. SkipClusterReadinessChecks)

👍 all sound like good ideas. I don’t need to be worried about being too verbose in these flag descriptions? If not, I’ll make those changes. I like the rename too, I struggled to find the right name.

Note though: I think that flag name makes sense if we are doing it at startup; how do you think that plays into the discussion above about whether or not that flag should be respected only at startup or every time the WaitFor... method is called?

liggitt · 2019-05-30T15:28:44Z

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-e2e-kind

liggitt · 2019-05-30T16:25:45Z

/retest

johnSchnake · 2019-05-30T21:20:57Z

I need some additional feedback about a few of those points. I can iterate on this tonight at some point, but its getting down to the wire and I'm not sure if it'll make the deadline.

neolit123 · 2019-05-31T12:15:13Z

thanks for working on this @johnSchnake

in general the flag to op-out seems like a good addition.
but the fact that pull-kubernetes-e2e-kops-aws is failing may indicate that the change is not backwards compatible with existing jobs.

/test pull-kubernetes-e2e-kind

johnSchnake · 2019-06-03T15:42:08Z

/test pull-kubernetes-e2e-kops-aws

johnSchnake · 2019-06-03T16:35:23Z

@neolit123 I didnt get to see the first failure, it may have just been a flake. The 2nd just appeared to be hanging due to the known prow instability https://github.com/kubernetes/test-infra/blob/c15dff889739590affa2197a864caf94f4afda6f/docs/post-mortems/2019-05-30.md and really had nothing to do with my code.

After rerunning it passed.

karlkfi · 2019-06-10T21:29:35Z

Is this change going to allow me to be able to run conformance tests on a cluster that has multiple node pools, with and without taints? I'd like to be able to test on just some of the nodes.

johnSchnake · 2019-06-11T15:28:27Z

@karlkfi I think the answer is maybe?

Let me see if I understand: prior to your tests you effectively cordon off a few nodes by tainting them and want to still run tests on the remaining nodes, is that right?

If that is the case, then yes I think this will help you, at least get over the "waiting for nodes to be ready" blockade that currently exists.

Your situation does make me wonder about the comment above (here), maybe you can weigh in there? It seems like if the "skip node startup logic" flag was given, but a test called the WaitForAllNodesToBeSchedulable method, then you'd still be blocked. So it would work in your use case better to have that flag always respected, not just at startup?

karlkfi · 2019-06-12T14:37:25Z

@johnSchnake we use taints on whole node pools to make them require tolerations for scheduling. Reasons include GPUs, fast-cpu nodes, dedicated tenant resources, etc. Then we also have an untainted generic/default pool.

I want to run the tests on our untainted generic pool. Or maybe alternatively be able to specify tolerations for the tests.

k8s-ci-robot · 2019-06-14T21:27:55Z

@johnSchnake: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johnSchnake · 2019-06-18T16:22:18Z

The most this (or similar fixes) will do will avoid this Wait... logic from stalling because of the tainted nodes. However, the individual tests will not implicitly be able to tolerate a given taint; the tests create pods/etc on an ad hoc basis and there isn't a central way to define that all tests should tolerate a given taint.

If you think that is an ask, then I'd create a separate ticket to discuss.

This PR/ticket however will make the tests able to run and utilize the untainted nodes.

karlkfi · 2019-06-20T23:39:07Z

As long as none of the tests require all nodes to be scheduleble without taint, that should be fine.

Does this mean it won’t wait for tainted nodes at all?

johnSchnake · 2019-06-28T15:12:37Z

@karlkfi If you provide the flag to skip wait logic, no waiting at all will happen at the start of the test suite. TBD if this flag would impact just the wait logic at the start of the suite or each time a test calls for the system to wait.

By default it will still wait to be "in a testable state" which means that nodes are schedulable and dont have "condition driven taints" (automatically added by the system when experiencing things like lack of network connectivity or memory issues).

By schedulable, I just mean that its metadata doesnt have it set as unschedulable, that it shows as Ready, and has network connectivity.

timothysc · 2019-07-01T19:51:25Z

/assign @timothysc @andrewsykim

johnSchnake · 2019-08-05T18:48:11Z

Brought this up during a testing-commons meeting; waiting for some feedback/direction since it seems there was some uncertainty as this has mutated slightly from the original conception of the fix.

timothysc · 2019-08-05T19:46:49Z

/hold

The original ask somehow morphed by playing the telephone game. The original intent long ago was to have a whitelist of taints, and this change would have weird unintended consequences.

xmudrii · 2019-08-25T11:51:20Z

@johnSchnake @liggitt @timothysc Hello! I'm the bug triage lead for the 1.16 release cycle and considering this PR is tagged for 1.16, I'd like to check its status. The code freeze is starting on August 29th EOD PST (about 5 days from now) and once the code freeze starts, only release-blocking PRs will be considered to be merged. Is this PR still planned for the 1.16 release?

xmudrii · 2019-09-05T14:18:45Z

The underlying issue is moved to 1.17, so I'll move this PR too

/milestone v1.17

johnSchnake · 2019-09-05T14:51:19Z

Closing; usurped by #81043

k8s-ci-robot requested review from cblecker and smarterclayton May 29, 2019 21:13

johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 445b0dc to 8c8f550 Compare May 29, 2019 21:20

johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 8c8f550 to 52274f3 Compare May 30, 2019 13:30

johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 52274f3 to 934ceb3 Compare May 30, 2019 13:44

liggitt reviewed May 30, 2019

View reviewed changes

liggitt mentioned this pull request May 31, 2019

WIP - Remove special-case handling of master label nodes in e2e #76654

Closed

johnSchnake mentioned this pull request Jun 3, 2019

Support e2e runs despite custom taints on nodes vmware-tanzu/sonobuoy#599

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2019

timothysc added this to the v1.16 milestone Jun 28, 2019

k8s-ci-robot assigned andrewsykim and timothysc Jul 1, 2019

liggitt mentioned this pull request Jul 17, 2019

Describe appropriate use of node role labels and fixes that should be made kubernetes/enhancements#1144

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2019

k8s-ci-robot modified the milestones: v1.16, v1.17 Sep 5, 2019

johnSchnake closed this Sep 5, 2019

Conversation

johnSchnake commented May 29, 2019

Uh oh!

k8s-ci-robot commented May 29, 2019

Uh oh!

johnSchnake commented May 29, 2019

Uh oh!

johnSchnake commented May 30, 2019

Uh oh!

johnSchnake commented May 30, 2019

Uh oh!

johnSchnake commented May 30, 2019

Uh oh!

liggitt commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liggitt commented May 30, 2019

Uh oh!

liggitt commented May 30, 2019

Uh oh!

johnSchnake commented May 30, 2019

Uh oh!

neolit123 commented May 31, 2019

Uh oh!

johnSchnake commented Jun 3, 2019

Uh oh!

johnSchnake commented Jun 3, 2019

Uh oh!

karlkfi commented Jun 10, 2019

Uh oh!

johnSchnake commented Jun 11, 2019

Uh oh!

karlkfi commented Jun 12, 2019

Uh oh!

k8s-ci-robot commented Jun 14, 2019

Uh oh!

johnSchnake commented Jun 18, 2019

Uh oh!

karlkfi commented Jun 20, 2019

Uh oh!

johnSchnake commented Jun 28, 2019

Uh oh!

timothysc commented Jul 1, 2019

Uh oh!

johnSchnake commented Aug 5, 2019

Uh oh!

timothysc commented Aug 5, 2019

Uh oh!

xmudrii commented Aug 25, 2019

Uh oh!

xmudrii commented Sep 5, 2019

Uh oh!

johnSchnake commented Sep 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

liggitt May 30, 2019 •

edited

Loading