Skip to content

Simplify test suite startup conditions and provide an opt-out#78500

Closed
johnSchnake wants to merge 1 commit intokubernetes:masterfrom
johnSchnake:simplifyTestWaitingForTaints
Closed

Simplify test suite startup conditions and provide an opt-out#78500
johnSchnake wants to merge 1 commit intokubernetes:masterfrom
johnSchnake:simplifyTestWaitingForTaints

Conversation

@johnSchnake
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:
When starting the test suite there are multiple checks which take
place to ensure the cluster is ready to be tested. Some of these
were introduced due to problems discovered when testing very
large clusters and are not typically relevant for most users.

However, the checks recently were modified to be too strict and
introduced hardcoding to be permissive of a particular master
node label. For numerous users, their master nodes were tainted
in a different way but were still unschedulable, leading to the
test suite timing out during this wait condition.

This change does the following to remedy the situation:

  • provides a flag for opting out of all 'wait' logic at the start
    of the suite
  • removes any special case handling of a particular master label
  • loosens the restriction that all the nodes have to be untainted
    and instead just confirms there are not one of the known-bad taints
    such as if the network is unavailable or the node is not ready.

Which issue(s) this PR fixes:
Fixes #74282

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Added a new testing flag "--start-now" which will skip certain checks at the start of the test suite which wait for nodes and pods to enter known good states. This provides a way of sidestepping this logic and getting unblocked if it ever causes a problem an a cluster.

/sig testing
/area test
/area test-framework
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework labels May 29, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johnSchnake
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: krzyzacy

If they are not already assigned, you can assign the PR to them by writing /assign @krzyzacy in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnSchnake johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 445b0dc to 8c8f550 Compare May 29, 2019 21:20
@johnSchnake
Copy link
Copy Markdown
Contributor Author

/retest

Also, I'd love to get this into the 1.15 release if possible.

@johnSchnake
Copy link
Copy Markdown
Contributor Author

/retest

Also, to clarify for those reviewing: this started after I followed up with @liggitt about #76654 . That PR went in a slightly different direction which he wanted to back away from some and he suggested this approach. If this goes in, the intent is to close that other PR as far as I know (though there may have been a few other effects that PR needed to accomplish which would be split off into a separate PR)

@johnSchnake johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 8c8f550 to 52274f3 Compare May 30, 2019 13:30
@johnSchnake
Copy link
Copy Markdown
Contributor Author

For some reason gofmt wasn't running against my file on save so I actually had some gofmt failures; all tests besides that have passed though. But with the repush I have to wait for all tests to go green again.

When starting the test suite there are multiple checks which take
place to ensure the cluster is ready to be tested. Some of these
were introduced due to problems discovered when testing very
large clusters and are not typically relevent for most users.

However, the checks recently were modified to be too strict and
introduced hardcoding to be permissive of a particular master
node label. For numerous users, their master nodes were tainted
in a different way but were still unschedulable, leading to the
test suite timing out during this wait condition.

This change does the following to remedy the situation:
 - provides a flag for opting out of all 'wait' logic at the start
of the suite
 - removes any special case handling of a particular master label
 - loosens the restriction that all the nodes have to be untainted
and instead just confirms there are not one of the known-bad taints
such as if the network is unavailable or the node is not ready.
@johnSchnake johnSchnake force-pushed the simplifyTestWaitingForTaints branch from 52274f3 to 934ceb3 Compare May 30, 2019 13:44
@johnSchnake
Copy link
Copy Markdown
Contributor Author

/retest

@liggitt
Copy link
Copy Markdown
Member

liggitt commented May 30, 2019

flake in #78525

/retest

// In large clusters we may get to this point but still have a bunch
// of nodes without Routes created. Since this would make a node
// unschedulable, we need to wait until all of them are schedulable.
framework.ExpectNoError(framework.WaitForAllNodesSchedulable(c, framework.TestContext.NodeSchedulableTimeout))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the skip have to go inside WaitForAllNodesSchedulable? there are lots of individual tests that call that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting question. I was initially thinking to place it outside in order to limit the scope of this to just startup behavior. It could become blocking for someone if the tests themselves are doing that.

However, that makes the flag a bit more complicated. Do we really want to skip that logic in a test which is specifically calling the method? Even if the cluster started in a known good state (hence you may provide the flag to skip the startup waiting), if a test tears down a node and wants to wait for it to come back, you really do want to wait then, not skip it like you did at startup.

continue
}
if !isNodeSchedulable(node) || !isNodeUntainted(node) {
if !isNodeSchedulable(node) || isNodeConditionallyTainted(node) {
Copy link
Copy Markdown
Member

@liggitt liggitt May 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNodeConditionallyTainted is a confusing name. suggest something like nodeHasConditionDrivenNoScheduleTaints

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, rename is fine to me.

flag.StringVar(&TestContext.EtcdUpgradeVersion, "etcd-upgrade-version", "", "The etcd binary version to upgrade to (e.g., '3.0.14', '2.3.7') if doing an etcd upgrade test.")
flag.StringVar(&TestContext.GCEUpgradeScript, "gce-upgrade-script", "", "Script to use to upgrade a GCE cluster.")
flag.BoolVar(&TestContext.CleanStart, "clean-start", false, "If true, purge all namespaces except default and system before running tests. This serves to Cleanup test namespaces from failed/interrupted e2e runs in a long-lived cluster.")
flag.BoolVar(&TestContext.SkipStartupWait, "start-now", false, "If true, disables multiple checks at the start of the test run ensuring nodes/pods/etc are in the proper state.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be explicit about what checks and other flags it makes us ignore (--minStartupPods, --system-pods-startup-timeout, --system-daemonsets-startup-timeout, --node-schedulable-timeout, --allowed-not-ready-nodes?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the help for those other flags should be updated to add something like This flag is ignored if --start-now is set.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a name like --skip-cluster-readiness-checks be clearer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest aligning the testcontext var name with the flag (e.g. SkipClusterReadinessChecks)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 all sound like good ideas. I don’t need to be worried about being too verbose in these flag descriptions? If not, I’ll make those changes. I like the rename too, I struggled to find the right name.

Note though: I think that flag name makes sense if we are doing it at startup; how do you think that plays into the discussion above about whether or not that flag should be respected only at startup or every time the WaitFor... method is called?

@liggitt
Copy link
Copy Markdown
Member

liggitt commented May 30, 2019

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-e2e-kind

@liggitt
Copy link
Copy Markdown
Member

liggitt commented May 30, 2019

/retest

@johnSchnake
Copy link
Copy Markdown
Contributor Author

I need some additional feedback about a few of those points. I can iterate on this tonight at some point, but its getting down to the wire and I'm not sure if it'll make the deadline.

@neolit123
Copy link
Copy Markdown
Member

thanks for working on this @johnSchnake

in general the flag to op-out seems like a good addition.
but the fact that pull-kubernetes-e2e-kops-aws is failing may indicate that the change is not backwards compatible with existing jobs.

/test pull-kubernetes-e2e-kind

@johnSchnake
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-e2e-kops-aws

@johnSchnake
Copy link
Copy Markdown
Contributor Author

@neolit123 I didnt get to see the first failure, it may have just been a flake. The 2nd just appeared to be hanging due to the known prow instability https://github.com/kubernetes/test-infra/blob/c15dff889739590affa2197a864caf94f4afda6f/docs/post-mortems/2019-05-30.md and really had nothing to do with my code.

After rerunning it passed.

@karlkfi
Copy link
Copy Markdown
Contributor

karlkfi commented Jun 10, 2019

Is this change going to allow me to be able to run conformance tests on a cluster that has multiple node pools, with and without taints? I'd like to be able to test on just some of the nodes.

@johnSchnake
Copy link
Copy Markdown
Contributor Author

@karlkfi I think the answer is maybe?

Let me see if I understand: prior to your tests you effectively cordon off a few nodes by tainting them and want to still run tests on the remaining nodes, is that right?

If that is the case, then yes I think this will help you, at least get over the "waiting for nodes to be ready" blockade that currently exists.

Your situation does make me wonder about the comment above (here), maybe you can weigh in there? It seems like if the "skip node startup logic" flag was given, but a test called the WaitForAllNodesToBeSchedulable method, then you'd still be blocked. So it would work in your use case better to have that flag always respected, not just at startup?

@karlkfi
Copy link
Copy Markdown
Contributor

karlkfi commented Jun 12, 2019

@johnSchnake we use taints on whole node pools to make them require tolerations for scheduling. Reasons include GPUs, fast-cpu nodes, dedicated tenant resources, etc. Then we also have an untainted generic/default pool.

I want to run the tests on our untainted generic pool. Or maybe alternatively be able to specify tolerations for the tests.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@johnSchnake: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2019
@johnSchnake
Copy link
Copy Markdown
Contributor Author

The most this (or similar fixes) will do will avoid this Wait... logic from stalling because of the tainted nodes. However, the individual tests will not implicitly be able to tolerate a given taint; the tests create pods/etc on an ad hoc basis and there isn't a central way to define that all tests should tolerate a given taint.

If you think that is an ask, then I'd create a separate ticket to discuss.

This PR/ticket however will make the tests able to run and utilize the untainted nodes.

@karlkfi
Copy link
Copy Markdown
Contributor

karlkfi commented Jun 20, 2019

As long as none of the tests require all nodes to be scheduleble without taint, that should be fine.

Does this mean it won’t wait for tainted nodes at all?

@timothysc timothysc added this to the v1.16 milestone Jun 28, 2019
@johnSchnake
Copy link
Copy Markdown
Contributor Author

@karlkfi If you provide the flag to skip wait logic, no waiting at all will happen at the start of the test suite. TBD if this flag would impact just the wait logic at the start of the suite or each time a test calls for the system to wait.

By default it will still wait to be "in a testable state" which means that nodes are schedulable and dont have "condition driven taints" (automatically added by the system when experiencing things like lack of network connectivity or memory issues).

By schedulable, I just mean that its metadata doesnt have it set as unschedulable, that it shows as Ready, and has network connectivity.

@timothysc
Copy link
Copy Markdown
Contributor

/assign @timothysc @andrewsykim

@johnSchnake
Copy link
Copy Markdown
Contributor Author

Brought this up during a testing-commons meeting; waiting for some feedback/direction since it seems there was some uncertainty as this has mutated slightly from the original conception of the fix.

@timothysc
Copy link
Copy Markdown
Contributor

/hold

The original ask somehow morphed by playing the telephone game. The original intent long ago was to have a whitelist of taints, and this change would have weird unintended consequences.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2019
@xmudrii
Copy link
Copy Markdown
Member

xmudrii commented Aug 25, 2019

@johnSchnake @liggitt @timothysc Hello! I'm the bug triage lead for the 1.16 release cycle and considering this PR is tagged for 1.16, I'd like to check its status. The code freeze is starting on August 29th EOD PST (about 5 days from now) and once the code freeze starts, only release-blocking PRs will be considered to be merged. Is this PR still planned for the 1.16 release?

@xmudrii
Copy link
Copy Markdown
Member

xmudrii commented Sep 5, 2019

The underlying issue is moved to 1.17, so I'll move this PR too

/milestone v1.17

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.16, v1.17 Sep 5, 2019
@johnSchnake
Copy link
Copy Markdown
Contributor Author

Closing; usurped by #81043

@johnSchnake johnSchnake closed this Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support custom taints during e2e tests

8 participants