Simplify test suite startup conditions and provide an opt-out#78500
Simplify test suite startup conditions and provide an opt-out#78500johnSchnake wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: johnSchnake If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
445b0dc to
8c8f550
Compare
|
/retest Also, I'd love to get this into the 1.15 release if possible. |
|
/retest Also, to clarify for those reviewing: this started after I followed up with @liggitt about #76654 . That PR went in a slightly different direction which he wanted to back away from some and he suggested this approach. If this goes in, the intent is to close that other PR as far as I know (though there may have been a few other effects that PR needed to accomplish which would be split off into a separate PR) |
8c8f550 to
52274f3
Compare
|
For some reason gofmt wasn't running against my file on save so I actually had some gofmt failures; all tests besides that have passed though. But with the repush I have to wait for all tests to go green again. |
When starting the test suite there are multiple checks which take place to ensure the cluster is ready to be tested. Some of these were introduced due to problems discovered when testing very large clusters and are not typically relevent for most users. However, the checks recently were modified to be too strict and introduced hardcoding to be permissive of a particular master node label. For numerous users, their master nodes were tainted in a different way but were still unschedulable, leading to the test suite timing out during this wait condition. This change does the following to remedy the situation: - provides a flag for opting out of all 'wait' logic at the start of the suite - removes any special case handling of a particular master label - loosens the restriction that all the nodes have to be untainted and instead just confirms there are not one of the known-bad taints such as if the network is unavailable or the node is not ready.
52274f3 to
934ceb3
Compare
|
/retest |
|
flake in #78525 /retest |
| // In large clusters we may get to this point but still have a bunch | ||
| // of nodes without Routes created. Since this would make a node | ||
| // unschedulable, we need to wait until all of them are schedulable. | ||
| framework.ExpectNoError(framework.WaitForAllNodesSchedulable(c, framework.TestContext.NodeSchedulableTimeout)) |
There was a problem hiding this comment.
doesn't the skip have to go inside WaitForAllNodesSchedulable? there are lots of individual tests that call that
There was a problem hiding this comment.
That is an interesting question. I was initially thinking to place it outside in order to limit the scope of this to just startup behavior. It could become blocking for someone if the tests themselves are doing that.
However, that makes the flag a bit more complicated. Do we really want to skip that logic in a test which is specifically calling the method? Even if the cluster started in a known good state (hence you may provide the flag to skip the startup waiting), if a test tears down a node and wants to wait for it to come back, you really do want to wait then, not skip it like you did at startup.
| continue | ||
| } | ||
| if !isNodeSchedulable(node) || !isNodeUntainted(node) { | ||
| if !isNodeSchedulable(node) || isNodeConditionallyTainted(node) { |
There was a problem hiding this comment.
isNodeConditionallyTainted is a confusing name. suggest something like nodeHasConditionDrivenNoScheduleTaints
There was a problem hiding this comment.
Sure, rename is fine to me.
| flag.StringVar(&TestContext.EtcdUpgradeVersion, "etcd-upgrade-version", "", "The etcd binary version to upgrade to (e.g., '3.0.14', '2.3.7') if doing an etcd upgrade test.") | ||
| flag.StringVar(&TestContext.GCEUpgradeScript, "gce-upgrade-script", "", "Script to use to upgrade a GCE cluster.") | ||
| flag.BoolVar(&TestContext.CleanStart, "clean-start", false, "If true, purge all namespaces except default and system before running tests. This serves to Cleanup test namespaces from failed/interrupted e2e runs in a long-lived cluster.") | ||
| flag.BoolVar(&TestContext.SkipStartupWait, "start-now", false, "If true, disables multiple checks at the start of the test run ensuring nodes/pods/etc are in the proper state.") |
There was a problem hiding this comment.
this needs to be explicit about what checks and other flags it makes us ignore (--minStartupPods, --system-pods-startup-timeout, --system-daemonsets-startup-timeout, --node-schedulable-timeout, --allowed-not-ready-nodes?)
There was a problem hiding this comment.
and the help for those other flags should be updated to add something like This flag is ignored if --start-now is set.
There was a problem hiding this comment.
would a name like --skip-cluster-readiness-checks be clearer?
There was a problem hiding this comment.
suggest aligning the testcontext var name with the flag (e.g. SkipClusterReadinessChecks)
There was a problem hiding this comment.
👍 all sound like good ideas. I don’t need to be worried about being too verbose in these flag descriptions? If not, I’ll make those changes. I like the rename too, I struggled to find the right name.
Note though: I think that flag name makes sense if we are doing it at startup; how do you think that plays into the discussion above about whether or not that flag should be respected only at startup or every time the WaitFor... method is called?
|
/test pull-kubernetes-e2e-kops-aws |
|
/retest |
|
I need some additional feedback about a few of those points. I can iterate on this tonight at some point, but its getting down to the wire and I'm not sure if it'll make the deadline. |
|
thanks for working on this @johnSchnake in general the flag to op-out seems like a good addition. /test pull-kubernetes-e2e-kind |
|
/test pull-kubernetes-e2e-kops-aws |
|
@neolit123 I didnt get to see the first failure, it may have just been a flake. The 2nd just appeared to be hanging due to the known prow instability https://github.com/kubernetes/test-infra/blob/c15dff889739590affa2197a864caf94f4afda6f/docs/post-mortems/2019-05-30.md and really had nothing to do with my code. After rerunning it passed. |
|
Is this change going to allow me to be able to run conformance tests on a cluster that has multiple node pools, with and without taints? I'd like to be able to test on just some of the nodes. |
|
@karlkfi I think the answer is maybe? Let me see if I understand: prior to your tests you effectively cordon off a few nodes by tainting them and want to still run tests on the remaining nodes, is that right? If that is the case, then yes I think this will help you, at least get over the "waiting for nodes to be ready" blockade that currently exists. Your situation does make me wonder about the comment above (here), maybe you can weigh in there? It seems like if the "skip node startup logic" flag was given, but a test called the |
|
@johnSchnake we use taints on whole node pools to make them require tolerations for scheduling. Reasons include GPUs, fast-cpu nodes, dedicated tenant resources, etc. Then we also have an untainted generic/default pool. I want to run the tests on our untainted generic pool. Or maybe alternatively be able to specify tolerations for the tests. |
|
@johnSchnake: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
The most this (or similar fixes) will do will avoid this If you think that is an ask, then I'd create a separate ticket to discuss. This PR/ticket however will make the tests able to run and utilize the untainted nodes. |
|
As long as none of the tests require all nodes to be scheduleble without taint, that should be fine. Does this mean it won’t wait for tainted nodes at all? |
|
@karlkfi If you provide the flag to skip wait logic, no waiting at all will happen at the start of the test suite. TBD if this flag would impact just the wait logic at the start of the suite or each time a test calls for the system to wait. By default it will still wait to be "in a testable state" which means that nodes are schedulable and dont have "condition driven taints" (automatically added by the system when experiencing things like lack of network connectivity or memory issues). By schedulable, I just mean that its metadata doesnt have it set as unschedulable, that it shows as |
|
/assign @timothysc @andrewsykim |
|
Brought this up during a testing-commons meeting; waiting for some feedback/direction since it seems there was some uncertainty as this has mutated slightly from the original conception of the fix. |
|
/hold The original ask somehow morphed by playing the telephone game. The original intent long ago was to have a whitelist of taints, and this change would have weird unintended consequences. |
|
@johnSchnake @liggitt @timothysc Hello! I'm the bug triage lead for the 1.16 release cycle and considering this PR is tagged for 1.16, I'd like to check its status. The code freeze is starting on August 29th EOD PST (about 5 days from now) and once the code freeze starts, only release-blocking PRs will be considered to be merged. Is this PR still planned for the 1.16 release? |
|
The underlying issue is moved to 1.17, so I'll move this PR too /milestone v1.17 |
|
Closing; usurped by #81043 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When starting the test suite there are multiple checks which take
place to ensure the cluster is ready to be tested. Some of these
were introduced due to problems discovered when testing very
large clusters and are not typically relevant for most users.
However, the checks recently were modified to be too strict and
introduced hardcoding to be permissive of a particular master
node label. For numerous users, their master nodes were tainted
in a different way but were still unschedulable, leading to the
test suite timing out during this wait condition.
This change does the following to remedy the situation:
of the suite
and instead just confirms there are not one of the known-bad taints
such as if the network is unavailable or the node is not ready.
Which issue(s) this PR fixes:
Fixes #74282
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/sig testing
/area test
/area test-framework
/priority important-longterm