Skip to content

WIP - Remove special-case handling of master label nodes in e2e#76654

Closed
liggitt wants to merge 1 commit intokubernetes:masterfrom
liggitt:conformance-label
Closed

WIP - Remove special-case handling of master label nodes in e2e#76654
liggitt wants to merge 1 commit intokubernetes:masterfrom
liggitt:conformance-label

Conversation

@liggitt
Copy link
Copy Markdown
Member

@liggitt liggitt commented Apr 16, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
Allows test invokers to specify a label selector to indicate which nodes should be waited for, rather than hard-coding a label for one particular deployment (kops). This removes special treatment of the node-role.k8s.io/master label from conformance.

Which issue(s) this PR fixes:
xref #65618

Does this PR introduce a user-facing change?:

e2e/conformance: `--node-schedulable-selector` can be used to specify a label selector for which nodes must be scheduleable. Special handling of the `node-role.kubernetes.io/master` label is removed from the e2e suite.

/area conformance
/sig testing
/cc @smarterclayton

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. area/conformance Issues or PRs related to kubernetes conformance tests size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Apr 16, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2019
@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 16, 2019

cc @kubernetes/cncf-conformance-wg

@liggitt liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 16, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 16, 2019
@liggitt liggitt force-pushed the conformance-label branch 4 times, most recently from fd49cd5 to f790c76 Compare April 16, 2019 17:14
@liggitt liggitt force-pushed the conformance-label branch from f790c76 to a7119c2 Compare April 16, 2019 17:22
OutputPrintType string
// NodeSchedulableTimeout is the timeout for waiting for all nodes to be schedulable.
// NodeSchedulableSelector is the label selector of the nodes we should wait to be scheduleable. If empty, wait for all nodes.
NodeSchedulableSelector string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@smarterclayton
Copy link
Copy Markdown
Contributor

This is consistent with our statement that

a) role labels are advisory
b) masters aren't special
c) conformance doesn't require masters to be labeled a certain way
d) conformance tests should work no matter what the topology of the cluster is

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 16, 2019

/retest

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 16, 2019

/test pull-kubernetes-e2e-kops-aws

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 16, 2019

/hold
for feedback from sig-testing

is there anything I need to do to plumb this option through kubetest or ginkgo.sh?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2019
@neolit123
Copy link
Copy Markdown
Member

/test pull-kubernetes-e2e-kind

@neolit123
Copy link
Copy Markdown
Member

is there anything I need to do to plumb this option through kubetest or ginkgo.sh?

kubetest has this flag:

  --test_args string                                   Space-separated list of arguments to pass to Ginkgo test runner.

so it can already pass the new --node-schedulable-selector.
for ginkgo-e2e.sh i don't think this is needed, but there might be jobs in test-infra that require it now.

cc @justinsb
https://github.com/kubernetes/kubernetes/pull/76654/files#diff-eb7b79470992813ea1905e96c298b47bL2752
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/76654/pull-kubernetes-e2e-kops-aws/1118264577910378496/
does this mean that after this PR the kops jobs will always timeout?

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 16, 2019

does this mean that after this PR the kops jobs will always timeout?

Only until we adjust the kops job invocation to inform the test framework that it shouldn't wait for all nodes to be schedulable based on the way kops sets up the cluster

@neolit123
Copy link
Copy Markdown
Member

neolit123 commented Apr 17, 2019

the kind job SIGABRT-ed from ginkgo.
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/76654/pull-kubernetes-e2e-kind/1118292602886557696/build-log.txt

the backtrace does not include framework entries, but my guess here is a missing net/http timeout.

the kind(kubeadm) worker nodes are unlabeled, so if NodeSchedulableSelector is now mandatory it means we have to modify them with respect to the e2e testing.
@kubernetes/sig-cluster-lifecycle
cc @BenTheElder

/test pull-kubernetes-e2e-kind

@neolit123
Copy link
Copy Markdown
Member

/assign @timothysc
to have a look too.

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 17, 2019

the kind(kubeadm) nodes are unlabeled, so if NodeSchedulableSelector is now mandatory it means we have to modify them with respect to the e2e testing.

It is not mandatory. Environments with no special node labeling don't need to do anything. The existing e2e presubmits passed with no changes, for example. Only clusters that assign special labels to nodes they don't want to wait for need to inform the e2e of that.

@BenTheElder
Copy link
Copy Markdown
Member

they'd also been passing fine.. https://testgrid.k8s.io/conformance-kind#kind,%20master%20(dev)&width=20

that backtrace is not terribly informative so far :|

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@liggitt: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws a7119c2 link /test pull-kubernetes-e2e-kops-aws
pull-kubernetes-e2e-kind a7119c2 link /test pull-kubernetes-e2e-kind

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@neolit123
Copy link
Copy Markdown
Member

passing --node-schedulable-selector=\!node-role.kubernetes.io/master works.

but it seems to me the current logic is incorrect:
https://github.com/kubernetes/kubernetes/pull/76654/files#diff-eb7b79470992813ea1905e96c298b47bR2759

waiting for a node that is unschedulable for the duration of the job timeout (e.g. 30+minutes) doesn't feel right.

Apr 17 03:49:40.680: INFO: Waiting up to 30m0s for all nodes to be schedulable
Apr 17 03:49:40.704: INFO: Unschedulable nodes:
Apr 17 03:49:40.704: INFO: -> kind-control-plane Ready=true Network=true Taints=[{node-role.kubernetes.io/master  NoSchedule <nil>}]
Apr 17 03:49:40.704: INFO: ================================
Apr 17 03:50:10.717: INFO: Unschedulable nodes:
Apr 17 03:50:10.717: INFO: -> kind-control-plane Ready=true Network=true Taints=[{node-role.kubernetes.io/master  NoSchedule <nil>}]
Apr 17 03:50:10.717: INFO: ================================
....<loops>

@smarterclayton
Copy link
Copy Markdown
Contributor

the selector for all nodes is pretty easy (1=1

@BenTheElder
Copy link
Copy Markdown
Member

BenTheElder commented Apr 17, 2019

We discussed in sig-testing slack, TLDR tentative plan is to improve taint handling a bit in the wait-for-node-readiness logic.

We could plumb through another flag for kops/kind, but e.g. sonobuoy should work without that for clusters with less than surprising configuration.

Ideally we can ignore nodes with the node-role taint, or as @liggitt suggested only wait for certain known taints to be removed. When waiting for taints was originally added only certain taints were referenced ('not-ready and network-unavailable').

Another option would be removing the waiting for node readiness entirely and leaving that up to the tester / cluster operator.

@BenTheElder
Copy link
Copy Markdown
Member

cc @mborsz -- any other particular taints we should be waiting on?

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 17, 2019

Another option would be removing the waiting for node readiness entirely and leaving that up to the tester / cluster operator.

That would be my preference long-term. "Get your cluster ready, then run the test" makes so much more sense to me

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 17, 2019

The two mentioned in the PR that added the label checking were NetworkUnavailable and NotReady

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented Apr 17, 2019

waiting for a node that is unschedulable for the duration of the job timeout (e.g. 30+minutes) doesn't feel right.

that comes from the pre-existing --node-schedulable-timeout option and default of 30*time.Minute

@johnSchnake
Copy link
Copy Markdown
Contributor

Maybe a follow up issue but directly related to your questions about plumbing it through ginkgo.sh and such:

https://github.com/kubernetes/kubernetes/tree/master/cluster/images/conformance should be updated so that the flag can be set when using the conformance image. This way anyone pursuing CNCF certification via Sonobuoy can supply the values.

@timothysc
Copy link
Copy Markdown
Contributor

So TL;DR from my reading of this issue, is plumbing of a whitelist of taints is a reasonable behavior, but other taints should be respected and possibly dropped from the overall node count. If this is the case, then I think that's a good move forwards.

@liggitt liggitt changed the title Remove special-case handling of master label nodes in e2e WIP - Remove special-case handling of master label nodes in e2e May 3, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@liggitt: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 4, 2019
@johnSchnake
Copy link
Copy Markdown
Contributor

Status check on this?

@liggitt
Copy link
Copy Markdown
Member Author

liggitt commented May 31, 2019

Closing in favor of #78500

@liggitt liggitt closed this May 31, 2019
@liggitt liggitt deleted the conformance-label branch November 25, 2019 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants