kubernetes-csi: full test matrix for hostpath#11981
Conversation
87938f8 to
7bd128b
Compare
7bd128b to
b78b59e
Compare
We recently clarified that newer sidecars must work with older Kubernetes and vice versa. Here we add all currently supported permutations and enable testing on a single configuration for those repos where it makes no sense to test with different Kubernetes versions.
b78b59e to
1f2e565
Compare
| csi-release-tools | ||
| external-attacher | ||
| external-provisioner | ||
| external-snapshotter |
There was a problem hiding this comment.
We should leave snapshotter out of version skew testing. It is still alpha and does not offer any compatibility guarantees.
There was a problem hiding this comment.
Okay. It's now tested only once, against the Kubernetes version specified in the external-snapshotter repo itself.
| for deployment in 1.13 1.14; do # must have a deploy/kubernetes-<version> dir in csi-driver-host-path | ||
| for kubernetes in 1.13.3 1.14.0; do # these versions must have pre-built kind images (see https://hub.docker.com/r/kindest/node/tags) | ||
| cat >>"$base/$repo/$repo-config.yaml" <<EOF | ||
| - name: pull-sig-storage-$repo-$(echo "$deployment" | tr . -)-on-kubernetes-$(echo "$kubernetes" | tr . - | sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/\1-\2/') |
There was a problem hiding this comment.
Can the kubernetes version parsing be in a util function so we only need to update it in one place in the future?
|
|
||
| presubmits: | ||
| kubernetes-csi/csi-driver-host-path: | ||
| - name: pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-1-13 |
There was a problem hiding this comment.
I don't think we need all the skews in presubmit jobs. (however, CI Jobs should have all the skews).
I think for presubmits, it is sufficient to just have:
canary hostpath driver on K8s master
canary hostpath driver on K8s master - 1
There was a problem hiding this comment.
Is that meant to save resources in the Prow test cluster?
We trade less resource usage against the risk that some breaking change slips through.
There was a problem hiding this comment.
When you are submitting a PR, you are only modifying the latest version of a sidecar/driver. Your PR changes should not impact already tagged and released deployments.
There was a problem hiding this comment.
Testing against master is problematic, because then pre-submit jobs become non-deterministic: some change in K8S master can break the component and there's nothing can be done about it in the PR itself.
I've reduced the matrix to just two jobs: the latest stable release and the release before that.
There was a problem hiding this comment.
I'm still confused why we need to run 1.13 and 1.14 versions of the driver in pull jobs? It should just be testing the latest build of the driver and not previous releases.
I think these are the only jobs we need (for pull jobs):
- driver-master-kubernetes-1.14
- driver-master-kubernetes-1.13
There was a problem hiding this comment.
I'm still confused why we need to run 1.13 and 1.14 versions of the driver in pull jobs?
This is still about pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-1-13, right?
It should just be testing the latest build of the driver and not previous releases.
I think that's were the confusion comes from. That job does use the driver that was built as part of the PR job. If this needs to be more obvious, then I can add CSI_PROW_BUILD_JOB: true explicitly with a comment. It's left out at the moment because that's the default.
"1-13-on-kubernetes-1-13" just means that it then runs tests using the deploy/kubernetes-1.13 deployment on Kubernetes 1.13 with the driver image replaced.
The deployment is relevant because it determines which sidecars to test with. The job then tests that the updated driver is still compatible with that older deployment, which is something that we are promising and thus need to test. I think this is a good candidate for a pre-submit job because it keeps everything else at a stable version.
I've been thinking a bit more about the situation where a new sidecar or driver feature depends on a new API in Kubernetes master. Right now, such changes can be submitted in a PR, but pre-submit tests will not cover them. Once merged, the periodic job with on-kubernetes-master will do that. If we want to enable pre-submit testing of such features, then we could do it like this:
- create a pr job which does not override the Kubernetes version
- bump the CSI_PROW_KUBERNETES_VERSION in the component to some known-good revision of Kubernetes, potentially also switch to a newer version of the E2E tests
- submit the PR
I'm just not sure how important this is. My suggestion is to not add it at this time and wait whether there is a need.
There was a problem hiding this comment.
The job then tests that the updated driver is still compatible with that older deployment, which is something that we are promising and thus need to test.
Hostpath driver compatibility is not something we're promising. We are defining a versioning policy for sidecars, not for the hostpath driver. That's why I think once we release a hostpath driver, we should not need to modify older deployments.
| resources: | ||
| requests: | ||
| cpu: 2000m | ||
| - name: pull-sig-storage-csi-driver-host-path-deployment-1-13-on-kubernetes-1-13 |
There was a problem hiding this comment.
what is the difference between this and pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-1-13 above?
There was a problem hiding this comment.
The difference is explained in
test-infra/config/jobs/kubernetes-csi/gen-jobs.sh
Lines 143 to 148 in 1f2e565
One tests with updated images, the other the images specified in the deployment.
There was a problem hiding this comment.
Why do the jobs that don't modify the images need to run in the pull job?
There was a problem hiding this comment.
I'm still unclear why do we need to run jobs that don't modify images in pull jobs?
There was a problem hiding this comment.
That is for PRs that modify just the deployment directory. Suppose I submit a PR with image: quay.io/k8scsi/hostpathplugin:v1.1.0 in deploy/kubernetes-1.13 (something we talked about yesterday) and that image doesn't exist yet. If merged, the deployment in master gets broken. This is bad because we point people to the master branch as the source for the latest stable deployments. We have to do that a) because a tagged release of csi-driver-host-path cannot contain a .yaml file with the image for that tagged release (chicken-and-egg problem) b) it's easier to have one link that remains constant (link to master vs. link to latest release).
One could argue that it's okay to break master temporarily, but according to Murphy, this time window will be when someone tries out the deployment and notices the regression. I'd prefer to keep this as pre-submit. We can limit the number of times that it actually runs by adding a path filter. That should also clarify the purpose. How about that?
There was a problem hiding this comment.
I would prefer having the tagged version actually reference the correct deployment images. If someone is automatically deploying the driver in some production environment, they should be depending on a tagged version, not master.
There was a problem hiding this comment.
I would prefer having the tagged version actually reference the correct deployment images.
Then how do you propose to deal with the chicken-and-egg problem? Allow breaking master temporarily?
If someone is automatically deploying the driver in some production environment, they should be depending on a tagged version, not master.
This is the hostpath driver. It's never going to be deployed in production.
There was a problem hiding this comment.
In that case we can remove these jobs again.
There was a problem hiding this comment.
I've removed them the PR.
|
|
||
| presubmits: | ||
| kubernetes-csi/external-attacher: | ||
| - name: pull-sig-storage-external-attacher-1-13-on-kubernetes-1-13 |
There was a problem hiding this comment.
For sidecar presubmits, I think we only need the following jobs:
- canary on K8s master
- canary on K8s master - 1
Review feedback pointed out that pre-submits do not need to cover all cases. External-snapshotter is alpha and should be excluded from skew testing.
|
@msau42 updated, please have another look |
| resources: | ||
| requests: | ||
| cpu: 2000m | ||
| - name: pull-sig-storage-csi-driver-host-path-deployment-1-13-on-kubernetes-1-13 |
There was a problem hiding this comment.
I'm still unclear why do we need to run jobs that don't modify images in pull jobs?
| resources: | ||
| requests: | ||
| cpu: 2000m | ||
| - name: pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-master |
There was a problem hiding this comment.
You mentioned that we shouldn't be running pull jobs against K8s master?
There was a problem hiding this comment.
The goal was to have the job in Prow, but never enable it by default (in contrast to the other jobs, which will get enabled by default eventually):
# Experimental job, explicitly needs to be started with /test.
# This cannot be enabled by default because there's always the risk
# that something changes in master which breaks the pre-merge check.
Then if we see a PR which is worthwhile testing on master right away, we can explicitly request a job run with /test.
This is not quite the same as testing against a version of Kubernetes revision specified inside a PR (see my other comment), because here we always use the latest Kubernetes without updating the E2E tests.
|
|
||
| presubmits: | ||
| kubernetes-csi/csi-driver-host-path: | ||
| - name: pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-1-13 |
There was a problem hiding this comment.
I'm still confused why we need to run 1.13 and 1.14 versions of the driver in pull jobs? It should just be testing the latest build of the driver and not previous releases.
I think these are the only jobs we need (for pull jobs):
- driver-master-kubernetes-1.14
- driver-master-kubernetes-1.13
| requests: | ||
| cpu: 2000m | ||
| - interval: 6h | ||
| name: ci-kubernetes-csi-1-13-on-kubernetes-master |
There was a problem hiding this comment.
To reduce our testing matrix, I think we should only test "-1, current, +1" version skew with K8s.
So we would have the following jobs:
-
driver-1.13-on-kubernetes-1.13
-
driver-1.13-on-kubernetes-1.14
-
driver-1.14-on-kubernetes-1.13
-
driver-1.14-on-kubernetes-1.14
-
driver-1.14-on-kubernetes-master
-
driver-master-on-kubernetes-1.14
-
driver-master-on-kubernetes-master
There was a problem hiding this comment.
But we still promise that "1.13 on Kubernetes 1.15" works, right? How can we promise that if we don't test it?
There was a problem hiding this comment.
Ok, I can see the point about 1.13 on k8s 1.15.
But thinking about the newer configs, I think we are misinterpreting the version skew support. Let's say 1.14 adds some new features that were not present in 1.13 and the driver wants to use them. Then the 1.14 deployment enables those new features and cannot be backwards compatible because we only offer backwards compatibility if you don't use the new features. So the deployment should never be used on an older K8s, and the jobs look like:
-
driver-1.13-on-k8s-1.13
-
driver-1.13-on-k8s-1.14
-
driver-1.13-on-k8s-master
-
driver-1.14-on-k8s-1.14
-
driver-1.14-on-k8s-master
-
driver-master-on-k8s-master
There was a problem hiding this comment.
Let's say 1.14 adds some new features that were not present in 1.13 and the driver wants to use them. Then the 1.14 deployment enables those new features and cannot be backwards compatible because we only offer backwards compatibility if you don't use the new features.
Correct, if 1.14 enables and depends on new features, then it cannot run on older Kubernetes. But we still may want to test that the new sidecars with the new features disabled work on Kubernetes 1.13. It'll depend on the feature whether that can work and is supported: if it doesn't, we announce a breaking change and release a new major version.
So for the upcoming sidecars we theoretically have two deployments: with topology enabled and disabled. The one where it's off can run on 1.13, and this is something that we have to support and test - at least that was my understanding of the compatibility document, and the new versions (1.1 instead of 2.0) seems to confirm that.
There was a problem hiding this comment.
I think having different deployments with different feature sets in the same release is going to be too complex to maintain, and have too many permutations to test.
Let's just start with just like-version testing for now and iron out the details for skew testing later.
There was a problem hiding this comment.
I think testing a newer deployment, which might have enabled new features against older Kubernetes is not correct, and is not something we support in our versioning policy.
There was a problem hiding this comment.
(a newer sidecar using an older deployment is something we do support)
There was a problem hiding this comment.
If you want test newer sidecars on older deployments, you could still use the 1.13 deployment on K8s 1.13 job to do that.
Testing "canary" images with older deployments is something that can be done because it is unambiguous what those images are. It gets more tricky when we have multiple release trains: "stable images" becomes ambiguous in that case. Which ones do we test on 1.13, and in which combination? There has to be a definition of that somewhere. Currently the example deployments are our definition of which sidecars are meant to be used together for a certain situation.
So that would leave us with only 6 jobs in CI:
Hostpath with stable images * K8s 1.13, 1.14, master
Hostpath with canary images * K8s 1.13, 1.14, master
Which "stable images"?
What if we don't have a new hostpath deployment for every release, and only add a new deployment when something in the hostpath deployment actually changes?
What is this "something"? Does adding a new release train count? Because that is what changes now: we support external-provisioner 1.0.x for Kubernetes 1.13 and we add 1.1.x for Kubernetes 1.14. Nothing else changes in the deployment.
If a new release fully replaces the previous one, we can just bump the image version in the existing deployment. But this is not the case right now.
I think testing a newer deployment, which might have enabled new features against older Kubernetes is not correct, and is not something we support in our versioning policy.
I agree, and this is not what I was suggesting. I was talking about a new deployment that does not enable new features.
Let's take the tentative "kubernetes-1.14" deployment as example. Assume that we do the 1.1 releases and then update that deployment with those new image versions. We don't enable any new features. Do you want this deployment to be in the csi-driver-host-path repo? What kind of testing do you want to do with that deployment?
cc @jsafrane
There was a problem hiding this comment.
In order to make some progress with this PR and perhaps get it merged, I've taken out the ci-kubernetes-csi-1-14-on-* jobs.
@msau42 is that now something that could be merged?
There was a problem hiding this comment.
In order to test older, but still supported release trains, I think we will need to branch every release and have a job off of the release branch that tests the supported versions at the time of release.
There's a lot of detail and logistics that I think we still need to work out, so I am fine with starting with a smaller test matrix until we can document and agree on a plan.
|
|
||
| presubmits: | ||
| kubernetes-csi/csi-driver-host-path: | ||
| - name: pull-sig-storage-csi-driver-host-path-1-13-on-kubernetes-1-13 |
It turned out that the case that these jobs were meant to catch ("does
the deployment work as specified without changing images") is too
strict. The plan is to allow bumping the hostpath image version before
actually publishing that image, to ensure that the tagged release has
the right image version. This will break the deployment temporarily on
the branch.
There was no consensus on how to test the upcoming deployment for 1.14 (see kubernetes#11981 (comment)). It also has not been finalized yet (uses canary images). Therefore the periodic jobs using it get removed. The periodic jobs also unintentionally override the deployment. They should use the default deployment for the Kubernetes version they test against. The "deployment" branch was merged, so we can use the upstream csi-driver-host-path again.
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: d142c36a3f466f909e177306e73bde7d948eb219 |
|
/assign @spiffxp |
b8adeb6 to
3f857e8
Compare
There was no consensus on how to test the upcoming deployment for 1.14 (see kubernetes#11981 (comment)). It also has not been finalized yet (uses canary images). Therefore the periodic jobs using it get removed. The periodic jobs also unintentionally override the deployment. They should use the default deployment for the Kubernetes version they test against. Canary jobs use just the new images, without changing the RBAC rules used by the deployments. The "deployment" branch was merged, so we can use the upstream csi-driver-host-path again.
|
@msau42 I made one last fix for the canary periodic jobs: They weren't actually using canary images and if they had, they would have used updated RBAC rules, which is not what we want to test here. Please add LGTM again. |
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: 96a7ce78ba097ad5e97ae4e4ca42a2580ea3fe71 |
|
/assign @michelle192837 |
| # -- Yaml representation for configuring testgrid.k8s.io | ||
| # | ||
| # To add a new test: | ||
| # 1. Append a new testgroup under test_groups, specify the name and where to get the log. |
There was a problem hiding this comment.
Was this supposed to be deleted?
There was a problem hiding this comment.
Oops, no. Fixed in the revised commit.
There was no consensus on how to test the upcoming deployment for 1.14 (see kubernetes#11981 (comment)). It also has not been finalized yet (uses canary images). Therefore the periodic jobs using it get removed. The periodic jobs also unintentionally override the deployment. They should use the default deployment for the Kubernetes version they test against. Canary jobs use just the new images, without changing the RBAC rules used by the deployments. The "deployment" branch was merged, so we can use the upstream csi-driver-host-path again.
3f857e8 to
4001776
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: michelle192837, msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
LGTM label has been added. DetailsGit tree hash: f6161201f8527d923f1ad87668bb61052edfc9f1 |
|
@pohly: Updated the
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was no consensus on how to test the upcoming deployment for 1.14 (see kubernetes/test-infra#11981 (comment)). It also has not been finalized yet (uses canary images). Therefore the periodic jobs using it get removed. The periodic jobs also unintentionally override the deployment. They should use the default deployment for the Kubernetes version they test against. Canary jobs use just the new images, without changing the RBAC rules used by the deployments. The "deployment" branch was merged, so we can use the upstream csi-driver-host-path again.
We recently clarified that newer sidecars must work with older
Kubernetes and vice versa. Here we add all currently supported
permutations and enable testing on a single configuration for those
repos where it makes no sense to test with different Kubernetes
versions.
/cc @msau42