Skip to content

Add NodeStage error tests#89041

Merged
k8s-ci-robot merged 3 commits into
kubernetes:masterfrom
jsafrane:stage-error-tests
Apr 7, 2020
Merged

Add NodeStage error tests#89041
k8s-ci-robot merged 3 commits into
kubernetes:masterfrom
jsafrane:stage-error-tests

Conversation

@jsafrane

@jsafrane jsafrane commented Mar 11, 2020

Copy link
Copy Markdown
Member

What this PR does / why we need it:
Add some test for NodeStage error handling. The main purpose is to test that:

  • NodeUnstage is called after NodeStage transient error && corresponding pod is deleted.
  • NodeUnstage is not called after NodeStage final error && corresponding pod is deleted.

The test (and whole CSI mock output handling) becomes quite complex.

This is just an exercise how to use the javascript hooks. If we decide this is useful, it can be extended to test also NodePublish (+ block mode of both).

@gnufied @tsmetana @pohly @msau42 @jingxu97, is it useful? We already have unit tests for this behavior.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/kind cleanup
/sig storage

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 11, 2020
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Mar 11, 2020
@jsafrane jsafrane force-pushed the stage-error-tests branch 3 times, most recently from 869f30a to 6e4686d Compare March 11, 2020 14:39
@jsafrane

Copy link
Copy Markdown
Member Author

the newly introduced tests are flaky, https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041/pull-kubernetes-e2e-gce-storage-slow/1237750118531207170/:

test/e2e/storage/csi_mock_volume.go:664
Mar 11 15:26:33.230: while waiting for initial CSI calls
Unexpected error:
    <*errors.errorString | 0xc001da7a10>: {
        s: "could not load CSI driver logs: the server rejected our request for an unknown reason (get pods csi-mockplugin-0)",
    }
    could not load CSI driver logs: the server rejected our request for an unknown reason (get pods csi-mockplugin-0)
occurred
test/e2e/storage/csi_mock_volume.go:688

/hold
/retest

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2020
@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow

1 similar comment
@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow

@jsafrane

Copy link
Copy Markdown
Member Author

Reworked to merged version of javascript hooks.

The only WIP item is bump of csi-mock driver version to canary. We should do a proper release before merging this PR.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2020
@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow

@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow
/retest

@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow

@jsafrane

Copy link
Copy Markdown
Member Author

/test pull-kubernetes-e2e-gce-storage-slow

@jsafrane jsafrane force-pushed the stage-error-tests branch from 5b6ea42 to 0cc6363 Compare April 1, 2020 09:37
@jsafrane jsafrane changed the title WIP: Add NodeStage error tests Add NodeStage error tests Apr 1, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2020
@jsafrane

jsafrane commented Apr 1, 2020

Copy link
Copy Markdown
Member Author

/hold cancel
Rebased to mock v3.1.0 with the javascript hooks.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2020
@pohly

pohly commented Apr 1, 2020

Copy link
Copy Markdown
Contributor

Is the new test flaky? There's a "could not load CSI driver logs: the server rejected our request for an unknown reason (get pods csi-mockplugin-0)" in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041/pull-kubernetes-e2e-gce-storage-slow/1245284273221537792/.

/retest

Comment thread test/e2e/storage/csi_mock_volume.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed because "the server rejected our request for an unknown reason (get pods csi-mockplugin-0)" (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041/pull-kubernetes-e2e-gce-storage-slow/1245284273221537792/).

Perhaps a retry loop would help?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a retry loop in the caller (via wait.Poll). Perhaps failures to retrieve log output should simply be treated here as "no output", i.e. return nil, nil?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the pod is not Running yet. The new mock test calls createPod() and immediately after that it reads the pod logs. I added wait for PVC to get bound in between - the driver must be fully operational to provision a PV.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the failed build noise. Stupid typo...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. The mock driver provisioned a PV, yet it gets error from API server:

could not load CSI driver logs: the server rejected our request for an unknown reason (get pods csi-mockplugin-0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a retry loop in the caller (via wait.Poll). Perhaps failures to retrieve log output should simply be treated here as "no output", i.e. return nil, nil?

I've implemented that idea in #88114 (3cf9ab1) and got all tests to pass. The log does indeed show that "the server rejected our request for an unknown reason" occurred, but tests succeeded after ignoring that error (https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/88114/pull-kubernetes-e2e-kind/1245706940810530817/build-log.txt).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your last commit from #88114, hoping it helps.

I can't reproduce "the server rejected our request for an unknown reason" on any of my clusters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce "the server rejected our request for an unknown reason" on any of my clusters.

Me neither. I think saw it once while working on some other test, but as far as I remember, that then turned out to be because I was I was asking for logs after the pod had just been deleted. Perhaps here we have the inverse, asking for a very recently started pod? Just wondering.

@gnufied

gnufied commented Apr 1, 2020

Copy link
Copy Markdown
Member

/assign

@jsafrane jsafrane force-pushed the stage-error-tests branch 3 times, most recently from 366b04d to 12fdc04 Compare April 1, 2020 14:44
@pohly pohly mentioned this pull request Apr 2, 2020
jsafrane and others added 3 commits April 6, 2020 15:03
Especially related to "uncertain" global mounts. A large refactoring of CSI
mock tests were necessary:
- to be able to script the driver to return errors as required by the test
- to parse the CSI driver logs to check kubelet called the right CSI calls
As seen in some test
runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/89041),
retrieving output can fail with "the server rejected our request for
an unknown reason (get pods csi-mockplugin-0)".

If this truly an intermittent error, then the existing retry logic in
the callers can deal with this.
@jsafrane jsafrane force-pushed the stage-error-tests branch from 12fdc04 to 981aae3 Compare April 6, 2020 13:07
@pohly

pohly commented Apr 6, 2020

Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@jsafrane

jsafrane commented Apr 7, 2020

Copy link
Copy Markdown
Member Author

/retest

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@jsafrane: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-node-e2e-containerd 981aae3 link /test pull-kubernetes-node-e2e-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pohly pohly left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2020
@k8s-ci-robot k8s-ci-robot merged commit 15bb54c into kubernetes:master Apr 7, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants