Skip to content

Mark volume as uncertain after Unmount* fails#100183

Merged
k8s-ci-robot merged 6 commits into
kubernetes:masterfrom
jsafrane:fix-unstage-retry
Jun 18, 2021
Merged

Mark volume as uncertain after Unmount* fails#100183
k8s-ci-robot merged 6 commits into
kubernetes:masterfrom
jsafrane:fix-unstage-retry

Conversation

@jsafrane

@jsafrane jsafrane commented Mar 12, 2021

Copy link
Copy Markdown
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

Mark volume device / volume mount as uncertain after UnmountDevice / UnmountVolume operation fails, because kubelet can't be sure what's the actual state of the volume. Kubelet will then call MountDevice / SetUp to make sure the volume is fully mounted when a new pod wants to use the same volume.

Similarly, update Unmap* calls for block volumes.

Which issue(s) this PR fixes:

Fixes: #100182

Special notes for your reviewer:

  • Added e2e test for NodeUnstage errors, which exercises the UnmountDevice error path.
  • Both transient and final errors result in the mount marked as uncertain. IMO it's not safe to assume that a mount is usable even after UnmountVolume / UnmountDevice final error. Better to call SetUp / MountDevice again when the mount is needed.

Does this PR introduce a user-facing change?

Fixed starting new pods after previous pod timed out unmounting its volumes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2021
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2021
@jsafrane jsafrane force-pushed the fix-unstage-retry branch 2 times, most recently from 3da7652 to fe7d5bd Compare March 12, 2021 13:19
@yangjunmyfm192085

Copy link
Copy Markdown
Contributor

/test pull-kubernetes-unit

@yangjunmyfm192085

Copy link
Copy Markdown
Contributor

/test pull-kubernetes-e2e-kind

@jsafrane

Copy link
Copy Markdown
Member Author

I tried to add e2e tests for the block mode, but the mock driver does not support block volumes.

@jsafrane

Copy link
Copy Markdown
Member Author

Ready for review (I tried to add e2e tests for block mode, but the mock driver does not support block)

cc @gnufied @msau42 @jingxu97 PTAL

Comment thread test/e2e/storage/csi_mock_volume.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For InvalidArgument, do we need to mark as uncertain?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an example of a "final" response, where we can be reasonably sure that e.g. provisioning is not in progress. Not sure if we can apply the same logic to Unstage.

@jingxu97

Copy link
Copy Markdown
Contributor

/approve

@jingxu97

Copy link
Copy Markdown
Contributor

PR lgtm. just want to make sure other reviewers have a chance to take a look.

@jingxu97

Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2021
@jingxu97 jingxu97 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2021
jsafrane added 6 commits June 16, 2021 18:39
When UnmountDevice fails, kubelet treat the volume mount as uncertain,
because it does not know at which stage UnmountDevice failed. It may be
already partially unmonted / destroyed.

As result, MountDevice will be performer when a new Pod is started on the
node after UnmountDevice faiure.
…ere unmounted

podVolumesExist() should consider also uncertain volumes (where kubelet
does not know if a volume was fully unmounted) when checking for pod's
volumes. Added GetPossiblyMountedVolumesForPod for that.

Adding uncertain mounts to GetMountedVolumesForPod would potentially break
other callers (e.g. `verifyVolumesMountedFunc`).
To know when a volume has been fully unmounted (incl. uncertain mounts).
Change existing desiredStateOfWorldPopulator.findAndAddNewPods tests to use
a common initialization function.
desiredStateOfWorldPopulator.findAndRemoveDeletedPods() should remove
volumes from DSW when a pod is deleted on the API server and the volume is
uncertain in ASW.
@jsafrane jsafrane force-pushed the fix-unstage-retry branch from fca9dc1 to d5da730 Compare June 16, 2021 16:42
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2021
@gnufied

gnufied commented Jun 17, 2021

Copy link
Copy Markdown
Member

/retest
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2021
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

@jsafrane: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-storage-slow d5da730 link /test pull-kubernetes-e2e-gce-storage-slow

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mrunalp

mrunalp commented Jun 18, 2021

Copy link
Copy Markdown
Contributor

/approve

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97, jsafrane, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2021
@k8s-ci-robot k8s-ci-robot merged commit 4afb72a into kubernetes:master Jun 18, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jun 18, 2021
@timebertt

Copy link
Copy Markdown
Contributor

Any chance we can backport this change?
We observed #100182 multiple times on v1.20 clusters.

@ankitjain28may

Copy link
Copy Markdown

Any updates, can we get to know if this is getting backported to 1.20 since we are facing a lot of such issues causing production downtimes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

NodeStage is not called after NodeUnstage error