Skip to content

util: restart process when gRPC call is stuck for 10+ minutes#6286

Merged
mergify[bot] merged 1 commit into
ceph:develfrom
Rakshith-R:restart-slow-grpc
May 21, 2026
Merged

util: restart process when gRPC call is stuck for 10+ minutes#6286
mergify[bot] merged 1 commit into
ceph:develfrom
Rakshith-R:restart-slow-grpc

Conversation

@Rakshith-R

@Rakshith-R Rakshith-R commented May 19, 2026

Copy link
Copy Markdown
Contributor

Add a --feature-gates CLI flag that accepts comma-separated key=bool
pairs (e.g., SlowGRPCRestart=false). Unknown keys are rejected at
startup.

Add killOnSlowGRPC unary interceptor that calls os.Exit(1) if any
gRPC handler takes longer than 10 minutes; the kubelet restarts the
container in-place. The interceptor is conditionally wired into the
middleware chain based on the SlowGRPCRestart feature gate (enabled
by default).

Methods matching prefixes in slowGRPCSkipPrefixes are excluded. The
initial list contains only ReclaimSpace calls (/reclaimspace.), which
can legitimately take a long time on large volumes.

Assisted-by: Claude noreply@anthropic.com


Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch from de4b498 to 54b8fc4 Compare May 19, 2026 06:14
Comment thread internal/csi-common/utils.go Outdated
Comment thread internal/csi-common/utils.go Outdated
Comment thread internal/csi-common/utils.go Outdated
Comment thread internal/csi-common/utils.go Outdated
Comment thread internal/csi-common/utils.go Outdated
black-dragon74
black-dragon74 previously approved these changes May 19, 2026
@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch from 44a3c29 to e6d6aa6 Compare May 19, 2026 06:33
@mergify mergify Bot dismissed black-dragon74’s stale review May 19, 2026 06:33

Pull request has been modified.

@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch from e6d6aa6 to e5de3b5 Compare May 19, 2026 06:39
@Rakshith-R Rakshith-R requested a review from black-dragon74 May 19, 2026 06:43
black-dragon74
black-dragon74 previously approved these changes May 19, 2026

@Madhu-1 Madhu-1 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes looks good, we need to put this functionality behind a flag (enable is by default but provide option to disable it)

Comment thread internal/csi-common/utils.go Outdated
// reclaimSpaceServicePrefix is the gRPC service prefix for ReclaimSpace
// operations, which are excluded from the stuck-call restart because they can
// legitimately run for a long time on large volumes.
const reclaimSpaceServicePrefix = "/reclaimspace."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make it generic so that we can expand it, add a helper function to skip known services

@Rakshith-R

Copy link
Copy Markdown
Contributor Author

/test ci/centos/mini-e2e/k8s-1.34

@mergify mergify Bot dismissed black-dragon74’s stale review May 19, 2026 09:48

Pull request has been modified.

@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch 3 times, most recently from f900042 to 2348974 Compare May 19, 2026 10:16
@Rakshith-R

Copy link
Copy Markdown
Contributor Author

changes looks good, we need to put this functionality behind a flag (enable is by default but provide option to disable it)

we should make it generic so that we can expand it, add a helper function to skip known services

Done, added env var DISABLE_CSI_SLOW_GRPC_RESTART flag to disable and a skip services list that we can append at later point of time.

PTAL

@black-dragon74

Copy link
Copy Markdown
Member

We usually use flags/launch args to enable/disable features.

@Rakshith-R

Rakshith-R commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

We usually use flags/launch args to enable/disable features.

We recently decided against it cause difficulty in upgrades/backward compatibility/API problems to pass settings and moving towards env var.

black-dragon74
black-dragon74 previously approved these changes May 19, 2026
@Rakshith-R Rakshith-R requested a review from a team May 19, 2026 12:45
@Madhu-1

Madhu-1 commented May 19, 2026

Copy link
Copy Markdown
Collaborator

We usually use flags/launch args to enable/disable features.

We recently decided against it cause difficulty in upgrades/backward compatibility/API problems to pass settings and moving towards env var.

where this was decided, i would vote for a feature-gate CLI command which accepts the feature1=true,feature2=false like this where we can provide option for user to enable/disable any features instead of the env variable

@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch from 2348974 to d17207e Compare May 20, 2026 10:51
@mergify mergify Bot dismissed black-dragon74’s stale review May 20, 2026 10:52

Pull request has been modified.

@Rakshith-R Rakshith-R force-pushed the restart-slow-grpc branch from d17207e to e00035e Compare May 20, 2026 10:56
@Rakshith-R

Copy link
Copy Markdown
Contributor Author

/test ci/centos/mini-e2e/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.33

@ceph-csi-bot ceph-csi-bot added ci/in-progress/e2e This label acts like a guard and prevents Mergify from adding the `ok-to-test` label again. and removed ok-to-test Label to trigger E2E tests labels May 21, 2026
@mergify mergify Bot removed the ci/in-progress/e2e This label acts like a guard and prevents Mergify from adding the `ok-to-test` label again. label May 21, 2026
@mergify

mergify Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Deprecation notice: This pull request comes from a fork and was queued with update_method=rebase and update_bot_account impersonation. This capability will be removed on July 1, 2026. After this date, the merge queue will no longer be able to rebase fork pull requests with this configuration. To avoid disruption, switch to update_method=merge in your queue rule.

@mergify mergify Bot added the queued label May 21, 2026
@mergify

mergify Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

  • Entered queue2026-05-21 07:40 UTC · Rule: default
  • Checks passed · in-place
  • Merged2026-05-21 10:45 UTC · at b78ada0aaa5a56122d1db8fa34d80095a17442ae · rebase

This pull request spent 3 hours 5 minutes 13 seconds in the queue, including 3 hours 4 minutes 31 seconds running CI.

Required conditions to merge

Add a --feature-gates CLI flag that accepts comma-separated key=bool
pairs (e.g., SlowGRPCRestart=false). Unknown keys are rejected at
startup.

Add killOnSlowGRPC unary interceptor that calls os.Exit(1) if any
gRPC handler takes longer than 10 minutes; the kubelet restarts the
container in-place. The interceptor is conditionally wired into the
middleware chain based on the SlowGRPCRestart feature gate (enabled
by default).

Methods matching prefixes in slowGRPCSkipPrefixes are excluded. The
initial list contains only ReclaimSpace calls (/reclaimspace.), which
can legitimately take a long time on large volumes.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Rakshith R <rar@redhat.com>
@mergify mergify Bot added the ok-to-test Label to trigger E2E tests label May 21, 2026
@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.33

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.33

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.34

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.34

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.33

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.35

@ceph-csi-bot

Copy link
Copy Markdown
Collaborator

/test ci/centos/mini-e2e/k8s-1.34

@ceph-csi-bot ceph-csi-bot added ci/in-progress/e2e This label acts like a guard and prevents Mergify from adding the `ok-to-test` label again. and removed ok-to-test Label to trigger E2E tests labels May 21, 2026
@mergify mergify Bot removed the ci/in-progress/e2e This label acts like a guard and prevents Mergify from adding the `ok-to-test` label again. label May 21, 2026
@mergify mergify Bot merged commit 02cc78c into ceph:devel May 21, 2026
41 checks passed
@mergify mergify Bot removed the queued label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants