util: restart process when gRPC call is stuck for 10+ minutes#6286
Conversation
de4b498 to
54b8fc4
Compare
54b8fc4 to
44a3c29
Compare
44a3c29 to
e6d6aa6
Compare
Pull request has been modified.
e6d6aa6 to
e5de3b5
Compare
Madhu-1
left a comment
There was a problem hiding this comment.
changes looks good, we need to put this functionality behind a flag (enable is by default but provide option to disable it)
| // reclaimSpaceServicePrefix is the gRPC service prefix for ReclaimSpace | ||
| // operations, which are excluded from the stuck-call restart because they can | ||
| // legitimately run for a long time on large volumes. | ||
| const reclaimSpaceServicePrefix = "/reclaimspace." |
There was a problem hiding this comment.
we should make it generic so that we can expand it, add a helper function to skip known services
|
/test ci/centos/mini-e2e/k8s-1.34 |
Pull request has been modified.
f900042 to
2348974
Compare
Done, added env var PTAL |
|
We usually use flags/launch args to enable/disable features. |
We recently decided against it cause difficulty in upgrades/backward compatibility/API problems to pass settings and moving towards env var. |
where this was decided, i would vote for a feature-gate CLI command which accepts the |
2348974 to
d17207e
Compare
Pull request has been modified.
d17207e to
e00035e
Compare
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.33 |
|
Deprecation notice: This pull request comes from a fork and was queued with |
Merge Queue Status
This pull request spent 3 hours 5 minutes 13 seconds in the queue, including 3 hours 4 minutes 31 seconds running CI. Required conditions to merge
|
Add a --feature-gates CLI flag that accepts comma-separated key=bool pairs (e.g., SlowGRPCRestart=false). Unknown keys are rejected at startup. Add killOnSlowGRPC unary interceptor that calls os.Exit(1) if any gRPC handler takes longer than 10 minutes; the kubelet restarts the container in-place. The interceptor is conditionally wired into the middleware chain based on the SlowGRPCRestart feature gate (enabled by default). Methods matching prefixes in slowGRPCSkipPrefixes are excluded. The initial list contains only ReclaimSpace calls (/reclaimspace.), which can legitimately take a long time on large volumes. Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: Rakshith R <rar@redhat.com>
e00035e to
b78ada0
Compare
|
/test ci/centos/upgrade-tests-cephfs |
|
/test ci/centos/k8s-e2e-external-storage/1.33 |
|
/test ci/centos/upgrade-tests-rbd |
|
/test ci/centos/mini-e2e-helm/k8s-1.33 |
|
/test ci/centos/k8s-e2e-external-storage/1.35 |
|
/test ci/centos/k8s-e2e-external-storage/1.34 |
|
/test ci/centos/mini-e2e-helm/k8s-1.34 |
|
/test ci/centos/mini-e2e-helm/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.33 |
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.34 |
Add a --feature-gates CLI flag that accepts comma-separated key=bool
pairs (e.g., SlowGRPCRestart=false). Unknown keys are rejected at
startup.
Add killOnSlowGRPC unary interceptor that calls os.Exit(1) if any
gRPC handler takes longer than 10 minutes; the kubelet restarts the
container in-place. The interceptor is conditionally wired into the
middleware chain based on the SlowGRPCRestart feature gate (enabled
by default).
Methods matching prefixes in slowGRPCSkipPrefixes are excluded. The
initial list contains only ReclaimSpace calls (/reclaimspace.), which
can legitimately take a long time on large volumes.
Assisted-by: Claude noreply@anthropic.com
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)