fix: propagate context deadline exceeded error properly#12821
Conversation
When a shim becomes unresponsive (e.g., stopped via SIGSTOP), ttrpc communication times out with `context deadline exceeded`. Currently, this error is not properly propagated, causing redundant API calls and slow container listing by client sides. Specifically, when executing the API to check the task state, it appears that the `context deadline exceeded` error via ttrpc is not being handled within `shimTask.State()` and `getProcessState()`. As a result, when this error occurs, clients such as nerdctl cannot recognize this error, and it is thought that the issue described below is occurring: - containerd/nerdctl#4720 Therefore, this commit adds error handling to ensure timeouts are properly handled by client sides. Signed-off-by: Hayato Kiwata <dev@haytok.jp>
|
Hi, @AkihiroSuda Could you please review when you have time ? |
AkihiroSuda
left a comment
There was a problem hiding this comment.
Thanks, would it be possible to have a unit test?
| if errdefs.IsDeadlineExceeded(err) { | ||
| return runtime.State{}, err | ||
| } | ||
| if !errors.Is(err, ttrpc.ErrClosed) { | ||
| return runtime.State{}, errgrpc.ToNative(err) |
There was a problem hiding this comment.
Curious; should this be considered a bug / missing functionality in errgrpc.ToNative ? Looks like it has an early return for non-GRPC errors, in which case it makes it a "Unknown";
containerd/vendor/github.com/containerd/errdefs/pkg/errgrpc/grpc.go
Lines 213 to 226 in 28cad51
There was a problem hiding this comment.
Oh, right, but it does return a deadline-exceeded; or is that wrapped later on?
containerd/vendor/github.com/containerd/errdefs/pkg/errgrpc/grpc.go
Lines 255 to 256 in 28cad51
There was a problem hiding this comment.
HI @thaJeztah Thanks for the comment!
My understanding is that the timeout error for an unresponsive shim is generated by ttrpc here:
func (c *Client) dispatch(ctx context.Context, req *Request, resp *Response) error {
...
case <-ctx.Done():
return ctx.Err()In this case, the error does not contain a gRPC status code.
Because of this, when ToNative processes this error, the following code path is executed, resulting in codes.Unknown on the client side:
containerd/vendor/github.com/containerd/errdefs/pkg/errgrpc/grpc.go
Lines 224 to 225 in 28cad51
So I think this is not exactly a bug in ToNative. The issue is that ttrpc returns a plain context.DeadlineExceeded error (not wrapped with a gRPC status), so we need to catch it before calling ToNative.
|
Hi, @AkihiroSuda Thanks for your review! After checking the current unit tests, I think it's difficult to add unit tests for additional error handling... https://github.com/containerd/containerd/blob/main/core/runtime/v2/shim_test.go |
|
Hi, @dmcgowan @thaJeztah (cc: @AkihiroSuda) AkihiroSuda and fuweid have approved, but could you check it when you have time? |
|
This PR seems to have caused a regression |
When a shim becomes unresponsive (e.g., stopped via SIGSTOP), ttrpc communication times out with
context deadline exceeded.Currently, this error is not properly propagated, causing redundant API calls and slow container listing by client sides.
Specifically, when executing the API to check the task state, it appears that the
context deadline exceedederror via ttrpc is not being handled withinshimTask.State()andgetProcessState().As a result, when this error occurs, clients such as nerdctl cannot recognize this error, and it is thought that the issue described below is occurring:
nerdctl psis slow when there are containers whch haveUnknownstatus nerdctl#4720Therefore, this commit adds error handling to ensure timeouts are properly handled by client sides.