Skip to content

fix: propagate context deadline exceeded error properly#12821

Merged
fuweid merged 1 commit into
containerd:mainfrom
haytok:propagate-deadline-exceeded-error
Feb 21, 2026
Merged

fix: propagate context deadline exceeded error properly#12821
fuweid merged 1 commit into
containerd:mainfrom
haytok:propagate-deadline-exceeded-error

Conversation

@haytok

@haytok haytok commented Jan 26, 2026

Copy link
Copy Markdown
Member

When a shim becomes unresponsive (e.g., stopped via SIGSTOP), ttrpc communication times out with context deadline exceeded.

Currently, this error is not properly propagated, causing redundant API calls and slow container listing by client sides.

Specifically, when executing the API to check the task state, it appears that the context deadline exceeded error via ttrpc is not being handled within shimTask.State() and getProcessState().

As a result, when this error occurs, clients such as nerdctl cannot recognize this error, and it is thought that the issue described below is occurring:

Therefore, this commit adds error handling to ensure timeouts are properly handled by client sides.

When a shim becomes unresponsive (e.g., stopped via SIGSTOP), ttrpc
communication times out with `context deadline exceeded`.

Currently, this error is not properly propagated, causing redundant API
calls and slow container listing by client sides.

Specifically, when executing the API to check the task state, it appears
that the `context deadline exceeded` error via ttrpc is not being handled
within `shimTask.State()` and `getProcessState()`.

As a result, when this error occurs, clients such as nerdctl cannot
recognize this error, and it is thought that the issue described below is
occurring:

- containerd/nerdctl#4720

Therefore, this commit adds error handling to ensure timeouts are properly
handled by client sides.

Signed-off-by: Hayato Kiwata <dev@haytok.jp>
@haytok

haytok commented Feb 4, 2026

Copy link
Copy Markdown
Member Author

Hi, @AkihiroSuda Could you please review when you have time ?

@AkihiroSuda AkihiroSuda left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, would it be possible to have a unit test?

@AkihiroSuda AkihiroSuda requested a review from dmcgowan February 4, 2026 15:22
Comment thread core/runtime/v2/shim.go
Comment on lines +838 to 842
if errdefs.IsDeadlineExceeded(err) {
return runtime.State{}, err
}
if !errors.Is(err, ttrpc.ErrClosed) {
return runtime.State{}, errgrpc.ToNative(err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious; should this be considered a bug / missing functionality in errgrpc.ToNative ? Looks like it has an early return for non-GRPC errors, in which case it makes it a "Unknown";

s, isGRPC := status.FromError(err)
var (
desc string
code codes.Code
)
if isGRPC {
desc = s.Message()
code = s.Code()
} else {
desc = err.Error()
code = codes.Unknown
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, but it does return a deadline-exceeded; or is that wrapped later on?

case codes.DeadlineExceeded:
cls = context.DeadlineExceeded

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @thaJeztah Thanks for the comment!

My understanding is that the timeout error for an unresponsive shim is generated by ttrpc here:

https://github.com/containerd/ttrpc/blob/9638fba0e51b478cdd4cc916893cffd4312bc0b5/client.go#L547-L548

func (c *Client) dispatch(ctx context.Context, req *Request, resp *Response) error {
...
	case <-ctx.Done():
		return ctx.Err()

In this case, the error does not contain a gRPC status code.

Because of this, when ToNative processes this error, the following code path is executed, resulting in codes.Unknown on the client side:

desc = err.Error()
code = codes.Unknown

So I think this is not exactly a bug in ToNative. The issue is that ttrpc returns a plain context.DeadlineExceeded error (not wrapped with a gRPC status), so we need to catch it before calling ToNative.

@haytok

haytok commented Feb 5, 2026

Copy link
Copy Markdown
Member Author

Hi, @AkihiroSuda Thanks for your review!

After checking the current unit tests, I think it's difficult to add unit tests for additional error handling...

https://github.com/containerd/containerd/blob/main/core/runtime/v2/shim_test.go

@github-project-automation github-project-automation Bot moved this from Needs Triage to Review In Progress in Pull Request Review Feb 7, 2026
@haytok haytok requested a review from thaJeztah February 13, 2026 04:46
@haytok

haytok commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

Hi, @dmcgowan @thaJeztah (cc: @AkihiroSuda)

AkihiroSuda and fuweid have approved, but could you check it when you have time?

@fuweid fuweid added this pull request to the merge queue Feb 21, 2026
Merged via the queue into containerd:main with commit 591de24 Feb 21, 2026
90 of 92 checks passed
@github-project-automation github-project-automation Bot moved this from Review In Progress to Done in Pull Request Review Feb 21, 2026
@AkihiroSuda

AkihiroSuda commented May 27, 2026

Copy link
Copy Markdown
Member

This PR seems to have caused a regression


+ limactl shell default nerdctl ps -a
time="2026-05-27T02:56:35Z" level=warning msg="treating lima version \"0c93a37\" from \"/Users/runner/.lima/default/lima-version\" as very latest release"
CONTAINER ID    IMAGE                                              COMMAND                   CREATED           STATUS                      PORTS                     NAMES
3f67f52a5931    ghcr.io/stargz-containers/nginx:1.19-alpine-org    "/docker-entrypoint.…"    10 seconds ago    Exited (1) 5 seconds ago    127.0.0.1:8080->80/tcp    nginx
+ limactl shell default nerdctl logs nginx
time="2026-05-27T02:56:36Z" level=warning msg="treating lima version \"0c93a37\" from \"/Users/runner/.lima/default/lima-version\" as very latest release"
+ [[ -z '' ]]
+ limactl shell default journalctl --user -u containerd
[...]
May 27 02:55:49 lima-default containerd-rootless.sh[1173]: time="2026-05-27T02:55:49.956586096Z" level=info msg="containerd successfully booted in 7.782954s"
May 27 02:56:25 lima-default containerd-rootless.sh[1173]: time="2026-05-27T02:56:25.445537685Z" level=info msg="connecting to shim 3f67f52a59314ed6fc4e1048b0e2c826ab26df575cec4cbf38619beb754dbebd" address="unix:///run/containerd/s/56dfcbff8ef161f104b965a6f02b0158fca28ed1beba7df4c255935edf3a9122" namespace=default protocol=ttrpc version=3
May 27 02:56:29 lima-default containerd-rootless.sh[1173]: time="2026-05-27T02:56:29.753250886Z" level=error msg="ttrpc: received message on inactive stream" stream=3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants