Skip to content

client.New doesn't respect the timeout (v2) #11507

@vvoland

Description

@vvoland

Description

containerd.WithTimeout doesn't work on containerd v2 client and returns immediately if the connection can't be established.

This is a consequence of moving from grpc.DialContext to grpc.NewClient and dropping the usage of WithBlock: 63b4688.

The former attempts the connection immediately, and the WithBlock blocks until the connection is ready.

As explained in https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md - this is considered as an anti-pattern because one can't assume that the connection would still be available later.

The shift to grpc.NewClient makes the gRPC connection "lazy" - it isn't performed at the time the client is being created, it's deferred to a later time where an actual RPC is performed.

This makes the containerd.WithTimeout useless, because it only worked for the initial connection, and not all connections dialed throughout the whole lifecycle of the containerd client.

The timeout was first introduced in #2554 - however I'm not able to determine whether it was intentional that it only applies to the initial connection. This is where I need your input, because we have two options depending on this decision :)

  1. Deprecate/remove WithTimeout because there's no initial connection anymore
  2. Pass the timeout to the dialer, so the timeout is applied to every dial attempt client: Respect client.WithTimeout option #11508

On the other hand, there's also a question of whether the new behavior of client.New should produce such "lazily" initialized client, instead of verifying that at least the initial connection is performed successfully.

Looking at the godoc:

// New returns a new containerd client that is connected to the containerd
// instance provided by address
func New(address string, opts ...Opt) (*Client, error) {

The that is connected to the containerd instance sounds like that function should at least make sure that there's a containerd instance listening on the provided address.

Currently, on the v2 client this behavior isn't really consistent, because it depends on whether the default namespace was specified or not (client.WithDefaultNamespace):

containerd/client/client.go

Lines 173 to 180 in c63d132

// check namespace labels for default runtime
if copts.defaultRuntime == "" && c.defaultns != "" {
if label, err := c.GetLabel(context.Background(), defaults.DefaultRuntimeNSLabel); err != nil {
return nil, err
} else if label != "" {
c.runtime = label
}
}

If it was, then the client will attempt to query the daemon for the default runtime label which effectively "unlazies" the client.

We should decide whether the client.New should produce an "unlazied" connection and if so - encode that explicitly.

Steps to reproduce the issue

package main

import (
	"context"
	"testing"
	"time"

	"github.com/containerd/containerd"
	containerdv2 "github.com/containerd/containerd/v2/client"
)

func TestContainerdV1(t *testing.T) {
	start := time.Now()
	defer func() {
		s := time.Since(start)
		t.Logf("took %s", s)
		if s < time.Second*8 {
			t.Fatal("returned earlier than expected")
		}
	}()

	c8d, err := containerd.New("/run/containerd.sock",
		containerd.WithTimeout(time.Second*10),
	)
	if err != nil {
		t.Logf("new error: %s", err)
		return
	}
	defer c8d.Close()

	_, err = c8d.ListImages(context.Background())
	if err != nil {
		t.Logf("list error: %s", err)
		return
	}
}

func TestContainerdV2(t *testing.T) {
	start := time.Now()
	defer func() {
		s := time.Since(start)
		t.Logf("took %s", s)
		if s < time.Second*8 {
			t.Fatal("returned earlier than expected")
		}
	}()

	c8d, err := containerdv2.New("/run/containerd.sock",
		containerdv2.WithTimeout(time.Second*10),
	)
	if err != nil {
		t.Logf("new error: %s", err)
		return
	}
	defer c8d.Close()

	_, err = c8d.ListImages(context.Background())
	if err != nil {
		t.Logf("list error: %s", err)
		return
	}
}

Describe the results you received and expected

Got:

$ go test -v .
=== RUN   TestContainerdV1
    main_test.go:28: new error: failed to dial "/run/containerd.sock": context deadline exceeded: connection error: desc = "transport: error while dialing: dial unix:///run/containerd.sock: timeout"
    main_test.go:18: took 10.002191797s
--- PASS: TestContainerdV1 (10.00s)
=== RUN   TestContainerdV2
    main_test.go:62: list error: connection error: desc = "transport: Error while dialing: dial unix:///run/containerd.sock: timeout"
    main_test.go:44: took 1.003387459s
    main_test.go:46: returned earlier than expected
--- FAIL: TestContainerdV2 (1.00s)
FAIL
FAIL	github.com/vvoland/ctrdnewtimeout	11.018s
FAIL

Expected the second test to pass (retry the connection for at least 10 seconds).

What version of containerd are you using?

v2.0.3 and v1.7.26

Any other relevant information

No response

Show configuration if it is related to CRI plugin.

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions