content/helper: use semaphore for OpenWriter by fuweid · Pull Request #4985 · containerd/containerd

fuweid · 2021-02-01T15:41:32Z

The content backend uses key lock for long-lived write transaction. When
the content reference has been marked for write transaction, the other
request on the same reference will fail fast with unavailable error.
Since the metadata plugin is based on boltbd which only supports single
writer, the backend can't block or handle the request too long. It requires
the client to handle retry by itself.

And OpenWriter is one retry helper to handle unavailable error for
content write transaction. The OpenWriter's maximum retry interval can
be up to 2 seconds. If there are several concurrent requires for the
same image, it will take long time to finish all the pulling jobs. And
it will be worse if the image has many more layers.

In order to improve the performance, use a weighted semaphore before
backoff and retry in OpenWriter. When the reference write transaction
has been hold, we should park the fetching-the-same-blob goroutines into
waiting state. When the transaction has been committed or aborted, the
one of waiting goroutines will be notified. It also saves CPU resources.

Since the OpenWriter is used in client side, we still need retry to
handle unavailable error.

Test script:

localhost:5000/redis:latest is equal to docker.io/library/redis:latest.
The image is hold in local registry service by

docker run -d -p 5000:5000 --name registry registry:2
docker tag redis:latest localhost:5000/redis:latest
docker push localhost:5000/redis:latest

image_name="localhost:5000/redis:latest"
pull_times=10

cleanup() {
  ctr image rmi "${image_name}"
  ctr -n k8s.io image rmi "${image_name}"
  crictl rmi "${image_name}"
  docker rmi "${image_name}"
  sleep 2
}

crictl_testing() {
  for idx in $(seq 1 ${pull_times}); do
    crictl pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

docker_testing() {
  for idx in $(seq 1 ${pull_times}); do
    docker pull "${image_name}" > /dev/null 2>&1 &
  done
  wait
}

cleanup > /dev/null 2>&1

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "crictl pull $image_name (x${pull_times}) takes ..."
time crictl_testing
echo

echo 3 > /proc/sys/vm/drop_caches
sleep 3
echo "docker pull $image_name (x${pull_times}) takes ..."
time docker_testing

Test result(local):

from main commit https://github.com/fuweid/containerd/commit/3122239ee50624a0159d0107e53f3a9fd612570f

crictl pull localhost:5000/redis:latest (x10) takes ...

real	0m29.780s
user	0m0.124s
sys	0m0.169s

docker pull localhost:5000/redis:latest (x10) takes ...

real	0m1.233s
user	0m0.204s
sys	0m0.224s

from this patch

crictl pull localhost:5000/redis:latest (x10) takes ...

real	0m2.655s
user	0m0.091s
sys	0m0.141s

docker pull localhost:5000/redis:latest (x10) takes ...

real	0m1.188s
user	0m0.198s
sys	0m0.200s

Fixes: #4937

Signed-off-by: Wei Fu fuweid89@gmail.com

theopenlab-ci · 2021-02-01T15:59:24Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 16m 22s (non-voting)

dmcgowan · 2021-02-01T17:52:12Z

And this patch makes MaxConcurrentDownloads tunable for CRI-plugin.
The MaxConcurrentDownloads is sharing between several image pulling
requests. It is aligned with moby engine. It is good to keep the value
small to reduce impacts for running containers when pulling images.

I think this is worth splitting out in a separate commit or PR. It is not really lated to the other change and both are big behavioral changes on their own.

fuweid · 2021-04-06T04:19:12Z

This patch only is for client side's OpenWriter and will file other PR for CRI's max concurrent downloader number.

@dmcgowan and @AkihiroSuda please take a look. Thanks~

theopenlab-ci · 2021-04-06T04:20:42Z

Build succeeded.

containerd-build-arm64 : FAILURE in 6m 11s (non-voting)

theopenlab-ci · 2021-04-06T04:38:34Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 24s (non-voting)

theopenlab-ci · 2021-04-06T06:06:42Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 30s (non-voting)

theopenlab-ci · 2021-04-07T02:45:05Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 34s (non-voting)

kzys · 2021-04-19T22:31:29Z

content/helpers.go

+
+type lockItem struct {
+	waiter *semaphore.Weighted
+	cnt    int


Is cnt the number of writers who are waiting the lock?

Yes. The unlock should check the number of waiter before deletion of the key. If not, there are several goroutines holding the same key.

kzys · 2021-04-19T22:32:28Z

content/helpers.go

+		if w.cnt <= 0 {
+			delete(l.locks, key)
+		}


Would it be negative?

Basically, it is impossible. Just check <=0 in case. :p

How about checking zero in this if and returning an error on the negative case?

kzys

The change looks good to me. Some comments to understand the code better.

dmcgowan · 2021-05-27T04:01:17Z

content/helpers.go

-			// TODO: Check status to determine if the writer is active,
-			// continue waiting while active, otherwise return lock
-			// error or abort. Requires asserting for an ingest manager
+	// TODO: Keep the ref with namespace scope.


Why not just scope it to the namespace in this PR?

containerd provides namespace in two way: one is from context, and other one is from gRPC metadata context.

For CRI-plugin case, we can get namespace from context. But for client-server case, the context doesn't contains the namespace which will be provided by gRPC interceptor.

I am still seeking common way to get the namespace :(

Wow. I didn't know that. It would be great if we can have the namespace in the context. Alternatively we may be able to supply the namespace in WriterOpts.

dmcgowan · 2021-05-27T04:05:13Z

Looks good, I think the key should include the namespace in this PR though, it is a small change

yyb196 · 2021-09-06T08:59:13Z

Anyone still looks into this pr? It actually speeds up concurrent image pulling, we need it. PTAL

xujihui1985 · 2021-12-06T03:04:07Z

this pr has been suspended for half a year, someone should take the responsibility to keep this pr move on, @dmcgowan @kzys @AkihiroSuda

fuweid · 2021-12-06T15:41:54Z

@yyb196 @xujihui1985 Sorry for late reply. I am still working on this patch because I need to find the way to carry namespace information, like comment in #6039 (comment). I am trying to finish it in release/1.6.

xujihui1985 · 2021-12-17T08:01:56Z

content/helpers.go

+		)
+
+		for {
+			cw, err = cs.Writer(ctx, opts...)


shall we provide a new method like cs.WriterWithLock here as the lock by ref logic has been moved to OpenWriter, so there should be no lock contention inside Writer method

dcantah · 2022-03-08T01:44:39Z

@fuweid Awesome!!! I'll be sure to review this tomorrow morning, thank you!

dmcgowan · 2022-03-09T17:44:47Z

@fuweid sgtm

dcantah · 2022-03-09T21:47:50Z

@fuweid Took a look, I'm alright with changing the approach to the one in the commit :) I'll let others chime in

fuweid · 2022-03-10T05:14:30Z

@dmcgowan @dcantah thanks! Will update it shortly

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Daniel Canter <dcanter@microsoft.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Davanum Srinivas <davanum@gmail.com> (cherry picked from commit 1764ea9) Signed-off-by: Jim DeWaard <dewaard@amazon.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com> (cherry picked from commit 8113758) Signed-off-by: Amit Barve <ambarve@microsoft.com>

Background: With current design, the content backend uses key-lock for long-lived write transaction. If the content reference has been marked for write transaction, the other requestes on the same reference will fail fast with unavailable error. Since the metadata plugin is based on boltbd which only supports single-writer, the content backend can't block or handle the request too long. It requires the client to handle retry by itself, like OpenWriter - backoff retry helper. But the maximum retry interval can be up to 2 seconds. If there are several concurrent requestes fo the same image, the waiters maybe wakeup at the same time and there is only one waiter can continue. A lot of waiters will get into sleep and we will take long time to finish all the pulling jobs and be worse if the image has many more layers, which mentioned in issue containerd#4937. After fetching, containerd.Pull API allows several hanlers to commit same ChainID snapshotter but only one can be done successfully. Since unpack tar.gz is time-consuming job, it can impact the performance on unpacking for same ChainID snapshotter in parallel. For instance, the Request 2 doesn't need to prepare and commit, it should just wait for Request 1 finish, which mentioned in pull request containerd#6318. ```text Request 1 Request 2 Prepare | | | | Prepare Commit | | | | Commit(failed on exist) ``` Both content backoff retry and unnecessary unpack impacts the performance. Solution: Introduced the duplicate suppression in fetch and unpack context. The deplicate suppression uses key-mutex and single-waiter-notify to support singleflight. The caller can use the duplicate suppression in different PullImage handlers so that we can avoid unnecessary unpack and spin-lock in OpenWriter. Test Result: Before enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 1m6.172s user 0m0.268s sys 0m0.193s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.324s user 0m0.441s sys 0m0.316s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 1m47.657s user 0m0.284s sys 0m0.224s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.381s user 0m0.488s sys 0m0.358s ``` With this enhancement: ```bash ➜ /tmp sudo bash testing.sh "localhost:5000/redis:latest" 20 crictl pull localhost:5000/redis:latest (x20) takes ... real 0m1.140s user 0m0.243s sys 0m0.178s docker pull localhost:5000/redis:latest (x20) takes ... real 0m1.239s user 0m0.463s sys 0m0.275s ➜ /tmp sudo bash testing.sh "localhost:5000/golang:latest" 20 crictl pull localhost:5000/golang:latest (x20) takes ... real 0m5.546s user 0m0.217s sys 0m0.219s docker pull localhost:5000/golang:latest (x20) takes ... real 0m6.090s user 0m0.501s sys 0m0.331s ``` Test Script: localhost:5000/{redis|golang}:latest is equal to docker.io/library/{redis|golang}:latest. The image is hold in local registry service by `docker run -d -p 5000:5000 --name registry registry:2`. ```bash image_name="${1}" pull_times="${2:-10}" cleanup() { ctr image rmi "${image_name}" ctr -n k8s.io image rmi "${image_name}" crictl rmi "${image_name}" docker rmi "${image_name}" sleep 2 } crictl_testing() { for idx in $(seq 1 ${pull_times}); do crictl pull "${image_name}" > /dev/null 2>&1 & done wait } docker_testing() { for idx in $(seq 1 ${pull_times}); do docker pull "${image_name}" > /dev/null 2>&1 & done wait } cleanup > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "crictl pull $image_name (x${pull_times}) takes ..." time crictl_testing echo echo 3 > /proc/sys/vm/drop_caches sleep 3 echo "docker pull $image_name (x${pull_times}) takes ..." time docker_testing ``` Fixes: containerd#4937 Close: containerd#4985 Close: containerd#6318 Signed-off-by: Wei Fu <fuweid89@gmail.com>

fuweid changed the title ~~content: use semaphore instead of backoff and retry~~ [RFC] content: use semaphore instead of backoff and retry Feb 1, 2021

fuweid changed the title ~~[RFC] content: use semaphore instead of backoff and retry~~ [WIP] content: use semaphore instead of backoff and retry Mar 5, 2021

fuweid force-pushed the fix-4937 branch from 62a46ba to 5fd142a Compare April 6, 2021 04:12

fuweid changed the title ~~[WIP] content: use semaphore instead of backoff and retry~~ content/helper: use semaphore for OpenWriter Apr 6, 2021

fuweid requested review from AkihiroSuda and dmcgowan April 6, 2021 04:14

fuweid force-pushed the fix-4937 branch from 5fd142a to b3fd846 Compare April 6, 2021 04:30

fuweid force-pushed the fix-4937 branch from b3fd846 to 5fd5178 Compare April 6, 2021 05:58

fuweid force-pushed the fix-4937 branch from 5fd5178 to 003071e Compare April 7, 2021 02:37

fuweid requested a review from kzys April 15, 2021 23:05

kzys reviewed Apr 19, 2021

View reviewed changes

dmcgowan added this to the 1.6 milestone Apr 21, 2021

dmcgowan reviewed May 27, 2021

View reviewed changes

dmcgowan added the status/needs-update Awaiting contributor update label May 27, 2021

fuweid mentioned this pull request Jul 14, 2021

Add option to disallow pulling the same image in parallel for cri #5711

Closed

fuweid removed this from the 1.6 milestone Dec 9, 2021

xujihui1985 reviewed Dec 17, 2021

View reviewed changes

fuweid requested review from cpuguy83 and mikebrow March 7, 2022 15:15

fuweid mentioned this pull request Mar 19, 2022

CRI: improve image pulling performance #6702

Merged

fuweid closed this Mar 22, 2022

Conversation

fuweid commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theopenlab-ci bot commented Feb 1, 2021

Uh oh!

dmcgowan commented Feb 1, 2021

Uh oh!

fuweid commented Apr 6, 2021

Uh oh!

theopenlab-ci bot commented Apr 6, 2021

Uh oh!

theopenlab-ci bot commented Apr 6, 2021

Uh oh!

theopenlab-ci bot commented Apr 6, 2021

Uh oh!

theopenlab-ci bot commented Apr 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kzys left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmcgowan commented May 27, 2021

Uh oh!

yyb196 commented Sep 6, 2021

Uh oh!

xujihui1985 commented Dec 6, 2021

Uh oh!

fuweid commented Dec 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcantah commented Mar 8, 2022

Uh oh!

dmcgowan commented Mar 9, 2022

Uh oh!

dcantah commented Mar 9, 2022

Uh oh!

fuweid commented Mar 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fuweid commented Feb 1, 2021 •

edited

Loading

fuweid Apr 21, 2021 •

edited

Loading