-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
Deadlock when concurrently pulling a same image.
Here is the timeline for concurrent downloads of container images.
openwriter -> copy content -> content commit.
The deadlock:
client A hold the layer ref lock and try to acquire bolt lock to commit.
client B hold the bolt lock and try to acquire layer ref lock.
writer trylock here always try 0 or 10 times due to the deadlock, and fails after about 600ms timeout. This caused the image download time to be much longer
https://github.com/containerd/containerd/blob/main/content/local/store.go#L463
If my analysis is correct, here is my solution.
just trylock once and immediately return Unavailable error, it release the bolt db lock
and retry in content.OpenWriter
https://github.com/containerd/containerd/blob/main/content/helpers.go#L115
Steps to reproduce the issue
- concurrently pulling a new image
for ((i=1;i<=10;i++));
do
time nerdctl pull -q centos:7 &
done
waiteach operation will cost about 30s while single pulling cost 8s on my testing environment.
➜ sh pull-image.sh
real 0m28.861s
user 0m0.059s
sys 0m0.004s
real 0m28.938s
user 0m0.037s
sys 0m0.023s
real 0m28.983s
user 0m0.043s
sys 0m0.020s
real 0m29.047s
user 0m0.054s
sys 0m0.020s
real 0m29.048s
user 0m0.217s
sys 0m0.139s
real 0m29.049s
user 0m0.041s
sys 0m0.020s
real 0m29.068s
user 0m0.058s
sys 0m0.010s
real 0m29.068s
user 0m0.054s
sys 0m0.016s
real 0m29.071s
user 0m0.048s
sys 0m0.019s
real 0m29.070s
user 0m0.043s
sys 0m0.016s➜ nerdctl pull centos:7
docker.io/library/centos:7: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:be65f488b7764ad3638f236b7b515b3678369a5124c47b8d32916d6487418ea4: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:dead07b4d8ed7e29e98de0f4504d87e8880d4347859d839686a31da35a3b532f: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:eeb6ee3f44bd0b5103bb561b4c16bcb82328cfe5809ab675bb17ab3a16c517c9: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:2d473b07cdd5f0912cd6f1a703352c82b512407db6b05b43f2553732b55df3bc: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 8.5 s total: 72.6 M (8.5 MiB/s) Describe the results you received and expected
without retry ref lock
func (s *store) Writer(ctx context.Context, opts ...content.WriterOpt) (content.Writer, error) {
var wOpts content.WriterOpts
for _, opt := range opts {
if err := opt(&wOpts); err != nil {
return nil, err
}
}
// TODO(AkihiroSuda): we could create a random string or one calculated based on the context
// https://github.com/containerd/containerd/issues/2129#issuecomment-380255019
if wOpts.Ref == "" {
return nil, fmt.Errorf("ref must not be empty: %w", errdefs.ErrInvalidArgument)
}
if err := tryLock(wOpts.Ref); err != nil {
if !errdefs.IsUnavailable(err) {
return nil, err
}
return nil, err
}
it will be much faster
➜ sh pull-image.sh
real 0m8.602s
user 0m0.219s
sys 0m0.148s
real 0m12.127s
user 0m0.077s
sys 0m0.000s
real 0m12.199s
user 0m0.056s
sys 0m0.007s
real 0m12.199s
user 0m0.056s
sys 0m0.006s
real 0m12.224s
user 0m0.072s
sys 0m0.009s
real 0m12.224s
user 0m0.059s
sys 0m0.013s
real 0m12.225s
user 0m0.041s
sys 0m0.026s
real 0m12.305s
user 0m0.036s
sys 0m0.024s
real 0m12.306s
user 0m0.054s
sys 0m0.007s
real 0m12.306s
user 0m0.060s
sys 0m0.015sWhat version of containerd are you using?
v1.6.8
Any other relevant information
The version I tested is 1.6.8, but the latest version should also have this problem. Thanks
Show configuration if it is related to CRI plugin.
No response