Skip to content

When downloading zstd:chunked image from docker registry backed by s3 bucket, downloading manifest causes downloading the entire layer (effectively the full layer blob is downloaded twice) #188

@kbr-

Description

@kbr-
  • AWS S3, and also minio, don't support multi-part range queries
  • PutBlobPartial function (in storage/storage_dest.go in containers/image) relies on HTTP multi-part range queries to obtain some metadata from the end of the blob. It performs a multi-part range query with 2 ranges (perhaps to get tar-split data and chunk manifest data)
func (s *dockerImageSource) GetBlobAt(ctx context.Context, info types.BlobInfo, chunks []private.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
	headers := make(map[string][]string)

	rangeVals := make([]string, 0, len(chunks))
	lastFound := false
	for _, c := range chunks {
		if lastFound {
			return nil, nil, fmt.Errorf("internal error: another chunk requested after an util-EOF chunk")
		}
		// If the Length is set to -1, then request anything after the specified offset.
		if c.Length == math.MaxUint64 {
			lastFound = true
			rangeVals = append(rangeVals, fmt.Sprintf("%d-", c.Offset))
		} else {
			rangeVals = append(rangeVals, fmt.Sprintf("%d-%d", c.Offset, c.Offset+c.Length-1))
		}
	}

	headers["Range"] = []string{fmt.Sprintf("bytes=%s", strings.Join(rangeVals, ","))}
  • docker registry started against s3/minio returns a 200 response instead of 206
  • this causes the library to download the full layer blob (silently, without any warning 👌), just to discard 99.9% of it in order to obtain the metadata
	switch res.StatusCode {
	case http.StatusOK:
		// if the server replied with a 200 status code, convert the full body response to a series of
		// streams as it would have been done with 206.
		streams := make(chan io.ReadCloser)
		errs := make(chan error)
		go splitHTTP200ResponseToPartial(streams, errs, res.Body, chunks)
  • then it calculates missing chunks and downloads a large part of the blob again

Possibly related: #141


To reproduce, you can use the following docker compose setup.
docker-compose.yml:

networks:
  registry_net:
    driver: bridge

services:
  minio:
    image: minio/minio
    container_name: minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - ./minio-data:/data
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
    command: server /data --console-address ":9001"
    networks:
      registry_net:
        aliases:
          - minio.localhost

  registry:
    image: registry:2
    container_name: minio-registry
    ports:
      - "5000:5000"
    volumes:
      - ./registry-config.yml:/etc/docker/registry/config.yml
    command: serve /etc/docker/registry/config.yml
    depends_on:
      - minio
    networks:
      - registry_net

registry-config.yml:

version: 0.1
storage:
  s3:
    accesskey: minioadmin
    secretkey: minioadmin
    region: us-east-1
    regionendpoint: http://minio.localhost:9000
    bucket: docker-registry
    secure: false
    rootdirectory: /
  delete:
    enabled: true
http:
  addr: :5000

create the directory ./minio-data
docker compose up, then use minio client mc to create the bucket docker-registry inside minio
(after setting mc alias reg):
mc mb reg/docker-registry

pick any image. Pull it to local storage with buildah. Then push it to the registry while compressing with zstd:chunked:

buildah push --tls-verify=false --compression-format zstd:chunked <image-id> docker://localhost:5000/img

then remove it from local storage since we're gonna pull it next

buildah rmi <image-id>

Now I was testing this with limited bandwidth using tc, then the behavior is visible to the naked eye: when pulling the image, at the end buildah gets "stuck" for a while at 0% (but download is happening behind the scenes, e.g. visible in slurm), that's most likely the part where it's downloading the manifests (actually downloading the whole blob 🥲)

here's the script to limit bandwidth, run it with sudo in the same directory where docker-compose.yml is

#!/bin/bash

# Need to create and start virtual interface first:
# sudo ip link add name ifb0 type ifb
# sudo ip link set dev ifb0 up
#
# virual interface used to turn $interface ingress traffic into egress traffic for ifb0

interface="$(docker compose config --format json | jq .networks.registry_net.name | xargs -I {} docker network ls -q --filter name={} --format br-\{\{.ID\}\})"

echo "Interface used by docker compose: $interface"

download_limit=100mbps

tc qdisc del dev "$interface" ingress 2>/dev/null
tc qdisc del dev "ifb0" root 2>/dev/null

# It's not possible to shape ingress, only police it
# So instead we reroute all ingress traffic into virtual interface "ifb0" on which it becomes egress
tc qdisc add dev "$interface" ingress
tc filter add dev "$interface" parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0

tc qdisc add dev ifb0 root handle 1: htb default 10
tc class add dev ifb0 parent 1: classid 1:1 htb rate "$download_limit"
tc class add dev ifb0 parent 1:1 classid 1:10 htb rate "$download_limit"

echo To remove: tc qdisc del dev "$interface" ingress
echo To remove: tc qdisc del dev "ifb0" root

(note that script requires ifb0 virtual interface to exist, see comment inside the script)
for my tests I was limiting to 100mbps and using ~700MB image (compressed in the registry)

$ skopeo inspect --tls-verify=false docker://localhost:5000/img --raw | jq -r '([.layers.[].size] | add)/1024/1024'
691.4367246627808

adjust the limits appropriately based on size of the image you're using for testing.

Now pull (also remember to set enable_partial_images to true)

time ~/go/bin/buildah --log-level trace --tls-verify=false pull docker://localhost:5000/img

with vanilla buildah this took for me:

real    0m17.138s
user    0m12.764s
sys     0m9.426s

now we can apply a workaround which translates multi-part range requests into a sequence of single-part range request (courtesy of @psarna):

diff --git a/vendor/github.com/containers/image/v5/storage/storage_dest.go b/vendor/github.com/containers/image/v5/storage/storage_dest.go
index 0af4523ff..360273633 100644
--- a/vendor/github.com/containers/image/v5/storage/storage_dest.go
+++ b/vendor/github.com/containers/image/v5/storage/storage_dest.go
@@ -323,20 +323,32 @@ type zstdFetcher struct {
 
 // GetBlobAt converts from chunked.GetBlobAt to BlobChunkAccessor.GetBlobAt.
 func (f *zstdFetcher) GetBlobAt(chunks []chunked.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
-       newChunks := make([]private.ImageSourceChunk, 0, len(chunks))
-       for _, v := range chunks {
-               i := private.ImageSourceChunk{
-                       Offset: v.Offset,
-                       Length: v.Length,
+       errs := make(chan error)
+       rc := make(chan io.ReadCloser)
+       logrus.Debug("chunks: ", chunks)
+       go func() {
+               for _, v := range chunks {
+                       i := private.ImageSourceChunk{
+                               Offset: v.Offset,
+                               Length: v.Length,
+                       }
+                       logrus.Debug("special careful single GetBlobAt to make docker registry work: ", i, " ", f.blobInfo)
+                       blobRc, blobErrs, blobErr := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, []private.ImageSourceChunk{i})
+                       if blobErr != nil {
+                               errs <- blobErr
+                       }
+                       // channels transplant
+                       for r := range blobRc {
+                               rc <- r
+                       }
+                       for e := range blobErrs {
+                               errs <- e
+                       }
                }
-               newChunks = append(newChunks, i)
-       }
-       rc, errs, err := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, newChunks)
-       if _, ok := err.(private.BadPartialRequestError); ok {
-               err = chunked.ErrBadRequest{}
-       }
-       return rc, errs, err
-
+               close(rc)
+               close(errs)
+       }()
+       return rc, errs, nil
 }
 
 // PutBlobPartial attempts to create a blob using the data that is already present

redo the experiment. I saw:

real    0m9.759s
user    0m12.078s
sys     0m8.066s

yay, 2x speedup

Metadata

Metadata

Assignees

No one assigned

    Labels

    imageRelated to "image" package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions