Skip to content

Large snapshot causes adding a new manager to fail #3113

@xinfengliu

Description

@xinfengliu

A large snapshot (e.g. a few hundred MB) causes adding a new manager to fail.

We used to have a fix #2458 , however it is not enough. There's also a SendTimeout which seems to use hardcoded 2 seconds in sendProcessMessage in manager/state/raft/transport/peer.go.

This issue can be easily reproduced. Steps are as below:

  • Create many large objects in swarm
for i in $(seq 1 500)
do
 dd if=/dev/urandom bs=900k count=1 2>/dev/null | docker config create foo${i} -
done
  • Trigger snapshotting
docker swarm update --snapshot-interval 1
docker network create -d overlay dummy
docker network rm dummy
docker swarm update --snapshot-interval 10000
  • Verify the snapshot is big enough
/var/lib/docker/swarm/raft/snap-v3-encrypted:
-rw-r--r--. 1 root root 461774425 Jan 31 11:54 000000000000000b-000000000000042e.snap
  • Add a new manager node.

You will see the dead loop in docker logs:

On the leader node:

Jan 31 11:57:50 centos7 dockerd[4644]: time="2023-01-31T11:57:50.651215634+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:52 centos7 dockerd[4644]: time="2023-01-31T11:57:52.655983276+08:00" level=error msg="error streaming message to peer" error=EOF
Jan 31 11:57:54 centos7 dockerd[4644]: time="2023-01-31T11:57:54.660918294+08:00" level=error msg="error streaming message to peer" error=EOF

On the manager node that is newly added:

Jan 31 11:57:51 centos7-1 dockerd[1326]: time="2023-01-31T11:57:51.009851258+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:53 centos7-1 dockerd[1326]: time="2023-01-31T11:57:53.014080429+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Jan 31 11:57:55 centos7-1 dockerd[1326]: time="2023-01-31T11:57:55.019443613+08:00" level=error msg="error while reading from stream" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions