Skip to content

[Proposal] Streaming messages between manager nodes #2416

@anshulpundir

Description

@anshulpundir

Background

Swarmkit uses a replicated configuration store that relies on etcd/raft for consistently replicating changes to the members. Etcd/raft is an embeddable library that provides the raft algorithm, networking is implemented on top in swarmkit. Etcd/raft has a write-ahead log (WAL) for storing the most recent log entries, which is configurable in size, beyond which entries are flushed to a snapshot which is also used to backups. Entires flushed to a snapshot may get removed from the WAL. When a new node joins the quorum, WAL log entries are sent to it form the leader. Older entries may only be available in a snapshot, in which case the leader sends an entire snapshot to the new node.

Motivation

Swamkit uses gRPC for the networking between raft nodes. Raft snapshots can grow arbitrarily large in size, which needs to be handled by the networking when they need to be shipped across nodes. Currently, this is handled by having a large gRPC message size limit/timeout limit, but this is not a scalable solution. We would like to be able to stream large snapshots without arbitrarily large gRPC message limits and timeout out.

Proposal

The proposed fix is to provide networking support for shipping arbitrarily large objects between the raft nodes using gRPC streaming. On a high level, the sender will split up a large message into smaller messages, which will be streamed over gRPC and re-assembled on the receiver. The logic to split a large message into smaller messages will be implemented in the transport part of the raft component in swarmkit. Currently, etcd/raft spits out a list of messages to be sent to the raft peers. If there is a snapshot message to be sent, the raft transport will split up the snapshot::data, which is a byte array, into smaller byte arrays, which will be streamed as separate ProcessRaftMessageRequest messages to the peer. The receiving side will append the messages together and pass it along to the raft FSM on that node.

From a gRPC point of view, we will use client-side streaming. In this case, the client is the raft leader and the server is the raft peer. Note that gRPC streaming maintains ordering.

Backwards compatibility and error handling.

If the raft peer is on an older version which does not support streaming, the sender will handle the 'Unimplemented' grpc error code and not send the snapshot to the peer. We will need a way to report this error up, possibly including this in the docker node ls, docker node inspect outputs for the sender node

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions