Skip to content

rpc/lib/client: introduce jitter for ExponentialBackoff to avoid potential self DDOS vector #751

@odeke-em

Description

@odeke-em

So in lines

d := time.Duration(math.Exp2(float64(attempt)))
time.Sleep(d * time.Second)

we perform exponential backoff by the standard definition which is 1<<n seconds, every client that attempts to reconnect with retry in at least 1<<n seconds.

tl;dr: Let's add randomness to our system in order to avoid potential self DDOS-ing. It might sound far fetched but I've personally dealt with this. A more trustable and prominent web services company AWS also has recognized such problems and added some advice here https://www.awsarchitectureblog.com/2015/03/backoff.html and if I recall right, also historically the ALOHA networking in the 1970s recognized these problems and incorporated randomness into their system.

Long Wind

The current code is quite deterministic in that if the WSServer were overloaded past capacity,
and had say X clients all reconnected at the same time, if all the X clients reattempted at the exact same time, this will be an unintentional DDOS.

I mention this because last year, I built a video transcoding system for an internet company and one of the main micro-services running on a very large and beefed up machine(specialized for transcoding video to 3+ formats in parallel).
The transcoder(horizontally scaled on its own) was connected to by multiple clients and had a timeout as well as obviously limits on the hardware. Problem though is that when very many clients uploaded 4K videos to be transcoded to 3 formats in parallel, RAM would blow up and cause problems for many clients, requests would be aborted due to timeouts. But then on retry, all the clients would re-flood it at the same time, leading to a stalemate each time. The solution involved introducing jitter/randomness in the sleeps by:

d := time.Duration(1e9*rand.Float64()) + ((1 << attempts) * time.Second)
time.Sleep(d)

and that eased the load a whole lot.

Please feel free to adapt the jitter/randomness as necessary but the main point is that not every client should attempt to reconnect at the same time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions