Retry dead servers a lot less often by erikjohnston · Pull Request #340 · matrix-org/synapse

erikjohnston · 2015-11-02T16:51:51Z

The individual federation HTTP requests have a retry schedule of: 5s, 25s, 2m, 10m

We then only try the host after: 10m, 50m, 4h, 20h and then every 24h.

(All these values are subject to 80% - 140% fuzzing)

NegativeMjark · 2015-11-02T17:05:04Z

How do these times compare with what synapse did before?

ara4n · 2015-11-02T23:45:04Z

I'm not sure I follow the rationale here... is it just to make the back-off less aggressive? What does the random factor actually achieve (given we don't have thundering herd problems here)?

The actual bug we've been seeing are fast-retries on trying to talk on tarpitted servers on federation - this is easily seen on matrix.org. Shouldn't we be fixing whatever that bug is rather than tuning the benign behaviour? I assume I'm missing something...

erikjohnston · 2015-11-03T10:42:17Z

@NegativeMjark The current times were:

1s, 2s, 4s, 8s, 16s for HTTP retries
5s, 10s, 20s, 40s, 1m20s, 2m40s, 5m, 10m, 21m, 42m and then every 1h for retrying the host.

@ara4n The current retry times (above) seem overly aggressive. If retrying servers every few seconds causes noticeable performance issues then the current schedule will certainly exacerbate them.

This is most certainly not an attempt to fix the bug, rather something which I noticed while doing so. Given I have yet been unable to track down exactly what is causing said bug, taking a few minutes to fix this now (while I remember it) seemed prudent.

The randomization is there because a) its trivial to add and b) it does help spread subsequent retries out. Due to the fact that we only retry hosts when we have something new to send them, retries will naturally batch up each time someone on that server sends a message into a large room (e.g. Matrix HQ).

NegativeMjark · 2015-11-05T16:14:06Z

LGTM

Retry dead servers a lot less often

Retry dead servers a lot less often

eacb068

erikjohnston force-pushed the erikj/server_retries branch from 5dcda47 to eacb068 Compare November 2, 2015 16:56

erikjohnston assigned NegativeMjark Nov 5, 2015

erikjohnston added a commit that referenced this pull request Nov 5, 2015

Merge pull request #340 from matrix-org/erikj/server_retries

5bc6904

Retry dead servers a lot less often

erikjohnston merged commit 5bc6904 into develop Nov 5, 2015

erikjohnston deleted the erikj/server_retries branch November 19, 2015 16:33

richvdh mentioned this pull request Sep 12, 2019

Fix bug in calculating the federation retry backoff period #6025

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry dead servers a lot less often#340

Retry dead servers a lot less often#340
erikjohnston merged 1 commit intodevelopfrom
erikj/server_retries

erikjohnston commented Nov 2, 2015

Uh oh!

NegativeMjark commented Nov 2, 2015

Uh oh!

ara4n commented Nov 2, 2015

Uh oh!

erikjohnston commented Nov 3, 2015

Uh oh!

NegativeMjark commented Nov 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

erikjohnston commented Nov 2, 2015

Uh oh!

NegativeMjark commented Nov 2, 2015

Uh oh!

ara4n commented Nov 2, 2015

Uh oh!

erikjohnston commented Nov 3, 2015

Uh oh!

NegativeMjark commented Nov 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants