Retry dead servers a lot less often#340
Conversation
5dcda47 to
eacb068
Compare
|
How do these times compare with what synapse did before? |
|
I'm not sure I follow the rationale here... is it just to make the back-off less aggressive? What does the random factor actually achieve (given we don't have thundering herd problems here)? The actual bug we've been seeing are fast-retries on trying to talk on tarpitted servers on federation - this is easily seen on matrix.org. Shouldn't we be fixing whatever that bug is rather than tuning the benign behaviour? I assume I'm missing something... |
|
@NegativeMjark The current times were:
@ara4n The current retry times (above) seem overly aggressive. If retrying servers every few seconds causes noticeable performance issues then the current schedule will certainly exacerbate them. This is most certainly not an attempt to fix the bug, rather something which I noticed while doing so. Given I have yet been unable to track down exactly what is causing said bug, taking a few minutes to fix this now (while I remember it) seemed prudent. The randomization is there because a) its trivial to add and b) it does help spread subsequent retries out. Due to the fact that we only retry hosts when we have something new to send them, retries will naturally batch up each time someone on that server sends a message into a large room (e.g. Matrix HQ). |
|
LGTM |
Retry dead servers a lot less often
The individual federation HTTP requests have a retry schedule of: 5s, 25s, 2m, 10m
We then only try the host after: 10m, 50m, 4h, 20h and then every 24h.
(All these values are subject to 80% - 140% fuzzing)