De-flake MSC3030 backfill test with Synapse worker replication#492
Conversation
Spawning from matrix-org/synapse#14028 (comment) The reason they failed before was because we're fighting against stale caches across workers racing while waiting for invalidation, see matrix-org/synapse#13185 (comment) > Here is what happens: > > 1. `serverB` has `event1` stored as an `outlier` from previous requests (specifically from MSC3030 jump to date pulling in a missing `prev_event` after backfilling) > 1. Client on `serverB` calls `/messages?dir=b` > 1. `serverB:client_reader1` accepts the request and drives things > 1. `serverB:client_reader1` has some backward extremities in range and requests `/backfill` from `serverA` > 1. `serverB:client_reader1` processes the events from backfill including `event1` and puts them in the `_event_persist_queue` > 1. `serverB:master` picks up the events from the `_event_persist_queue` and persists them to the database, de-outliers `event1` and invalidates its own cache and sends them over replication > 1. `serverB:client_reader1` starts assembling the `/messages` response and gets `event1` out of the stale cache still as an `outlier` > 1. `serverB:client_reader1` responds to the `/messages` request without `event1` because `outliers` are filtered out > 1. `serverB:client_reader1` finally gets the replication data and invalidates its own cache for `event1` (too late, we already got the events from the stale cache and responded)
|
Fixes matrix-org/synapse#13944? |
| @@ -203,6 +203,16 @@ func TestJumpToDateEndpoint(t *testing.T) { | |||
| contextResResBody := client.ParseJSON(t, contextRes) | |||
There was a problem hiding this comment.
Fixes matrix-org/synapse#13944?
@squahtx I don't think this fixes that issue. The error there is:
msc3030_test.go:200: CSAPI.MustDoFunc response return non-2xx code: 403 Forbidden - body: {"errcode":"M_FORBIDDEN","error":"You don't have permission to access that event."}
which indicates we also need to retry the /context request above until it works
There was a problem hiding this comment.
Fixed in a separate PR: matrix-org/synapse#14096
DMRobertson
left a comment
There was a problem hiding this comment.
Is the fact that we have to do this a bug in Synapse? From the description it sounds like this is all happening within a single request to the client_reader.
Are there any spec guidelines on how /messages is supposed to behave WRT backfill?
I think this is more a discussion to be had on matrix-org/synapse#13185. The current Synapse behavior doesn't feel quite right but since there is no spec/guidelines, this could be assumed to be correct as well.
I don't think so. The spec just says this:
|
|
Thanks for the review @squahtx and @DMRobertson 🐦 |
De-flake MSC3030 backfill test with Synapse worker replication
Spawning from matrix-org/synapse#14028 (comment)
CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion)
Why did this test fail with workers before?
The reason they failed before was because we're fighting against stale caches across workers racing while waiting for invalidation, see matrix-org/synapse#14211
Dev notes
Complement test run failure with relevant debug logs