Skip to content

Handle missing devices in /keys/claim responses#2805

Merged
richvdh merged 7 commits intomainfrom
rav/claim_keys
Nov 9, 2023
Merged

Handle missing devices in /keys/claim responses#2805
richvdh merged 7 commits intomainfrom
rav/claim_keys

Conversation

@richvdh
Copy link
Member

@richvdh richvdh commented Nov 1, 2023

Keep a record of devices that were included in a /keys/claim request, and then, if they are missing in the response, register them as "failed".

Fixes #281.

This PR is currently based on #2803.

@richvdh richvdh requested a review from a team as a code owner November 1, 2023 16:39
@richvdh richvdh requested review from jplatte and removed request for a team November 1, 2023 16:39
Base automatically changed from rav/claim_keys_prep to main November 3, 2023 09:43
We can no longer randomly inject `/keys/claim` responses when there was no
pending request; instead we have to generate a request using
`get_missing_sessions`. *However*, the whole point is that Alice's device
should be "timed out", so will not appear in the generated request.

So, to test the timeout behaviour, we need to forcibly expire the timeout, and
*then* generate a request; and once the valid response is processed, check that
the timout is properly cleared.
... and make sure that any responses received match it.
Compare the response to `/keys/claim` with the request, so that we can see
which devices are absent in the response, and handle accordingly.
@codecov
Copy link

codecov bot commented Nov 3, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (9ef6103) 82.01% compared to head (12f0102) 82.06%.
Report is 42 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2805      +/-   ##
==========================================
+ Coverage   82.01%   82.06%   +0.04%     
==========================================
  Files         204      204              
  Lines       21021    21043      +22     
==========================================
+ Hits        17240    17268      +28     
+ Misses       3781     3775       -6     
Files Coverage Δ
crates/matrix-sdk-crypto/src/machine.rs 84.75% <100.00%> (+0.02%) ⬆️
...x-sdk-crypto/src/session_manager/group_sessions.rs 96.21% <ø> (ø)
crates/matrix-sdk-crypto/src/utilities.rs 97.56% <100.00%> (+0.59%) ⬆️
.../matrix-sdk-crypto/src/session_manager/sessions.rs 91.35% <89.06%> (-0.51%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@richvdh richvdh requested a review from poljar November 8, 2023 15:42
Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left a couple of suggestions.

// empty map, this will yield `None`.
//
// The `None` entries are then filtered out by `filter_map`.
key_map.values().next().map(|key| ((user_id.clone(), device_id.clone()), key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key_map.values().next().map(|key| ((user_id.clone(), device_id.clone()), key))
key_map.values().next().map(|key| ((user_id.to_owned(), device_id.to_owned()), key))

// if we were able to pair this response with a request, look for devices that
// were present in the request but did not elicit a successful response.
if let Some(request) = request {
let missing_devices: Vec<(OwnedUserId, OwnedDeviceId)> = request
Copy link
Contributor

@poljar poljar Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this whole thing can be better expressed via sets, something like:

let sessions_to_create: BTreeSet<_> =
    sessions_to_create.iter().map(|((u, d), _)| (u, d)).collect();

let requested_sessions: BTreeSet<(_, _)> = request
    .one_time_keys
    .iter()
    .filter(|(user_id, _)| !failed_servers.contains(user_id.server_name()))
    .flat_map(|(u, d)| d.keys().map(|d| (u, d)).collect::<BTreeSet<_>>())
    .collect();

let missing_devices: BTreeSet<_> = requested_sessions.difference(&sessions_to_create).collect();

*edit: Perhaps even move the filter() for the failed servers on the let missing_devices line.

You'll have to deal with double borrows a bit further bellow, but I think the clarity is worth it:

for (user_id, device_id) in missing_devices {
    failed_devices_lock
        .entry((*user_id).to_owned())
        .or_default()
        .insert((*device_id).to_owned())
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, happy to do this. The reason I avoided it was that it felt a bit redundant to build the two sets just so that we could call .difference on them, particularly given we already have the requested_sessions in a map whose keys we can check in O(n log n) (ie, the same as the BTreeSet).

I guess using sets shows the intention more clearly though.

Copy link
Contributor

@poljar poljar Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily we can check if the BTreeSet approach is slower, red is the flat_map(), blue is the BTreeSet approach:

flat_map_vs_sets_page-0001

The benchmark is processing a /keys/claim response with 163 one-time keys. I did run it a couple of times to get somewhat stable results. The BTreeSet approach does not seem to make any measurable difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I am understanding the report correctly: this is saying the mean time increased by 0.7680%? In which case, very much agreed this is not worth worrying about.

I'd be interested to see how you ran this benchmark, for future reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are understanding correctly, I did have a couple of runs which claimed that the time decreased by ~0.5% as well. As the report tells us, all of this is within the noise threshold.

I'd be interested to see how you ran this benchmark, for future reference.

Please take a look at the /benchmarks folder: https://github.com/matrix-org/matrix-rust-sdk/tree/main/benchmarks

richvdh and others added 2 commits November 9, 2023 17:14
@richvdh richvdh enabled auto-merge (squash) November 9, 2023 18:10
@richvdh richvdh merged commit 71a2d23 into main Nov 9, 2023
@richvdh richvdh deleted the rav/claim_keys branch November 9, 2023 18:30
richvdh added a commit that referenced this pull request Nov 9, 2023
 - Add an entry which I forgot for #2805

 - Reorder a few other items that have been added since
 #2591.
richvdh added a commit that referenced this pull request Nov 10, 2023
- Add an entry which I forgot for #2805

 - Reorder a few other items that have been added since
 #2591.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTK fetching failure woes (attempting to establish Olm sessions too often)

2 participants