MSC2716: Incrementally importing history into existing rooms#2716
MSC2716: Incrementally importing history into existing rooms#2716ara4n wants to merge 69 commits intoold_masterfrom
Conversation
A proposal for letting ASes specify event parents and timestamps when submitting events, letting them much more effectively insert past conversation history. cc @tulir for feedback, as the main consumer of the ?ts= API today...
|
If I understand this correctly, this requires the application service to insert all the historical data before the user requests it. Isn't it an idea to create a new querying API and request backlog from the AS, like homeservers currently do when they ask other federated servers for historical events? This would lead to an even cleaner SS-like integration with ASes, without creating a buttload of work for every AS. How much would they need to import, for example. Especially if the room has events going back ten years or something like that. |
yup, as per the Potential Issues section:
|
tulir
left a comment
There was a problem hiding this comment.
I had commented on this in HQ earlier, but forgot to post on github.
Overall it looks good. Inserting state could get complicated, but that's just an initial feeling and I'm not sure if it's actually true. A server implementation prototype would be nice so I could just try implementing it in one of my bridges.
|
|
||
| ## Unstable prefix | ||
|
|
||
| Feels unnecessary. |
There was a problem hiding this comment.
It might be useful to have an unstable feature flag to check if the homeserver supports this
| | \___________________________________ | ||
| | \ \ | ||
| | \ \ | ||
| live timeline previous 1000 messages another block of ancient history |
There was a problem hiding this comment.
Backfilling via /messages works by walking back up prev_events. If the DAG looks like this, we'll never hit different eras so /messages will return 0 events.
EDIT: Actually it uses depth which will interleave instead. /get_missing_events will however walk up prev events, so all these lovely eras will never make it to other federated servers.
There was a problem hiding this comment.
I don't think this works because the /messages endpoint has no idea when to jump to a different era. That endpoint topologically walks the DAG (in Dendrite it does this by depth), meaning if you actually did this you would get interleaved events as each era's events start producing the same depth values. This at least returns the events in the forks, but not where you want them.
There was a problem hiding this comment.
In addition, doing this would produce forwards extremities at the end of each era, which servers will attempt to merge.
There was a problem hiding this comment.
so my expectation is that a homeserver should calculate the an appropriate depth when importing history like this, probably by tiebreaking based on origin_server_ts. Where does Dendrite get its depth param from? As it certainly shouldn't be trusting the one it receives over federation, because of https://github.com/matrix-org/matrix-doc/issues/1229.
There was a problem hiding this comment.
also, the fact that we create forward extremities at the end of each era which then get merged by the next message sent in the room was intended to be a feature, not a bug.
There was a problem hiding this comment.
From our meeting, the depth is assumed from the stream ID and can be spoofed. I may not have the details correct but we did discuss fudging it.
There was a problem hiding this comment.
So having rediscussed this IRL: Dendrite (and Synapse) currently get their depth parameters used for ordering from the wire. Ideally, we'd calculate the depth parameter instead - which could be easy, if we mandate that blocks of old history are always loaded contiguously in reverse chronological order. As a quick fudge to test the approach however we could set depth=1 for these events, and hopefully the default ordering will be sufficient (we think it is on synapse, but dendrite might need a tweak).
kegsay
left a comment
There was a problem hiding this comment.
I don't think this proposal works at stated on its own without additional modifications for the reasons given.
Brainstorming with @neilalexander and @MadLittleMods it sounds like MSC2836: Threading would help here as you can retrospectively update pointers in the thread DAG, and arguably it separates concerns better (the structural auth DAG vs effectively a presentation DAG). This would however require clients to support threading which seems like a high barrier as it doesn't Just Work.
| | \___________________________________ | ||
| | \ \ | ||
| | \ \ | ||
| live timeline previous 1000 messages another block of ancient history |
There was a problem hiding this comment.
I don't think this works because the /messages endpoint has no idea when to jump to a different era. That endpoint topologically walks the DAG (in Dendrite it does this by depth), meaning if you actually did this you would get interleaved events as each era's events start producing the same depth values. This at least returns the events in the forks, but not where you want them.
| | \___________________________________ | ||
| | \ \ | ||
| | \ \ | ||
| live timeline previous 1000 messages another block of ancient history |
There was a problem hiding this comment.
In addition, doing this would produce forwards extremities at the end of each era, which servers will attempt to merge.
Edits to make TARDIS work with Synapse while writing Complement tests for [MSC 2716](matrix-org/matrix-spec-proposals#2716). - matrix-org/synapse#9247 - matrix-org/complement#68
|
@MadLittleMods I believe this MSC has received the sanity review it was after and therefore am removing it from the SCT's backlog board. If this is false, please raise it in the SCT office for reconsideration. |
| with `?prev_event_id` pointing at that floating state to auth the event and where we | ||
| want to insert the event. | ||
|
|
||
| Another way of doing this might be to store the different eras of the room as |
There was a problem hiding this comment.
Has there been any discussion around being able to import history from the client?
Assume someone was previously active on IRC, and creates a Matrix account, then joins an IRC channel via matrix-appservice-irc.
That person might already have had some room history or DMs (that the AS is unaware of), and it would be nice if there was a way to "import" these old logs from the IRC client into some room state, and then see it consistently in all their Matrix clients.
This would need to be stored somewhere server-side, so it's available across all devices, but it might not necessarily be shared to other room members (even though two parties might decide to migrate their conversation history that has happened on IRC to a Matrix DM room)
There was a problem hiding this comment.
@flokli The new /batch_send endpoint here is in the client API -> POST /_matrix/client/v1/rooms/<roomID>/batch_send?prev_event_id=<message3-eventID> but is currently restricted to application service tokens currently.
If there are good use cases for it, we could easily remove this restriction to allow any room admin (person with proper power levels) to do it.
There was a problem hiding this comment.
I agree, I would be very much interested in this too, being able to import old chats to an appservice would be quite relevant.
| Because we also can't use the `historical` power level for controlling who can | ||
| send these events in the existing room version, we always persist but instead | ||
| only process and give meaning to the `m.room.insertion`, `m.room.batch`, and | ||
| `m.room.marker` events when the room `creator` sends them. This caveat/rule only | ||
| applies to existing room versions. |
There was a problem hiding this comment.
Is this using soft rejection on the server-side to determine valid-ness of events? Currently it reads a bit like an event auth change, which can't happen with respect to existing room versions.
There was a problem hiding this comment.
It doesn't reject or soft-fail any events in existing room versions.
If the homeserver sees one of those event types from the room creator, the homeserver will process as necessary to make the imported history display when paginating over that area of the room.
Otherwise, it just appears as any other event in the room, nothing happens, and the imported history isn't visible. If the homeserver ever adds MSC2716 support, that history suddenly is unlocked for that server once they re-process the events in the room (probably as a background job).
| Synapse, these are called `outlier`s and won't be visible in the chat history | ||
| which also allows us to insert multiple batches without having a bunch of `@mxid | ||
| joined the room` noise between each batch. **The state will not be resolved into | ||
| the current state of the room.** |
There was a problem hiding this comment.
Current behavior of synapse is undesirable because it fails backfilled joins for members who have already joined the room.
Does the spec need to clarify that the client implementation must validate the current state of the room before sending a batch event, or should the spec mandate server implementations to ignore batched member state events that are already fulfilled in the current state of the room?
There was a problem hiding this comment.
What fails about the member already being joined to the room? What error are you seeing? I would assume this should work because a m.room.member transition from join -> join is allowed, https://spec.matrix.org/v1.4/client-server-api/#mroommember
For clarity, it would be good to see your exact request that is failing.
Back to your topic of how to resolve things; we would want the conflict resolution to be general and the same for all event types. Doing something different for m.room.member would be a recipe for complexity foot-guns.
There was a problem hiding this comment.
In most cases, bridges have to include invite events, and join -> invite is not allowed.
Ignoring such state events feels like it'd be the easiest to implement on the server, but it doesn't necessarily need to be built into state_events_at_start at all, it could be a new members field that the server uses to add invite and join events as necessary, or it could even just detect senders of the events in the batch. From the bridge point of view, all of those options are easy to implement.
Beeper (hungryserv) does the last option: it gets the list of users from the event senders, filters away any users who are already in the room and then creates member events for the missing users. (we also don't have any use cases for state events, so the server forbids using state_events_at_start entirely)
| **Power level:** | ||
|
|
||
| - `historical` (does not need prefixing because it's already under an | ||
| experimental room version) |
There was a problem hiding this comment.
This language is derived/inspired by MSC2285
| experimental room version) | |
| ### While the MSC is unstable | |
| During this period, to detect server support clients should check for the | |
| presence of the `org.matrix.msc2716` flag in `unstable_features` on `/versions`. | |
| Clients are also required to use the unstable prefixes (see [unstable | |
| prefix](#unstable-prefix)) during this time. | |
| ### Once the MSC is merged but not in a spec version | |
| Once this MSC is merged, but is not yet part of the spec, clients should rely on | |
| the presence of the `org.matrix.msc2716.stable` flag in `unstable_features` to | |
| determine server support. If the flag is present, clients are required to use | |
| stable prefixes (see [unstable prefix](#unstable-prefix)). | |
| ### Once the MSC is in a spec version | |
| Once this MSC becomes a part of a spec version, clients should rely on the | |
| presence of the spec version, that supports the MSC, in `versions` on | |
| `/versions`, to determine support. Servers are encouraged to keep the | |
| `org.matrix.msc2716.stable` flag around for a reasonable amount of time | |
| to help smooth over the transition for clients. "Reasonable" is intentionally | |
| left as an implementation detail, however the MSC process currently recommends | |
| *at most* 2 months from the date of spec release. |
|
What does this comment mean from linked PR. Could you give more information on what happened?
|
|
@moritzdietz Based on your comment in #matrix-spec, I think you have a misunderstanding on how this relates to Gitter. We were able to import all 141M messages from Gitter to Matrix without MSC2716. We used the single The big drive to put effort into MSC2716 was the Gitter case but we were able to accomplish the Gitter migration without it in the end and there is no reliance on it now. Historical import within the DAG is still a very useful concept to have in Matrix but there are some roadblocks in the MSC before being viable:
There have been lots of good learnings here but these shortcomings don't instill confidence to keep driving this forward without a underlying reason to do so. Hopefully we can come at this with some fresh ideas to solve these shortcomings when we need this sort of thing again. Instead of leaving these experimental implementations languish around in Element and Synapse, I aim to remove them. For the Element case, the PR was never merged, so I could easily just close it. |
|
@MadLittleMods Thank you Eric for clarifying. As you said, I did misunderstood that. I guess the missing link was this bit of information you just shared above which I haven't seen elsewhere. |
|
I have to add that a bunch of people would really like this functionality in order to migrate to Matrix from other platforms, without losing years if not decades of message history. Understandable that it's a difficult thing to implement, but it's be very useful to a lot of users. |
|
It would be really unfortunate to hear that importing history won't be possible in the foreseeable future. This, like Avamander already mentioned, is for me the only thing which keeps me from adapting Matrix fully and convincing others to do so. @MadLittleMods I got a bit confused by your last comment - did I understand correctly that, since the event ordering only depends on As a thought, is In any case, can you point me to some PRs or discussions on how the current implementation of the event ordering came to be? When I first subscribed to this PR, the description had me thinking that event ordering would simply be done by timestamp, which would seem to solve any problems of inserting historical messages. But I'm sure that there are other issues which were circumvented by using the current implementation, and I'd like to understand this process to not make any pointless suggestions. |
In the Gitter case, we started with a fresh room for the historical messages and imported one by one so the
See my previous comment: " You can see
Importing messages at the beginning of a room is only one use case. We also want to be able to import between any two events and even between already imported messages. One example is if you're importing a mail or newsgroup archive and you stumble across a lost mbox years later with a few more messages, you want to fill in that history. If your use case is just one import blast at the beginning of a room, the way Gitter accomplished this works now and is a lot simpler (do that instead).
Matrix is a DAG (direct acyclic graph) of events. This design decision is before my time and I don't know of any good references. Maybe someone in #matrix-spec:matrix.org has some context
|
Pulling from my summary comments at: - #2716 (comment) - #2716 (comment)
|
Is there anything that supersedes this proposal right now, and/or should this be kept open? |
A proposal for letting ASes specify event parents and timestamps when submitting events, letting them much more effectively incrementally insert past conversation history. This is getting increasingly topical given the need to bridge existing conversation archives from existing chat systems into Matrix. Fixes most of https://github.com/matrix-org/matrix-doc/issues/698 hopefully.
Rendered
Homeserver implementations:
RoomBatchHandlerClient implementations:
Old proposal rendered
cc @tulir for feedback, as the main consumer of the ?ts= API today...