Skip to content

MSC2716: Incrementally importing history into existing rooms#2716

Open
ara4n wants to merge 69 commits intoold_masterfrom
matthew/msc2716
Open

MSC2716: Incrementally importing history into existing rooms#2716
ara4n wants to merge 69 commits intoold_masterfrom
matthew/msc2716

Conversation

@ara4n
Copy link
Member

@ara4n ara4n commented Aug 4, 2020

A proposal for letting ASes specify event parents and timestamps when submitting events, letting them much more effectively incrementally insert past conversation history. This is getting increasingly topical given the need to bridge existing conversation archives from existing chat systems into Matrix. Fixes most of https://github.com/matrix-org/matrix-doc/issues/698 hopefully.

Rendered

Homeserver implementations:

Client implementations:


Old proposal rendered

cc @tulir for feedback, as the main consumer of the ?ts= API today...

A proposal for letting ASes specify event parents and timestamps when
submitting events, letting them much more effectively insert past
conversation history.

cc @tulir for feedback, as the main consumer of the ?ts= API today...
@ara4n ara4n added the proposal A matrix spec change proposal label Aug 4, 2020
@turt2live turt2live added kind:feature MSC for not-core and not-maintenance stuff proposal-in-review labels Aug 4, 2020
@Half-Shot Half-Shot self-requested a review August 5, 2020 00:00
@turt2live turt2live self-requested a review August 5, 2020 06:02
@lieuwex
Copy link

lieuwex commented Aug 29, 2020

If I understand this correctly, this requires the application service to insert all the historical data before the user requests it.

Isn't it an idea to create a new querying API and request backlog from the AS, like homeservers currently do when they ask other federated servers for historical events?
This would block the homeserver while the AS prepares the events, provides them to the homeserver using the APIs outlined in this MSC, and then the AS could return something like an array containing all the event IDs that were created during the request.

This would lead to an even cleaner SS-like integration with ASes, without creating a buttload of work for every AS. How much would they need to import, for example. Especially if the room has events going back ten years or something like that.

@ara4n
Copy link
Member Author

ara4n commented Nov 10, 2020

If I understand this correctly, this requires the application service to insert all the historical data before the user requests it.

yup, as per the Potential Issues section:

This doesn't provide a way for a HS to tell an AS that a client has tried to call /messages beyond the beginning of a room, and that the AS should try to lazy-insert some more messages (as per https://github.com/matrix-org/matrix-doc/issues/698). For this MSC to be properly useful, we might want to flesh that out.

Copy link
Member

@tulir tulir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had commented on this in HQ earlier, but forgot to post on github.

Overall it looks good. Inserting state could get complicated, but that's just an initial feeling and I'm not sure if it's actually true. A server implementation prototype would be nice so I could just try implementing it in one of my bridges.


## Unstable prefix

Feels unnecessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have an unstable feature flag to check if the homeserver supports this

| \___________________________________
| \ \
| \ \
live timeline previous 1000 messages another block of ancient history
Copy link
Member

@kegsay kegsay Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfilling via /messages works by walking back up prev_events. If the DAG looks like this, we'll never hit different eras so /messages will return 0 events.

EDIT: Actually it uses depth which will interleave instead. /get_missing_events will however walk up prev events, so all these lovely eras will never make it to other federated servers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works because the /messages endpoint has no idea when to jump to a different era. That endpoint topologically walks the DAG (in Dendrite it does this by depth), meaning if you actually did this you would get interleaved events as each era's events start producing the same depth values. This at least returns the events in the forks, but not where you want them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, doing this would produce forwards extremities at the end of each era, which servers will attempt to merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my expectation is that a homeserver should calculate the an appropriate depth when importing history like this, probably by tiebreaking based on origin_server_ts. Where does Dendrite get its depth param from? As it certainly shouldn't be trusting the one it receives over federation, because of https://github.com/matrix-org/matrix-doc/issues/1229.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the fact that we create forward extremities at the end of each era which then get merged by the next message sent in the room was intended to be a feature, not a bug.

Copy link
Contributor

@MadLittleMods MadLittleMods Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our meeting, the depth is assumed from the stream ID and can be spoofed. I may not have the details correct but we did discuss fudging it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So having rediscussed this IRL: Dendrite (and Synapse) currently get their depth parameters used for ordering from the wire. Ideally, we'd calculate the depth parameter instead - which could be easy, if we mandate that blocks of old history are always loaded contiguously in reverse chronological order. As a quick fudge to test the approach however we could set depth=1 for these events, and hopefully the default ordering will be sufficient (we think it is on synapse, but dendrite might need a tweak).

Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this proposal works at stated on its own without additional modifications for the reasons given.

Brainstorming with @neilalexander and @MadLittleMods it sounds like MSC2836: Threading would help here as you can retrospectively update pointers in the thread DAG, and arguably it separates concerns better (the structural auth DAG vs effectively a presentation DAG). This would however require clients to support threading which seems like a high barrier as it doesn't Just Work.

| \___________________________________
| \ \
| \ \
live timeline previous 1000 messages another block of ancient history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works because the /messages endpoint has no idea when to jump to a different era. That endpoint topologically walks the DAG (in Dendrite it does this by depth), meaning if you actually did this you would get interleaved events as each era's events start producing the same depth values. This at least returns the events in the forks, but not where you want them.

| \___________________________________
| \ \
| \ \
live timeline previous 1000 messages another block of ancient history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, doing this would produce forwards extremities at the end of each era, which servers will attempt to merge.

MadLittleMods added a commit to MadLittleMods/tardis that referenced this pull request Feb 2, 2021
Edits to make TARDIS work with Synapse while writing Complement tests for [MSC 2716](matrix-org/matrix-spec-proposals#2716).

 - matrix-org/synapse#9247
 - matrix-org/complement#68
@turt2live
Copy link
Member

@MadLittleMods I believe this MSC has received the sanity review it was after and therefore am removing it from the SCT's backlog board. If this is false, please raise it in the SCT office for reconsideration.

with `?prev_event_id` pointing at that floating state to auth the event and where we
want to insert the event.

Another way of doing this might be to store the different eras of the room as
Copy link

@flokli flokli Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has there been any discussion around being able to import history from the client?

Assume someone was previously active on IRC, and creates a Matrix account, then joins an IRC channel via matrix-appservice-irc.
That person might already have had some room history or DMs (that the AS is unaware of), and it would be nice if there was a way to "import" these old logs from the IRC client into some room state, and then see it consistently in all their Matrix clients.

This would need to be stored somewhere server-side, so it's available across all devices, but it might not necessarily be shared to other room members (even though two parties might decide to migrate their conversation history that has happened on IRC to a Matrix DM room)

Copy link
Contributor

@MadLittleMods MadLittleMods Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flokli The new /batch_send endpoint here is in the client API -> POST /_matrix/client/v1/rooms/<roomID>/batch_send?prev_event_id=<message3-eventID> but is currently restricted to application service tokens currently.

If there are good use cases for it, we could easily remove this restriction to allow any room admin (person with proper power levels) to do it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I would be very much interested in this too, being able to import old chats to an appservice would be quite relevant.

Comment on lines +362 to +366
Because we also can't use the `historical` power level for controlling who can
send these events in the existing room version, we always persist but instead
only process and give meaning to the `m.room.insertion`, `m.room.batch`, and
`m.room.marker` events when the room `creator` sends them. This caveat/rule only
applies to existing room versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this using soft rejection on the server-side to determine valid-ness of events? Currently it reads a bit like an event auth change, which can't happen with respect to existing room versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't reject or soft-fail any events in existing room versions.

If the homeserver sees one of those event types from the room creator, the homeserver will process as necessary to make the imported history display when paginating over that area of the room.

Otherwise, it just appears as any other event in the room, nothing happens, and the imported history isn't visible. If the homeserver ever adds MSC2716 support, that history suddenly is unlocked for that server once they re-process the events in the room (probably as a background job).

Synapse, these are called `outlier`s and won't be visible in the chat history
which also allows us to insert multiple batches without having a bunch of `@mxid
joined the room` noise between each batch. **The state will not be resolved into
the current state of the room.**
Copy link

@twouters twouters Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current behavior of synapse is undesirable because it fails backfilled joins for members who have already joined the room.

Does the spec need to clarify that the client implementation must validate the current state of the room before sending a batch event, or should the spec mandate server implementations to ignore batched member state events that are already fulfilled in the current state of the room?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What fails about the member already being joined to the room? What error are you seeing? I would assume this should work because a m.room.member transition from join -> join is allowed, https://spec.matrix.org/v1.4/client-server-api/#mroommember

For clarity, it would be good to see your exact request that is failing.


Back to your topic of how to resolve things; we would want the conflict resolution to be general and the same for all event types. Doing something different for m.room.member would be a recipe for complexity foot-guns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, bridges have to include invite events, and join -> invite is not allowed.

Ignoring such state events feels like it'd be the easiest to implement on the server, but it doesn't necessarily need to be built into state_events_at_start at all, it could be a new members field that the server uses to add invite and join events as necessary, or it could even just detect senders of the events in the batch. From the bridge point of view, all of those options are easy to implement.

Beeper (hungryserv) does the last option: it gets the list of users from the event senders, filters away any users who are already in the room and then creates member events for the missing users. (we also don't have any use cases for state events, so the server forbids using state_events_at_start entirely)

**Power level:**

- `historical` (does not need prefixing because it's already under an
experimental room version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This language is derived/inspired by MSC2285

Suggested change
experimental room version)
### While the MSC is unstable
During this period, to detect server support clients should check for the
presence of the `org.matrix.msc2716` flag in `unstable_features` on `/versions`.
Clients are also required to use the unstable prefixes (see [unstable
prefix](#unstable-prefix)) during this time.
### Once the MSC is merged but not in a spec version
Once this MSC is merged, but is not yet part of the spec, clients should rely on
the presence of the `org.matrix.msc2716.stable` flag in `unstable_features` to
determine server support. If the flag is present, clients are required to use
stable prefixes (see [unstable prefix](#unstable-prefix)).
### Once the MSC is in a spec version
Once this MSC becomes a part of a spec version, clients should rely on the
presence of the spec version, that supports the MSC, in `versions` on
`/versions`, to determine support. Servers are encouraged to keep the
`org.matrix.msc2716.stable` flag around for a reasonable amount of time
to help smooth over the transition for clients. "Reasonable" is intentionally
left as an implementation detail, however the MSC process currently recommends
*at most* 2 months from the date of spec release.

@moritzdietz
Copy link

What does this comment mean from linked PR. Could you give more information on what happened?

Abandoning PR as I don't see MSC2716 going further now that Gitter has fully migrated to Matrix

matrix-org/matrix-react-sdk#8354 (comment)

@MadLittleMods
Copy link
Contributor

@moritzdietz Based on your comment in #matrix-spec, I think you have a misunderstanding on how this relates to Gitter.

We were able to import all 141M messages from Gitter to Matrix without MSC2716. We used the single /send endpoint with the timestamp massaging ?ts=xxx query parameter split between a "historical" and live room.

The big drive to put effort into MSC2716 was the Gitter case but we were able to accomplish the Gitter migration without it in the end and there is no reliance on it now. Historical import within the DAG is still a very useful concept to have in Matrix but there are some roadblocks in the MSC before being viable:

  • Event ordering over federation
    • Currently, events in Synapse are sorted by (topological_ordering, stream_ordering) where topological_ordering is just depth and is baked into the event when it goes over federation. This means when we try to import between depth 1 and 2, we can only rely on stream_ordering to sort between 1 and 2. Since stream_ordering is just dependent on when the server receives the event, the historical messages can easily get out of order. (some more info in the MSC)
    • To totally fix this problem, it would require a different graph linearization strategy. Perhaps we would do some online topological ordering (Katriel–Bodlaender algorithm) where depth/topological_ordering is dynamically updated whenever new events are inserted into the DAG. This is something extremely sci-fi and a big task though.
  • Self-referential batches: There are some ideas in this open discussion but none stand out as great to use.

There have been lots of good learnings here but these shortcomings don't instill confidence to keep driving this forward without a underlying reason to do so. Hopefully we can come at this with some fresh ideas to solve these shortcomings when we need this sort of thing again.

Instead of leaving these experimental implementations languish around in Element and Synapse, I aim to remove them. For the Element case, the PR was never merged, so I could easily just close it.

@moritzdietz
Copy link

@MadLittleMods Thank you Eric for clarifying. As you said, I did misunderstood that. I guess the missing link was this bit of information you just shared above which I haven't seen elsewhere.

@Avamander
Copy link

I have to add that a bunch of people would really like this functionality in order to migrate to Matrix from other platforms, without losing years if not decades of message history.

Understandable that it's a difficult thing to implement, but it's be very useful to a lot of users.

@James-Mat
Copy link

It would be really unfortunate to hear that importing history won't be possible in the foreseeable future. This, like Avamander already mentioned, is for me the only thing which keeps me from adapting Matrix fully and convincing others to do so.
Right now it is inevitable to keep every other client with valuable information installed as well if one wants to look up something historical.
Most people I come across don't want to start anew but take all data with them, though this might be not the case in general.

@MadLittleMods I got a bit confused by your last comment - did I understand correctly that, since the event ordering only depends on topological_ordering and stream_ordering, the usage of the ?ts=xxx query parameter was only to achieve correct metadata for the messages, but doesn't influence ordering at all?

As a thought, is topological_ordering unsigned or signed? If it is signed, this might then make the retrospective insertion of non-interweaving history possible, i.e. if a user switches messengers without a period of using both, by simply giving the historical messages negative values to enqueue them before the messages already in the room.
In the case of unsigned, at least an import right at the beginning (before any more recent messages are sent in the room) would be possible without a change of how event ordering is handled, right?

In any case, can you point me to some PRs or discussions on how the current implementation of the event ordering came to be? When I first subscribed to this PR, the description had me thinking that event ordering would simply be done by timestamp, which would seem to solve any problems of inserting historical messages. But I'm sure that there are other issues which were circumvented by using the current implementation, and I'd like to understand this process to not make any pointless suggestions.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Apr 11, 2023

I got a bit confused by your last comment - did I understand correctly that, since the event ordering only depends on topological_ordering and stream_ordering, the usage of the ?ts=xxx query parameter was only to achieve correct metadata for the messages, but doesn't influence ordering at all?

In the Gitter case, we started with a fresh room for the historical messages and imported one by one so the topological_ordering was correct. We also used /send?ts=xxx to make the timestamps correct. Then connected the historical and "live" room together with a m.room.tombstone and MSC3946 predecessor event. This functionality is completely separate from MSC2716 and works fine today.

As a thought, is topological_ordering unsigned or signed?

See my previous comment: "topological_ordering is just depth and is baked into the event when it goes over federation"

You can see depth as part of the PDU (persistent data unit) in the spec: https://spec.matrix.org/v1.5/rooms/v10/#event-format-1

by simply giving the historical messages negative values to enqueue them before the messages already in the room.

Importing messages at the beginning of a room is only one use case. We also want to be able to import between any two events and even between already imported messages. One example is if you're importing a mail or newsgroup archive and you stumble across a lost mbox years later with a few more messages, you want to fill in that history.

If your use case is just one import blast at the beginning of a room, the way Gitter accomplished this works now and is a lot simpler (do that instead).

In any case, can you point me to some PRs or discussions on how the current implementation of the event ordering came to be? When I first subscribed to this PR, the description had me thinking that event ordering would simply be done by timestamp, which would seem to solve any problems of inserting historical messages. But I'm sure that there are other issues which were circumvented by using the current implementation, and I'd like to understand this process to not make any pointless suggestions.

Matrix is a DAG (direct acyclic graph) of events. depth being baked in is kinda a "get out of jail free" card on how to linearize the DAG.

This design decision is before my time and I don't know of any good references. Maybe someone in #matrix-spec:matrix.org has some context

@Beryesa
Copy link

Beryesa commented Jan 3, 2025

Is there anything that supersedes this proposal right now, and/or should this be kept open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal requires-room-version An idea which will require a bump in room version unassigned-room-version Remove this label when things get versioned.

Projects

None yet

Development

Successfully merging this pull request may close these issues.