Begin adding implementing room chunks#3226
Conversation
This commit adds the necessary tables and columns, as well as an implementation of an online topological sorting algorithm to maintain an absolute ordering of the room chunks.
c7be1f5 to
943f102
Compare
| psql_only=True, | ||
| ) | ||
|
|
||
| self.register_background_index_update( |
There was a problem hiding this comment.
this is going to be expensive. Might want to warn users about it somewhere.
synapse/util/katriel_bodlaender.py
Outdated
| """This module contains an implementation of the Katriel-Bodlaender algorithm, | ||
| which is used to do online topological ordering of graphs. | ||
|
|
||
| Note that the ordering derived from the graph has the first node one with no |
synapse/util/katriel_bodlaender.py
Outdated
|
|
||
| class OrderedListStore(object): | ||
| """An abstract base class that is used to store a topological ordering of | ||
| a graph. Suitable for use with the Katriel-Bodlaender algorithm. |
There was a problem hiding this comment.
I'm confused: "Suitable for use with" or "Implements" ?
| class OrderedListStore(object): | ||
| """An abstract base class that is used to store a topological ordering of | ||
| a graph. Suitable for use with the Katriel-Bodlaender algorithm. | ||
| """ |
There was a problem hiding this comment.
I think some more words about the assumptions being made here would be useful. Likewise re the interface being exposed, to save me having to go and grok the paper.
For example:
- Looks like this assumes nodes are identified by unique strings?
- How do edges and ordering interact?
- does adding a node before/after another imply an edge between those nodes?
| in the database. | ||
|
|
||
| Internally the ordering is implemented using floats, and the average is | ||
| taken when a node is inserted inbetween other nodes. To avoid presicion |
| in the database. | ||
|
|
||
| Internally the ordering is implemented using floats, and the average is | ||
| taken when a node is inserted inbetween other nodes. To avoid presicion |
| in a range around the node, where the bounds are rounded to this | ||
| number of digits. | ||
| min_difference (int): A rebalance is triggered when the difference | ||
| between two successive orderings are less than the reverse of |
There was a problem hiding this comment.
s/are/is/. Two orderings, one difference.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class ChunkDBOrderedListStore(OrderedListStore): |
There was a problem hiding this comment.
I think it would be worth noting what a room chunk actually is, as well as noting that we are storing them in an ordered graph
| with Measure(self.clock, "chunk_rebalance"): | ||
| # We pick the interval to try and minimise the number of decimal | ||
| # places, i.e. we round to nearest float with `rebalance_digits` and | ||
| # use that as the middle of the interval |
There was a problem hiding this comment.
looks like we use it as the end of the interval to me
This both simplifies the code, and ensures that the target node is roughly in the center of the range rather than at an end.
This makes it clearer what the public interface is vs what subclasses need to implement.
CHANGES.rst
Outdated
| ======================= | ||
|
|
||
| This release adds an index to the events table. This means that on first | ||
| startup there will be an inceased amount of IO until the index is created. |
There was a problem hiding this comment.
... and it will use up a bunch of disk space
|
|
||
| A room chunk is a connected portion of the room events DAG. As such it | ||
| inherits a DAG, i.e. if an event in one chunk references an event in a | ||
| second chunk, then we say that the first chunk references the second, and |
There was a problem hiding this comment.
should probably explain how we get into this situation, rather than the events just being in the same chunk.
synapse/util/katriel_bodlaender.py
Outdated
| the room DAG, as newer messages would be added to the start rather than the | ||
| end. | ||
|
|
||
| ***We therefore invert the direction of edges when using the algorithm*** |
There was a problem hiding this comment.
s/we/ChunkDBOrderedListStore/ ?
synapse/util/katriel_bodlaender.py
Outdated
| self._insert_before(node_id, None) | ||
|
|
||
| def add_edge(self, source, target): | ||
| """Adds a new edge is added to the graph and updates the ordering. |
|
|
||
|
|
||
| class InMemoryOrderedListStore(OrderedListStore): | ||
| """An in memory OrderedListStore |
There was a problem hiding this comment.
are you planning to use this outside the tests? if not I'd move it to the tests heirarchy
There was a problem hiding this comment.
Yes, there's a few cases where we want to order events by depth where this would be useful. Though admittedly I'm not 100% sure that it'll be needed in the end.
| """Used as the list store for room chunks, efficiently maintaining them in | ||
| topological order on updates. | ||
|
|
||
| A room chunk is a connected portion of the room events DAG. As such it |
There was a problem hiding this comment.
"it" -> the set of chunks in a room?
| A room chunk is a connected portion of the room events DAG. As such it | ||
| inherits a DAG, i.e. if an event in one chunk references an event in a | ||
| second chunk, then we say that the first chunk references the second, and | ||
| thus forming a DAG. |
There was a problem hiding this comment.
right, but potentially a disconnected DAG. How does the ordering happen in that case?
|
|
||
| The server may only have a subset of all events in a room, in which case | ||
| its possible for the server to have chunks that are unconnected from each | ||
| other. The ordering between unconnected chunks is arbitrary. |
There was a problem hiding this comment.
If I have four chunks, with two connected parts (A->B , C->D), so we know that A<B and C<D, is it also the case that either B<C (so A<B<C<D) or D<A (so C<D<A<B) ? or could we end up with B in the middle of C and D or something?
There was a problem hiding this comment.
I don't think there is anything stopping B being in the middle, since theoretically if you inserted the chunks in the order A, C, B, D, then that obeys all the constraints and so the KB algorithm will leave things alone (I think).
829cde0 to
0504d80
Compare
richvdh
left a comment
There was a problem hiding this comment.
generally this lgtm and I am now quibbling about words which can easily be updated in the future
| an event in a second chunk, then we say that the first chunk references the | ||
| second, and thus forming a DAG. | ||
|
|
||
| Chunks are constructed so that they have the aditional property that for all |
| reference an event in the chunk and be referenced by an event in the chunk | ||
| (assuming no cycles). | ||
|
|
||
| Multiple chunks can therefore happen when the server misses some events, |
There was a problem hiding this comment.
"We can therefore end up with multiple chunks in a room when..."
| inherits a DAG, i.e. if an event in one chunk references an event in a | ||
| second chunk, then we say that the first chunk references the second, and | ||
| thus forming a DAG. | ||
| A room chunk is a connected portion of the room events DAG. As such the set |
There was a problem hiding this comment.
suggest moving "As such" to after the explanation about how they are constructed
| an event in a second chunk, then we say that the first chunk references the | ||
| second, and thus forming a DAG. | ||
|
|
||
| Chunks are constructed so that they have the aditional property that for all |
There was a problem hiding this comment.
probably worth noting somewhere here that chunks start out disconnected, and then get connected later when we receive events that, uh, connect them.
This commit adds the necessary tables and columns, as well as an
implementation of an online topological sorting algorithm to maintain an
absolute ordering of the room chunks.
Note: This is the start of a series of PRs. I'm landing them on a separate branch as we should probably land the full set onto develop at once, but want to make separate PRs to make review easier