With the upcoming implementation of an online version for JabRef, we face the issue of synchronizing data across multiple devices (and eventually different users).
Setting
- S1: One user, multiple devices: One user may have different devices with JabRef installed and would like to sync the library across all devices. A device may go offline at any point and may reconnect at a later time. JabRef's central server may been seen as one of these "user devices".
- S2: Multiple users accessing the same database: In the future, it would be nice to share one or multiple entries (or all) with other people, giving them different rights (e.g. teacher > student, students only have read rights; author > coauthor, all have write privileges). Users don't need to be available always.
- Mix of S1 and S2
Assumptions
JabRef's requirements are summarized as follows:
- A1: Conflicts at the level of one entry are relatively rare. In S1, it seems most probable that some metadata changes on multiple devices simultaneous (e.g. read status) or that the users changes an entry on one device while having forgotten that she made a similar change already on another device. In S2, multiple users may completely revise the entry independently. But it is not as common to have conflicts as for example in collaborative text editing.
- A2: Conflicts may be resolved manually since JabRef already has a nice 2-way merge interface and conflicts shouldn't happen that often to annoy users.
- A3: Data correctness is valued very highly.
- A4: Users should always be able to edit their own databases, since otherwise their local workflow with latex compilation is not possible (offline-first).
- A5: Users may try to edit the same data at the same time (though this should happen rarely for our current user numbers).
- A6: Users may go offline for some time, without transmitting their data to a central server before.
- A7: All data that we want to transmit needs be saved in the users library, at least to the extend that the bib file should work as a snapshot from which the user can start working without loosing data.
- A8: Users can change the bib file manually outside of JabRef, and changes should still be successfully transmitted to the server.
- A9: Users may edit the library on any device, i.e. there is no "main" device that can be considered to be the single source of truth.
- A10: The data stored in one entry is relatively small (300-500 bytes), but there may be many entries so that the total library is usually a few mb.
- A11: All syncs are user-triggered. In particular, we can always ask the user to resolve conflicts and don't need automatic conflict handling.
Algorithm:
For each entry, we store the following data on the client side:
- A "shared id" which uniquely identifies the entry for sync purposes.
- A "client version number" that represents the version of the entry the last time the client synced with the server.
- A "local version number" that represents the local version of the entry, and which is increased as soon as the entry is modified locally.
- A "hashed entry" representing the entry at the point when the local version number has been increased the last time.
On server side, the "shared id" and a "server version number" is stored.
Then the sync operations are implemented as follows:
- LOAD (JabRef is started and reads the library with sync information): For each entry, check if the the stored hashed entry coincides with the hashed version of the entry in the bib file. If not, increase the local version number and update the hashed entry. Then trigger PULL (if client is online).
- CHANGE (an entry is modified in JabRef locally): Increase its local version number and update its hash. Then trigger PUSH (if client is online).
- PULL (e.g. after the client has been offline, or in regular intervals): Send a map of entry shared IDs with their client version number to the server; server sends back the list of entries (including the field values) that have a higher server version number than the client version number that has been send; go to MERGE
- PUSH (e.g. after the entry is locally modified): Get a list of all entries whose local version number is larger than the client version number (i.e entries modified locally), and then send these entries to the server. The server checks if the server version number is the same as the client version number (i.e. no update on the server happened in the meantime), and then replaces the server entry with the transmitted entry; otherwise the server returns the server-side entry and which triggers CONFLICT on the client side.
- SUBSCRIBE is an event that is triggered by the server if the entry has been modified on the server side. The event contains the modified entry(ies). Action on client side is to go to MERGE.
- MERGE (the server version number is higher than the client version number): Check if the local version number is the same as client version number (i.e. no update on the client happened), then replace client entry with server entry and update ; otherwise go to CONFLICT
- CONFLICT (if the server entry and the user entry have been modified): If the entries are indeed different, show merge dialog to let user resolve the conflict. In case they are the same, set local and client version number to the sever version number (another option would be to update the server version number to the client version number; but that would lead to an unnecessary query on another device).
As an example:
| Server version |
Client version |
Local version |
State |
Operation |
| 2 |
1 |
1 |
Server entry has been updated; no update of client entry |
Replace client entry with server entry |
| 1 |
1 |
2 |
Client entry has been updated; no update of server entry |
Replace server entry with client entry |
| 2 |
1 |
2 |
Client and server entry have been updated |
Let user resolve conflict |
Details and overview of (different) options
Lock systems
One node acquires a lock to write changes. All other write changes that happened during the same time are rejected or marked as conflicts that need be resolved by the user. Here we should favor Optimistic Offline Lock (always allow users to make changes, but commit to server requires lock) over pessimistic locks (only allow writing when one has a lock) since the chance for conflicts is low (A1) and users should always be able to make edits even when offline (A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.
The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.
In general, I have the impression that optimistic locks are great for handling concurrent transactions (two users trying to update the same entry at the same time), but has shortcomings when used as a data synchronization protocol.
Optimistic replication
Locking (in the way we use it) ensures that there is essentially only one valid replica (the server version) and all user copies are almost immediately synced, leading to essentially one copy of the data across multiple nodes. In contrast, optimistic replication accepts that for some short time there might be divergent replica that only eventually converge to an end state. Having divergent versions of data seems to be a realist scenario as users may work offline for some time (A6). The goal of optimistic replication is to have eventual consistency, which should be sufficient for our purposes.
There are different options varying along different axes.
-
Operations
- State-transfer: The entire state is submitted (e.g. the complete entry)
- Operation-transfer: Only the change is submitted (e.g. title changed from A to B, or even more fine grained as "character X inserted at position Y")
- Intermediate options such as sending only version-deltas
In principle, we have easily access to the change operations and implemented listeners already. However, sending these change operations to the server would require that the user is online. In the case that the user is offline we would need to write a changelog to the bib file, which would be sent to the server once the user goes online again (and upon successful transfer to the server the changelog is reset again). Using an external store does not work because then the bib file is not enough to replay the changes, violating A7. In the same vain, when the user edits the bib file manually, we don't have any direct operation information (A8). For these reasons, we definitely need to support state-transfer at least as a fallback.
-
Conflict-raising
- Just ignore conflicts, e.g. ignore later updates
- Syntactic policies are only based on the timing of changes, e.g. if user 1 makes changes while user 2 changed the same entry, then this is marked as a conflict.
- Semantic policies use application/operation specific knowledge to reduce conflicts.
Syntactic policies are simpler and generic but cause more conflicts, whereas semantic policies are more flexible but also more complicated and application-specific.
Ignoring conflicts is not an option in my opinion due to A3. I would mainly use with a mostly-synatic policy, enriching it with simple semantic rules. For example, additions of new different fields shouldn't result in conflicts. But I wouldn't try to analyze the user's intent if both edit the same field. This marks some changes as conflicts that could in principle be reconciled, eg starting from "A is B", user 1 changes "A" to "JabRef" and user 2 changes "B" to "great" could be resolved to "JabRef is great". A border case are special fields, e.g. two keywords are added or readstatus is changed to "skimmed" and on another device to "read". Proposal: start with synatic and enrich it by semantic rules based on user feedback.
-
Propagation: Network Topology
The main communication path should be via the central JabRef server, resulting in a star topology. However, we also need to support the case where the user syncs the database manually (e.g. by using git).
-
Propagation: Degree of synchronicity
- Pull (manually or regularly) asks for new updates
- Push broadcasts new updates
Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).
-
Maintaining a "happens-before" relationship
- Lamport clocks / scalar clocks: Every node has its own clock. When submitting an update, the sender increments the clock and attaches the new value to the object. When an update is received, the local clock is set to max(object value, local value) + 1. When an event A happens before B, then the clock for A is smaller than the clock of B. However, the converse may not true.
- Vector clocks: Every node has an own logical clock (i.e. a counter) and a copy of the other clocks. Open submitting a change, the own clock is increased and then the vector of clocks passed along with the message. When updates are received, the local vector of clocks is updated.
Vector clocks become problematic when the number of nodes becomes large (not an issue for us in the near future) or is dynamic (definitely an issue for us). For the latter see e.g Dynamic Vector Clocks for Consistent Ordering of Events in Dynamic Distributed Applications.
Nice overview of many variations of vector clocks.
Old proposed algorithm
Every node has a vector clock (for each entry) and increments its own time once the entry is modified. Every update message from the server include the vector clock, which is then used to determine the merge strategy by comparing it with the local vector clock:
- If the local and server time for this node is the same, the we know that the user didn't modified the entry locally, so we can simply copy the server entry over the local entry.
- If the times differ, then there were two simultaneous modifications (at the server and locally). Thus, we try to merge the changes automatically and if this doesn't succeed then show the merge dialog (see above for conflict resolution).
When the user reconnects and wants to update (pull sync), then she sends all (shared) ids of entries and the local vector clock to the server. The response will be all entries with more recent times (in any of the clocks).
This is based on SData description.
One question I still have is how to store the vector clock (in the entry?) and name the nodes (user + device name?).
Other approaches:
- Operational transformation: Doesn't work for us as we cannot reliable keep track of all operations.
- Differential Sync
- CRDTs
References:
With the upcoming implementation of an online version for JabRef, we face the issue of synchronizing data across multiple devices (and eventually different users).
Setting
Assumptions
JabRef's requirements are summarized as follows:
Algorithm:
For each entry, we store the following data on the client side:
On server side, the "shared id" and a "server version number" is stored.
Then the sync operations are implemented as follows:
As an example:
Details and overview of (different) options
Lock systems
One node acquires a lock to write changes. All other write changes that happened during the same time are rejected or marked as conflicts that need be resolved by the user. Here we should favor Optimistic Offline Lock (always allow users to make changes, but commit to server requires lock) over pessimistic locks (only allow writing when one has a lock) since the chance for conflicts is low (A1) and users should always be able to make edits even when offline (A4). Optimistic offline lock is the approach we use currently for for sync with shared databases. In particular, we have a "version" counter that is incremented upon every write process. If the user's version counter is smaller than the remove version counter, then this is an indication that there was an additional update to the remote database in meantime and the local change is rejected.
The current implementation may run into problems when the server is offline (if I'm not mistaken). For example, suppose that two users have the same entry with version counter zero. If user 1 changes the entry while being offline, and the server version is updated by user 2 in the meantime (thus version 2 counter is incremented), then version 1 < version 2 indicating that the entry of user 1 needs to be replaced by the remote version. But this would overwrite the changes of user 1.
In general, I have the impression that optimistic locks are great for handling concurrent transactions (two users trying to update the same entry at the same time), but has shortcomings when used as a data synchronization protocol.
Optimistic replication
Locking (in the way we use it) ensures that there is essentially only one valid replica (the server version) and all user copies are almost immediately synced, leading to essentially one copy of the data across multiple nodes. In contrast, optimistic replication accepts that for some short time there might be divergent replica that only eventually converge to an end state. Having divergent versions of data seems to be a realist scenario as users may work offline for some time (A6). The goal of optimistic replication is to have eventual consistency, which should be sufficient for our purposes.
There are different options varying along different axes.
Operations
In principle, we have easily access to the change operations and implemented listeners already. However, sending these change operations to the server would require that the user is online. In the case that the user is offline we would need to write a changelog to the bib file, which would be sent to the server once the user goes online again (and upon successful transfer to the server the changelog is reset again). Using an external store does not work because then the bib file is not enough to replay the changes, violating A7. In the same vain, when the user edits the bib file manually, we don't have any direct operation information (A8). For these reasons, we definitely need to support state-transfer at least as a fallback.
Conflict-raising
Syntactic policies are simpler and generic but cause more conflicts, whereas semantic policies are more flexible but also more complicated and application-specific.
Ignoring conflicts is not an option in my opinion due to A3. I would mainly use with a mostly-synatic policy, enriching it with simple semantic rules. For example, additions of new different fields shouldn't result in conflicts. But I wouldn't try to analyze the user's intent if both edit the same field. This marks some changes as conflicts that could in principle be reconciled, eg starting from "A is B", user 1 changes "A" to "JabRef" and user 2 changes "B" to "great" could be resolved to "JabRef is great". A border case are special fields, e.g. two keywords are added or readstatus is changed to "skimmed" and on another device to "read". Proposal: start with synatic and enrich it by semantic rules based on user feedback.
Propagation: Network Topology
The main communication path should be via the central JabRef server, resulting in a star topology. However, we also need to support the case where the user syncs the database manually (e.g. by using git).
Propagation: Degree of synchronicity
Using GraphQL subscriptions, it should be straightforward to implement a broadcast system for updates (while the user is online). Supplemented by a pull after times of being offline (e.g. JabRef start, connection loss).
Maintaining a "happens-before" relationship
Vector clocks become problematic when the number of nodes becomes large (not an issue for us in the near future) or is dynamic (definitely an issue for us). For the latter see e.g Dynamic Vector Clocks for Consistent Ordering of Events in Dynamic Distributed Applications.
Nice overview of many variations of vector clocks.
Old proposed algorithm
Every node has a vector clock (for each entry) and increments its own time once the entry is modified. Every update message from the server include the vector clock, which is then used to determine the merge strategy by comparing it with the local vector clock:
When the user reconnects and wants to update (pull sync), then she sends all (shared) ids of entries and the local vector clock to the server. The response will be all entries with more recent times (in any of the clocks).
This is based on SData description.
One question I still have is how to store the vector clock (in the entry?) and name the nodes (user + device name?).
Other approaches:
References: