Skip to content

Real-time collaboration: Implement proof-of-concept long-polling (SSE) sync provider#74331

Closed
chriszarate wants to merge 2 commits into
trunkfrom
try/rest-api-sse-sync-provider-2
Closed

Real-time collaboration: Implement proof-of-concept long-polling (SSE) sync provider#74331
chriszarate wants to merge 2 commits into
trunkfrom
try/rest-api-sse-sync-provider-2

Conversation

@chriszarate

Copy link
Copy Markdown
Contributor

What?

A proof-of-concept exploring a default Yjs provider based on long-polling (server-sent events) and state stored in the WordPress database. See #74085

Why?

The current default provider is based on WebRTC and is unreliable in certain network conditions. Making it reliable requires centralized infrastructure that is probably unattainable.

An alternative default transport could use long-polling against an internal endpoint with state stored in the WordPress database. This approach would face performance issues on medium-to-large sites but would allow users to explore collaborative editing under some protective limits (e.g., a maximum of two simultaneous collaborators). Moving beyond these protective limits would require a more robust host-provided transport such as WebSockets.

How?

  1. Implement a new Yjs provider: HttpSseProvider
  2. Register a new REST API endpoint: /sync/v1/messages. The new provider will connect to this endpoint.
    • Messages (Yjs document updates) are sent to this endpoint via POST requests.
    • Messages are consumed via EventSource connections to this endpoint.
      • If a client connects and no other clients have connected recently, the server closes the connection and the client will reconnect after a short time. This is a naïve implementation of "lazy connections" that will reduce the number of consumed connections overall.
      • If other clients have recently connected, then the connection is kept open and new messages are sent as server-sent events.
  3. There is no PHP Yjs library. Without writing one, we have no ability to apply updates from connected clients onto a central document stored on the server. The server can only naïvely replay messages received by the server to other connected clients.
  4. Not all WordPress installations implement a shared object cache. PHP processes are isolated and there is no universal method for cooperative communication. Therefore, we store messages in transients, which are persisted in the WordPress database—or persistent object cache, if configured.
  5. The server cannot indefinitely store messages for replay; long-running sessions would eventually exhaust memory and storage. Nor can the server reliably determine when a user has left the session.
    • As a workaround, we ask clients to periodically send a "snapshot" of their local document and the ID of the last message applied to it. Upon receiving this snapshot, the server can (a) confirm that it is the "latest" available snapshot (b) store that snapshot as an update, and (c) delete messages older than the indicated last message.

Limitations and considerations

  1. There are likely some state bugs or race conditions in this new client-server interchange. This is simply a proof-of-concept for discussion.
  2. The sync manager creates a new provider instance for each entity being synced. Currently, the HttpSseProvider opens a new EventSource connection for each instance. As we provide support for additional entity syncing, this will consume more and more HTTP connections, overwhelming lower-resourced hosts.
    • If we move forward with this approach, we will probably want to reuse a single EventSource connection, which will require separately tracking the rooms and last_message_ids for each client.
  3. Yjs providers are meshable. We could ship multiple default providers that provide progressive enhancement depending on the host's configuration and resources.

Testing Instructions

  1. Check out this PR.
  2. Enable on the collaborative editing experiment (Gutenberg > Experiments).
  3. Open a post for editing in two browsers.

Testing Instructions for Keyboard

n/a

Screenshots or screencast

sse-sync.mov

@chriszarate chriszarate added [Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration [Type] Experimental Experimental feature or API. labels Jan 2, 2026
@github-actions

github-actions Bot commented Jan 2, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @nickchomey.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Unlinked contributors: nickchomey.

Co-authored-by: chriszarate <czarate@git.wordpress.org>
Co-authored-by: maxschmeling <maxschmeling@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@maxschmeling

Copy link
Copy Markdown
Contributor

How will this handle the situation where the transient is deleted early? My initial impression is that it would stop being able to sync until a full snapshot is sent again? What if the transient storage is under pressure and being deleted frequently? Do we need to worry about that scenario?

Everyone seems to misunderstand how transient expiration works, so the long and short of it is: transient expiration times are a maximum time. There is no minimum age. Transients might disappear one second after you set them, or 24 hours, but they will never be around after the expiration time.

https://developer.wordpress.org/apis/transients/

@chriszarate

chriszarate commented Jan 5, 2026

Copy link
Copy Markdown
Contributor Author

How will this handle the situation where the transient is deleted early? My initial impression is that it would stop being able to sync until a full snapshot is sent again?

Sync would continue to function, but individual peers may not have an up-to-date representation of the Yjs document state until they receive a more complete snapshot from another peer.

What if the transient storage is under pressure and being deleted frequently?

Great point. Because eviction is not controlled by the application, neither transients nor object cache are ideal persistence layers for sync data. Under severe pressure where sync data cannot survive longer >30s, I'd guess that syncing may cease to reliably function.

This PR is really just to show that it's possible to implement a Yjs provider backed by the WordPress database (and not some other network service), and therefore provide a sync transport that works (in theory) on every WordPress installation. Instead of transients, maybe we should target a new built-in post type and manage evictions manually? Or perhaps a better idea will emerge.

@nickchomey

nickchomey commented Jan 11, 2026

Copy link
Copy Markdown

I like this approach a lot - SSE has been very underused (until it started to have a renaissance with LLM chatbots). (nitpick: technically longpolling is a distinct technique from SSE.)

Because eviction is not controlled by the application, neither transients nor object cache are ideal persistence layers for sync data. Under severe pressure where sync data cannot survive longer >30s, I'd guess that syncing may cease to reliably function... Instead of transients, maybe we should target a new built-in post type and manage evictions manually? Or perhaps a better idea will emerge.

A CPT would surely be slower than transients/object cache. Transients are definitely not ideal, but as you said they get stored in the persistent object cache if you are using one (which anyone who is having perf problems should be doing). Redis and memcache are the most popular, and can handle 100k+ operations per second.

SQLite Object Cache uses SQLite, which any WP install should be able to use, and is faster than Redis for this purpose. It also uses php APCu if available, which is even faster than sqlite and is shared across php workers etc...

Anyone who is hitting limits for any of these options is either on terrible hardware or should be able to solve these problems with a custom service outside of WP (because they are running a large enterprise)

@chriszarate

Copy link
Copy Markdown
Contributor Author

I like this approach a lot - SSE has been very underused (until it started to have a renaissance with LLM chatbots). (nitpick: technically longpolling is a distinct technique from SSE.)

Thanks @nickchomey! Great point. I agree that SSE can lead to suprisingly responsive collaboration.

Anyone who is hitting limits for any of these options is either on terrible hardware or should be able to solve these problems with a custom service outside of WP (because they are running a large enterprise)

Agreed generally, but our initial goal is to ship a provider that can function, under limits, on just about any host. We can simultaneously light the path for others who may be ready to devote additional resources for a better implementation.

Our first step is focusing on short-polling since SSE support is not guaranteed. Please follow along on this follow-up PR:

#74564

@nickchomey

Copy link
Copy Markdown

Thanks, for the info. Though, when would SSE support not be guaranteed? Its just normal http with a different header...? Though, whether doing anything like that in a synchronous PHP environment is a good idea, is an entirely other matter.

You just do CQRS - long-lived SSE connection to push things out, and then adhoc POST requests to make changes. This is the basis of Mercure. Its Golang, but is integrated with Frankenphp and meant for php-based (symfony) API Platform

Or are you perhaps referring to any network/proxy complexity that might cause issues for SSE connections?

@chriszarate

Copy link
Copy Markdown
Contributor Author

Though, when would SSE support not be guaranteed? Its just normal http with a different header...? [...] Or are you perhaps referring to any network/proxy complexity that might cause issues for SSE connections?

Yes, the main obstacle is output buffering, which might be configured (and not overridable) by the server, or by any HTTP layer in between the client and the server (caching proxies, load balancers, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration [Package] Sync [Type] Experimental Experimental feature or API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants