RTC: Fix connection lost error modal when /wp-json/wp-sync/v1/updates exceeds 16 MiB limit#77724
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @danluu! In case you missed it, we'd love to have you join us in our Slack community. If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information. |
|
At first glance this PR describes a contrived situation. Because we already cap updates at Currently, hitting the That said, if we adjust a handful of limits (possible in future changes), this situation likely becomes somewhat more reasonable and human-reproducible. For example, adjusting |
|
Successfully reproduced the issue! First, change a few limits:
On
|
alecgeatches
left a comment
There was a problem hiding this comment.
Looks good! I was able to reproduce and pushed a handful of cleanup changes and verified it still correctly chunks updates. Will merge when tests pass. Thank you!
What?
This is part of a series of issues and PRs created by a coding agent looking at the output of an AI generated fuzzer. See #77716 for the tracking issue.
This issue of the 16 MiB /wp-json/wp-sync/v1/updates limit causing the connection lost modal is themeatically similar to #77669 in that this also causes a connection lost modal and it's also due to an update that's too large, but this is a distinct failure mode.
The repro shown in the video below is a bit odd in that it's stuffing updates into the title, but the same failure mode should apply in the more reasonable case of putting updates into the body of posts.
annotated-repro.mp4
I don't know how realistic this is, but I didn't file #77631 when the fuzzer surfaced it before it was found by a human developer who was fixing a bug encountered by a real user because I didn't know if 50 rooms was realistic or not. I've gotten the "connection lost" modal a number of times myself from normal use and, in general, think it's nice to get rid of cases that could cause it so that fuzzing can find other cases that might be more realistic even if this case isn't realistic, but I can also see the argument for not adding this much code to fix a corner case issue that might not happen in practice.
BEGIN AI GENERATED TEXT
Why this is distinct
>50 roomsvalidation limit.1 MBsingle-update limit or oversized compaction case.413 Request body is too large.Repro shape
4rooms.40additional numericpostType/post:*entity records into sync.40extra post records with sub-1 MBtitle changes (450 KiBeach).44rooms total24.6 MB413 Request Entity Too Large.Connection lostmodal.Reproduction levels
The full
Connection lostmodal is only visible at the browser/editor layer,but the bug can be reproduced at lower levels by separating the failure into
the client payload shape and the server validator.
Payload-shape repro
This is the lowest-level client-side repro. It constructs the same
SyncPayloadshape that the polling manager sends, while staying under the
50room cap andunder the
1 MiBper-update cap. The resulting JSON body still exceeds theserver's
16 MiBbody cap.Run from any checkout:
Expected result:
roomsis44, each encoded update is614400bytes, andbodyBytesis about24581413, which is greater than16777216.Server-validator repro
This isolates the server behavior: the route-level validator rejects an
oversized raw request body with
rest_sync_body_too_largeand status413.This does not exercise the editor modal, but it proves that the server rejects
the aggregate request before the sync handler stores updates.
This test is in
phpunit/tests/collaboration/wpHttpPollingSyncServer.php.Browser HTTP repro
This is the focused browser repro without relying on a fuzzer campaign. It
loads the extra synced entity rooms, edits them, observes a
POST /wp-json/wp-sync/v1/updatesrequest above16 MiB, observes repeated413responses, and then asserts that the
Connection lostmodal appears.WP_ENV_PORT=8893 npm run wp-env-test start WP_ENV_PORT=8893 WP_BASE_URL=http://localhost:8893 npm exec \ --workspace @wordpress/e2e-tests-playwright -- wp-scripts test-playwright \ test/e2e/specs/editor/collaboration/collaboration-sync-body-size-connection-lost.spec.ts \ --project=chromium WP_ENV_PORT=8893 npm run wp-env-test stopUse a different
WP_ENV_PORTif8893is already occupied.How this was introduced
This appears to be a composition bug in the original HTTP polling sync design,
not a bug introduced by the later
1 MiBsingle-update accounting fix.Relevant history:
c214929139f50337250efe2bb24ff82c3ff2b6aa, made syncing aside-concern in the core-data resolver. This is part of the path where
resolved entity records can be loaded into the sync manager.
48ce44dac7981eb730079563a3a2975b89840fac, added the default HTTPpolling sync provider. Its polling manager built one payload from all
registered
roomStatesand drained each room's queued updates withstate.updateQueue.get(), without an aggregate request-byte budget.c4e4fac0a26bfb2dc38f17c765f2e84266b68b99, removed theIS_GUTENBERG_PLUGINguards around collaborative editing. In the currentpath, any resolved numeric entity record with
entityConfig.syncConfigandno query is loaded into sync.
1be2ef27e6819597dacc3b395caa05670d494194, backported the servervalidation hardening from
WordPress/wordpress-develop#11296. It addedMAX_BODY_SIZE = 16 * MB_IN_BYTES,MAX_ROOMS_PER_REQUEST = 50,MAX_UPDATE_DATA_SIZE = MB_IN_BYTES, and the route-levelvalidate_request()path that returnsrest_sync_body_too_largewithstatus
413.Two later nearby fixes are related but do not fix this aggregate-body bug:
1642980d599be51c7cce7b2dc3a0c052b69ad367, rotates rooms when theregistered room count exceeds
MAX_ROOMS_PER_REQUEST. It addresses the>50rooms failure, but a request with50or fewer rooms can still exceedthe
16 MiBbody cap.a54911b0c49e3b4abea4d6d7ce85c0e2c2bad11e, fixes the separateper-update base64 accounting mismatch. That prevents a single encoded update
from exceeding the
1 MiBupdate limit, but does not limit the total bodysize of a multi-room poll.
The sync endpoint has three independent caps:
50rooms per request1 MiBof encodeddatafor one update16 MiBfor the whole request bodyThe client batches every registered room into one poll request and drains every
room's queued updates into that request. That makes the first two caps
insufficient to protect the third cap. For example,
40rooms with updates farbelow
1 MiBcan still create a request body larger than16 MiBonce theirbase64 update strings, awareness payloads, room metadata, and JSON overhead are
combined.
The number of rooms can grow beyond what a user would perceive as the one
document they are editing because resolved numeric entity records are loaded
into the sync manager. A page that touches many synced post records can
therefore create many sync rooms, each with ordinary-sized queued updates.
The failure then becomes sticky. A
413from the route-level body validatormeans the server rejected the request before storing the updates, but the client
treats it like a generic retryable poll failure. It retries the same oversized
shape after backoff, eventually surfacing the generic
Connection lostmodal.Issue analysis
This is an availability and scalability issue, not evidence of server-side
partial write corruption. The server body-size validator runs before the sync
handler stores updates, so the failing request is rejected as a unit.
The risky part is the client recovery behavior after rejection. The current
generic failure path cannot tell that the payload is structurally too large to
ever succeed. It may restore the same queued updates, or replace failed outgoing
updates with compaction updates in cases where the room already has a cursor.
That behavior is appropriate for ambiguous network errors where the server
might have committed the write, but it is the wrong shape for a deterministic
413: retrying or compacting does not reduce the aggregate request size and cankeep the session in a permanent retry loop.
The fix should preserve these invariants:
sent
the failure toward memory pressure and slower requests
poll batches can make progress
Relevant code path
getEntityRecord()resolves:Fix plan
The least risky fix is client-side request budgeting and batching, with a
specific
413recovery path.soft budget, for example
15 MiB, to leave room for serialization detailsand future metadata.
room in one request. Enforce both:
MAX_ROOMS_PER_REQUESTto
apiFetch, rather than estimating from raw update sizes. The update datais already base64 at this point, so
JSON.stringify( payload )measured withTextEncodershould be close to the request body that the REST endpointreceives.
without losing updates. A safe design is to add queue operations that can
peek at pending updates and then take only the updates assigned to the
current batch. Rooms and updates not assigned to the batch must remain queued.
that room across multiple polls at update boundaries. Do not split a single
Yjs update byte array. A single encoded update should already be bounded by
the per-update limit.
rooms without outgoing updates so incoming updates and awareness do not
starve for secondary rooms.
413/rest_sync_body_too_largeseparately from ambiguous networkfailures. Because the route validator rejected the request before storing
updates, restore the exact attempted updates and retry with a smaller batch
budget. Do not replace them with compaction updates on this path.
exactly on
4131 MiBupdates are sentacross multiple successful polls, with every serialized payload below the
budget
413test proving the next retry shrinks the batch and does not triggerthe disconnect modal
stable
This plan avoids new protocol semantics. It does not require server-side Yjs
chunk reassembly, does not alter sync storage, and does not raise server limits.
It makes the existing polling protocol respect the server's aggregate request
limit before sending.
Five-pass confirmation
Focused isolated reruns on
8895:24,617,807bytes, modal at22.7s, repeated41324,617,433bytes, modal at22.8s, repeated41324,617,457bytes, modal at22.6s, repeated41324,617,468bytes, modal at22.8s, repeated41324,617,453bytes, modal at23.4s, repeated413All
5/5runs reproduced.Conclusion
This is a real
Connection lostcause: many ordinary-sized synced edits acrossmany synced entities can overflow the server's
16 MiBpoll-body cap, yieldingrepeated
413s and the generic disconnect modal.END AI GENERATED TEXT
Use of AI Tools
Except for the text in this PR outside of the AI generated block, everything here was AI generated including the fuzzer code that surfaced this bug.