Upload Media: Add retry with exponential backoff and network resilience#76765
Conversation
Adds automatic retry with exponential backoff for transient upload failures (network errors, timeouts, server errors). Also adds a hook to detect and regenerate missing image sub-sizes when the editor loads. Key features: - ErrorCode enum for error classification and retry decisions - Exponential backoff with jitter to prevent thundering herd - PendingRetry status and retry selectors for state tracking - useMissingSizesCheck hook for missing image sub-size detection - queueMissingSizeGeneration action for client-side sub-size generation Depends on #74917 for error handling infrastructure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Size Change: +746 B (+0.01%) Total Size: 8.21 MB 📦 View Changed
ℹ️ View Unchanged
|
- Remove automatic missing sizes generation on editor load - Add info icon indicator on images with missing sub-sizes in the editor - Add "Missing image sizes" panel in pre-publish checks with Generate action - Add network reconnection handling to pause/resume upload queue on offline/online Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@andrewserong I took another pass at how we could handle resiliency by introducing some user facing UI for the case when uploads have failed for some reason and there are missing image sizes. I also added some resilience if the browser goes offline in the middle of processing and then comes back online, the processing should continue. Probably need some design feedback here, I wanted to give it a try to see how it feels verses to "automatic" missing size generation which might be unexpected for users as you pointed out. |
|
Flaky tests detected in a635139. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/26517048387
|
|
@jasmussen Since you are digging into client side media, I would appreciate your feedback on the UI I am proposing here! |
The resumeQueue thunk in private-actions.ts dispatches executeRetry to re-arm PendingRetry items, but executeRetry was missing from the local ActionCreators type, causing the TypeScript build to fail with TS2339.
The getPendingRetryItems, getItemNextRetryTimestamp, getItemRetryCount, and hasExceededMaxRetries selectors are not consumed anywhere — the retry scheduler reads item.retryCount and settings.retry directly. Drop them and their tests to keep the public surface narrow.
No upload path in this PR constructs an UploadError with a code that would trip isRetryable, so the enum and getter add surface area without behavior. Rely on the message-pattern matcher in shouldRetryError, which is what already catches every transient failure from fetch and apiFetch. When error-code classification lands in a follow-up, it can reintroduce the typed API alongside the patterns.
Adds a regression test asserting that an UploadError whose message does not match RETRYABLE_MESSAGE_PATTERNS (e.g. IMAGE_TRANSCODING_ERROR) is not retried. Documents the expected behavior after dropping the isRetryable getter.
The retry path added in this PR is gated on settings.retry being set,
but nothing in trunk dispatches updateSettings({ retry: ... }) on the
core/upload-media store. block-editor's useMediaUploadSettings forwards
a fixed allowlist of fields that does not include retry, so the feature
was unreachable in production: failed uploads went straight to
cancelItem's non-retry branch.
Move the retry defaults the unit tests already assume into
DEFAULT_STATE.settings.retry so the feature is active out of the box.
Embedders can still pass settings.retry to MediaUploadProvider to
override or set undefined to opt out (updateSettings is a shallow
merge).
…ry-logic # Conflicts: # packages/upload-media/src/store/types.ts
Resolve conflicts in upload-media store between this branch's retry machinery and trunk's parent-cancellation cleanup: - types.ts: keep both Settings additions — this branch's `retry` field and trunk's `mediaDelete` field. - actions.ts: merge the differing import diffs (keep both ItemStatus + retry-helper imports and trunk's OperationType import). Guard the retry intercept in `cancelItem` on `! item.parentId` and `! item.attachment?.id` so child sideload failures fall through to trunk's parent-cancellation path, and so the recursive cancelItem(parent, wrappedError) call from that path does not get rerouted into a retry for an attachment that already uploaded. - Updates the existing parent-cancellation tests pass alongside the retry tests (258 passing).
Collapse the four-clause retry guard onto one line to satisfy Prettier's print-width rule (was failing CI lint:js).
|
Testing of automatic retries worked well. In this screencast, I test turning off the network mid-processing twice; each time once the connection was retired, the sideloading process continued: automatic.retry.mp4I also tested starting offline - uploading was paused, then going online - uploading resumed: start.while.offline.mp4 |
The editor's media-upload wrapper forwards only the error message (a string) to the upload queue, so cancelItem receives a string rather than an Error instance. shouldRetryError read error.message, which is undefined on a string, so transient upload failures never matched the retryable patterns and were never retried in the real editor flow. Accept Error | string in shouldRetryError and match the string directly. Add unit coverage for string-form messages.
Cover the client-side upload retry path end to end: - recovery: abort the first create request, assert the automatic retry recovers the upload and the block resolves to a server URL. - exhaustion: abort every attempt, assert the error snackbar appears and the create endpoint is hit initial-attempt + maxRetryAttempts times. These run through the CSM pipeline (where retry lives) and use the existing skipIfClientSideMediaInactive guard, so they skip gracefully where cross-origin isolation is unavailable.
Add an editor CHANGELOG entry for the offline-pause/online-resume queue behavior added by the useNetworkReconnect hook, and broaden the upload-media retry entry to note the queue can pause and resume.
|
This is a bit of a drive by test. 😄 I found something I think is related, but I'm not sure. I manually tested so might have missed something. After interrupting an upload, the post remains dirty ( I confirmed the lock is coming from the upload queue by running:
That immediately re-enabled Save. I guess the save lock is derived from |
Yes, it should. The lock is to prevent users from publishing the post with incomplete uploads. |
…ry-logic # Conflicts: # packages/editor/CHANGELOG.md # packages/upload-media/CHANGELOG.md
- Note automatic retry with exponential backoff for resilient uploads (#76765) - Reframe cross-origin isolation as a capability (SharedArrayBuffer) extenders can use, not an implementation detail - Clarify that Firefox/Safari are unsupported for the full pipeline, with a pointer to the compatibility section - Add publication-date browser-version context and a Chrome Platform Status link for tracking Document-Isolation-Policy support - Fix the Safari compatibility row wording and stray parenthesis - Merge the duplicated 'Feature detection thresholds' and 'Known limitations' sections into one - Explain why the AVIF MIME-check exception carries minimal security risk
|
@andrewserong this is ready for additional review/testing when you have a moment! |
andrewserong
left a comment
There was a problem hiding this comment.
Thanks for the ping! Just gave this a re-test and it's working very nicely indeed for me. If I switch to offline half-way through an upload and switch back on, it gracefully picks things up and to a user it's seamless. I imagine this covers the majority use case of this — a user on a flaky internet connection that drops out while they're mid-upload. So this feels good 👍
Yes, it should. The lock is to prevent users from publishing the post with incomplete uploads.
Sounds good. And this has been working fine in my testing so far. One thing to keep in the back of our minds is that eventually someone might want to develop an offline mode to support locally saving overall post content while offline, in which case we might want to revisit the locking behaviour here. It's a bit of a moot point right now, though as we'd also need to revisit how "uploading" media works in an offline context!
In any case, thanks again for all the back and forth here, this is a lovely bit of polish for the overall experience of uploading media 🚀
| * Three total attempts (initial + 2 retries) with exponential backoff: | ||
| * ~1s, then ~2s, capped at 30s. The jitter factor adds randomness to | ||
| * the delay so simultaneous failures do not retry in lockstep. | ||
| */ | ||
| export const DEFAULT_RETRY_SETTINGS = { | ||
| maxRetryAttempts: 3, |
There was a problem hiding this comment.
This is such a tiny nit, but with maxRetryAttempts set to 3 the comment is out of date as it only mentions 2 retries.
The auto-retry follow-up (#76765) shipped its own retryability check based on error-message patterns (RETRYABLE_MESSAGE_PATTERNS in store/utils/retry.ts), because uploads reject with plain Error instances from fetch/api-fetch rather than UploadError carrying a code. That leaves this PR's UploadError#isRetryable getter, the RETRYABLE_CODES allowlist, and the aspirational NETWORK_ERROR / TIMEOUT_ERROR / SERVER_ERROR codes (and their messages) unused and unproduced. Remove them so the ErrorCode enum only describes failures the package actually throws, keeping getErrorMessage as the user-facing message layer. Also relocate the unmerged #74917 CHANGELOG entries from the released 0.33.0 section back to Unreleased, where they belong, and drop the isRetryable mention.
PauseQueue and ResumeQueue are already covered by trunk's reducer tests (landed with the retry work in #76765), and CacheBlobUrl/RevokeBlobUrls test reducers unrelated to this PR's error-taxonomy scope. These were leftovers from when retry logic lived on this branch; the reducer test file now matches trunk. Addresses review feedback from andrewserong.
Summary
Fixes #76790
Enhances client-side media processing with two reliability improvements:
These are background features without user-facing controls. Defaults are wired into the
core/upload-mediastore's initial state, so the feature is active out of the box wherever the upload pipeline is used; embedders can override or opt out by passing a differentretryvalue (orundefined) toMediaUploadProvider.Retry Flow
flowchart LR A[Upload] --> B[Processing] B --> C{Success?} C -->|Yes| D[Complete] C -->|No| E{Retryable?} E -->|No| F[Failed] E -->|Yes| G{Retries < 3?} G -->|No| F G -->|Yes| H[Wait with backoff] H -->|1s / 2s / 4s| BRetryability is determined by matching the error message against a small set of patterns covering Chrome (
Failed to fetch), Safari (Load failed), Node DNS/TCP errors (ECONNRESET,ETIMEDOUT,ENOTFOUND), and the@wordpress/api-fetchfetch_errormessage - the rejections actually thrown by the existing upload path. A typedUploadError/ErrorCodeclassification can layer on top in a follow-up without changing this PR's behavior.Network Resilience
flowchart LR A[Upload Queue] --> B{Browser Online?} B -->|Yes| C[Continue Processing] B -->|No| D[Pause Queue] D --> E[Wait for reconnect] E -->|online event| F[Resume Queue] F --> C C --> G[Complete]Screencasts
Test turning off the network mid-processing twice; each time once the connection was retired, the sideloading process continued:
automatic.retry.mp4
Test starting offline - uploading was paused, then going online - uploading resumed:
start.while.offline.mp4
Defaults
DEFAULT_STATE.settings.retryin the store is initialized fromDEFAULT_RETRY_SETTINGS:maxRetryAttempts3(4 total POST attempts: initial + 3 retries)initialRetryDelayMs1000maxRetryDelayMs30000backoffMultiplier2retryJitter0.1(+/-10%)updateSettingsis a shallow merge, soMediaUploadProvider settings={{ retry: undefined }}cleanly disables the feature.Test coverage
@wordpress/upload-media(20 suites, 238 tests):calculateRetryDelay- exponential growth, max-delay cap, jitter bounds, multiplier variantsshouldRetryError- every retryable message pattern (ChromeFailed to fetch, SafariLoad failed, NodeECONNRESET/ETIMEDOUT/ENOTFOUND,apiFetchfetch_error), non-retryable messages, retry-count limits, regression test that anUploadErrorwith a non-transient message (e.g.IMAGE_TRANSCODING_ERROR) is not retriedRetryItem(status reset, error clear, count increment, freshAbortController),ScheduleRetry,PauseQueue/ResumeQueue(transitions, idempotency, state preservation)cancelItemretry integration - schedules retry for retryable errors, skips whensilent, skips for non-retryable, skips when retry settings are absent, clears the pending timer on manual cancel, falls through to cancellation after exhaustingmaxRetryAttemptsscheduleRetry/executeRetry- setsPendingRetrystatus, stores error andnextRetryTimestamp, fires the timer, replaces the abortedAbortController, no-ops when the item is missing or not inPendingRetryremoveItemclears the retry timer to prevent timer-map leaks@wordpress/editor(7 tests foruseNetworkReconnect):__clientSideMediaProcessingis false / undefinedoffline/onlinelistenerspauseQueueon offline,resumeQueueon onlineTest plan
Automated
npx jest packages/upload-media --config test/unit/jest.config.js- 20 suites pass.npx jest packages/editor/src/components/provider/test/use-network-reconnect.js --config test/unit/jest.config.js- network reconnect hook tests pass.Manual: retry with exponential backoff
Prereqs:
npm run wp-env start, open the editor in Chrome.window.__clientSideMediaProcessing === true. If not, enable "Client-side media processing" under Gutenberg > Experiments and reload.{ maxRetryAttempts: 3, initialRetryDelayMs: 1000, maxRetryDelayMs: 30000, backoffMultiplier: 2, retryJitter: 0.1 }.Force a transient failure:
*/wp-json/wp/v2/media*). Alternative: use a Chrome extension like Requestly or ModHeader to force HTTP 500 on that URL pattern. (The "go offline" toggle exercises a different code path - see the network resilience section below.)Observe the retries:
/wp/v2/mediafor the same file. Expect 4 total attempts (initial + 3 retries) with gaps of approximately 1s, then 2s, then 4s. Jitter is +/-10%, so accept ~0.9-1.1s, ~1.8-2.2s, ~3.6-4.4s.status: 'PENDING_RETRY'while waiting, with anextRetryTimestampslightly in the future. Expectstatus: 'PROCESSING'and an incrementingretryCountduring each in-flight POST.Manual: network resilience
getAllItems()shows them inPROCESSING/PENDING_RETRYrather than being cancelled).PENDING_RETRYre-execute) without any user action.Manual: opt-out path (optional, for embedders)
wp.data.dispatch('core/upload-media').updateSettings({ retry: undefined })(this requiresunlock, so most testers can skip), then trigger a transient failure as above. Expect the item to fail on the first error with no retries.Use of AI
Most code and this PR description written by Claude code with careful prompting. I reviewed the code manually.