[core][cgraph] Introduce fault-tolerant PushMutableObject#58866

Open

ruisearch42 wants to merge 12 commits intoray-project:masterfrom

ruisearch42:ft_push

Contributor

ruisearch42 commented Nov 21, 2025 •

edited by jeffreywang-anyscale

Loading

Description

Currently PushMutableObject() does not retry and HandlePushMutableObject() does not handle retry or out-of-order messages. This PR adds these support to be fault tolerant to network jitters.

Approach -- Chunk-level, version-aware retry

Version tracking: Each write epoch has a version from PlasmaObjectHeader.version. The receiver tracks highest_completed_version_ per_object.
Classifying versions in HandlePushMutableObjects

Incoming Version	Classification	Action
≤ highest_completed	Stale	Immediately reply `done=true`, discard
= highest_completed + 1	Active	Write to backing store
> highest_completed + 1	Future	Buffer the chunk key, reply `done=false`

Chunk-level idempotency: Duplicate chunks are detected via received_chunks set hashed by (offset, version) without re-writing.
Sender-side retry: Each chunk RPC uses INVOKE_RETRYABLE_RPC_CALL. The callback fires exactly once either when done=true is received or on the first failure.
Completion: When written_so_far == total_data_size for the active version, metadata is copied, WriteRelease is called, and highest_completed_version_ is updated.

Related issues

Fixes #58426

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

ruisearch42 force-pushed the ft_push branch from 54ecf6e to f49beaa Compare

November 21, 2025 00:45

ruisearch42 changed the title ~~[cgraph] Introduce fault-tolerant PushMutableObject~~ [core][cgraph] Introduce fault-tolerant PushMutableObject

ruisearch42 added the go label

Mei-pris reviewed

View reviewed changes

src/ray/core_worker/tests/mutable_object_provider_test.cc Outdated Show resolved Hide resolved

ruisearch42 mentioned this pull request

[core][llm] Check failure, unexpected system state warning during batch llm inference (multi-node) test #58062

Open

nrghosh mentioned this pull request

[llm] upgrade vllm to 0.12.0 #58026

Merged

ruisearch42 force-pushed the ft_push branch from f49beaa to e8b735b Compare

December 4, 2025 23:27

ruisearch42 added 3 commits

December 4, 2025 23:38


          [core][cgraph] Introduce fault-tolerant PushMutableObject

2bfcf51

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

wip

e14a5b7

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>


          barrier

c836985

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 force-pushed the ft_push branch from e8b735b to c836985 Compare

December 4, 2025 23:38

ruisearch42 and others added 4 commits

December 5, 2025 00:51


          fix unit test

1051cdf

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>


          up test

1a4ea6c

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>


          Merge branch 'master' into ft_push

6dff21b

up

34a87b2

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 marked this pull request as ready for review

December 18, 2025 21:58

ruisearch42 requested a review from a team as a code owner

December 18, 2025 21:58

cursor bot reviewed

View reviewed changes

src/ray/core_worker/experimental_mutable_object_provider.cc Show resolved Hide resolved

src/ray/core_worker/experimental_mutable_object_provider.cc Outdated Show resolved Hide resolved

up

138d1fe

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 assigned dayshah

Contributor Author

ruisearch42 commented Dec 18, 2025

Hey @dayshah , could you help review this PR? thanks!

ruisearch42 added 2 commits

December 18, 2025 22:39

up

6f9cc19

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

5ec49eb

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

cursor bot reviewed

View reviewed changes

src/ray/core_worker/experimental_mutable_object_provider.cc Show resolved Hide resolved

src/ray/core_worker/experimental_mutable_object_provider.cc Outdated Show resolved Hide resolved

ray-gardener bot added the core label


          fix stale retry issue with versioning

e96483f

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

cursor bot reviewed

View reviewed changes

src/ray/protobuf/node_manager.proto Show resolved Hide resolved

fix

0dc0e8e

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

cursor bot reviewed

View reviewed changes

src/ray/core_worker/experimental_mutable_object_provider.cc

+                    received_chunks.insert(chunk_key);
+                    reply->set_done(false);
+                    return;
+                  }

cursor bot Dec 20, 2025

Future version chunks tracked but data never written

When a future version chunk arrives (version > active version), the code inserts the chunk_key into received_chunks_ but returns immediately without writing the actual data to the backing store. Later, when that version becomes active and retries arrive for those chunks, they're found in received_chunks_ and treated as duplicates, returning early without writing data. The comment says "buffer for later processing" but only the chunk key is tracked—the actual data is discarded. This causes data loss for any chunks that arrive before their version becomes active.

Additional Locations (1)

src/ray/core_worker/experimental_mutable_object_provider.cc#L180-L191

Contributor

jeffreywang-anyscale Feb 22, 2026 •

edited

Loading

We can simply remove received_chunks.insert(chunk_key) and let the client retry.

src/ray/core_worker/experimental_mutable_object_provider.cc

+                // Step 2: Determine if this is active version or future version
+                int64_t active_version = highest_completed + 1;
+                bool is_active_version = (request_version == active_version);

cursor bot Dec 20, 2025

TOCTOU race allows stale retries to corrupt state

A time-of-check-time-of-use race exists between the two lock acquisitions. The code reads highest_completed and performs the stale version check in the first critical section (lines 159-167), then releases the lock and computes is_active_version outside the lock (lines 169-171). When the second lock is acquired (line 177), the stale is_active_version is used without re-validation. If another thread completes the version between the two lock acquisitions, a stale retry can bypass the staleness check, find an empty received_chunks_ (cleaned up by the completing thread), and trigger a new WriteAcquire for an already-completed version. This can leave the object in an inconsistent state since only one chunk would be written.

Additional Locations (1)

src/ray/core_worker/experimental_mutable_object_provider.cc#L176-L206

Contributor

daiping8 commented Dec 30, 2025

Great fix. May I ask how long it will take for this PR to be merged?

dayshah reviewed

View reviewed changes

Contributor

dayshah left a comment

Can you write a pr description explaining the logic for making it fault-tolerant, it's p complicated...

Also I'm not sure if this is really necessary, I doubt messages are dropped regularly at all, esp. for compiled graphs where nodes aren't expected to be across az's, and in the case they are dropped it might be ok to scrap the whole transfer on both sides and start from scratch. This is what the regular object manager push/pull does on failures, and I don't really want to have differing fault tolerance logic for each one.

src/ray/core_worker/experimental_mutable_object_manager.cc

+              Status MutableObjectManager::ReadAcquire(const ObjectID &object_id,
+                                                       std::shared_ptr<RayObject> &result,
+                                                       int64_t &version_read,

Contributor

dayshah Jan 5, 2026

can you return this out with the status with StatusOr instead of an out param

src/ray/rpc/tests/push_mutable_object_test.cc


		namespace ray {
		namespace rpc {

Contributor

dayshah Jan 5, 2026

We have RAY_testing_rpc_failure which can inject failures for you and then can test functionality from python with the injected failures.

src/ray/raylet_rpc_client/raylet_client.cc

-                  const ray::rpc::ClientCallback<ray::rpc::PushMutableObjectReply> &callback) {
+                  int64_t version,
+                  const ray::rpc::ClientCallback<ray::rpc::PushMutableObjectReply> &callback,
+                  int64_t timeout_ms) {

Contributor

dayshah Jan 5, 2026

why adding timeout as a param here if it's effectively hardcoded to -1 anyways

Contributor Author

ruisearch42 commented Jan 8, 2026

Great fix. May I ask how long it will take for this PR to be merged?

Hey I was on vacation. I will continue with the change and hopefully we can get this in soon!

sdtblckgov commented Jan 22, 2026

Hi Rui, any updates on this? Also encountering the same error. Happy to try and test out your branch if that would be helpful, would I need to build from source?

Contributor

daiping8 commented Jan 22, 2026

Hi Rui, any updates on this? Also encountering the same error. Happy to try and test out your branch if that would be helpful, would I need to build from source?

Yes. Need to compile from source code, with changes in the C++ code.

EthanAndersonUSA commented Jan 28, 2026

Any change this will be fixed? I an encountering the same issue.

github-actions bot commented Feb 12, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions bot added the stale label

EthanAndersonUSA commented Feb 12, 2026

The fix from the branch seems to be fixing this issue. I am running vLLM succesfully on 2 nodes with 4x RTX 5090 each.

github-actions bot added unstale and removed stale labels

jeffreywang-anyscale self-assigned this

Kelang-Tian commented Mar 30, 2026

This work nicely fixed the problem I was experiencing. When can it be merged into the main branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core go unstale