Skip to content

Retry when push failed#12872

Closed
fishbone wants to merge 1 commit intoray-project:masterfrom
fishbone:fix-push-failure
Closed

Retry when push failed#12872
fishbone wants to merge 1 commit intoray-project:masterfrom
fishbone:fix-push-failure

Conversation

@fishbone
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Copy Markdown
Contributor Author

Run this script on prod pkg:

import numpy as np
import time
import ray
from ray.cluster_utils import Cluster
import pytest


cluster = Cluster()
# Force task onto the second node.
cluster.add_node(num_cpus=0, object_store_memory=75 * 1024 * 1024)
cluster.add_node(object_store_memory=75 * 1024 * 1024)

ray.init(cluster.address)

@ray.remote
def put():
    return np.random.rand(8 * 1024 * 1024)  # 40 MB data

local_ref = ray.put(np.random.rand(8 * 1024 * 1024))
ray.wait([local_ref], num_returns=1)
time.sleep(5)
print("local", local_ref)
remote_ref = put.remote()
time.sleep(5)
print("remote", remote_ref)
print(ray.wait([remote_ref], num_returns=1, timeout=10))
print("del local ref")
del local_ref
time.sleep(5)
print(ray.wait([remote_ref], num_returns=1))

Sometimes, it'll hang there like this:

root@e3331903d656:~/ray/python# python ~/x.py
2020-12-15 05:26:00,857 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
[2020-12-15 05:26:00,984 I 7932 7932] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:26:07,244 I 8070 8070] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
([], [ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)])
del local ref

This happens because for push from the remote node if it failed, it won't be resent. So local node won't be able to seal the object which leads to it waiting forever. This PR fixing this issue by calling re-trying on the remote node is not the correct way. We don't want to retry when there are some connection issues, like disconnect or something else. We might only want to do a retry for memory-related or transient errors. I notice PushReply actually return nothing so it's not doable now.

Will you suggest adding such info into grpc?

Besides this, I also notice one other issue. The hanging behavior won't always happen, and when it doesn't happen, it'll show logs like this:

2020-12-15 05:25:03,931 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
[2020-12-15 05:25:04,429 I 7621 7621] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:25:10,896 I 7759 7759] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:25:16,532 E 7696 7705] create_request_queue.cc:119: Not enough memory to create object df5a1a828c9685d3ffffffff0100000001000000 after 5 tries, will return OutOfMemory to the client
([ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)], [])
del local ref
([ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)], [])
2020-12-15 05:25:21,642 ERROR worker.py:976 -- Possible unhandled error from worker: ray::put() (pid=7759, ip=172.17.0.3)
  File "python/ray/_raylet.pyx", line 504, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1468, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object df5a1a828c9685d3ffffffff0100000001000000 in object store because it is full. Object size is 67109111 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
2020-12-15 05:25:23,117 ERROR worker.py:938 -- print_logs: Connection closed by server.
2020-12-15 05:25:23,143 ERROR worker.py:1032 -- listen_error_messages_raylet: Connection closed by server.
2020-12-15 05:25:23,168 ERROR import_thread.py:89 -- ImportThread: Connection closed by server.
Aborted

I think this is the one I encountered before when I thought the hanging disappeared. From the log, it looks like the object got sealed/added somehow first, which makes the first wait return the object in ready status. Somehow later exception got thrown later. I haven't investigated this one yet, do you have some ideas about this?

@ericl @stephanie-wang

RAY_LOG(DEBUG) << "Push for " << dest_id << ", " << obj_id
<< " completed, remaining: " << NumPushesInFlight();
}
} else {
Copy link
Copy Markdown
Contributor Author

@fishbone fishbone Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan only resend errors like: status.IsOutOfMemory() || status.IsTimedOut() || status.IsObjectStoreFull() || status.IsTransientObjectStoreFull()), but it looks like we convert all grpc::Status to ray::Status::IOError . Extra information is needed to be returned.

@fishbone
Copy link
Copy Markdown
Contributor Author

Wired... I started from a new environment and do install a nightly build and it looks like both cases just disappeared :(
I'm using a docker container with Ubuntu installed for development, maybe because of this. I'll close this PR for now and might reopen it later if I saw similar issues.

@fishbone fishbone closed this Dec 15, 2020
@fishbone
Copy link
Copy Markdown
Contributor Author

I'm reopening this PR. The nightly build I'm using somehow is way back to Nov 3 when I check ray.__commit__. I get a new ubuntu env (not container one) and rebuild the trunk. The problem still exists.
@ericl mentioned that there should be a retry on the pull manager side, I'll take a look at this tonight.

Btw, I reverted this PR #12335 and at least the hanging thing won't happen. There are still sometimes, the first wait return the item as ready. I suppose it's because the current API will return items ready if they appear someplace in the cluster. After introducing fetch_local, I think this needs to be fixed too.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 15, 2020

@wuisawesome can you look into the hang (see test case above)? it seems we are missing the push retry in #12335

@wuisawesome
Copy link
Copy Markdown
Contributor

I see this message repeated 13 times in the raylet log:

[2020-12-15 22:28:47,712 I 1269999 1269999] node_manager.cc:3149: Broadcasting Python GC request to all raylets since the cluster is low on resources. This removes Ray actor and object refs that are stuck in Python reference cycle
s.                                                       
[2020-12-15 22:28:47,715 I 1269999 1270014] eviction_policy.cc:135: There is not enough space to create this object, so evicting 0 objects to free up 0 bytes. The number of bytes in use (before this eviction) is 67109119.
[2020-12-15 22:28:47,715 W 1269999 1270014] store.cc:314: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=67109113, metadata_size=6
[2020-12-15 22:28:47,715 I 1269999 1269999] node_manager.cc:3149: Broadcasting Python GC request to all raylets since the cluster is low on resources. This removes Ray actor and object refs that are stuck in Python reference cycle
s.                                                       
[2020-12-15 22:28:47,715 I 1269999 1270039] object_manager.cc:678: ReceiveObjectChunk index 12 of object df5a1a828c9685d3ffffffff0100000001000000 failed: object does not fit in the plasma store, overall 13/13 failed
[2020-12-15 22:28:47,729 I 1269999 1269999] node_manager.cc:566: Sending Python GC request to 1 local workers to clean up Python cyclic references.

nothing after it.

@wuisawesome
Copy link
Copy Markdown
Contributor

Problem is the retry timer isn't getting reset. I'll submit a PR shortly.

@fishbone
Copy link
Copy Markdown
Contributor Author

Thanks for working on the fix. Please let me know when you fixed this and I'll rebase my PR on yours.

@wuisawesome
Copy link
Copy Markdown
Contributor

@ahbone here's the PR: #12907

Thanks for the nice report/reproduction!

@fishbone
Copy link
Copy Markdown
Contributor Author

np

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants