Retry when push failed by fishbone · Pull Request #12872 · ray-project/ray

fishbone · 2020-12-15T06:10:31Z

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

fishbone · 2020-12-15T06:56:29Z

Run this script on prod pkg:

import numpy as np
import time
import ray
from ray.cluster_utils import Cluster
import pytest


cluster = Cluster()
# Force task onto the second node.
cluster.add_node(num_cpus=0, object_store_memory=75 * 1024 * 1024)
cluster.add_node(object_store_memory=75 * 1024 * 1024)

ray.init(cluster.address)

@ray.remote
def put():
    return np.random.rand(8 * 1024 * 1024)  # 40 MB data

local_ref = ray.put(np.random.rand(8 * 1024 * 1024))
ray.wait([local_ref], num_returns=1)
time.sleep(5)
print("local", local_ref)
remote_ref = put.remote()
time.sleep(5)
print("remote", remote_ref)
print(ray.wait([remote_ref], num_returns=1, timeout=10))
print("del local ref")
del local_ref
time.sleep(5)
print(ray.wait([remote_ref], num_returns=1))

Sometimes, it'll hang there like this:

root@e3331903d656:~/ray/python# python ~/x.py
2020-12-15 05:26:00,857 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
[2020-12-15 05:26:00,984 I 7932 7932] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:26:07,244 I 8070 8070] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
([], [ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)])
del local ref

This happens because for push from the remote node if it failed, it won't be resent. So local node won't be able to seal the object which leads to it waiting forever. This PR fixing this issue by calling re-trying on the remote node is not the correct way. We don't want to retry when there are some connection issues, like disconnect or something else. We might only want to do a retry for memory-related or transient errors. I notice PushReply actually return nothing so it's not doable now.

Will you suggest adding such info into grpc?

Besides this, I also notice one other issue. The hanging behavior won't always happen, and when it doesn't happen, it'll show logs like this:

2020-12-15 05:25:03,931 INFO worker.py:655 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
[2020-12-15 05:25:04,429 I 7621 7621] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:25:10,896 I 7759 7759] logging.cc:272: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
(pid=raylet) [2020-12-15 05:25:16,532 E 7696 7705] create_request_queue.cc:119: Not enough memory to create object df5a1a828c9685d3ffffffff0100000001000000 after 5 tries, will return OutOfMemory to the client
([ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)], [])
del local ref
([ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)], [])
2020-12-15 05:25:21,642 ERROR worker.py:976 -- Possible unhandled error from worker: ray::put() (pid=7759, ip=172.17.0.3)
  File "python/ray/_raylet.pyx", line 504, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1468, in ray._raylet.CoreWorker.store_task_outputs
  File "python/ray/_raylet.pyx", line 140, in ray._raylet.check_status
ray.exceptions.ObjectStoreFullError: Failed to put object df5a1a828c9685d3ffffffff0100000001000000 in object store because it is full. Object size is 67109111 bytes.
The local object store is full of objects that are still in scope and cannot be evicted. Tip: Use the `ray memory` command to list active objects in the cluster.
2020-12-15 05:25:23,117 ERROR worker.py:938 -- print_logs: Connection closed by server.
2020-12-15 05:25:23,143 ERROR worker.py:1032 -- listen_error_messages_raylet: Connection closed by server.
2020-12-15 05:25:23,168 ERROR import_thread.py:89 -- ImportThread: Connection closed by server.
Aborted

I think this is the one I encountered before when I thought the hanging disappeared. From the log, it looks like the object got sealed/added somehow first, which makes the first wait return the object in ready status. Somehow later exception got thrown later. I haven't investigated this one yet, do you have some ideas about this?

@ericl @stephanie-wang

fishbone · 2020-12-15T07:04:02Z

src/ray/object_manager/push_manager.cc

+      RAY_LOG(DEBUG) << "Push for " << dest_id << ", " << obj_id
+                     << " completed, remaining: " << NumPushesInFlight();
+    }
+  } else {


I plan only resend errors like: status.IsOutOfMemory() || status.IsTimedOut() || status.IsObjectStoreFull() || status.IsTransientObjectStoreFull()), but it looks like we convert all grpc::Status to ray::Status::IOError . Extra information is needed to be returned.

fishbone · 2020-12-15T07:50:55Z

Wired... I started from a new environment and do install a nightly build and it looks like both cases just disappeared :(
I'm using a docker container with Ubuntu installed for development, maybe because of this. I'll close this PR for now and might reopen it later if I saw similar issues.

fishbone · 2020-12-15T20:28:43Z

I'm reopening this PR. The nightly build I'm using somehow is way back to Nov 3 when I check ray.__commit__. I get a new ubuntu env (not container one) and rebuild the trunk. The problem still exists.
@ericl mentioned that there should be a retry on the pull manager side, I'll take a look at this tonight.

Btw, I reverted this PR #12335 and at least the hanging thing won't happen. There are still sometimes, the first wait return the item as ready. I suppose it's because the current API will return items ready if they appear someplace in the cluster. After introducing fetch_local, I think this needs to be fixed too.

ericl · 2020-12-15T21:48:57Z

@wuisawesome can you look into the hang (see test case above)? it seems we are missing the push retry in #12335

wuisawesome · 2020-12-16T00:32:24Z

I see this message repeated 13 times in the raylet log:

[2020-12-15 22:28:47,712 I 1269999 1269999] node_manager.cc:3149: Broadcasting Python GC request to all raylets since the cluster is low on resources. This removes Ray actor and object refs that are stuck in Python reference cycle
s.                                                       
[2020-12-15 22:28:47,715 I 1269999 1270014] eviction_policy.cc:135: There is not enough space to create this object, so evicting 0 objects to free up 0 bytes. The number of bytes in use (before this eviction) is 67109119.
[2020-12-15 22:28:47,715 W 1269999 1270014] store.cc:314: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=67109113, metadata_size=6
[2020-12-15 22:28:47,715 I 1269999 1269999] node_manager.cc:3149: Broadcasting Python GC request to all raylets since the cluster is low on resources. This removes Ray actor and object refs that are stuck in Python reference cycle
s.                                                       
[2020-12-15 22:28:47,715 I 1269999 1270039] object_manager.cc:678: ReceiveObjectChunk index 12 of object df5a1a828c9685d3ffffffff0100000001000000 failed: object does not fit in the plasma store, overall 13/13 failed
[2020-12-15 22:28:47,729 I 1269999 1269999] node_manager.cc:566: Sending Python GC request to 1 local workers to clean up Python cyclic references.

nothing after it.

wuisawesome · 2020-12-16T00:43:57Z

Problem is the retry timer isn't getting reset. I'll submit a PR shortly.

fishbone · 2020-12-16T01:54:57Z

Thanks for working on the fix. Please let me know when you fixed this and I'll rebase my PR on yours.

wuisawesome · 2020-12-16T02:58:58Z

@ahbone here's the PR: #12907

Thanks for the nice report/reproduction!

fishbone · 2020-12-16T04:41:38Z

np

Retry when push failed

c1312d0

fishbone mentioned this pull request Dec 15, 2020

[core] Introduce fetch_local to ray.wait #12526

Merged

6 tasks

fishbone commented Dec 15, 2020

View reviewed changes

fishbone closed this Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry when push failed#12872

Retry when push failed#12872
fishbone wants to merge 1 commit intoray-project:masterfrom
fishbone:fix-push-failure

fishbone commented Dec 15, 2020

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

fishbone Dec 15, 2020 •

edited

Loading

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

ericl commented Dec 15, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

fishbone commented Dec 16, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

fishbone commented Dec 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fishbone commented Dec 15, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

fishbone Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

fishbone commented Dec 15, 2020

Uh oh!

ericl commented Dec 15, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

fishbone commented Dec 16, 2020

Uh oh!

wuisawesome commented Dec 16, 2020

Uh oh!

fishbone commented Dec 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fishbone Dec 15, 2020 •

edited

Loading