[Core] Improve scheduling observability and fix wrong resource deadlock report message. by rkooo567 · Pull Request #19746 · ray-project/ray

rkooo567 · 2021-10-26T10:57:04Z

Why are these changes needed?

This PR is the replacement of #19720.

It is doing 3 things.

Improve scheduling observability. Currently, it is hard to know why tasks are not scheduled when things are hanging. This PR adds per task "state" that indicates why tasks are not scheduled. The states are trackable through the cluster_task_manager->DebugStr().
Add a threaded actor stress test. Currently it prints the segfault occasionally when max_concurrency == 10, so we will run it with a single threaded actor. The issue is tracked here [Bug] Threaded actor stress test invokes SIGSEGV #19748
Fix [Core][Bug] ray nodes get into a bad state and actor can't be scheduled #19207. I could somehow repro this issue, and it seems like the the problem was that we print resource deadlock warning when tasks/actors are not actually waiting for resources to be available. The first task (improving observability) allows us to correctly identify when to raise an error. Please check the code for more details (the approach could be controversial). Note that I am not 100% sure if the issue I could repro is the exactly the same issue or not.

This is an example regarding how the better observability helps us fixing issues.

For the incorrect actor scheduling error message in this particular test, the problem was that the process startups were rate limited because we starts more workers than the threshold at once;

(raylet) Infeasible queue length: 0
(raylet) Schedule queue length: 0
(raylet) Dispatch queue length: 15
(raylet) num_waiting_for_resource: 0
(raylet) num_waiting_for_plasma_memory: 0
(raylet) num_waiting_for_remote_node_resources: 0
(raylet) num_worker_not_started_by_job_config_not_exist: 0
(raylet) num_worker_not_started_by_registration_timeout: 0
(raylet) num_worker_not_started_by_process_rate_limit: 13
(raylet) num_worker_waiting_for_workers: 2

Related issue number

closes #19207 (90% confidence) #19427

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-10-26T10:57:21Z

python/ray/tests/test_threaded_actor.py

+    num_parents = 6
+    num_children = 6
+    death_probability = 0.95
+    # TODO(sang): Currently setting this to 10 creates segfault


This will be handled in the next PR

rkooo567 · 2021-10-26T10:57:28Z

src/ray/common/task/task.h

-
-  /// Override cancellation behaviour.
-  void OnCancellationInstead(const CancelTaskCallback &callback) {
-    on_cancellation_ = callback;


rkooo567 · 2021-10-26T10:57:52Z

src/ray/raylet/node_manager.cc


    std::string error_message_str = error_message.str();
    RAY_LOG(WARNING) << error_message_str;
+    RAY_LOG(WARNING) << cluster_task_manager_->DebugStr();


will help us figuring exactly why warnings are printed (while it is not printed to users)

we can eventually use the event here

what's the overhead of this? does it make sense to use RAY_LOG_EVER_MS...?

The function itself's overhead is probably not too big because I believe we shouldn't have too many pending tasks at a time normally.

But RAY_LOG_EVER_MS sounds not bad just in case. My question is what's the semantic? If I use

RAY_LOG_EVER_MS(log, 1000)

Does that mean we can print this maximum 1 per second?

that's correct.

Limited to once per 10 seconds

rkooo567 · 2021-10-26T10:58:59Z

src/ray/raylet/scheduling/cluster_task_manager.cc

+
+      // If the work is not in the waiting state, it will be scheduled soon or won't be
+      // scheduled. Consider as non-pending.
+      if (work.GetState() != internal::WorkStatus::WAITING) {


This is the RFC; So, the approach is like this;

We only count tasks/actors that are actually waiting for resources. We don't return any pending tasks if they were pending because of other reasons (e.g., workers are failed to be started).

rkooo567 · 2021-10-26T10:59:41Z

src/ray/raylet/scheduling/cluster_task_manager.cc

+                                      int, std::deque<std::shared_ptr<internal::Work>>>
+                                      &pair) {
+    const auto &work_queue = pair.second;
+    for (auto work_it = work_queue.begin(); work_it != work_queue.end();) {


This could add additional overhead for DebugStr(). Since we usually don't have long queue in the raylet, I assume it won't be likely an issue, but if you think we need to hide it behind Debug flag, lmk

In the long term, we can optimize it to update live

Do you think it makes sense to also add STATS here? It's really simple, some sample code here:

https://github.com/ray-project/ray/blob/master/src/ray/common/asio/instrumented_io_context.cc#L23

I think when worker states change, we probably can just update it there. So even we don't print debug string, we still can have some insights about this.

DebugString is always periodically called (I don't know if there's an option to not do that).

What about we just call this in this DebugStr() method?

src/ray/raylet/scheduling/cluster_task_manager.h

rkooo567 · 2021-10-26T15:48:04Z

If the high level approaches make sense, I will also add an unit test for stats.

scv119 · 2021-10-26T18:09:02Z

Love the direction this PR is going! @iycheng @wuisawesome also take a look?

scv119 · 2021-10-26T18:04:03Z

src/ray/protobuf/common.proto

 // `TaskSpec` is determined at submission time.
 message TaskExecutionSpec {
  // The last time this task was received for scheduling.
-  double last_timestamp = 2;


i'd rather not making this backward incompatible change.

I think for this path, it is fine because 1. it is not used anywhere 2. For task spec, backward compatibility doesn't matter IIUC (only thing that it matters is the RPC that's used by the autoscaler because that's the only path that different ray version can communicate to each other). But if you still don't like it I can just keep it

src/ray/raylet/scheduling/cluster_task_manager.cc

src/ray/raylet/node_manager.cc

scv119 · 2021-10-26T20:08:45Z

src/ray/raylet/node_manager.cc

        << "\n"
-        << "Available resources on this node: " << available_resources
-        << "In total there are " << pending_tasks << " pending tasks and "
+        << "Available resources on this node: "


is it possible to print the reason why this worker can't be scheduled here as well?

I believe the log above explains it? Or do you want to include workers not starting warnings here? (that requires the semantic change of this method because it "warns resource deadlock", which don't include worker startup failure imo)

hmm maybe we should warn for both 1. resource deadlock + workers are not started properly

scv119 · 2021-10-26T20:10:10Z

src/ray/raylet/node_manager.cc


    std::string error_message_str = error_message.str();
    RAY_LOG(WARNING) << error_message_str;
+    RAY_LOG(WARNING) << cluster_task_manager_->DebugStr();


what's the overhead of this? does it make sense to use RAY_LOG_EVER_MS...?

src/ray/raylet/scheduling/cluster_task_manager.h

scv119 · 2021-10-26T20:28:33Z

src/ray/raylet/scheduling/cluster_task_manager.h

+  WORKER_NOT_FOUND_LATE_LIMITED,
+};
+
+/// Work represents all the information needed to make a scheduling decision.


Now since we are here, does it make sense to have comments to describe all possible state for the "Work"?

Also we might want find a better name for "Work" (not in this PR)

Also we might want find a better name for "Work" (not in this PR)

+1

Now since we are here, does it make sense to have comments to describe all possible state for the "Work"?

isn't this already kind of described in the enum?

fishbone · 2021-10-27T17:52:36Z

My only concern is that without DEFINE_stats, it's not easy to observe for a cluster, basically you need to dump everything and check for each raylet. And also you will loose the time series data which might be useful for debugging.
But if you think it fits your business which I don't have that much context, I'm ok with this.

scv119 · 2021-10-27T18:05:51Z

src/ray/raylet/node_manager.cc


    std::string error_message_str = error_message.str();
    RAY_LOG(WARNING) << error_message_str;
+    RAY_LOG(WARNING) << cluster_task_manager_->DebugStr();


that's correct.

scv119 · 2021-10-27T18:09:42Z

src/ray/raylet/scheduling/cluster_task_manager.cc

+        } else if (status == PopWorkerStatus::WorkerPendingRegistration) {
+          cause = internal::UnscheduledWorkCause::WORKER_NOT_FOUND_REGISTRATION_TIMEOUT;
+        } else {
+          RAY_LOG(FATAL) << "Unexpected state received for the empty pop worker.";


will this be too aggressive?

Imo, this is fine because receiving other states without updating this code path is regression (or a bug). (but let me know if you have other alternative! )

src/ray/raylet/scheduling/cluster_task_manager.cc

src/ray/raylet/scheduling/cluster_task_manager.h

rkooo567 · 2021-10-28T00:43:05Z

My only concern is that without DEFINE_stats, it's not easy to observe for a cluster, basically you need to dump everything and check for each raylet. And also you will loose the time series data which might be useful for debugging.

@iycheng To be clear, I am totally on the same page. I think my opinion here is just that we always dump debug string, so we can call Record method within that function (not everytime the state is changed)

rkooo567 · 2021-10-28T00:43:21Z

The unit tests / integration tests will be added in the follow up PR

rkooo567 · 2021-10-28T02:28:26Z

@edoakes this PR will correctly report the "resource deadlock"

rkooo567 · 2021-10-28T03:40:30Z

@edoakes can you provide me the test script that I can verify this?

rkooo567 · 2021-10-28T08:15:27Z

cc @iycheng

I will do following things in the follow-up next week.

More tests (as discussed with @scv119)
DEFINE_STATS
the test that proves it doesn't print this error msg upon runtime env failures (if I can get the repro quickly by tomorrow, I will just add them here).

ericl

Proto change LGTM

edoakes · 2021-10-28T19:13:29Z

@rkooo567 sorry a bit late here, but you can try:

ray.init()

@ray.remote(runtime_env={"pip": ["tensorflow", "torch"]})
def f():
    pass

# Check no warning printed.
ray.get(f.remote())

rkooo567 · 2021-10-28T21:57:56Z

@edoakes

This is the state while running that particular test, and it doesn't print any deadlock message. This is the expected behavior right? (btw, this task seems to run pretty long time... is it normal? )

num_waiting_for_resource: 0
num_waiting_for_plasma_memory: 0
num_waiting_for_remote_node_resources: 0
num_worker_not_started_by_job_config_not_exist: 0
num_worker_not_started_by_registration_timeout: 0
num_worker_not_started_by_process_rate_limit: 0
num_worker_waiting_for_workers: 1
num_cancelled_tasks: 0

EDIT

It is finished without printing spurious resource deadlock error, but I am seeing this.

(raylet) Traceback (most recent call last):
(raylet)   File "/Users/sangbincho/work/ray/python/ray/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet) ModuleNotFoundError: No module named 'ray'

is it expected? or is it due to my setup?

architkulkarni · 2021-10-28T22:09:56Z

Ah @rkooo567 can you try running with RAY_RUNTIME_ENV_LOCAL_DEV_MODE=1? You need that flag if you're using pip or conda in the runtime env with Ray built from source.

rkooo567 · 2021-10-28T23:57:34Z

Turns out the log was buried into the top of very long conda log messages.

rkooo567 added 8 commits October 25, 2021 15:09

done

57c1aad

lint

db1c230

Merge branch 'master' into add-small-scale-threaded-actor-test

98af0b8

.

dde9347

in progress

1fe99a6

done

82a44f0

done

330f530

Merge branch 'master' into improve-scheduling-observability

ee85c6b

rkooo567 requested review from AmeerHajAli, ericl, pcmoritz, raulchen, robertnishihara and wuisawesome as code owners October 26, 2021 10:57

rkooo567 commented Oct 26, 2021

View reviewed changes

rkooo567 mentioned this pull request Oct 26, 2021

[Test] Add small scale threaded actor test #19720

Closed

6 tasks

fix compile issue

43135c0

This was referenced Oct 26, 2021

[Bug] Threaded actor stress test invokes SIGSEGV #19748

Closed

[Threaded actor] Fix threaded actor race condition #19751

Merged

rkooo567 assigned ericl, scv119 and wuisawesome Oct 26, 2021

scv119 reviewed Oct 26, 2021

View reviewed changes

rkooo567 assigned fishbone Oct 26, 2021

scv119 reviewed Oct 26, 2021

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2021

rkooo567 added 2 commits October 27, 2021 06:58

Merge branch 'master' into improve-scheduling-observability

69bfee7

addressed code review.

9f108fc

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2021

scv119 approved these changes Oct 27, 2021

View reviewed changes

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 27, 2021

rkooo567 assigned edoakes and unassigned ericl Oct 28, 2021

rkooo567 mentioned this pull request Oct 28, 2021

[WIP][runtime_env] Don't print "actor or task cannot be scheduled right now" warnings when installing env #19796

Closed

6 tasks

Merge branch 'master' into improve-scheduling-observability

5c1e0f5

Done

dae9fea

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 28, 2021

rkooo567 added 2 commits October 28, 2021 03:15

done

b2696f3

.

33912fb

scv119 assigned ericl Oct 28, 2021

ericl approved these changes Oct 28, 2021

View reviewed changes

ericl merged commit 96fc875 into ray-project:master Oct 28, 2021

rkooo567 mentioned this pull request Oct 29, 2021

[Runtime env] Add a test to make sure resource deadlock message is not printed when waiting for workers #19870

Merged

6 tasks

Conversation

rkooo567 commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rkooo567 commented Oct 26, 2021

Uh oh!

scv119 commented Oct 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fishbone commented Oct 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rkooo567 commented Oct 28, 2021

Uh oh!

rkooo567 commented Oct 28, 2021

Uh oh!

rkooo567 commented Oct 28, 2021

Uh oh!

rkooo567 commented Oct 28, 2021

Uh oh!

rkooo567 commented Oct 28, 2021

Uh oh!

rkooo567 commented Oct 26, 2021 •

edited

Loading

rkooo567 Oct 27, 2021 •

edited

Loading

rkooo567 commented Oct 28, 2021 •

edited

Loading