[core] Add task and object reconstruction status to ray memory by stephanie-wang · Pull Request #22317 · ray-project/ray

stephanie-wang · 2022-02-11T18:52:02Z

Why are these changes needed?

Improve observability for general objects and lineage reconstruction by adding a "Status" field to ray memory. The value of the field can be:

  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:

IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000

Related issue number

Closes #21427.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang · 2022-02-11T18:53:39Z

I'd also like to add some metrics like total number of reconstructing tasks to the CoreWorkerStats or debug_state.txt, but I'm not really sure how to read those metrics.

ericl · 2022-02-11T22:52:53Z

src/ray/protobuf/common.proto

+  // The task finished previously but is now being scheduled again because its
+  // outputs need to be reconstructed. It is waiting for its dependencies to be
+  // created again.
+  RECONSTRUCTING_AND_WAITING_FOR_DEPENDENCIES = 4;


Should there be a "reconstructing" flag instead of separate states?

I wanted to avoid adding an extra field to the protobuf but I can do this if you think it's cleaner.

ericl

Does test_memstat.py also need updating?

stephanie-wang · 2022-02-14T23:43:37Z

cc @rkooo567

By the way, I'm going to hold off on adding this to any metrics for now because I'm not really sure if it'll make sense. These stats are collected at the owners so they don't have anything to do with the tasks that are queued at the local raylet.

ericl · 2022-02-15T07:34:12Z

I like the attempt number change!

rkooo567

Can you update the public documentation (ray memory) to explain the meaning states/attempt?

rkooo567 · 2022-02-16T03:51:53Z

src/ray/core_worker/task_manager.h

      const std::vector<ObjectID> &inlined_dependency_ids,
      const std::vector<ObjectID> &contained_ids) = 0;

+  virtual void MarkDependenciesResolved(const TaskID &task_id) = 0;


Personal preference, but why don't we use OnDependenciesResolved? Mark sounds like we will have some operations for the task id

…roject#22317) Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be: ``` // The task is waiting for its dependencies to be created. WAITING_FOR_DEPENDENCIES = 1; // All dependencies have been created and the task is scheduled to execute. SCHEDULED = 2; // The task finished successfully. FINISHED = 3; ``` In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output: ``` IP Address | PID | Type | Call Site | Status | Size | Reference Type | Object Ref 192.168.4.22 | 279475 | Driver | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000 ```

Add task status to ray memory

838f252

stephanie-wang requested review from AmeerHajAli, ericl, pcmoritz, raulchen, robertnishihara and wuisawesome as code owners February 11, 2022 18:52

stephanie-wang assigned ericl Feb 11, 2022

ericl reviewed Feb 11, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 11, 2022

stephanie-wang added 2 commits February 14, 2022 12:10

test_memstat

7154051

Add attempt # to stats, fix bug in recovering actors

5c19bd3

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 14, 2022

rkooo567 self-assigned this Feb 15, 2022

ericl approved these changes Feb 15, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 15, 2022

test

b87dc7b

rkooo567 reviewed Feb 16, 2022

View reviewed changes

fix

ea921a3

stephanie-wang merged commit abf2a70 into ray-project:master Feb 23, 2022

stephanie-wang deleted the lineage-observability branch February 23, 2022 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Add task and object reconstruction status to ray memory#22317

[core] Add task and object reconstruction status to ray memory#22317
stephanie-wang merged 5 commits intoray-project:masterfrom
stephanie-wang:lineage-observability

stephanie-wang commented Feb 11, 2022 •

edited

Loading

Uh oh!

stephanie-wang commented Feb 11, 2022

Uh oh!

ericl Feb 11, 2022

Uh oh!

stephanie-wang Feb 12, 2022

Uh oh!

ericl left a comment

Uh oh!

stephanie-wang commented Feb 14, 2022

Uh oh!

ericl commented Feb 15, 2022

Uh oh!

rkooo567 left a comment

Uh oh!

rkooo567 Feb 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stephanie-wang commented Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

stephanie-wang commented Feb 11, 2022

Uh oh!

ericl Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Feb 12, 2022

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Feb 14, 2022

Uh oh!

ericl commented Feb 15, 2022

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stephanie-wang commented Feb 11, 2022 •

edited

Loading