Skip to content

[core] Add task and object reconstruction status to ray memory#22317

Merged
stephanie-wang merged 5 commits intoray-project:masterfrom
stephanie-wang:lineage-observability
Feb 23, 2022
Merged

[core] Add task and object reconstruction status to ray memory#22317
stephanie-wang merged 5 commits intoray-project:masterfrom
stephanie-wang:lineage-observability

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang commented Feb 11, 2022

Why are these changes needed?

Improve observability for general objects and lineage reconstruction by adding a "Status" field to ray memory. The value of the field can be:

  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:

IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000


Related issue number

Closes #21427.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

I'd also like to add some metrics like total number of reconstructing tasks to the CoreWorkerStats or debug_state.txt, but I'm not really sure how to read those metrics.

// The task finished previously but is now being scheduled again because its
// outputs need to be reconstructed. It is waiting for its dependencies to be
// created again.
RECONSTRUCTING_AND_WAITING_FOR_DEPENDENCIES = 4;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a "reconstructing" flag instead of separate states?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid adding an extra field to the protobuf but I can do this if you think it's cleaner.

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does test_memstat.py also need updating?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 11, 2022
@stephanie-wang stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 14, 2022
@stephanie-wang
Copy link
Copy Markdown
Contributor Author

cc @rkooo567

By the way, I'm going to hold off on adding this to any metrics for now because I'm not really sure if it'll make sense. These stats are collected at the owners so they don't have anything to do with the tasks that are queued at the local raylet.

@rkooo567 rkooo567 self-assigned this Feb 15, 2022
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 15, 2022
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Feb 15, 2022

I like the attempt number change!

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the public documentation (ray memory) to explain the meaning states/attempt?

const std::vector<ObjectID> &inlined_dependency_ids,
const std::vector<ObjectID> &contained_ids) = 0;

virtual void MarkDependenciesResolved(const TaskID &task_id) = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but why don't we use OnDependenciesResolved? Mark sounds like we will have some operations for the task id

@stephanie-wang stephanie-wang merged commit abf2a70 into ray-project:master Feb 23, 2022
@stephanie-wang stephanie-wang deleted the lineage-observability branch February 23, 2022 05:26
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
…roject#22317)

Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be:
```
  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;
```

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:
```
IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000


```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] [Feature] Better observability for lineage reconstruction

3 participants