[core] Add task and object reconstruction status to ray memory#22317
[core] Add task and object reconstruction status to ray memory#22317stephanie-wang merged 5 commits intoray-project:masterfrom
Conversation
|
I'd also like to add some metrics like total number of reconstructing tasks to the CoreWorkerStats or debug_state.txt, but I'm not really sure how to read those metrics. |
src/ray/protobuf/common.proto
Outdated
| // The task finished previously but is now being scheduled again because its | ||
| // outputs need to be reconstructed. It is waiting for its dependencies to be | ||
| // created again. | ||
| RECONSTRUCTING_AND_WAITING_FOR_DEPENDENCIES = 4; |
There was a problem hiding this comment.
Should there be a "reconstructing" flag instead of separate states?
There was a problem hiding this comment.
I wanted to avoid adding an extra field to the protobuf but I can do this if you think it's cleaner.
ericl
left a comment
There was a problem hiding this comment.
Does test_memstat.py also need updating?
|
cc @rkooo567 By the way, I'm going to hold off on adding this to any metrics for now because I'm not really sure if it'll make sense. These stats are collected at the owners so they don't have anything to do with the tasks that are queued at the local raylet. |
|
I like the attempt number change! |
rkooo567
left a comment
There was a problem hiding this comment.
Can you update the public documentation (ray memory) to explain the meaning states/attempt?
| const std::vector<ObjectID> &inlined_dependency_ids, | ||
| const std::vector<ObjectID> &contained_ids) = 0; | ||
|
|
||
| virtual void MarkDependenciesResolved(const TaskID &task_id) = 0; |
There was a problem hiding this comment.
Personal preference, but why don't we use OnDependenciesResolved? Mark sounds like we will have some operations for the task id
…roject#22317) Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be: ``` // The task is waiting for its dependencies to be created. WAITING_FOR_DEPENDENCIES = 1; // All dependencies have been created and the task is scheduled to execute. SCHEDULED = 2; // The task finished successfully. FINISHED = 3; ``` In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output: ``` IP Address | PID | Type | Call Site | Status | Size | Reference Type | Object Ref 192.168.4.22 | 279475 | Driver | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000 ```
Why are these changes needed?
Improve observability for general objects and lineage reconstruction by adding a "Status" field to
ray memory. The value of the field can be:In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:
Related issue number
Closes #21427.
Checks
scripts/format.shto lint the changes in this PR.