Skip to content

[Data] Ray Data currently lacks support for displaying execution plans and does not provide operator-level metrics within those plans #55052

@caican00

Description

@caican00

What happened + What you expected to happen

Ray Data currently lacks support for displaying execution plans and does not provide operator-level metrics within those plans.

Specifically:

  1. No Standard API for Execution Plans: Ray Data does not offer a standard API (e.g., an explain() method) to visualize the execution plan, encompassing both the logical and physical plans.

  2. Limited Plan Visibility: While print(ds) can display aspects of the logical plan, it does not reveal the physical execution plan.

  3. Insufficient Operator-Level Metrics: The execution plan view lacks detailed metrics for individual operators. This makes it difficult to assess critical performance aspects such as:
    The volume of data processed by each operator.
    Whether predicate pushdown optimizations have been successfully applied.

  4. Coarse-Grained Dataset Statistics: The metrics provided by the Dataset.stats() API are not sufficiently granular. They are reported at the block level and lack comprehensive, global statistics like total input/output row counts.

Operator 1 ReadLance->Filter(NoneType): 200 tasks executed, 200 blocks produced in 29967.49s
* Remote wall time: 92.49ms min, 2.92s max, 1.22s mean, 243.21s total
* Remote cpu time: 7.36ms min, 418.41ms max, 208.63ms mean, 41.73s total
* UDF time: 305.39us min, 1.28s max, 150.0ms mean, 30.0s total
* Peak heap memory usage (MiB): 407.04 min, 683.52 max, 527 mean
* Output num rows per block: 0 min, 1 max, 0 mean, 1 total
* Output size bytes per block: 0 min, 18327 max, 91 mean, 18327 total
* Output rows per task: 0 min, 1 max, 0 mean, 200 tasks used
* Tasks per node: 14 min, 25 max, 20 mean; 10 nodes used
* Operator throughput:
        * Ray Data throughput: 3.336949276338555e-05 rows/s
        * Estimated single node throughput: 0.00411168768564564 rows/s

Operator 2 limit=20: 200 tasks executed, 200 blocks produced in 29967.49s
* Remote wall time: 92.49ms min, 2.92s max, 1.22s mean, 243.21s total
* Remote cpu time: 7.36ms min, 418.41ms max, 208.63ms mean, 41.73s total
* UDF time: 305.39us min, 1.28s max, 150.0ms mean, 30.0s total
* Peak heap memory usage (MiB): 407.04 min, 683.52 max, 527 mean
* Output num rows per block: 0 min, 1 max, 0 mean, 1 total
* Output size bytes per block: 0 min, 18327 max, 91 mean, 18327 total
* Output rows per task: 0 min, 1 max, 0 mean, 200 tasks used
* Tasks per node: 14 min, 25 max, 20 mean; 10 nodes used
* Operator throughput:
        * Ray Data throughput: 3.336949276338555e-05 rows/s
        * Estimated single node throughput: 0.00411168768564564 rows/s

Dataset iterator time breakdown:
* Total time overall: 7.87s
    * Total time in Ray Data iterator initialization code: 713.05ms
    * Total time user thread is blocked by Ray Data iter_batches: 7.15s
    * Total execution time for user thread: 1.65ms
* Batch iteration time breakdown (summed across prefetch threads):
    * In ray.get(): 150.7us min, 6.41ms max, 636.5us avg, 127.3ms total
    * In batch creation: 4.48us min, 4.48us max, 4.48us avg, 4.48us total
    * In batch formatting: 26.3us min, 26.3us max, 26.3us avg, 26.3us total

Dataset throughput:
        * Ray Data throughput: 3.336949276338555e-05 rows/s
        * Estimated single node throughput: 0.00205584384282282 rows/s

Is it because the detailed execution plan and execution metrics were not considered in the design of ray data, or is it just not supported for the time being?

Versions / Dependencies

ray 2.40.0

Reproduction script

None

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to RayobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingquestionJust a question :)triageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions