[Data] Support for setting an initial concurrency

### Description

Ray Data’s read / map / write APIs already let you specify concurrency with a tuple `(min, max)` when the argument `fn` is a Python class.
However, in my long-tail workload the last few tasks are extremely expensive, so I set `min = 1` to keep at least one actor alive. This forces Ray to start with a single actor and scale up to `max` through autoscaling, which adds noticeable startup latency.

I believe this requirement is reasonable, as both Kubernetes and Spark already provide similar knobs: in Kubernetes we have replicas, maxReplicas, and minReplicas, while Spark offers spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.maxExecutors, and spark.dynamicAllocation.minExecutors.

### Use case

Perhaps we could extend the concurrency parameter to `concurrency = (min, max, init)`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Support for setting an initial concurrency #54648

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Support for setting an initial concurrency #54648

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions