Skip to content

[data] Improve documentation around typing and batch formats #58615

@richardliaw

Description

@richardliaw

Description

Let's add some documentation around typing in Ray Data.

A couple questions when working with batch formats:

  1. I though the underlying format in Ray Data was a pyarrow table, but the documentation says it is a dict of numpy arrays. Which is correct?
  2. What are the valid return types for the UDF in map_batches? The docs say either "pandas" or "Dict[str, np.ndarray]", but I thought a pyarrow table was also a valid return type.
  3. For different batch_formats, what is the compatibility matrix and performance matrix when they convert amongst each other?

And what is the recommendation of which batch_format to use and when?

generic recommendation is:
if you want dataframe, then use pyarrow+polars
if you want dataframe + pandas, then use pandas
if you want numerical ops or tensor ops, use numpy

Link

No response

Metadata

Metadata

Labels

dataRay Data-related issuesdocsAn issue or change related to documentationtriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions