-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[data] Improve documentation around typing and batch formats #58615
Copy link
Copy link
Closed
Labels
dataRay Data-related issuesRay Data-related issuesdocsAn issue or change related to documentationAn issue or change related to documentationtriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
Description
Let's add some documentation around typing in Ray Data.
A couple questions when working with batch formats:
- I though the underlying format in Ray Data was a pyarrow table, but the documentation says it is a dict of numpy arrays. Which is correct?
- What are the valid return types for the UDF in map_batches? The docs say either "pandas" or "Dict[str, np.ndarray]", but I thought a pyarrow table was also a valid return type.
- For different batch_formats, what is the compatibility matrix and performance matrix when they convert amongst each other?
And what is the recommendation of which batch_format to use and when?
generic recommendation is:
if you want dataframe, then use pyarrow+polars
if you want dataframe + pandas, then use pandas
if you want numerical ops or tensor ops, use numpy
Link
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
dataRay Data-related issuesRay Data-related issuesdocsAn issue or change related to documentationAn issue or change related to documentationtriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)