-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data]: Streaming train/test split #56780
Copy link
Copy link
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance
Description
Description
The current implementation of train_test_split materialized the datasets, which for very large datasets leads to bottlenecks or excesively big cluster to be able to materialize them.
Adding support for a streaming version of the spilt would allow for less resource consumption.
There's a few ways that this can be implemented:
Bernoulli split
- Each record goes to A with probability p, else to B.
- One pass, no buffering, truly streaming.
- Result sizes will be close to p:N but not exact.
Hash-based split
- Choose a stable key (e.g., user_id). Hash it; if hash < threshold, send to A; else B.
- Proportions are very close to p for large datasets, and it’s reproducible (no RNG state needed).
- Works great for training/validation splits that must be consistent across runs.
Exact-size split in one pass (requires knowing N)
- If you know total N and want exactly k = round(p*N) in A, you can do selection sampling:
- As you read item i, with n_remaining = N - i and k_remaining, choose current item for A with probability k_remaining / n_remaining.
- Guarantees exactly k in A, uniformly at random, single pass, O(1) memory.
For our use case Hash based split would be the ideal given it guarantees determinism, though Bernoulli will work for a wide range of use cases, and does not require a unique key in the dataset
Use case
Splitting input dataset into train/test splits for training/validation purposes.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityperformance