Skip to content

[Data]: Streaming train/test split #56780

@martinbomio

Description

@martinbomio

Description

The current implementation of train_test_split materialized the datasets, which for very large datasets leads to bottlenecks or excesively big cluster to be able to materialize them.
Adding support for a streaming version of the spilt would allow for less resource consumption.

There's a few ways that this can be implemented:

Bernoulli split

  • Each record goes to A with probability p, else to B.
  • One pass, no buffering, truly streaming.
  • Result sizes will be close to p:N but not exact.

Hash-based split

  • Choose a stable key (e.g., user_id). Hash it; if hash < threshold, send to A; else B.
  • Proportions are very close to p for large datasets, and it’s reproducible (no RNG state needed).
  • Works great for training/validation splits that must be consistent across runs.

Exact-size split in one pass (requires knowing N)

  • If you know total N and want exactly k = round(p*N) in A, you can do selection sampling:
    • As you read item i, with n_remaining = N - i and k_remaining, choose current item for A with probability k_remaining / n_remaining.
  • Guarantees exactly k in A, uniformly at random, single pass, O(1) memory.

For our use case Hash based split would be the ideal given it guarantees determinism, though Bernoulli will work for a wide range of use cases, and does not require a unique key in the dataset

Use case

Splitting input dataset into train/test splits for training/validation purposes.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilityperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions