[Data]: Streaming train/test split

### Description

The current implementation of `train_test_split` materialized the datasets, which for very large datasets leads to bottlenecks or excesively big cluster to be able to materialize them. 
Adding support for a streaming version of the spilt would allow for less resource consumption.

There's a few ways that this can be implemented:

## Bernoulli split
- Each record goes to A with probability p, else to B.
- One pass, no buffering, truly streaming.
- Result sizes will be close to p:N but not exact.

## Hash-based split
- Choose a stable key (e.g., user_id). Hash it; if hash < threshold, send to A; else B.
- Proportions are very close to p for large datasets, and it’s reproducible (no RNG state needed).
- Works great for training/validation splits that must be consistent across runs.

## Exact-size split in one pass (requires knowing N)
- If you know total N and want exactly k = round(p*N) in A, you can do selection sampling:
  - As you read item i, with n_remaining = N - i and k_remaining, choose current item for A with probability k_remaining / n_remaining.
 -  Guarantees exactly k in A, uniformly at random, single pass, O(1) memory.


For our use case `Hash based split` would be the ideal given it guarantees determinism, though `Bernoulli` will work for a wide range of use cases, and does not require a unique key in the dataset

### Use case

Splitting input dataset into train/test splits for training/validation purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data]: Streaming train/test split #56780

Description

Bernoulli split

Hash-based split

Exact-size split in one pass (requires knowing N)

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data]: Streaming train/test split #56780

Description

Description

Bernoulli split

Hash-based split

Exact-size split in one pass (requires knowing N)

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions