Skip to content

[Data] seed not respected in Dataset.random_sample() #48497

@scottjlee

Description

@scottjlee

What happened + What you expected to happen

When running Dataset.random_sample() multiple times with the same seed, the resulting dataset is not consistent. We would expect that with a fixed seed, the output dataset is reproducible and deterministic.

Versions / Dependencies

ray master (ray 2.38)

Reproduction script

import ray

ds = ray.data.range(1219)
ds = ds.random_sample(0.1, seed=0)
check1 = ds.count()
print(f"=== Check 1: {check1}")

check2 = ds.count()
print(f"=== Check 2 {check2}")

assert check1 == check2, f"{check1=} vs. {check2=}"

Without the

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions