[data] make random_sample() reproducible by wingkitlee0 · Pull Request #51401 · ray-project/ray

wingkitlee0 · 2025-03-15T16:47:36Z

Why are these changes needed?

Problem

Current random_sample() does not work with fixed seed. #40406

Previous attempts (changing the global seed or passing the same seed/state to workers) also do not work.

Solution [Updated after PR review]

In order to use random generators in parallel, we need to be careful about the seed/state that passes into map_batches. numpy describes a few methods and one of them is to use a sequence of seeds https://numpy.org/doc/2.2/reference/random/parallel.html#sequence-of-integer-seeds. In Ray Data, we can construct a random_sample() UDF that has access to a "block id" via TaskContext (that is thread-local) and use [block_id, seed] to initialize a RNG. As the Ray task may be reused for different blocks, the RNG is saved into TaskContext.kwargs.

Proposed fix [Updated after PR review]

We add set/get_current() methods to TaskContext which allow the UDF to get a local copy. It has access to the task_idx and previously initialized RNG. This removes the need of extra arguments in the original proposal.

After fix

In [9]: ds = ray.data.range(1000)

In [10]: ds.random_sample(0.05, seed=1234).take_batch()
Out[10]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}

In [11]: ds.random_sample(0.05, seed=1234).take_batch()
Out[11]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}

Related issue number

This issue has been raised a few times:
Closes #40406 #48497

Other implementations did not solve the root cause:
#46088
#49443

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

wingkitlee0 · 2025-03-18T01:21:01Z

@alexeykudinkin can you help review? thanks

alexeykudinkin · 2025-03-24T20:12:22Z

Will be addressed in #46088

wingkitlee0 · 2025-03-24T21:39:24Z

Will be addressed in #46088

@alexeykudinkin I am aware of that PR and do not think the solution is correct. See my example here: #49443 (comment)

raulchen · 2025-03-26T17:33:26Z

hey @wingkitlee0 , @alexeykudinkin was referring to a different PR https://github.com/ray-project/ray/pull/46088/files
That PR looks good to me as well. And it's simpler.
If you still have concerns, please comment on that PR.

raulchen · 2025-03-26T17:45:20Z

Actually, after reading more on both PRs, I think this PR is better, because it can avoid the small-batch issue mentioned in #46088 cc @alexeykudinkin

raulchen · 2025-03-26T17:47:27Z

python/ray/data/dataset.py

I'd like to avoid exposing this flag.
Instead, we can add a get_current() method in TaskContext

and we don't need passing the batch_idx either.
We can just seed the random generator once per task.
this can be done by either 1) use a class-based UDF or 2) save some state in TaskContext.kwargs. (1) would be better.

use a class-based UDF

class-based UDF requires concurrency to be set. Any way to get around that?

You can use a class-based UDF with compute=TaskPoolComputeStrategy

Interesting, ray/data/_internal/util.py's get_compute_strategy seems to discourage the use of CallableClass + TaskPoolStrategy (get_compute_strategy is called by _map_batches_without_batch_size_validation). I can find ways to bypass that, but I am curious if that would work...

By default, we choose ActorPoolStrategy if the UDF is a class.
The motivation is to simplify the usage and avoid users having to specify the strategy.
But in theory, CallableClass + TaskPoolStrategy should also be feasible.
You can probably move get_compute_strategy out of _map_batches_without_batch_size_validation.
Not sure if there are other issues. if you find you have to tweak too many things, TaskContext.kwargs is also fine.

raulchen · 2025-03-26T17:49:32Z

python/ray/data/tests/test_map.py

we can probably just check against hard-coded expected results.
otherwise, we'll need to also test get_expected_mask_indices is deterministic.

raulchen · 2025-03-26T17:50:19Z

python/ray/data/tests/test_map.py

alexeykudinkin · 2025-03-26T18:21:55Z

@wingkitlee0 yes, you're right. I overlooked the fact that we'll be generating identical sequences per block which isn't ideal.

alexeykudinkin · 2025-03-26T18:09:42Z

python/ray/data/_internal/logical/operators/map_operator.py

Let's avoid this param

Sure.
However, it will now need to call _map_batches_without_batch_size_validation which does not have default args. So there will be a bunch of hardcoded default values in random_sample.. Any advice?

alexeykudinkin · 2025-03-26T18:20:12Z

python/ray/data/dataset.py

Why do we need batch_idx?

My original thought was to keep this a pure and stateless function.

richardliaw · 2025-04-01T04:01:54Z

@wingkitlee0 - could you fix tests? and ping when ready for another review?

wingkitlee0 · 2025-04-04T23:06:11Z

It's ready for re-review!

No change to the public APIs.
Used TaskContext.kwargs at the end. I checked TaskPoolStrategy etc: It can't use class-based UDFs now because it misses the step to call the constructor.
Updated AbstractUDFMap to use kwargs (the use of positional args + defaults makes it hard to track down the previous pipeline failures: the object was instantiated successfully with mismatched args but the error wasn't raised many steps later)
Simplified the unit tests

raulchen · 2025-04-08T00:14:17Z

python/ray/data/dataset.py

+            else:
+                rng = np.random.default_rng(
+                    [ctx.kwargs.get("batch_idx", 0), ctx.task_idx, seed]
+                )


I think we can get rid of the include_task_ctx flag as well.

we can add a thread-local variable in TaskContext to allow accessing the current TaskContext. Basically call TaskContext.set_current/reset_currernt in _map_task.

I don't think we need batch_idx here. We just need to create the rng object once per task and store it in TaskContext.kwargs.

Great suggestion. It will reduce code changes. I will probably need to update the tests a little bit.

raulchen

LGTM with one last comment

raulchen · 2025-04-08T22:47:51Z

python/ray/data/_internal/execution/interfaces/task_context.py

    kwargs: Dict[str, Any] = field(default_factory=dict)
+
+    @classmethod
+    def get_current(cls, create_if_not_exists=True, **kwargs) -> "TaskContext":


nit, I don't think create_if_not_exists and kwargs are needed for this PR.
Let's remove it.

- using a TaskContext to access task_idx Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

## Why are these changes needed? **Problem** Current `random_sample()` does not work with fixed seed. ray-project#40406 Previous attempts (changing the global seed or passing the same seed/state to workers) also do not work. **Solution [Updated after PR review]** In order to use random generators in parallel, we need to be careful about the seed/state that passes into `map_batches`. `numpy` describes a few methods and one of them is to use a sequence of seeds https://numpy.org/doc/2.2/reference/random/parallel.html#sequence-of-integer-seeds. In Ray Data, we can construct a `random_sample()` UDF that has access to a "block id" via `TaskContext` (that is thread-local) and use `[block_id, seed]` to initialize a RNG. As the Ray task may be reused for different blocks, the RNG is saved into `TaskContext.kwargs`. **Proposed fix [Updated after PR review]** We add `set/get_current()` methods to `TaskContext` which allow the UDF to get a local copy. It has access to the `task_idx` and previously initialized RNG. This removes the need of extra arguments in the original proposal. **After fix** ```python In [9]: ds = ray.data.range(1000) In [10]: ds.random_sample(0.05, seed=1234).take_batch() Out[10]: {'id': array([ 27, 54, 72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247, 248, 307, 312, 313, 340, 347, 375])} In [11]: ds.random_sample(0.05, seed=1234).take_batch() Out[11]: {'id': array([ 27, 54, 72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247, 248, 307, 312, 313, 340, 347, 375])} ``` ## Related issue number This issue has been raised a few times: Closes ray-project#40406 ray-project#48497 Other implementations did not solve the root cause: ray-project#46088 ray-project#49443  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com> Signed-off-by: Steve Han <stevehan2001@gmail.com>

wingkitlee0 changed the title ~~Make random_sample() reproducible~~ [data] make random_sample() reproducible Mar 15, 2025

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch 8 times, most recently from 9190c43 to 09d7814 Compare March 16, 2025 21:06

wingkitlee0 marked this pull request as ready for review March 16, 2025 23:35

wingkitlee0 requested a review from a team as a code owner March 16, 2025 23:35

richardliaw assigned raulchen Mar 18, 2025

jcotant1 added the data Ray Data-related issues label Mar 24, 2025

raulchen reviewed Mar 26, 2025

View reviewed changes

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 26, 2025

raulchen mentioned this pull request Mar 26, 2025

[Data] Make the seed take effect in Dataset.random_sample() #46088

Closed

8 tasks

alexeykudinkin reviewed Mar 26, 2025

View reviewed changes

richardliaw added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Mar 28, 2025

richardliaw mentioned this pull request Mar 28, 2025

[data] fix random_sample return different data in fixed seed #49443

Closed

8 tasks

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch from 09d7814 to a8b8d33 Compare March 29, 2025 22:19

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch 3 times, most recently from 3382a4e to 18736ba Compare April 4, 2025 01:35

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch from 18736ba to 0c1dd08 Compare April 4, 2025 01:41

hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch from 0c1dd08 to e3d6d91 Compare April 4, 2025 03:49

wingkitlee0 requested review from alexeykudinkin and raulchen April 4, 2025 23:06

raulchen reviewed Apr 8, 2025

View reviewed changes

raulchen approved these changes Apr 8, 2025

View reviewed changes

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch from fe368ee to 35d4280 Compare April 9, 2025 00:07

feat: a reproducible random_sample with fixed seeds

8f3b7b4

- using a TaskContext to access task_idx Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the klee/random-sample-fixed-seed branch from 35d4280 to 8f3b7b4 Compare April 9, 2025 01:27

richardliaw merged commit fa03256 into ray-project:master Apr 10, 2025
5 checks passed

richardliaw mentioned this pull request Apr 21, 2025

[Data] seed not respected in Dataset.random_sample() #48497

Closed

hainesmichaelc added the community-backlog label May 22, 2025

Conversation

wingkitlee0 commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

wingkitlee0 commented Mar 18, 2025

Uh oh!

alexeykudinkin commented Mar 24, 2025

Uh oh!

wingkitlee0 commented Mar 24, 2025

Uh oh!

raulchen commented Mar 26, 2025

Uh oh!

raulchen commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Apr 1, 2025

Uh oh!

wingkitlee0 commented Apr 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wingkitlee0 commented Mar 15, 2025 •

edited

Loading