Skip to content

[data] fix random_sample return different data in fixed seed#49443

Closed
Jay-ju wants to merge 1 commit intoray-project:masterfrom
Jay-ju:fix_random_sample_fixed_seed
Closed

[data] fix random_sample return different data in fixed seed#49443
Jay-ju wants to merge 1 commit intoray-project:masterfrom
Jay-ju:fix_random_sample_fixed_seed

Conversation

@Jay-ju
Copy link
Copy Markdown
Contributor

@Jay-ju Jay-ju commented Dec 26, 2024

Why are these changes needed?

Related issue number

Closes #48497

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: jukejian <jukejian@bytedance.com>
@Jay-ju Jay-ju requested a review from a team as a code owner December 26, 2024 04:56
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 27, 2024
@Jay-ju
Copy link
Copy Markdown
Contributor Author

Jay-ju commented Jan 3, 2025

@scottjlee @richardliaw Please help take a look at this problem when you have time.

@wingkitlee0
Copy link
Copy Markdown
Contributor

Technically, when generating random numbers in parallel (e.g., calling random_sample(batch) in map_batches), we need to use different random generators in each task. It's a big topic. Checkout Numpy's SeedSequence.

For example,

ds.map_batches(lambda x: {"x": [rng.random() for _ in range(3)]}, batch_size=1)

gives the same 3 random numbers in each batch:

[{'x': 0.9664535356921388},
 {'x': 0.4407325991753527},
 {'x': 0.007491470058587191},
 {'x': 0.9664535356921388},
 {'x': 0.4407325991753527},
 {'x': 0.007491470058587191},
 {'x': 0.9664535356921388},
 {'x': 0.4407325991753527},
 {'x': 0.007491470058587191},
 {'x': 0.9664535356921388},
...

@alexeykudinkin
Copy link
Copy Markdown
Contributor

Will be addressed in #46088

@richardliaw
Copy link
Copy Markdown
Contributor

Hi, we'll be taking #51401 -- feel free to work with @wingkitlee0 on this.

richardliaw pushed a commit that referenced this pull request Apr 10, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

**Problem**

Current `random_sample()` does not work with fixed seed.
#40406

Previous attempts (changing the global seed or passing the same
seed/state to workers) also do not work.

**Solution [Updated after PR review]**

In order to use random generators in parallel, we need to be careful
about the seed/state that passes into `map_batches`. `numpy` describes a
few methods and one of them is to use a sequence of seeds
https://numpy.org/doc/2.2/reference/random/parallel.html#sequence-of-integer-seeds.
In Ray Data, we can construct a `random_sample()` UDF that has access to
a "block id" via `TaskContext` (that is thread-local) and use
`[block_id, seed]` to initialize a RNG. As the Ray task may be reused
for different blocks, the RNG is saved into `TaskContext.kwargs`.

**Proposed fix  [Updated after PR review]**

We add `set/get_current()` methods to `TaskContext` which allow the UDF
to get a local copy. It has access to the `task_idx` and previously
initialized RNG. This removes the need of extra arguments in the
original proposal.

**After fix**
```python
In [9]: ds = ray.data.range(1000)

In [10]: ds.random_sample(0.05, seed=1234).take_batch()
Out[10]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}

In [11]: ds.random_sample(0.05, seed=1234).take_batch()
Out[11]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}
```

## Related issue number

This issue has been raised a few times:
Closes #40406 #48497 

Other implementations did not solve the root cause:
#46088
#49443

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
han-steve pushed a commit to han-steve/ray that referenced this pull request Apr 11, 2025
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

**Problem**

Current `random_sample()` does not work with fixed seed.
ray-project#40406

Previous attempts (changing the global seed or passing the same
seed/state to workers) also do not work.

**Solution [Updated after PR review]**

In order to use random generators in parallel, we need to be careful
about the seed/state that passes into `map_batches`. `numpy` describes a
few methods and one of them is to use a sequence of seeds
https://numpy.org/doc/2.2/reference/random/parallel.html#sequence-of-integer-seeds.
In Ray Data, we can construct a `random_sample()` UDF that has access to
a "block id" via `TaskContext` (that is thread-local) and use
`[block_id, seed]` to initialize a RNG. As the Ray task may be reused
for different blocks, the RNG is saved into `TaskContext.kwargs`.

**Proposed fix  [Updated after PR review]**

We add `set/get_current()` methods to `TaskContext` which allow the UDF
to get a local copy. It has access to the `task_idx` and previously
initialized RNG. This removes the need of extra arguments in the
original proposal.

**After fix**
```python
In [9]: ds = ray.data.range(1000)

In [10]: ds.random_sample(0.05, seed=1234).take_batch()
Out[10]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}

In [11]: ds.random_sample(0.05, seed=1234).take_batch()
Out[11]:
{'id': array([ 27,  54,  72, 111, 136, 144, 147, 168, 200, 224, 225, 245, 247,
        248, 307, 312, 313, 340, 347, 375])}
```

## Related issue number

This issue has been raised a few times:
Closes ray-project#40406 ray-project#48497

Other implementations did not solve the root cause:
ray-project#46088
ray-project#49443

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Steve Han <stevehan2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] seed not respected in Dataset.random_sample()

7 participants