`Generator` based random-number generation in`dask.array` by erayaslan · Pull Request #9038 · dask/dask

erayaslan · 2022-05-05T17:07:30Z

Closes Add dask.array.default_rng #8790
Tests added / passed
Passes pre-commit run --all-files

Obsoletes PR #9005

tried to keep backward compatibility. cost was some code duplication
more importantly, followed numpy interface as much as possible. should result in a better user experience, especially down the line when Generator use becomes more common
RandomState is still the default for now
documentation is meh. numpy docs are somewhat non-regular for this feature

np.random.Generator is the replacement for np.random.RandomState which is effectively frozen. Add Generator class and methods. Keep RandomState as default for now for backward compatibility Closes (dask#8790)

erayaslan · 2022-05-10T10:55:43Z

any comments? @jsignell

jcrist

Thanks @erayaslan. I gave this a skim and left some high level comments. @jakirkham or @pentschev, could you give this a review?

One concern - as far as I can tell, numpy has not deprecated RandomState. If their intent is that both implementations will be around for a long time, it would be good to reduce the code duplication between the Generator and RandomState implementations. If this is tricky to do without complicating things, don't worry about it.

dask/array/random.py

The macos builds consistently take the longest to run, and have a lower org-level concurrency limit than windows or linux. In `dask/dask` we have no logic specific to macos, so testing on every python version on this platform doesn't seem necessary. To reduce CI delay, we change github actions to only test macos on python 3.9.

* Temporary band-aid to force the deletion of dsk.key_dependencies, which could result in bad dependency information and thereby bad scheduling when used in `compute_as_if_collection` * Typo

* Change test to also look for *slightly* overlapping divisions. * Also raise if the start of an appended partition is equal to the end of the last partition, as these should be considered overlapping partitions. * Remove commented-out code. * Remove kwarg which is new in pandas 1.4, the default behavior is fine.

Currently codecov pushes an initial failing status on every PR that is later updated to passing once more CI builds finish. This is annoying. As far as I can tell from https://docs.codecov.com/docs/merging-reports this shouldn't be happening, but it is. Here we attempt to stop this behavior by: - Only running code coverage on the main test suite that hits most of the codebase. - Setting a minimum number of builds before codecov should push a status update. Hopefully this fixes the problem.

The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.

`prod`/`nanprod` overflow for data at this size. It's not clear why numpy only raises these warnings _sometimes_, but it makes sense why they're there. We filter `RuntimeWarning` now for `prod`/`nanprod` (and only these operations) now to fix this.

CI for these tests has been broken for a while. Since dask no longer has any HDFS-specific functionality, we're just relying on fsspec here for hdfs interaction. Since HDFS isn't commonly used, the maintenance burden here doesn't seem worth it.

* Add codespell pre-commit * Fix missing newline in setup.cfg Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

[PEP 544](https://www.python.org/dev/peps/pep-0544/) introduces the `Protocol` class to the `typing` module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for [dask-awkward](https://github.com/ContinuumIO/dask-awkward/) has had me thinking about working on a `DaskCollection` protocol. I imagine the benefits to be: - usage with static type checkers - other activity in this area at - dask#8295 - dask#8706 - dask#8854 - Python supporting IDEs take advantage of typing - self-documenting; some improvements to [the custom collections page](https://docs.dask.org/en/latest/custom-collections.html) of the docs. The protocol docs can be autogenerated and added to that page. - purely opt-in feature The `typing.runtime_checkable` decorator allows use of `isinstance(x, DaskCollection)` in any code base that uses Dask collections; for example: ```python >>> from dask.typing import DaskCollection >>> import dask.array as da >>> x = da.zeros((10, 3)) >>> isinstance(x, DaskCollection) True ``` (though this is an order of magnitude slower than `dask.base.is_dask_collection` which only checks for `x.__dask_graph__() is not None`; static typing checking & built-in interface documentation are the core benefits IMO) Something else that came up in the brief discussion on a call last week was having `{Scheduler,Worker,Nanny}Plugin` protocols in `distributed`; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.

Currently both Dask and Distributed implement this function with very slight variations. To attempt to consolidate these, pull in the Distributed implementation content into the Dask implementation. Then both Dask & Distributed can use this one function.

…rrectly. (dask#8979)

…9062)

…` and ``aggregate_files`` (dask#9052) As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

* Use `elif` for `decode` in `ensure_unicode` * Handle Python Buffer Protocol in `ensure_unicode` Any other arbitrary object (like `bytearray` or `memoryview` based objects) can be decoded to `unicode` via `codecs.decode`. This is analogous to what is done in `ensure_bytes`. So handle this case here. If this also fails, then raise as usual. * Include `ensure_unicode` tests for various objects * Clarify error messages * Use `uint8` in `array` tests This is more consistent with the other tests, which also use this type. Though `int8` also works. * Pass `bytes` directly to `array` Appears this already gets interpreted correctly by `array`. Should also make the code easier to read for other maintainers. * Use `from array import array` Avoids the `array.array` bit which is a tad verbose.

erayaslan · 2022-11-21T08:48:02Z

@rjzamora we look good I think. any comments or suggestions on moving forward?

rjzamora · 2022-11-21T14:54:14Z

Thanks for the nudge @erayaslan - I think I may try to simplify a few minor things related to dispatching and _backend changes here today, and it would be ideal if @jrbourbeau had a chance to take a final look after that. Thank you for your work and patience on this (I really appreciate it)!

erayaslan · 2023-01-14T13:46:41Z

ping. how do you feel about merging this? @rjzamora @jrbourbeau

rjzamora · 2023-01-19T19:13:04Z

@jrbourbeau - It would be nice to have someone else other than me take a pass at this. Note that I'd be willing to follow up with fixes if this PR ends up breaking anything.

pentschev

Apologies for taking so long to review here again @erayaslan , I appreciate your patience!

I do not see any obvious errors, I've left a few comments/suggestions on minor things, but otherwise this looks good.

dask/array/tests/test_cupy_random.py

dask/array/random.py

pentschev · 2023-01-19T20:30:59Z

dask/array/random.py

+        if isinstance(ar, np.ndarray):
+            return np.ascontiguousarray(np.broadcast_to(ar, shape))


Will there be no CuPy arrays being evaluated here and in conditionals below https://github.com/dask/dask/pull/9038/files#diff-bacbead89e1b64dbdf54cbe8878e9a52445945d2c2710b2a11edc65364c6b16dR925 and https://github.com/dask/dask/pull/9038/files#diff-bacbead89e1b64dbdf54cbe8878e9a52445945d2c2710b2a11edc65364c6b16dR940 ? Do they need to be special-cased?

I don't think so although @rjzamora can give a more definite answer. In the meantime, added code to raise an error if we get an unexpected type 8a83a0f

dask/array/random.py

dask/array/tests/test_random.py

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

As noted by Peter Andreas Entschev <peter@entschev.com>: Why do we need to compute() results here? assert_eq should take care of that, plus computing here will potentially lose things attributes such as meta.

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

so that we do not break if p is ever a CuPy array for example

erayaslan · 2023-02-09T07:54:09Z

ping for review? @pentschev @jrbourbeau

pentschev

Apologies, I thought I had already approved this. Looks good, I'm ok with this being merged as is. Thanks for all the hard work and patience @erayaslan !

rjzamora · 2023-02-17T02:15:45Z

Thanks for the work and patience here @erayaslan !

QuLogic · 2023-03-26T04:25:25Z

dask/array/tests/test_random.py

+    with pytest.raises(DeprecationWarning):
+        da.random.random_integers(10, size=5, chunks=3).compute()


What warning are you expecting here? I don't get any deprecation and I don't know what it should be since there's no match.

random_integers was deprecated in numpy-1.11.0. So, you should be getting a DeprecationWarning with any higher version. I am getting one with numpy-1.24.2

https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_integers.html

Ah, I see, thank you. I've found that someone had added an overly broad -Wignore to the package, which broke this test. I have removed that and the downstream build is working. Sorry for the noise here.

erayaslan added 3 commits May 5, 2022 19:43

Use np.random.Generator to generate random numbers

04b977e

np.random.Generator is the replacement for np.random.RandomState which is effectively frozen. Add Generator class and methods. Keep RandomState as default for now for backward compatibility Closes (dask#8790)

Add tests for da.random.Generator class

d88c561

Update documentation for da.random.Generator class

de2141c

github-actions bot added array documentation Improve or add to documentation labels May 5, 2022

erayaslan mentioned this pull request May 5, 2022

Use np.random.Generator to generate random numbers #9005

Closed

3 tasks

jcrist reviewed May 11, 2022

View reviewed changes

dask/array/random.py Outdated Show resolved Hide resolved

dask/array/random.py Outdated Show resolved Hide resolved

dask/array/random.py Outdated Show resolved Hide resolved

jcrist and others added 22 commits May 12, 2022 19:32

Band-aid for compute_as_if_collection (dask#8998)

e58618f

* Temporary band-aid to force the deletion of dsk.key_dependencies, which could result in bad dependency information and thereby bad scheduling when used in `compute_as_if_collection` * Typo

array setitem hardmask (dask#9027)

57c5395

Remove the HDFS tests (dask#9039)

6d049ab

CI for these tests has been broken for a while. Since dask no longer has any HDFS-specific functionality, we're just relying on fsspec here for hdfs interaction. Since HDFS isn't commonly used, the maintenance burden here doesn't seem worth it.

Add codespell pre-commit hook (dask#9040)

ab304aa

* Add codespell pre-commit * Fix missing newline in setup.cfg Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

Add end of file pre-commit hook (dask#9045)

f2d56fc

Update ensure_bytes (dask#9050)

0c9721d

Currently both Dask and Distributed implement this function with very slight variations. To attempt to consolidate these, pull in the Distributed implementation content into the Dask implementation. Then both Dask & Distributed can use this one function.

Additional check on is_dask_collection (dask#9054)

a1d44b7

Fix Blockwise.clone does not handle iterable literal arguments co…

3b4e993

…rrectly. (dask#8979)

Force nightly pyarrow in the upstream build (dask#8993)

bbeaa03

Docs - Update page on creating and storing Dask DataFrames (dask#9025)

52bf012

Revert is_dask_collection; back to previous implementation (dask#…

d8f1324

…9062)

Update error message for assignment with 1-d dask array (dask#9036)

4c4b8b6

Allow fillna with dask dataframe (dask#8950)

add1aa6

Implement {Series,DataFrame}GroupBy fillna methods (dask#8869)

98004fd

Co-authored-by: Ian Rose <ian.r.rose@gmail.com>

remove default_rng_lookup

98b94bc

NickleDave mentioned this pull request Dec 19, 2022

Unable to replicate random samples between dask and numpy #8547

Closed

Merge branch 'dask:main' into da-np-generator-v2

1c5e15a

pentschev suggested changes Jan 19, 2023

View reviewed changes

erayaslan and others added 10 commits January 21, 2023 11:40

Merge branch 'dask:main' into da-np-generator-v2

59a7ff0

QA: numpy -> NumPy

fd5009f

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

QA: wording

10a29f0

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

QA: readibility

2555f14

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

Get rid of unnecessary compute() calls

b7c15b3

As noted by Peter Andreas Entschev <peter@entschev.com>: Why do we need to compute() results here? assert_eq should take care of that, plus computing here will potentially lose things attributes such as meta.

Add missing kwargs parameter to multinomial calls

3373f3e

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

QA: satisfy black rules

493aba3

Do not assume that conversion NumPy array is possible

c39dafe

so that we do not break if p is ever a CuPy array for example

Add checks for return type of random functions

88c8216

Raise an error if we get an unexpected object type

8a83a0f

jakirkham requested review from jrbourbeau and pentschev January 26, 2023 09:45

pentschev approved these changes Feb 9, 2023

View reviewed changes

rjzamora added 2 commits February 16, 2023 11:41

Merge remote-tracking branch 'upstream/main' into da-np-generator-v2

dfc4169

linting

f051070

rjzamora approved these changes Feb 16, 2023

View reviewed changes

rjzamora changed the title ~~Use np.random.Generator to generate random numbers - v2~~ Generator based random-number generation indask.array Feb 17, 2023

rjzamora merged commit e2c7472 into dask:main Feb 17, 2023

erayaslan deleted the da-np-generator-v2 branch February 18, 2023 15:44

QuLogic reviewed Mar 26, 2023

View reviewed changes

		if isinstance(ar, np.ndarray):
		return np.ascontiguousarray(np.broadcast_to(ar, shape))

		with pytest.raises(DeprecationWarning):
		da.random.random_integers(10, size=5, chunks=3).compute()

Uh oh!

Conversation

erayaslan commented May 5, 2022

Uh oh!

erayaslan commented May 10, 2022

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erayaslan commented Nov 21, 2022

Uh oh!

rjzamora commented Nov 21, 2022

Uh oh!

erayaslan commented Jan 14, 2023

Uh oh!

rjzamora commented Jan 19, 2023

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pentschev Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

erayaslan Jan 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erayaslan commented Feb 9, 2023

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Feb 17, 2023

Uh oh!

QuLogic Mar 26, 2023

Choose a reason for hiding this comment

Uh oh!

erayaslan Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuLogic Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

erayaslan Mar 27, 2023 •

edited

Loading