Generator based random-number generation indask.array#9038
Generator based random-number generation indask.array#9038
Generator based random-number generation indask.array#9038Conversation
np.random.Generator is the replacement for np.random.RandomState which is effectively frozen. Add Generator class and methods. Keep RandomState as default for now for backward compatibility Closes (dask#8790)
|
any comments? @jsignell |
jcrist
left a comment
There was a problem hiding this comment.
Thanks @erayaslan. I gave this a skim and left some high level comments. @jakirkham or @pentschev, could you give this a review?
One concern - as far as I can tell, numpy has not deprecated RandomState. If their intent is that both implementations will be around for a long time, it would be good to reduce the code duplication between the Generator and RandomState implementations. If this is tricky to do without complicating things, don't worry about it.
The macos builds consistently take the longest to run, and have a lower org-level concurrency limit than windows or linux. In `dask/dask` we have no logic specific to macos, so testing on every python version on this platform doesn't seem necessary. To reduce CI delay, we change github actions to only test macos on python 3.9.
* Temporary band-aid to force the deletion of dsk.key_dependencies, which could result in bad dependency information and thereby bad scheduling when used in `compute_as_if_collection` * Typo
* Change test to also look for *slightly* overlapping divisions. * Also raise if the start of an appended partition is equal to the end of the last partition, as these should be considered overlapping partitions. * Remove commented-out code. * Remove kwarg which is new in pandas 1.4, the default behavior is fine.
Currently codecov pushes an initial failing status on every PR that is later updated to passing once more CI builds finish. This is annoying. As far as I can tell from https://docs.codecov.com/docs/merging-reports this shouldn't be happening, but it is. Here we attempt to stop this behavior by: - Only running code coverage on the main test suite that hits most of the codebase. - Setting a minimum number of builds before codecov should push a status update. Hopefully this fixes the problem.
The (experimental) pandas `string[pyarrow]` dtype has some major performance benefits that we'd like to experiment with in dask. However, currently `pyarrow.StringArray` objects have a bug in their pickle implementation where a small slice of the array still serializes the full (potentially very large) backing buffers (see https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is fixed upstream in pyarrow at some point, but for now we patch around it by overriding the pickling implementation for `ArrowStringArray` in pandas. This implementation is efficient, resulting in zero-copy serialization in most cases. There is still more work to do to fully support the `string[pyarrow]` dtype, but I think this PR can go in as is for now.
CI for these tests has been broken for a while. Since dask no longer has any HDFS-specific functionality, we're just relying on fsspec here for hdfs interaction. Since HDFS isn't commonly used, the maintenance burden here doesn't seem worth it.
[PEP 544](https://www.python.org/dev/peps/pep-0544/) introduces the `Protocol` class to the `typing` module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for [dask-awkward](https://github.com/ContinuumIO/dask-awkward/) has had me thinking about working on a `DaskCollection` protocol. I imagine the benefits to be: - usage with static type checkers - other activity in this area at - dask#8295 - dask#8706 - dask#8854 - Python supporting IDEs take advantage of typing - self-documenting; some improvements to [the custom collections page](https://docs.dask.org/en/latest/custom-collections.html) of the docs. The protocol docs can be autogenerated and added to that page. - purely opt-in feature The `typing.runtime_checkable` decorator allows use of `isinstance(x, DaskCollection)` in any code base that uses Dask collections; for example: ```python >>> from dask.typing import DaskCollection >>> import dask.array as da >>> x = da.zeros((10, 3)) >>> isinstance(x, DaskCollection) True ``` (though this is an order of magnitude slower than `dask.base.is_dask_collection` which only checks for `x.__dask_graph__() is not None`; static typing checking & built-in interface documentation are the core benefits IMO) Something else that came up in the brief discussion on a call last week was having `{Scheduler,Worker,Nanny}Plugin` protocols in `distributed`; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.
…` and ``aggregate_files`` (dask#9052) As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.
* Use `elif` for `decode` in `ensure_unicode` * Handle Python Buffer Protocol in `ensure_unicode` Any other arbitrary object (like `bytearray` or `memoryview` based objects) can be decoded to `unicode` via `codecs.decode`. This is analogous to what is done in `ensure_bytes`. So handle this case here. If this also fails, then raise as usual. * Include `ensure_unicode` tests for various objects * Clarify error messages * Use `uint8` in `array` tests This is more consistent with the other tests, which also use this type. Though `int8` also works. * Pass `bytes` directly to `array` Appears this already gets interpreted correctly by `array`. Should also make the code easier to read for other maintainers. * Use `from array import array` Avoids the `array.array` bit which is a tad verbose.
|
@rjzamora we look good I think. any comments or suggestions on moving forward? |
|
Thanks for the nudge @erayaslan - I think I may try to simplify a few minor things related to dispatching and |
|
ping. how do you feel about merging this? @rjzamora @jrbourbeau |
|
@jrbourbeau - It would be nice to have someone else other than me take a pass at this. Note that I'd be willing to follow up with fixes if this PR ends up breaking anything. |
pentschev
left a comment
There was a problem hiding this comment.
Apologies for taking so long to review here again @erayaslan , I appreciate your patience!
I do not see any obvious errors, I've left a few comments/suggestions on minor things, but otherwise this looks good.
dask/array/random.py
Outdated
| if isinstance(ar, np.ndarray): | ||
| return np.ascontiguousarray(np.broadcast_to(ar, shape)) |
There was a problem hiding this comment.
Will there be no CuPy arrays being evaluated here and in conditionals below https://github.com/dask/dask/pull/9038/files#diff-bacbead89e1b64dbdf54cbe8878e9a52445945d2c2710b2a11edc65364c6b16dR925 and https://github.com/dask/dask/pull/9038/files#diff-bacbead89e1b64dbdf54cbe8878e9a52445945d2c2710b2a11edc65364c6b16dR940 ? Do they need to be special-cased?
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
As noted by Peter Andreas Entschev <peter@entschev.com>: Why do we need to compute() results here? assert_eq should take care of that, plus computing here will potentially lose things attributes such as meta.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
so that we do not break if p is ever a CuPy array for example
|
ping for review? @pentschev @jrbourbeau |
pentschev
left a comment
There was a problem hiding this comment.
Apologies, I thought I had already approved this. Looks good, I'm ok with this being merged as is. Thanks for all the hard work and patience @erayaslan !
Generator based random-number generation indask.array
|
Thanks for the work and patience here @erayaslan ! |
| with pytest.raises(DeprecationWarning): | ||
| da.random.random_integers(10, size=5, chunks=3).compute() |
There was a problem hiding this comment.
What warning are you expecting here? I don't get any deprecation and I don't know what it should be since there's no match.
There was a problem hiding this comment.
random_integers was deprecated in numpy-1.11.0. So, you should be getting a DeprecationWarning with any higher version. I am getting one with numpy-1.24.2
https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_integers.html
There was a problem hiding this comment.
Ah, I see, thank you. I've found that someone had added an overly broad -Wignore to the package, which broke this test. I have removed that and the downstream build is working. Sorry for the noise here.
dask.array.default_rng#8790pre-commit run --all-filesObsoletes PR #9005