Skip to content

Generator based random-number generation indask.array#9038

Merged
rjzamora merged 91 commits intodask:mainfrom
erayaslan:da-np-generator-v2
Feb 17, 2023
Merged

Generator based random-number generation indask.array#9038
rjzamora merged 91 commits intodask:mainfrom
erayaslan:da-np-generator-v2

Conversation

@erayaslan
Copy link
Copy Markdown
Contributor

Obsoletes PR #9005

  • tried to keep backward compatibility. cost was some code duplication
  • more importantly, followed numpy interface as much as possible. should result in a better user experience, especially down the line when Generator use becomes more common
  • RandomState is still the default for now
  • documentation is meh. numpy docs are somewhat non-regular for this feature

erayaslan added 3 commits May 5, 2022 19:43
np.random.Generator is the replacement for np.random.RandomState which
is effectively frozen. Add Generator class and methods. Keep RandomState
as default for now for backward compatibility

Closes (dask#8790)
@github-actions github-actions bot added array documentation Improve or add to documentation labels May 5, 2022
@erayaslan
Copy link
Copy Markdown
Contributor Author

any comments? @jsignell

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @erayaslan. I gave this a skim and left some high level comments. @jakirkham or @pentschev, could you give this a review?

One concern - as far as I can tell, numpy has not deprecated RandomState. If their intent is that both implementations will be around for a long time, it would be good to reduce the code duplication between the Generator and RandomState implementations. If this is tricky to do without complicating things, don't worry about it.

jcrist and others added 22 commits May 12, 2022 19:32
The macos builds consistently take the longest to run, and have a lower
org-level concurrency limit than windows or linux. In `dask/dask` we
have no logic specific to macos, so testing on every python version on
this platform doesn't seem necessary. To reduce CI delay, we change
github actions to only test macos on python 3.9.
* Temporary band-aid to force the deletion of dsk.key_dependencies, which
could result in bad dependency information and thereby bad scheduling
when used in `compute_as_if_collection`

* Typo
* Change test to also look for *slightly* overlapping divisions.

* Also raise if the start of an appended partition is equal to the end of
the last partition, as these should be considered overlapping
partitions.

* Remove commented-out code.

* Remove kwarg which is new in pandas 1.4, the default behavior is fine.
Currently codecov pushes an initial failing status on every PR that is
later updated to passing once more CI builds finish. This is annoying.

As far as I can tell from https://docs.codecov.com/docs/merging-reports
this shouldn't be happening, but it is. Here we attempt to stop this
behavior by:

- Only running code coverage on the main test suite that hits most of
  the codebase.
- Setting a minimum number of builds before codecov should push a status
  update.

Hopefully this fixes the problem.
The (experimental) pandas `string[pyarrow]` dtype has some major
performance benefits that we'd like to experiment with in dask. However,
currently `pyarrow.StringArray` objects have a bug in their pickle
implementation where a small slice of the array still serializes the
full (potentially very large) backing buffers (see
https://issues.apache.org/jira/browse/ARROW-10739). Hopefully this is
fixed upstream in pyarrow at some point, but for now we patch around it
by overriding the pickling implementation for `ArrowStringArray` in
pandas. This implementation is efficient, resulting in zero-copy
serialization in most cases.

There is still more work to do to fully support the `string[pyarrow]`
dtype, but I think this PR can go in as is for now.
`prod`/`nanprod` overflow for data at this size. It's not clear why
numpy only raises these warnings _sometimes_, but it makes sense why
they're there. We filter `RuntimeWarning` now for `prod`/`nanprod` (and
only these operations) now to fix this.
CI for these tests has been broken for a while. Since dask no longer has
any HDFS-specific functionality, we're just relying on fsspec here for
hdfs interaction. Since HDFS isn't commonly used, the maintenance burden
here doesn't seem worth it.
* Add codespell pre-commit

* Fix missing newline in setup.cfg

Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
[PEP 544](https://www.python.org/dev/peps/pep-0544/) introduces the `Protocol` class to the `typing` module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for [dask-awkward](https://github.com/ContinuumIO/dask-awkward/) has had me thinking about working on a `DaskCollection` protocol. I imagine the benefits to be:

- usage with static type checkers
  - other activity in this area at
    - dask#8295 
    - dask#8706 
    -  dask#8854
  - Python supporting IDEs take advantage of typing
- self-documenting; some improvements to [the custom collections page](https://docs.dask.org/en/latest/custom-collections.html) of the docs. The protocol docs can be autogenerated and added to that page.
- purely opt-in feature

The `typing.runtime_checkable` decorator allows use of `isinstance(x, DaskCollection)` in any code base
that uses Dask collections; for example:

```python
>>> from dask.typing import DaskCollection
>>> import dask.array as da
>>> x = da.zeros((10, 3))
>>> isinstance(x, DaskCollection)
True
```
(though this is an order of magnitude slower than `dask.base.is_dask_collection` which only checks for `x.__dask_graph__() is not None`; static typing checking & built-in interface documentation are the core benefits IMO)

Something else that came up in the brief discussion on a call last week was having `{Scheduler,Worker,Nanny}Plugin` protocols in `distributed`; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.
Currently both Dask and Distributed implement this function with very
slight variations. To attempt to consolidate these, pull in the
Distributed implementation content into the Dask implementation. Then
both Dask & Distributed can use this one function.
…` and ``aggregate_files`` (dask#9052)

As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`.

This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation).  My goal is to find (and inform) whatever users may be depending on these obscure options.
Co-authored-by: Ian Rose <ian.r.rose@gmail.com>
* Use `elif` for `decode` in `ensure_unicode`

* Handle Python Buffer Protocol in `ensure_unicode`

Any other arbitrary object (like `bytearray` or `memoryview` based
objects) can be decoded to `unicode` via `codecs.decode`. This is
analogous to what is done in `ensure_bytes`. So handle this case here.
If this also fails, then raise as usual.

* Include `ensure_unicode` tests for various objects

* Clarify error messages

* Use `uint8` in `array` tests

This is more consistent with the other tests, which also use this type.
Though `int8` also works.

* Pass `bytes` directly to `array`

Appears this already gets interpreted correctly by `array`. Should also
make the code easier to read for other maintainers.

* Use `from array import array`

Avoids the `array.array` bit which is a tad verbose.
@erayaslan
Copy link
Copy Markdown
Contributor Author

@rjzamora we look good I think. any comments or suggestions on moving forward?

@rjzamora
Copy link
Copy Markdown
Member

Thanks for the nudge @erayaslan - I think I may try to simplify a few minor things related to dispatching and _backend changes here today, and it would be ideal if @jrbourbeau had a chance to take a final look after that. Thank you for your work and patience on this (I really appreciate it)!

@erayaslan
Copy link
Copy Markdown
Contributor Author

ping. how do you feel about merging this? @rjzamora @jrbourbeau

@rjzamora
Copy link
Copy Markdown
Member

@jrbourbeau - It would be nice to have someone else other than me take a pass at this. Note that I'd be willing to follow up with fixes if this PR ends up breaking anything.

Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking so long to review here again @erayaslan , I appreciate your patience!

I do not see any obvious errors, I've left a few comments/suggestions on minor things, but otherwise this looks good.

Comment on lines +915 to +916
if isinstance(ar, np.ndarray):
return np.ascontiguousarray(np.broadcast_to(ar, shape))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so although @rjzamora can give a more definite answer. In the meantime, added code to raise an error if we get an unexpected type 8a83a0f

erayaslan and others added 10 commits January 21, 2023 11:40
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
As noted by Peter Andreas Entschev <peter@entschev.com>:
Why do we need to compute() results here? assert_eq should take care of
that, plus computing here will potentially lose things attributes such
as meta.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
so that we do not break if p is ever a CuPy array for example
@erayaslan
Copy link
Copy Markdown
Contributor Author

ping for review? @pentschev @jrbourbeau

Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I thought I had already approved this. Looks good, I'm ok with this being merged as is. Thanks for all the hard work and patience @erayaslan !

@rjzamora rjzamora changed the title Use np.random.Generator to generate random numbers - v2 Generator based random-number generation indask.array Feb 17, 2023
@rjzamora rjzamora merged commit e2c7472 into dask:main Feb 17, 2023
@rjzamora
Copy link
Copy Markdown
Member

Thanks for the work and patience here @erayaslan !

@erayaslan erayaslan deleted the da-np-generator-v2 branch February 18, 2023 15:44
Comment on lines +202 to +203
with pytest.raises(DeprecationWarning):
da.random.random_integers(10, size=5, chunks=3).compute()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warning are you expecting here? I don't get any deprecation and I don't know what it should be since there's no match.

Copy link
Copy Markdown
Contributor Author

@erayaslan erayaslan Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_integers was deprecated in numpy-1.11.0. So, you should be getting a DeprecationWarning with any higher version. I am getting one with numpy-1.24.2

https://numpy.org/doc/stable/reference/random/generated/numpy.random.random_integers.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thank you. I've found that someone had added an overly broad -Wignore to the package, which broke this test. I have removed that and the downstream build is working. Sorry for the noise here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array dispatch Related to `Dispatch` extension objects documentation Improve or add to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dask.array.default_rng