Skip to content

Map partitions info#6755

Closed
kumarprabhu1988 wants to merge 183 commits intodask:masterfrom
kumarprabhu1988:map_partitions-info
Closed

Map partitions info#6755
kumarprabhu1988 wants to merge 183 commits intodask:masterfrom
kumarprabhu1988:map_partitions-info

Conversation

@kumarprabhu1988
Copy link
Contributor

This PR adds functionality to return partition_info in a call to map_partitions. It is useful for distributed implementations of many algorithms.

When I ran black, it reformatted some of the comments. Should I be using a specific version of black?

  • Tests added / passed
  • Passes black dask / flake8 dask

Fixes #3707

@jsignell
Copy link
Member

When I ran black, it reformatted some of the comments. Should I be using a specific version of black?

It's recommended that you use the pre-commit hook pip install pre-commit && pre-commit install. That way you are guaranteed to be using the same black settings as run on CI.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the style changes are still in and there is one print that seems unintentional.

assert dsk[("x", d.divisions.index(partition_info["division"]))].equals(df)
return df

print("in test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this snuck in accidentally :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh-oh. fixed. The style changes remained even after installing pre-commit. Removed them manually, but the pre-commit hook runs black (the version I was using before) and reverted all the removals. Not sure if anyone else sees this problem. Can you share your black version? I can just use the same and see if it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the black changes will go away if you rebase off latest master. Are you comfortable trying that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Rebasing wasn't enough because pre-commit would mess it up again. Anyway, I uninstalled pre-commit, fixed it and rebased it.

Copy link
Contributor Author

@kumarprabhu1988 kumarprabhu1988 Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my rebase had unintended consequences. I rebased with master, but wasn't sure if force push to my branch was allowed. So I tried to rebase and messed up. I'll create a new branch with just my changes

for k, v in dsk.items():
vv = v
v = v[0]
[(key, task)] = v.dsk.items() # unpack subgraph callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are key and task ever used? It seems they're reassigned a few lines down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed them.

jrbourbeau and others added 19 commits October 29, 2020 01:56
While the arrays are _technically_ empty and contain no values,
the rechunking code does update the chunk size metadata.
* Use `concatenate_lookup` in `concatenate`

As we sometimes need to have custom ways to dispatch `concatenate` over
different array types like SciPy's or CuPy's sparse matrices, make sure
we lookup the appropriate `concatenate` implementation and use that.
…dask#6282)

* handle auto-index detection for partitioned datasets in pyarrow
* handle null-named rangeindex in fastparquet
* Dispatch `iloc` calls to `getitem`
* Include `pickle5` for testing on Python 3.7

Make sure we have at least one CI matrix case where `pickle5` is tested.

* Use cloudpickle to load objects it pickled

As `pickle5` support requires using `cloudpickle` to load objects it
pickled, use it instead of regular `pickle`.
…#6382)

* Call custom optimizations once, with kwargs provided.

* Actually retain collection-specific optimizations when custom optimizations are specified.
* Fix docstrings to reflect filename can contain extension
* Remove Sphinx link in favor of plain text
…pment (dask#6399)

* DOC: add env to code install

* remove conda hyperlink

* add build after setup env

* symlink latest

* add -latest.yaml

Co-authored-by: Ray Bell <rayjognbell0@gmail.com>
See also delimiter is :, not -, and numpydoc choke on double backticks
as well.
JimCircadian and others added 27 commits October 29, 2020 01:56
* Sphinx configuration missing doctest extension for Makefile
* Fixed the majority of doctest failures for warnings originating in the dask library documentation itself.

There are quite a few originating from the dask/distributed integration and the NumPY FFT integration.
I'm going to review and debug these with respect to the version of distributed coming from the requirements.

* Fixing up, or ignoring, remaining doctest errors from imported methods or newer changes
* Fix svd_flip type casting that fails with CuPy arrays

* Fix svd meta for single-chunk case

* Add single-chunk SVD test with CuPy
* Update overlap functions to use *_like with meta support

* Update CuPy tests

* Removed no longer used wrap import in overlap
Small fix for a missing line that was causing weird rendering in the docs
The initial chunking may be unbalanced, and specifying the `balance`
argument indicates a desire always balance.
…ask#6505)

* Adjust parquet ArrowEngine to allow more easy subclass for the writing part

* add keyword names

* blacken
When a partition had unobserved categories, the result MultiIndex would
have the right dtype but not the right shape. This caused the `concat`
to later cast to object dtype, causing the test failure.

The fix is to not drop the all-NA columns.

Closes dask#6729
* Fix meta for min/max reductions

* Add more CuPy reduction tests

* Fix compute_meta ValueError exception handling
…dask#6764)

* Hint how to do boolean indexing

dask does boolean indexing differently compared to numpy, add a hint how it's done in the error. 
Make if-condition a little easier to read.

* Undo condition changes

* Move error tot setitem instead.
…ask#6675)

* Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array

This is a WIP and needs benchmarked.  I think it's interesting, though, and want to share.
It's been a while since I've worked on dask.array, so feedback is most welcome.

This is a work-efficient parallel prefix scan.  It uses a Brent-Kung construction and
is known as the Blelloch algorithm.  We adapt it to work on chunks.

Previously, to do a cumsum across N chunks would require N levels of dependencies.
This PR takes approximately 2 * lg(N) levels of dependencies.  It exposes parallelism.
It is work-efficient and only requires a third more tasks than the previous method.
Scans on floating point values should also be more accurate.

A parallel cumsum works by first taking the sum of each block, then do a binary tree
merge followed by a fan-out (i.e., the Brent-Kung pattern).  We then take the cumsum
of each block and add the sum of the previous blocks.

NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod
significantly faster.  This is why I think this approach will be faster.
Exposing parallelism and an efficient communication pattern is another reason I think
this should be faster (especially when communication costs are significant).

I also think this will be an interesting test for `dask.order` and the scheduler.

Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)?
Does the answer to this depend on benchmarks?

Benchmarks and graph diagrams are forthcoming :)

* Choose cumsum/cumprod with `method=` keyword argument.

Current choices are "sequential", "blelloch", and "blelloch-split".
Default is "sequential".  I need to document these.

* black

* Add docstrings for "blelloch" method for cumsum/cumprod
* Documenta `meta` kwarg in `map_blocks` and `map_overlap`.

* Small fixes to ``meta`` text

* Document using `dtype` with `meta` in `map_blocks`

* Skip CuPy doctests
…rrow) (dask#6741)

* [bugfix/to-parquet-write-empty-metadata] Filter out null entries in pyarrow parquet metadata writes, causes AttributeError/Segfault

* Explicit failure for exception test

* [bugfix/to-parquet-write-empty-metadata] black

* Remove unnecessary imports
UTs homogenous typing for all builds

* Placate 3.9 pre-commit

* Remove unnecessary scheduler specs

Co-authored-by: Callum Noble <C.Noble@mwam.com>
@kumarprabhu1988
Copy link
Contributor Author

Created a new branch with just the changes. Here's the PR: #6776
Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add magic partition_info keyword to dd.map_partitions function