Map partitions info by kumarprabhu1988 · Pull Request #6755 · dask/dask

kumarprabhu1988 · 2020-10-20T21:47:26Z

This PR adds functionality to return partition_info in a call to map_partitions. It is useful for distributed implementations of many algorithms.

When I ran black, it reformatted some of the comments. Should I be using a specific version of black?

Tests added / passed
Passes black dask / flake8 dask

Fixes #3707

jsignell · 2020-10-23T15:22:22Z

When I ran black, it reformatted some of the comments. Should I be using a specific version of black?

It's recommended that you use the pre-commit hook pip install pre-commit && pre-commit install. That way you are guaranteed to be using the same black settings as run on CI.

dask/dataframe/tests/test_dataframe.py

dask/dataframe/core.py

jsignell

It looks like the style changes are still in and there is one print that seems unintentional.

jsignell · 2020-10-26T20:00:04Z

dask/dataframe/tests/test_dataframe.py

+        assert dsk[("x", d.divisions.index(partition_info["division"]))].equals(df)
        return df

+    print("in test")


I think this snuck in accidentally :)

Uh-oh. fixed. The style changes remained even after installing pre-commit. Removed them manually, but the pre-commit hook runs black (the version I was using before) and reverted all the removals. Not sure if anyone else sees this problem. Can you share your black version? I can just use the same and see if it works.

Actually I think the black changes will go away if you rebase off latest master. Are you comfortable trying that?

Sorry for the delay. Rebasing wasn't enough because pre-commit would mess it up again. Anyway, I uninstalled pre-commit, fixed it and rebased it.

Oops my rebase had unintended consequences. I rebased with master, but wasn't sure if force push to my branch was allowed. So I tried to rebase and messed up. I'll create a new branch with just my changes

TomAugspurger · 2020-10-27T13:28:41Z

dask/dataframe/core.py

+        for k, v in dsk.items():
+            vv = v
+            v = v[0]
+            [(key, task)] = v.dsk.items()  # unpack subgraph callable


Are key and task ever used? It seems they're reassigned a few lines down.

Good point, removed them.

While the arrays are _technically_ empty and contain no values, the rechunking code does update the chunk size metadata.

* ignore_index bugfix

* Use `concatenate_lookup` in `concatenate` As we sometimes need to have custom ways to dispatch `concatenate` over different array types like SciPy's or CuPy's sparse matrices, make sure we lookup the appropriate `concatenate` implementation and use that.

Related to dask/distributed#3180

…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow

* handle null-named rangeindex in fastparquet

* Dispatch `iloc` calls to `getitem`

Inspired by https://stackoverflow.com/questions/62700350/storing-dask-dataframes-to-parquet-when-data-doesnt-fit-in-memory

* Include `pickle5` for testing on Python 3.7 Make sure we have at least one CI matrix case where `pickle5` is tested. * Use cloudpickle to load objects it pickled As `pickle5` support requires using `cloudpickle` to load objects it pickled, use it instead of regular `pickle`.

…#6382) * Call custom optimizations once, with kwargs provided. * Actually retain collection-specific optimizations when custom optimizations are specified.

* Fix docstrings to reflect filename can contain extension * Remove Sphinx link in favor of plain text

…pment (dask#6399) * DOC: add env to code install * remove conda hyperlink * add build after setup env * symlink latest * add -latest.yaml Co-authored-by: Ray Bell <rayjognbell0@gmail.com>

See also delimiter is :, not -, and numpydoc choke on double backticks as well.

Fix for errant backslashes was fixed upstream and released. https://numpydoc.readthedocs.io/en/latest/release_notes.html#fixed-bugs numpy/numpydoc#218

* Sphinx configuration missing doctest extension for Makefile * Fixed the majority of doctest failures for warnings originating in the dask library documentation itself. There are quite a few originating from the dask/distributed integration and the NumPY FFT integration. I'm going to review and debug these with respect to the version of distributed coming from the requirements. * Fixing up, or ignoring, remaining doctest errors from imported methods or newer changes

* Fix svd_flip type casting that fails with CuPy arrays * Fix svd meta for single-chunk case * Add single-chunk SVD test with CuPy

* Update overlap functions to use *_like with meta support * Update CuPy tests * Removed no longer used wrap import in overlap

Small fix for a missing line that was causing weird rendering in the docs

The initial chunking may be unbalanced, and specifying the `balance` argument indicates a desire always balance.

…ask#6505) * Adjust parquet ArrowEngine to allow more easy subclass for the writing part * add keyword names * blacken

When a partition had unobserved categories, the result MultiIndex would have the right dtype but not the right shape. This caused the `concat` to later cast to object dtype, causing the test failure. The fix is to not drop the all-NA columns. Closes dask#6729

* Fix meta for min/max reductions * Add more CuPy reduction tests * Fix compute_meta ValueError exception handling

…dask#6764) * Hint how to do boolean indexing dask does boolean indexing differently compared to numpy, add a hint how it's done in the error. Make if-condition a little easier to read. * Undo condition changes * Move error tot setitem instead.

…ask#6675) * Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :) * Choose cumsum/cumprod with `method=` keyword argument. Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these. * black * Add docstrings for "blelloch" method for cumsum/cumprod

* Documenta `meta` kwarg in `map_blocks` and `map_overlap`. * Small fixes to ``meta`` text * Document using `dtype` with `meta` in `map_blocks` * Skip CuPy doctests

…rrow) (dask#6741) * [bugfix/to-parquet-write-empty-metadata] Filter out null entries in pyarrow parquet metadata writes, causes AttributeError/Segfault * Explicit failure for exception test * [bugfix/to-parquet-write-empty-metadata] black * Remove unnecessary imports UTs homogenous typing for all builds * Placate 3.9 pre-commit * Remove unnecessary scheduler specs Co-authored-by: Callum Noble <C.Noble@mwam.com>

…k#6771)

kumarprabhu1988 · 2020-10-29T03:27:44Z

Created a new branch with just the changes. Here's the PR: #6776
Closing this one.

TomAugspurger and others added 3 commits June 19, 2020 09:45

stuff

32318b0

map_partition with partition info and unittest

20847f3

remove unused variable

7768260

kumarprabhu1988 mentioned this pull request Oct 20, 2020

DataFrame.map_partitions_with_index #6334

Closed

jsignell reviewed Oct 23, 2020

View reviewed changes

dask/dataframe/tests/test_dataframe.py Show resolved Hide resolved

jsignell reviewed Oct 23, 2020

View reviewed changes

dask/dataframe/core.py Show resolved Hide resolved

review comments

5d871d4

kumarprabhu1988 requested a review from jsignell October 26, 2020 18:08

jsignell reviewed Oct 26, 2020

View reviewed changes

TomAugspurger reviewed Oct 27, 2020

View reviewed changes

jrbourbeau and others added 19 commits October 29, 2020 01:56

bump version to 2.19.0

0b24e8f

Add doc describing argument. (dask#6318)

a4e2470

Fix rechunking of arrays with some zero-length dimensions. (dask#6335)

1b0bb5a

While the arrays are _technically_ empty and contain no values, the rechunking code does update the chunk size metadata.

[REVIEW] Use ignore_index for pandas' group_split_dispatch (dask#6251)

490678d

* ignore_index bugfix

Register sizeof for numpy zero-strided arrays (dask#6343)

2df4613

Related to dask/distributed#3180

Preserve index when writing partitioned parquet datasets with pyarrow (…

c07a05a

…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow

Handle unnamed pandas RangeIndex in fastparquet engine (dask#6350)

e6a9ae8

* handle null-named rangeindex in fastparquet

Dispatch iloc calls to getitem (dask#6355)

28c71c0

* Dispatch `iloc` calls to `getitem`

bump version to 2.20.0

699e81d

Parse bytes_per_chunk keyword from string (dask#6370)

be41bbe

Inspired by https://stackoverflow.com/questions/62700350/storing-dask-dataframes-to-parquet-when-data-doesnt-fit-in-memory

Fix blockwise concatenate for array with some dimension=1. (dask#6342)

48e00c3

use pytest.warns to check for UserWarning (dask#6378)

ea9b6a3

[Bug Fix] Call custom optimizations once, with kwargs provided. (dask…

4b32ac1

…#6382) * Call custom optimizations once, with kwargs provided. * Actually retain collection-specific optimizations when custom optimizations are specified.

Update visualize docstrings (dask#6383)

94feccd

* Fix docstrings to reflect filename can contain extension * Remove Sphinx link in favor of plain text

DOC: add instructions for using conda when installing code for develo…

340aec4

…pment (dask#6399) * DOC: add env to code install * remove conda hyperlink * add build after setup env * symlink latest * add -latest.yaml Co-authored-by: Ray Bell <rayjognbell0@gmail.com>

DOC: Numpydoc formatting. (dask#6402)

4ecacea

See also delimiter is :, not -, and numpydoc choke on double backticks as well.

Unpin numpydoc following 1.1 release (dask#6407)

64cf6f5

Fix for errant backslashes was fixed upstream and released. https://numpydoc.readthedocs.io/en/latest/release_notes.html#fixed-bugs numpy/numpydoc#218

JimCircadian and others added 27 commits October 29, 2020 01:56

Fixes for svd with __array_function__ (dask#6727)

a2bfb27

* Fix svd_flip type casting that fails with CuPy arrays * Fix svd meta for single-chunk case * Add single-chunk SVD test with CuPy

Update overlap *_like function calls and CuPy tests (dask#6728)

263764f

* Update overlap functions to use *_like with meta support * Update CuPy tests * Removed no longer used wrap import in overlap

Map on HighLevelGraph Layers (dask#6689)

06680f4

Ensure HighLevelGraph layers always contain Layer instances (dask#6716)

b00de77

Fix docstring DataFrame.set_index (dask#6739)

2cd8546

Small fix for a missing line that was causing weird rendering in the docs

Do balanced rechunking even if chunks are the same. (dask#6735)

2fae915

The initial chunking may be unbalanced, and specifying the `balance` argument indicates a desire always balance.

Handle literal in meta_from_array (dask#6731)

1b66f80

Add ShuffleStage HLG Layer (dask#6650)

f0b216c

Adjust parquet ArrowEngine to allow more easy subclass for writing (d…

a435d33

…ask#6505) * Adjust parquet ArrowEngine to allow more easy subclass for the writing part * add keyword names * blacken

Removed Mutable Default Argument (dask#6747)

ea30121

Add attrs property to Series/Dataframe (dask#6742)

50b116e

Serialization of layers (dask#6693)

228f93c

Add 2D possibility to da.linalg.lstsq - mirroring numpy (dask#6749)

beda6c2

Fix meta for min/max reductions (dask#6736)

d77e318

* Fix meta for min/max reductions * Add more CuPy reduction tests * Fix compute_meta ValueError exception handling

Temporarily use pyarrow<2 in CI (dask#6759)

290cc5a

Config array optimize to skip fusion and return a HLG (dask#6751)

9e6e627

Efficient serialization of shuffle layers (dask#6760)

f4b3f74

Document meta kwarg in map_blocks and map_overlap. (dask#6763)

c0b56e6

* Documenta `meta` kwarg in `map_blocks` and `map_overlap`. * Small fixes to ``meta`` text * Document using `dtype` with `meta` in `map_blocks` * Skip CuPy doctests

Fix some minor typos and trailing whitespaces in array-slice.rst (das…

af2ed5f

…k#6771)

map_partition with partition info and unittest

37d4bd7

review comments, remove style changes and unused variables

1a05ce4

fix

00f4384

kumarprabhu1988 mentioned this pull request Oct 29, 2020

map_partitions with review comments #6776

Merged

3 tasks

kumarprabhu1988 closed this Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Map partitions info#6755

Map partitions info#6755
kumarprabhu1988 wants to merge 183 commits intodask:masterfrom
kumarprabhu1988:map_partitions-info

kumarprabhu1988 commented Oct 20, 2020

Uh oh!

jsignell commented Oct 23, 2020

Uh oh!

Uh oh!

Uh oh!

jsignell left a comment

Uh oh!

jsignell Oct 26, 2020

Uh oh!

kumarprabhu1988 Oct 26, 2020

Uh oh!

jsignell Oct 26, 2020

Uh oh!

kumarprabhu1988 Oct 29, 2020

Uh oh!

kumarprabhu1988 Oct 29, 2020 •

edited

Loading

Uh oh!

TomAugspurger Oct 27, 2020

Uh oh!

kumarprabhu1988 Oct 29, 2020

Uh oh!

kumarprabhu1988 commented Oct 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

kumarprabhu1988 commented Oct 20, 2020

Uh oh!

jsignell commented Oct 23, 2020

Uh oh!

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

jsignell Oct 26, 2020

Choose a reason for hiding this comment

Uh oh!

kumarprabhu1988 Oct 26, 2020

Choose a reason for hiding this comment

Uh oh!

jsignell Oct 26, 2020

Choose a reason for hiding this comment

Uh oh!

kumarprabhu1988 Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

kumarprabhu1988 Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

kumarprabhu1988 Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

kumarprabhu1988 commented Oct 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

kumarprabhu1988 Oct 29, 2020 •

edited

Loading