Conversation
It's recommended that you use the pre-commit hook |
jsignell
left a comment
There was a problem hiding this comment.
It looks like the style changes are still in and there is one print that seems unintentional.
| assert dsk[("x", d.divisions.index(partition_info["division"]))].equals(df) | ||
| return df | ||
|
|
||
| print("in test") |
There was a problem hiding this comment.
I think this snuck in accidentally :)
There was a problem hiding this comment.
Uh-oh. fixed. The style changes remained even after installing pre-commit. Removed them manually, but the pre-commit hook runs black (the version I was using before) and reverted all the removals. Not sure if anyone else sees this problem. Can you share your black version? I can just use the same and see if it works.
There was a problem hiding this comment.
Actually I think the black changes will go away if you rebase off latest master. Are you comfortable trying that?
There was a problem hiding this comment.
Sorry for the delay. Rebasing wasn't enough because pre-commit would mess it up again. Anyway, I uninstalled pre-commit, fixed it and rebased it.
There was a problem hiding this comment.
Oops my rebase had unintended consequences. I rebased with master, but wasn't sure if force push to my branch was allowed. So I tried to rebase and messed up. I'll create a new branch with just my changes
dask/dataframe/core.py
Outdated
| for k, v in dsk.items(): | ||
| vv = v | ||
| v = v[0] | ||
| [(key, task)] = v.dsk.items() # unpack subgraph callable |
There was a problem hiding this comment.
Are key and task ever used? It seems they're reassigned a few lines down.
There was a problem hiding this comment.
Good point, removed them.
While the arrays are _technically_ empty and contain no values, the rechunking code does update the chunk size metadata.
* ignore_index bugfix
* Use `concatenate_lookup` in `concatenate` As we sometimes need to have custom ways to dispatch `concatenate` over different array types like SciPy's or CuPy's sparse matrices, make sure we lookup the appropriate `concatenate` implementation and use that.
…dask#6282) * handle auto-index detection for partitioned datasets in pyarrow
* handle null-named rangeindex in fastparquet
* Include `pickle5` for testing on Python 3.7 Make sure we have at least one CI matrix case where `pickle5` is tested. * Use cloudpickle to load objects it pickled As `pickle5` support requires using `cloudpickle` to load objects it pickled, use it instead of regular `pickle`.
…#6382) * Call custom optimizations once, with kwargs provided. * Actually retain collection-specific optimizations when custom optimizations are specified.
…pment (dask#6399) * DOC: add env to code install * remove conda hyperlink * add build after setup env * symlink latest * add -latest.yaml Co-authored-by: Ray Bell <rayjognbell0@gmail.com>
See also delimiter is :, not -, and numpydoc choke on double backticks as well.
Fix for errant backslashes was fixed upstream and released. https://numpydoc.readthedocs.io/en/latest/release_notes.html#fixed-bugs numpy/numpydoc#218
* Sphinx configuration missing doctest extension for Makefile * Fixed the majority of doctest failures for warnings originating in the dask library documentation itself. There are quite a few originating from the dask/distributed integration and the NumPY FFT integration. I'm going to review and debug these with respect to the version of distributed coming from the requirements. * Fixing up, or ignoring, remaining doctest errors from imported methods or newer changes
* Update overlap functions to use *_like with meta support * Update CuPy tests * Removed no longer used wrap import in overlap
Small fix for a missing line that was causing weird rendering in the docs
The initial chunking may be unbalanced, and specifying the `balance` argument indicates a desire always balance.
…ask#6505) * Adjust parquet ArrowEngine to allow more easy subclass for the writing part * add keyword names * blacken
When a partition had unobserved categories, the result MultiIndex would have the right dtype but not the right shape. This caused the `concat` to later cast to object dtype, causing the test failure. The fix is to not drop the all-NA columns. Closes dask#6729
* Fix meta for min/max reductions * Add more CuPy reduction tests * Fix compute_meta ValueError exception handling
…dask#6764) * Hint how to do boolean indexing dask does boolean indexing differently compared to numpy, add a hint how it's done in the error. Make if-condition a little easier to read. * Undo condition changes * Move error tot setitem instead.
…ask#6675) * Begin experimenting with parallel prefix scan for cumsum and cumprod in dask.array This is a WIP and needs benchmarked. I think it's interesting, though, and want to share. It's been a while since I've worked on dask.array, so feedback is most welcome. This is a work-efficient parallel prefix scan. It uses a Brent-Kung construction and is known as the Blelloch algorithm. We adapt it to work on chunks. Previously, to do a cumsum across N chunks would require N levels of dependencies. This PR takes approximately 2 * lg(N) levels of dependencies. It exposes parallelism. It is work-efficient and only requires a third more tasks than the previous method. Scans on floating point values should also be more accurate. A parallel cumsum works by first taking the sum of each block, then do a binary tree merge followed by a fan-out (i.e., the Brent-Kung pattern). We then take the cumsum of each block and add the sum of the previous blocks. NumPy calculates cumsum and cumprod very fast, but it calculates sum and prod significantly faster. This is why I think this approach will be faster. Exposing parallelism and an efficient communication pattern is another reason I think this should be faster (especially when communication costs are significant). I also think this will be an interesting test for `dask.order` and the scheduler. Q: Should we allow users to choose which method to use (i.e., prev or new in this PR)? Does the answer to this depend on benchmarks? Benchmarks and graph diagrams are forthcoming :) * Choose cumsum/cumprod with `method=` keyword argument. Current choices are "sequential", "blelloch", and "blelloch-split". Default is "sequential". I need to document these. * black * Add docstrings for "blelloch" method for cumsum/cumprod
* Documenta `meta` kwarg in `map_blocks` and `map_overlap`. * Small fixes to ``meta`` text * Document using `dtype` with `meta` in `map_blocks` * Skip CuPy doctests
…rrow) (dask#6741) * [bugfix/to-parquet-write-empty-metadata] Filter out null entries in pyarrow parquet metadata writes, causes AttributeError/Segfault * Explicit failure for exception test * [bugfix/to-parquet-write-empty-metadata] black * Remove unnecessary imports UTs homogenous typing for all builds * Placate 3.9 pre-commit * Remove unnecessary scheduler specs Co-authored-by: Callum Noble <C.Noble@mwam.com>
|
Created a new branch with just the changes. Here's the PR: #6776 |
This PR adds functionality to return
partition_infoin a call to map_partitions. It is useful for distributed implementations of many algorithms.When I ran black, it reformatted some of the comments. Should I be using a specific version of black?
black dask/flake8 daskFixes #3707