Skip to content

Handle Blockwise HLG pack/unpack for concatenate=True#7455

Merged
jrbourbeau merged 19 commits intodask:mainfrom
rjzamora:blockwise-concatenate
Apr 23, 2021
Merged

Handle Blockwise HLG pack/unpack for concatenate=True#7455
jrbourbeau merged 19 commits intodask:mainfrom
rjzamora:blockwise-concatenate

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Mar 23, 2021

One possible solution to the Blockwise serialization challenge discussed in this comment. Note that I would certainly prefer a general solution to nested-task serialization.

@jrbourbeau
Copy link
Member

Just pushed a commit to temporarily point CI to dask/distributed#4641 in order to illustrate that tests pass here when we include the distributed changes -- hope that's okay @rjzamora

@rjzamora
Copy link
Member Author

@jrbourbeau - What is the best way to deal with the distributed min-dep changing to the next release (2021.04.1)? For now, I just changed the min-dep environment to install distributed from source.

@rjzamora rjzamora marked this pull request as ready for review April 19, 2021 17:57
rjzamora added a commit to rjzamora/dask that referenced this pull request Apr 19, 2021
@rjzamora
Copy link
Member Author

@jrbourbeau - I am okay with merging #7415 very soon after the next release, but it would be nice to get this fix in as soon as possible. Do you expect there to be problems with this plan?

@jrbourbeau
Copy link
Member

That sounds good to me. The issue here is we're getting consistent failures in the mindeps-distributed build due to calling pytest.importorskip in gen_cluster (we're running into this in other places like #7524 too).

I tried to determine what the underlying issue is yesterday but wasn't able to figure it out yet. I did observe that the issue is only present on Python 3.7 so I tried to temporarily bump the Python version to used in the mindeps-distributed build to 3.8 over in this branch but ran into some conda solve conflicts. That's where things currently stand. I'll try to pick this up again today (or you should feel free to if you have the bandwidth). @jsignell also suggested we might try moving tests into separate modules and skipping entire modules instead of individual tests. I suspect this would also work as a quickfix, but long-term I'd prefer to keep the current file structure.

@jrbourbeau
Copy link
Member

Alright, I ended up changing the location of the test added here so it's no longer the last test in this module. That should hopefully result in CI passing 🤞It's unfortunate that test ordering currently matters, but the actual change here is small and benign

@rjzamora
Copy link
Member Author

rjzamora commented Apr 22, 2021

Interesting - Would it help for us to add a module-check decorator (so that we skip outside the gen_cluster)?

For example: would something like this commit help?

@jrbourbeau
Copy link
Member

Hmm good idea, something like that would also work. Though from experimenting locally it looks like find_spec returns a non-None output for dask.array even when I don't have numpy installed. I'm fine trying to include a module check decorator or just keeping things as they currently are in this PR

@jrbourbeau
Copy link
Member

Just pushed a commit to resolve a merge conflict. Planning to merge this PR after CI finishes

@jrbourbeau jrbourbeau mentioned this pull request Apr 23, 2021
3 tasks
@jrbourbeau jrbourbeau merged commit 89f7785 into dask:main Apr 23, 2021
@rjzamora
Copy link
Member Author

Thanks @jrbourbeau !

jakirkham pushed a commit that referenced this pull request Apr 30, 2021
* Add test illustrating import issue

* move layer materialization for shuffle

* add missing layers.py file and revise org a bit

* move layers.py

* remove xfail

* use import rather than pickle for functions

* fix importlib error and use dict

* roll back test changes from 7374 (let failures be resolved in that PR)

* remove debug print

* move Shuffle layers to layers.py completely

* moving moving BroadcastJoinLayer to layers.py

* tweak CallableLazyImport

* move BlockwiseCreateArray into layers.py

* import test coverage

* incorperate testing idea from 7374

* comment tweaks

* introduce new test_layers.py module

* update comment in test

* Update dask/layers.py

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

* Only use CallableLazyImport when the graph is materialized within __dask_distributed_unpack__

* remove obsolete annotation handling

* Update dask/layers.py

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

* _construct_graph fix

* migrate csv code

* migrate orc changes

* basic parquet migration - still need to handle serialization

* add require_pickle option to DataFrameIOLayer

* update testing

* change 'column culling' language to 'column projection'

* use ConcatAxesWrapper to avoid inline task

* add test coverage

* use serialize instead of pickle

* use simpler SerializedFunction wrapper

* update to rely on distributed#4575

* serialization experiments

* align with current status of distributed-4641

* update/fix comment

* align with latest distributed#4641 state

* re-align with James' suggestion

* Temporarily point to Distributed PR #4641

* point CI back to distributed main branch

* avoid using distributed 2021.03.0 for mindeps test

* start aligning with #7455

* begin requiring io_deps elements to inherit from base class

* more simplification

* fix indices init

* Add .git suffix

* Add explicit pip

* serialization cleanup

* address dep-name issue

* strip out unnecessary import logic

* use output_blocks in pack

* more code review suggestions

* use clearer tmp variable name

* avoid io_deps copy

* dictionary comp for BlockwiseDepDict pack

* start working on csv problem

* handle csv nested-task problem

* further cleanup

* apply fix

* add produces_tasks

* use place-holder required_indices variable

* CreateArrayDeps fix

* remove extra name def

* Change test ordering

* fix dumps_function mistake and account for required_indices being an empty collection

* roll back (breaking) large_graph_objects change

* minor blockwise changes to address code-review

* more cleanup

* use  for column projection in csv

* updating some comments

* improve documentation

* add project_columns to the functions to make things a bit more explicit for now

* move timeseries to Blockwise

* remove commented code

* add daily-stock

Co-authored-by: James Bourbeau <jrbourbeau@gmail.com>
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blockwise with concatenate=True broken with LocalCluster + optimize_graph=False

3 participants