Handle Blockwise HLG pack/unpack for concatenate=True#7455
Handle Blockwise HLG pack/unpack for concatenate=True#7455jrbourbeau merged 19 commits intodask:mainfrom
Conversation
|
Just pushed a commit to temporarily point CI to dask/distributed#4641 in order to illustrate that tests pass here when we include the |
|
@jrbourbeau - What is the best way to deal with the distributed min-dep changing to the next release (2021.04.1)? For now, I just changed the min-dep environment to install distributed from source. |
|
@jrbourbeau - I am okay with merging #7415 very soon after the next release, but it would be nice to get this fix in as soon as possible. Do you expect there to be problems with this plan? |
|
That sounds good to me. The issue here is we're getting consistent failures in the I tried to determine what the underlying issue is yesterday but wasn't able to figure it out yet. I did observe that the issue is only present on Python 3.7 so I tried to temporarily bump the Python version to used in the |
|
Alright, I ended up changing the location of the test added here so it's no longer the last test in this module. That should hopefully result in CI passing 🤞It's unfortunate that test ordering currently matters, but the actual change here is small and benign |
|
Interesting - Would it help for us to add a module-check decorator (so that we skip outside the For example: would something like this commit help? |
|
Hmm good idea, something like that would also work. Though from experimenting locally it looks like |
|
Just pushed a commit to resolve a merge conflict. Planning to merge this PR after CI finishes |
|
Thanks @jrbourbeau ! |
* Add test illustrating import issue * move layer materialization for shuffle * add missing layers.py file and revise org a bit * move layers.py * remove xfail * use import rather than pickle for functions * fix importlib error and use dict * roll back test changes from 7374 (let failures be resolved in that PR) * remove debug print * move Shuffle layers to layers.py completely * moving moving BroadcastJoinLayer to layers.py * tweak CallableLazyImport * move BlockwiseCreateArray into layers.py * import test coverage * incorperate testing idea from 7374 * comment tweaks * introduce new test_layers.py module * update comment in test * Update dask/layers.py Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com> * Only use CallableLazyImport when the graph is materialized within __dask_distributed_unpack__ * remove obsolete annotation handling * Update dask/layers.py Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com> * _construct_graph fix * migrate csv code * migrate orc changes * basic parquet migration - still need to handle serialization * add require_pickle option to DataFrameIOLayer * update testing * change 'column culling' language to 'column projection' * use ConcatAxesWrapper to avoid inline task * add test coverage * use serialize instead of pickle * use simpler SerializedFunction wrapper * update to rely on distributed#4575 * serialization experiments * align with current status of distributed-4641 * update/fix comment * align with latest distributed#4641 state * re-align with James' suggestion * Temporarily point to Distributed PR #4641 * point CI back to distributed main branch * avoid using distributed 2021.03.0 for mindeps test * start aligning with #7455 * begin requiring io_deps elements to inherit from base class * more simplification * fix indices init * Add .git suffix * Add explicit pip * serialization cleanup * address dep-name issue * strip out unnecessary import logic * use output_blocks in pack * more code review suggestions * use clearer tmp variable name * avoid io_deps copy * dictionary comp for BlockwiseDepDict pack * start working on csv problem * handle csv nested-task problem * further cleanup * apply fix * add produces_tasks * use place-holder required_indices variable * CreateArrayDeps fix * remove extra name def * Change test ordering * fix dumps_function mistake and account for required_indices being an empty collection * roll back (breaking) large_graph_objects change * minor blockwise changes to address code-review * more cleanup * use for column projection in csv * updating some comments * improve documentation * add project_columns to the functions to make things a bit more explicit for now * move timeseries to Blockwise * remove commented code * add daily-stock Co-authored-by: James Bourbeau <jrbourbeau@gmail.com> Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
One possible solution to the
Blockwiseserialization challenge discussed in this comment. Note that I would certainly prefer a general solution to nested-task serialization.concatenate=Truebroken with LocalCluster +optimize_graph=False#7449black dask/flake8 dask