Conversation
|
Since |
|
After a quick look through the existing problem areas, it seems like we will need to move all EDIT: It seems that the above directory structure will not work, since importing the |
|
@jrbourbeau - I think you suggested the use of a full HLG/Layer directory structure last week (in an offline discussion). I think that ideam may be the best way forward. Perhaps we can create a |
|
Thinking about it a bit more I wonder if starting out with a single Additionally, whatever module structure we go with, we should make use of our CI import build dask/continuous_integration/scripts/test_imports.sh Lines 17 to 21 in a1187b1 to ensure that we can always import, for example |
#7381 is going in this direction for shuffle, but that PR is also attempting to minimize the amount of necessary code to include in |
Earlier today @ian-r-rose pointed out that when we materialize a
HighLevelGraphto the scheduler we end up importing the modules which contain the layers that make up theHighLevelGraph(e.g.Blockwise)dask/dask/highlevelgraph.py
Line 993 in a62fced
This is so we can call their
__dask_distributed_unpack__method during the graph materialization procedure on the scheduler.This is problematic because importing a module like
dask.dataframe.shuffle, where theShuffleLayerclass lives, will result in us attempting to import other libraries that that module depends on, e.g.pandas, which may not be installed in the environment the scheduler is running on.@ian-r-rose @rjzamora and I tested this out earlier today and indeed ran into
ImportErrors when trying to perform a DataFrame shuffle on a cluster where the scheduler didn't havepandasinstalled. This PR adds a test which illustrates this issue.cc @madsbk