Skip to content

[WIP] Add HighLevelGraph import test#7374

Closed
jrbourbeau wants to merge 1 commit intodask:mainfrom
jrbourbeau:scheduler-import-test
Closed

[WIP] Add HighLevelGraph import test#7374
jrbourbeau wants to merge 1 commit intodask:mainfrom
jrbourbeau:scheduler-import-test

Conversation

@jrbourbeau
Copy link
Member

Earlier today @ian-r-rose pointed out that when we materialize a HighLevelGraph to the scheduler we end up importing the modules which contain the layers that make up the HighLevelGraph (e.g. Blockwise)

mod = import_allowed_module(layer["__module__"])

This is so we can call their __dask_distributed_unpack__ method during the graph materialization procedure on the scheduler.

This is problematic because importing a module like dask.dataframe.shuffle, where the ShuffleLayer class lives, will result in us attempting to import other libraries that that module depends on, e.g. pandas, which may not be installed in the environment the scheduler is running on.

@ian-r-rose @rjzamora and I tested this out earlier today and indeed ran into ImportErrors when trying to perform a DataFrame shuffle on a cluster where the scheduler didn't have pandas installed. This PR adds a test which illustrates this issue.

cc @madsbk

@madsbk
Copy link
Contributor

madsbk commented Mar 12, 2021

Since __dask_distributed_unpack__() is a classmethod, it can be moved out of the class and isolated in its own module -- it would ugly but work in this case. However, the implementation of __dask_distributed_unpack__() might itself import external modules, which would trigger the problem again.

@rjzamora
Copy link
Member

rjzamora commented Mar 12, 2021

After a quick look through the existing problem areas, it seems like we will need to move all __dask_distributed_unpack__ definitions (and releated graph-materialization functions) into a dedicated layers.py file (or something) within each module. We will also need to revise the graph-materialization functions to avoid the use of utilities that rely on "superfluous" dependencies (numpy, pandas, etc.). One way to make this easier would be to move all "light-weight" utilites into a light_utils.py file (or something) for each module. This approach would produce an organization like:

dask/
 |
 |— array/

 |      |— layers.py
 |      |— light_utils.py
 |
 |— dataframe/
 |      |— layers.py
 |      |— light_utils.py
 |
 |— layers.py
 |— light_utils.py

EDIT: It seems that the above directory structure will not work, since importing the dask.dataframe.layers module (for example) also imports the dask.dataframe imports.

@rjzamora
Copy link
Member

@jrbourbeau - I think you suggested the use of a full HLG/Layer directory structure last week (in an offline discussion). I think that ideam may be the best way forward. Perhaps we can create a highlevelgraph/ directory, and use that umbrella to store any function/class that may need to run on the scheduler? For example:

dask/
 |— array/
 |      |— # API and modules requiring typical Array imports
 |— dataframe/
 |      |— # API and modules requiring typical DataFrame imports
 |— highlevelgraph/
 |      |— utils.py # Lightweight utils
 |      |— blockwise.py
 |      |— highlevelgraph.py
 |      |— array/
 |      |      |— utils.py # Lightweight array-focused utils
 |      |      |— shuffle.py
 |      |      |— …
 |      |— dataframe/
 |      |      |— utils.py # Lightweight dataframe-focused utils
 |      |      |— …

@jrbourbeau
Copy link
Member Author

Thinking about it a bit more I wonder if starting out with a single layers.py would work. If that's too unruly then the structure you outlined seems sensible to me.

Additionally, whatever module structure we go with, we should make use of our CI import build

test_import "" "import dask, dask.base, dask.multiprocessing, dask.threaded, dask.optimization, dask.bag, dask.delayed, dask.graph_manipulation"
test_import "numpy" "import dask.array"
test_import "pandas" "import dask.dataframe"
test_import "bokeh" "import dask.diagnostics"
test_import "distributed" "import dask.distributed"

to ensure that we can always import, for example dask.layers, without any additional dependencies

@rjzamora
Copy link
Member

rjzamora commented Mar 15, 2021

Thinking about it a bit more I wonder if starting out with a single layers.py would work. If that's too unruly then the structure you outlined seems sensible to me.

#7381 is going in this direction for shuffle, but that PR is also attempting to minimize the amount of necessary code to include in layers.py (by effectively splitting the Layer definitions a into the unpack/materialization code and "everything else"). Do you think that is a reasonable approach, or should we plan to move entire Layer subclass definitions into layers.py? The latter approach will keep the definitions in one place (which is nice), but it will limit the kind of logic we can do (directly) in __init__, cull and __dask_distributed_pack__.

@jrbourbeau
Copy link
Member Author

Let's include this test over in #7381 -- thanks @rjzamora!

@jrbourbeau jrbourbeau closed this Mar 16, 2021
@jrbourbeau jrbourbeau deleted the scheduler-import-test branch March 16, 2021 18:24
@crusaderky crusaderky mentioned this pull request Dec 9, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants