[WIP] Add HighLevelGraph import test by jrbourbeau · Pull Request #7374 · dask/dask

jrbourbeau · 2021-03-12T03:30:12Z

Earlier today @ian-r-rose pointed out that when we materialize a HighLevelGraph to the scheduler we end up importing the modules which contain the layers that make up the HighLevelGraph (e.g. Blockwise)

dask/dask/highlevelgraph.py

Line 993 in a62fced

mod = import_allowed_module(layer["__module__"])

This is so we can call their __dask_distributed_unpack__ method during the graph materialization procedure on the scheduler.

This is problematic because importing a module like dask.dataframe.shuffle, where the ShuffleLayer class lives, will result in us attempting to import other libraries that that module depends on, e.g. pandas, which may not be installed in the environment the scheduler is running on.

@ian-r-rose @rjzamora and I tested this out earlier today and indeed ran into ImportErrors when trying to perform a DataFrame shuffle on a cluster where the scheduler didn't have pandas installed. This PR adds a test which illustrates this issue.

cc @madsbk

madsbk · 2021-03-12T08:32:58Z

Since __dask_distributed_unpack__() is a classmethod, it can be moved out of the class and isolated in its own module -- it would ugly but work in this case. However, the implementation of __dask_distributed_unpack__() might itself import external modules, which would trigger the problem again.

rjzamora · 2021-03-12T17:03:43Z

After a quick look through the existing problem areas, it seems like we will need to move all __dask_distributed_unpack__ definitions (and releated graph-materialization functions) into a dedicated layers.py file (or something) within each module. We will also need to revise the graph-materialization functions to avoid the use of utilities that rely on "superfluous" dependencies (numpy, pandas, etc.). One way to make this easier would be to move all "light-weight" utilites into a light_utils.py file (or something) for each module. This approach would produce an organization like:

dask/
 |
 |— array/ 
 |      |— layers.py
 |      |— light_utils.py
 |
 |— dataframe/
 |      |— layers.py
 |      |— light_utils.py
 |
 |— layers.py
 |— light_utils.py

EDIT: It seems that the above directory structure will not work, since importing the dask.dataframe.layers module (for example) also imports the dask.dataframe imports.

rjzamora · 2021-03-15T16:35:35Z

@jrbourbeau - I think you suggested the use of a full HLG/Layer directory structure last week (in an offline discussion). I think that ideam may be the best way forward. Perhaps we can create a highlevelgraph/ directory, and use that umbrella to store any function/class that may need to run on the scheduler? For example:

dask/
 |— array/
 |      |— # API and modules requiring typical Array imports
 |— dataframe/
 |      |— # API and modules requiring typical DataFrame imports
 |— highlevelgraph/
 |      |— utils.py # Lightweight utils
 |      |— blockwise.py
 |      |— highlevelgraph.py
 |      |— array/
 |      |      |— utils.py # Lightweight array-focused utils
 |      |      |— shuffle.py
 |      |      |— …
 |      |— dataframe/
 |      |      |— utils.py # Lightweight dataframe-focused utils
 |      |      |— …

jrbourbeau · 2021-03-15T18:41:40Z

Thinking about it a bit more I wonder if starting out with a single layers.py would work. If that's too unruly then the structure you outlined seems sensible to me.

Additionally, whatever module structure we go with, we should make use of our CI import build

dask/continuous_integration/scripts/test_imports.sh

Lines 17 to 21 in a1187b1

    
           test_import ""                                "import dask, dask.base, dask.multiprocessing, dask.threaded, dask.optimization, dask.bag, dask.delayed, dask.graph_manipulation" 
        
           test_import "numpy"                           "import dask.array" 
        
           test_import "pandas"                          "import dask.dataframe" 
        
           test_import "bokeh"                           "import dask.diagnostics" 
        
           test_import "distributed"                     "import dask.distributed"

to ensure that we can always import, for example dask.layers, without any additional dependencies

rjzamora · 2021-03-15T18:56:41Z

Thinking about it a bit more I wonder if starting out with a single layers.py would work. If that's too unruly then the structure you outlined seems sensible to me.

#7381 is going in this direction for shuffle, but that PR is also attempting to minimize the amount of necessary code to include in layers.py (by effectively splitting the Layer definitions a into the unpack/materialization code and "everything else"). Do you think that is a reasonable approach, or should we plan to move entire Layer subclass definitions into layers.py? The latter approach will keep the definitions in one place (which is nice), but it will limit the kind of logic we can do (directly) in __init__, cull and __dask_distributed_pack__.

jrbourbeau · 2021-03-16T18:24:30Z

Let's include this test over in #7381 -- thanks @rjzamora!

Add test illustrating import issue

7604abf

rjzamora mentioned this pull request Mar 12, 2021

Avoid unnecessary imports for HLG Layer unpacking and materialization #7381

Merged

jrbourbeau closed this Mar 16, 2021

jrbourbeau deleted the scheduler-import-test branch March 16, 2021 18:24

crusaderky mentioned this pull request Dec 9, 2022

Update Prometheus docs #9696

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add HighLevelGraph import test#7374

[WIP] Add HighLevelGraph import test#7374
jrbourbeau wants to merge 1 commit intodask:mainfrom
jrbourbeau:scheduler-import-test

jrbourbeau commented Mar 12, 2021

Uh oh!

madsbk commented Mar 12, 2021

Uh oh!

rjzamora commented Mar 12, 2021 •

edited

Loading

Uh oh!

rjzamora commented Mar 15, 2021

Uh oh!

jrbourbeau commented Mar 15, 2021

Uh oh!

rjzamora commented Mar 15, 2021 •

edited

Loading

Uh oh!

jrbourbeau commented Mar 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jrbourbeau commented Mar 12, 2021

Uh oh!

madsbk commented Mar 12, 2021

Uh oh!

rjzamora commented Mar 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Mar 15, 2021

Uh oh!

jrbourbeau commented Mar 15, 2021

Uh oh!

rjzamora commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau commented Mar 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Mar 12, 2021 •

edited

Loading

rjzamora commented Mar 15, 2021 •

edited

Loading