Add optional IO-subgraph to Blockwise Layers by rjzamora · Pull Request #6715 · dask/dask

rjzamora · 2020-10-08T16:29:44Z

Big Picture

Blockwise HighLevelGraph (HLG) layers currently require a collection dependency. This has prevented IO (data-generation) operations from being represeted as Blockwise objects. This, in turn, has prevented optimize_blockwise from fusing IO tasks with follow-up block-wise transformation. This is fine if/when we can use fuse on the full task-graph dictionary, but not if we hope to send HLGs directly to the scheduler.

This PR makes it possible to perform IO from within a Blockwise layer (by passing in a special IO subgraph), and introduces a new BlockwiseParquet Layer. These changes demonstrate that optimize_blockwise behave correctly for IO-enabled layers, without complicating optimize_read_parquet_getitem.

TODO:

Add low-level fusion of blockwise io and transition tasks
Fuse IO into subgraph_callable (may make future optimizations cleaner)
~~Try to avoid IO-subgraph dict materialization when culling and/or getting dependencies~~
Try to avoid "no-op" tasks in IO-only Blockwise layers
Update other IO operations to leverage Blockwise (e.g. read_csv)
Remove from_callable API, and remove or update the related test

Follows on from dask#6508

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

rjzamora · 2020-10-20T23:14:42Z

Update: While 8d371ce demonstrates low-level task fusion of IO and (traditional) blockwise operations, the most recent commit goes a bit further and pushes IO-function calls into the same subgraph_callable as the other blockwise operations.

I assume it is best (for scheduler logic) to fuse all operations into the same subgraph_callable definition, but this may be a bit more restrictive on the properties of the IO subgraph -- We need subgraph.get((<io_name>, <partition_index>)) to return a tuple with the first element being an "IO function".

rjzamora · 2020-10-25T17:17:59Z

It seems that (at least some of) the CI failures are related to fsspec#458

EDIT: It seems that the issue was in aiohttp 3.7.0 and has been resolved in 3.7.1

…ental

jrbourbeau

Thanks for all your work here @rjzamora! The test you've added does a really nice job of demonstrating blockwise layer optimizations

dask/blockwise.py

jrbourbeau · 2020-10-30T21:40:20Z

dask/blockwise.py

+            # Extract actual IO function for SubgraphCallable construction.
+            # Wrap func in `PackedFunctionCall`, since it will receive
+            # all arguments as a sigle (packed) tuple at run time.
+            io_func = self.io_subgraph.get((self.io_name, 0), (None,))[0]


We should document whatever assumptions we're making about the structure of io_subgraph

I made some minor tweaks here and added notes in the docstring and below to clarify the assumptions. Let me know what you think.

jrbourbeau · 2020-11-03T02:09:00Z

dask/dataframe/io/tests/test_parquet.py

+    graph = optimize_blockwise(ddf.__dask_graph__(), keys)
+    layers = graph.layers
+    name = list(layers.keys())[0]
+    assert len(layers) == 1


Nice! This is good to see : )

jrbourbeau · 2020-11-03T02:12:38Z

dask/blockwise.py

+        # TODO: Handle N-D Collections and more-complex
+        # tensor operations.


Could you open up an issue with checkboxes to track follow-up work we'll want to do to ensure we're using using IO-subgraphs throughout the codebase

I added #6791 (to track the use of IO subgraphs) and #6792 (to track improvements in this particular method) - Let me know if you had anything else in mind. Also, feel free to modify those issues in any way you'd like :)

dask/dataframe/io/tests/test_parquet.py

jrbourbeau

Thanks @rjzamora! This is in

mrocklin and others added 30 commits August 13, 2020 06:42

Add HighLevelGraph.validate call to assert_eq in tests

a636838

Follows on from dask#6508

Merge branch 'master' into hlg-verify

5636f29

subs(): now compare key hashes

14628a5

subs(): now use hash and equality matching

e0c2756

Merge branch 'master' of github.com:dask/dask into sub_use_key_hash

9039fa8

minor cleanup

27acf33

Implemented culling of high level graphs

5b2ea08

Dataframe to use high level graph culling

cfffa3a

arrays to use high level culling

06bc6d4

For now fixing hlg.dependencies

8bd59fe

When <dask#6509> passes, we can remove this fix, which introduces a significant overhead

Skip layers nobody depend on

b8f2066

Added class Layer(Mapping)

590d4ba

Implemented a default Layer.cull()

b99d94b

clean up of array.optimize()

9bad423

Use set intersection

8eaed75

docs

678795b

moved cull_highlevelgraph() into HighLevelGraph

58ce174

clean up

6ec590c

Added core.keys_in_tasks()

7acf49a

Layer.get_external_dependencies() to use keys_in_tasks()

10d1ffd

Merge branch 'master' of github.com:dask/dask into cull_high_level_graph

ab21639

reformat: black

a750910

BasicLayer(): added an external_dependencies argument

67ab27f

Blockwise to implement the Layer protocol

3c7877e

Added doctest

b8c8731

Added core.keys_in_tasks()

2ce8cee

HLG: added more thorough validation check

d037c66

lingalg: layer dependencies are now sets

85d9c39

linalg.lstsq(): fixed dependencies

f6920bf

linalg.sfqr(): fixed dependencies

e1a621c

rjzamora added 3 commits October 16, 2020 14:15

remove commented code

9ee604d

avoid subgraph_callable with no-op only

8d371ce

try packaging everything into subgraph_callable

548081d

rjzamora marked this pull request as ready for review October 17, 2020 03:35

rjzamora added 4 commits October 16, 2020 21:00

some parquet cases working... but this may be painful

4e9c837

parquet working

1b58dbb

pushing blockwise IO into subgraph_callable

3796e7c

remove no_op

9dea6e5

rjzamora mentioned this pull request Oct 23, 2020

Workflow memory-pressure optimizations NVIDIA-Merlin/NVTabular#334

Merged

rjzamora added 3 commits October 24, 2020 16:23

remove from_callable and move basic io-blockwise test to test_parquet.py

fcdf15e

add BlockwiseReadCSV

2824c23

fix comment typo

a1a3d5d

conver orc reader to Blockwise

6a16741

rjzamora changed the title ~~[WIP] Add optional IO-subgraph to Blockwise Layers~~ Add optional IO-subgraph to Blockwise Layers Oct 27, 2020

rjzamora added 5 commits October 30, 2020 11:13

Merge remote-tracking branch 'upstream/master' into blockwise-experim…

3e3cf66

…ental

implement get_output_keys for the simplest case

fb27fd4

implement get_output_keys for the simplest case

963b6c7

implement get_output_keys for the simplest case

154b29b

dissallow empty indices

01187cb

jrbourbeau reviewed Nov 3, 2020

View reviewed changes

rjzamora added 2 commits November 2, 2020 19:17

remove test_graph_size_pyarrow change

0465a94

address code review and add to docstring

3f65a80

This was referenced Nov 3, 2020

Use IO-Subgraph + Blockwise throughout codebase #6791

Closed

Add comprehensive get_output_keys defintition to Blockwise #6792

Closed

jrbourbeau approved these changes Nov 4, 2020

View reviewed changes

jrbourbeau merged commit 5589bfd into dask:master Nov 4, 2020

rjzamora mentioned this pull request Nov 7, 2020

Avoid graph materialization during Blockwise culling #6815

Merged

3 tasks

rjzamora deleted the blockwise-experimental branch May 21, 2024 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add optional IO-subgraph to Blockwise Layers#6715

Add optional IO-subgraph to Blockwise Layers#6715
jrbourbeau merged 155 commits intodask:masterfrom
rjzamora:blockwise-experimental

rjzamora commented Oct 8, 2020 •

edited

Loading

Uh oh!

rjzamora commented Oct 20, 2020

Uh oh!

rjzamora commented Oct 25, 2020 •

edited

Loading

Uh oh!

jrbourbeau left a comment

Uh oh!

Uh oh!

jrbourbeau Oct 30, 2020

Uh oh!

rjzamora Nov 3, 2020

Uh oh!

jrbourbeau Nov 3, 2020

Uh oh!

jrbourbeau Nov 3, 2020

Uh oh!

rjzamora Nov 3, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# TODO: Handle N-D Collections and more-complex
		# tensor operations.

Uh oh!

Conversation

rjzamora commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Oct 20, 2020

Uh oh!

rjzamora commented Oct 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jrbourbeau Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Nov 3, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rjzamora commented Oct 8, 2020 •

edited

Loading

rjzamora commented Oct 25, 2020 •

edited

Loading