Skip to content

Use Blockwise/map_partitions in various DataFrame join methods #8306

@gjoseph92

Description

@gjoseph92

I noticed that some join methods have things like

        dsk = {
            (name, i): (apply, merge_chunk, [left_key, right_key], kwargs)
            for i, right_key in enumerate(right.__dask_keys__())
        }

where we're generating a low-level graph that could just be done with map_partitions. Using map_partitions in these scenarios would both speed up graph transmission and allow for blockwise fusion across the operations. Refactoring this simple sorts of graphs should be straightforward.

  • single_partition_join
  • hash_join's merge_chunk
  • stack_partitions should use HighLevelGraph.from_collections instead of merging all of the input graphs

cc @rjzamora @ncclementi @jrbourbeau

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions