Move timeseries and daily-stock to Blockwise by rjzamora · Pull Request #7615 · dask/dask

rjzamora · 2021-04-27T22:47:46Z

Superceeds #7237
~~Blocked by #7415~~

Refactors make_timeseries to use Blockwise.
Refactors daily_stock to use Blockwise.

…shuffle-avoid-pd-import

…port

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

…ask_distributed_unpack__

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

…port

…it for now

rjzamora · 2021-04-27T22:58:11Z

@quasiben @jakirkham - This PR should make it a bit easier to benchmark scheduler performance for large graphs without any significant IO or communication overhead. For example, something like the following is a reasonable way to produce a large graph that will also benefit from Blockwise-IO optimizations:

import dask
from dask.distributed import LocalCluster, wait, Client

cluster = LocalCluster()
client = Client(cluster)

ddf = dask.datasets.timeseries(
    start='2000-01-01',
    end='2003-12-31',
    freq='120s',
    partition_freq='1h',
)
s = ddf["id"] + 10
mean = s.mean()

with dask.config.set({"optimization.fuse.active": False}):
    wait(mean.persist())

This code seems to run ~20% faster for me with this PR (compared to main).

rjzamora · 2021-04-28T14:10:26Z

It seems that test_parquet.py::test_partition_on has become flaky for some reason (I've seen it fail here and in #7415 for apparently different reasons). Unfortunately, I cannot reproduce locally no matter how many times I run the test (or even all parquet tests) in a loop.

For reference, here is the latest failure

=================================== FAILURES ===================================
________________________ test_partition_on[fastparquet] ________________________
[gw0] linux -- Python 3.8.8 /usr/share/miniconda3/envs/test-environment/bin/python

tmpdir = '/tmp/pytest-of-runner/pytest-0/popen-gw0/test_partition_on_fastparquet_0'
engine = 'fastparquet'

    def test_partition_on(tmpdir, engine):
        tmpdir = str(tmpdir)
        df = pd.DataFrame(
            {
                "a1": np.random.choice(["A", "B", "C"], size=100),
                "a2": np.random.choice(["X", "Y", "Z"], size=100),
                "b": np.random.random(size=100),
                "c": np.random.randint(1, 5, size=100),
                "d": np.arange(0, 100),
            }
        )
        d = dd.from_pandas(df, npartitions=2)
        d.to_parquet(tmpdir, partition_on=["a1", "a2"], engine=engine)
        # Note #1: Cross-engine functionality is missing
        # Note #2: The index is not preserved in pyarrow when partition_on is used
        out = dd.read_parquet(
            tmpdir, engine=engine, index=False, gather_statistics=False
        ).compute()
        for val in df.a1.unique():
            assert set(df.b[df.a1 == val]) == set(out.b[out.a1 == val])
    
        # Now specify the columns and allow auto-index detection
>       out = dd.read_parquet(tmpdir, engine=engine, columns=["b", "a2"]).compute()

dask/dataframe/io/tests/test_parquet.py:1257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/base.py:285: in compute
    (result,) = compute(self, traverse=False, **kwargs)
dask/base.py:567: in compute
    results = schedule(dsk, keys, **kwargs)
dask/threaded.py:79: in get
    results = get_async(
dask/local.py:514: in get_async
    raise_exception(exc, tb)
dask/local.py:325: in reraise
    raise exc
dask/local.py:223: in execute_task
    result = _execute_task(task, data)
dask/core.py:121: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
dask/optimization.py:963: in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
dask/core.py:151: in get
    result = _execute_task(task, cache)
dask/core.py:121: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
dask/dataframe/io/parquet/core.py:82: in __call__
    return read_parquet_part(
dask/dataframe/io/parquet/core.py:345: in read_parquet_part
    dfs = [
dask/dataframe/io/parquet/core.py:346: in <listcomp>
    func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
dask/dataframe/io/parquet/fastparquet.py:801: in read_partition
    parquet_file.read_row_group_file(
/usr/share/miniconda3/envs/test-environment/lib/python3.8/site-packages/fastparquet/api.py:210: in read_row_group_file
    core.read_row_group_file(
/usr/share/miniconda3/envs/test-environment/lib/python3.8/site-packages/fastparquet/core.py:303: in read_row_group_file
    return read_row_group(f, rg, columns, categories, schema_helper, cats,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

file = <_io.BufferedReader name='/tmp/pytest-of-runner/pytest-0/popen-gw0/test_partition_on_fastparquet_0/a1=B/a2=Y/part.0.parquet'>
rg = <class 'fastparquet.parquet_thrift.parquet.ttypes.RowGroup'>
columns: [<class 'fastparquet.parquet_thrift.parquet.ttyp...ompressed_size: 55
  total_uncompressed_size: None
  type: 2

]
num_rows: 4
sorting_columns: None
total_byte_size: 252

columns = ['b', 'a2', '__null_dask_index__'], categories = {}
schema_helper = <Parquet Schema with 5 entries>
cats = OrderedDict([('a1', ['C']), ('a2', ['X'])]), selfmade = True
index = ['__null_dask_index__']
assign = {'__null_dask_index__': array([18, 29, 34, 49]), 'a1': array([0, 0, 0, 0], dtype=int8), 'a1-catdef': ['B', 'B', 'B', 'B']
Categories (3, object): ['B', 'C', 'A'], 'a2': array([0, 0, 0, 0], dtype=int8), ...}
scheme = 'hive', partition_meta = {}

    def read_row_group(file, rg, columns, categories, schema_helper, cats,
                       selfmade=False, index=None, assign=None,
                       scheme='hive', partition_meta=None):
        """
        Access row-group in a file and read some columns into a data-frame.
        """
        partition_meta = partition_meta or {}
        if assign is None:
            raise RuntimeError('Going with pre-allocation!')
        read_row_group_arrays(file, rg, columns, categories, schema_helper,
                              cats, selfmade, assign=assign)
    
        for cat in cats:
            if scheme == 'hive':
                s = ex_from_sep('/')
                partitions = s.findall(rg.columns[0].file_path)
            else:
                partitions = [('dir%i' % i, v) for (i, v) in enumerate(
                    rg.columns[0].file_path.split('/')[:-1])]
            key, val = [p for p in partitions if p[0] == cat][0]
            val = val_to_num(val, meta=partition_meta.get(key))
>           assign[cat][:] = cats[cat].index(val)
E           ValueError: 'B' is not in list

/usr/share/miniconda3/envs/test-environment/lib/python3.8/site-packages/fastparquet/core.py:366: ValueError

jrbourbeau · 2021-04-29T16:06:11Z

Re: the test_parquet.py::test_partition_on failure, we've seen similar things in other tests (xref #7369)

jakirkham · 2021-04-29T16:17:39Z

Guessing this needs a rebase now that PR ( #7415 ) has gone in

jakirkham · 2021-04-30T21:35:16Z

Thank you Rick! 😄

jrbourbeau and others added 30 commits March 11, 2021 19:16

Add test illustrating import issue

7604abf

move layer materialization for shuffle

0939cf4

add missing layers.py file and revise org a bit

8947089

Merge remote-tracking branch 'jrbourbeau/scheduler-import-test' into …

05283d7

…shuffle-avoid-pd-import

move layers.py

579fa6b

remove xfail

4e4fee1

use import rather than pickle for functions

948e670

fix importlib error and use dict

42da7e0

roll back test changes from 7374 (let failures be resolved in that PR)

3a2ebb4

remove debug print

958322a

Merge remote-tracking branch 'upstream/main' into shuffle-avoid-pd-im…

e1e4e18

…port

move Shuffle layers to layers.py completely

9a3d3f7

moving moving BroadcastJoinLayer to layers.py

2edba11

tweak CallableLazyImport

f8ed2f9

move BlockwiseCreateArray into layers.py

3ebbef3

import test coverage

9badf2f

incorperate testing idea from 7374

e62dba6

comment tweaks

6647084

introduce new test_layers.py module

79f5044

update comment in test

f3f5a96

Update dask/layers.py

d37940c

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

Only use CallableLazyImport when the graph is materialized within __d…

bf94c89

…ask_distributed_unpack__

remove obsolete annotation handling

6321802

Update dask/layers.py

dc54e49

Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>

_construct_graph fix

5b35fba

Merge remote-tracking branch 'upstream/main' into shuffle-avoid-pd-im…

9b7d5fc

…port

migrate csv code

3813b61

migrate orc changes

be01159

basic parquet migration - still need to handle serialization

0b71c39

add require_pickle option to DataFrameIOLayer

fb94266

rjzamora added 12 commits April 26, 2021 07:58

Merge remote-tracking branch 'upstream/main' into blockwise-io-dataframe

d7cf995

roll back (breaking) large_graph_objects change

b9c4c57

Merge remote-tracking branch 'upstream/main' into blockwise-io-dataframe

365962a

minor blockwise changes to address code-review

d20b510

more cleanup

94cf61b

use for column projection in csv

ab4a35f

updating some comments

0dd1cdf

improve documentation

83fafc4

add project_columns to the functions to make things a bit more explic…

3ced1c8

…it for now

move timeseries to Blockwise

41f8b11

remove commented code

494dfbd

add daily-stock

8f22f3d

github-actions bot added dataframe io labels Apr 27, 2021

rjzamora mentioned this pull request Apr 27, 2021

Move DataFrame demo IO to Blockwise #7237

Closed

jakirkham mentioned this pull request Apr 28, 2021

HLG timeseries call graph quasiben/dask-scheduler-performance#137

Open

Merge remote-tracking branch 'upstream/main' into blockwise-io-demo

23024fb

jrbourbeau mentioned this pull request Apr 29, 2021

Flaky test_append_with_partition #7369

Closed

Merge remote-tracking branch 'upstream/main' into blockwise-io-demo

498fb36

rjzamora mentioned this pull request Apr 29, 2021

Use IO-Subgraph + Blockwise throughout codebase #6791

Closed

29 tasks

rjzamora marked this pull request as ready for review April 29, 2021 16:27

jakirkham merged commit 709bd45 into dask:main Apr 30, 2021

rjzamora mentioned this pull request May 3, 2021

Move read_hdf to Blockwise #7625

Merged

This was referenced May 4, 2021

DGX Nightly Benchmark run 20210504 quasiben/dask-scheduler-performance#139

Open

[NO MRG] Try Rick's example in the benchmark quasiben/dask-scheduler-performance#141

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move timeseries and daily-stock to Blockwise#7615

Move timeseries and daily-stock to Blockwise#7615
jakirkham merged 94 commits intodask:mainfrom
rjzamora:blockwise-io-demo

rjzamora commented Apr 27, 2021 •

edited

Loading

Uh oh!

rjzamora commented Apr 27, 2021 •

edited

Loading

Uh oh!

rjzamora commented Apr 28, 2021 •

edited

Loading

Uh oh!

jrbourbeau commented Apr 29, 2021

Uh oh!

jakirkham commented Apr 29, 2021

Uh oh!

jakirkham commented Apr 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rjzamora commented Apr 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau commented Apr 29, 2021

Uh oh!

jakirkham commented Apr 29, 2021

Uh oh!

jakirkham commented Apr 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Apr 27, 2021 •

edited

Loading

rjzamora commented Apr 27, 2021 •

edited

Loading

rjzamora commented Apr 28, 2021 •

edited

Loading