[WIP] Add tests for order to prefer avoiding upwards branching by mrocklin · Pull Request #3017 · dask/dask

mrocklin · 2017-12-19T16:05:46Z

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

mrocklin · 2017-12-19T21:16:27Z

@rabernat I believe that you were having memory issues recently with XArray. Changes in this PR may affect that. If you're able to provide a tiny XArray example to test against this would be a very convenient time to dive into that.

rabernat · 2017-12-22T06:12:56Z

Sorry for the delayed reply here.

Can you clarify what sort of xarray-related test you are looking for? I'm not sure I understand this PR enough to know how to test against it.

mrocklin · 2017-12-22T07:27:17Z

You mentioned that you had a problem where Dask seemed to use more memory than you thought was necessary. I think you were talking about subtracting seasonal means. I suspect that there is a way to construct a representative example using a random array and pandas.date_range and the DataArray or DataSet constructor. If you're able to construct such a fail case I would find that valuable.

…

On Fri, Dec 22, 2017 at 1:12 AM, Ryan Abernathey ***@***.***> wrote: Sorry for the delayed reply here. Can you clarify what sort of xarray-related test you are looking for? I'm not sure I understand this PR enough to know how to test against it. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3017 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIBmw1Ts9CSKUxFMClQFmLhu05ycks5tC0hpgaJpZM4RHL7O> .

rabernat · 2017-12-22T13:53:13Z

Here is the sort of test case I have in mind

import xarray as xr
import dask.array as dsa
import pandas as pd

# below I create a random dataset that is typical of high-res climate models
# size of example can be adjusted up and down by changing shape
dims = ('time', 'depth', 'lat', 'lon')
time = pd.date_range('1980-01-01', '2017-01-01', freq='5D')
shape = (len(time), 50, 1800, 3600)
# what I consider to be a reasonable chunk size
chunks = (1, 1, 1800, 3600)
ds = xr.Dataset({k: (dims, dsa.random.random(shape, chunks=chunks))
                 for k in ['u', 'v', 'w']},
                coords={'time': time})

# create seasonal climatology
ds_clim = ds.groupby('time.month').mean(dim='time')

# at this point, we usually recommend saving to loading ds_clim or saving to disk
# http://xarray.pydata.org/en/latest/dask.html#optimization-tips
ds_clim = ds_clim.to_netcdf('ds_clim.nc')
# now we reload from disk (could also just have called `ds_clim.load()`)
ds_clim = xr.open_dataset('ds_clim.nc')

# construct seasonal anomaly
ds_anom = ds.groupby('time.month') - ds_clim
# compute variance of seasonal anomaly
ds_anom_var = (ds_anom**2).mean(dim='time').load()

rabernat · 2017-12-22T14:25:20Z

I can confirm that this example causes my workers to die (on the first to_netcdf call).

mrocklin · 2017-12-22T16:12:46Z

In an attempt to reproduce I ran a stripped down version of this and things ran fine.

The differences here are the following:

Reduced the total calendar time of the dataset
Didn't roundtrip to NetCDF (although I've done this as well and it seems fine)
Use the distributed scheduler for diagnostics

import xarray as xr
import dask.array as da
import pandas as pd

from dask.distributed import Client
client = Client(processes=False)

# below I create a random dataset that is typical of high-res climate models
# size of example can be adjusted up and down by changing shape
dims = ('time', 'depth', 'lat', 'lon')
time = pd.date_range('1980-01-01', '1981-01-01', freq='5d')
shape = (len(time), 50, 1800, 3600)
# what I consider to be a reasonable chunk size
chunks = (1, 1, 1800, 3600)
ds = xr.Dataset({k: (dims, da.random.random(shape, chunks=chunks))
                 for k in ['u', 'v', 'w']},
                coords={'time': time})

# create seasonal climatology
ds_clim = ds.groupby('time.month').mean(dim='time')

# at this point, we usually recommend saving to loading ds_clim or saving to disk
# http://xarray.pydata.org/en/latest/dask.html#optimization-tips
# ds_clim = ds_clim.to_netcdf('ds_clim.nc')
# now we reload from disk (could also just have called `ds_clim.load()`)
# ds_clim = xr.open_dataset('ds_clim.nc', chunks=ds_clim.)

# construct seasonal anomaly
ds_anom = ds.groupby('time.month') - ds_clim
# compute variance of seasonal anomaly
ds_anom_var = (ds_anom**2).mean(dim='time')

Still trying to track things down. @rabernat does this still fail to behave well if you reduce the size of the job? As a fail case this thing is huge!

rabernat · 2017-12-27T13:11:24Z

I just tried running the shorter time range. I am using the following scheduler:

cluster = LocalCluster(n_workers=4, threads_per_worker=4)
client = Client(cluster)

It looks to be heading for a crash, with memory consumption increasing to over 250 GB. I had to kill the scheduler before it crashed the whole machine. (Before that, it was running quite slow...I projected over one hour just to do a single year.)

How is this different from your successful trial? Did you actually call .load() on ds_anom_var?

rabernat · 2017-12-27T13:19:08Z

Update: memory usage appears to be stable at around 10 GB if I instead use client = Client(processes=False). Calculation is still undesirably slow, however.

mrocklin · 2017-12-27T16:07:18Z

If you still have it can you include the bottom part of that image with the progress bars. It will help us understand which data is hanging around in memory.

…

On Wed, Dec 27, 2017 at 5:19 AM, Ryan Abernathey ***@***.***> wrote: Update, memory usage appears to be stable at around 10 GB if I instead use client = Client(processes=False). Calculation is still undesirably slow, however. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3017 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszMAEyjQvQv8dDj2jAO71Yxy5_IXsks5tEkPNgaJpZM4RHL7O> .

mrocklin · 2017-12-27T16:17:16Z

Knowing that things change when processes=False is helpful. Thank you for mentioning that. On Wed, Dec 27, 2017 at 8:07 AM, Matthew Rocklin <mrocklin@anaconda.com> wrote:

…

If you still have it can you include the bottom part of that image with the progress bars. It will help us understand which data is hanging around in memory. On Wed, Dec 27, 2017 at 5:19 AM, Ryan Abernathey ***@***.*** > wrote: > Update, memory usage appears to be stable at around 10 GB if I instead > use client = Client(processes=False). Calculation is still undesirably > slow, however. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#3017 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AASszMAEyjQvQv8dDj2jAO71Yxy5_IXsks5tEkPNgaJpZM4RHL7O> > . >

mrocklin · 2017-12-28T20:37:36Z

I can reproduce the failure. Thanks @rabernat !

rabernat · 2017-12-29T04:44:36Z

Is my issue actually related to this PR? If not I can open it elsewhere.

This "climatology" use case is really fundamental to climate data analysis. Achieving stability and good performance on this even for very large datasets should be a top priority for xarray and pangeo. So I want to make sure this is appropriately visible.

mrocklin · 2017-12-29T16:26:11Z

Is my issue actually related to this PR?

Possibly. This sue is about fixing other workloads that don't navigate the task graph in a way to support low-memory computation. Looking at a toned down version of your computation though everything seems to be mostly fine.

If not I can open it elsewhere.

That seems reasonable. I'm likely to merge this PR soon.

See dask#3013

mrocklin · 2018-01-04T11:15:49Z

I plan to merge this later today if there are no comments. @jcrist or @eriknw are probably the right people to review. Note that this does change current scheduling behavior in a way that is not strictly an improvement.

mrocklin mentioned this pull request Dec 19, 2017

Graph prioritization problem #3013

Closed

mrocklin force-pushed the order-dependents branch from 9970b29 to bef27d5 Compare December 21, 2017 20:27

mrocklin added 5 commits January 4, 2018 05:11

Add tests for order to prefer avoiding upwards branching

6964203

See dask#3013

Remove computation of maxes from order

b397ed4

flake8

6346b78

remove ordering test

448fe7c

add changelog entry

af692a6

mrocklin force-pushed the order-dependents branch from bef27d5 to af692a6 Compare January 4, 2018 11:13

mrocklin merged commit e74161c into dask:master Jan 9, 2018

mrocklin deleted the order-dependents branch January 9, 2018 15:48

eriknw mentioned this pull request Dec 6, 2019

Redo dask.order.order. Fix #5584. Use structural info, not key names #5646

Merged

2 tasks

Uh oh!

Conversation

mrocklin commented Dec 19, 2017

Uh oh!

mrocklin commented Dec 19, 2017

Uh oh!

rabernat commented Dec 22, 2017

Uh oh!

mrocklin commented Dec 22, 2017 via email

Uh oh!

rabernat commented Dec 22, 2017

Uh oh!

rabernat commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Dec 22, 2017

Uh oh!

rabernat commented Dec 27, 2017

Uh oh!

rabernat commented Dec 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Dec 27, 2017 via email

Uh oh!

mrocklin commented Dec 27, 2017 via email

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

rabernat commented Dec 29, 2017

Uh oh!

mrocklin commented Dec 29, 2017

Uh oh!

mrocklin commented Jan 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rabernat commented Dec 22, 2017 •

edited

Loading

rabernat commented Dec 27, 2017 •

edited

Loading