[WIP] Add tests for order to prefer avoiding upwards branching#3017
[WIP] Add tests for order to prefer avoiding upwards branching#3017mrocklin merged 5 commits intodask:masterfrom
Conversation
|
@rabernat I believe that you were having memory issues recently with XArray. Changes in this PR may affect that. If you're able to provide a tiny XArray example to test against this would be a very convenient time to dive into that. |
9970b29 to
bef27d5
Compare
|
Sorry for the delayed reply here. Can you clarify what sort of xarray-related test you are looking for? I'm not sure I understand this PR enough to know how to test against it. |
|
You mentioned that you had a problem where Dask seemed to use more memory
than you thought was necessary. I think you were talking about subtracting
seasonal means. I suspect that there is a way to construct a
representative example using a random array and pandas.date_range and the
DataArray or DataSet constructor. If you're able to construct such a fail
case I would find that valuable.
…On Fri, Dec 22, 2017 at 1:12 AM, Ryan Abernathey ***@***.***> wrote:
Sorry for the delayed reply here.
Can you clarify what sort of xarray-related test you are looking for? I'm
not sure I understand this PR enough to know how to test against it.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszIBmw1Ts9CSKUxFMClQFmLhu05ycks5tC0hpgaJpZM4RHL7O>
.
|
|
Here is the sort of test case I have in mind import xarray as xr
import dask.array as dsa
import pandas as pd
# below I create a random dataset that is typical of high-res climate models
# size of example can be adjusted up and down by changing shape
dims = ('time', 'depth', 'lat', 'lon')
time = pd.date_range('1980-01-01', '2017-01-01', freq='5D')
shape = (len(time), 50, 1800, 3600)
# what I consider to be a reasonable chunk size
chunks = (1, 1, 1800, 3600)
ds = xr.Dataset({k: (dims, dsa.random.random(shape, chunks=chunks))
for k in ['u', 'v', 'w']},
coords={'time': time})
# create seasonal climatology
ds_clim = ds.groupby('time.month').mean(dim='time')
# at this point, we usually recommend saving to loading ds_clim or saving to disk
# http://xarray.pydata.org/en/latest/dask.html#optimization-tips
ds_clim = ds_clim.to_netcdf('ds_clim.nc')
# now we reload from disk (could also just have called `ds_clim.load()`)
ds_clim = xr.open_dataset('ds_clim.nc')
# construct seasonal anomaly
ds_anom = ds.groupby('time.month') - ds_clim
# compute variance of seasonal anomaly
ds_anom_var = (ds_anom**2).mean(dim='time').load() |
|
I can confirm that this example causes my workers to die (on the first |
|
In an attempt to reproduce I ran a stripped down version of this and things ran fine. The differences here are the following:
import xarray as xr
import dask.array as da
import pandas as pd
from dask.distributed import Client
client = Client(processes=False)
# below I create a random dataset that is typical of high-res climate models
# size of example can be adjusted up and down by changing shape
dims = ('time', 'depth', 'lat', 'lon')
time = pd.date_range('1980-01-01', '1981-01-01', freq='5d')
shape = (len(time), 50, 1800, 3600)
# what I consider to be a reasonable chunk size
chunks = (1, 1, 1800, 3600)
ds = xr.Dataset({k: (dims, da.random.random(shape, chunks=chunks))
for k in ['u', 'v', 'w']},
coords={'time': time})
# create seasonal climatology
ds_clim = ds.groupby('time.month').mean(dim='time')
# at this point, we usually recommend saving to loading ds_clim or saving to disk
# http://xarray.pydata.org/en/latest/dask.html#optimization-tips
# ds_clim = ds_clim.to_netcdf('ds_clim.nc')
# now we reload from disk (could also just have called `ds_clim.load()`)
# ds_clim = xr.open_dataset('ds_clim.nc', chunks=ds_clim.)
# construct seasonal anomaly
ds_anom = ds.groupby('time.month') - ds_clim
# compute variance of seasonal anomaly
ds_anom_var = (ds_anom**2).mean(dim='time')Still trying to track things down. @rabernat does this still fail to behave well if you reduce the size of the job? As a fail case this thing is huge! |
|
I just tried running the shorter time range. I am using the following scheduler: cluster = LocalCluster(n_workers=4, threads_per_worker=4)
client = Client(cluster)It looks to be heading for a crash, with memory consumption increasing to over 250 GB. I had to kill the scheduler before it crashed the whole machine. (Before that, it was running quite slow...I projected over one hour just to do a single year.) How is this different from your successful trial? Did you actually call |
|
Update: memory usage appears to be stable at around 10 GB if I instead use |
|
If you still have it can you include the bottom part of that image with the
progress bars. It will help us understand which data is hanging around in
memory.
…On Wed, Dec 27, 2017 at 5:19 AM, Ryan Abernathey ***@***.***> wrote:
Update, memory usage appears to be stable at around 10 GB if I instead use client
= Client(processes=False). Calculation is still undesirably slow, however.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3017 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszMAEyjQvQv8dDj2jAO71Yxy5_IXsks5tEkPNgaJpZM4RHL7O>
.
|
|
Knowing that things change when processes=False is helpful. Thank you for
mentioning that.
On Wed, Dec 27, 2017 at 8:07 AM, Matthew Rocklin <mrocklin@anaconda.com>
wrote:
… If you still have it can you include the bottom part of that image with
the progress bars. It will help us understand which data is hanging around
in memory.
On Wed, Dec 27, 2017 at 5:19 AM, Ryan Abernathey ***@***.***
> wrote:
> Update, memory usage appears to be stable at around 10 GB if I instead
> use client = Client(processes=False). Calculation is still undesirably
> slow, however.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#3017 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszMAEyjQvQv8dDj2jAO71Yxy5_IXsks5tEkPNgaJpZM4RHL7O>
> .
>
|
|
I can reproduce the failure. Thanks @rabernat ! |
|
Is my issue actually related to this PR? If not I can open it elsewhere. This "climatology" use case is really fundamental to climate data analysis. Achieving stability and good performance on this even for very large datasets should be a top priority for xarray and pangeo. So I want to make sure this is appropriately visible. |
Possibly. This sue is about fixing other workloads that don't navigate the task graph in a way to support low-memory computation. Looking at a toned down version of your computation though everything seems to be mostly fine.
That seems reasonable. I'm likely to merge this PR soon. |
bef27d5 to
af692a6
Compare

See #3013
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API