-
Notifications
You must be signed in to change notification settings - Fork 23
LAMA to Dask migration: Data.stats
#432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAMA to Dask migration: Data.stats
#432
Conversation
cf/data/data.py
Outdated
|
|
||
| if all or sample_size: | ||
| out["sample_size"] = int(self.sample_size()) | ||
| out["sample_size"] = delayed(lambda: int(self.sample_size()))() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know lambdas are frowned upon somewhat, but it seems intuitive to use one here instead of defining a trivial function, I think.
|
Hi Sadie, This is a really nice idea. Perhaps we could go one step further and potentially avoid having to read the data from disk each time. If we defined simple wrappers for the stats functions, they could share the same underlying data. E.g. import dask.array as da
from dask import delayed, compute
def max(x):
return x.max()
def min(x):
return x.min()
dx = da.arange(100, chunks=50)
stats = (delayed(min, pure=True)(dx), delayed(max, pure=True)(dx))
compute(*stats)
# (0, 99)It would involve writing a little function for each stat (or lambdas - but I don't know how that would play with the (I'm sure you can visualise the above graph, but I couldn't work out how. The docs imply that the |
|
Thanks David, this:
is a great idea. Once I've tied up some loose ends elsewhere I will have a crack at getting it implemented in the way you suggest, which also sounds nice to me. |
|
Hi @davidhassell, I got back onto this one this week (intermittently) as per our priority list. I've updated it to incorporate your idea in #432 (comment). Let me know if I've implemented that as you intended, correctly, and if there is any further feedback. Saying that, there are a few aspects we should agree on regarding the desired output for Namely, the 'new way' with Dask migration, where we return
|
|
Regarding the fact that the data type of the scalar values themselves is not consistent, at least for our testing case with a |
|
Updating now in line with an offline discussion with @davidhassell, in which we realised that the approach in #432 (comment) wasn't worthwhile. In the meantime, David, what are your thoughts regarding the output value format and |
|
Hi Sadie - thanks for the nudge. This is a bit tricky, as the results haven't actually been computed, now! Perhaps we could have a |
Aha, I like that idea! Thanks, let's do that. I'll add it in today. I was also updating the test to cover a representative range of possible input parameters since it was too basic as it was. I'll push the commits up shortly. |
|
All feedback incorporated so this is now ready for re-review @davidhassell. Thanks. (Note sinec last review I've also greatly improved the |
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
|
@davidhassell ready for re-review I think. Have my last two commits amended this to what you were thinking, namely:
? I assume the |
|
This look perfect to me. I tried: In [1]: import cf
In [2]: f = cf.example_field(0)
In [3]: f.data.stats()
Out[3]:
{'minimum': 0.003,
'mean': 0.046075,
'median': 0.036,
'maximum': 0.146,
'range': 0.143,
'mid_range': 0.0745,
'standard_deviation': 0.03613612285511549,
'root_mean_square': 0.05855531572795078,
'sample_size': 40}
In [4]: f.data.stats(compute=False)
Out[4]:
{'minimum': <CF Data(): 0.003 1>,
'mean': <CF Data(): 0.046075 1>,
'median': <CF Data(): 0.036 1>,
'maximum': <CF Data(): 0.146 1>,
'range': <CF Data(): 0.143 1>,
'mid_range': <CF Data(): 0.0745 1>,
'standard_deviation': <CF Data(): 0.03613612285511549 1>,
'root_mean_square': <CF Data(): 0.05855531572795078 1>,
'sample_size': <CF Data(1, 1): [[40]]>}
In [5]: The values are appearing in the |
davidhassell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sadielbartholomew, thanks for working through all of these subtleties. Please merge when you're ready.
Brilliant, thanks for clarifying. Merging now. |
Migrates the
Data.statsmethod towards #182.Since
statsis a compound method, in that it is in essence just reporting the outputs from various othercfstats/collapsing methods, each of which have already been 'daskified' hence utilise dask under-the-hood, the simplest way to migrate with full performance (bar working out and making use of any sub-calculation operations which might be shared between any of the statistics or something like that which would surely be over the top for our purposes?) is (I believe) to run each statistic calculation in parallel, which in Dask world for this context is done withdelayedfunctions and a finalcompute.This PR implements this. As indicated by the resultant Dask task graph shown below, all statistics are calculated separately, not in serial, so
statsshould take only as long as the most intensive calculation rather than the sum of all calculation times.(I hope I haven't misunderstood the assignment here, so to speak, by taking such an approach centered on Dask's
delayed.)Graph
With the code as-is on the branch/PR here and now, but with a little tweak so we can access and save the Dask task graph, namely I did:
the task graph generated interactively via:
is:
which looks right to me. The task graph here doesn't indicate but obviously all results get pulled together into the
dictoutput at compute-time at the end via querying thecomputeoutput tuple).