[DataFrame] Implementing API correct groupby with aggregation methods#1914
[DataFrame] Implementing API correct groupby with aggregation methods#1914robertnishihara merged 38 commits intoray-project:masterfrom
Conversation
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
|
Their seem to be some test failures, e.g., |
python/ray/dataframe/groupby.py
Outdated
| sort, | ||
| group_keys, | ||
| squeeze, | ||
| *part), |
There was a problem hiding this comment.
looks like a python 2 issue, you may need to do
args=(by,
axis,
level,
as_index,
sort,
group_keys,
squeeze) + partor something like that.
There was a problem hiding this comment.
I can give this a try.
|
Test FAILed. |
| "To contribute to Pandas on Ray, please visit " | ||
| "github.com/ray-project/ray.") | ||
| elif is_list_like(arg): | ||
| from .concat import concat |
There was a problem hiding this comment.
Put this import with the other imports.
There was a problem hiding this comment.
Cyclical import won't allow this.
python/ray/dataframe/dataframe.py
Outdated
| else: | ||
| kwargs['temp_index'] = self.index | ||
|
|
||
| def remote_helper(df, arg, *args, **kwargs): |
There was a problem hiding this comment.
More descriptive names would be helpful here for readability.
|
|
||
| # This magic unzips the list comprehension returned from remote | ||
| is_series, new_parts, index, columns = \ | ||
| [list(t) for t in zip(*remote_result)] |
There was a problem hiding this comment.
Do you need each variable in list form? zip should allow auto-unboxing.
There was a problem hiding this comment.
We do need most of them in list form. I'll go ahead and change it.
There was a problem hiding this comment.
We do actually need them all in lists because of the ray.get()
| # DataFrame, and we have to determine which here. Shouldn't add | ||
| # too much to latency in either case because the booleans can | ||
| # be returned immediately | ||
| is_series = ray.get(is_series) |
There was a problem hiding this comment.
Wouldn't getting the booleans require a block on the rest of the parameters being calculated anyways?
There was a problem hiding this comment.
The (de)serialization should be faster, so it should be ready sooner (given a large enough Series).
| # return DataFrames | ||
| elif any(is_series): | ||
| raise ValueError("no results.") | ||
| elif axis == 0: |
There was a problem hiding this comment.
What's different between the last two cases in this if statement?
There was a problem hiding this comment.
I will add better comments.
|
Test PASSed. |
* master: updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914) Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903)
* 'master' of https://github.com/ray-project/ray: [rllib] Fix broken link in docs (ray-project#1967) [DataFrame] Sample implement (ray-project#1954) [DataFrame] Implement Inter-DataFrame operations (ray-project#1937) remove UniqueIDHasher (ray-project#1957) [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946) updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
What do these changes do?
Adds groupby and allows users to interact with the GroupBy object the same way they would in Pandas.
groupbyimplementationDataFrameGroupByobjectagg/aggregate/applyfor non-dictionary arguments