[DataFrame] Implementing API correct groupby with aggregation methods#1914

Merged

robertnishihara merged 38 commits intoray-project:masterfrom

devin-petersohn:groupby_tdd

Apr 22, 2018

Member

devin-petersohn commented Apr 17, 2018

What do these changes do?

Adds groupby and allows users to interact with the GroupBy object the same way they would in Pandas.

groupby implementation
DataFrameGroupBy object
agg / aggregate / apply for non-dictionary arguments
Some performance improvements overall.

devin-petersohn and others added 30 commits

April 17, 2018 01:06


          Implementing groupby object

117aa1e


          Making sum work

3ac3405


          Making lazy and adding axis=1 support

84de470


          Starting on agg

24ec3c3


          Fixing errors to report same as Pandas

23cdb06


          Minor changes

8f4b585


          Adding aggregate and apply for single strings

3cffd84


          Checkpointing progress

00b743c


          Start toward implementing callables

1e5bd79


          Working on agg

148ef09


          Groupby + agg functional for string or callable


          Begin implementation of groupby methods

b6872cb


          Implement remaining groupby methods

8fd1f56


          Moving to being more lazy

8f88ce0


          updating groupby

d975118


          Updating remote

b90548b


          Update groupby

078911a


          Fixing some performance issues

3fa9bd3


          Adding list support for reduction tasks

759eeec


          Removing print

e67230f


          Working on performance debug

4fd1f14


          Working on tuning

2e0b367


          Making lists of functions work for agg and apply

8871f37


          Cleaning up

0c0ea4f


          Improving serialization of agg

295f453


          implement transform

3d163a2


          resolve merge artifacts

796939a


          groupby transform works now

c6ba8c8


          temp implementation of __array__

e592957


          some error handling and kwargs cleanup

cd5a346

kunalgosar and others added 3 commits

April 17, 2018 01:06


          add a todo

e0d0e9f


          Updating groupby for utility.

a2d8b32


          Cleanup code

534fb94

devin-petersohn changed the title ~~Implementing API correct groupby with aggregation methods~~ [DataFrame] Implementing API correct groupby with aggregation methods

AmplabJenkins commented Apr 17, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4968/
Test FAILed.

AmplabJenkins commented Apr 17, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4969/
Test PASSed.


          Fix lint

0ff8a80

AmplabJenkins commented Apr 17, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4970/
Test PASSed.


          Fixing tests

b12a339

AmplabJenkins commented Apr 18, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4985/
Test PASSed.


          Fix lint

01b29d0

AmplabJenkins commented Apr 18, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4989/
Test PASSed.

Collaborator

robertnishihara commented Apr 20, 2018

Their seem to be some test failures, e.g.,

____________ ERROR collecting ray/dataframe/test/test_dataframe.py _____________
�[31m../../../.local/lib/python2.7/site-packages/pytest-3.5.0-py2.7.egg/_pytest/python.py:411: in _importtestmodule
    mod = self.fspath.pyimport(ensuresyspath=importmode)
../../../.local/lib/python2.7/site-packages/py-1.5.3-py2.7.egg/py/_path/local.py:668: in pyimport
    __import__(modname)
../../../.local/lib/python2.7/site-packages/pytest-3.5.0-py2.7.egg/_pytest/assertion/rewrite.py:213: in load_module
    py.builtin.exec_(co, mod.__dict__)
python/ray/dataframe/test/test_dataframe.py:9: in <module>
    import ray.dataframe as rdf
../../../.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/__init__.py:31: in <module>
    from .dataframe import DataFrame  # noqa: 402
../../../.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/dataframe.py:29: in <module>
    from .groupby import DataFrameGroupBy
E     File "/home/travis/.local/lib/python2.7/site-packages/ray-0.4.0-py2.7-linux-x86_64.egg/ray/dataframe/groupby.py", line 48
E       *part),
E       ^
E   SyntaxError: invalid syntax�[0m

robertnishihara reviewed

View reviewed changes

python/ray/dataframe/groupby.py Outdated

+                                                           sort,
+                                                           group_keys,
+                                                           squeeze,
+                                                           *part),

Collaborator

robertnishihara Apr 20, 2018

looks like a python 2 issue, you may need to do

                                       args=(by,
                                             axis,
                                             level,
                                             as_index,
                                             sort,
                                             group_keys,
                                             squeeze) + part

or something like that.

Collaborator

robertnishihara Apr 20, 2018

I can give this a try.


          Fix Python 2 syntax issue.

02d7116

AmplabJenkins commented Apr 20, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5014/
Test FAILed.

p-yang suggested changes

View reviewed changes

python/ray/dataframe/dataframe.py

+                              "To contribute to Pandas on Ray, please visit "
+                              "github.com/ray-project/ray.")
+                      elif is_list_like(arg):
+                          from .concat import concat

Contributor

p-yang Apr 20, 2018

Put this import with the other imports.

Member Author

devin-petersohn Apr 20, 2018

Cyclical import won't allow this.

python/ray/dataframe/dataframe.py Outdated

+                      else:
+                          kwargs['temp_index'] = self.index
+                      def remote_helper(df, arg, *args, **kwargs):

Contributor

p-yang Apr 20, 2018

More descriptive names would be helpful here for readability.

python/ray/dataframe/dataframe.py

+                      # This magic unzips the list comprehension returned from remote
+                      is_series, new_parts, index, columns = \
+                          [list(t) for t in zip(*remote_result)]

Contributor

p-yang Apr 20, 2018

Do you need each variable in list form? zip should allow auto-unboxing.

Member Author

devin-petersohn Apr 20, 2018

We do need most of them in list form. I'll go ahead and change it.

Member Author

devin-petersohn Apr 20, 2018

We do actually need them all in lists because of the ray.get()

python/ray/dataframe/dataframe.py

+                      # DataFrame, and we have to determine which here. Shouldn't add
+                      # too much to latency in either case because the booleans can
+                      # be returned immediately
+                      is_series = ray.get(is_series)

Contributor

p-yang Apr 20, 2018

Wouldn't getting the booleans require a block on the rest of the parameters being calculated anyways?

Member Author

devin-petersohn Apr 20, 2018

The (de)serialization should be faster, so it should be ready sooner (given a large enough Series).

python/ray/dataframe/dataframe.py

+                      # return DataFrames
+                      elif any(is_series):
+                          raise ValueError("no results.")
+                      elif axis == 0:

Contributor

p-yang Apr 20, 2018

What's different between the last two cases in this if statement?

Member Author

devin-petersohn Apr 20, 2018

I will add better comments.


          Addressing comments

d341cb2

AmplabJenkins commented Apr 20, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5021/
Test PASSed.

robertnishihara approved these changes

View reviewed changes

robertnishihara merged commit 8f59546 into ray-project:master

alok added a commit to alok/ray that referenced this pull request


          Merge branch 'master' into pytorch-trpo

700a7b0

* master:
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
  Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929)
  [DataFrame] Adding read methods and tests (ray-project#1712)
  Allow task_table_update to fail when tasks are finished. (ray-project#1927)
  [rllib] Contribute DDPG to RLlib (ray-project#1877)
  [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920)
  Raylet task dispatch and throttling worker startup (ray-project#1912)
  [DataFrame] Eval fix (ray-project#1903)

royf added a commit to royf/ray that referenced this pull request


          Merge branch 'master' of https://github.com/ray-project/ray

b87c06d

* 'master' of https://github.com/ray-project/ray:
  [rllib] Fix broken link in docs (ray-project#1967)
  [DataFrame] Sample implement (ray-project#1954)
  [DataFrame] Implement Inter-DataFrame operations (ray-project#1937)
  remove UniqueIDHasher (ray-project#1957)
  [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946)
  updates (ray-project#1958)
  Pin Cython in autoscaler development example. (ray-project#1951)
  Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950)
  [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944)
  Remove smart_open install. (ray-project#1943)
  [DataFrame] Fully implement append, concat and join (ray-project#1932)
  [DataFrame] Fix for __getitem__ string indexing (ray-project#1939)
  [DataFrame] Implementing write methods (ray-project#1918)
  [rllib] arr[end] was excluded when end is not None (ray-project#1931)
  [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet