[DataFrame] Implement Inter-DataFrame operations by devin-petersohn · Pull Request #1937 · ray-project/ray

devin-petersohn · 2018-04-23T04:07:35Z

Implements the inter-DataFrame and scalar operations:

add, __add__
radd, __radd__
__iadd__
sub, __sub__, subtract
rsub, __rsub__
__isub__
mul, __mul__, multiply
rmul, __rmul__
div, __div__, divide
floordiv, __floordiv__
rfloordiv, __rfloordiv
__ifloordiv__
truediv, __truediv__
rtruediv, __rtruediv__
__itruediv__
mod, __mod__
rmod, __rmod__
__imod__
pow, __pow__
rpow, __rpow
__ipow__

Depends on #1932, don't merge until after that is merged.

Edit: Also add comparison methods:

ge, __ge__
gt, __gt__
le, __le__
lt, __lt__
eq, __eq__
ne, __ne__

AmplabJenkins · 2018-04-23T05:11:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5035/
Test PASSed.

p-yang · 2018-04-23T06:57:11Z

python/ray/dataframe/dataframe.py

Please note here (and everywhere else) that this can cause implicit serialization issues if other is a series, for the eventual refactor

p-yang · 2018-04-23T06:59:58Z

python/ray/dataframe/dataframe.py

"Multilevel" mispelling

p-yang · 2018-04-23T07:02:41Z

python/ray/dataframe/dataframe.py

Might be worth putting a note here (like you did in the join code) for the future to join on metadatas, enabling you to pass metadatas below.

Added below

p-yang · 2018-04-23T07:04:22Z

python/ray/dataframe/dataframe.py

If it's non-list-like (scalar), probably best to perform the action on block partitions (a la applymap)

p-yang · 2018-04-23T07:06:06Z

Didn't get to look too deep overall, but I'll take another look later on. One thing I'll note for all the math functions though is that you can reduce the amount of copied code by possibly having one math archetype (similar to _arithmetic_helper) that takes in a class function like pandas.DataFrame.add

AmplabJenkins · 2018-04-23T07:36:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5039/
Test FAILed.

AmplabJenkins · 2018-04-24T23:12:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5058/
Test FAILed.

AmplabJenkins · 2018-04-24T23:35:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5057/
Test PASSed.

AmplabJenkins · 2018-04-25T05:34:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5063/
Test PASSed.

AmplabJenkins · 2018-04-25T07:50:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5069/
Test FAILed.

AmplabJenkins · 2018-04-25T07:53:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5065/
Test PASSed.

AmplabJenkins · 2018-04-25T07:56:38Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5066/
Test PASSed.

AmplabJenkins · 2018-04-25T08:27:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5070/
Test PASSed.

AmplabJenkins · 2018-04-26T19:55:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5077/
Test PASSed.

AmplabJenkins · 2018-04-27T05:05:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5079/
Test PASSed.

AmplabJenkins · 2018-04-27T18:40:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5088/
Test PASSed.

AmplabJenkins · 2018-04-27T20:21:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5090/
Test PASSed.

AmplabJenkins · 2018-04-28T04:05:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5098/
Test PASSed.

AmplabJenkins · 2018-04-28T19:51:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5105/
Test FAILed.

devin-petersohn · 2018-04-28T20:06:39Z

Jenkins, retest this please.

AmplabJenkins · 2018-04-28T21:11:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5106/
Test PASSed.

AmplabJenkins · 2018-04-29T05:24:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5113/
Test PASSed.

robertnishihara · 2018-04-29T16:17:40Z

python/ray/dataframe/utils.py

+
+
+@ray.remote
+def co_op_helper(func, left_columns, right_columns, left_df_len, *zipped):


Is co short for "column"? May be worth clarifying this. This function could probably use a docstring.

Also how do you determine which function names in this file should start with _?

co op is short for copartition operation. I will clarify. The note about the leading underscore is something I have been wanting to clean up. This should lead with an underscore.

robertnishihara · 2018-04-29T16:22:40Z

python/ray/dataframe/dataframe.py

+        else:
+            return self._single_df_op_helper(
+                lambda df: df.eq(other, axis, level),
+                other, axis, level)


instead of this if statement, should we just use _iter_and_single_df_op_helper?

robertnishihara · 2018-04-29T16:22:51Z

python/ray/dataframe/dataframe.py

+        else:
+            return self._single_df_op_helper(
+                lambda df: df.ge(other, axis, level),
+                other, axis, level)


instead of this if statement, should we just use _iter_and_single_df_op_helper?

robertnishihara · 2018-04-29T16:23:00Z

python/ray/dataframe/dataframe.py

+        else:
+            return self._single_df_op_helper(
+                lambda df: df.gt(other, axis, level),
+                other, axis, level)


instead of this if statement, should we just use _iter_and_single_df_op_helper?

The same question applies in a few more places.

robertnishihara · 2018-04-29T16:24:30Z

python/ray/dataframe/dataframe.py

+
+        Returns:
+            A new DataFrame filled with Booleans.
+        """


General question. I thought we were inheriting docstrings from pandas. Does that mean that these docstrings are redundant?

True, these docs are internal for us. Ideally we wouldn't have to go to the Pandas docs each time we want to look at a method.

robertnishihara · 2018-04-29T16:26:01Z

python/ray/dataframe/dataframe.py

            "To contribute to Pandas on Ray, please visit "
            "github.com/ray-project/ray.")
+
+    def _copartition(self, other, new_index):


This method repartitions the two DFs so that they have the same partitioning?

Yes, based on the index. I will add more detailed notes here.

AmplabJenkins · 2018-04-29T17:40:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5115/
Test FAILed.

AmplabJenkins · 2018-04-29T18:20:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5116/
Test PASSed.

robertnishihara · 2018-04-30T01:39:26Z

Could you fix the linting errors?

$ flake8 --exclude=python/ray/core/src/common/flatbuffers_ep-prefix/,python/ray/core/generated/,src/common/format/,doc/source/conf.py,python/ray/cloudpickle/
./python/ray/dataframe/dataframe.py:2193:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2194:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2199:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2200:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2238:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2239:47: E128 continuation line under-indented for visual indent
./python/ray/dataframe/dataframe.py:2245:29: E127 continuation line over-indented for visual indent
./python/ray/dataframe/dataframe.py:4012:30: E127 continuation line over-indented for visual indent
./python/ray/dataframe/dataframe.py:4019:30: E127 continuation line over-indented for visual indent

AmplabJenkins · 2018-04-30T02:51:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5122/
Test FAILed.

devin-petersohn · 2018-04-30T05:09:27Z

Jenkins, retest this please.

AmplabJenkins · 2018-04-30T06:14:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5125/
Test PASSed.

* 'master' of https://github.com/ray-project/ray: [rllib] Fix broken link in docs (ray-project#1967) [DataFrame] Sample implement (ray-project#1954) [DataFrame] Implement Inter-DataFrame operations (ray-project#1937) remove UniqueIDHasher (ray-project#1957) [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946) updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)

p-yang reviewed Apr 23, 2018

View reviewed changes

python/ray/dataframe/dataframe.py Outdated

Copy link
Copy Markdown

Contributor

p-yang Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Multilevel" mispelling

p-yang reviewed Apr 23, 2018

View reviewed changes

devin-petersohn added 6 commits April 24, 2018 21:30

Implement inter ops simple cases

c1eb8b5

Implementing all the inter-df math ops

3e0a509

Adding comparison methods

ea76921

Starting to convert to block implementations

49f602f

Transition to blocks implementation

92a7c96

Fix column length bug

71ab6ad

devin-petersohn force-pushed the df_inter_ops branch from 2deb1d8 to 71ab6ad Compare April 25, 2018 04:31

Fix lint

811cf2e

devin-petersohn added 4 commits April 24, 2018 23:43

Fixing tests

79a485a

Fix lint

2e2a5b0

Addressing comment

9409d58

Addressing comments

a21a548

Fixing ascii error

ac8aace

Fixing unicode error

fd39b52

Fixing Python2 compat issue

2798d2f

devin-petersohn added 2 commits April 27, 2018 12:16

Fix python2 compat

b28b1c1

Fix duplicate line

9e17cba

Fixing python2 compat

a25c90d

Fixing compat

5a436c8

devin-petersohn mentioned this pull request Apr 28, 2018

[DataFrame] Implement df.merge #1964

Merged

1 task

Fixing test

d946fb5

robertnishihara reviewed Apr 29, 2018

View reviewed changes

devin-petersohn added 2 commits April 29, 2018 10:12

Addressing comments

f705a8f

Adding more docs

736eb43

robertnishihara approved these changes Apr 30, 2018

View reviewed changes

Fix lint

776e36d

robertnishihara merged commit 0c477fb into ray-project:master Apr 30, 2018



		@ray.remote
		def co_op_helper(func, left_columns, right_columns, left_df_len, *zipped):

Conversation

devin-petersohn commented Apr 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Apr 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p-yang commented Apr 23, 2018

Uh oh!

AmplabJenkins commented Apr 23, 2018

Uh oh!

AmplabJenkins commented Apr 24, 2018

Uh oh!

AmplabJenkins commented Apr 24, 2018

Uh oh!

AmplabJenkins commented Apr 25, 2018

Uh oh!

AmplabJenkins commented Apr 25, 2018

Uh oh!

AmplabJenkins commented Apr 25, 2018

Uh oh!

AmplabJenkins commented Apr 25, 2018

Uh oh!

AmplabJenkins commented Apr 25, 2018

Uh oh!

AmplabJenkins commented Apr 26, 2018

Uh oh!

AmplabJenkins commented Apr 27, 2018

Uh oh!

AmplabJenkins commented Apr 27, 2018

Uh oh!

AmplabJenkins commented Apr 27, 2018

Uh oh!

AmplabJenkins commented Apr 28, 2018

Uh oh!

AmplabJenkins commented Apr 28, 2018

Uh oh!

devin-petersohn commented Apr 28, 2018

Uh oh!

AmplabJenkins commented Apr 28, 2018

Uh oh!

AmplabJenkins commented Apr 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 29, 2018

Uh oh!

AmplabJenkins commented Apr 29, 2018

Uh oh!

robertnishihara commented Apr 30, 2018

Uh oh!

AmplabJenkins commented Apr 30, 2018

Uh oh!

devin-petersohn commented Apr 23, 2018 •

edited

Loading