[DataFrame] Improve performance of iteration methods by kunalgosar · Pull Request #2026 · ray-project/ray

kunalgosar · 2018-05-10T01:18:47Z

What do these changes do?

Make DataFrame iteration methods much more performant. Uses generators to iterate through row/column partitions and only fetches data as needed.

Performance Analysis:

New Performance:

In [6]: df = pd.DataFrame(np.random.randint(0,100,size=(100000, 200)))

In [7]: %time x = list(df.iterrows())
CPU times: user 6.68 s, sys: 106 ms, total: 6.79 s
Wall time: 7.35 s

In [8]: %time x = list(df.items())
CPU times: user 162 ms, sys: 40 ms, total: 202 ms
Wall time: 511 ms

In [9]: %time x = list(df.itertuples())
CPU times: user 1.69 s, sys: 167 ms, total: 1.86 s
Wall time: 2.18 s

Old Performance:

In [5]: df = pd.DataFrame(np.random.randint(0,100,size=(100000, 200)))

In [6]: %time x = list(df.iterrows())
CPU times: user 7min 16s, sys: 5.28 s, total: 7min 22s
Wall time: 7min 57s

In [7]: %time x = list(df.items())
CPU times: user 1.18 s, sys: 412 ms, total: 1.59 s
Wall time: 3.13 s

In [8]: %time x = list(df.itertuples())
CPU times: user 5.45 s, sys: 515 ms, total: 5.97 s
Wall time: 9.81 s

Standard Pandas Performance:

In [4]: df = pd.DataFrame(np.random.randint(0,100,size=(100000, 200)))

In [5]: %time x = list(df.iterrows())
CPU times: user 6.44 s, sys: 107 ms, total: 6.54 s
Wall time: 6.58 s

In [6]: %time x = list(df.items())
CPU times: user 128 ms, sys: 29.9 ms, total: 158 ms
Wall time: 163 ms

In [7]: %time x = list(df.itertuples())
CPU times: user 2.63 s, sys: 176 ms, total: 2.81 s
Wall time: 2.9 s

Related issue number

#2025

AmplabJenkins · 2018-05-10T02:20:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5302/
Test PASSed.

devin-petersohn

Left a few comments, looks pretty good!

devin-petersohn · 2018-05-10T02:35:59Z

python/ray/dataframe/iterator.py

This should extend iterator.

devin-petersohn · 2018-05-10T02:38:15Z

python/ray/dataframe/iterator.py

Can we do without the index reference?

The index reference here is needed, as it is used to get the outer index or columns for that partition.

Can you handle the increment outside of this class?

The purpose of the Iterator is to iterate through each partition, it would be possible to define a function to increment the partitions outside of this class, but that would make the code much more complex.

Right now, it checks if there are any items remaining in the current partition, and if not, increments curr_partition and gets the next one.

devin-petersohn · 2018-05-10T02:44:14Z

python/ray/dataframe/dataframe.py

can you not just return partition_iterator?

Currently, I do this to ensure that the return type of the function is a generator, which is concordant with pandas. Let me know if you think I should still change it.

That sounds great, thanks for clarifying.

AmplabJenkins · 2018-05-10T20:51:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5318/
Test PASSed.

devin-petersohn

Left a couple of other comments. Thanks!

devin-petersohn · 2018-05-16T19:47:59Z

python/ray/dataframe/iterator.py

Nit: from collections import Iterator

devin-petersohn · 2018-05-16T19:48:20Z

python/ray/dataframe/iterator.py

Nit: class PartitionIterator(Iterator):

devin-petersohn · 2018-05-16T19:59:55Z

python/ray/dataframe/dataframe.py

I was thinking something along the lines of this to resolve the index comment below:

index_iter = (obj.index for obj in self._row_metadata.partition_series) def itertuples_helper(part): df = ray.get(part) df.columns = self.columns df.index = next(index_iter) return df.itertuples(index=index, name=name)

Something like this.

I agree, this is much better. I've updated the PR in this way. Thanks!

devin-petersohn · 2018-05-16T20:00:40Z

python/ray/dataframe/dataframe.py

That sounds great, thanks for clarifying.

devin-petersohn

Looks great, one minor nit.

devin-petersohn · 2018-05-16T23:39:35Z

python/ray/dataframe/dataframe.py

-            series.index = self.columns
-            series.name = list(self.index)[i]
-            return series
+        index_iter = iter([self._row_metadata.partition_series(i).index


Prefer index_iter = (self._row_metadata.partition_series(i).index for i in range(len(self._row_partitions)))

AmplabJenkins · 2018-05-17T00:13:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5436/
Test PASSed.

AmplabJenkins · 2018-05-17T01:20:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5443/
Test PASSed.

devin-petersohn · 2018-05-17T21:45:29Z

Merged, thanks @kunalgosar!

* master: (22 commits) [xray] Fix bug in updating actor execution dependencies (ray-project#2064) [DataFrame] Refactor __delitem__ (ray-project#2080) [xray] Better error messaging when pulling from self. (ray-project#2068) Use source code in hash where possible (fix ray-project#2089) (ray-project#2090) Functions for flushing done tasks and evicted objects. (ray-project#2033) Fix compilation error for RAY_USE_NEW_GCS with latest clang. (ray-project#2086) [xray] Corrects Error Handling During Push and Pull. (ray-project#2059) [xray] Sophisticated task dependency management (ray-project#2035) Support calling positional arguments by keyword (fix ray-project#998) (ray-project#2081) [DataFrame] Improve performance of iteration methods (ray-project#2026) [DataFrame] Implement to_csv (ray-project#2014) [xray] Lineage cache only requests notifications about remote parent tasks (ray-project#2066) [rllib] Add magic methods for rollouts (ray-project#2024) [DataFrame] Allows DataFrame constructor to take in another DataFrame (ray-project#2072) Pin Pandas version for Travis to 0.22 (ray-project#2075) Fix python linting (ray-project#2076) [xray] Fix GCS table prefixes (ray-project#2065) Some tests for _submit API. (ray-project#2062) [rllib] Queue lib for python 2.7 (ray-project#2057) [autoscaler] Remove faulty assert that breaks during downscaling, pull configs from env (ray-project#2006) ...

* master: (24 commits) Performance fix (ray-project#2110) Use flake8-comprehensions (ray-project#1976) Improve error message printing and suppression. (ray-project#2104) [rllib] [doc] Broken link in ddpg doc YAPF, take 3 (ray-project#2098) [rllib] rename async -> _async (ray-project#2097) fix unused lambda capture (ray-project#2102) [xray] Use pubsub instead of timeout for ObjectManager Pull. (ray-project#2079) [DataFrame] Update _inherit_docstrings (ray-project#2085) [JavaWorker] Changes to the build system for support java worker (ray-project#2092) [xray] Fix bug in updating actor execution dependencies (ray-project#2064) [DataFrame] Refactor __delitem__ (ray-project#2080) [xray] Better error messaging when pulling from self. (ray-project#2068) Use source code in hash where possible (fix ray-project#2089) (ray-project#2090) Functions for flushing done tasks and evicted objects. (ray-project#2033) Fix compilation error for RAY_USE_NEW_GCS with latest clang. (ray-project#2086) [xray] Corrects Error Handling During Push and Pull. (ray-project#2059) [xray] Sophisticated task dependency management (ray-project#2035) Support calling positional arguments by keyword (fix ray-project#998) (ray-project#2081) [DataFrame] Improve performance of iteration methods (ray-project#2026) ...

kunalgosar mentioned this pull request May 10, 2018

[DataFrame] iterrows() is slow for larger dataframes #2025

Closed

devin-petersohn reviewed May 10, 2018

View reviewed changes

devin-petersohn reviewed May 16, 2018

View reviewed changes

kunalgosar added 4 commits May 16, 2018 15:32

fix iterrows

afc2945

make iteration methods performant

16f8449

resolving comments

a473ac2

remove indexing from iterator

29d9bb2

kunalgosar force-pushed the iterrows branch from fea9db2 to 29d9bb2 Compare May 16, 2018 22:48

devin-petersohn reviewed May 16, 2018

View reviewed changes

switch to iterator syntax

6acfe91

devin-petersohn approved these changes May 17, 2018

View reviewed changes

devin-petersohn merged commit afbb260 into ray-project:master May 17, 2018

Conversation

kunalgosar commented May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented May 10, 2018

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 10, 2018

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 17, 2018

Uh oh!

AmplabJenkins commented May 17, 2018

Uh oh!

devin-petersohn commented May 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunalgosar commented May 10, 2018 •

edited

Loading