Skip to content

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests#1603

Merged
robertnishihara merged 3 commits intoray-project:masterfrom
devin-petersohn:df_patch04
Feb 26, 2018
Merged

[DataFrame] Adding insert, set_axis, set_index, reset_index and tests#1603
robertnishihara merged 3 commits intoray-project:masterfrom
devin-petersohn:df_patch04

Conversation

@devin-petersohn
Copy link
Copy Markdown
Member

Mostly used Pandas implementations for Index related things. Also was able to pull in some of their error checking.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3945/
Test PASSed.

from pandas.core.dtypes.cast import maybe_upcast_putmask
from pandas.compat import lzip

import warnings
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using the warnings module everywhere else in the codebase? Right now we're just printing warnings to stdout, but that's almost certainly the wrong approach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. We are using warnings here because that's what Pandas does. Warnings are typically sent to stderr. They also show up differently in a Jupyter Notebook.

cumulative = np.cumsum(self._lengths)
partitions = [value[cumulative[i-1]:cumulative[i]]
for i in range(len(cumulative))
if i != 0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really a big deal, but you could do range(1, len(cumulative)) and get rid of the if statement

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. @simon-mo can update this in his loc and iloc PR as an optimization.


def insert(self, loc, column, value, allow_duplicates=False):
raise NotImplementedError("Not Yet implemented.")
"""Insert column into DataFrame at specified location.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With documentation, it just occurred to me that while it is important to document arguments and return values, the documentation here is largely the same as pandas. The most important thing to draw attention to is any deviations from pandas. That doesn't apply to this function specifically, but just in general we should make sure to highlight differences where they occur.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public API should be 100% the same, excluding possibly the constructor (in the future). Most people create DataFrames from files, so I doubt the constructor will be used outside of examples and small test cases. The return types and errors raised should all be the same for everything implemented so far. I think putting together some documentation for this in the readthedocs is probably going to happen after our blog post this week.

return pd.concat(ray.get(df._df))
pd_df = pd.concat(ray.get(df._df))
pd_df.index = df.index
pd_df.columns = df.columns
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations does to_pandas behave differently now? That is, when are the index and columns fields different from what you would get from pd.concat?

Do the tests compare equality of these fields? And are there tests where the old version of to_pandas would fail?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal change to make it faster to lookup and retrieve things in the index (i.e. don't require a remote task to fetch this information). It doesn't change the tests at all, but the old version of to_pandas will not have the correct index and columns. @simon-mo has an update to this to start making the index on the inner frame more like the design.

@robertnishihara robertnishihara merged commit 1fa59f1 into ray-project:master Feb 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants