[DataFrame] Adding insert, set_axis, set_index, reset_index and tests#1603
[DataFrame] Adding insert, set_axis, set_index, reset_index and tests#1603robertnishihara merged 3 commits intoray-project:masterfrom
Conversation
|
Test PASSed. |
| from pandas.core.dtypes.cast import maybe_upcast_putmask | ||
| from pandas.compat import lzip | ||
|
|
||
| import warnings |
There was a problem hiding this comment.
Should we be using the warnings module everywhere else in the codebase? Right now we're just printing warnings to stdout, but that's almost certainly the wrong approach.
There was a problem hiding this comment.
Probably. We are using warnings here because that's what Pandas does. Warnings are typically sent to stderr. They also show up differently in a Jupyter Notebook.
| cumulative = np.cumsum(self._lengths) | ||
| partitions = [value[cumulative[i-1]:cumulative[i]] | ||
| for i in range(len(cumulative)) | ||
| if i != 0] |
There was a problem hiding this comment.
not really a big deal, but you could do range(1, len(cumulative)) and get rid of the if statement
There was a problem hiding this comment.
Yes, I agree. @simon-mo can update this in his loc and iloc PR as an optimization.
|
|
||
| def insert(self, loc, column, value, allow_duplicates=False): | ||
| raise NotImplementedError("Not Yet implemented.") | ||
| """Insert column into DataFrame at specified location. |
There was a problem hiding this comment.
With documentation, it just occurred to me that while it is important to document arguments and return values, the documentation here is largely the same as pandas. The most important thing to draw attention to is any deviations from pandas. That doesn't apply to this function specifically, but just in general we should make sure to highlight differences where they occur.
There was a problem hiding this comment.
The public API should be 100% the same, excluding possibly the constructor (in the future). Most people create DataFrames from files, so I doubt the constructor will be used outside of examples and small test cases. The return types and errors raised should all be the same for everything implemented so far. I think putting together some documentation for this in the readthedocs is probably going to happen after our blog post this week.
| return pd.concat(ray.get(df._df)) | ||
| pd_df = pd.concat(ray.get(df._df)) | ||
| pd_df.index = df.index | ||
| pd_df.columns = df.columns |
There was a problem hiding this comment.
In what situations does to_pandas behave differently now? That is, when are the index and columns fields different from what you would get from pd.concat?
Do the tests compare equality of these fields? And are there tests where the old version of to_pandas would fail?
There was a problem hiding this comment.
This is an internal change to make it faster to lookup and retrieve things in the index (i.e. don't require a remote task to fetch this information). It doesn't change the tests at all, but the old version of to_pandas will not have the correct index and columns. @simon-mo has an update to this to start making the index on the inner frame more like the design.
Mostly used Pandas implementations for Index related things. Also was able to pull in some of their error checking.