[DataFrame] Fully implement append, concat and join#1932
[DataFrame] Fully implement append, concat and join#1932robertnishihara merged 7 commits intoray-project:masterfrom
Conversation
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
| if keys is not None: | ||
| objs = [objs[k] for k in keys] | ||
| else: | ||
| objs = list(objs) |
There was a problem hiding this comment.
None objects need to be dropped from objs as specified in pandas docs.
| else: | ||
| objs = list(objs) | ||
|
|
||
| if len(objs) == 0: |
There was a problem hiding this comment.
Need to handle case of ValueError: All objects passed were None
| pdf.columns = pd.RangeIndex(len(new_columns)) | ||
|
|
||
| return pdf | ||
| if isinstance(objs, dict): |
There was a problem hiding this comment.
This case is not necessary.
There was a problem hiding this comment.
Why is this case not necessary?
There was a problem hiding this comment.
Actually it is, my comment was wrong. I had understood that you could only pass in a dictionary with keys specified. Turns out you can pass in a dictionary by itself.
|
|
||
| # (TODO) Group all the pandas dataframes | ||
| # We need this in a list because we use it later. | ||
| all_index, all_columns = list(zip(*[(obj.index, obj.columns) |
There was a problem hiding this comment.
This will not work for Panel objects which do not have index or columns properties
There was a problem hiding this comment.
True, I'm going to just drop Panel support.
|
|
||
| # Put all of the DataFrames into Ray format | ||
| # TODO just partition the DataFrames instead of building a new Ray DF. | ||
| objs = [DataFrame(obj) if isinstance(obj, (pandas.DataFrame, |
There was a problem hiding this comment.
All pandas.Series objects would already be DataFrames by this point. Does it make sense to combine the steps?
There was a problem hiding this comment.
Yes, it's not completely efficient to do it this way.
There was a problem hiding this comment.
Resolved in series_to_df.
| other = pd.DataFrame(other.values.reshape((1, len(other))), | ||
| index=index, | ||
| columns=combined_columns) | ||
| other = other._convert(datetime=True, timedelta=True) |
There was a problem hiding this comment.
Does the current DataFrame here need to reindex its columns to the combined_columns?
There was a problem hiding this comment.
This will happen in concat.
| if isinstance(other, pd.Series): | ||
| if other.name is None: | ||
| raise ValueError("Other Series must have a name") | ||
| other = DataFrame({other.name: other}) |
There was a problem hiding this comment.
Can pass other directly into DataFrame constructor. It carries the series name over to the column name.
There was a problem hiding this comment.
This is similar to how Pandas does it, so I vote we keep it this way. It's probably for clarity.
python/ray/dataframe/dataframe.py
Outdated
| raise ValueError("Joining multiple DataFrames only supported" | ||
| " for joining on index") | ||
|
|
||
| # Joining the empty DataFrames with either index of columns is |
|
Test FAILed. |
python/ray/dataframe/concat.py
Outdated
| type_check = next(obj for obj in objs | ||
| if not isinstance(obj, (pandas.Series, | ||
| pandas.DataFrame, DataFrame, | ||
| pandas.Panel))) |
There was a problem hiding this comment.
Dropped support for pandas.Panel?
|
Test PASSed. |
* master: updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914) Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903)
* 'master' of https://github.com/ray-project/ray: [rllib] Fix broken link in docs (ray-project#1967) [DataFrame] Sample implement (ray-project#1954) [DataFrame] Implement Inter-DataFrame operations (ray-project#1937) remove UniqueIDHasher (ray-project#1957) [rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (ray-project#1946) updates (ray-project#1958) Pin Cython in autoscaler development example. (ray-project#1951) Incorporate C++ Buffer management and Seal global threadpool fix from arrow (ray-project#1950) [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (ray-project#1944) Remove smart_open install. (ray-project#1943) [DataFrame] Fully implement append, concat and join (ray-project#1932) [DataFrame] Fix for __getitem__ string indexing (ray-project#1939) [DataFrame] Implementing write methods (ray-project#1918) [rllib] arr[end] was excluded when end is not None (ray-project#1931) [DataFrame] Implementing API correct groupby with aggregation methods (ray-project#1914)
Make some changes to
concatandDataFrame.join. Changes are:concatforpandas.Seriesconcatforaxis=1andkeys.DataFrame.joinDataFrame.append