ARROW-1291: [Python] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas #911

wesm · 2017-07-29T15:34:24Z

No description provided.

…m_pandas Change-Id: I7fdd4c32b2f54d3003c6b87b9ae13186c35bcec0

icexelloss · 2017-07-29T16:02:28Z

python/pyarrow/pandas_compat.py



-def construct_metadata(df, index_levels, preserve_index, types):
+def construct_metadata(df, column_names, index_levels, preserve_index, types):


Why pass the column_names instead of:

column_names = [str(col) for col in df.columns]

?

these got sanitized earlier as part of creating the schema

icexelloss · 2017-07-29T16:07:47Z

python/pyarrow/table.pxi

    return 0


-cdef tuple _dataframe_to_arrays(


Out of curiosity , why was this written in cython originally?

Started small, got bigger =)

icexelloss · 2017-07-29T16:10:40Z

I didn't compare "dataframe_to_arrays" with the original cython implementation too carefully. I assume they are the same except for the column name casting?

Otherwise LGTM

cpcloud · 2017-07-29T17:25:39Z

python/pyarrow/pandas_compat.py

+
+    for name in df.columns:
+        col = df[name]
+        if not isinstance(name, six.string_types):


This allows anything that isn't a string including floats, timestamps, and other any wacky thing someone puts in a column index. Should this be more strict about what type(df.columns) is?

In a lot of cases it will just be "Index". I'd rather have someone complaining about this rather than pre-emptively guessing what will be the right thing to do

Fair enough.

cpcloud · 2017-07-29T17:26:43Z

python/pyarrow/tests/test_convert_pandas.py

        df['a'] = df['a'].astype('category')
        self._check_pandas_roundtrip(df)

+    def test_non_string_columns(self):


There should be a test for additional column types that either fails or explicitly succeeds based on what we decide about allowing other types in.

I suppose we can leave this as the only test right now and say that anything other integers or strings is undefined behavior.

cpcloud · 2017-07-29T17:27:35Z

python/pyarrow/pandas_compat.py

-                ),
-                'pandas_version': pd.__version__,
-            }
-        ).encode('utf8')


I'll start making more local variables :)

cpcloud

LGTM

Cast non-string DataFrame columns to strings in RecordBatch/Table.fro…

d442f3b

…m_pandas Change-Id: I7fdd4c32b2f54d3003c6b87b9ae13186c35bcec0

icexelloss reviewed Jul 29, 2017

View reviewed changes

cpcloud reviewed Jul 29, 2017

View reviewed changes

cpcloud approved these changes Jul 29, 2017

View reviewed changes

asfgit closed this in 4108bda Jul 29, 2017

wesm deleted the ARROW-1291 branch July 29, 2017 17:55



		def construct_metadata(df, index_levels, preserve_index, types):
		def construct_metadata(df, column_names, index_levels, preserve_index, types):

ARROW-1291: [Python] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas #911

ARROW-1291: [Python] Cast non-string DataFrame columns to strings in RecordBatch/Table.from_pandas #911

Uh oh!

Conversation

wesm commented Jul 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Jul 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Jul 29, 2017 •

edited

Loading