Skip to content

BigQuery : ArrowTypeError when trying to push DataFrame with int columns with NaN values #22

@sebastienharinck

Description

@sebastienharinck

Hello !

I am a new user of BigQuery python package. And I encounter a problem.

To simplify, I have a simple pandas DataFrame with a None value :

df = pd.DataFrame({'x': [1, 2, None, 4]})

For pandas, this will be a NaN and the dtype of column will be float64 by default. But I would like to push this DataFrame on BigQuery with an integer format for the column x. (the None will be a null)

Thanks to the new version of pandas (>= 0.24), we can change the type of the column and keep the NaN value :

df['x'] = df['x'].astype('Int64')
print(df.dtypes)
# Int64

source : https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value

But when I try to push this DataFrame to BigQuery, I encounter an ArrowTypeError :

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column x with type Int64')

I find some solutions that allow me to do the stuff. But I always need to update my table after execute load_table_from_dataframe... I think it must have a better solution for doing this. Any ideas please ?

At the end, I would like to have this table in BigQuery :

Line | x    |
1    | 1    |
2    | 2    |
3    | null |
4    | 4    |

with x as an INTEGER type.

The full code

import pandas as pd
from google.cloud import bigquery

print(pd.__version__)
print(bigquery.__version__)

df = pd.DataFrame({'x': [1, 2, None, 4]})

df['x'] = df['x'].astype('Int64')
print(df.dtypes)

client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test')

client.load_table_from_dataframe(df, table_ref).result()

Stack trace

ArrowTypeError                            Traceback (most recent call last)
<ipython-input-4-fe43ea977e67> in <module>
     14 table_ref = dataset_ref.table('test')
     15 
---> 16 client.load_table_from_dataframe(df, table_ref).result()

/usr/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config)
   1045         """
   1046         buffer = six.BytesIO()
-> 1047         dataframe.to_parquet(buffer)
   1048 
   1049         if job_config is None:

/usr/local/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2201         to_parquet(self, fname, engine,
   2202                    compression=compression, index=index,
-> 2203                    partition_cols=partition_cols, **kwargs)
   2204 
   2205     @Substitution(header='Whether to print column labels, default True')

/usr/local/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250     impl = get_engine(engine)
    251     return impl.write(df, path, compression=compression, index=index,
--> 252                       partition_cols=partition_cols, **kwargs)
    253 
    254 

/usr/local/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
    111         else:
    112             from_pandas_kwargs = {'preserve_index': index}
--> 113         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    114         if partition_cols is not None:
    115             self.api.parquet.write_to_dataset(

/usr/local/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    466         arrays = [convert_column(c, t)
    467                   for c, t in zip(columns_to_convert,
--> 468                                   convert_types)]
    469     else:
    470         from concurrent import futures

/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    465     if nthreads == 1:
    466         arrays = [convert_column(c, t)
--> 467                   for c, t in zip(columns_to_convert,
    468                                   convert_types)]
    469     else:

/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    461             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    462                        .format(col.name, col.dtype),)
--> 463             raise e
    464 
    465     if nthreads == 1:

/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    455     def convert_column(col, ty):
    456         try:
--> 457             return pa.array(col, type=ty, from_pandas=True, safe=safe)
    458         except (pa.ArrowInvalid,
    459                 pa.ArrowNotImplementedError,

/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_type()

/usr/local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column x with type Int64')

Thank you :)

Metadata

Metadata

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions