-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Hello !
I am a new user of BigQuery python package. And I encounter a problem.
To simplify, I have a simple pandas DataFrame with a None value :
df = pd.DataFrame({'x': [1, 2, None, 4]})For pandas, this will be a NaN and the dtype of column will be float64 by default. But I would like to push this DataFrame on BigQuery with an integer format for the column x. (the None will be a null)
Thanks to the new version of pandas (>= 0.24), we can change the type of the column and keep the NaN value :
df['x'] = df['x'].astype('Int64')
print(df.dtypes)
# Int64But when I try to push this DataFrame to BigQuery, I encounter an ArrowTypeError :
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column x with type Int64')
I find some solutions that allow me to do the stuff. But I always need to update my table after execute load_table_from_dataframe... I think it must have a better solution for doing this. Any ideas please ?
At the end, I would like to have this table in BigQuery :
Line | x |
1 | 1 |
2 | 2 |
3 | null |
4 | 4 |
with x as an INTEGER type.
The full code
import pandas as pd
from google.cloud import bigquery
print(pd.__version__)
print(bigquery.__version__)
df = pd.DataFrame({'x': [1, 2, None, 4]})
df['x'] = df['x'].astype('Int64')
print(df.dtypes)
client = bigquery.Client()
dataset_ref = client.dataset('test_dataset')
table_ref = dataset_ref.table('test')
client.load_table_from_dataframe(df, table_ref).result()Stack trace
ArrowTypeError Traceback (most recent call last)
<ipython-input-4-fe43ea977e67> in <module>
14 table_ref = dataset_ref.table('test')
15
---> 16 client.load_table_from_dataframe(df, table_ref).result()
/usr/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config)
1045 """
1046 buffer = six.BytesIO()
-> 1047 dataframe.to_parquet(buffer)
1048
1049 if job_config is None:
/usr/local/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
2201 to_parquet(self, fname, engine,
2202 compression=compression, index=index,
-> 2203 partition_cols=partition_cols, **kwargs)
2204
2205 @Substitution(header='Whether to print column labels, default True')
/usr/local/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
250 impl = get_engine(engine)
251 return impl.write(df, path, compression=compression, index=index,
--> 252 partition_cols=partition_cols, **kwargs)
253
254
/usr/local/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
111 else:
112 from_pandas_kwargs = {'preserve_index': index}
--> 113 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
114 if partition_cols is not None:
115 self.api.parquet.write_to_dataset(
/usr/local/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
466 arrays = [convert_column(c, t)
467 for c, t in zip(columns_to_convert,
--> 468 convert_types)]
469 else:
470 from concurrent import futures
/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
465 if nthreads == 1:
466 arrays = [convert_column(c, t)
--> 467 for c, t in zip(columns_to_convert,
468 convert_types)]
469 else:
/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
461 e.args += ("Conversion failed for column {0!s} with type {1!s}"
462 .format(col.name, col.dtype),)
--> 463 raise e
464
465 if nthreads == 1:
/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
455 def convert_column(col, ty):
456 try:
--> 457 return pa.array(col, type=ty, from_pandas=True, safe=safe)
458 except (pa.ArrowInvalid,
459 pa.ArrowNotImplementedError,
/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
/usr/local/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_type()
/usr/local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column x with type Int64')
Thank you :)