Skip to content

[Python] Allow fast writing of Decimal column to parquet #24713

@asfimport

Description

@asfimport

Currently, when one wants to use a decimal datatype in Pandas, the only possibility is to use the decimal.Decimal standard-libary type. This is then an "object" column in the DataFrame.

Arrow can write a column of decimal type to Parquet, which is quite impressive given that [fastparquet does not write decimals|#data-types]] at all. However, the writing is very slow, in the code snippet below a factor of 4.

Improvements

Of course the best outcome would be if the conversion of a decimal column can be made faster, but I am not familiar enough with pandas internals to know if that's possible. (This same behavior also applies to .to_pickle etc.)

It would be nice, if a warning is shown that object-typed columns are being converted which is very slow. That would at least make this behavior more explicit.

Now, if fast parsing of a decimal.Decimal object column is not possible, it would be nice if a workaround is possible. For example, pass an int and then shift the dot "x" places to the left. (It is already possible to pass an int column and specify "decimal" dtype in the Arrow schema during pa.Table.from_pandas() but then it simply becomes a decimal without decimals.) Also, it might be nice if it can be encoded as a 128-bit byte string in the pandas column and then directly interpreted by Arrow.

Usecase

I need to save large dataframes (~10GB) of geospatial data with latitude/longitude. I can't use float as comparisons need to be exact, and the BigQuery "clustering" feature needs either an integer or a decimal but not a float. In the meantime, I have to do a workaround where I use only ints (the original number multiplied by 1000.)

Snippet

import decimal
from time import time

import numpy as np
import pandas as pd

d = dict()

for col in "abcdefghijklmnopqrstuvwxyz":
    d[col] = np.random.rand(int(1E7)) * 100

df = pd.DataFrame(d)

t0 = time()

df.to_parquet("/tmp/testabc.pq", engine="pyarrow")

t1 = time()

df["a"] = df["a"].round(decimals=3).astype(str).map(decimal.Decimal)

t2 = time()

df.to_parquet("/tmp/testabc_dec.pq", engine="pyarrow")

t3 = time()

print(f"Saving the normal dataframe took {t1-t0:.3f}s, with one decimal column {t3-t2:.3f}s")
# Saving the normal dataframe took 4.430s, with one decimal column 17.673s

 

 

Reporter: Fons de Leeuw

Related issues:

Note: This issue was originally created as ARROW-8545. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions