-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Python] pyarrow.Table.from_pandas() causing memory leak #37989
Copy link
Copy link
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Issue Description
(continuing from pandas-dev/pandas#55296)
pyarrow.Table.from_pandas() causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below.
From the Reproducible Example:
- 1st
pa.Table.from_pandas()call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 91.9 MiB 91.9 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 91.9 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
- 2000th call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 140.1 MiB 140.1 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 140.1 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
- 10000th call:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
74 329.4 MiB 329.4 MiB 1 @profile
75 def convert_df_to_table(df: pd.DataFrame):
76 329.5 MiB 0.0 MiB 1 table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
Reproducible Example
import os
import string
import sys
from random import choice, randint
from uuid import uuid4
import pandas as pd
import pyarrow as pa
from memory_profiler import profile
sample_schema = pa.struct(
[
("a", pa.string()),
(
"b",
pa.struct(
[
("ba", pa.list_(pa.string())),
("bc", pa.string()),
("bd", pa.string()),
("be", pa.list_(pa.string())),
(
"bf",
pa.list_(
pa.struct(
[
(
"bfa",
pa.struct(
[
("bfaa", pa.string()),
("bfab", pa.string()),
("bfac", pa.string()),
("bfad", pa.float64()),
("bfae", pa.string()),
]
),
)
]
)
),
),
]
),
),
("c", pa.int64()),
("d", pa.int64()),
("e", pa.string()),
(
"f",
pa.struct(
[
("fa", pa.string()),
("fb", pa.string()),
("fc", pa.string()),
("fd", pa.string()),
("fe", pa.string()),
("ff", pa.string()),
("fg", pa.string()),
]
),
),
("g", pa.int64()),
]
)
def generate_random_string(str_length: int) -> str:
return "".join(
[choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
)
@profile
def convert_df_to_table(df: pd.DataFrame) -> None:
table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
def generate_random_data():
return {
"a": [generate_random_string(128)],
"b": [
{
"ba": [generate_random_string(128) for i in range(50)],
"bc": generate_random_string(128),
"bd": generate_random_string(128),
"be": [generate_random_string(128) for i in range(50)],
"bf": [
{
"bfa": {
"bfaa": generate_random_string(128),
"bfab": generate_random_string(128),
"bfac": generate_random_string(128),
"bfad": randint(0, 2**32),
"bfae": generate_random_string(128),
}
}
],
}
],
"c": [randint(0, 2**32)],
"d": [randint(0, 2**32)],
"e": [generate_random_string(128)],
"f": [
{
"fa": generate_random_string(128),
"fb": generate_random_string(128),
"fc": generate_random_string(128),
"fd": generate_random_string(128),
"fe": generate_random_string(128),
"ff": generate_random_string(128),
"fg": generate_random_string(128),
}
],
"g": [randint(0, 2**32)],
}
def main():
for i in range(10000):
df = pd.DataFrame.from_dict(generate_random_data())
# pa.jemalloc_set_decay_ms(0)
convert_df_to_table(df) # memory leak
if __name__ == "__main__":
main()Installed Versions
Details
INSTALLED VERSIONS
------------------
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8
pyarrow : 13.0.0
pandas : 2.1.1
numpy : 1.26.0
Component(s)
Python
Reactions are currently unavailable