Skip to content

[Python] pyarrow.Table.from_pandas() causing memory leak  #37989

@RizzoV

Description

@RizzoV

Describe the bug, including details regarding any error messages, version, and platform.

Issue Description

(continuing from pandas-dev/pandas#55296)

pyarrow.Table.from_pandas() causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below.

From the Reproducible Example:

  • 1st pa.Table.from_pandas() call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74     91.9 MiB     91.9 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76     91.9 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
  • 2000th call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    140.1 MiB    140.1 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    140.1 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))
  • 10000th call:
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    329.4 MiB    329.4 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    329.5 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))

Reproducible Example

import os
import string
import sys
from random import choice, randint
from uuid import uuid4

import pandas as pd
import pyarrow as pa
from memory_profiler import profile

sample_schema = pa.struct(
    [
        ("a", pa.string()),
        (
            "b",
            pa.struct(
                [
                    ("ba", pa.list_(pa.string())),
                    ("bc", pa.string()),
                    ("bd", pa.string()),
                    ("be", pa.list_(pa.string())),
                    (
                        "bf",
                        pa.list_(
                            pa.struct(
                                [
                                    (
                                        "bfa",
                                        pa.struct(
                                            [
                                                ("bfaa", pa.string()),
                                                ("bfab", pa.string()),
                                                ("bfac", pa.string()),
                                                ("bfad", pa.float64()),
                                                ("bfae", pa.string()),
                                            ]
                                        ),
                                    )
                                ]
                            )
                        ),
                    ),
                ]
            ),
        ),
        ("c", pa.int64()),
        ("d", pa.int64()),
        ("e", pa.string()),
        (
            "f",
            pa.struct(
                [
                    ("fa", pa.string()),
                    ("fb", pa.string()),
                    ("fc", pa.string()),
                    ("fd", pa.string()),
                    ("fe", pa.string()),
                    ("ff", pa.string()),
                    ("fg", pa.string()),
                ]
            ),
        ),
        ("g", pa.int64()),
    ]
)


def generate_random_string(str_length: int) -> str:
    return "".join(
        [choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
    )


@profile
def convert_df_to_table(df: pd.DataFrame) -> None:
     table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))


def generate_random_data():
    return {
        "a": [generate_random_string(128)],
        "b": [
            {
                "ba": [generate_random_string(128) for i in range(50)],
                "bc": generate_random_string(128),
                "bd": generate_random_string(128),
                "be": [generate_random_string(128) for i in range(50)],
                "bf": [
                    {
                        "bfa": {
                            "bfaa": generate_random_string(128),
                            "bfab": generate_random_string(128),
                            "bfac": generate_random_string(128),
                            "bfad": randint(0, 2**32),
                            "bfae": generate_random_string(128),
                        }
                    }
                ],
            }
        ],
        "c": [randint(0, 2**32)],
        "d": [randint(0, 2**32)],
        "e": [generate_random_string(128)],
        "f": [
            {
                "fa": generate_random_string(128),
                "fb": generate_random_string(128),
                "fc": generate_random_string(128),
                "fd": generate_random_string(128),
                "fe": generate_random_string(128),
                "ff": generate_random_string(128),
                "fg": generate_random_string(128),
            }
        ],
        "g": [randint(0, 2**32)],
    }


def main():
    for i in range(10000):
        df = pd.DataFrame.from_dict(generate_random_data())
        # pa.jemalloc_set_decay_ms(0)
        convert_df_to_table(df)  # memory leak


if __name__ == "__main__":
    main()

Installed Versions

Details
INSTALLED VERSIONS
------------------
python              : 3.10.9.final.0
python-bits         : 64
OS                  : Darwin
OS-release          : 22.6.0
Version             : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
machine             : x86_64
processor           : i386
byteorder           : little
LC_ALL              : None
LANG                : it_IT.UTF-8
LOCALE              : it_IT.UTF-8

pyarrow             : 13.0.0
pandas              : 2.1.1
numpy               : 1.26.0

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions