[Python] Quadratic memory usage of Table.to_pandas with nested data

Reading nested Parquet data and then converting it to a Pandas DataFrame shows quadratic memory usage and will eventually run out of memory for reasonably small files. I had initially thought this was a regression since 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row counts.

Example code to generate nested Parquet data:
```python

import numpy as np
import random
import string
import pandas as pd

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 1_024_000
filename = 'nested.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))
df = pd.DataFrame({'c1': nested_col})
df.to_parquet(filename)
```
And then read into a DataFrame with:
```python

import pyarrow.parquet as pq
table = pq.read_table(filename)
df = table.to_pandas()
```
Only reading to an Arrow table isn't a problem, it's the to_pandas method that exhibits the large memory usage. I haven't tested generating nested Arrow data in memory without writing Parquet from Pandas but I assume the problem probably isn't Parquet specific.

Memory usage I see when reading different sized files on a machine with 64 GB RAM:

|Num rows|Memory used with 10.0.1 (MB)|Memory used with 7.0.0 (MB)||
|-|-|-|-|
|32,000|362|361||
|64,000|531|531||
|128,000|1,152|1,101||
|256,000|2,888|1,402||
|512,000|10,301|3,508||
|1,024,000|38,697|5,313||
|2,048,000|OOM|20,061||
|4,096,000| |OOM|  With Arrow 10.0.1, memory usage approximately quadruples when row count doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but then quadruples from 1024k to 2048k rows.  PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something changed between 7.0.0 and 8.0.0.|


**Environment**: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 64 GB RAM
**Reporter**: [Adam Reeve](https://issues.apache.org/jira/browse/ARROW-18400) / @adamreeve
**Assignee**: [Will Jones](https://issues.apache.org/jira/browse/ARROW-18400) / @wjones127
#### Original Issue Attachments:
- [test_memory.py](https://issues.apache.org/jira/secure/attachment/13054045/test_memory.py)
#### PRs and other links:
- [GitHub Pull Request #15210](https://github.com/apache/arrow/pull/15210)

**Note**: *This issue was originally created as [ARROW-18400](https://issues.apache.org/jira/browse/ARROW-18400). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] Quadratic memory usage of Table.to_pandas with nested data #20512

Original Issue Attachments:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Quadratic memory usage of Table.to_pandas with nested data #20512

Description

Original Issue Attachments:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions