[Python] High memory usage writing pyarrow.Table with large strings to parquet

My case of datasets stored is specific. I have large strings (1-100MB each).

Let's take for example a single row.

43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.

When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb.

With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case.

Here's the footprint after running using memory profiler:
```java

Line #    Mem usage    Increment   Line Contents
================================================
     4     48.9 MiB     48.9 MiB   @profile
     5                             def test():
     6    143.7 MiB     94.7 MiB       data = pd.read_csv('43mb.csv')
     7    498.6 MiB    354.9 MiB       data.to_parquet('out.parquet')
 
```
Is this typical for parquet in case of big strings?

**Environment**: Mac OSX
**Reporter**: [Bogdan Klichuk](https://issues.apache.org/jira/browse/ARROW-7305)
#### Related issues:
- [[C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable](https://github.com/apache/arrow/issues/23308) (relates to)
#### Original Issue Attachments:
- [50mb.csv.gz](https://issues.apache.org/jira/secure/attachment/12989053/50mb.csv.gz)

<sub>**Note**: *This issue was originally created as [ARROW-7305](https://issues.apache.org/jira/browse/ARROW-7305). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python] High memory usage writing pyarrow.Table with large strings to parquet #23592

Related issues:

Original Issue Attachments:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] High memory usage writing pyarrow.Table with large strings to parquet #23592

Description

Related issues:

Original Issue Attachments:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions