Skip to content

[C++] Scanning datasets does not enforce back pressure #29252

@asfimport

Description

@asfimport

I have a simple test case where I scan the batches of a 4GB dataset and print out the currently used memory:

import pyarrow as pa
import pyarrow.dataset as ds

dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
num_rows = 0
for batch in dataset.to_batches():
    print(pa.total_allocated_bytes())
    num_rows += batch.num_rows

print(num_rows)

In pyarrow 3.0.0 this consumes just over 5MB. In pyarrow 4.0.0 and 5.0.0 this consumes multiple GB of RAM.

Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13611. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions