[C++] Scanning datasets does not enforce back pressure

I have a simple test case where I scan the batches of a 4GB dataset and print out the currently used memory:

```python

import pyarrow as pa
import pyarrow.dataset as ds

dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
num_rows = 0
for batch in dataset.to_batches():
    print(pa.total_allocated_bytes())
    num_rows += batch.num_rows

print(num_rows)
```

In pyarrow 3.0.0 this consumes just over 5MB.  In pyarrow 4.0.0 and 5.0.0 this consumes multiple GB of RAM.

**Reporter**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-13611) / @westonpace
**Assignee**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-13611) / @westonpace
#### Related issues:
- [[C++] Ensure dataset writing applies back pressure](https://github.com/apache/arrow/issues/29235) (is duplicated by)
- [[C++][Dataset] Dataset writes should respect backpressure](https://github.com/apache/arrow/issues/29776) (is depended upon by)
#### PRs and other links:
- [GitHub Pull Request #11285](https://github.com/apache/arrow/pull/11285)

<sub>**Note**: *This issue was originally created as [ARROW-13611](https://issues.apache.org/jira/browse/ARROW-13611). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Scanning datasets does not enforce back pressure #29252

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Scanning datasets does not enforce back pressure #29252

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions