-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
I have a simple test case where I scan the batches of a 4GB dataset and print out the currently used memory:
import pyarrow as pa
import pyarrow.dataset as ds
dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
num_rows = 0
for batch in dataset.to_batches():
print(pa.total_allocated_bytes())
num_rows += batch.num_rows
print(num_rows)In pyarrow 3.0.0 this consumes just over 5MB. In pyarrow 4.0.0 and 5.0.0 this consumes multiple GB of RAM.
Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace
Related issues:
- [C++] Ensure dataset writing applies back pressure (is duplicated by)
- [C++][Dataset] Dataset writes should respect backpressure (is depended upon by)
PRs and other links:
Note: This issue was originally created as ARROW-13611. Please see the migration documentation for further details.