Skip to content

[Python] Too much RAM consumption when using take on a memory-mapped table #37766

@blackblitz

Description

@blackblitz

Describe the bug, including details regarding any error messages, version, and platform.

I created a random array and wrote it repeatedly to an Arrow IPC file so that the whole array was too large to fit in RAM. Then, I read it by memory mapping. I could slice it without any problem, but when I tried to access the rows based on an arbitrary list of indices by using take, the RAM usage went up until the computer hung. The code is as follows (in which the array length and the number of writes may be adjusted according to your disk space and RAM size):

import numpy as np
import pyarrow as pa
from pyarrow import feather

rng = np.random.default_rng(1337)
data = rng.normal(size=(1000000,))
table = pa.table({'data': data})
sink = pa.output_stream('data.feather')
schema = pa.schema([('data', pa.float64())])
with pa.ipc.new_file(sink, schema) as writer:
    for i in range(1000):
        writer.write_table(table)

table = feather.read_table('data.feather', memory_map=True)
print(table.take([0]))

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions