-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data #40301
Description
Describe the enhancement requested
The Consolidated Block Creator runs the column conversion in parallel, creating a Scalar Memo Table for each column up until pa.cpu_count().
For performance reasons, jemalloc and mimalloc maintain allocations on a per-memory segment level to reduce contention between threads.
What this means is that if a user calls table.to_pandas(split_blocks=False) for a small table, a disproportionately large amount of memory gets allocated to build the Scalar Memo Table. Both jemalloc and mimalloc will essentially allocate a chunk of memory per column.
Here is some code:
import pyarrow as pa
def test_memory_usage():
table = pa.table({'A': 'a', 'B': 'b', 'C': 'c', 'D': 'd', 'E': 'e', 'F': 'f', 'G': 'g'}) #'H':'h', 'I': 'i', 'J': 'j', 'K':'k', 'L':'l', 'M':'m', 'N':'n', 'O':'o', 'P':'p', 'Q':'q', 'R':'r', 'S':'s', 'T':'t', 'U':'u', 'V':'v', 'W':'w', 'X':'x','Y':'y','Z':'z'})
df = table.to_pandas()
if __name__ == '__main__':
test_memory_usage()
Here are the resulting memory allocations summarised with memray:
jemalloc with 7 columns:
📦 Total memory allocated:
178.127MB
📊 Histogram of allocation size:
min: 1.000B
--------------------------------------------
< 4.000B : 3548 ▇▇
< 24.000B : 1253 ▇
< 119.000B : 14686 ▇▇▇▇▇▇▇
< 588.000B : 3746 ▇▇
< 2.827KB : 58959 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
< 13.924KB : 2550 ▇▇
< 68.569KB : 403 ▇
< 337.661KB: 84 ▇
< 1.624MB : 24 ▇
<=7.996MB : 20 ▇
--------------------------------------------
max: 7.996MB
jemalloc with 26 columns:
📦 Total memory allocated:
238.229MB
📊 Histogram of allocation size:
min: 1.000B
--------------------------------------------
< 4.000B : 3545 ▇▇
< 24.000B : 1627 ▇
< 119.000B : 15086 ▇▇▇▇▇▇▇
< 588.000B : 3882 ▇▇
< 2.828KB : 58971 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
< 13.929KB : 2552 ▇▇
< 68.593KB : 403 ▇
< 337.794KB: 84 ▇
< 1.625MB : 24 ▇
<=8.000MB : 47 ▇
--------------------------------------------
max: 8.000MB
mimalloc with 7 columns:
📦 Total memory allocated:
380.166MB
📊 Histogram of allocation size:
min: 1.000B
--------------------------------------------
< 6.000B : 3548 ▇▇
< 36.000B : 7470 ▇▇▇▇
< 222.000B : 9524 ▇▇▇▇▇
< 1.319KB : 57271 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
< 7.999KB : 6492 ▇▇▇
< 48.503KB : 775 ▇
< 294.066KB: 150 ▇
< 1.741MB : 26 ▇
< 10.556MB : 1 ▇
<=64.000MB : 4 ▇
--------------------------------------------
max: 64.000MB
mimalloc with 26 columns:
📦 Total memory allocated:
1.434GB
📊 Histogram of allocation size:
min: 1.000B
--------------------------------------------
< 6.000B : 3545 ▇▇
< 36.000B : 7845 ▇▇▇▇
< 222.000B : 10001 ▇▇▇▇▇
< 1.319KB : 57332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
< 7.999KB : 6501 ▇▇▇
< 48.503KB : 794 ▇
< 294.066KB: 150 ▇
< 1.741MB : 26 ▇
< 10.556MB : 1 ▇
<=64.000MB : 21 ▇
--------------------------------------------
max: 64.000MB
You can see how dramatically the memory increases even for a very small table.
My proposal is that we only do the conversion in parallel when it might make a substantial performance difference for a table of a certain size. I'm not quite sure which size, but once the code has been refactored, we can run experiments to come to a data-informed decision.
Component(s)
C++