[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data

### Describe the enhancement requested

The [Consolidated Block Creator](https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/src/arrow/python/arrow_to_pandas.cc#L2276) runs the column conversion in parallel, creating a [Scalar Memo Table](https://github.com/apache/arrow/blob/a6e577d031d20a1a7d3dd01536b9a77db5d1bff8/python/pyarrow/src/arrow/python/arrow_to_pandas.cc#L609) for each column up until `pa.cpu_count()`. 

For performance reasons, [jemalloc](https://engineering.fb.com/2011/01/03/core-infra/scalable-memory-allocation-using-jemalloc/) and [mimalloc](https://github.com/microsoft/mimalloc/issues/351) maintain allocations on a per-memory segment level to reduce contention between threads. 

What this means is that if a user calls `table.to_pandas(split_blocks=False)` for a small table, a disproportionately large amount of memory gets allocated to build the `Scalar Memo Table`. Both `jemalloc` and `mimalloc` will essentially allocate a chunk of memory per column.

Here is some code:

```
import pyarrow as pa

def test_memory_usage():
    table = pa.table({'A': 'a', 'B': 'b', 'C': 'c', 'D': 'd', 'E': 'e', 'F': 'f', 'G': 'g'}) #'H':'h', 'I': 'i', 'J': 'j', 'K':'k', 'L':'l', 'M':'m', 'N':'n', 'O':'o', 'P':'p', 'Q':'q', 'R':'r', 'S':'s', 'T':'t', 'U':'u', 'V':'v', 'W':'w', 'X':'x','Y':'y','Z':'z'})
    df = table.to_pandas()

if __name__ == '__main__':
    test_memory_usage()
```

Here are the resulting memory allocations summarised with [memray](https://github.com/bloomberg/memray):

jemalloc with 7 columns:

```
📦 Total memory allocated:
        178.127MB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 4.000B   :  3548 ▇▇
        < 24.000B  :  1253 ▇
        < 119.000B : 14686 ▇▇▇▇▇▇▇
        < 588.000B :  3746 ▇▇
        < 2.827KB  : 58959 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 13.924KB :  2550 ▇▇
        < 68.569KB :   403 ▇
        < 337.661KB:    84 ▇
        < 1.624MB  :    24 ▇
        <=7.996MB  :    20 ▇
        --------------------------------------------
        max: 7.996MB
```

jemalloc with 26 columns:

```
📦 Total memory allocated:
        238.229MB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 4.000B   :  3545 ▇▇
        < 24.000B  :  1627 ▇
        < 119.000B : 15086 ▇▇▇▇▇▇▇
        < 588.000B :  3882 ▇▇
        < 2.828KB  : 58971 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 13.929KB :  2552 ▇▇
        < 68.593KB :   403 ▇
        < 337.794KB:    84 ▇
        < 1.625MB  :    24 ▇
        <=8.000MB  :    47 ▇
        --------------------------------------------
        max: 8.000MB
```

mimalloc with 7 columns:

```
📦 Total memory allocated:
        380.166MB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 6.000B   :  3548 ▇▇
        < 36.000B  :  7470 ▇▇▇▇
        < 222.000B :  9524 ▇▇▇▇▇
        < 1.319KB  : 57271 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 7.999KB  :  6492 ▇▇▇
        < 48.503KB :   775 ▇
        < 294.066KB:   150 ▇
        < 1.741MB  :    26 ▇
        < 10.556MB :     1 ▇
        <=64.000MB :     4 ▇
        --------------------------------------------
        max: 64.000MB
```

mimalloc with 26 columns:

```
📦 Total memory allocated:
        1.434GB

📊 Histogram of allocation size:
        min: 1.000B
        --------------------------------------------
        < 6.000B   :  3545 ▇▇
        < 36.000B  :  7845 ▇▇▇▇
        < 222.000B : 10001 ▇▇▇▇▇
        < 1.319KB  : 57332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 7.999KB  :  6501 ▇▇▇
        < 48.503KB :   794 ▇
        < 294.066KB:   150 ▇
        < 1.741MB  :    26 ▇
        < 10.556MB :     1 ▇
        <=64.000MB :    21 ▇
        --------------------------------------------
        max: 64.000MB
```

You can see how dramatically the memory increases even for a very small table.

My proposal is that we only do the conversion in parallel when it might make a substantial performance difference for a table of a certain size. I'm not quite sure which size, but once the code has been refactored, we can run experiments to come to a data-informed decision. 

### Component(s)

C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data #40301

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Python] Only convert in parallel for the ConsolidatedBlockCreator class for large data #40301

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions