Performance: `joblib.Parallel` is significantly slower than `ProcessPoolExecutor` for tasks with large objects

**Problem description**

When running parallel tasks on a large, complex object (in this case, a `pyvista.PolyData` mesh), `joblib.Parallel` is substantially outperformed by the standard `concurrent.futures.ProcessPoolExecutor`.

Even when a manual `batch_size` is specified to reduce dispatching overhead, `joblib` remains an order of magnitude slower. The performance difference suggests that `joblib` may be incurring significant overhead from repeatedly serializing the large input object, whereas `ProcessPoolExecutor.map` seems to handle this more efficiently.

**Minimal Reproducible Example**

This script sets up a simple parallel task accessing data from a large `pyvista.PolyData` object and benchmarks `joblib` against `concurrent.futures`.

```python
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "joblib",
#     "pyvista",
# ]
# ///
import concurrent
import concurrent.futures
import itertools
import time

import joblib
import pyvista as pv


def process(mesh: pv.PolyData, idx: int) -> float:
    """A simple function that accesses data from the mesh."""
    return mesh.points[idx, 0]


def main() -> None:
    # Create a large mesh object to demonstrate the issue.
    # A smaller `level` will reduce the mesh size and the performance gap.
    mesh: pv.PolyData = pv.Box(level=200)
    N_JOBS: int = 1000
    print("Mesh details:")
    print(mesh)
    print("-" * 20)

    # Benchmark joblib with automatic batching
    with joblib.parallel_config(n_jobs=8, verbose=1, prefer="processes"):
        time_start: float = time.perf_counter()
        parallel = joblib.Parallel(batch_size="auto")
        _ = parallel(joblib.delayed(process)(mesh, i) for i in range(N_JOBS))
        time_end: float = time.perf_counter()
        rate: float = N_JOBS / (time_end - time_start)
        print(f"joblib (auto batch size): {rate:.2f} it/sec")

        # Benchmark joblib with manual batching
        time_start: float = time.perf_counter()
        parallel = joblib.Parallel(batch_size=N_JOBS // 16)
        _ = parallel(joblib.delayed(process)(mesh, i) for i in range(N_JOBS))
        time_end: float = time.perf_counter()
        rate: float = N_JOBS / (time_end - time_start)
        print(f"joblib (manual batch size): {rate:.2f} it/sec")

    # Benchmark concurrent.futures.ProcessPoolExecutor
    time_start = time.perf_counter()
    with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
        _ = executor.map(
            process,
            itertools.repeat(mesh),
            range(N_JOBS),
            chunksize=N_JOBS // 16,
        )
    time_end = time.perf_counter()
    rate: float = N_JOBS / (time_end - time_start)
    print(f"concurrent.futures: {rate:.2f} it/sec")


if __name__ == "__main__":
    main()
```

**Observed Behavior**

|       Implementation       | Throughput (it/sec) |
| :------------------------: | ------------------: |
|  joblib (auto batch size)  |               32.26 |
| joblib (manual batch size) |              322.58 |
|     concurrent.futures     |             1829.17 |

The output clearly shows that `concurrent.futures` is significantly faster.

```
Mesh details:
PolyData (0x7f3e8c1e9f00)
  N Cells:    242406
  N Points:   242408
  X Bounds:   -1.000e+00, 1.000e+00
  Y Bounds:   -1.000e+00, 1.000e+00
  Z Bounds:   -1.000e+00, 1.000e+00
  N Arrays:   0
--------------------
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    6.1s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   13.7s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:   24.4s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:   31.0s finished
joblib (auto batch size): 32.26 it/sec
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done 668 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.1s finished
joblib (manual batch size): 322.58 it/sec
concurrent.futures: 1829.17 it/sec
```

**Expected Behavior**

It is expected that `joblib`'s performance would be more comparable to `ProcessPoolExecutor`, especially since `joblib` is a library specialized for this kind of parallel workload. While some overhead is expected, a >5x performance difference with manual batching seems excessive.

**Environment**

- **Python:** 3.12.10
- **joblib:** 1.5.1
- **pyvista:** 0.45.3
- **OS:** Linux


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: `joblib.Parallel` is significantly slower than `ProcessPoolExecutor` for tasks with large objects #1733

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation	Throughput (it/sec)
joblib (auto batch size)	32.26
joblib (manual batch size)	322.58
concurrent.futures	1829.17

Performance: joblib.Parallel is significantly slower than ProcessPoolExecutor for tasks with large objects #1733

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Performance: `joblib.Parallel` is significantly slower than `ProcessPoolExecutor` for tasks with large objects #1733