Skip to content

Performance: joblib.Parallel is significantly slower than ProcessPoolExecutor for tasks with large objects #1733

@liblaf

Description

@liblaf

Problem description

When running parallel tasks on a large, complex object (in this case, a pyvista.PolyData mesh), joblib.Parallel is substantially outperformed by the standard concurrent.futures.ProcessPoolExecutor.

Even when a manual batch_size is specified to reduce dispatching overhead, joblib remains an order of magnitude slower. The performance difference suggests that joblib may be incurring significant overhead from repeatedly serializing the large input object, whereas ProcessPoolExecutor.map seems to handle this more efficiently.

Minimal Reproducible Example

This script sets up a simple parallel task accessing data from a large pyvista.PolyData object and benchmarks joblib against concurrent.futures.

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "joblib",
#     "pyvista",
# ]
# ///
import concurrent
import concurrent.futures
import itertools
import time

import joblib
import pyvista as pv


def process(mesh: pv.PolyData, idx: int) -> float:
    """A simple function that accesses data from the mesh."""
    return mesh.points[idx, 0]


def main() -> None:
    # Create a large mesh object to demonstrate the issue.
    # A smaller `level` will reduce the mesh size and the performance gap.
    mesh: pv.PolyData = pv.Box(level=200)
    N_JOBS: int = 1000
    print("Mesh details:")
    print(mesh)
    print("-" * 20)

    # Benchmark joblib with automatic batching
    with joblib.parallel_config(n_jobs=8, verbose=1, prefer="processes"):
        time_start: float = time.perf_counter()
        parallel = joblib.Parallel(batch_size="auto")
        _ = parallel(joblib.delayed(process)(mesh, i) for i in range(N_JOBS))
        time_end: float = time.perf_counter()
        rate: float = N_JOBS / (time_end - time_start)
        print(f"joblib (auto batch size): {rate:.2f} it/sec")

        # Benchmark joblib with manual batching
        time_start: float = time.perf_counter()
        parallel = joblib.Parallel(batch_size=N_JOBS // 16)
        _ = parallel(joblib.delayed(process)(mesh, i) for i in range(N_JOBS))
        time_end: float = time.perf_counter()
        rate: float = N_JOBS / (time_end - time_start)
        print(f"joblib (manual batch size): {rate:.2f} it/sec")

    # Benchmark concurrent.futures.ProcessPoolExecutor
    time_start = time.perf_counter()
    with concurrent.futures.ProcessPoolExecutor(max_workers=8) as executor:
        _ = executor.map(
            process,
            itertools.repeat(mesh),
            range(N_JOBS),
            chunksize=N_JOBS // 16,
        )
    time_end = time.perf_counter()
    rate: float = N_JOBS / (time_end - time_start)
    print(f"concurrent.futures: {rate:.2f} it/sec")


if __name__ == "__main__":
    main()

Observed Behavior

Implementation Throughput (it/sec)
joblib (auto batch size) 32.26
joblib (manual batch size) 322.58
concurrent.futures 1829.17

The output clearly shows that concurrent.futures is significantly faster.

Mesh details:
PolyData (0x7f3e8c1e9f00)
  N Cells:    242406
  N Points:   242408
  X Bounds:   -1.000e+00, 1.000e+00
  Y Bounds:   -1.000e+00, 1.000e+00
  Z Bounds:   -1.000e+00, 1.000e+00
  N Arrays:   0
--------------------
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    6.1s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   13.7s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:   24.4s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:   31.0s finished
joblib (auto batch size): 32.26 it/sec
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done 668 tasks      | elapsed:    1.4s
[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    3.1s finished
joblib (manual batch size): 322.58 it/sec
concurrent.futures: 1829.17 it/sec

Expected Behavior

It is expected that joblib's performance would be more comparable to ProcessPoolExecutor, especially since joblib is a library specialized for this kind of parallel workload. While some overhead is expected, a >5x performance difference with manual batching seems excessive.

Environment

  • Python: 3.12.10
  • joblib: 1.5.1
  • pyvista: 0.45.3
  • OS: Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions