Skip to content

Distributing large input data over worker processes? #996

@tueda

Description

@tueda

Joblib's Parallel is a fascinating tool for easy loop parallelization (thank you for developing a nice package!), but when one starts to process large input data, there may be some performance issues.

Is there some recommended way to distribute large input data over worker processes?

I found a post, "Copy large object only once per Joblib process" in Stack Overflow, but there is no answer. Maybe one considers that this question is related to:

Below is a detailed description with examples:

Suppose I have a function that takes a very large (constant) input data as an argument. The data is a complicated Python object (not numpy data). Shared memory does not scale well. And for this case, I would like to parallelize it with respect to an extra argument (say, some index).

def func(data, i):
    # computation with data and i
    ...

data = ... # very large

results = Parallel(n_jobs=16)(delayed(func)(data, i) for i in range(1000))

This becomes unacceptably slow. I could store the data as a global variable:

data = ... # very large

def func(i):
    # computation with data and i
    ...

results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))

which is also very slow. I guess in both above cases the data is pickled by cloudpickle 1000 times (maybe dumping data on the master process is the bottleneck).

The performance becomes better if I manually pickle the data, like

data = ... # very large

with open("some_unique_name.tmp", mode="wb") as f:
    pickle.dump(data, f)

def func(i):
    with open("some_unique_name.tmp", mode="rb") as f:
        data = pickle.load(f)
    # computation with data and i
    ...

results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))

os.remove("some_unique_name.tmp")

This is much faster than the above two examples, but

  1. the pickle file is loaded 1000 times; 16 times must be enough,
  2. not neat. It should be more easily written.

Any better ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions