Distributing large input data over worker processes?

Joblib's `Parallel `is a fascinating tool for easy loop parallelization (thank you for developing a nice package!), but when one starts to process large input data, there may be some performance issues.

Is there some recommended way to distribute large input data over worker processes?

I found a post, ["Copy large object only once per Joblib process"](https://stackoverflow.com/q/55422086) in Stack Overflow, but there is no answer. Maybe one considers that this question is related to:
- #381: because input data must be initialized once per process
- #619: object shelving might be used for storing input data (?) 

Below is a detailed description with examples:

Suppose I have a function that takes a very large (constant) input data as an argument. The data is a complicated Python object (not numpy data). Shared memory does not scale well. And for this case, I would like to parallelize it with respect to an extra argument (say, some index).
```python
def func(data, i):
    # computation with data and i
    ...

data = ... # very large

results = Parallel(n_jobs=16)(delayed(func)(data, i) for i in range(1000))
```
This becomes unacceptably slow. I could store the data as a global variable:
```python
data = ... # very large

def func(i):
    # computation with data and i
    ...

results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))
```
which is also very slow. I guess in both above cases the data is pickled by `cloudpickle` 1000 times (maybe dumping data on the master process is the bottleneck).

The performance becomes better if I manually pickle the data, like
```python
data = ... # very large

with open("some_unique_name.tmp", mode="wb") as f:
    pickle.dump(data, f)

def func(i):
    with open("some_unique_name.tmp", mode="rb") as f:
        data = pickle.load(f)
    # computation with data and i
    ...

results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))

os.remove("some_unique_name.tmp")
```
This is much faster than the above two examples, but
1. the pickle file is loaded 1000 times; 16 times must be enough,
2. not neat. It should be more easily written.

Any better ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributing large input data over worker processes? #996

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributing large input data over worker processes? #996

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions