-
Notifications
You must be signed in to change notification settings - Fork 450
Distributing large input data over worker processes? #996
Description
Joblib's Parallel is a fascinating tool for easy loop parallelization (thank you for developing a nice package!), but when one starts to process large input data, there may be some performance issues.
Is there some recommended way to distribute large input data over worker processes?
I found a post, "Copy large object only once per Joblib process" in Stack Overflow, but there is no answer. Maybe one considers that this question is related to:
- No simple way to pass initializer for process #381: because input data must be initialized once per process
- [WIP] introduce object shelving #619: object shelving might be used for storing input data (?)
Below is a detailed description with examples:
Suppose I have a function that takes a very large (constant) input data as an argument. The data is a complicated Python object (not numpy data). Shared memory does not scale well. And for this case, I would like to parallelize it with respect to an extra argument (say, some index).
def func(data, i):
# computation with data and i
...
data = ... # very large
results = Parallel(n_jobs=16)(delayed(func)(data, i) for i in range(1000))This becomes unacceptably slow. I could store the data as a global variable:
data = ... # very large
def func(i):
# computation with data and i
...
results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))which is also very slow. I guess in both above cases the data is pickled by cloudpickle 1000 times (maybe dumping data on the master process is the bottleneck).
The performance becomes better if I manually pickle the data, like
data = ... # very large
with open("some_unique_name.tmp", mode="wb") as f:
pickle.dump(data, f)
def func(i):
with open("some_unique_name.tmp", mode="rb") as f:
data = pickle.load(f)
# computation with data and i
...
results = Parallel(n_jobs=16)(delayed(func)(i) for i in range(1000))
os.remove("some_unique_name.tmp")This is much faster than the above two examples, but
- the pickle file is loaded 1000 times; 16 times must be enough,
- not neat. It should be more easily written.
Any better ideas?