joblib.Parallel is not efficient at scheduling small tasks due to interprocess communication overhead. Currently this can be addressed by writing a wrapper function that work on a group of tasks and do the task grouping manually prior to calling joblib.Parallel
It would be more convenient to make joblib.Parallel do grouped dispatch internally by just passing the size of the group:
output = joblib.Parallel(n_jobs=42, group_size=10)(
delayed(my_function)(one_input) for one_input in many_many_inputs)