-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
store in unreliable targets #3299
Description
We are using dask.array.store (via xarray, zarr, and gcsfs) to store many TB of data in google cloud storage. There are potentially millions of individual dask chunks involved. We are getting lots of errors (see pangeo-data/pangeo#150). There are many layers to this stack, and it is hard to debug. Often, we see errors directly from the google cloud storage api.
However, one feature that I think would help is if dask.array.store were more resilient to errors in the target. When performing millions of store operations on any remote data store, the probability of failure at some point seems large. Currently dask can't recover from such errors.
Consider this example storage target
import random
fail_probability = 10
class UnreliableStore(object):
def __setitem__(self, key, item):
if not random.choice(range(fail_probability)):
raise RuntimeError('Random failure!')
# don't actually store anything, it's just an exampleThe UnreliableStore will raise an error one out of ten times you try to store in it, bringing down the whole store operation.
import dask.array as dsa
da = dsa.zeros(shape=(1000,1000), chunks=(100,100))
da.store(UnreliableStore()) # almost certain to raise the RuntimeErrorInstead, it would be wonderful to be able to say
da.store(UnreliableStore(), retries=10)I believe this could help overcome lots of the problems we face in pushing very large datasets to cloud storage.