Skip to content

store in unreliable targets #3299

@rabernat

Description

@rabernat

We are using dask.array.store (via xarray, zarr, and gcsfs) to store many TB of data in google cloud storage. There are potentially millions of individual dask chunks involved. We are getting lots of errors (see pangeo-data/pangeo#150). There are many layers to this stack, and it is hard to debug. Often, we see errors directly from the google cloud storage api.

However, one feature that I think would help is if dask.array.store were more resilient to errors in the target. When performing millions of store operations on any remote data store, the probability of failure at some point seems large. Currently dask can't recover from such errors.

Consider this example storage target

import random
fail_probability = 10
class UnreliableStore(object):
    def __setitem__(self, key, item):
        if not random.choice(range(fail_probability)):
            raise RuntimeError('Random failure!')
        # don't actually store anything, it's just an example

The UnreliableStore will raise an error one out of ten times you try to store in it, bringing down the whole store operation.

import dask.array as dsa
da = dsa.zeros(shape=(1000,1000), chunks=(100,100))
da.store(UnreliableStore()) # almost certain to raise the RuntimeError

Instead, it would be wonderful to be able to say

da.store(UnreliableStore(), retries=10)

I believe this could help overcome lots of the problems we face in pushing very large datasets to cloud storage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions