store in unreliable targets

We are using dask.array.store (via xarray, zarr, and gcsfs) to store many TB of data in google cloud storage. There are potentially millions of individual dask chunks involved. We are getting lots of errors (see pangeo-data/pangeo#150). There are many layers to this stack, and it is hard to debug. Often, we see errors directly from the google cloud storage api.

However, one feature that I think would help is if dask.array.store were more resilient to errors in the target. When performing millions of store operations on any remote data store, the probability of failure at some point seems large. Currently dask can't recover from such errors.

Consider this example storage target
```python
import random
fail_probability = 10
class UnreliableStore(object):
    def __setitem__(self, key, item):
        if not random.choice(range(fail_probability)):
            raise RuntimeError('Random failure!')
        # don't actually store anything, it's just an example
```

The `UnreliableStore` will raise an error one out of ten times you try to store in it, bringing down the whole store operation.

```python
import dask.array as dsa
da = dsa.zeros(shape=(1000,1000), chunks=(100,100))
da.store(UnreliableStore()) # almost certain to raise the RuntimeError
```

Instead, it would be wonderful to be able to say
```python
da.store(UnreliableStore(), retries=10)
```

I believe this could help overcome lots of the problems we face in pushing very large datasets to cloud storage.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

store in unreliable targets #3299

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

store in unreliable targets #3299

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions