Exponential backoff on reading and writing by martindurant · Pull Request #237 · fsspec/s3fs

martindurant · 2019-09-09T13:15:54Z

No description provided.

martindurant · 2019-09-09T13:50:34Z

@birdsarah , want to try again?

birdsarah · 2019-09-09T15:42:11Z

@martindurant the reason i have not being using dask retries previous is that I was losing data. Will this change correctly start writing from where it's appropriate and be resilient to the failure?

martindurant · 2019-09-09T15:46:20Z

i have not being using dask retries previous is that I was losing data.

This is a separate issue which we do not understand. It is correct that a failing task should end up with partial data (this would happen for a local file too), but we don't know why the retried task does not overwrite it; perhaps a subtle race condition. We have not yet considered how fsspec's transaction content (all or none of a file) could be used by Dask.
The change here should make it less likely that a task fails in the first place.

birdsarah · 2019-09-09T17:27:31Z

I will test this when I can, but I have been forced to give up and find other solutions (I was/am two weeks behind schedule) so I'm not directly being impacted by this issue at the moment.

birdsarah · 2019-09-09T19:11:09Z

One more question which would change my calculus on how quickly I invest time in testing this @martindurant. Can you explain why this code fixes a problem that doesn't exist in earlier versions of s3fs?

martindurant · 2019-09-09T19:24:06Z

Yes I can!
On the hypothesis that what you are seeing is rate-limiting behaviour, this PR changes the way reties happen from "10 rapid-fire requests, followed by failure" to "10 requests separated by increasing time-gaps".

birdsarah · 2019-09-09T19:34:30Z

But why does s3fs < 0.3 not hit API rate limits when doing the exact same ETL?

birdsarah · 2019-09-09T19:35:26Z

(may not just be s3fs < 0.3 - may be some combination thing with dask 2.2 and higher)

martindurant · 2019-09-09T19:40:45Z

Indeed, a lot of things have changed in the meantime, and so it's hard to say. That's what the extra logging was supposed to help with.

birdsarah · 2019-09-09T20:06:28Z

Thanks for this extra info.

birdsarah · 2019-09-18T00:51:24Z

Just tried this, it doesn't help. Got the Permission Error: Access Denied pretty quickly.

ofirnk · 2019-09-18T07:43:07Z

Hey, jumping in -
I think such a change should be behind a flag / under major version bump (e.g 0.4.x) .
We also use s3fs in AWS lambda, and this has caused major performance degradation in one of our environments.
It seems that connection errors happen A LOT in lambda+s3 access, and most of them go unnoticed.
So I suggest to:

Put this behind a flag
Ease the backoff a lot, this are the sleep times for first 10 attempts:

100ms for first failure, 1.5 seconds for 5th , 11 seconds for 10th attempt... That seems quite a lot, isn't it?

How about a more delicate backoff times like

0 | 0.01
1 | 0.03
2 | 0.11
3 | 0.29
4 | 0.58
5 | 0.83
6 | 0.95
7 | 0.99
8 | 1.00
9 | 1.01
..
these could be given by a formula or hard coded, and be hard limited at 1 seconds IMO

martindurant · 2019-09-18T11:44:22Z

@ofirnk you have a point. How about the same pattern, but a factor of 10 smaller? It is supposed to be exponential, after all.

ofirnk · 2019-09-18T13:40:07Z

@martindurant yep, 10x smaller makes more sense, with maxing out at the 10th attempt
But I'd still put it behind a flag (default=off) to reduce the surprise factor as 10-20ms can look like an unintentional performance regression

martindurant · 2019-09-18T13:44:52Z

The retries are still the exception, rather than the rule, no? I'll reduce the time, but it seems to me that having the backoff is a good default, since that's what AWS recommends. I'll push it to maybe 20x smaller?

birdsarah · 2019-09-18T14:59:49Z

@ofirnk I noticed you said " It seems that connection errors happen A LOT in lambda+s3 access, and most of them go unnoticed." I am not on lambda, but on EMR + s3 (with ServerSideEncryption). I cannot get the connection errors to go unnoticed. They make my dask writes fail. Do you have any tips?

ofirnk · 2019-09-18T16:36:54Z

@birdsarah no, but maybe encryption does put the error number too high .. for us it's mainly a service degradation

@martindurant I think it's just too common to be called exception :) but yeah - I think it makes sense, something like Math.min(exp_wait(retry_number), exp_wait(5)) with 20x smaller wait times

Exponential backoff on reading and writing

50519cb

martindurant mentioned this pull request Sep 9, 2019

Intermittent 'PermissionError: Access Denied' when trying to read S3 file from AWS Lambda #218

Closed

martindurant merged commit 7de9cbd into fsspec:master Sep 9, 2019

martindurant deleted the backoff branch September 9, 2019 13:50

birdsarah mentioned this pull request Sep 18, 2019

Reducing read_metadata output size in pyarrow/parquet dask/dask#5391

Merged

2 tasks

Conversation

martindurant commented Sep 9, 2019

Uh oh!

martindurant commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 9, 2019

Uh oh!

martindurant commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

birdsarah commented Sep 9, 2019

Uh oh!

martindurant commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 9, 2019

Uh oh!

martindurant commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 9, 2019

Uh oh!

birdsarah commented Sep 18, 2019

Uh oh!

ofirnk commented Sep 18, 2019

Uh oh!

martindurant commented Sep 18, 2019

Uh oh!

ofirnk commented Sep 18, 2019

Uh oh!

martindurant commented Sep 18, 2019

Uh oh!

birdsarah commented Sep 18, 2019

Uh oh!

ofirnk commented Sep 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

birdsarah commented Sep 9, 2019 •

edited

Loading