Skip to content

Exponential backoff on reading and writing#237

Merged
martindurant merged 1 commit intofsspec:masterfrom
martindurant:backoff
Sep 9, 2019
Merged

Exponential backoff on reading and writing#237
martindurant merged 1 commit intofsspec:masterfrom
martindurant:backoff

Conversation

@martindurant
Copy link
Member

No description provided.

@martindurant
Copy link
Member Author

@birdsarah , want to try again?

@birdsarah
Copy link

@martindurant the reason i have not being using dask retries previous is that I was losing data. Will this change correctly start writing from where it's appropriate and be resilient to the failure?

@martindurant
Copy link
Member Author

i have not being using dask retries previous is that I was losing data.

This is a separate issue which we do not understand. It is correct that a failing task should end up with partial data (this would happen for a local file too), but we don't know why the retried task does not overwrite it; perhaps a subtle race condition. We have not yet considered how fsspec's transaction content (all or none of a file) could be used by Dask.
The change here should make it less likely that a task fails in the first place.

@birdsarah
Copy link

birdsarah commented Sep 9, 2019

I will test this when I can, but I have been forced to give up and find other solutions (I was/am two weeks behind schedule) so I'm not directly being impacted by this issue at the moment.

@birdsarah
Copy link

One more question which would change my calculus on how quickly I invest time in testing this @martindurant. Can you explain why this code fixes a problem that doesn't exist in earlier versions of s3fs?

@martindurant
Copy link
Member Author

Yes I can!
On the hypothesis that what you are seeing is rate-limiting behaviour, this PR changes the way reties happen from "10 rapid-fire requests, followed by failure" to "10 requests separated by increasing time-gaps".

@birdsarah
Copy link

But why does s3fs < 0.3 not hit API rate limits when doing the exact same ETL?

@birdsarah
Copy link

(may not just be s3fs < 0.3 - may be some combination thing with dask 2.2 and higher)

@martindurant
Copy link
Member Author

Indeed, a lot of things have changed in the meantime, and so it's hard to say. That's what the extra logging was supposed to help with.

@birdsarah
Copy link

Thanks for this extra info.

@birdsarah
Copy link

Just tried this, it doesn't help. Got the Permission Error: Access Denied pretty quickly.

@ofirnk
Copy link

ofirnk commented Sep 18, 2019

Hey, jumping in -
I think such a change should be behind a flag / under major version bump (e.g 0.4.x) .
We also use s3fs in AWS lambda, and this has caused major performance degradation in one of our environments.
It seems that connection errors happen A LOT in lambda+s3 access, and most of them go unnoticed.
So I suggest to:

  1. Put this behind a flag
  2. Ease the backoff a lot, this are the sleep times for first 10 attempts:
0 | 100
1 | 170
2 | 289
3 | 491
4 | 835
5 | 1420
6 | 2414
7 | 4103
8 | 6976
9 | 11859

100ms for first failure, 1.5 seconds for 5th , 11 seconds for 10th attempt... That seems quite a lot, isn't it?

  1. How about a more delicate backoff times like

0 | 0.01
1 | 0.03
2 | 0.11
3 | 0.29
4 | 0.58
5 | 0.83
6 | 0.95
7 | 0.99
8 | 1.00
9 | 1.01
..
these could be given by a formula or hard coded, and be hard limited at 1 seconds IMO

@martindurant
Copy link
Member Author

@ofirnk you have a point. How about the same pattern, but a factor of 10 smaller? It is supposed to be exponential, after all.

@ofirnk
Copy link

ofirnk commented Sep 18, 2019

@martindurant yep, 10x smaller makes more sense, with maxing out at the 10th attempt
But I'd still put it behind a flag (default=off) to reduce the surprise factor as 10-20ms can look like an unintentional performance regression

@martindurant
Copy link
Member Author

The retries are still the exception, rather than the rule, no? I'll reduce the time, but it seems to me that having the backoff is a good default, since that's what AWS recommends. I'll push it to maybe 20x smaller?

@birdsarah
Copy link

@ofirnk I noticed you said " It seems that connection errors happen A LOT in lambda+s3 access, and most of them go unnoticed." I am not on lambda, but on EMR + s3 (with ServerSideEncryption). I cannot get the connection errors to go unnoticed. They make my dask writes fail. Do you have any tips?

@ofirnk
Copy link

ofirnk commented Sep 18, 2019

@birdsarah no, but maybe encryption does put the error number too high .. for us it's mainly a service degradation

@martindurant I think it's just too common to be called exception :) but yeah - I think it makes sense, something like Math.min(exp_wait(retry_number), exp_wait(5)) with 20x smaller wait times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants