Exponential backoff on reading and writing#237
Conversation
|
@birdsarah , want to try again? |
|
@martindurant the reason i have not being using dask retries previous is that I was losing data. Will this change correctly start writing from where it's appropriate and be resilient to the failure? |
This is a separate issue which we do not understand. It is correct that a failing task should end up with partial data (this would happen for a local file too), but we don't know why the retried task does not overwrite it; perhaps a subtle race condition. We have not yet considered how fsspec's transaction content (all or none of a file) could be used by Dask. |
|
I will test this when I can, but I have been forced to give up and find other solutions (I was/am two weeks behind schedule) so I'm not directly being impacted by this issue at the moment. |
|
One more question which would change my calculus on how quickly I invest time in testing this @martindurant. Can you explain why this code fixes a problem that doesn't exist in earlier versions of s3fs? |
|
Yes I can! |
|
But why does s3fs < 0.3 not hit API rate limits when doing the exact same ETL? |
|
(may not just be s3fs < 0.3 - may be some combination thing with dask 2.2 and higher) |
|
Indeed, a lot of things have changed in the meantime, and so it's hard to say. That's what the extra logging was supposed to help with. |
|
Thanks for this extra info. |
|
Just tried this, it doesn't help. Got the |
|
Hey, jumping in -
100ms for first failure, 1.5 seconds for 5th , 11 seconds for 10th attempt... That seems quite a lot, isn't it?
0 | 0.01 |
|
@ofirnk you have a point. How about the same pattern, but a factor of 10 smaller? It is supposed to be exponential, after all. |
|
@martindurant yep, 10x smaller makes more sense, with maxing out at the 10th attempt |
|
The retries are still the exception, rather than the rule, no? I'll reduce the time, but it seems to me that having the backoff is a good default, since that's what AWS recommends. I'll push it to maybe 20x smaller? |
|
@ofirnk I noticed you said " It seems that connection errors happen A LOT in lambda+s3 access, and most of them go unnoticed." I am not on lambda, but on EMR + s3 (with ServerSideEncryption). I cannot get the connection errors to go unnoticed. They make my dask writes fail. Do you have any tips? |
|
@birdsarah no, but maybe encryption does put the error number too high .. for us it's mainly a service degradation @martindurant I think it's just too common to be called exception :) but yeah - I think it makes sense, something like |
No description provided.