-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
This is an RFC for an improvement to retries in uv.
uv sets a read timeout through reqwest, the maximum time between two reads before we give up, which is exposed as UV_HTTP_TIMEOUT (seanmonstar/reqwest#2950 - I've asked the reqwest maintainer what what "read" means). uv does not set a total timeout, it doesn't really make sense on a network level if you download e.g. an 800MB torch wheel on a slow network.
Currently, uv retries a set of network errors, but explicitly not read timeouts. This means if the network ever gets stuck without an explicit error, we abort entirely. This problem was observed by a user.
Retrying on timeouts has the effect that with the default 3 retries, we may not error after 30s seconds, but after 30s * (1 + 3) = 120s (plus backoff/jitter). Note that generally, a completely offline machine or server causes an error almost immediately (e.g., an offline machine will cause a DNS error), so this path does not apply here. It also doesn't make sense for us to expose a total network timeout in these circumstances, as we do want to retry if e.g. an 800MB torch wheels download errors after half the file on a slow network.
When there is no internet, we currently fail in ~3s on my Ubuntu 22.04 machine, which includes 3 retries.
$ hyperfine -i "uv pip install --no-cache tqdm"
Time (mean ± σ): 2.871 s ± 0.797 s [User: 0.040 s, System: 0.022 s]
Range (min … max): 1.738 s … 4.427 s 10 runs
$ uv pip install --no-cache tqdm
error: Request failed after 3 retries
Caused by: Failed to fetch: `https://pypi.org/simple/tqdm/`
Caused by: error sending request for url (https://pypi.org/simple/tqdm/)
Caused by: client error (Connect)
Caused by: dns error
Caused by: failed to lookup address information: Name or service not known
We fail which the same error in the same time if there is actually no DNS entry.
For a server that is down (here simulated by an address that has a DNS entry but doesn't have a server listening) and a blackhole server, we fail after ~120s with 3 retries, occasionally taking longer.
$ time uv pip install --no-cache tqdm --index-url https://example.com:81
error: Request failed after 3 retries
Caused by: Failed to fetch: `https://example.com:81/tqdm/`
Caused by: error sending request for url (https://example.com:81/tqdm/)
Caused by: operation timed out
real 2m39,094s
$ time uv pip install --no-cache tqdm --index-url https://blackhole.webpagetest.org
error: Request failed after 3 retries
Caused by: Failed to fetch: `https://blackhole.webpagetest.org/tqdm/`
Caused by: error sending request for url (https://blackhole.webpagetest.org/tqdm/)
Caused by: operation timed out
real 2m3,105s
The duration is determined by the read timeout (plus backoff/jitter).
$ time UV_HTTP_TIMEOUT=10 uv pip install --no-cache tqdm --index-url https://example.com:81
error: Request failed after 3 retries
Caused by: Failed to fetch: `https://example.com:81/tqdm/`
Caused by: error sending request for url (https://example.com:81/tqdm/)
Caused by: operation timed out
real 0m44,110s
We can also set a connect timeout, in which case the duration for this case is determined by the minimum of read timeout and connect timeout. We currently don't set a connect timeout, I don't know whether it makes sense to set one here.
For comparison, I ran curl https://example.com:81, wget https://example.com:81 and opened the page in firefox and chrome. curl times out after 4.5min, though it's sources say the default connect timeout is 5min: https://github.com/curl/curl/blob/4e5908306ad5febee88f7eae8ea3b0c41a6b7d84/lib/connect.h#L43, defined 25 years ago: curl/curl@09da900#diff-a8a54563608f8155973318f4ddb61d7328dab512b8ff2b5cc48cc76979d4204cR1119. wget tries each of the 4 different IP addresses in the DNS (A and AAAA) entry until each one times out after 9min then retrying all of them 20 times for a total of 182min. Its documentation says there's only a 15min read timeout on top of the system connect timeout https://www.gnu.org/software/wget/manual/wget.html#index-timeout. Firefox and Chrome also time out after a bit (I couldn't run those through time, seemingly closer to curl). It seems unlikely that waiting several minutes causes eventual success, rather than delaying feedback for the user. wget's behavior will much more likely get the CI job killed without proper error reporting.
My proposal is two changes: The first is to lower the read timeout to 10s. If there has been not a single read for 10s, we assume that the connection is broken and we have to retry. This reduces the wait time for the common failure case that a server is down or unreachable due to network settings (think missing VPN) from ~120s to ~40s. The second is to retry on timeout errors. With the 10s read timeout, even if we run into a read timeout consistently after the connection was successfully established, we fail after 10s * (1 + 3) = 40s (plus backoff/jitter), which is similar enough to the current 30s, except with grace for connection hiccups.
TODO: What about connect_timeout, should we set one? (also asked at seanmonstar/reqwest#2950)