-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Added timeout for closing devfs sync http connections. #66152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat. Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing. |
|
Were you ever able to repro? I'd rather know what changed between 1.17 and 1.20 and track down the bug, rather than just timing out. @zanderso was investigating this area with #54123. There was also a lot of asset work done this release. @jonahwilliams @dnfield |
|
Ah, that PR was something else. As for this PR, 4 minutes sounds like an OS-level connection timeout. We might be able to bring that in by setting However, client request output is buffered (buffer size 8KB): https://api.dart.dev/dev/2.10.0-146.0.dev/dart-io/HttpClientRequest/bufferOutput.html, so we might not be trying to connect and send anything until the If we think the transfer is taking too long, we should use As @jmagman says however, it would be nice to find the root cause before adding the timeout, but adding the timeout is not terrible here since we've already got a retry loop. |
|
Should we be using platform specific tools in order to upload binaries/assets to each device instead of HTTP? Once we have the devFS location presumably we could use something like |
|
Wrote up go/flutter-improving-devfs-reliability to discuss some options |
|
In general I think playing with the timeouts here is a bit scary. We don't really know how long it should take to sync an asset |
|
@jmagman @jonahwilliams @zanderso Yes, I was able to reproduce it. The bug is in closing the http request. It is a bug in the Dart http library. I've filed an issue. We already have timeouts for this call. They are 4 minutes long which is dictated by Dart. This timeout is one that is more amenable to our users and is 50x the observed duration during normal operation. Closing the request happens after the data has already been sent, so we do know how long it should take. The call before close is the one that scales based on the size of the payload. Regardless, the devfs sync operation retries 10 times, so even if it somehow prematurely times out, it would have to exhibit the pathological case 9 more times and prematurely timeout 9 times which is highly unlikely. (in the bug reports and my observation the stalled asset has always succeeded the second time). |
|
One other thing to consider is that even if Dart fixes the http code, we will still want an apropos timeout. |
|
It's also possible that the bug is in the devfs server. Do we own that code? I'm thinking it might possibly be closing a connection when it receives the payload before returning a status code, that would match my observations. |
|
Yes, the devFS is part of the dart vm service: https://github.com/dart-lang/sdk/blob/master/sdk/lib/vmservice/devfs.dart |
|
there are a few associated files there, but do note that we use the HTTP API and not the JSON-RPC API |
|
I agree that fixing this may involve changing/adding timeouts, but given that we don't really seem to know the cause I'm a bit concerned that we'd just be sort of papering-over some other issue. If you can reproduce the syncing issues fairly consistently it seems like the best course of action would be to identify the actual error condition we're hitting on the device/devFS |
I was never able to get it to repro. I can help bisecting if I can figure out how to do so. |
Just to link it here, @jonahwilliams opened #66266 |
|
I was able to capture the error in pcap: The HTTP PUT is number 1547. The server never sends us a status code back for that PUT despite having sent the data. The retry is number 2086. Here is the logs (with my patch recovering): I'll add these details to the Dart issue as well.
That's usually how internet protocols work, that's why there is the "Robustness principle" that dictates that we be liberal with what we accept. That means we need to consider the fact that the server, now and in the future, may not return a status code to our PUT request. |
|
We control both the client and the server here. If there is an ambient bug in the devFS logic that is only being triggered on iOS, we need to track that down. |
|
I've been able to verify that request.response.close() isn't getting called in the pathological case: The following patch yields: --- a/sdk/lib/_internal/vm/bin/vmservice_server.dart
+++ b/sdk/lib/_internal/vm/bin/vmservice_server.dart
@@ -307,13 +307,16 @@ class Server {
fsPath = fsPathList[0];
}
}
- } catch (e) {/* ignore */}
+ } catch (e) {
+ print('vmservice_server: error $e');
+ }
String result;
try {
result = await _service.devfs.handlePutStream(fsName, fsPath, fsUri,
request.cast<List<int>>().transform(gzip.decoder));
} catch (e) {
+ print('vmservice_server: handlePutStream error $e');
request.response.statusCode = HttpStatus.internalServerError;
request.response.write(e);
request.response.close();
@@ -326,6 +329,7 @@ class Server {
request.response.write(result);
}
request.response.close();
+ print('vmservice_server: request close! ${fsUri ?? ''}');
return;
}
if (request.method != 'GET') {When: |
|
When printing out the put messages as well as when we close the response we find that that code never got the PUT request despite what the client thinks: server: client: Notice that |
|
I investigated using HttpClient.connectionTimeout and it didn't catch the problem case: --- a/packages/flutter_tools/lib/src/devfs.dart
+++ b/packages/flutter_tools/lib/src/devfs.dart
@@ -220,7 +220,9 @@ class _DevFSHttpWriter {
: httpAddress = serviceProtocol.httpAddress,
_client = httpClient,
_osUtils = osUtils,
- _logger = logger;
+ _logger = logger {
+ _client.connectionTimeout = Duration(milliseconds:500);
+ }
final HttpClient _client;
final OperatingSystemUtils _osUtils;
@@ -273,7 +275,7 @@ class _DevFSHttpWriter {
// The contents has already been streamed, closing the request should
// not take long but we are experiencing hangs with it, see #63869.
final HttpClientResponse response =
- await request.close().timeout(const Duration(milliseconds: 500));
+ await request.close();
response.listen((_) {},
onError: (dynamic error) {
_logger.printTrace('error: $error'); |
|
I updated the PR so that it aborts the request in the case of a timeout. |
|
I did some further investigation into the root cause of the missing status code and filed a dart bug here: dart-lang/sdk#43525 It seems like there might be something gobbled up in _HttpParser. Root cause wasn't established. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for abort() say that this will cause HttpClientRequest.done to be completed with an error, which I would have expected to be propagated to the containing zone, skipping over the catch block below since it is an asynchronous error.
Even if that isn't what should happen, we can be on the safe side and put an await request.done after this line, which should definitely force the error into the catch block below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change. I tried to understand what the concern is and had a hard time figuring it out. Is your concern that there was some asynchronous operations in abort that could be short circuited the TimeoutException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without awaiting the request's done Future, the TimeoutException would be rethrown and caught below as expected, but in addition, abort() will complete the done Future with an error (probably in a later microtask). When a Future is completed with an error, and it doesn't have an error handling callback attached with e.g. Future.catchError or isn't awaited, then the error skips over synchronous try {} catch {} blocks, and is passed to the unhandled exception callback of the containing Zone. We don't want that to happen here because that callback will report the exception to crash logging.
If you aren't seeing the abort() result in an error like that, then that behavior doesn't match the API docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is passed to the unhandled exception callback of the containing Zone
Ah, ok. That's the piece I was missing, thanks. I was thinking the the abort exception would silently be ignored since there was no one to catch it.
zanderso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without awaiting the request's done Future, the TimeoutException would be rethrown and caught below as expected, but in addition, abort() will complete the done Future with an error (probably in a later microtask). When a Future is completed with an error, and it doesn't have an error handling callback attached with e.g. Future.catchError or isn't awaited, then the error skips over synchronous try {} catch {} blocks, and is passed to the unhandled exception callback of the containing Zone. We don't want that to happen here because that callback will report the exception to crash logging.
If you aren't seeing the abort() result in an error like that, then that behavior doesn't match the API docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no newline after '=', two space indent on continued line.
| final HttpClientResponse response = | |
| await request.close().timeout(const Duration(milliseconds: 500)); | |
| final HttpClientResponse response = await request.close().timeout( | |
| const Duration(milliseconds: 500), | |
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
This pull request is not suitable for automatic merging in its current state.
|
| 'test', | ||
| fileSystem.currentDirectory, | ||
| fileSystem: fileSystem, | ||
| logger: BufferLogger.test(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to take a look at the stdout and stderr from the BufferLogger to make sure their contents is as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| when(httpRequest.close()).thenAnswer((Invocation invocation) async { | ||
| if (!didAbort) { | ||
| while (!didAbort) { | ||
| await new Future.delayed(const Duration(milliseconds : 10)); | ||
| } | ||
| throw HttpException('aborted'); | ||
| } | ||
| didRetry = true; | ||
| return httpClientResponse; | ||
| }); | ||
| when(httpRequest.abort()).thenAnswer((_) { | ||
| didAbort = true; | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use completers to avoid the polling loop. This also uses a completer for the result of close(), so that the same Future can be returned for httpRequest.done.
final Completer<void> abortCompleter = Completer<void>();
final Completer<HttpClientResponse> responseCompleter = Completer<HttpClientResponse>();
when(httpRequest.close()).thenAnswer((Invocation invocation) async {
if (!didAbort) {
await abortCompleter.future;
unawaited(Future<void>(() async {
responseCompleter.completeError(HttpException('aborted'));
});
return responseCompleter.future;
}
didRetry = true;
return httpClientResponse;
});
when(httpRequest.abort()).thenAnswer((_) {
didAbort = true;
abortCompleter.complete();
});
when(httpRequest.done).thenReturn(responseCompleter.future);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Google Tests seems to not have run and it looks like they failed again in CL 333390424. |
zanderso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A negative test that makes sure the tool fails the right way when all 10 retries fail would also be good if we don't already have one of those.
| <Completer<MockHttpClientResponse>>[hanger, succeeder]; | ||
| succeeder.complete(MockHttpClientResponse()); | ||
|
|
||
| when(httpRequest.close()).thenAnswer((Invocation invocation) async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to by async, and I don't think you want to await the Future before returning it---the code under test should be doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return succeeder.future; | ||
| } else { | ||
| // This branch shouldn't happen. | ||
| assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(false) -> fail('This branch should not happen')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A negative test that makes sure the tool fails the right way when all 10 retries fail would also be good if we don't already have one of those.
If you don't mind. I'd rather prioritize getting this fix landed for now. It's affecting users and has ran into plenty of roadblocks. I still have to investigate what's happening in "Google Tests" and I don't want to add the opportunity for more issues.
I added an issue for that test so we don't lose it: #66564
|
Correctness should be the primary concern here. Cherry picking something like this is especially risky, given that we don't really have integration test coverage for these sorts of failures - we'll be relying on crash reports and user issues to know if the fix worked. |
Are you talking about the test that exhausts the retry count? That isn't related to to this change. This change has 100% test code coverage. As far as integration tests, we actually do have them, they are in Google Test and I'm investigating them now. |
|
Two things:
|
Thanks, I don't want to be misunderstood I'm not trying to shirk responsibility, I'm being economical with my time and respectful of our user's time. I have responsibilities beyond this PR. If a task is suggested, no matter how good the suggestion is, if it doesn't advance the fixing of this issue it needs to be prioritized against all the other responsibilities I have. In this case I don't see a reasonable theory that the suggested test will help Google Tests pass or cover an issue that could arise from the code change. If I'm missing something please let me know. I think it's a good idea, that's why I made the issue. |
|
Mac tool_tests failure in an infra failure in LUCI, it passed on Cirrus. |
jmagman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's merge this.
Hm I thought FRoB had run again, but it didn't? Do we know if those TAP tests are now passing? |
|
@jmagman I need a few more hours to investigate the Google Test failures that aren't showing up in the CI list. I don't think they are critical failures but I want to make sure. The one I investigated yesterday was failing because of extra stderr output which I can't reproduce locally. |
|
@gaaclarke Sorry to rush you, I had thought the Google tests were passing, but they just didn't run. |
|
I was able to reproduce Google Test |
|
I printed out the times those close functions take to finish in milliseconds with the size of the payload in bytes: That's very different than the local ~10ms I was seeing. Notice that each one is larger than the last. I think this is an artifact of the result that we are parallelizing the sending of the files, but the server is handling them serially. That means payload size does indirectly affect the time it takes to wait for the status code (not in network cost, but processing cost). It also means that retry attempts will have lower likelihood of timing out as the server load lessens (assuming some tasks converge). These tests are also running on server farms with emulated devices so I expect their numbers to be higher than anyone would experience in any sane setup. This speaks to the need to find the root cause in the server even more. A safe number would be 10x the observed max which would be 50 seconds, which is still better than the current 4 minute timeout. It is hardly an acceptable pause to receive 20% of the time. |
updated the timeout documentation reduced the max number of inflight uploads
|
@jonahwilliams @zanderso Please give this another look. I increased the timeout to 10 seconds based on my observations in Google Tests, trying to balance safe and considerable. I also reduced the max inflight uploads to limit the variance in the time the request.close takes (since it appears to be waiting for the serial handling of the server of PUT requests). |
zanderso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm on green presubmits
|
Unfortunately, this had a side effect for developers deploying large apps over proxied WFH environments. There are legitimate cases where the sync takes over 10 seconds. We should either revert this or increase the timeout. I believe 60 seconds is reasonable. I would consider any dev cycle that takes over 60 seconds as unusable. See b/171005910. |
|
It should be safe to revert. The problem this was trying to solve was syncing assets during the initial connection, after #66742 we no longer need to sync any assets |
|
Increasing to 60 seconds would be a better change though, then we're not relying on the OS level timeout |
|
#68488 increased the timeout to 1 minute. |
Description
HttpClientRequest.close is randomly hanging then throwing a generic exception after 4 minutes. I've filed a bug against dart and introduced a timeout in flutter since the request that would fail to converge on close would succeed on the second attempt in my testing.
I was able to validate with local testing that this addresses the issue and the 500ms timeout is well above my observed time for close to succeed which was more like 10ms in my observations.
cc @jmagman @xster
Related Issues
#63869
Fixes #65402
Tests
I added the following tests:
none
Checklist
Before you create this PR, confirm that it meets all requirements listed below by checking the relevant checkboxes (
[x]). This will ensure a smooth and quick review process.///).flutter analyze --flutter-repo) does not report any problems on my PR.Breaking Change
Did any tests fail when you ran them? Please read Handling breaking changes.