Limit concurrent connections via NUGET_CONCURRENCY_LIMIT#5046
Limit concurrent connections via NUGET_CONCURRENCY_LIMIT#5046
Conversation
|
Thanks for the contribution. LGTM just a comment about concurrency: If concurrent requests on Linux has a default of 1024 open files, which is 4 time more than Max OSX, should we have a different |
|
I think it would be worthwhile also adding an environment variable to override the limit (and maybe Frankly, I'm worried about performance regressions affecting most customers to fix a bug on a platform that's not officially supported. But I don't have any better ideas on how to detect s390x specifically. Maybe |
|
Thanks for the reviews!
Good point. I've tried again with different values, and it seems that the concurrency limit can actually be increased to about 400 without the errors ever appearing. Starting at about 500 I see sporadic failures, and at 1000 or higher I see frequent failures. Of course this is with my specific test cases, not sure to what extent this can be generalized. On the other hand, these numbers seem logical given that each connection requires two file descriptors, and there are 1024 in available in total, some of which will already be used by the rest of the application. I also see that the overall performance is indeed significantly higher when increasing the concurrency limit beyond 16. However, most of that improvement happens early on - increasing the limit higher than about 256 no longer appears to significantly affect performance.
Happy to do so, of course. Were you thinking of moving the environment variable check to NuGet.Configuration.SettingsUtility as well, where the other checks appear to be?
If we wanted an explicit s390x check, I think the most straightforward would be to look at However, I'm not sure that this is the appropriate check - the problem is not really specifically related to the architecture as such. In fact, the "too many open files" problem seems to have been reported in other Linux environments as well, e.g. under Docker here: NuGet/Home#8571 (which was resolved by working around the issue by increasing the open file limit in that particular envionment). It seems that ideally, we would actually check the Interestingly, while investigating That might in fact explain why this problem doesn't show on the x86 architecture (except for cases like the Docker scenario mentioned above, which might have a lower hard limit). So maybe we should have a check for running under the Mono runtime instead of an architecture check. Or maybe the best solution would be to change the Mono runtime itself to similarly raise the limit, like CoreCLR does. (That is a bit challenging as we then need to also make sure we never use |
We've got code checking environment variables all over the place: https://github.com/search?q=repo%3ANuGet%2FNuGet.Client+GetEnvironmentVariable&type=code&p=1 What I suggest is change the line defining |
613ff74 to
2674cd7
Compare
I've now updated the PR according to these suggestions, thanks! However, note that pending the outcome of the discussion on the runtime side here: dotnet/runtime#82428 , this change to NuGet may not actually be necessary any more. (It might still be preferable to add at least some concurrency limit, so I'm still leaving this PR open.) |
|
This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 90 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch. |
zivkan
left a comment
There was a problem hiding this comment.
Sorry for not replying in a few days.
| if (RuntimeEnvironmentHelper.IsLinux) | ||
| { | ||
| // Also limit concurrent requests on Linux, which has a default of | ||
| // 1024 open files at a time. | ||
| concurrencyLimit = 256; | ||
| } |
There was a problem hiding this comment.
From: #5046 (comment)
However, note that pending the outcome of the discussion on the runtime side here: dotnet/runtime#82428 , this change to NuGet may not actually be necessary any more. (It might still be preferable to add at least some concurrency limit, so I'm still leaving this PR open.)
How about we remove this, so Linux continues to not have any limit by default (so as to not slow down the product for customers on officially supported platforms). If the runtime issue doesn't pan out with a fix for s390x, then we can revisit what default we should change it to use. But the environment variable is an escape hatch until the runtime change is put in place.
I can't find it now, but I remember someone saying they were having problems getting NuGet to successfully restore their solution on a Raspberry Pi, or some other low resource machine. This PR's environment variable could also help customers in that situation. Although, having too many knobs adds complexity. Usually because of exponential growth of the test matrix, but I don't see any feasible way to test this, so perhaps that's not a concern here.
There was a problem hiding this comment.
Do we have any perf tests that we could run? Given we set the limit to 16 for macOS and that's apparently fine perf-wise it might be better to avoid sending that many concurrent requests anyway (and help the low resource machines in turn).
There was a problem hiding this comment.
NuGet have some perf scripts here: https://github.com/NuGet/NuGet.Client/tree/dev/scripts/perftests
But I'm not sure they're well suited to finding maximum scalability of NuGet. For that, I think we'd need an unreasonably huge solution, that uses many packages, and we'd need to test it on a machine with a high CPU core count (ideally with fast network, but slow disk would show scalability issues better than fast disk). The perf scripts we usually run against the NuGet.Client, Orleans and/or OrchardCore repos. I know a lot of Microsoft internal teams use much larger repos, and I once had an external customer tell me they have a repo with around 18000 projects (I really hope that was a typo for 1800). Small perf regressions in the public <100 project repos aren't going to have nearly the same impact as the 18,000 project solution.
However, I had forgotten what the throttling was guarding. It's only guarding a call to FindPackageByIdResource.DoesPackageExistAsync. So, it's effectively throttling how many package IDs can be checked for existence, in parallel. It's not throttling a "large" chunk of the dependency resolution algorithm. I don't know if this means this PR is "incomplete" since it's not also guarding package download & extraction, which will also need several file handles. On the other hand, apparently throttling this alone avoided the failure. On the other hand, since it's only throttling DoesPackageExistAsync, I wouldn't expect 256 concurrent to have a big impact on huge solutions with many packages restoring on huge machines with many CPU cores.
@uweigand can you confirm if NuGet still encounters too many file open errors with the mono fix, but without this PR's changes?
I'm still open to the change, but I'm wondering if it's better to "play it safe" and not change the default linux throttling limit, unless we can confirm there are still problems, even after the mono fix.
I'm also curious, was this failure found running source-build? Was it using a filesystem package source, or only HTTP package sources? I forget what NuGet does in series and what it does in parallel, but it's feasible that filesystem feeds, particularly if there are a large number of packages in the feed will do a lot in parallel and therefore use many file handles at the same time.
There was a problem hiding this comment.
However, I had forgotten what the throttling was guarding. It's only guarding a call to
FindPackageByIdResource.DoesPackageExistAsync. So, it's effectively throttling how many package IDs can be checked for existence, in parallel. It's not throttling a "large" chunk of the dependency resolution algorithm. I don't know if this means this PR is "incomplete" since it's not also guarding package download & extraction, which will also need several file handles. On the other hand, apparently throttling this alone avoided the failure. On the other hand, since it's only throttlingDoesPackageExistAsync, I wouldn't expect 256 concurrent to have a big impact on huge solutions with many packages restoring on huge machines with many CPU cores.
Just to clarify here, this patch only adds throttling for FindPackageByIdResource.DoesPackageExistAsync, but that's because all the other operations are already throttled in the current code base. It seems that the missing throttling for DoesPackageExistAsync was just an oversight? [ B.t.w. I suspect the missing DoesPackageExistAsync throttling may already cause problems on MacOS today - but I cannot test there to confirm. ]
I added the DoesPackageExistAsync throttling because just enabling the existing MacOS throttling logic as-is on Linux did not fix the problem on Mono. With this PR, the problem is fully solved (with the old runtime without any changes).
@uweigand can you confirm if NuGet still encounters too many file open errors with the mono fix, but without this PR's changes?
The problem is completely fixed by either applying the Mono runtime fix (using unchanged NuGet), or by applying this NuGet PR (using unchanged Mono runtime). Applying both patches of course also still works.
I'm still open to the change, but I'm wondering if it's better to "play it safe" and not change the default linux throttling limit, unless we can confirm there are still problems, even after the mono fix.
The Mono runtime change has been applied upstream in the meantime, and it does fix the observed problems. So from that perspective, the NuGet change is no longer required to fix any functional issue for us. It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.
I'm also curious, was this failure found running source-build? Was it using a filesystem package source, or only HTTP package sources? I forget what NuGet does in series and what it does in parallel, but it's feasible that filesystem feeds, particularly if there are a large number of packages in the feed will do a lot in parallel and therefore use many file handles at the same time.
I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.
There was a problem hiding this comment.
that's because all the other operations are already throttled in the current code base
🤦 yes, you're right. _throttle already existed in the class, and looking at references, it appears to be used everywhere that _findPackageByIdResource is used, except for this one. It does look like an oversight. Therefore, it would be a great idea to merge this PR.
With this PR, the problem is fully solved (with the old runtime without any changes).
The restore problem is. I'm honestly surprised that the compiler, when building large projects, or large solution that can build in parallel, doesn't also reach too many files open errors. And apparently there aren't any customers using mono with apps that open a significant number of files (or http sockets?) at the same time. The mono runtime change brings that in line with the CoreCLR, which I think is a better outcome.
It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.
If you're willing to remove the default limit on Linux (so it works the same as CoreCLR does today), but leave in the environment variable override, I'll be happy to merge it as soon as we get a green CI build.
On the topic of green CI build, we have infrastructure issues, so if you can rebase onto the latest dev branch, that'll resolve a known problem. Otherwise, as long as you've left the default permissions for repo maintainers to have force push permission on your PR branch, I can do the rebase myself tomorrow.
If you feel strongly that we should limit concurrency on Linux, on CoreCLR, then I'll want to discuss with my team first.
As mentioned in my previous message (unless I edited it out? I wrote for a long time before deleting half of it and posting), now that I realise it's just throttling package source work. Plus it's an instance variable, which means that when restoring a solution with multiple projects, I wouldn't be surprised if it's one throttle instance per project being restored.
So, now that I understand the code better, I'm less concerned about potential perf regressions. But it is a change compared to today's CoreCLR behaviour, so I'm not going to make the decision independently from my team.
I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.
Wow, that's surprising. While 26 sources is a lot, and I'd expect at least 3 file handles open at once per HTTP request (http request, lock file, http-cache file), that's still less than 100. I wonder what the other 400+ file handles were. This might signal another bug, or potential for a perf improvement. But that's out of scope for this PR.
There was a problem hiding this comment.
It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.
If you're willing to remove the default limit on Linux (so it works the same as CoreCLR does today), but leave in the environment variable override, I'll be happy to merge it as soon as we get a green CI build.
OK, I'm fine with that. I'll make that change shortly.
I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.
Wow, that's surprising. While 26 sources is a lot, and I'd expect at least 3 file handles open at once per HTTP request (http request, lock file, http-cache file), that's still less than 100. I wonder what the other 400+ file handles were. This might signal another bug, or potential for a perf improvement. But that's out of scope for this PR.
Well, I was seeing more than one HTTP connection to each source. The project has multiple dependencies (that's why it needs to be a xunit hello world - just a console hello world doesn't have enough dependencies), and NuGet is opening a separate HTTP connection checking for each dependency on each source, all in parallel. (And once it finds that there are multiple versions of a dependent package on a source, it'll then go and open multiple more HTTP connections looking at each of the versions -- still all in parallel.)
Add environment variable to override OS-specific limit of concurrent connections in SourceRepositoryDependencyProvider.cs. Also guard _findPackagesByIdResource.DoesPackageExistAsync under the concurrent connection _throttle. Addresses NuGet/Home#12410.
There was a problem hiding this comment.
Thanks for contribution!
@uweigand Can you please fill out the checklist and update the issue body? I think it's a bit outdated now.
We should create and link the issue to document the new environment variable from the template.
I've updated the description in this PR and in the linked issue. I've called out the missing documentation update in that issue - not sure if this is what you meant or if there should be a new issue created. If the latter, please let me know how to do that - I wasn't able to find any template for that. |
|
@uweigand I created a docs repo issue, and updated the template: NuGet/Home#13568 Thanks for your help making the change, and your patience during the process |
Bug
Fixes: NuGet/Home#12410
Regression? Last working version: n/a
Description
See the linked issue for a detailed description of the underlying problem. While the original Linux issue has now been addressed elsewhere, by raising the RLIMIT_NOFILE limit on Linux in the Mono runtime, this PR still provides the following two related changes:
NUGET_CONCURRENCY_LIMITto override OS-specific limit of concurrent connections inSourceRepositoryDependencyProvider.cs._findPackagesByIdResource.DoesPackageExistAsyncunder_throttle. Note that all other_findPackagesByIdResourcecallbacks were already guarded under the same_throttleto limit concurrency, but this one call seems to have been missed.PR Checklist
PR has a meaningful title
PR has a linked issue.
Described changes
Tests
Documentation