Skip to content

Limit concurrent connections via NUGET_CONCURRENCY_LIMIT#5046

Merged
zivkan merged 1 commit intoNuGet:devfrom
uweigand:max-open-files
Mar 7, 2023
Merged

Limit concurrent connections via NUGET_CONCURRENCY_LIMIT#5046
zivkan merged 1 commit intoNuGet:devfrom
uweigand:max-open-files

Conversation

@uweigand
Copy link
Contributor

@uweigand uweigand commented Feb 10, 2023

Bug

Fixes: NuGet/Home#12410

Regression? Last working version: n/a

Description

See the linked issue for a detailed description of the underlying problem. While the original Linux issue has now been addressed elsewhere, by raising the RLIMIT_NOFILE limit on Linux in the Mono runtime, this PR still provides the following two related changes:

  • Add environment variable NUGET_CONCURRENCY_LIMIT to override OS-specific limit of concurrent connections in SourceRepositoryDependencyProvider.cs.
  • Guard _findPackagesByIdResource.DoesPackageExistAsync under _throttle. Note that all other _findPackagesByIdResource callbacks were already guarded under the same _throttle to limit concurrency, but this one call seems to have been missed.

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception - Default behavior should remain unchanged if the env variable isn't set. No clear way of testing the effect if the variable is set (in particular now that the runtime change was merged).
    • OR
    • N/A
  • Documentation

@uweigand uweigand requested a review from a team as a code owner February 10, 2023 15:10
@ghost ghost added the Community PRs created by someone not in the NuGet team label Feb 10, 2023
@martinrrm
Copy link
Contributor

Thanks for the contribution. LGTM just a comment about concurrency:

If concurrent requests on Linux has a default of 1024 open files, which is 4 time more than Max OSX, should we have a different ConcurrencyLimit for each OS?

@zivkan
Copy link
Member

zivkan commented Feb 16, 2023

I think it would be worthwhile also adding an environment variable to override the limit (and maybe -1 to disable the limit), so if this causes a performance regression, then customers can opt-out of the change, or tune their builds appropriately. We already have other examples: https://learn.microsoft.com/en-us/nuget/reference/cli-reference/cli-ref-environment-variables

Frankly, I'm worried about performance regressions affecting most customers to fix a bug on a platform that's not officially supported. But I don't have any better ideas on how to detect s390x specifically. Maybe RuntimeInformation.OSDescription has a string we can search for, but I don't know if that's safe to rely on, plus we don't have a s390x machine to test on.

@uweigand
Copy link
Contributor Author

Thanks for the reviews!

If concurrent requests on Linux has a default of 1024 open files, which is 4 time more than Max OSX, should we have a different ConcurrencyLimit for each OS?

Good point. I've tried again with different values, and it seems that the concurrency limit can actually be increased to about 400 without the errors ever appearing. Starting at about 500 I see sporadic failures, and at 1000 or higher I see frequent failures. Of course this is with my specific test cases, not sure to what extent this can be generalized. On the other hand, these numbers seem logical given that each connection requires two file descriptors, and there are 1024 in available in total, some of which will already be used by the rest of the application.

I also see that the overall performance is indeed significantly higher when increasing the concurrency limit beyond 16. However, most of that improvement happens early on - increasing the limit higher than about 256 no longer appears to significantly affect performance.

I think it would be worthwhile also adding an environment variable to override the limit (and maybe -1 to disable the limit), so if this causes a performance regression, then customers can opt-out of the change, or tune their builds appropriately. We already have other examples: https://learn.microsoft.com/en-us/nuget/reference/cli-reference/cli-ref-environment-variables

Happy to do so, of course. Were you thinking of moving the environment variable check to NuGet.Configuration.SettingsUtility as well, where the other checks appear to be?

Frankly, I'm worried about performance regressions affecting most customers to fix a bug on a platform that's not officially supported. But I don't have any better ideas on how to detect s390x specifically. Maybe RuntimeInformation.OSDescription has a string we can search for, but I don't know if that's safe to rely on, plus we don't have a s390x machine to test on.

If we wanted an explicit s390x check, I think the most straightforward would be to look at
RuntimeInformation.OSArchitecture, which will return System.Runtime.InteropServices.Architecture.S390x.

However, I'm not sure that this is the appropriate check - the problem is not really specifically related to the architecture as such. In fact, the "too many open files" problem seems to have been reported in other Linux environments as well, e.g. under Docker here: NuGet/Home#8571 (which was resolved by working around the issue by increasing the open file limit in that particular envionment).

It seems that ideally, we would actually check the RLIMIT_NOFILE value that is currently active - unfortunately there does not appear to be an easy way to access that value (this information is not accessible via any runtime interface I can find, and we do not want to add native code dependencies to the nuget libraries ...).


Interestingly, while investigating RLIMIT_NOFILE in runtime, I just noticed that there is actually code in the start up sequence of the CoreCLR runtime that will raise the limit from its "soft" value (usually 1024 on Linux) to its "hard" value (which is usually much higher, typically on the order of 100K). This is missing from the Mono runtime start up - and we do use Mono on s390x.

That might in fact explain why this problem doesn't show on the x86 architecture (except for cases like the Docker scenario mentioned above, which might have a lower hard limit).

So maybe we should have a check for running under the Mono runtime instead of an architecture check.

Or maybe the best solution would be to change the Mono runtime itself to similarly raise the limit, like CoreCLR does. (That is a bit challenging as we then need to also make sure we never use select with file descriptors above 1024 - but that should be doable, I think.) I'll look into this on the runtime side as well.

@zivkan
Copy link
Member

zivkan commented Feb 16, 2023

Happy to do so, of course. Were you thinking of moving the environment variable check to NuGet.Configuration.SettingsUtility as well, where the other checks appear to be?

We've got code checking environment variables all over the place: https://github.com/search?q=repo%3ANuGet%2FNuGet.Client+GetEnvironmentVariable&type=code&p=1

What I suggest is change the line defining _throttle to = GetThrottleSemaphoreSlim(EnvironmentVariableReader.Instance);, and add method internal static SemaphoreSlim GetThrottleSempahoreSlim(IEnvironmentVariableReader env). This was we can write tests validating that when the env var is either not parsable as an int, or is <= 0 that it returns null, when the env var is a different positive integer, that the return value's .ConcurrencyLimit is the correct value. I'm not sure I'd bother writing tests when the env var is not set, because then the test will be validating that the value has the correct value depending on the platform. But to me, tests are supposed to catch regressions, and changing a default value isn't a regression, so I don't think it's a good candidate for a unit test.

@uweigand
Copy link
Contributor Author

What I suggest is change the line defining _throttle to = GetThrottleSemaphoreSlim(EnvironmentVariableReader.Instance);, and add method internal static SemaphoreSlim GetThrottleSempahoreSlim(IEnvironmentVariableReader env). This was we can write tests validating that when the env var is either not parsable as an int, or is <= 0 that it returns null, when the env var is a different positive integer, that the return value's .ConcurrencyLimit is the correct value. I'm not sure I'd bother writing tests when the env var is not set, because then the test will be validating that the value has the correct value depending on the platform. But to me, tests are supposed to catch regressions, and changing a default value isn't a regression, so I don't think it's a good candidate for a unit test.

I've now updated the PR according to these suggestions, thanks!

However, note that pending the outcome of the discussion on the runtime side here: dotnet/runtime#82428 , this change to NuGet may not actually be necessary any more. (It might still be preferable to add at least some concurrency limit, so I'm still leaving this PR open.)

@ghost ghost added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Feb 28, 2023
@ghost
Copy link

ghost commented Feb 28, 2023

This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 90 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

Copy link
Member

@zivkan zivkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not replying in a few days.

Comment on lines +57 to +62
if (RuntimeEnvironmentHelper.IsLinux)
{
// Also limit concurrent requests on Linux, which has a default of
// 1024 open files at a time.
concurrencyLimit = 256;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From: #5046 (comment)

However, note that pending the outcome of the discussion on the runtime side here: dotnet/runtime#82428 , this change to NuGet may not actually be necessary any more. (It might still be preferable to add at least some concurrency limit, so I'm still leaving this PR open.)

How about we remove this, so Linux continues to not have any limit by default (so as to not slow down the product for customers on officially supported platforms). If the runtime issue doesn't pan out with a fix for s390x, then we can revisit what default we should change it to use. But the environment variable is an escape hatch until the runtime change is put in place.

I can't find it now, but I remember someone saying they were having problems getting NuGet to successfully restore their solution on a Raspberry Pi, or some other low resource machine. This PR's environment variable could also help customers in that situation. Although, having too many knobs adds complexity. Usually because of exponential growth of the test matrix, but I don't see any feasible way to test this, so perhaps that's not a concern here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any perf tests that we could run? Given we set the limit to 16 for macOS and that's apparently fine perf-wise it might be better to avoid sending that many concurrent requests anyway (and help the low resource machines in turn).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NuGet have some perf scripts here: https://github.com/NuGet/NuGet.Client/tree/dev/scripts/perftests
But I'm not sure they're well suited to finding maximum scalability of NuGet. For that, I think we'd need an unreasonably huge solution, that uses many packages, and we'd need to test it on a machine with a high CPU core count (ideally with fast network, but slow disk would show scalability issues better than fast disk). The perf scripts we usually run against the NuGet.Client, Orleans and/or OrchardCore repos. I know a lot of Microsoft internal teams use much larger repos, and I once had an external customer tell me they have a repo with around 18000 projects (I really hope that was a typo for 1800). Small perf regressions in the public <100 project repos aren't going to have nearly the same impact as the 18,000 project solution.

However, I had forgotten what the throttling was guarding. It's only guarding a call to FindPackageByIdResource.DoesPackageExistAsync. So, it's effectively throttling how many package IDs can be checked for existence, in parallel. It's not throttling a "large" chunk of the dependency resolution algorithm. I don't know if this means this PR is "incomplete" since it's not also guarding package download & extraction, which will also need several file handles. On the other hand, apparently throttling this alone avoided the failure. On the other hand, since it's only throttling DoesPackageExistAsync, I wouldn't expect 256 concurrent to have a big impact on huge solutions with many packages restoring on huge machines with many CPU cores.

@uweigand can you confirm if NuGet still encounters too many file open errors with the mono fix, but without this PR's changes?

I'm still open to the change, but I'm wondering if it's better to "play it safe" and not change the default linux throttling limit, unless we can confirm there are still problems, even after the mono fix.

I'm also curious, was this failure found running source-build? Was it using a filesystem package source, or only HTTP package sources? I forget what NuGet does in series and what it does in parallel, but it's feasible that filesystem feeds, particularly if there are a large number of packages in the feed will do a lot in parallel and therefore use many file handles at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I had forgotten what the throttling was guarding. It's only guarding a call to FindPackageByIdResource.DoesPackageExistAsync. So, it's effectively throttling how many package IDs can be checked for existence, in parallel. It's not throttling a "large" chunk of the dependency resolution algorithm. I don't know if this means this PR is "incomplete" since it's not also guarding package download & extraction, which will also need several file handles. On the other hand, apparently throttling this alone avoided the failure. On the other hand, since it's only throttling DoesPackageExistAsync, I wouldn't expect 256 concurrent to have a big impact on huge solutions with many packages restoring on huge machines with many CPU cores.

Just to clarify here, this patch only adds throttling for FindPackageByIdResource.DoesPackageExistAsync, but that's because all the other operations are already throttled in the current code base. It seems that the missing throttling for DoesPackageExistAsync was just an oversight? [ B.t.w. I suspect the missing DoesPackageExistAsync throttling may already cause problems on MacOS today - but I cannot test there to confirm. ]

I added the DoesPackageExistAsync throttling because just enabling the existing MacOS throttling logic as-is on Linux did not fix the problem on Mono. With this PR, the problem is fully solved (with the old runtime without any changes).

@uweigand can you confirm if NuGet still encounters too many file open errors with the mono fix, but without this PR's changes?

The problem is completely fixed by either applying the Mono runtime fix (using unchanged NuGet), or by applying this NuGet PR (using unchanged Mono runtime). Applying both patches of course also still works.

I'm still open to the change, but I'm wondering if it's better to "play it safe" and not change the default linux throttling limit, unless we can confirm there are still problems, even after the mono fix.

The Mono runtime change has been applied upstream in the meantime, and it does fix the observed problems. So from that perspective, the NuGet change is no longer required to fix any functional issue for us. It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.

I'm also curious, was this failure found running source-build? Was it using a filesystem package source, or only HTTP package sources? I forget what NuGet does in series and what it does in parallel, but it's feasible that filesystem feeds, particularly if there are a large number of packages in the feed will do a lot in parallel and therefore use many file handles at the same time.

I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's because all the other operations are already throttled in the current code base

🤦 yes, you're right. _throttle already existed in the class, and looking at references, it appears to be used everywhere that _findPackageByIdResource is used, except for this one. It does look like an oversight. Therefore, it would be a great idea to merge this PR.

With this PR, the problem is fully solved (with the old runtime without any changes).

The restore problem is. I'm honestly surprised that the compiler, when building large projects, or large solution that can build in parallel, doesn't also reach too many files open errors. And apparently there aren't any customers using mono with apps that open a significant number of files (or http sockets?) at the same time. The mono runtime change brings that in line with the CoreCLR, which I think is a better outcome.

It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.

If you're willing to remove the default limit on Linux (so it works the same as CoreCLR does today), but leave in the environment variable override, I'll be happy to merge it as soon as we get a green CI build.

On the topic of green CI build, we have infrastructure issues, so if you can rebase onto the latest dev branch, that'll resolve a known problem. Otherwise, as long as you've left the default permissions for repo maintainers to have force push permission on your PR branch, I can do the rebase myself tomorrow.

If you feel strongly that we should limit concurrency on Linux, on CoreCLR, then I'll want to discuss with my team first.

As mentioned in my previous message (unless I edited it out? I wrote for a long time before deleting half of it and posting), now that I realise it's just throttling package source work. Plus it's an instance variable, which means that when restoring a solution with multiple projects, I wouldn't be surprised if it's one throttle instance per project being restored.

So, now that I understand the code better, I'm less concerned about potential perf regressions. But it is a change compared to today's CoreCLR behaviour, so I'm not going to make the decision independently from my team.

I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.

Wow, that's surprising. While 26 sources is a lot, and I'd expect at least 3 file handles open at once per HTTP request (http request, lock file, http-cache file), that's still less than 100. I wonder what the other 400+ file handles were. This might signal another bug, or potential for a perf improvement. But that's out of scope for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might still be beneficial in general, if we want to reduce resource consumption and/or load on the servers generated by the NuGet client. But that's no longer a platform-specific decision, so I'm happy to leave that up to you.

If you're willing to remove the default limit on Linux (so it works the same as CoreCLR does today), but leave in the environment variable override, I'll be happy to merge it as soon as we get a green CI build.

OK, I'm fine with that. I'll make that change shortly.

I described the reproducer in the linked issue NuGet/Home#12410, but basically it's just a nuget restore on a hello world xunit project, using a nuget.config that pulls in a large number of HTTP package sources.

Wow, that's surprising. While 26 sources is a lot, and I'd expect at least 3 file handles open at once per HTTP request (http request, lock file, http-cache file), that's still less than 100. I wonder what the other 400+ file handles were. This might signal another bug, or potential for a perf improvement. But that's out of scope for this PR.

Well, I was seeing more than one HTTP connection to each source. The project has multiple dependencies (that's why it needs to be a xunit hello world - just a console hello world doesn't have enough dependencies), and NuGet is opening a separate HTTP connection checking for each dependency on each source, all in parallel. (And once it finds that there are multiple versions of a dependent package on a source, it'll then go and open multiple more HTTP connections looking at each of the versions -- still all in parallel.)

@ghost ghost removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Mar 1, 2023
Add environment variable to override OS-specific limit of
concurrent connections in SourceRepositoryDependencyProvider.cs.

Also guard _findPackagesByIdResource.DoesPackageExistAsync under
the concurrent connection _throttle.

Addresses NuGet/Home#12410.
Copy link
Member

@nkolev92 nkolev92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution!

@uweigand Can you please fill out the checklist and update the issue body? I think it's a bit outdated now.

We should create and link the issue to document the new environment variable from the template.

@uweigand uweigand changed the title Limit concurrent connections on Linux Limit concurrent connections via NUGET_CONCURRENCY_LIMIT Mar 7, 2023
@uweigand
Copy link
Contributor Author

uweigand commented Mar 7, 2023

Thanks for contribution!

@uweigand Can you please fill out the checklist and update the issue body? I think it's a bit outdated now.

We should create and link the issue to document the new environment variable from the template.

I've updated the description in this PR and in the linked issue. I've called out the missing documentation update in that issue - not sure if this is what you meant or if there should be a new issue created. If the latter, please let me know how to do that - I wasn't able to find any template for that.

@zivkan
Copy link
Member

zivkan commented Mar 7, 2023

@uweigand I created a docs repo issue, and updated the template: NuGet/Home#13568

Thanks for your help making the change, and your patience during the process

@zivkan zivkan merged commit 2449796 into NuGet:dev Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community PRs created by someone not in the NuGet team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Too many open files" error on Linux (on s390x)

5 participants