Temporarily restore concurrency limit to historical fixed default#6258
Conversation
When we first developed the "pure" SSH-only Git LFS object transfer protocol in PR git-lfs#4446, our initial prototype supported only a single SSH connection, and so in commit 594f8e3 of that PR we updated the NewManifest() function in our "tq" package to set the maximum number of concurrent transfers to one when the new SSH adapter was in use. We also added a comment which states that "Multiple concurrent transfers are not yet supported". Later in the same PR we implemented support for multiple concurrent SSH connections using the ControlMaster option of the ssh(1) command, and so in commit 9c46a38 we revised the NewManifest() function to no longer set the maximum number of concurrent transfers to one when the new SSH adapter was in use. However, we accidentally left the comment regarding the lack of support for concurrent transfers in place. Then in commit b44cbe4 of PR git-lfs#5136 we resolved the problem reported in git-lfs#5064 by updating the NewManifest() function a third time. The function once again sets the maximum number of concurrent transfers to one when the SSH adapter is in use, but now only when SSH multiplexing is not enabled, as might occur when using an SSH agent other than the ssh(1) command. When we made this most recent change, though, we neither updated the original comment nor moved it to appear prior to the relevant conditional block in the NewManifest() function, so we make those corrections now.
In PR git-lfs#6241 we recently adopted a change which made the default value of the "lfs.concurrentTransfers" configuration option variable and scaled to three times the number of available CPUs. This change benefits users with multi-core CPUs when they transfer Git LFS objects over HTTP, since we start multiple concurrent threads (i.e., "goroutines") to make Git LFS Batch API requests and object transfer requests in parallel. However, the "lfs.concurrentTransfers" configuration option also controls the number of parallel object transfers when using protocols other than HTTP, and in these cases, we spawn multiple separate child processes to perform the transfer operations. For example, when using the "pure" SSH-only Git LFS object transfer protocol, we may run multiple instances of the ssh(1) command, unless SSH multiplexing is not enabled. (Note that we may also use another SSH agent, depending on what has been configured via the "core.sshCommand" option and the GIT_SSH or GIT_SSH_COMMAND environment variables.) When multiplexing is enabled, we set the ssh(1) command's ControlMaster option so that all the separate processes we start will share a single underlying SSH connection. We also only start SSH agent processes on demand in this case. When transferring many objects, though, we will start multiple processes, up to the limit specified by the "lfs.concurrentTransfers" configuration option. As another example, when using our custom transfer protocol we always start multiple separate instances of the transfer agent program, which may either be our "git lfs standalone-file" command or an agent configured with the "lfs.customTransfer.*" options. In the first case, we only start a single process, but otherwise we always start exactly as many processes as is specified by the "lfs.concurrentTransfers" option, unless the "lfs.concurrentTransfers.<name>.concurrent" option is set to "false". Finally, even when using our standard HTTP-based transfer adapter, we may start multiple threads in situations where we do not need them all. For instance, in cases where the "git lfs smudge" command is used instead of the more-efficient "git lfs filter-process" command, each invocation of the "git lfs smudge" command will transfer only a single object, but will nevertheless start the number of threads specified by the "lfs.concurrentTransfers" option. Typical modern installations of Git LFS do not use the "git lfs smudge" command, but if the "filter.lfs.process" Git configuration option is missing, or if a version of Git older than v2.11.0 is in use, then the "git lfs smudge" command may still be utilized. Even when the "git lfs filter-process" command is used, when just a small number of objects need to be transferred, we will still start the number of threads specified by the "lfs.concurrentTransfers" option. In all these instances, if we set the default value of the "lfs.concurrentTransfers" option based on the number of CPUs then we may start many more processes or threads than we did prior to the changes in PR git-lfs#6241. This may have unexpected and negative consequences for our users, particularly in those cases where we spawn separate processes to transfer objects, such as when a custom transfer agent is configured. Ideally, we would address all of these issues by refactoring our HTTP and custom transfer adapters to start threads or processes only on demand, as the SSH transfer adapter already does, while also introducing new configuration options to better set limits on the number of transfer agent processes we spawn as distinct from the number of threads we start during HTTP object transfers. As we expect to make a v3.8.0 release shortly, though, refactoring work of this nature is out of scope for the near term. We therefore now reverse the key change from PR git-lfs#6241, and set the default value of the "lfs.concurrentTransfers" option to its historical value of 8. Looking forward, we should write a proposal to discuss how to implement any new configuration options such that they respect the historical meaning of the "lfs.concurrentTransfers" option. As one consideration, if a user has explicitly set the "lfs.concurrentTransfers" option but has not set any of the new options, then the "lfs.concurrentTransfers" option's value should be understood as a maximum limit for the number of threads or processes we start during object transfers. On the other hand, if a user sets one of our new options but not the "lfs.concurrentTransfers" option, then we might allow its value to override the default value of the "lfs.concurrentTransfers" option.
a396653 to
d05aba2
Compare
|
Hey @joshfriend, I'm sorry about this PR! It certainly wasn't my intention to reverse course when I approved and merged PR #6241, which seemed like a relatively straightforward enhancement with clear performance benefits. I should have thought a bit more carefully about all the dusty edge cases we have to support. While I don't think it's a disaster if we spin up, say, 10 or 12 custom transfer agents instead of 8, I wouldn't want to surprise users who discover we're now forking 384 processes on their 128-core machine, just to transfer one or two objects via their custom transfer process. |
|
Not a problem, Chris! In my setup with pretty vanilla github hosted LFS that was a pretty great default tuning, but I had no test harness available for some of the other transport methods and clearly missed a few things. #6234 is the main unlock for performance though. We could leave the default parallelism pretty low as it sat before, with affordances for increasing it, and I think that would still be a big win! |
In commit aa08c37 of PR #6241 we changed the default value of the "lfs.concurrentTransfers" configuration option from eight to a dynamic limit based on the number of CPUs in the current system. This change required that we also update a number of the tests in our shell test suite, as they previously were written with the value of eight hard-coded as the expected default value of the "lfs.concurrentTransfers" option. We therefore added a new setup_expected_concurrent_transfers() helper function to our "t/testhelpers.sh" shell library, and revised several test scripts to invoke this function before executing any tests. The function sets a global "expectedConcurrentTransfers" variable with the expected default value of the "lfs.concurrentTransfers" option so that tests can now refer to this variable instead of using a fixed hard-coded value as they did before. However, when making these revisions, we inadvertently overlooked the tests of the "pure" SSH-only Git LFS transfer protocol in our "t/t-batch-transfer.sh" script. At present, these tests also contain a hard-coded value for the expected maximum number of object transfers which may be performed concurrently. The tests pass this value to the assert_ssh_transfer_sessions() helper function, which is defined in the same script as the tests. One reason we did not adjust these tests in PR #6241 is that the assert_ssh_transfer_sessions() function's fourth parameter, which is expected to receive a value representing the maximum number of concurrent transfers, is misnamed as "objs_per_batch". We chose the parameter name "objs_per_batch" when we first introduced the function in commit b20b6e9 of PR #5905. In the same PR, we then added tests of the SSH-based object transfer protocol, including the "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)" test in which we explicitly set the "lfs.concurrentTransfers" with a value of two. As we explained in a prior commit in this PR, that test was also misnamed, as it effectively forced the Git LFS client to only transfer two objects at a time, but did not alter the batch size or the number of batches the queue processed. Hence we have now renamed the test to "batch transfers with ssh endpoint and multiple objects exceeding workers (git-lfs-transfer)", which more accurately reflects the conditions established by the test. The assert_ssh_transfer_sessions() function's fourth parameter, which is currently named "objs_per_batch", is used to calculate the maximum number of trace log messages the function expects to find which record the start of an SSH process or its termination. Since this number is actually determined by the value of the "lfs.concurrentTransfers" option, we now rename the parameter to "max_concurrency" to better reflect its purpose. As well, in the tests where we call the assert_ssh_transfer_sessions() function and pass a hard-coded value of eight for the fourth parameter, we now instead pass the value of the "expectedConcurrentTransfers" variable. This will allow the tests to continue to pass regardless of how we choose to define the "lfs.concurrentTransfers" option's default value in the future, as the setup_expected_concurrent_transfers() function should always return the appropriate value. Note that for the reasons outlined in PR #6258, we plan to temporarily reverse the principal changes from PR #6241. However, we also intend to eventually re-adopt the use of a default concurrency limit that scales with the number of CPUs, at least for HTTP-based object transfers. Hence we still retain the setup_expected_concurrent_transfers() function in PR #6258 and can rely on it being present in the future.
This PR reverses the key change from PR #6241 and restores the fixed, historical default of 8 for the
lfs.concurrentTransfersconfiguration option, so that we do not spawn large numbers of child processes when transferring objects using our "pure" SSH protocol or when using a custom transfer agent.After the upcoming v3.8.0 Git LFS release we will revisit the issue of scaling default concurrency settings with CPU counts while limiting the number of child processes we might create. See #6259 for that discussion.
This PR will be most easily reviewed on a commit-by-commit basis.
Additional Context
As discussed in PRs #6241 and #6234, we would like to refine the concurrency controls for the Git LFS client so that they can scale with the number of available CPUs, at least when performing HTTP-based object transfers.
We adopted an initial implementation of in PR #6241, which was effective for HTTP object transfers and improved performance significantly on multi-core CPUs.
However, this change also has the unintended effect of scaling the number of child processes we execute when handling object transfers using a custom agent or when using multiplexed SSH connections. While some systems may not be affected, creating large numbers of separate processes is at best not ideal and at worst may run afoul of system or container limits.
We therefore temporarily reverse the key change from PR #6241 and restore the fixed, historical default of 8 for the
lfs.concurrentTransfersconfiguration option.After our upcoming release of version 3.8.0 of the Git LFS client, we intend to revisit this issue and find a more comprehensive solution so that we can safely scale thread counts while keeping process counts within reasonable bounds.
Other Changes
We also take the opportunity to correct the placement and wording of a comment regarding concurrency support for SSH-based object transfers.
/cc @joshfriend as the author of PRs #6241 and #6234.