Reuse SSH sessions when possible and only close at client exit#6266
Merged
chrisd8088 merged 14 commits intoMay 21, 2026
Conversation
In later commits in this PR we expect to revise our SSH connection handling so that when a Git LFS remote supports the "pure" SSH Git LFS object transfer protocol, we reuse SSH connections as much as possible for both locking API requests and object transfers. At present, though, we start separate SSH connections for locking API requests and for object transfers. As well, when verifying locks or making other locking API requests for multiple remote Git references, we start a separate SSH connection for each unique reference. This is particularly noticeable when a "git lfs pre-push" command is passed multiple references by the Git pre-push hook, in which case the command attempts to check for locks held on the remote by other users and makes one lock verification request for each Git reference being pushed. When the "pure" SSH Git LFS protocol is in use, the consequence is that we start one SSH connection for each reference, plus one or more SSH connections for the subsequent object transfers. One challenge we encountered while working toward a resolution of this issue is that our trace logs currently do not clearly indicate when the Git LFS client is making a lock verification request. To help clarify the behaviour of the client in this regard and aid future debugging efforts we therefore now add trace log messages which bracket each lock verification request and which state the remote reference for which the request is being made.
In commit 693c6f8 of PR git-lfs#5905 we updated the trace log messages output by the Git LFS client when it starts or stops SSH connections for the "pure" SSH-based Git LFS object transfer protocol, and in particular we revised several of the messages output by the setConnectionCount() method of the SSHTransfer structure in our "ssh" package to more clearly identify SSH connections by their internal session ID. In the case of the "terminating pure SSH connection" message, we made this change correctly, by adding a session ID to the message which we calculated using the current loop index variable "i" plus the variable "tn", which gives the initial offset into the slice of SSH connections over which the loop iterates. However, in the case of the "skipping uninitialized lazy pure SSH connection" message, we accidentally omitted the "tn" variable's offset from the value we interpolate into the message as the session ID, with the result that this message's session IDs are not aligned with those of the "terminating pure SSH connection" messages. To fix this oversight we simply make sure to include the slice offset from the "tn" variable in the session IDs we calculate when generating "skipping uninitialized lazy pure SSH connection" trace log messages.
First in PR git-lfs#4446 and then later in PR git-lfs#5905 we added tests to our "t/t-batch-transfer.sh" shell script to verify the behaviour of the Git LFS client when it uploads and downloads objects using the "pure" SSH-based Git LFS object transfer protocol. Each of these tests performs a "git push" command followed by a "git clone" command, checks the trace log output of those commands to ensure the expected number of SSH connections were started, and also checks that the commands have transferred all the objects created by the tests. However, at least in the case of the "git clone" commands, the tests do not actually confirm that the exit codes of the commands are zero (which indicates success). We set the "errexit" shell attribute in each test, but this only causes the tests' subshells to exit if the final command in a pipeline fails, and we perform the "git clone" commands in pipelines where the final tee(1) command always succeeds. We therefore now add checks of the first element of the PIPESTATUS shell array variable after each "git clone" command to verify that these commands have succeeded and returned exit statuses of zero. We also add the same checks after each "git push" command in our tests of the SSH-based Git LFS object transfer protocol, and revise the tests to run the "git push" commands in pipelines where a final "tee" command captures the trace messages output by the commands into a log file. The existing design was sufficient to confirm that the "git push" commands succeeded, since the "errexit" shell attribute would cause the tests to fail if the commands returned a non-zero exit code. Should this occur, though, the trace messages generated by the commands would be lost, because we redirected the commands' standard output into a log file. To aid future debugging efforts we therefore now adopt the same idiom we use elsewhere in our shell test suite. In each of these tests we pipe the output of the "git push" command into a "tee" command, and then check the first element of the PIPESTATUS array variable to confirm that the "git push" command succeeded. This change ensures that if a "git push" command does fail then the end_test() function which follows the test will report the full set of trace log messages from the command.
In commit a84cd13 of PR git-lfs#5905 we added a pair of tests to our "t/t-batch-transfer.sh" shell test script, both of which verify that the Git LFS client is able to successfully upload and download multiple objects using the "pure" SSH-based Git LFS object transfer protocol. These tests were intended to demonstrate that we had resolved the bug reported in git-lfs#5880, where the Git LFS client would crash when trying to transfer multiple objects using the SSH-based transfer protocol. The first test, named "batch transfers with ssh endpoint and multiple objects (git-lfs-transfer)", creates three objects and then checks that when pushing and fetching the objects, the Git LFS client uses three separate SSH processes to do so. Note that because we set the "lfs.ssh.autoMultiplex" configuration option to "true", the test expects that these SSH processes will share a single multiplexed connection. The second test, which we named "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)", also creates three objects and then pushes and fetches them, but only after setting the "lfs.concurrentTransfers" configuration option to a value smaller than the number of objects to be transferred. This forces the Git LFS client to use only two separate SSH processes to transfer the objects, so that the transfer of the third object is delayed until the first two objects have been successfully transferred. Although the second test does verify this behaviour of the Git LFS client, the test's name is not accurate, nor is one of its internal code comments, because these both state that the objects are transferred in multiple batches. While the transfer queue does delay the transfer of one object until the other two have been transferred, this all still occurs within a single batch of objects. We therefore now update the name of the test to "batch transfers with ssh endpoint and multiple objects exceeding workers (git-lfs-transfer)", which more accurately reflects the conditions created by the test. We also rename the test's repository to align with the new test name, and revise the incorrect code comment to better explain the effect of setting the "lfs.concurrentTransfers" configuration option. Note that the remainder of this commit description is provided here solely for future reference, and to clarify and correct a few details from the commit descriptions in PR git-lfs#5905. The reason we originally developed the "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)" test was to help reproduce the bug reported in git-lfs#5880, in which the Git LFS client would panic and crash when it tried to use an SSH process it had previously closed to transfer an object. Hence the test tries to force the client to reuse an SSH process to transfer at least one object, unlike what occurs with the preceding "batch transfers with ssh endpoint and multiple objects (git-lfs-transfer)" test. In that test, each of the three objects will be transferred using a separate SSH process, since the number of objects is fewer than eight, which is the default value of the "lfs.concurrentTransfers" option. In practice, both of these tests could reproduce the crash, absent some of the other changes we made in PR git-lfs#5905. Most obviously, in commit eca6c8a we updated the client to avoid attempting to use a previously closed SSH process. This stopped the panic and crash from occurring, so the tests would of course succeed once that change was made. However, prior to making that correction, we first resolved a bug in our "lfs-ssh-echo" test helper utility, and this on its own was sufficient to allow our new tests to pass, although we did not intend for that to be the case. Specifically, in commit 9af2883 of the same PR git-lfs#5905, we adjusted the "lfs-ssh-echo" utility so that it set defined read/write permissions on the temporary file it uses to simulate the behaviour of OpenSSH when the ControlMaster argument is supplied. This allowed subsequent invocations of the utility to reopen the file, whereas previously they would fail with a non-zero exit code. As it happens, though, the two tests we introduced in commit a84cd13 depended on the "lfs-ssh-echo" utility failing unexpectedly in order to fully reproduce the crash bug from git-lfs#5880. The crash would occur when the Git LFS client attempted to reuse an SSH process it had previously closed, which was only possible when the client had shut down one transfer queue and then started another. The crash did not occur if just a single queue was used, even if the queue transferred objects in multiple batches or even in sequence within a batch in order to respect the maximum transfer concurrency limit. Thus the values of the "lfs.transfer.batchSize" and "lfs.concurrentTransfers" configuration option had no effect on whether the crash occurred or not. Instead, to reproduce the crash, the client needed to be induced to start a second transfer queue after shutting down the initial one. As they were originally written, both of the tests we added in commit a84cd13 implicitly expected a second queue to be started after the first queue experienced the failure of an SSH process. The tests therefore effectively relied on the bug in the "lfs-ssh-echo" command to cause this to occur, but because we fixed that bug before adding the tests, they could never actually reproduce the issue as we intended. Both tests create and then clone a repository containing several Git LFS objects. When cloning, the "git lfs filter-process" command is invoked by Git and asked to apply its "smudge" filter to each object. Git sends these requests via its long-running filter protocol, and indicates that the filter process may delay its response, so the "git lfs filter-process" command replies with "status=delayed" messages and then enqueues the objects for download. To transfer the first object, the client starts an initial SSH process with the ControlMaster argument set to "yes". For the second object, since the "lfs.ssh.autoMultiplex" configuration option is enabled, the client starts another SSH process but sets its ControlMaster argument to "no", with the expectation that both processes will share the same underlying SSH connection. Our shell test suite uses our "lfs-ssh-echo" utility in place of an actual SSH program like OpenSSH, and the bug in the "lfs-ssh-echo" utility caused it to exit with an error status code when it was run with the ControlMaster argument set to "no". When the bug was present, then, the second object transfer in the client's initial queue would fail, and queue would treat this error as one for which the transfer should not be retried. As a consequence, when Git sent an initial "list_available_blobs" request, per the long-running filter protocol, the Git LFS client would reply with only the file path corresponding to the first object. Git would then retrieve the "smudged" content of that file via another "smudge" filter protocol request (this time with the "can-delay" option not permitted, though), after which it would make another request to list the available blobs. At this point, the first transfer queue would have been shut down, because when the "git lfs filter-process" command receives the first "list_available_blobs" request from Git, it calls the Wait() method of the TransferQueue structure in our "tq" package. This method closes various channels and then calls the Shutdown() method of the SSHTransfer structure in our "ssh" package, which terminates all of the running SSH processes. When our "git lfs filter-process" command responds to the second "list_available_blobs" request from Git, we might expect that it would reply with an error status for the second object, since the queue failed to download it and determined that further transfers would not be attempted. However, since commit e764429 of PR git-lfs#2511, when we implemented support for the delay feature of Git's long-running filter protocol, this is not how the Git LFS client behaves under these conditions. Since the transfer queue has been closed down, when the filterCommand() function in our "commands" package handles the second "list_available_blobs" request from Git, it finds that no file paths can be read from the queue's "available" channel. However, it then iterates through the list of outstanding file paths for which no content data has been returned to Git, and sends each of those file paths back in its response. (Specifically, the function iterates over the "ptrs" map and replies to Git with all of the map's keys. These keys are the file paths Git sent in "smudge" requests with the "can-delay" option enabled, and for which no content data has been sent back to Git.) Because Git receives the file path corresponding to the second object in reply to its second "list_available_blobs" request, Git now requests that object's "smudged" content, and does so with the "can-delay" option disabled. When the "git lfs filter-process" command receives this request, its filterCommand() function invokes the smudge() function for the single file's path, which in turn calls the Smudge() method of the GitFilter structure in our "lfs" package. That method runs the same structure's downloadFile() method, and that calls the "tq" package's NewTransferQueue() function to start a dedicated queue to download just the one object. The bug we fixed in commit eca6c8a would now cause the client to panic and crash. Note, though, that our description in that commit suggests that the crash would occur between batches in a single transfer queue, which is not accurate. As explained above, for the crash to occur, a first transfer queue must have been fully shut down, and then another queue started. The new TransferQueue structure would be initialized with the same concreteManifest structure as was populated for the original queue. The batchClientAdapter field of this concreteManifest structure pointed to an extant SSHBatchClient structure, and the "transfer" field in that structure in turn pointed to the SSHTransfer structure that was used to start and stop the SSH connections and processes for the initial transfer queue. When the SSHBatchClient structure's batchInternal() method would attempt to retrieve a session, it would receive a "nil" value from the SSHTransfer structure's Connection() method and then try to dereference it, causing a panic. All of which explains why, when we fixed the bug in the "lfs-ssh-echo" utility, the tests we introduced in PR git-lfs#5905 were no longer effective at actually simulating the conditions under which the Git LFS client would panic and crash.
In commit aa08c37 of PR git-lfs#6241 we changed the default value of the "lfs.concurrentTransfers" configuration option from eight to a dynamic limit based on the number of CPUs in the current system. This change required that we also update a number of the tests in our shell test suite, as they previously were written with the value of eight hard-coded as the expected default value of the "lfs.concurrentTransfers" option. We therefore added a new setup_expected_concurrent_transfers() helper function to our "t/testhelpers.sh" shell library, and revised several test scripts to invoke this function before executing any tests. The function sets a global "expectedConcurrentTransfers" variable with the expected default value of the "lfs.concurrentTransfers" option so that tests can now refer to this variable instead of using a fixed hard-coded value as they did before. However, when making these revisions, we inadvertently overlooked the tests of the "pure" SSH-only Git LFS transfer protocol in our "t/t-batch-transfer.sh" script. At present, these tests also contain a hard-coded value for the expected maximum number of object transfers which may be performed concurrently. The tests pass this value to the assert_ssh_transfer_sessions() helper function, which is defined in the same script as the tests. One reason we did not adjust these tests in PR git-lfs#6241 is that the assert_ssh_transfer_sessions() function's fourth parameter, which is expected to receive a value representing the maximum number of concurrent transfers, is misnamed as "objs_per_batch". We chose the parameter name "objs_per_batch" when we first introduced the function in commit b20b6e9 of PR git-lfs#5905. In the same PR, we then added tests of the SSH-based object transfer protocol, including the "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)" test in which we explicitly set the "lfs.concurrentTransfers" with a value of two. As we explained in a prior commit in this PR, that test was also misnamed, as it effectively forced the Git LFS client to only transfer two objects at a time, but did not alter the batch size or the number of batches the queue processed. Hence we have now renamed the test to "batch transfers with ssh endpoint and multiple objects exceeding workers (git-lfs-transfer)", which more accurately reflects the conditions established by the test. The assert_ssh_transfer_sessions() function's fourth parameter, which is currently named "objs_per_batch", is used to calculate the maximum number of trace log messages the function expects to find which record the start of an SSH process or its termination. Since this number is actually determined by the value of the "lfs.concurrentTransfers" option, we now rename the parameter to "max_concurrency" to better reflect its purpose. As well, in the tests where we call the assert_ssh_transfer_sessions() function and pass a hard-coded value of eight for the fourth parameter, we now instead pass the value of the "expectedConcurrentTransfers" variable. This will allow the tests to continue to pass regardless of how we choose to define the "lfs.concurrentTransfers" option's default value in the future, as the setup_expected_concurrent_transfers() function should always return the appropriate value. Note that for the reasons outlined in PR git-lfs#6258, we plan to temporarily reverse the principal changes from PR git-lfs#6241. However, we also intend to eventually re-adopt the use of a default concurrency limit that scales with the number of CPUs, at least for HTTP-based object transfers. Hence we still retain the setup_expected_concurrent_transfers() function in PR git-lfs#6258 and can rely on it being present in the future.
In PR git-lfs#763 we revised the way the Git LFS client created and deleted temporary files, and as part of this change we updated our top-level main() function so that upon exit it invokes a function to delete any stale temporary files and always does so exactly one time only. Specifically, in commit a0704be we updated the main() function to call the ClearTempObjects() function in our "lfs" package using the Do() method of a Once structure from the "sync" package in the Go standard library. The main() function invoked the ClearTempObjects() function via the Do() method in two instances, when the main() function was exiting normally, and when it received a signal from the operating system. Then in commit a3a2369 of the same PR we refined the list of signals for which the client will be notified to exit and restricted them to just the SIGINT and SIGKILL signals. Later, in commit b9c5a10 of PR git-lfs#1390 we created a Cleanup() function in our "commands" package and changed the Do() methods in our main() function to call this new Cleanup() function instead. In turn, the Cleanup() function then invoked the ClearTempObjects() function. Finally, in commit e9121fd of PR git-lfs#2689 we replaced the ClearTempObjects() function with the Cleanup() method of the Filesystem structure in a new "fs" package. In the same commit we also added a Cleanup() method for the Configuration structure in our "config" package, which just invokes the Cleanup() method on the Configuration structure's "fs" field. We then updated the Cleanup() function in the "commands" package to call the Cleanup() method of the global "cfg" variable, whose type is that of a Configuration structure from our "config" package. These refactoring steps did not alter the basic design from PR git-lfs#763, though, in which stale temporary files are removed when the Git LFS client either receives a SIGINT and SIGKILL signal or otherwise exits normally. Note that these conditions include the case where the main() function will return a non-zero exit code of 127 because the Run() function in our "commands" package has returned this value. However, this case occurs only when a user provides an unknown Git LFS command name or invalid set of command-line flag options and arguments. When that occurs, the Execute() method of the Command structure from the "github.com/spf13/cobra" package returns a non-nil "error" value to our Run() function, which then returns the value 127 to the main() function. In other instances when the Git LFS client encounters an unrecoverable error condition, though, our command implementation functions call one of the exit wrapper functions in our "commands" package, such as Exit() or ExitWithCode(). In a recent set of changes in PR git-lfs#6255 we ensured that one of these wrapper functions is always used when the client needs to halt immediately. However, these wrapper functions do not consistently perform the cleanup steps that are otherwise executed by the Run() and main() functions when the client exits normally. In subsequent commits in this PR we expect to revise our exit wrapper functions so that when the client halts it always performs the same set of cleanup actions. As an initial step toward this goal, we begin by moving the Once structure from the "main" package to our "commands" package. We also refactor the Cleanup() function in the "commands" package so that it just invokes a new doCleanup() function using the Do() method of the Once structure. The doCleanup() function then performs the actual removal of any stale temporary files by calling the Cleanup() method of the global "cfg" variable, which in turn executes the Cleanup() method of a Filesystem structure from our "fs" package. Next, we revise the main() function so that it calls the Cleanup() function of the "commands" package directly, both at the end of the main() function and when a SIGINT or SIGKILL signal has been received. As before, though, the temporary file cleanup operation itself should occur at most one time only, as it is exclusively invoked via the Once structure's Do() method. With these changes in place, we will now be able to make further refinements to ensure that the Git LFS client always performs cleanup operations when exiting. In subsequent commits in this PR we will update the exit wrapper functions in our "commands" package so they all call the Cleanup() function, which can now guarantee on its own that it will only execute the cleanup activities a single time.
In a prior commit in this PR we refactored our "main" package's main() function and the Cleanup() function in our "commands" package so that the Cleanup() function can now guarantee on its own that it will only execute the cleanup activities a single time. At present, the only caller of the Cleanup() function is the main() function, which does so before it calls the Exit() function from the Go standard library's "os" package. However, as we described in the prior commit in which we refactored the Cleanup() function, the individual command implementation functions in our "commands" package often call one of the exit wrapper functions in the same package when an unrecoverable error condition is encountered. These exit wrapper functions in turn call the "os" package's Exit() function, so they cause the Git LFS client to halt without performing any cleanup activities. Since the Cleanup() function is now able to guarantee that it will only execute the cleanup operations a single time, we can update all of the exit wrapper functions which call the "os" package's Exit() function so that they first invoke our Cleanup() function. In a recent set of changes in PR git-lfs#6255 we ensured that one of these wrapper functions is always used when the client needs to halt immediately. As a consequence, with the changes from this commit in place the client should now always perform its cleanup activities when exiting, regardless of whether or not an error condition was encountered.
In PR git-lfs#1839 we first developed our "lfsapi" package, and in commit fecdc9e of that PR we introduced the getAPIClient() function to our "commands" package to ensure that we only initialize a single, global Client structure from the "lfsapi" package. The getAPIClient() function acquires a global mutex, appropriately named "global", and then checks if the "commands" package's "apiClient" global variable has been initialized or not. If not, the function calls the NewClient() function of the "lfsapi" package and assigns the newly instantiated Client structure to the "apiClient" variable. Later, in commit 2435a67 of PR git-lfs#2184, we added an HTTPLogger field to the "lfsapi" package's Client structure. This field's type was that of a WriteCloser structure from the Go standard library's "io" package, and since this type has a Close() method, we also defined a Close() method for the Client structure that called the HTTPLogger field's Close() method, unless the field was nil. In the same commit in PR git-lfs#2184 we then updated the Run() function in our "commands" package to call the new Close() method of the Client structure, after first calling the getAPIClient() function to ensure that the "apiClient" variable was not nil. Subsequently, in commit d16a320 of PR git-lfs#2633, we introduced a closeAPIClient() function to the "lfsapi" package, which acquires the "global" mutex and then calls the "apiClient" variable's Close() method, but only if the variable is not nil. By revising the Run() function to then just call the closeAPIClient() function before returning, this change avoided the need to instantiate a Client structure only in order to close it. Note that later, in PR git-lfs#3244, we refactored the "lfsapi" package and moved portions of its functionality into a new "lfshttp" package, including the HTTP logging subsystem. The Close() method of the "lfsapi" package's Client structure now just calls the Close() method of the "lfshttp" package's Client structure, and that structure is the one which contains an httpLogger field. Regardless, it remains the responsibility of the "commands" package to call the Close() method on its "apiClient" variable when necessary, in order to release any resources associated with it or its internal components. However, because the "apiClient" variable's Close() method is only called when the Run() function is about to return, the resources associated with our HTTP logging subsystem are only properly released when the Git LFS client exits normally. If the client encounters an unrecoverable error condition, it invokes one of the exit wrapper functions in our "commands" package, which call the Exit() function of the Go standard library's "os" package and therefore do not return. As a result, the Run() function is not completely executed, and the closeAPIClient() function will never be called. Likewise, if the client receives a SIGINT or SIGKILL signal, the signal handler established by our main() function in the "main" package will halt the program by calling the "os" package's Exit() function, and so the closeAPIClient() function will also not be called in this case. Fortunately, in previous commits in this PR as well as in PR git-lfs#6255 we have made a series of revisions which ensure that when exiting, the Git LFS client will always call the Cleanup() function in our "commands" package, regardless of the circumstances under which the client needs to halt. We therefore now revise our "commands" package so that the closeAPIClient() function is called from the Cleanup() function rather than the Run() function. This change will guarantee that the resources associated with our HTTP logging subsystem are always properly released when the Git LFS client exits. We also take the opportunity to add a check of the "error" return value from the closeAPIClient() function, so that if it is not nil, the Cleanup() function will report the corresponding error message to the standard error file descriptor.
In previous commits in this PR, we revised the Git LFS client such that it now always performs a consistent set of cleanup activities when exiting, regardless of the circumstances. In particular, the Cleanup() function in our "commands" package should always be executed before the client halts, and one of the steps this function now takes is to invoke the closeAPIClient() function in the "commands" package. The closeAPIClient() function checks if the "apiClient" global variable has been initialized. If it has, then it points to a Client structure from our "lfsapi" package, and the closeAPIClient() function calls that structure's Close() method to release any resources associated with the structure. However, we also instantiate Client structures from the "lfsapi" package in a number of other instances, primarily for our Go test suite but not exclusively so. While in many of these cases we do not strictly need to call the Close() method on the Client structures, for consistency we now revise all these instances so that we do not create a Client structure without eventually also calling its Close() method. In the Environ() function of our "lfs" package, this is simply a matter of registering the Close() method to run when the function returns, as the Client structure created by the function is only referenced within the scope of the function. In the newConcreteManifest() function of our "tq" package, however, we remove the conditional block which would instantiate a new Client structure if the function's "apiClient" parameter had a nil value. The only purpose of this conditional block was to allow some of our Go test functions in the "tq" package to avoid creating a Client structure to pass to the package's NewManifest() function, which instantiates a new lazyManifest structure. When one of that structure's methods was first invoked, it would call the newConcreteManifest() function, and a Client structure from the "lfsapi" package would be created at that time. With this conditional block removed, we then update all of the test functions in the "tq" package which call the NewManifest() function to ensure that they create a Client structure to pass to that function as its "apiClient" parameter, and that they register the structure's Close() function to run when the test function returns.
In previous commits in this PR, we revised the Git LFS client such that it now always performs a consistent set of cleanup activities when exiting, regardless of the circumstances. In particular, the Cleanup() function in our "commands" package should always be executed before the client halts, and one of the steps this function now takes is to invoke the closeAPIClient() function in the "commands" package. The closeAPIClient() function checks if the "apiClient" global variable has been instantiated. If it has, then it points to a Client structure from our "lfsapi" package, and the closeAPIClient() function calls that structure's Close() method to release any resources associated with the structure. In addition, we updated several other packages in our main Git LFS client program and its Go test functions so that after a Client structure from the "lfsapi" package is instantiated, its Close() method will always be invoked when the structure is no longer in use. For consistency, we now update the three other programs in our project which instantiate Client structures from the "lfsapi" package so that they also always call the Close() method on these structures at the end of their lifetimes. Note that in the case of the "t/git-lfs-test-server-api" program, we also revise the buildManifest() function as it no longer needs to return an "error" value, but does now need to return the Client structure it creates. In commit d40d1b7 of PR git-lfs#6254 we altered the NewClient() function in the "lfsapi" package so that it does not return an "error" value. At that time we could also have simplified the buildManifest() function in the "t/git-lfs-test-server-api" program to not return an "error" value, as it previously did so only to report any error condition returned by the NewClient() function of the "lfsapi" package. However, we overlooked this potential simplification in PR git-lfs#6254, so we make it now instead.
When we introduced support for the "pure" SSH-based Git LFS object transfer protocol in PR git-lfs#4446, we designed the SSHTransfer structure in our "ssh" package to be the basic abstraction with which we represent and manage SSH processes and connections. Then in commit 31d3fb7 of the same PR we added an SSHTransfer() method to the Client structure of our "lfsapi" package in order to provide a common interface through which we could initiate and share SSH sessions for multiple purposes. As we wrote in the description of that commit: The lfsapi client is used to perform operations for SSH authentication already, so we know we'll have one wherever SSH operations might be done. Let's move the instantiation to the client so that we can reuse it both in the transfer queue code and in the locking code as well. Using this SSHTransfer() method of the Client structure does have the effect of making consistent all the instances in which we instantiate and initialize new SSH connections for the "pure" SSH-based Git LFS object transfer and locking protocols. Note that the SSHTransfer() method may return a nil value rather than a fully initialized SSH session, not only in the case of an error condition, but also when the remote service does not provide support for the "pure" SSH-based Git LFS protocols. In this case, when the request to execute the "git-lfs-transfer" command fails, the Git LFS client will fall back to attempting to execute the "git-lfs-authenticate" command over SSH. That step involves a different abstraction, though, namely the SSHResolver type in our "lfshttp" package. While our implementation of the Client structure's SSHTransfer() method allows us to establish a new SSH session when the "pure" SSH-based Git LFS protocols are supported, we neglected to account for the fact that the method creates a new set of sessions each time it is called. The consequence is that when handling SSH-based object transfers as well as locking requests, we create an additional SSH process and connection just for the locking requests. Further, while we shut down the SSH connections we use for object transfers after the first transfer queue completes, we never do the same for the connections we create for locking requests. In a subsequent commit in this PR we will address the latter issue, as well as the issue reported in git-lfs#6118, which arises because we shut down SSH connections at the end of each transfer queue, but also try to reuse them between queues. Thus when we are transferring objects over SSH for more than one Git reference, the second queue we start always fails. To resolve this problem we will need to shut down SSH connections only when the Git LFS client is exiting, rather than when a single transfer queue completes. In turn, this implies we need to retain a listing of all the unique sets of SSH sessions we create during the lifetime of the Git LFS client. We therefore refactor the SSHTransfer() method in our "lfsapi" package into two methods. The new initSSHTransfer() method performs the same actions as the SSHTransfer() method did before, and as such may return a nil value if the remote service does not support the "git-lfs-transfer" command. The SSHTransfer() method now maintains a map of the SSH sessions for which it has called the initSSHTransfer() method, and only calls that method if it is required. Otherwise, the value returned by the initSSHTransfer() method for the given "operation" and "remote" parameters is returned again, even if it is a nil value. To maintain this map safely, the SSHTransfer() method first acquires a mutex, which we add to the Client structure along with the map. For keys, we use a concatenation of the unique pairs of "operation" and "remote" strings passed to the method as its parameters. This technique follows the existing approach used in our "commands" package, where we construct keys for the global "tqManifest" map by concatenating the "operation" and "remote" string parameters of the getTransferManifestOperationRemote() function. One advantage of this change is that it also resolves the problem whereby we would previously create an extra SSH process and connection for each locking request, even if they were to be made to the same remote service as subsequent object transfer requests, and for the same type of operation (i.e., an upload or download operation). Because we now reuse SSH connections for both object transfer and locking requests, we can simplify the assert_ssh_transfer_sessions() helper function we added to the "t/t-batch-transfer.sh" shell test script in commit b20b6e9 of PR git-lfs#5905. The assert_ssh_transfer_sessions() helper function, which checks the number of trace log messages that record the start of an SSH process or its termination, no longer needs to make special provision for the fact that certain operations would cause an extra SSH process to be started just to make a locking API request, and that this extra SSH process would never be terminated.
When we introduced support for the "pure" SSH-based Git LFS object transfer protocol in PR git-lfs#4446, we designed the SSHTransfer structure in our "ssh" package to be the basic abstraction with which we represent and manage SSH processes and connections. Then in commit 31d3fb7 of the same PR we added an SSHTransfer() method to the Client structure of our "lfsapi" package in order to provide a common interface through which we could initiate and share SSH sessions for multiple purposes. As we described in a previous commit in this PR, our original implementation of the SSHTransfer() method always either returned a new SSHTransfer structure, representing a new SSH session and underlying set of connections, or else returned a nil value to indicate that the "pure" SSH-based Git LFS protocols were not supported by the remote service, or that an error condition had occurred. Because the SSHTransfer() method always returns a new SSH session (when it does not return nil), at present we create separate sessions for both object transfer requests and locking requests. However, our original intention appears to have been that we would reuse SSH sessions as much as possible, per the remarks in commit 31d3fb7: The lfsapi client is used to perform operations for SSH authentication already, so we know we'll have one wherever SSH operations might be done. Let's move the instantiation to the client so that we can reuse it both in the transfer queue code and in the locking code as well. In a prior commit in this PR we have addressed this issue by refactoring the SSHTransfer() method so that for a given set of input parameters, it always returns the same value. In particular this means that if a remote service supports the SSH-based Git LFS protocols, then the method always returns the same SSH session. A further issue in our original design, as reported in git-lfs#6118, is that when a transfer queue completes, we shut down any SSH session it was using, but then we attempt to reuse that session in subsequent transfer queues, which causes an error. Specifically, when uploading or downloading objects for multiple Git references, we establish a single transfer manifest for the entire operation. The first time one of the methods of the "lazyManifest" structure in our "tq" package is invoked, it creates a "concreteManifest" structure by calling the newConcreteManifest() function, which calls the SSHTransfer() function of our "lfsapi" package. Thus if that function returns a non-nil value (i.e., an SSHTransfer structure), then the manifest retains a pointer to that structure for the lifetime of the Git LFS client process. We then pass that manifest to the NewTransferQueue() function each time we start a new transfer queue. After all the objects to be transferred have been added to the queue, we then call the Wait() method of the TransferQueue structure. When all the object transfers are complete, if the queue's manifest contains non-nil "sshTransfer" field, then the Wait() method calls the Shutdown() method of the SSHTransfer structure, which closes the underlying SSH connections. The consequence is that if we then attempt to start another queue for the same operation and remote endpoint, we do attempt to reuse the same SSHTransfer structure and associated SSH session, but all of its SSH processes and connections have been terminated. This leads to the error reported in git-lfs#6118. Conversely, until we refactored the SSHTransfer() method in a prior commit in this PR, we used to create a separate SSH session to make locking requests via the SSH-based Git LFS protocol, and then never attempted to close this session at all. We can now take advantage of a series of other changes we made in previous commits in this PR to ensure that we always attempt to close SSH sessions, but only at the end of the Git LFS client's lifetime and not when a single transfer queue completes. In these prior commits, we revised the main() function of our "main" package, along with the Run() function and exit wrapper functions in our "commands" package, so that when the Git LFS client is about to exit, it will always make exactly one call to the Close() method of the global "apiClient" variable in that package. As well, the SSHTransfer() method of the Client structure in the "lfsapi" package now maintains a map of all the SSH sessions it has initialized. We can therefore add a closeSSHTransfers() method which iterates over all the non-nil values in this map and calls the Shutdown() method of each SSHTransfer structure, and then invoke this new closeSSHTransfers() method from the Client structure's Close() method. We also revise the Wait() method of the TransferQueue structure so that it no longer attempts to call the Shutdown() method of the SSHTransfer structure associated with the queue's manifest. As these changes should resolve the problem reported in git-lfs#6118, we can then add a new test to our "t/t-batch-transfer.sh" shell test script to verify that this is indeed the case. Our new "batch transfers with ssh endpoint and multiple branches (git-lfs-transfer)" test does this by running each of the "git push" and "git lfs fetch" commands for multiple Git references, which causes both commands to start and stop more than a single transfer queue. Note that because our new test uses the "git lfs fetch" command to download Git LFS objects for multiple branches at once, we have to adjust the assert_ssh_transfer_sessions() helper function slightly. In our other tests of the SSH-based Git LFS object transfer protocol, we use the "git clone" command to download objects, which only does so for a single Git reference at most when checking out files into the new working tree. During this checkout phase of the "git clone" command, it invokes either the "git lfs filter-process" or "git lfs smudge" commands, depending on the version of Git available. With versions of Git earlier than 2.11.0, the "git lfs smudge" command will be used, and will be invoked once for each individual object to be downloaded. As a result, our assert_ssh_transfer_sessions() helper function adjusts the number of distinct SSH sessions it expects to find in the trace log it is passed. However, because our new test runs the "git lfs fetch" command directly, only a single SSH session should be started even when the version of Git available is older than v2.11.0. Hence we revise the assert_ssh_transfer_sessions() helper function so that it now checks the trace log for evidence that the "git lfs smudge" command has been run, and otherwise assumes that it does not need to account for each object being transferred by a separate Git LFS command. Note also that without the enhancements we made in a prior commit in this PR, this test would have required that we further adjust the assert_ssh_transfer_sessions() helper function. We used to start a separate SSH session to perform locking requests over the SSH-based Git LFS protocol, and we used to initialize one such SSH session for each Git reference. Thus we would have needed to revise the helper function to account for the extra SSH session started for each branch processed by the "git push" command in our new test. Previously, the assert_ssh_transfer_sessions() function simply incremented the expected counts of SSH connections by one when locking requests were anticipated, but this only worked because none of our tests of the "pure" SSH-based object transfer protocol created more than a single Git branch. Fortunately, when we refactored the SSHTransfer() method in the "lfsapi" package of the client so that it now avoids creating duplicative SSH sessions, we were able to simplify the helper function at the same time and remove the logic which incremented the expected SSH connection counts.
larsxschneider
approved these changes
May 19, 2026
|
|
||
| GIT_TRACE=1 git push origin main >push.log 2>&1 | ||
| GIT_TRACE=1 git push origin main 2>&1 | tee push.log | ||
| [ 0 -eq "${PIPESTATUS[0]}" ] |
Member
There was a problem hiding this comment.
This captures the exit status for all SSH tests as the commit message describes it. Should we do the same for other tests? (e.g. https://github.com/git-lfs/git-lfs/blame/b84a8a68c81e1710ec456922ee9eac615458104a/t/t-batch-transfer.sh#L91 )
Member
Author
There was a problem hiding this comment.
I thought about that, most especially the test of the git-lfs-authenticate flow (i.e., using SSH only to acquire HTTP credentials), but decided to keep the scope of the PR somewhat more focused.
There are certainly lots of instances in our test suite where we could standardize how we perform command checks, but I figure that's best left for a more comprehensive PR that tackles all our tests at once.
As suggested by larsxschneider on PR review, we can slightly improve the closeAPIClient() function in our "commands" package by resetting the "apiClient" global variable to nil after calling its Close() method. The closeAPIClient() function is only called from the doCleanup() function we introduced in a previous commit in this PR, and should only be performed just before the Git LFS client halts. As such, we do not expect subsequent calls to the getAPIClient() function, but if they should occur (perhaps due to a regression we introduce in the future), they will now initialize a new Client structure from the "lfsapi" package rather than return a nil value, which might cause the client to panic and crash. Co-authored-by: Lars Schneider <larsxschneider@github.com>
In a prior commit in this PR we refactored the SSHTransfer() method of the Client structure in our "lfsapi" package so that for a given set of input parameters, it always returns the same value. In particular this means that if a remote service supports the "pure" SSH-based Git LFS object transfer and locking protocols, then the method always returns the same SSH session. The new initSSHTransfer() method of the same Client structure performs the actions that the SSHTransfer() method did before, and as such may return a nil value if the remote service does not provide the "git-lfs-transfer" command for execution via SSH. The SSHTransfer() method now maintains a map of the SSH sessions for which it has called the initSSHTransfer() method, and only calls that method if it is required. Otherwise, the value returned by the initSSHTransfer() method for the given "operation" and "remote" parameters is returned again, even if it is a nil value. At present, the initSSHTransfer() method returns a nil value when a remote service is not configured for the SSH-based Git LFS transfer protocols or does not support them, but also when an error occurs while attempting to initialize an SSH session with a remote service. This design is inherited from the original implementation of the SSHTransfer() method in commit 31d3fb7 of PR git-lfs#4446. Because we now cache the result of the initSSHTransfer() method and return it from the SSHTransfer() method, if an error occurs while trying to establish an SSH session, then this will prevent subsequent calls to the SSHTransfer() method from trying again to establish a session. As noted by larsxschneider on PR review, this behaviour deviates slightly from our legacy behaviour, in that we might previously have successfully initialized one SSH session to make locking requests, while failing to do so for object transfer requests, or vice versa. In practice, this difference is limited in its impact, due to the order in which we call the SSHTransfer() method, and to a lesser extent due to the default setting of the "lfs.locksVerify" configuration option. During an upload operation, unless the "lfs.locksVerify" configuration option is set to "false", we attempt to make lock verification requests to the remote endpoint. However, unless the "lfs.locksVerify" option is set to "true", we allow these to fail without halting the upload operation. Further, we actually try to establish an SSH session for the object upload transfers prior to attempting lock verification, for the reasons detailed below. The consequence is that with our new design of the SSHTransfer() method, if the initial attempt to establish an SSH session is successful, then all the lock verification requests will utilize that session, as will the subsequent object transfer requests. On the other hand, if the initial attempt to establish an SSH session fails, then upload operation cannot succeed because the object transfer requests will lack an active SSH session. Hence the fact that the locking requests will now also always fail in this case is of little significance. Nevertheless, we now adjust the SSHTransfer() and initSSHTransfer() methods so that we avoid caching a nil value when an error occurs while attempting to establish an SSH session. This change will ensure that if in the future we alter our commands' implementation such that the initial attempts to establish an SSH session are made for optional actions, then these may safely fail without necessarily causing the subsequent requisite actions to fail as well. For example, suppose we were to revise the order in which we call the SSHTransfer() method so that we first did so specifically for lock requests and not for object transfer requests. Then unless the "lfs.locksVerify" configuration option was set to "true", the calls to the SSHTransfer() method for the lock requests could all fail to establish an SSH session, but since we now avoid caching a nil value when an error has occurred, the call to the SSHTransfer() method for the object transfer requests might succeed and return a valid SSH session. Note, though, that at present this is not the order in which these calls to the SSHTransfer() method are made, for the reasons we outline below. Both the "git lfs pre-push" or "git lfs push" commands begin by calling the newUploadContext() function, which creates an initial manifest for the transfer queue by calling the getTransferManifestOperationRemote() function and passing the result to the newLockVerifier() function. The getTransferManifestOperationRemote() function then invokes the NewManifest() function from our "tq" package, which returns a simple "lazyManifest" structure, allowing us to defer any complex initialization steps at this point. However, the newLockVerifier() function calls the manifest's IsStandaloneTransfer() method, which forces the manifest to be upgraded to a "concreteManifest" structure using the "tq" package's newConcreteManifest() function. That function then calls the SSHTransfer() method of the "lfsapi" package's Client structure to try to establish an SSH session. The consequence is that we try to establish an SSH session for the object transfer requests first, although the lock verification requests occur before the object transfer requests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR updates the Git LFS client so that where possible, SSH sessions are reused for multiple related operations.
This change will permit the "pure" SSH-based Git LFS transfer protocol to be used when pushing or fetching multiple branches, resolving the issue reported in #6118.
As well, this PR revises the Git LFS client so that when exiting, it always tries to cleanly terminate any SSH sessions that were started for the "pure" SSH-based transfer protocol.
This PR will be most easily reviewed on a commit-by-commit basis, with whitespace differences ignored.
SSH Session Reuse
When we introduced support for the "pure" SSH-based Git LFS object transfer protocol in PR #4446, we designed the
SSHTransferstructure in oursshpackage to be the basic abstraction with which we represent and manage SSH processes and connections for the protocol.In commit 31d3fb7 of the same PR we added an
SSHTransfer()method to theClientstructure of ourlfsapipackage in order to provide a common interface through which we could initiate and share SSH sessions. This method always returns either a newSSHTransferstructure, representing a new SSH session and underlying set of connections, or else returns anilvalue to indicate that the "pure" SSH-based Git LFS protocols are not supported by the remote service, or that an error condition has occurred.Because the
SSHTransfer()method always returns a new SSH session (when it does not returnnil), at present we create separate sessions for both object transfer requests and locking requests.Further, as reported in #6118, when a transfer queue completes we shut down any SSH session it was using, but then we attempt to reuse that session in subsequent transfer queues, which causes an error.
Specifically, when uploading or downloading objects for multiple Git references, we establish a single transfer manifest for the entire operation. The first time one of the methods of the
lazyManifeststructure in ourtqpackage is invoked, it creates aconcreteManifeststructure by calling thenewConcreteManifest()function, which calls theSSHTransfer()function of ourlfsapipackage. Thus if that function returns a non-nilvalue (i.e., anSSHTransferstructure), then the manifest retains a pointer to that structure for the lifetime of the Git LFS client process.We then pass that manifest to the
NewTransferQueue()function each time we start a new transfer queue. After all the objects to be transferred have been added to the queue, we then call theWait()method of theTransferQueuestructure.When all the object transfers are complete, if the queue's manifest contains non-
nilsshTransferfield, then theWait()method calls theShutdown()method of theSSHTransferstructure, which closes the underlying SSH connections.The consequence is that if we then attempt to start another queue for the same operation and remote endpoint, we do attempt to reuse the same
SSHTransferstructure and associated SSH session, but all of its SSH processes and connections have been terminated. This leads to the error reported in #6118.To avoid this problem, we first refactor the
SSHTransfer()method in thelfsapipackage into two methods. The newinitSSHTransfer()method performs the same actions as theSSHTransfer()method did before, and as such may returna
nilvalue if the remote service does not support thegit-lfs-transfercommand.The
SSHTransfer()method now maintains a map of the SSH sessions for which it has called theinitSSHTransfer()method, and only calls that method if it is required. Otherwise, the value returned by theinitSSHTransfer()method for the givenoperationandremoteparameters is returned again, even if it is anilvalue.To maintain this map safely, the
SSHTransfer()method first acquires a mutex, which we add to theClientstructure along with the map. For keys, we use a concatenation of the unique pairs ofoperationandremotestrings passed to the method as its parameters. This technique follows the existing approach used in ourcommandspackage, where we construct keys for the globaltqManifestmap by concatenating theoperationandremotestring parameters of thegetTransferManifestOperationRemote()function.One advantage of this change is that it also resolves the problem whereby we would previously create an extra SSH process and connection for each locking request, even if they were to be made to the same remote service as subsequent object transfer requests, and for the same type of operation (i.e., an upload or download operation).
Another advantage is that when the Git LFS client is exiting, we can identify the set of SSH sessions that were opened for the SSH-based transfer protocol and attempt to terminate them all properly, instead of closing the individual session used for a transfer queue after that queue has completed.
To do this, we add a
closeSSHTransfers()method to theClientstructure in thelfsapipackage, and then invoke this new method from theClientstructure'sClose()method. TheClose()method is called by thecommandspackage'scloseAPIClient()function when a Git LFS command has completed normally.Our new
closeSSHTransfers()method iterates over all the non-nilvalues in the map maintained by theSSHTransfer()method and calls theShutdown()method of eachSSHTransferstructure, thus attempting to terminate them properly. We can then revise theWait()method of theTransferQueue structureso that it no longer attempts to shut down the SSH session associated with the queue's manifest.Collectively, these revisions and enhancements should resolve the problem reported in #6118. To verify this, we add a new test to our
t/t-batch-transfer.shshell script which runs each of thegit pushandgit lfs fetchcommands for multiple branches, causing both commands to start and stop more than a single transfer queue.Consistent Exit Handling
The changes described above ensure we reuse SSH sessions for the "pure" SSH-based Git LFS transfer protocols, and that we only attempt to terminate them when the client is exiting, at least under normal circumstances.
However, if certain errors occur or if one of the signals we trap is received, then the Git LFS client may halt immediately, and in these cases we would like it to at least attempt to terminate any SSH sessions started for the SSH-based transfer protocol.
In PR #6255 we revised all of the individual command implementation functions in our
commandspackage so that they always invoke one of a small set of exit wrapper functions when they need to force the client to halt. We can now take advantage of this fact to ensure that regardless of circumstances, the Git LFS client will always attempt to terminate any active SSH sessions that were opened for the "pure" SSH-based Git LFS transfer protocols.Most obviously, we add a call to the
Cleanup()function of ourcommandspackage to each of the exit wrapper functions, just before they call theExit()function of the Go standard library'sospackage.Next, we refactor the
Cleanup()function into two functions, with the actual cleanup activities performed by a newdoCleanup()function, which is then invoked by theCleanup()function exactly one time only, via theDo()method of aOncestructure from thesyncpackage of the Go standard library. We assign thisOncestructure to a newcleanupOnceglobal variable in ourcommandspackage, which replaces theoncevariable of the same type in ourmainpackage. In ourmain()function, we then just call thecommandspackage'sCleanup()directly in the two instances where we previously did so via theoncevariable'sDo()method.Finally, we add to the
doCleanup()function a call to thecloseAPIClient()function, which acquires the necessaryglobalmutex before invoking the globalapiClientvariable'sClose()method. This supplants the legacy call to thecloseAPIClient()function from ourRun()function, so we just remove that call.Other Changes
In addition to the principal changes described above, this PR also makes a number of small refinements, including:
setup_expected_concurrent_transfers()test helper function we added in PR #6241.assert_ssh_transfer_sessions()test helper function to correctly reflect the conditions checked by the test, in which the number of SSH connections is limited to fewer than the number of objects to be transferred in a single batch. This causes the transfer of some objects to be delayed, but still performed within the same batch as the others (and to be performed by the same transfer queue).Close()method of thelfsapipackage'sClientstructure type, whenever such a structure has reached the end of its lifetime. For the same reason, we update one instance in the Git LFS client where we create a temporaryClientstructure, and eliminate another similar instance which exists solely to support some of our Go test functions.Resolves #6118.
/cc @galets as reporter.