Fix crash during pure SSH object transfer with multiple objects by chrisd8088 · Pull Request #5905 · git-lfs/git-lfs

chrisd8088 · 2024-11-04T23:50:09Z

This PR fixes a bug where the Git LFS client may crash while cloning a repository with multiple Git LFS objects using the "pure" SSH version of the Git LFS transfer protocol.

Our existing test of the SSH object transfer protocol did not detect the problem because it only pushes and fetches a single object, so we add more tests to confirm that the changes in this PR are effective.

In order for these additional tests to succeed, we also fix a latent issue in our lfs-ssh-echo test utility program whereby it silently fails while trying to emulate the behaviour of the OpenSSH client when it is asked to multiplex a new SSH session over an existing SSH connection's control socket.

As well, we revise the trace log messages generated by our SSH session handling code and make several other minor adjustments, including renaming the receiver variables of our SSHTransfer structure so they do not conflict with the name of our tr text localization package, and expanding our existing single-object test of the SSH object transfer protocol to align with the more comprehensive checks performed by our new tests.

This PR will be most easily reviewed on a commit-by-commit basis.

Resolves #5880.

In subsequent commits in this PR we expect to resolve a pair of issues which prevent us from testing the SSH object transfer protocol with more than a single object. The SSH transfer protocol was introduced in PR git-lfs#4446, and in commit 691de51 of that PR, the "batch transfers with ssh endpoint (git-lfs-transfer)" test in our t/t-batch-transfer.sh test script was added to validate that the new protocol worked as expected. This test pushes and then fetches a single object, so the issues which arise when handling multiple objects in a batch do not cause the test to fail. Nevertheless, before addressing those issues, we first expand the existing test so it checks that the git-lfs-transfer command is seen in the trace log messages during a push operation. The test already checks that this command is used during a clone operation. As well, we enhance the "batch transfers with ssh endpoint (git-lfs-authenticate)" test, which checks that when the SSH transfer protocol is not available but Git is configured to use SSH for a remote, the Git LFS client performs an authorization handshake over SSH prior to using HTTP for its object transfers. In the test we now confirm that the git-lfs-authenticate command is seen in the trace log messages generated during a push operation, and that the object was successfully pushed and has been stored by the remote server. We also make a small revision to the "batch transfers succeed with an empty hash algorithm" test to remove an unnecessary file redirection.

In subsequent commits in this PR we expect to resolve a pair of issues which prevent us from testing the SSH object transfer protocol with more than a single object. The SSH transfer protocol was introduced in PR git-lfs#4446, and in commit 691de51 of that PR, the "batch transfers with ssh endpoint (git-lfs-transfer)" test in our t/t-batch-transfer.sh test script was added to validate that the new protocol worked as expected. This test pushes and then fetches a single object, so the issues which arise when handling multiple objects in a batch do not cause the test to fail. Hence we intend to add two other tests to accompany the existing one so as to validate the SSH transfer protocol with multiple objects. As these tests will all perform object transfers only over SSH, our HTTP-based lfstest-gitserver test helper program will not be used in any of them. That program retains a copy of each object it receives in memory to simulate a remote Git LFS server. As a consequence, many of our tests use the assert_server_object() function defined in our t/testhelpers.sh library to confirm that an object has been received by the remote server; this function makes an HTTP batch request to the lfstest-gitserver program and checks the JSON response. In our tests of the SSH transfer protocol, by contrast, object data will be proxied by the lfs-ssh-echo helper program to the git-lfs-transfer command, which will write the data into a separate bare Git repository in the location provided as command argument. In fact this is how the existing test already operates; it uses the ssh_remote() function from our t/testhelpers.sh library to establish the SSH URL for its remote repository. The URLs returned by this function always include the directory path from our REMOTEDIR variable. We expect to also use the ssh_remote() function in our new tests to establish their remote repositories. In both our existing test and the ones we will add in a subsequent commit in this PR, we would like to confirm that the Git LFS objects we push have been successfully written into the "lfs/objects" cache in the remote repositories. To simplify such checks, we define a new assert_remote_object() assertion function in our shell test library. Unlike the assert_server_object() function, which makes an HTTP request, our new assertion acts in a similar manner to the existing assert_local_object() function by simply validating the size and existence of an object file in the appropriate subdirectory of the "lfs/objects" cache hierarchy of a bare repository. However, our new assert_remote_object() function constructs the path to the repository using the REMOTEDIR variable, rather than checking the object's presence in the current local repository as the assert_local_object() function does. With the assert_remote_object() function defined, we then update our existing "batch transfers with ssh endpoint (git-lfs-transfer)" test to make use of it after pushing an object over the SSH transfer protocol, and we will use the function for the same purpose in the additional tests we expect to introduce in a later commit in this PR.

In subsequent commits in this PR we expect to resolve a pair of issues which prevent us from testing the SSH object transfer protocol with more than a single object. The SSH transfer protocol was introduced in PR git-lfs#4446, and in commit 691de51 of that PR, the "batch transfers with ssh endpoint (git-lfs-transfer)" test in our t/t-batch-transfer.sh test script was added to validate that the new protocol worked as expected. This test pushes and then fetches a single object, so the issues which arise when handling multiple objects in a batch do not cause the test to fail. Hence we intend to add two other tests to accompany the existing one so as to validate the SSH transfer protocol with multiple objects. All these tests will perform object transfers over SSH, but with varying numbers and types of SSH sessions. While our existing test only pushes and fetches a single object, the tests we will add in a later commit in this PR will push and fetch multiple objects in each batch. By default, the Begin() method of the adapterBase structure in our "tq" package starts eight goroutines to process the objects in a batch, each running the structure's worker() method. The first worker routine must start an SSH session, and depending on how quickly other workers pull objects from the batch transfer queue, additional SSH sessions may be created as well, one per active worker. On platforms other than Windows, we set the default value of the "lfs.ssh.autoMultiplex" configuration option to "true". When this option is "true" and the available SSH client is considered to be compatible with OpenSSH, we attempt use OpenSSH's ControlMaster option to create a control socket and multiplex all SSH sessions over a single common connection. We expect the first SSH session to be established with a ControlMaster argument value of "yes", and other sessions with a value of "no". At present, our existing "batch transfers with ssh endpoint (git-lfs-transfer)" test does not check the value of the ControlMaster argument. As we expect to add tests which may create more than one SSH session, however, we would like to validate the number of sessions that create new control sockets, and the number of SSH sessions overall. To perform these SSH session count checks we add two dedicated assertion functions, assert_ssh_transfer_sessions() and assert_ssh_transfer_session_counts(), with the former calling the latter several times. These functions are unlikely to be useful outside of the context of our tests of the SSH object transfer protocol, so we define them directly in the t/t-batch-transfer.sh test script rather than in our generic t/testhelpers.sh library. The primary function, assert_ssh_transfer_sessions(), checks the number of times the git-lfs-transfer command is executed over SSH, and confirms that each session has the ControlMaster option set to "yes", which is always the case for our existing test, so long as we update the test to force the use of SSH connection multiplexing on Windows by explicitly setting the "lfs.ssh.autoMultiplex" configuration option to "true". The assert_ssh_transfer_sessions() function then uses the secondary assert_ssh_transfer_session_counts() function to validate that the expected number of startup, success, and termination trace log messages are seen for each SSH session. One specific issue arises in regard to the number of SSH sessions and git-lfs-transfer commands we expect to see started during a Git push operation. At present, the Git LFS client always starts a unique SSH session to run the git-lfs-transfer command when checking for Git LFS locks on the objects being pushed, and this session will create a control socket if the SSH client in use is considered compatible with OpenSSH and the "lfs.ssh.autoMultiplex" configuration option is set to "true". The control socket opened for this session is distinct from the one created later by the first SSH session started by the batch transfer worker routines. The unique SSH session used for the lock verification request during push operations is managed with a separate instance of the SSHTransfer structure from our "ssh" package. At present, the Git LFS client never calls the setConnectionCount() method with a zero argument on this instance of the SSHTransfer structure, so the dedicated SSH session used for lock verification is never explicitly terminated. Therefore our new assert_ssh_transfer_sessions() function adjusts the number of git-lfs-transfer command execution messages and the number of start and success trace log messages it expects when a push (i.e., "upload") operation is performed to be one higher than the number of termination trace log messages it expects. This allows the function to account for the additional SSH control socket session started for the lock verification step, and then never explicitly terminated.

In commit 448b0c4 of PR git-lfs#5537 our lfs-ssh-echo test helper utility was updated to examine the value of any provided ControlMaster command-line argument and simulate the action of the OpenSSH program when it is passed a ControlMaster argument with a value of either "yes" or "no". (The other values OpenSSH supports for this option are not accepted by our lfs-ssh-echo utility at the moment.) When a ControlMaster command-line argument is found, and has the value "yes", the lfs-ssh-echo program attempts to create a temporary file at the location given by the ControlPath command-line argument, which must also be provided. If the file already exists, or some other error occurs while creating it, the program halts with a non-zero exit code. If the ControlMaster argument's value is "no", and the ControlPath argument is also defined, the program attempts to open the file at the given path, and exits with a non-zero code if the file does not already exist or some other error occurs. This logic is designed to emulate the behaviour of OpenSSH, which supports the use of these arguments to multiplex SSH sessions over a common connection using a control socket. When the ControlMaster argument is "yes", OpenSSH creates a socket, which may then be shared with other invocations of OpenSSH by setting the ControlMaster argument to "no" and passing the socket's path in the ControlPath argument. Our lfs-ssh-echo program, however, creates its temporary file with no defined file permissions (i.e., with a file mode argument of zero). This results in a file which can not be opened by any other instances of the lfs-ssh-echo program. At the moment this does not cause any problems with our test suite, because we have no tests which exercise the SSH version of the Git LFS object transfer protocol with more than a single object, so multiple SSH sessions with the same ControlPath argument are never started. We would like to create additional tests which push and fetch multiple Git LFS objects over SSH, though, to help diagnose issues such as those described in git-lfs#5880. Therefore we change our lfs-ssh-echo program to use the conventional file mode of 0666 when creating its temporary file, which will allow subsequent invocations of the program with the ControlMaster argument set to "no" to also open the same file. As well as fixing this bug, we also alter the way the lfs-ssh-echo program handles errors when the file specified by the ControlPath argument can not be created or opened. We now output distinct error messages in the cases where the file already exists or does not exist, respectively, as well as in the cases where some other type of error occurs. Previously, we assumed any error must be due to a file either existing or not existing, and did not report the actual error message, so problems such as those caused by missing file permissions were obscured.

Since PR git-lfs#4446 the SSHTransfer() method of the Client structure in our "lfsapi" package has output several trace log messages when attempting to instantiate a new SSHTransfer structure. This method takes "operation" and "remote" string arguments, and the returned SSHTransfer structure is specific to those values. As this method may be called several times with distinct arguments during the execution of a Git LFS process, we add the "operation" and "remote" variables to the trace log messages, which will help clarify the calling context of the method in our diagnostic logs.

In commit 326b1ee of PR git-lfs#5063 we added trace logging to several functions and methods related to the creation and termination of SSH sessions in order to help analyze the diagnostic logs generated when transferring Git LFS objects over SSH. However, in the startConnection() function in our ssh/connection.go source file, we report the successful creation of a session even if we are returning a non-nil error value. Therefore we revise our trace logging in that function to distinguish between unsuccessful and successful conditions based on whether the PktlineConnection structure's Start() method returned an error or not. In addition, we also update a number of our other trace log messages to include the relevant session ID. (Note that we refer to SSH sessions as "connections" in our code, although in practice they may share a single SSH connection using a control socket.) Because we maintain a set of SSH sessions and do not necessarily start or terminate all of them at the same time, this change will provide more clarity as to the state of each individual session at different points in a trace log. Finally, we rephrase several trace log messages generated by the setConnectionCount() method of the SSHTransfer structure so they more fully explain when the method is terminating specific sessions because it has been asked to reduce the total number of sessions.

In commit b44cbe4 of PR git-lfs#5136 the "multiplex" argument was added to the FormatArgs() function in our ssh/ssh.go source file. Later, the "controlPath" argument was also added, in commit 448b0c4 of PR git-lfs#5537. Neither of these arguments is used in the function's body, though, so we can remove them now.

In PR git-lfs#4446 we defined a new SSHTransfer structure and a set of methods for it in our ssh/connection.go source file, and used the name "tr" for the receiver variables of those methods. Later, in commit 5dbbf13 of PR git-lfs#5674, we imported our "tr" message translation package into the same source file in order to format and localize an error message generated by the startConnection() function. Because this function is not one of the methods of the SSHTransfer structure there was no conflict with the "tr" receiver variables of those methods. However, in subsequent commits in this PR we expect to revise one of the SSHTransfer structure's methods to output an error message, and will want to use the Get() method of the "tr" package's global Tr variable. We are not able to do this with the current name of the "tr" receiver variable as it masks the "tr" package name within the method's scope. Therefore we now rename all our "tr" receiver variables to "st" in the ssh/connection.go source file, so as to avoid any namespace conflicts with our "tr" package.

In commit 44b8801 of PR git-lfs#5634 we revised the Connection() method of the SSHTransfer structure in our "ssh" package to create SSH sessions on demand, rather than simply return pre-existing sessions created by the setConnectionCount() method, which allowed us to reduce the total number of SSH sessions in many cases. As part of this change, we altered the Connection() method to return an error if it could not create a session successfully, and we then updated the callers of this method to check for a non-nil error. However, the method still returned a nil error value, along with a nil PktlineConnection value, if the requested session ID exceeded the maximum number of sessions permitted and so no additional session was created or returned. However, the method's callers all now assume a nil error value implies a non-nil PktlineConnection value. In particular, the batchInternal() method of the SSHBatchClient structure in the "tq" package originally checked for a nil return value from the Connection() method, as this was the only way the Connection() method indicated a failure to find or create an active SSH session. Since the change in commit 44b8801, the batchInternal() method now checks only for a non-nil error return value, which means it does not detect the case where a nil PktlineConnection value is returned. As a result, in situations such as those described in git-lfs#5880, the batchInternal() method may cause a Go panic when it attempts to call the Lock() method of a nil PktlineConnection value. One way this can occur is after one batch transfer operation has finished and the Wait() method of the TransferQueue structure has called the Shutdown() method of the sshTransfer field of the TransferQueue's concreteManifest. That closes all the SSH sessions and sets the "conn" array to nil. But a reference to the same SSHTransfer structure is retained by the sshTransfer field of the SSHBatchClient structure to which the concreteManifest's batchClientAdapter field points. When that manifest is reused by a subsequent batch transfer operation, the batchInternal() method of the SSHBatchClient structure attempts to retrieve a session, receives a nil PktlineConnection value, and causes a panic by trying to use it to run the Lock() method. We therefore adjust the Connection() method of the SSHTransfer structure to return a non-nil error when the requested session ID exceeds the maximum number of sessions allowed. This also covers the case where the "conn" array has been set to nil, as the len() built-in function returns zero for the length of a nil array.

In commit 691de51 of PR git-lfs#4446 we added the "batch transfers with ssh endpoint (git-lfs-transfer)" test to our t/t-batch-transfer.sh test script in order to validate that the new SSH object transfer protocol for Git LFS was operating as expected. This test only creates a single test object, and so does not cause the Git LFS client to establish multiple SSH sessions during the batch object transfer phase. While the sole SSH session started during this phase may create a control socket to multiplex its connection, it will not be shared by any other SSH sessions, since only one is started. In practice, this means that even if the "lfs.ssh.autoMultiplex" configuration option is set to "true" and the available SSH client is considered to be compatible with OpenSSH, the "batch transfers with ssh endpoint (git-lfs-transfer)" test will never try to start an SSH session with a value for the ControlMaster argument other than "yes". Our test suite sets the GIT_SSH environment variable to refer to our lfs-ssh-echo test utility program rather than an actual SSH client. As noted in git-lfs#5903, at present the Git LFS client treats this program, which it does not recognize as matching the filename of any known SSH client, as compatible with OpenSSH, and so our tests invoke our lfs-ssh-echo utility with the ControlMaster and ControlPath arguments. As described in a prior commit in this PR, our lfs-ssh-echo test utility program had a bug which prevented it from starting SSH sessions with the ControlMaster argument set to "no", so the "batch transfers with ssh endpoint (git-lfs-transfer)" test would have failed if it tried to push more than a single object. Such a test would also fail if it tried to fetch more than a single object, due to the bug described in git-lfs#5880. Specifically, we returned a nil from the Connection() method of the SSHTransfer structure in our "ssh" package when we had terminated all the SSH sessions, but then sometimes tried to reference that nil pointer in the batchInternal() method of the SSHBatchClient structure in the "tq" package, causing a Go panic condition. In prior commits in this PR we resolved these problems, first the bug in the lfs-ssh-echo test utility and then the bad return value from the Connection() method of the SSHTransfer structure in the Git LFS client. In the case of the latter issue, we adjusted the Connection() method to return a non-nil error when the requested session ID exceeds the maximum number of sessions permitted, which also covers the case where all the sessions have already been terminated. Therefore we can now introduce additional tests of the SSH object transfer protocol which push and fetch multiple objects. In these tests we use the new assert_remote_object() function that we defined in a prior commit in this PR to confirm that the pushed objects are all written to the remote repository. As well, we use the new assert_ssh_transfer_sessions() function we defined in another prior commit in this PR to check that the number of SSH sessions that create a control socket matches our expectations, as do the number of startup, success, and termination trace log messages. The first of our new tests leaves the maximum number of concurrent object transfers set to the default value of eight, so that all three of the objects we create in the test are pushed in a single batch, and may also be fetched in a single batch if the available version of Git is 2.11.0 or higher. To push or fetch these objects in a single batch requires that the Git LFS client establish as many as three separate SSH sessions per invocation, of which only the first should create a control socket. The second of our new tests sets the maximum number of concurrent object transfers to two, so we expect to see a maximum of only two SSH sessions per invocation of the Git LFS client, and only the first of these sessions should start an SSH connection with a control socket. Prior to version 2.11.0, Git did not support the "process" filter attribute, and so during a clone operation Git LFS would be invoked via the "smudge" filter instead, once for each object. When testing with such a version of Git, our assert_ssh_transfer_sessions() function adjusts its expectations to account for the fact that each invocation of Git LFS during a clone operation will establish its own SSH session with a control socket, and so the total number of trace log messages from these sessions should match the total number of objects being fetched.

larsxschneider

Stellar PR. A joy to read/review 🙇

larsxschneider · 2024-11-05T19:50:34Z

+
+  # On upload we currently spawn one extra control socket SSH connection
+  # to run locking commands and never shut it down cleanly, so our expected
+  # start counts are higher than our expected termination counts.


Would you consider it a bug that the locking command session never shuts down cleanly?

I assume we do not create the extra control socket if locking is disabled, right? (via lfs.locksverify=false)

I assume we do not create the extra control socket if locking is disabled, right? (via lfs.locksverify=false)

That's correct, I believe.

Would you consider it a bug that the locking command session never shuts down cleanly?

It's not ideal, but it's not causing any harm, so far as I know. I wrote up this issue in a bit more detail at the end of the description of #5880:

Ultimately, we may want to avoid trying to shut down our SSH connections at the end of individual batch operations, and do some analysis of the preferred lifetime for them. (Among other issues, when pushing, we create two separate ControlMaster=yes connections, one for locking operations and another for batch operations; it would be nice to avoid this duplication.)

Shutting down SSH connections cleanly at the end of the entire Git LFS process may also entail some refactoring of how we cause the program to exit. Right now, we make a token effort at closing resources associated with our API clients, but really that just amounts to closing our logger, and when we call os.Exit() in several dozen places, that step is skipped and we let Go take care of closing open I/O handles. If we want to always gracefully close multiple open SSH connections by sending a quit message, though, we may have to refactor these os.Exit() calls into a common exit routine.

It's also why I drew myself a class diagram in #5880 (comment).

To manage the lifetime of all the SSH sessions, including the one used for lock verification, we may have to do some more substantial revisions throughout the code in our commands package to replace all the os.Exit() calls with another mechanism. At any rate, it seemed out of scope for this PR.

larsxschneider · 2024-11-05T19:55:08Z


+  # On Windows we do not multiplex SSH connections by default, so we
+  # enforce their use in order to match other platforms' connection counts.
+  git config --global lfs.ssh.autoMultiplex true


Why is automultiplex disabled on Windows?

The GetExeAndArgs() function in our ssh package effectively sets the default value of lfs.ssh.autoMultiplex to be false and not true because Windows SSH clients (of which there are a few) don't always support multiplexing SSH sessions over a single connection, and may fail if they see an unrecognized control socket option.

There was a discussion of this concern in #5537 (comment), which led to the changes from that PR #5537 setting the default to false on Windows. The conclusion at that time was to just let Windows users who had OpenSSH installed (via Git for Windows, for instance) set the configuration option to true manually if they wanted to use multiplexing.

I did note, while working on this PR, that the lfs.ssh.autoMultiplex section in our git-lfs-config(5) manual page doesn't explain this difference in the default value on Windows. I have some work-in-progress documentation updates and I put an edit to the manual page into that set of changes, but since you've reminded me again about the issue, I'll just put that commit into this PR.

larsxschneider · 2024-11-05T20:14:48Z

+  if [ "download" = "$direction" ]; then
+    gitversion="$(git version | cut -d" " -f3)"
+    set +e
+    compare_version "$gitversion" '2.11.0'


Git 2.11 is a few years old. I wonder if we should require a minimum Git version for Git LFS to get rid of a few of those version checks?

That's a good question—it's probably worth exploring the idea. I think we'd want to announce any such plans in an issue, with a proposed future release version at the first one where we'd set a new minimum Git version, and collect input from users as to what that minimum should be. Some folks are likely to be running Git LFS in systems with fairly old Git clients, especially if they have one of the older OS platforms we still support.

Right now, we claim in our README that Git v1.8.2 or v1.8.5 is the minimum, although our Git v2.0.0 is the earliest one we build for use in our CI jobs.

For now and for the near-term release we have planned, though, I don't think we should change anything about the Git versions we try to support. (As one example of how our project is expected to work with quite old software, when we overlooked the fact that upgrading to Go 1.21 in our 3.5.x releases meant that we dropped support for Windows 7 and 8, that caused problems for the Git for Windows project maintainers.)

PR #5921 will at least update the README.md file to match the minimal Git version we use in our CI jobs.

As suggested by larsxschneider on PR review, we should use "clone.log" as the name of the log files we capture from "git clone" commands in our tests of the SSH object transfer protocol, rather than "trace.log", so we adjust those file names now. This change aligns these log files' names with the "push.log" files we create in the same tests, and also with the "clone.log" files created by the clone_repo() function in our t/testhelpers.sh shell library.

Since commit 448b0c4 in PR git-lfs#5537 the GetExeAndArgs() function in our "ssh" package sets the default value of our "lfs.ssh.autoMultiplex" configuration option to "false" when running on Windows, and "true" otherwise. This choice was made because the SSH clients available on Windows may not support multiplexing SSH sessions over a single connection, as OpenSSH does with its ControlMaster and ControlPath options. Since some of these SSH clients may fail if they are passed the ControlMaster and ControlPath options, we require Windows users who want to use SSH multiplexing to explicitly enable it by setting the "lfs.ssh.autoMultiplex" option to "true". See also the discussion in: git-lfs#5537 (comment) However, our git-lfs-config(5) manual page was not updated in PR git-lfs#5537 to reflect the change in the default value of the "lfs.ssh.autoMultiplex" option on Windows, so we update it now. Note that users with the Git for Windows project installed will typically have a version of OpenSSH available which supports the ControlMaster option. However, the OpenSSH for Windows client may not support multiplexing, as noted in PowerShell/Win32-OpenSSH#1328.

In commit 5e654f2 in PR git-lfs#565 a pair of test assertion functions were added to the forerunner of our current t/testhelpers.sh shell library. These assert_local_object() and refute_local_object() functions check for the presence or absence of a file in the object cache maintained by the Git LFS client in a local repository. To perform these checks, the functions capture the output of the "git lfs env" command and parse the contents of the LocalMediaDir line, which reports the full path to the Git LFS object cache location. To retrieve the path, the functions ignore the first 14 characters of the line, as that corresponds to the length of the LocalMediaDir field name (13 characters) plus one character in order to account for the equals sign which follows the field name. Later PRs have added three other assertion functions that follow the same design. The delete_local_object() function was added in commit 97434fe of PR git-lfs#742 to help test the "git lfs fetch" command's --prune option, the corrupt_local_object() function was added in commit 4b0f50e of PR git-lfs#2082 to help test the detection of corrupted local objects during push operations, and most recently, the assert_remote_object() function was added in commit 9bae8eb of PR git-lfs#5905 to improve our tests of the SSH object transfer protocol for Git LFS. All of these functions retrieve the object cache location by ignoring the first 14 characters from the LocalMediaDir line in the output of the "git lfs env" command. However, the refute_local_object() function contains a hint of an alternative approach to parsing this line's data. A local "regex" variable is defined in the refute_local_object() function, which matches the LocalMediaDir field name and equals sign and captures the subsequent object cache path value. Although this "regex" variable was included when the function was first introduced, it has never been used, and does not appear in any of the other similar functions. While reviewing PR git-lfs#5905, larsxschneider suggested an even simpler option than using a regular expression to extract the object cache path from the LocalMediaDir line. Rather than asking the Bash shell to start its parameter expansion at a fixed offset of 14 characters into the string, we can define a pattern which matches the leading LocalMediaDir field name and equals sign and specify that the shell should remove that portion of the string during parameter expansion. See also the discussion in this review comment from PR git-lfs#5905: git-lfs#5905 (comment) In addition to these changes, we can remove the definition of the "regex" variable from the refute_local_object() function, as it remains unused. Co-authored-by: Lars Schneider <larsxschneider@github.com>

When we introduced support for the "pure" SSH-based Git LFS object transfer protocol in PR git-lfs#4446, we designed the SSHTransfer structure in our "ssh" package to be the basic abstraction with which we represent and manage SSH processes and connections. Then in commit 31d3fb7 of the same PR we added an SSHTransfer() method to the Client structure of our "lfsapi" package in order to provide a common interface through which we could initiate and share SSH sessions for multiple purposes. As we wrote in the description of that commit: The lfsapi client is used to perform operations for SSH authentication already, so we know we'll have one wherever SSH operations might be done. Let's move the instantiation to the client so that we can reuse it both in the transfer queue code and in the locking code as well. Using this SSHTransfer() method of the Client structure does have the effect of making consistent all the instances in which we instantiate and initialize new SSH connections for the "pure" SSH-based Git LFS object transfer and locking protocols. Note that the SSHTransfer() method may return a nil value rather than a fully initialized SSH session, not only in the case of an error condition, but also when the remote service does not provide support for the "pure" SSH-based Git LFS protocols. In this case, when the request to execute the "git-lfs-transfer" command fails, the Git LFS client will fall back to attempting to execute the "git-lfs-authenticate" command over SSH. That step involves a different abstraction, though, namely the SSHResolver type in our "lfshttp" package. While our implementation of the Client structure's SSHTransfer() method allows us to establish a new SSH session when the "pure" SSH-based Git LFS protocols are supported, we neglected to account for the fact that the method creates a new set of sessions each time it is called. The consequence is that when handling SSH-based object transfers as well as locking requests, we create an additional SSH process and connection just for the locking requests. Further, while we shut down the SSH connections we use for object transfers after the first transfer queue completes, we never do the same for the connections we create for locking requests. In a subsequent commit in this PR we will address the latter issue, as well as the issue reported in git-lfs#6118, which arises because we shut down SSH connections at the end of each transfer queue, but also try to reuse them between queues. Thus when we are transferring objects over SSH for more than one Git reference, the second queue we start always fails. To resolve this problem we will need to shut down SSH connections only when the Git LFS client is exiting, rather than when a single transfer queue completes. In turn, this implies we need to retain a listing of all the unique sets of SSH sessions we create during the lifetime of the Git LFS client. We therefore refactor the SSHTransfer() method in our "lfsapi" package into two methods. The new initSSHTransfer() method performs the same actions as the SSHTransfer() method did before, and as such may return a nil value if the remote service does not support the "git-lfs-transfer" command. The SSHTransfer() method now maintains a map of the SSH sessions for which it has called the initSSHTransfer() method, and only calls that method if it is required. Otherwise, the value returned by the initSSHTransfer() method for the given "operation" and "remote" parameters is returned again, even if it is a nil value. To maintain this map safely, the SSHTransfer() method first acquires a mutex, which we add to the Client structure along with the map. For keys, we use a concatenation of the unique pairs of "operation" and "remote" strings passed to the method as its parameters. This technique follows the existing approach used in our "commands" package, where we construct keys for the global "tqManifest" map by concatenating the "operation" and "remote" string parameters of the getTransferManifestOperationRemote() function. One advantage of this change is that it also resolves the problem whereby we would previously create an extra SSH process and connection for each locking request, even if they were to be made to the same remote service as subsequent object transfer requests, and for the same type of operation (i.e., an upload or download operation). Because we now reuse SSH connections for both object transfer and locking requests, we can simplify the assert_ssh_transfer_sessions() helper function we added to the "t/t-batch-transfer.sh" shell test script in commit b20b6e9 of PR git-lfs#5905. The assert_ssh_transfer_sessions() helper function, which checks the number of trace log messages that record the start of an SSH process or its termination, no longer needs to make special provision for the fact that certain operations would cause an extra SSH process to be started just to make a locking API request, and that this extra SSH process would never be terminated.

In commit 693c6f8 of PR #5905 we updated the trace log messages output by the Git LFS client when it starts or stops SSH connections for the "pure" SSH-based Git LFS object transfer protocol, and in particular we revised several of the messages output by the setConnectionCount() method of the SSHTransfer structure in our "ssh" package to more clearly identify SSH connections by their internal session ID. In the case of the "terminating pure SSH connection" message, we made this change correctly, by adding a session ID to the message which we calculated using the current loop index variable "i" plus the variable "tn", which gives the initial offset into the slice of SSH connections over which the loop iterates. However, in the case of the "skipping uninitialized lazy pure SSH connection" message, we accidentally omitted the "tn" variable's offset from the value we interpolate into the message as the session ID, with the result that this message's session IDs are not aligned with those of the "terminating pure SSH connection" messages. To fix this oversight we simply make sure to include the slice offset from the "tn" variable in the session IDs we calculate when generating "skipping uninitialized lazy pure SSH connection" trace log messages.

First in PR #4446 and then later in PR #5905 we added tests to our "t/t-batch-transfer.sh" shell script to verify the behaviour of the Git LFS client when it uploads and downloads objects using the "pure" SSH-based Git LFS object transfer protocol. Each of these tests performs a "git push" command followed by a "git clone" command, checks the trace log output of those commands to ensure the expected number of SSH connections were started, and also checks that the commands have transferred all the objects created by the tests. However, at least in the case of the "git clone" commands, the tests do not actually confirm that the exit codes of the commands are zero (which indicates success). We set the "errexit" shell attribute in each test, but this only causes the tests' subshells to exit if the final command in a pipeline fails, and we perform the "git clone" commands in pipelines where the final tee(1) command always succeeds. We therefore now add checks of the first element of the PIPESTATUS shell array variable after each "git clone" command to verify that these commands have succeeded and returned exit statuses of zero. We also add the same checks after each "git push" command in our tests of the SSH-based Git LFS object transfer protocol, and revise the tests to run the "git push" commands in pipelines where a final "tee" command captures the trace messages output by the commands into a log file. The existing design was sufficient to confirm that the "git push" commands succeeded, since the "errexit" shell attribute would cause the tests to fail if the commands returned a non-zero exit code. Should this occur, though, the trace messages generated by the commands would be lost, because we redirected the commands' standard output into a log file. To aid future debugging efforts we therefore now adopt the same idiom we use elsewhere in our shell test suite. In each of these tests we pipe the output of the "git push" command into a "tee" command, and then check the first element of the PIPESTATUS array variable to confirm that the "git push" command succeeded. This change ensures that if a "git push" command does fail then the end_test() function which follows the test will report the full set of trace log messages from the command.

In commit a84cd13 of PR #5905 we added a pair of tests to our "t/t-batch-transfer.sh" shell test script, both of which verify that the Git LFS client is able to successfully upload and download multiple objects using the "pure" SSH-based Git LFS object transfer protocol. These tests were intended to demonstrate that we had resolved the bug reported in #5880, where the Git LFS client would crash when trying to transfer multiple objects using the SSH-based transfer protocol. The first test, named "batch transfers with ssh endpoint and multiple objects (git-lfs-transfer)", creates three objects and then checks that when pushing and fetching the objects, the Git LFS client uses three separate SSH processes to do so. Note that because we set the "lfs.ssh.autoMultiplex" configuration option to "true", the test expects that these SSH processes will share a single multiplexed connection. The second test, which we named "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)", also creates three objects and then pushes and fetches them, but only after setting the "lfs.concurrentTransfers" configuration option to a value smaller than the number of objects to be transferred. This forces the Git LFS client to use only two separate SSH processes to transfer the objects, so that the transfer of the third object is delayed until the first two objects have been successfully transferred. Although the second test does verify this behaviour of the Git LFS client, the test's name is not accurate, nor is one of its internal code comments, because these both state that the objects are transferred in multiple batches. While the transfer queue does delay the transfer of one object until the other two have been transferred, this all still occurs within a single batch of objects. We therefore now update the name of the test to "batch transfers with ssh endpoint and multiple objects exceeding workers (git-lfs-transfer)", which more accurately reflects the conditions created by the test. We also rename the test's repository to align with the new test name, and revise the incorrect code comment to better explain the effect of setting the "lfs.concurrentTransfers" configuration option. Note that the remainder of this commit description is provided here solely for future reference, and to clarify and correct a few details from the commit descriptions in PR #5905. The reason we originally developed the "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)" test was to help reproduce the bug reported in #5880, in which the Git LFS client would panic and crash when it tried to use an SSH process it had previously closed to transfer an object. Hence the test tries to force the client to reuse an SSH process to transfer at least one object, unlike what occurs with the preceding "batch transfers with ssh endpoint and multiple objects (git-lfs-transfer)" test. In that test, each of the three objects will be transferred using a separate SSH process, since the number of objects is fewer than eight, which is the default value of the "lfs.concurrentTransfers" option. In practice, both of these tests could reproduce the crash, absent some of the other changes we made in PR #5905. Most obviously, in commit eca6c8a we updated the client to avoid attempting to use a previously closed SSH process. This stopped the panic and crash from occurring, so the tests would of course succeed once that change was made. However, prior to making that correction, we first resolved a bug in our "lfs-ssh-echo" test helper utility, and this on its own was sufficient to allow our new tests to pass, although we did not intend for that to be the case. Specifically, in commit 9af2883 of the same PR #5905, we adjusted the "lfs-ssh-echo" utility so that it set defined read/write permissions on the temporary file it uses to simulate the behaviour of OpenSSH when the ControlMaster argument is supplied. This allowed subsequent invocations of the utility to reopen the file, whereas previously they would fail with a non-zero exit code. As it happens, though, the two tests we introduced in commit a84cd13 depended on the "lfs-ssh-echo" utility failing unexpectedly in order to fully reproduce the crash bug from #5880. The crash would occur when the Git LFS client attempted to reuse an SSH process it had previously closed, which was only possible when the client had shut down one transfer queue and then started another. The crash did not occur if just a single queue was used, even if the queue transferred objects in multiple batches or even in sequence within a batch in order to respect the maximum transfer concurrency limit. Thus the values of the "lfs.transfer.batchSize" and "lfs.concurrentTransfers" configuration option had no effect on whether the crash occurred or not. Instead, to reproduce the crash, the client needed to be induced to start a second transfer queue after shutting down the initial one. As they were originally written, both of the tests we added in commit a84cd13 implicitly expected a second queue to be started after the first queue experienced the failure of an SSH process. The tests therefore effectively relied on the bug in the "lfs-ssh-echo" command to cause this to occur, but because we fixed that bug before adding the tests, they could never actually reproduce the issue as we intended. Both tests create and then clone a repository containing several Git LFS objects. When cloning, the "git lfs filter-process" command is invoked by Git and asked to apply its "smudge" filter to each object. Git sends these requests via its long-running filter protocol, and indicates that the filter process may delay its response, so the "git lfs filter-process" command replies with "status=delayed" messages and then enqueues the objects for download. To transfer the first object, the client starts an initial SSH process with the ControlMaster argument set to "yes". For the second object, since the "lfs.ssh.autoMultiplex" configuration option is enabled, the client starts another SSH process but sets its ControlMaster argument to "no", with the expectation that both processes will share the same underlying SSH connection. Our shell test suite uses our "lfs-ssh-echo" utility in place of an actual SSH program like OpenSSH, and the bug in the "lfs-ssh-echo" utility caused it to exit with an error status code when it was run with the ControlMaster argument set to "no". When the bug was present, then, the second object transfer in the client's initial queue would fail, and queue would treat this error as one for which the transfer should not be retried. As a consequence, when Git sent an initial "list_available_blobs" request, per the long-running filter protocol, the Git LFS client would reply with only the file path corresponding to the first object. Git would then retrieve the "smudged" content of that file via another "smudge" filter protocol request (this time with the "can-delay" option not permitted, though), after which it would make another request to list the available blobs. At this point, the first transfer queue would have been shut down, because when the "git lfs filter-process" command receives the first "list_available_blobs" request from Git, it calls the Wait() method of the TransferQueue structure in our "tq" package. This method closes various channels and then calls the Shutdown() method of the SSHTransfer structure in our "ssh" package, which terminates all of the running SSH processes. When our "git lfs filter-process" command responds to the second "list_available_blobs" request from Git, we might expect that it would reply with an error status for the second object, since the queue failed to download it and determined that further transfers would not be attempted. However, since commit e764429 of PR #2511, when we implemented support for the delay feature of Git's long-running filter protocol, this is not how the Git LFS client behaves under these conditions. Since the transfer queue has been closed down, when the filterCommand() function in our "commands" package handles the second "list_available_blobs" request from Git, it finds that no file paths can be read from the queue's "available" channel. However, it then iterates through the list of outstanding file paths for which no content data has been returned to Git, and sends each of those file paths back in its response. (Specifically, the function iterates over the "ptrs" map and replies to Git with all of the map's keys. These keys are the file paths Git sent in "smudge" requests with the "can-delay" option enabled, and for which no content data has been sent back to Git.) Because Git receives the file path corresponding to the second object in reply to its second "list_available_blobs" request, Git now requests that object's "smudged" content, and does so with the "can-delay" option disabled. When the "git lfs filter-process" command receives this request, its filterCommand() function invokes the smudge() function for the single file's path, which in turn calls the Smudge() method of the GitFilter structure in our "lfs" package. That method runs the same structure's downloadFile() method, and that calls the "tq" package's NewTransferQueue() function to start a dedicated queue to download just the one object. The bug we fixed in commit eca6c8a would now cause the client to panic and crash. Note, though, that our description in that commit suggests that the crash would occur between batches in a single transfer queue, which is not accurate. As explained above, for the crash to occur, a first transfer queue must have been fully shut down, and then another queue started. The new TransferQueue structure would be initialized with the same concreteManifest structure as was populated for the original queue. The batchClientAdapter field of this concreteManifest structure pointed to an extant SSHBatchClient structure, and the "transfer" field in that structure in turn pointed to the SSHTransfer structure that was used to start and stop the SSH connections and processes for the initial transfer queue. When the SSHBatchClient structure's batchInternal() method would attempt to retrieve a session, it would receive a "nil" value from the SSHTransfer structure's Connection() method and then try to dereference it, causing a panic. All of which explains why, when we fixed the bug in the "lfs-ssh-echo" utility, the tests we introduced in PR #5905 were no longer effective at actually simulating the conditions under which the Git LFS client would panic and crash.

In commit aa08c37 of PR #6241 we changed the default value of the "lfs.concurrentTransfers" configuration option from eight to a dynamic limit based on the number of CPUs in the current system. This change required that we also update a number of the tests in our shell test suite, as they previously were written with the value of eight hard-coded as the expected default value of the "lfs.concurrentTransfers" option. We therefore added a new setup_expected_concurrent_transfers() helper function to our "t/testhelpers.sh" shell library, and revised several test scripts to invoke this function before executing any tests. The function sets a global "expectedConcurrentTransfers" variable with the expected default value of the "lfs.concurrentTransfers" option so that tests can now refer to this variable instead of using a fixed hard-coded value as they did before. However, when making these revisions, we inadvertently overlooked the tests of the "pure" SSH-only Git LFS transfer protocol in our "t/t-batch-transfer.sh" script. At present, these tests also contain a hard-coded value for the expected maximum number of object transfers which may be performed concurrently. The tests pass this value to the assert_ssh_transfer_sessions() helper function, which is defined in the same script as the tests. One reason we did not adjust these tests in PR #6241 is that the assert_ssh_transfer_sessions() function's fourth parameter, which is expected to receive a value representing the maximum number of concurrent transfers, is misnamed as "objs_per_batch". We chose the parameter name "objs_per_batch" when we first introduced the function in commit b20b6e9 of PR #5905. In the same PR, we then added tests of the SSH-based object transfer protocol, including the "batch transfers with ssh endpoint and multiple objects and batches (git-lfs-transfer)" test in which we explicitly set the "lfs.concurrentTransfers" with a value of two. As we explained in a prior commit in this PR, that test was also misnamed, as it effectively forced the Git LFS client to only transfer two objects at a time, but did not alter the batch size or the number of batches the queue processed. Hence we have now renamed the test to "batch transfers with ssh endpoint and multiple objects exceeding workers (git-lfs-transfer)", which more accurately reflects the conditions established by the test. The assert_ssh_transfer_sessions() function's fourth parameter, which is currently named "objs_per_batch", is used to calculate the maximum number of trace log messages the function expects to find which record the start of an SSH process or its termination. Since this number is actually determined by the value of the "lfs.concurrentTransfers" option, we now rename the parameter to "max_concurrency" to better reflect its purpose. As well, in the tests where we call the assert_ssh_transfer_sessions() function and pass a hard-coded value of eight for the fourth parameter, we now instead pass the value of the "expectedConcurrentTransfers" variable. This will allow the tests to continue to pass regardless of how we choose to define the "lfs.concurrentTransfers" option's default value in the future, as the setup_expected_concurrent_transfers() function should always return the appropriate value. Note that for the reasons outlined in PR #6258, we plan to temporarily reverse the principal changes from PR #6241. However, we also intend to eventually re-adopt the use of a default concurrency limit that scales with the number of CPUs, at least for HTTP-based object transfers. Hence we still retain the setup_expected_concurrent_transfers() function in PR #6258 and can rely on it being present in the future.

chrisd8088 added 10 commits November 4, 2024 14:15

chrisd8088 requested a review from a team as a code owner November 4, 2024 23:50

larsxschneider approved these changes Nov 5, 2024

View reviewed changes

chrisd8088 added 2 commits November 5, 2024 20:54

chrisd8088 merged commit 9d69005 into git-lfs:main Nov 6, 2024

chrisd8088 deleted the ssh-batch-multi-fix branch November 6, 2024 06:18

chrisd8088 mentioned this pull request Nov 18, 2024

Fix improper negated test expressions and refine TLS client certificate tests #5914

Merged

chrisd8088 mentioned this pull request Jul 10, 2025

git push for lfs repo fails with nil pointer dereference #6075

Closed

chrisd8088 mentioned this pull request Sep 19, 2025

SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x851bdc #6117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash during pure SSH object transfer with multiple objects#5905

Fix crash during pure SSH object transfer with multiple objects#5905
chrisd8088 merged 12 commits into
git-lfs:mainfrom
chrisd8088:ssh-batch-multi-fix

chrisd8088 commented Nov 4, 2024

Uh oh!

larsxschneider left a comment

Uh oh!

Uh oh!

Uh oh!

larsxschneider Nov 5, 2024

Uh oh!

chrisd8088 Nov 6, 2024

Uh oh!

Uh oh!

larsxschneider Nov 5, 2024

Uh oh!

chrisd8088 Nov 6, 2024

Uh oh!

larsxschneider Nov 5, 2024

Uh oh!

chrisd8088 Nov 6, 2024

Uh oh!

chrisd8088 Nov 29, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrisd8088 commented Nov 4, 2024

Uh oh!

larsxschneider left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

larsxschneider Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

chrisd8088 Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

larsxschneider Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

chrisd8088 Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

larsxschneider Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

chrisd8088 Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

chrisd8088 Nov 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants