Skip to content

Consider options for scaling concurrency of all transfer agents #6259

@chrisd8088

Description

@chrisd8088

As discussed in PRs #6241, #6234, and #6258 we would like to refine the concurrency controls for the Git LFS client so that they can scale with the number of available CPUs, at least when performing HTTP-based object transfers.

We adopted an initial implementation of in PR #6241, which was effective for HTTP object transfers and improved performance significantly on multi-core CPUs.

However, as described in PR #6258, this change also had the unintended effect of scaling the number of child processes we spawn when handling object transfers using a custom agent or when using multiplexed SSH connections. While some systems may not be affected, creating large numbers of separate processes is at best not ideal and at worst may run afoul of system or container limits. We therefore temporarily reversed the key change from PR #6241 in PR #6258.

After our v3.8.0 release, we should consider the best approach for a comprehensive solution that safely scales thread counts while keeping process counts within reasonable bounds.

SSH Transfer Agents

When using the "pure" SSH-only Git LFS object transfer protocol, we may run multiple instances of the ssh(1) command, unless SSH multiplexing is not enabled. (Note that we may also use another SSH agent, depending on what has been configured via the core.sshCommand option and the GIT_SSH or GIT_SSH_COMMAND environment variables.)

When multiplexing is enabled, we set the ssh(1) command's ControlMaster option so that all the separate processes we start will share a single underlying SSH connection. We also only start SSH agent processes on demand in this case. When transferring many objects, though, we will start multiple processes, up to the limit specified by the lfs.concurrentTransfers configuration option.

HTTP Transfer Adapter

When using our standard HTTP-based transfer adapter, we may start multiple threads in situations where we do not need them all.

For instance, in cases where the git lfs smudge command is used instead of the more-efficient git lfs filter-process command, each invocation of the git lfs smudge command will transfer only a single object, but will nevertheless start the number of threads specified by the lfs.concurrentTransfers option.

Typical modern installations of Git LFS do not use the git lfs smudge command, but if the filter.lfs.process Git configuration option is missing, or if a version of Git older than v2.11.0 is in use, then the git lfs smudge command may still be utilized.

Even when the git lfs filter-process command is used, when just a small number of objects need to be transferred, we will still start the number of threads specified by the lfs.concurrentTransfers option.

Custom Transfer Agents

When using our custom transfer protocol we always start multiple separate instances of the transfer agent program, which may either be our git lfs standalone-file command or an agent configured with the lfs.customTransfer.* options. In the first case, we only start a single process, but otherwise we always start exactly as many processes as is specified by the lfs.concurrentTransfers option, unless the lfs.concurrentTransfers.<name>.concurrent option is set to false.

In a related vein, the "Existing Implementation Issues" section from #6119 (review) describes an inconsistency in our handling of concurrency controls for custom transfer agents, and so is worth repeated here again for additional context:

... the lfs.concurrentTransfers option specifies the number of custom adapter processes that we start (unless the lfs.customTransfer.<name>.concurrent flag is set to false, in which case we only start one process). We also start one internal goroutine (i.e., thread) to communicate with each process.

This all occurs within the Begin() method of the adapterbase structure where we loop as many times as is specified by the lfs.concurrentTransfers option, or the value with which we've overridden it. This loop calls the WorkerStarting() method of the customAdapter structure because we've set that structure's transferImpl element to point back to itself.

The WorkerStarting() method, in addition to starting one custom transfer agent process, then sends an initialization message to the process.

That all makes sense, but where I think we have an oversight is that the initialization message includes concurrent and concurrenttransfers elements, and we've documented that we include the latter specifically in case "the transfer process wants to implement its own concurrency and wants to respect this setting".

But there's no way for the process to "respect" the concurrenttransfers setting since we've already started that many separate processes, and each one just gets a copy of the same initialization message.

Arguably, the Git LFS client should either start a single process, pass it the concurrency settings, and let it figure out how many child processes to spawn, or start multiple processes per the concurrency settings but then not pass those settings on because there's no purpose to doing so.

However, we've had this design and implementation for many years, so we can't really reverse course now. However, perhaps we can find a way to improve the situation as part of the "version": 2.0 revisions to the custom transfer protocol.

Future Considerations

Ideally, we would address all of these issues by refactoring our HTTP and custom transfer adapters to start threads or processes only on demand, as the SSH transfer adapter already does, while also introducing new configuration options to better set limits on the number of transfer agent processes we spawn as distinct from the number of threads we start during HTTP object transfers.

Some work in this direction has already been developed in PR #6234, for HTTP transfers specifically.

We should write a proposal to discuss how to implement any new configuration options such that they respect the historical meaning of the lfs.concurrentTransfers option. As one consideration, if a user has explicitly set the lfs.concurrentTransfers option but has not set any of the new options, then the lfs.concurrentTransfers option's value should be understood as a maximum limit for the number of threads or processes we start during object transfers. On the other hand, if a user sets one of our new options but not the lfs.concurrentTransfers option, then we might allow its value to override the default value of the lfs.concurrentTransfers option.

/cc @joshfriend as the author of PRs #6241 and #6234.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ideas

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions