Skip to content

Initialize sessions lazily#5634

Merged
bk2204 merged 2 commits intogit-lfs:mainfrom
bk2204:lazy-session
Jan 30, 2024
Merged

Initialize sessions lazily#5634
bk2204 merged 2 commits intogit-lfs:mainfrom
bk2204:lazy-session

Conversation

@bk2204
Copy link
Member

@bk2204 bk2204 commented Jan 29, 2024

Right now, we spawn potentially several connections without necessarily needing to. For example, if we're transferring two objects, we can practically use at most three connections: one for the batch and one for each object.

Instead of spawning all the workers up front, let's change how we do things and simply pass in the worker number at first. Then, acquire the connection when we're uploading or downloading so that the connection is only spawned on demand. This means that we spawn fewer workers in case there are few objects.

/cc #5622

@bk2204 bk2204 marked this pull request as ready for review January 29, 2024 17:20
@bk2204 bk2204 requested a review from a team as a code owner January 29, 2024 17:20
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! I had one suggestion only for an update to a comment.

bk2204 and others added 2 commits January 29, 2024 20:07
Right now, we spawn potentially several connections without necessarily
needing to.  For example, if we're transferring two objects, we can
practically use at most three connections: one for the batch and one for
each object.

To make this more efficient and avoid needless overhead, let's not
actually create the connection until we attempt to acquire it.  At that
point, if it doesn't exist, we'll spawn a new one (using the control
path socket if possible) and then start sending data.  Otherwise, unless
it's the initial connection, let's just stub it out until we actually
need it.

Note that we still create the connection before it's needed because we
create all workers up front, but that will change in a future commit now
that we have this change in place.

Co-authored-by: Chris Darroch <chris8088@github.com>
Right now, we spawn potentially several connections without necessarily
needing to.  For example, if we're transferring two objects, we can
practically use at most three connections: one for the batch and one for
each object.

Instead of spawning all the workers up front, let's change how we do
things and simply pass in the worker number at first.  Then, acquire the
connection when we're uploading or downloading so that the connection is
only spawned on demand.  This means that we spawn fewer workers in case
there are few objects.
@bk2204 bk2204 merged commit 2bf4567 into git-lfs:main Jan 30, 2024
@bk2204 bk2204 deleted the lazy-session branch January 30, 2024 14:00
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Oct 22, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create connections on demand, rather than simply
return pre-existing connections created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH connections
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a connection successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned nil if the requested connection number
exceeded the maximum number of connections permitted, and none of
the callers check for that condition.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find an active SSH connection.  With the change in
commit 44b8801, the batchInternal() method
now checks only for a non-nil error return value, which means it does
not detect the case where a nil PktlineConnection value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH connections and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a connection,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested connection number exceeds
the maximum number of permitted SSH connections.  This also covers the
case where the "conn" array has been set to nil, as the len() built-in
function returns zero for the length of a nil array.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Oct 23, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create connections on demand, rather than simply
return pre-existing connections created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH connections
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a connection successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned nil if the requested connection number
exceeded the maximum number of connections permitted, and none of
the callers check for that condition.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find an active SSH connection.  With the change in
commit 44b8801, the batchInternal() method
now checks only for a non-nil error return value, which means it does
not detect the case where a nil PktlineConnection value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH connections and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a connection,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested connection number exceeds
the maximum number of permitted SSH connections.  This also covers the
case where the "conn" array has been set to nil, as the len() built-in
function returns zero for the length of a nil array.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Oct 29, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create connections on demand, rather than simply
return pre-existing connections created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH connections
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a connection successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned nil if the requested connection number
exceeded the maximum number of connections permitted, and none of
the callers check for that condition.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find an active SSH connection.  With the change in
commit 44b8801, the batchInternal() method
now checks only for a non-nil error return value, which means it does
not detect the case where a nil PktlineConnection value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH connections and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a connection,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested connection number exceeds
the maximum number of permitted SSH connections.  This also covers the
case where the "conn" array has been set to nil, as the len() built-in
function returns zero for the length of a nil array.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Nov 1, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create connections on demand, rather than simply
return pre-existing connections created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH connections
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a connection successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned nil if the requested connection number
exceeded the maximum number of connections permitted, and none of
the callers check for that condition.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find an active SSH connection.  With the change in
commit 44b8801, the batchInternal() method
now checks only for a non-nil error return value, which means it does
not detect the case where a nil PktlineConnection value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH connections and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a connection,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested connection number exceeds
the maximum number of permitted SSH connections.  This also covers the
case where the "conn" array has been set to nil, as the len() built-in
function returns zero for the length of a nil array.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Nov 3, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create connections on demand, rather than simply
return pre-existing connections created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH connections
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a connection successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned nil if the requested connection number
exceeded the maximum number of connections permitted, and none of
the callers check for that condition.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find an active SSH connection.  With the change in
commit 44b8801, the batchInternal() method
now checks only for a non-nil error return value, which means it does
not detect the case where a nil PktlineConnection value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH connections and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a connection,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested connection number exceeds
the maximum number of permitted SSH connections.  This also covers the
case where the "conn" array has been set to nil, as the len() built-in
function returns zero for the length of a nil array.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request Nov 4, 2024
In commit 44b8801 of PR git-lfs#5634 we
revised the Connection() method of the SSHTransfer structure in our
"ssh" package to create SSH sessions on demand, rather than simply
return pre-existing sessions created by the setConnectionCount()
method, which allowed us to reduce the total number of SSH sessions
in many cases.

As part of this change, we altered the Connection() method to return
an error if it could not create a session successfully, and we then
updated the callers of this method to check for a non-nil error.

However, the method still returned a nil error value, along with a
nil PktlineConnection value, if the requested session ID exceeded
the maximum number of sessions permitted and so no additional session
was created or returned.  However, the method's callers all now assume
a nil error value implies a non-nil PktlineConnection value.

In particular, the batchInternal() method of the SSHBatchClient structure
in the "tq" package originally checked for a nil return value from the
Connection() method, as this was the only way the Connection() method
indicated a failure to find or create an active SSH session.  Since
the change in commit 44b8801, the
batchInternal() method now checks only for a non-nil error return value,
which means it does not detect the case where a nil PktlineConnection
value is returned.

As a result, in situations such as those described in git-lfs#5880, the
batchInternal() method may cause a Go panic when it attempts to call
the Lock() method of a nil PktlineConnection value.  One way this can
occur is after one batch transfer operation has finished and the
Wait() method of the TransferQueue structure has called the Shutdown()
method of the sshTransfer field of the TransferQueue's concreteManifest.
That closes all the SSH sessions and sets the "conn" array to nil.
But a reference to the same SSHTransfer structure is retained by the
sshTransfer field of the SSHBatchClient structure to which the
concreteManifest's batchClientAdapter field points.  When that manifest
is reused by a subsequent batch transfer operation, the batchInternal()
method of the SSHBatchClient structure attempts to retrieve a session,
receives a nil PktlineConnection value, and causes a panic by trying
to use it to run the Lock() method.

We therefore adjust the Connection() method of the SSHTransfer structure
to return a non-nil error when the requested session ID exceeds the
maximum number of sessions allowed.  This also covers the case where
the "conn" array has been set to nil, as the len() built-in function
returns zero for the length of a nil array.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants