-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
This issue proposes a new design for the transfer package, which will supersede the lfs/transfer_queue.go code and serve as the sole mechanism to transfer objects to and from LFS servers.
Motivation
The current transfer-queue lacks some features that we’d like to implement for the v2.0.0 release of Git LFS:
- Less buffering. LFS should be better at dealing with really large working trees during initial pushes, or
push —all’s. Currently, when retrying an object, it is re-inserted at the last spot in the queue, meaning that we could be holding a potentially large number of OIDs in memory before retrying them. - Async. Currently, we have to wait for the entire working tree scan to complete before beginning the transfer. This is fine for small trees, but yields undesirable behavior for larger ones. We should be able to start uploading as soon as we have enough data to do so.
- Backpressure. We should synchronize the work being done by the scanner package with the work being done by the
transferpackage in uploading objects that the scanner finds. In other words: if we’re network-bound while uploading, we should delay scanning for more objects until we can resume uploading. - Resumable transfers, by default. We should be able to resume uploads and downloads as the default if a transfer fails part-way through. This, combined with better retry behavior will yield for much more resilient and efficient transfers.
- Better design. Designing a new package for the
transfercode, thus moving it out of thelfspackage affords us the opportunity to re-imagine it from first principles, and define a solid, public API that can be used from other packages, and other programs.
Proposal
What follows is a high-level overview of the mechanism I’d like to implement in the scanner package, and a description of each of its major components (+ how it relates to the motivations above):
Async & backpressure
We should halt the scanning process when there are no available workers free to transfer objects. We can do this by using the following flow:
- Accept items using a
<-chan *lfs.Pointerto build up a single batch of items - If the currently accumulating batch has a cardinality smaller than twice the desired size of a batch (2*100), OR if the
<-chanremains open, go to step 1. - Make a batch API call with the accumulated items
- Distribute the transfers across
nworkers. - While all workers are busy, go to step 4.
- As soon as a worker is free, go to step 1.
- Repeat until the
<-chan *lfs.Pointersignals that there are no more items.
What’s great about this model is that we put back-pressure on the incoming channel of data when all the workers are busy, and we have a batch enqueued. Since the scanner’s implementation uses channels, the back-pressure propagates all the way up, halting the process of scanning revisions until it makes sense to continue that work.
Less buffering, retrying
I believe that there are two classifications of errors: resumable, and others. @technoweenie has provided some excellent commentary on the matter:
- Fatal errors: errors that can’t be recovered from.
- Examples: failing to read an object, write to a media file, or similar.
- Local/recoverable errors: if the object is expired, or we experienced a network hiccup in the middle of transferring a file, let’s resume from there.
We can be clever in how we handle these: fatal errors should be exited immediately, and not retried in the future. To signal these, we can use the errors.NewFatalError or similar to indicate that the error returned from an adapter implementation cannot and should not be retried in the future.
In the case of recoverable errors, this is where things get interesting: we should make an attempt to retry those errors within the adapter unless it requires that we make another batch API call. For instance, a network hiccup should be retried internally before giving up and re-trying the batch call. On the other hand, an expired object needs another batch API call to get a new storage endpoint, so we must reject that immediately and retry.
Budgeting
Previously, our approach has been to try and keep track of the number of retries an object has attempted per OID. This has worked OK, but does not necessarily align well with our goals of making the transfer queue better at handling large numbers of unique OIDs, since the map[string]uint64 will grow linearly with the number of OIDs in the queue.
The way I see it, we have two approaches on what we can do:
- Retain the same cost-based approach per OID, purging the key from the cache when we give up on the OID. If we apply the constraint that the adapter does not accept non-unique OIDs, we should be fine to adopt this model.
- Use a cost-based approach per transfer queue, budgeting against all objects in the queue, not per OID. This approach could work, but it may be better to just treat the
mapwith care, and adopt the first approach instead.
Retrying
The only way we can avoid buffering a potentially huge amount of OIDs is to retry them immediately, rather than later.
To accomplish this, I propose two return avenues for adapter implementations: one for new items and one for retried ones. The retry channel would be preferred and that would buffer old items before accepting new ones.
func acceptQueue() {
for {
var oid string
// Try first to grab something from the retries
// channel. If nothing was available, bail out
// immediately
select {
case oid = <-retriesCh:
default:
}
if len(oid) == 0 {
// If we didn't get an item to retry,
// search for a new one incoming
oid = <-newItems
}
if err := enqueue(oid); err != nil { … }
}
}This allows us to continue putting back-pressure all the way up the scanner implementation while we try and handle retries before accepting new items.
Resumable transfers
I like the existing transfer adapter design. What I’m proposing is similar, a different implementation of a transfer adapter interface for each transfer medium, Tus.io, Basic, and Custom:
package transfer
type Direction byte
const (
Upload Direction = iota
Download
)
type Adapter interface {
Transfer(p *lfs.Pointer, d Direction, offset uint64) (n int, err error)
}Better design
Implementing this mechanism in a brand new package, transfer, we afford ourselves the opportunity to redesign its public API to be stable, and easily consumable not only from other packages, but other programs as well.
In addition, and perhaps more importantly, considering new designs makes it easier to add tests at the unit and integration level, better ensuring the long-term stability of this crucial part of LFS.
Curious to hear your thoughts!
/cc @technoweenie @rubyist @larsxschneider @sinbad @sschuberth
