Proposal: `transfer` package for v2.0.0

This issue proposes a new design for the `transfer` package, which will supersede the `lfs/transfer_queue.go` code and serve as the sole mechanism to transfer objects to and from LFS servers.

## Motivation

The current transfer-queue lacks some features that we’d like to implement for the [v2.0.0][1] release of Git LFS:

1. __Less buffering.__ LFS should be better at dealing with really large working trees during initial pushes, or `push —all`’s. Currently, when retrying an object, it is re-inserted at the last spot in the queue, meaning that we could be holding a potentially large number of OIDs in memory before retrying them.
2. __Async.__ Currently, we have to wait for the entire working tree scan to complete before beginning the transfer. This is fine for small trees, but yields undesirable behavior for larger ones. We should be able to start uploading as soon as we have enough data to do so. 
3. __Backpressure.__ We should synchronize the work being done by the [scanner][2] package with the work being done by the `transfer` package in uploading objects that the scanner finds. In other words: if we’re network-bound while uploading, we should delay scanning for more objects until we can resume uploading.
4. __Resumable transfers, by default.__ We should be able to resume uploads and downloads as the default if a transfer fails part-way through. This, combined with better retry behavior will yield for much more resilient and efficient transfers.
5. __Better design.__ Designing a new package for the `transfer` code, thus moving it out of the `lfs` package affords us the opportunity to re-imagine it from first principles, and define a solid, public API that can be used from other packages, and other programs.

[1]: https://github.com/github/git-lfs/milestone/3
[2]: https://github.com/github/git-lfs/issues/1649

## Proposal

What follows is a high-level overview of the mechanism I’d like to implement in the `scanner` package, and a description of each of its major components (+ how it relates to the motivations above):

![](https://cloud.githubusercontent.com/assets/443245/19457770/1bf3c4b6-9485-11e6-84f5-5a5049305cb0.png)

### Async & backpressure

We should halt the scanning process when there are no available workers free to transfer objects. We can do this by using the following flow:

1. Accept items using a `<-chan *lfs.Pointer` to build up a single batch of items
2. If the currently accumulating batch has a cardinality smaller than twice the desired size of a batch (2*100), _OR_ if the `<-chan` remains open, go to step 1.
3. Make a batch API call with the accumulated items
4. Distribute the transfers across `n` workers.
5. While all workers are busy, go to step 4.
6. As soon as a worker is free, go to step 1.
7. Repeat until the `<-chan *lfs.Pointer` signals that there are no more items.

What’s great about this model is that we put back-pressure on the incoming channel of data when all the workers are busy, and we have a batch enqueued. Since the `scanner`’s implementation uses channels, the back-pressure propagates all the way up, halting the process of scanning revisions until it makes sense to continue that work.

### Less buffering, retrying

I believe that there are two classifications of errors: resumable, and others. @technoweenie has provided some excellent commentary on the matter:

- __Fatal errors__: errors that can’t be recovered from.
	- _Examples: failing to read an object, write to a media file, or similar._
- __Local/recoverable errors__: if the object is expired, or we experienced a network hiccup in the middle of transferring a file, let’s resume from there.

We can be clever in how we handle these: fatal errors should be exited immediately, and not retried in the future. To signal these, we can use the `errors.NewFatalError` or similar to indicate that the error returned from an adapter implementation cannot and should not be retried in the future.

In the case of recoverable errors, this is where things get interesting: we should make an attempt to retry those errors within the adapter _unless_ it requires that we make another batch API call. For instance, a network hiccup should be retried internally before giving up and re-trying the batch call. On the other hand, an expired object _needs_ another batch API call to get a new storage endpoint, so we must reject that immediately and retry.

#### Budgeting

Previously, our approach has been to try and keep track of the number of retries an object has attempted per OID. This has worked OK, but does not necessarily align well with our goals of making the transfer queue better at handling large numbers of unique OIDs, since the `map[string]uint64` will grow linearly with the number of OIDs in the queue.

The way I see it, we have two approaches on what we can do:

1. Retain the same cost-based approach _per OID_, purging the key from the cache when we give up on the OID. If we apply the constraint that the adapter does _not_ accept non-unique OIDs, we should be fine to adopt this model.
2. Use a cost-based approach _per transfer queue_, budgeting against all objects in the queue, not per OID. This approach _could_ work, but it may be better to just treat the `map` with care, and adopt the first approach instead.

#### Retrying
The only way we can avoid buffering a potentially _huge_ amount of OIDs is to retry them immediately, rather than later.
  
To accomplish this, I propose two return avenues for adapter implementations: one for new items and one for retried ones. The retry channel would be preferred and that would buffer old items before accepting new ones.

```go
func acceptQueue() {
        for {
                var oid string
        
                // Try first to grab something from the retries
                // channel. If nothing was available, bail out
                // immediately
                select {
                case oid = <-retriesCh:
                default:
                }
                
                if len(oid) == 0 {
                        // If we didn't get an item to retry,
                        // search for a new one incoming
                        oid = <-newItems
                }
                
                if err := enqueue(oid); err != nil { … }
        }
}
```

This allows us to continue putting back-pressure all the way up the `scanner` implementation while we try and handle retries before accepting new items.

### Resumable transfers

I like the existing transfer adapter design. What I’m proposing is similar, a different implementation of a transfer adapter interface for each transfer medium, `Tus.io`, `Basic`, and `Custom`:

```go
package transfer

type Direction byte

const (
        Upload Direction = iota
        Download
)

type Adapter interface {
        Transfer(p *lfs.Pointer, d Direction, offset uint64) (n int, err error)
}
```

### Better design

Implementing this mechanism in a brand new package, `transfer`, we afford ourselves the opportunity to redesign its public API to be stable, and easily consumable not only from other packages, but other programs as well.

In addition, and perhaps more importantly, considering new designs makes it easier to add tests at the unit and integration level, better ensuring the long-term stability of this crucial part of LFS.

---

Curious to hear your thoughts!

/cc @technoweenie @rubyist @larsxschneider @sinbad @sschuberth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: `transfer` package for v2.0.0 #1651

Motivation

Proposal

Async & backpressure

Less buffering, retrying

Budgeting

Retrying

Resumable transfers

Better design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: transfer package for v2.0.0 #1651

Description

Motivation

Proposal

Async & backpressure

Less buffering, retrying

Budgeting

Retrying

Resumable transfers

Better design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: `transfer` package for v2.0.0 #1651