tq: prioritize transferring retries before new items by ttaylorr · Pull Request #1758 · git-lfs/git-lfs

ttaylorr · 2016-12-12T01:58:52Z

This pull-request implements priority retries, and moves TransferQueue-related code into a new package, tq, as discussed in #1651.

Why?

In an effort to better support transferring large numbers of objects through the *tq.TransferQueue, we need to buffer less data than we currently do. One way to do this is to process retries up front, instead of keeping data in memory about failed objects around until the end of the transfer. Normally, this isn't a huge problem, but if the transfer queue fails half a million objects, we can't afford to keep that in memory.

What?

This pull request brings a number of benefits to the *TransferQueue type:

Back-pressure. If the *TransferQueue can't accept more items, there's no sense in wasting a large amount of disk and CPU usage scanning the Git data. Previously, instances of git rev-list, git cat-file and git cat-file --batch would run ad nauseam even if the TransferQueue couldn't accept new items. Now, Add() will block after a maximum buffer depth has been reached. In other words, by default, Add() will accept 200 (twice the default batch size of 100) items before apply back-pressure up to the gitscanner, causing it to wait.
Support for many objects: Right now, we keep a large amount of data in memory per each object retry. Before lfs/transfer_queue: support multiple retries per object #1535, that data was kept around until the very end of the transfer and then retried all at once. That got better after that PR was merged, but at the cost of many more API calls. This PR encompasses the best of both worlds: by retrying objects closer to when they were initially Add()-ed, data about that object can be kept in memory for less time. By treating retries as part of the next batch and prioritizing them, we make less API calls.
Less API calls. As of lfs/transfer_queue: support multiple retries per object #1535, we make an API call (about) once per object retry. Normally this isn't such a huge problem, but if many objects fail, we'll be making a proportional number of expensive API calls. Not cool. This pull-request, by prioritizing object retries before accepting new items cuts the number of batch API calls by the size of a batch (100, by default). Much better.

Inner-workings

This pull-request fundamentally changes the way that we process batches and retries in the transfer queue. Here's a breakdown of what goes on:

Start with an empty batch.
Until the batch is full OR the <-q.incoming channel is closed, append from Add() via <-q.incoming items into the batch.
Make a batch API request, and send the results to the transfer adapters.
Clear the batch.
Collect the failed items into the next batch.
If the <-q.incoming channel is closed AND the next batch is empty, exit.
Otherwise, go to step 2.

This requires us to change how the *tq.adapterBase (previously *transfer.adapterBase) type is implemented. The changes can be summarized by the signature change. What used to be:

func (a *adapterBase) Add(t *Transfer)

is now:

func (a *adapterBase) Add(ts ...*Transfer) (results <-chan TransferResult)

by synchronizing the results with a batch of *Transfer items, we can deterministically pre-fill the next batch with failed items from the last one.

This change comes at a cost, which is that in order to know when to close() the results channel, we must wait for all of the transfer adapters to finish their in-progress tasks. This effect will be un-noticeable if the items in a transfer set are of uniform size, but is more pronounced when items with wildly different sizes are processed in a random order with a poor network connection.

To combat this, items in a batch are sorted in descending object size to minimize the chance that a worker will get tied up on a large object as the last item in a batch. This change, implemented in 6decfe4 should make this a non-issue.

Since this PR is a on the bigger side, here's a convenient break-down of the changes:

453d02a: Introduce batchSize instance variable (and OptionFn) to allow configuring the batch size.
bdcacb0: Implement priority retries.
6decfe4: Sort batches by descending object size to minimize chance of idle workers.
d0c4ccd...55275ca: Move lfs/transfer_queue.go and transfer package into new package, tq.
70f1479...8c94383: Rename various exported methods/types in tq package to avoid stuttering.

/cc @git-lfs/core

sinbad · 2016-12-12T10:46:07Z

tq/adapterbase.go

Maybe I'm missing something, but if you're adding len(transfers) to jobWait, and Done is only ever called as a response to completion of items in jobChan on a 1:1 basis, jobWait.Wait() should never return before this loop is done, therefore the close in End() is safe?

This is correct, if the calling code occurs in a single goroutine. Consider this:

func main() { var a *adapterBase go a.Add(myBatch) a.Wait() }

If we're say, 50 items into myBatch before the execution scheduler takes us to the a.Wait() call, it'll close a.jobChan, which is what we expect. Once the scheduler takes us back to the goroutine, it'll try and process the 51st item and write into a closed channel, which is a panic. Even if it weren't a panic, we don't want to lose those items.

Excuse me, I hit enter a little too early :-). What I wanted to add is that even though this is a fairly fringe scenario, I still think it's important to consider.

I must be being stupid, because I don't understand how that could be an issue. Since Add() increments the wait group by len(myBatch), it doesn't matter how far through myBatch you get, the wait group won't be able to proceed past Wait() until Done() has been called that many times. So I'm not seeing where the race comes from...

Doh! Sorry, I thought you were asking about my use of a.jobWait.Wait() in Done(), which I added after I added that // BUG(taylor) comment while debugging. Looks like I just forgot to remove that comment, which I did in ce08f66. I think we just found out in a round-about way that we're on the same page!

sinbad · 2016-12-12T10:47:54Z

tq/adapterbase.go

I'm unreasonably happy with how job.Done() scans 😄

sinbad · 2016-12-12T11:27:59Z

tq/transfer_queue.go

Since you have to supply the max as an argument I'm not sure what value this function adds?

Yeah, I was a little hesitant about adding this function. I think it provides minimal value for the reason that you stated, and it expands the Batch interface. I've removed it in ced37c6c.

sinbad · 2016-12-12T11:36:09Z

tq/transfer_queue.go

Sorry, I don't understand what benefit this sort is giving you. Since you're only sorting within the batch, and the batch is the unit that is blocked on, what does it matter what the ordering is inside the batch?

I could imagine that if you sorted all the transfers as a whole so that bigger files made it into the first batch that this would alter the blocking behaviour, but sorting inside the batch seems incidental. Please explain?

No worries, I wasn't quite clear with my reasoning here. The idea is to reduce the likelihood of a single worker getting tied up processing one very large item at the end of a batch while others are sitting idle. Since enqueuing the next batch is contingent on processing the current batch completely, idle workers should be treated as something we need to minimize as much as possible.

Specifically, the scenario I'm thinking about is a batch of 100 items where the first 99 are 5MB and the last item is 2GB. Processed in sequence, the workers will quickly finish transferring the small items. Towards the very end of the sequence, one worker will be sitting transferring a very large object, while the remaining workers are sitting doing nothing.

If the 2GB item is processed first, it ties a worker up for the duration of the transfer as it always would, but it allows the other workers to tear through the small items, minimizing the chance that a small number of large transfers will prolong the duration of the batch.

The sort makes this property much less noticeable, as does a larger number of workers, or a more evenly sized workload.

I went ahead and clarified this in b9aa0986.

OK I see, so it's saving you from the case where one massive file is the absolute last thing off the queue when everything else is already finished. I'm not sure if in practice it's likely to make much of difference but I understand where you're coming from now :)

sinbad · 2016-12-12T11:37:47Z

tq/transfer_queue.go

I like how the original & retries are merged into one batch, makes everything simpler.

sinbad · 2016-12-12T11:47:18Z

tq/transfer_queue.go

This effectively means outChan is going to be nil for the most common case, which isn't entirely clear & could be confusing. Probably time to retire outChan entirely?

🔥'd in e1f68be3.

technoweenie · 2016-12-12T15:59:45Z

I'd really like to remove the coupling on the config package. I would've preferred to review "prioritize transferring retries" first, and then work out the tq move/extraction stuff separately. What do you think about changing the base of this PR to tq-master?

ttaylorr · 2016-12-12T16:26:46Z

I would've preferred to review "prioritize transferring retries" first, and then work out the tq move/extraction stuff separately.

This makes much more sense than the approach I pursued. I was finished with the in-place refactoring on Saturday night, and wanted to keep moving forward. However, I didn't want to create two PRs that would be in-flight at once, since I figured that'd be tough to review. Instead, I opted to create one large PR, which was a mistake.

What do you think about changing the base of this PR to tq-master?

I think that's a good idea. I reset tq-master to the current HEAD of master, and made the base of this PR points at the tq-master branch.

ttaylorr · 2016-12-12T22:37:56Z

@technoweenie I implemented the changes that we discussed this morning in cf409ac...d7bf608. 😄

technoweenie

Looks great, thanks for taking care of my notes from our IRL review. There's a number of things I want addressed before we can merge to master and make the tq package a real thing, but they should happen in separate PRs.

Rough list (that should probably be expanded into a full issue):

Try to de-couple config package, using config.Environment interface.
Change Begin(maxConcurrency int, cb ProgressCallback) interface function to `Begin(m *Manifest, cb ProgressCallback), so we can future proof this entry point.
Figure out that Transfer/Transferrable/api.ObjectResource stuff. Ideally there's a single tq.Object struct or something.

ttaylorr · 2016-12-13T01:06:52Z

@technoweenie sounds good, and thanks for the 👍. I'm thinking that the follow-up PRs can happen in the order that you suggested, since the only strict dependency is that the tq.Object stuff happen last.

Additionally before we merge tq-master into master, I'd like to look at the *adapterBase and narrow the interface down to WorkerStarting(), WorkerEnding(), and DoTransfer(). That will have to happen after the tq.Object, since I'm thinking that one is a dependency of the other.

I think after all of that, this should be 🆗 to merge into master.

ttaylorr · 2016-12-13T01:10:39Z

Opened up a tracking issue and am tracking progress in #1764.

ttaylorr added the review label Dec 12, 2016

ttaylorr added this to the v2.0.0 milestone Dec 12, 2016

ttaylorr mentioned this pull request Dec 12, 2016

tq: implement *tq.Queue type #1744

Closed

sinbad reviewed Dec 12, 2016

View reviewed changes

tq/adapterbase.go Outdated

Copy link

Contributor

sinbad Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unreasonably happy with how job.Done() scans 😄

ttaylorr reacted with thumbs up emoji

sinbad reviewed Dec 12, 2016

View reviewed changes

ttaylorr added 11 commits December 12, 2016 09:24

lfs/tq: promote batchSize to instance variable

29f6b94

lfs/tq: retry transfers before accepting new ones

8c8dbdb

lfs/tq: sort batches by descending object size

f17a07e

lfs,tq: move type *lfs.TransferQueue to pkg "tq"

c5ce4b7

tq,transfer: move pkg "transfer" to "tq"

891db97

lfs,tq: rename tq.TransferQueueOption to tq.Option

1570e44

lfs,tq: rename tq.TransferProgressCallback to tq.ProgressCallback

240ca02

tq: remove "Transfer" prefix from "TransferAdapterFunc" and others

77312eb

tq: inline Batch.IsFull implementation

2284cb1

tq: clarify why Batch implements sort.Interface

16cb390

lfs,tq: 🔥 completionChan from the tq API

886dbcc

ttaylorr force-pushed the tq-priority-retries branch from e1f68be to 886dbcc Compare December 12, 2016 16:24

ttaylorr changed the base branch from master to tq-master December 12, 2016 16:24

ttaylorr added 3 commits December 12, 2016 09:50

tq: remove BUG comment that has since been fixed

ce08f66

tq: set batchSize and bufferDepth after parsing options

cf409ac

tq: wait after closing collectBatches() goroutine

7745140

ttaylorr force-pushed the tq-priority-retries branch from f9e4888 to 7745140 Compare December 12, 2016 21:22

ttaylorr added 2 commits December 12, 2016 15:10

tq: extract tq.TransferQueue.remember function

b3a2b6c

tq: 💅

3494052

ttaylorr added 2 commits December 12, 2016 15:33

tq,test/cmd: raise an error on unknown OIDs

1852300

tq: un-defer close() call for clarity

d7bf608

technoweenie approved these changes Dec 13, 2016

View reviewed changes

ttaylorr merged commit 39badd3 into tq-master Dec 13, 2016

ttaylorr deleted the tq-priority-retries branch December 13, 2016 01:06

ttaylorr mentioned this pull request Dec 13, 2016

Tracking: tq package #1764

Closed

5 tasks

technoweenie mentioned this pull request Dec 16, 2016

merge tq package to master #1780

Merged

ttaylorr mentioned this pull request Feb 15, 2017

lfs/batcher: remove unused #1947

Merged

ttaylorr mentioned this pull request Aug 7, 2017

tq: make Add() accept new items during batch processing #2483

Merged

Conversation

ttaylorr commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

What?

Inner-workings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

technoweenie commented Dec 12, 2016

Uh oh!

ttaylorr commented Dec 12, 2016

Uh oh!

ttaylorr commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

technoweenie left a comment

Choose a reason for hiding this comment

Uh oh!

ttaylorr commented Dec 13, 2016

Uh oh!

ttaylorr commented Dec 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ttaylorr commented Dec 12, 2016 •

edited

Loading

ttaylorr commented Dec 12, 2016 •

edited

Loading