Skip to content

tq: prioritize transferring retries before new items#1758

Merged
ttaylorr merged 18 commits intotq-masterfrom
tq-priority-retries
Dec 13, 2016
Merged

tq: prioritize transferring retries before new items#1758
ttaylorr merged 18 commits intotq-masterfrom
tq-priority-retries

Conversation

@ttaylorr
Copy link
Contributor

@ttaylorr ttaylorr commented Dec 12, 2016

This pull-request implements priority retries, and moves TransferQueue-related code into a new package, tq, as discussed in #1651.

Why?

In an effort to better support transferring large numbers of objects through the *tq.TransferQueue, we need to buffer less data than we currently do. One way to do this is to process retries up front, instead of keeping data in memory about failed objects around until the end of the transfer. Normally, this isn't a huge problem, but if the transfer queue fails half a million objects, we can't afford to keep that in memory.

What?

This pull request brings a number of benefits to the *TransferQueue type:

  1. Back-pressure. If the *TransferQueue can't accept more items, there's no sense in wasting a large amount of disk and CPU usage scanning the Git data. Previously, instances of git rev-list, git cat-file and git cat-file --batch would run ad nauseam even if the TransferQueue couldn't accept new items. Now, Add() will block after a maximum buffer depth has been reached. In other words, by default, Add() will accept 200 (twice the default batch size of 100) items before apply back-pressure up to the gitscanner, causing it to wait.
  2. Support for many objects: Right now, we keep a large amount of data in memory per each object retry. Before lfs/transfer_queue: support multiple retries per object #1535, that data was kept around until the very end of the transfer and then retried all at once. That got better after that PR was merged, but at the cost of many more API calls. This PR encompasses the best of both worlds: by retrying objects closer to when they were initially Add()-ed, data about that object can be kept in memory for less time. By treating retries as part of the next batch and prioritizing them, we make less API calls.
  3. Less API calls. As of lfs/transfer_queue: support multiple retries per object #1535, we make an API call (about) once per object retry. Normally this isn't such a huge problem, but if many objects fail, we'll be making a proportional number of expensive API calls. Not cool. This pull-request, by prioritizing object retries before accepting new items cuts the number of batch API calls by the size of a batch (100, by default). Much better.

Inner-workings

This pull-request fundamentally changes the way that we process batches and retries in the transfer queue. Here's a breakdown of what goes on:

  1. Start with an empty batch.
  2. Until the batch is full OR the <-q.incoming channel is closed, append from Add() via <-q.incoming items into the batch.
  3. Make a batch API request, and send the results to the transfer adapters.
  4. Clear the batch.
  5. Collect the failed items into the next batch.
  6. If the <-q.incoming channel is closed AND the next batch is empty, exit.
  7. Otherwise, go to step 2.

This requires us to change how the *tq.adapterBase (previously *transfer.adapterBase) type is implemented. The changes can be summarized by the signature change. What used to be:

func (a *adapterBase) Add(t *Transfer)

is now:

func (a *adapterBase) Add(ts ...*Transfer) (results <-chan TransferResult)

by synchronizing the results with a batch of *Transfer items, we can deterministically pre-fill the next batch with failed items from the last one.

This change comes at a cost, which is that in order to know when to close() the results channel, we must wait for all of the transfer adapters to finish their in-progress tasks. This effect will be un-noticeable if the items in a transfer set are of uniform size, but is more pronounced when items with wildly different sizes are processed in a random order with a poor network connection.

To combat this, items in a batch are sorted in descending object size to minimize the chance that a worker will get tied up on a large object as the last item in a batch. This change, implemented in 6decfe4 should make this a non-issue.


Since this PR is a on the bigger side, here's a convenient break-down of the changes:

  1. 453d02a: Introduce batchSize instance variable (and OptionFn) to allow configuring the batch size.
  2. bdcacb0: Implement priority retries.
  3. 6decfe4: Sort batches by descending object size to minimize chance of idle workers.
  4. d0c4ccd...55275ca: Move lfs/transfer_queue.go and transfer package into new package, tq.
  5. 70f1479...8c94383: Rename various exported methods/types in tq package to avoid stuttering.

/cc @git-lfs/core

@ttaylorr ttaylorr added this to the v2.0.0 milestone Dec 12, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but if you're adding len(transfers) to jobWait, and Done is only ever called as a response to completion of items in jobChan on a 1:1 basis, jobWait.Wait() should never return before this loop is done, therefore the close in End() is safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, if the calling code occurs in a single goroutine. Consider this:

func main() {
        var a *adapterBase

        go a.Add(myBatch)

        a.Wait()
}

If we're say, 50 items into myBatch before the execution scheduler takes us to the a.Wait() call, it'll close a.jobChan, which is what we expect. Once the scheduler takes us back to the goroutine, it'll try and process the 51st item and write into a closed channel, which is a panic. Even if it weren't a panic, we don't want to lose those items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse me, I hit enter a little too early :-). What I wanted to add is that even though this is a fairly fringe scenario, I still think it's important to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be being stupid, because I don't understand how that could be an issue. Since Add() increments the wait group by len(myBatch), it doesn't matter how far through myBatch you get, the wait group won't be able to proceed past Wait() until Done() has been called that many times. So I'm not seeing where the race comes from...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Sorry, I thought you were asking about my use of a.jobWait.Wait() in Done(), which I added after I added that // BUG(taylor) comment while debugging. Looks like I just forgot to remove that comment, which I did in ce08f66. I think we just found out in a round-about way that we're on the same page!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unreasonably happy with how job.Done() scans 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have to supply the max as an argument I'm not sure what value this function adds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a little hesitant about adding this function. I think it provides minimal value for the reason that you stated, and it expands the Batch interface. I've removed it in ced37c6c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what benefit this sort is giving you. Since you're only sorting within the batch, and the batch is the unit that is blocked on, what does it matter what the ordering is inside the batch?

I could imagine that if you sorted all the transfers as a whole so that bigger files made it into the first batch that this would alter the blocking behaviour, but sorting inside the batch seems incidental. Please explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I wasn't quite clear with my reasoning here. The idea is to reduce the likelihood of a single worker getting tied up processing one very large item at the end of a batch while others are sitting idle. Since enqueuing the next batch is contingent on processing the current batch completely, idle workers should be treated as something we need to minimize as much as possible.

Specifically, the scenario I'm thinking about is a batch of 100 items where the first 99 are 5MB and the last item is 2GB. Processed in sequence, the workers will quickly finish transferring the small items. Towards the very end of the sequence, one worker will be sitting transferring a very large object, while the remaining workers are sitting doing nothing.

If the 2GB item is processed first, it ties a worker up for the duration of the transfer as it always would, but it allows the other workers to tear through the small items, minimizing the chance that a small number of large transfers will prolong the duration of the batch.

The sort makes this property much less noticeable, as does a larger number of workers, or a more evenly sized workload.

I went ahead and clarified this in b9aa0986.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see, so it's saving you from the case where one massive file is the absolute last thing off the queue when everything else is already finished. I'm not sure if in practice it's likely to make much of difference but I understand where you're coming from now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the original & retries are merged into one batch, makes everything simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively means outChan is going to be nil for the most common case, which isn't entirely clear & could be confusing. Probably time to retire outChan entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥'd in e1f68be3.

@technoweenie
Copy link
Contributor

I'd really like to remove the coupling on the config package. I would've preferred to review "prioritize transferring retries" first, and then work out the tq move/extraction stuff separately. What do you think about changing the base of this PR to tq-master?

@ttaylorr ttaylorr force-pushed the tq-priority-retries branch from e1f68be to 886dbcc Compare December 12, 2016 16:24
@ttaylorr ttaylorr changed the base branch from master to tq-master December 12, 2016 16:24
@ttaylorr
Copy link
Contributor Author

I would've preferred to review "prioritize transferring retries" first, and then work out the tq move/extraction stuff separately.

This makes much more sense than the approach I pursued. I was finished with the in-place refactoring on Saturday night, and wanted to keep moving forward. However, I didn't want to create two PRs that would be in-flight at once, since I figured that'd be tough to review. Instead, I opted to create one large PR, which was a mistake.

What do you think about changing the base of this PR to tq-master?

I think that's a good idea. I reset tq-master to the current HEAD of master, and made the base of this PR points at the tq-master branch.

@ttaylorr ttaylorr force-pushed the tq-priority-retries branch from f9e4888 to 7745140 Compare December 12, 2016 21:22
@ttaylorr
Copy link
Contributor Author

ttaylorr commented Dec 12, 2016

@technoweenie I implemented the changes that we discussed this morning in cf409ac...d7bf608. 😄

Copy link
Contributor

@technoweenie technoweenie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for taking care of my notes from our IRL review. There's a number of things I want addressed before we can merge to master and make the tq package a real thing, but they should happen in separate PRs.

Rough list (that should probably be expanded into a full issue):

  • Try to de-couple config package, using config.Environment interface.
  • Change Begin(maxConcurrency int, cb ProgressCallback) interface function to `Begin(m *Manifest, cb ProgressCallback), so we can future proof this entry point.
  • Figure out that Transfer/Transferrable/api.ObjectResource stuff. Ideally there's a single tq.Object struct or something.

@ttaylorr
Copy link
Contributor Author

@technoweenie sounds good, and thanks for the 👍. I'm thinking that the follow-up PRs can happen in the order that you suggested, since the only strict dependency is that the tq.Object stuff happen last.

Additionally before we merge tq-master into master, I'd like to look at the *adapterBase and narrow the interface down to WorkerStarting(), WorkerEnding(), and DoTransfer(). That will have to happen after the tq.Object, since I'm thinking that one is a dependency of the other.

I think after all of that, this should be 🆗 to merge into master.

@ttaylorr ttaylorr merged commit 39badd3 into tq-master Dec 13, 2016
@ttaylorr ttaylorr deleted the tq-priority-retries branch December 13, 2016 01:06
@ttaylorr ttaylorr mentioned this pull request Dec 13, 2016
5 tasks
@ttaylorr
Copy link
Contributor Author

Opened up a tracking issue and am tracking progress in #1764.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants