Skip to content

lfs/transfer_queue: support multiple retries per object#1535

Merged
ttaylorr merged 2 commits intomasterfrom
multiple-retries
Sep 22, 2016
Merged

lfs/transfer_queue: support multiple retries per object#1535
ttaylorr merged 2 commits intomasterfrom
multiple-retries

Conversation

@ttaylorr
Copy link
Contributor

This pull-request teaches the lfs/transfer_queue how to retry an object more than once per transfer.

Previously

The transfer queue used to follow these steps when executing a transfer:

  1. Receive a bunch of objects from calls to Add()
  2. Start either the batchApiRoutine or individualApiRoutine based on its capabilities.
  3. Attempt to transfer the object(s), and commit to a retry if it failed with a retriable error.
    1. Collect the transfers to be retried in the retryCollector
    2. Close the transfer channel to kick off all of the objects to be retried.
    3. Re-add all of the objects to be retried and re-process them once.
  4. Exit.

This works fine, but limits the transfer queue to:

  • Responding to one set of retries per transfer.
  • Responding to one retry per unique OID in the queue.

Now

Now, the transfer queue knows how to immediately retry an object. It works like this:

  1. Steps 1-4 from above.
  2. If an object failed, and can be retried, enqueue a retry.
  3. Collect the retry, and...
    1. Make sure that we have the budget to retry the object.
    2. Add() it to the next batch, or API channel (in legacy mode)
    3. If in batch mode, immediately flush the batch, forcing the batchApiRoutine function to receive a new batch.
  4. Wait until all objects have been either transferred, or abandoned.

(The ability to flush a batch was introduced in #1528, and enables the ability to immediately retry an object in a batch.)

In order to make sure that all items are processed before exiting, the way we treat the internal waitgroup q.wait changed slightly. Previously, q.wait was incremented every time we try to preform a transfer on an item. To prevent a situation where the WaitGroup would reach zero while waiting in between failing an object and retrying it, the WaitGroup is now only incremented the first time an object begins a transfer.

The transferqueue also now keeps track of the number of retries made per OID in order to prevent infinitely retrying that transfer. Currently, the maximum number of retries per object is set to 1, in order to keep the same behavior as from before this PR which is why there are no new tests.

I've also left some line comments throughout the code to clarify some further things.

/cc @technoweenie @rubyist @sinbad for 👀 and 💭

for _, t := range batch {
q.retry(t.(Transferable))
var errOnce sync.Once
for _, o := range batch {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eventually we end up grouping retried objects and "fresh" objects in the same batch, we should only fail objects that have exceeded their retry budget, and not all of the items in the batch at once. To keep the old behavior of reporting to the errorc channel the same, I wrapped it in a sync.Once, so it can only happen once.

Copy link
Contributor

@technoweenie technoweenie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great small step forward 👍

@ttaylorr ttaylorr merged commit fae6810 into master Sep 22, 2016
@ttaylorr ttaylorr deleted the multiple-retries branch September 22, 2016 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants