Skip to content

Endpoint for batch upload/download operations#285

Merged
technoweenie merged 38 commits intomasterfrom
multitransfer
May 28, 2015
Merged

Endpoint for batch upload/download operations#285
technoweenie merged 38 commits intomasterfrom
multitransfer

Conversation

@rubyist
Copy link
Contributor

@rubyist rubyist commented May 5, 2015

This PR allows the LFS client to batch its upload and download operations.

Typically, the client has to ask the server about each object it wants to upload or download, resulting in many http round trips. This causes significant delay when dealing with a large amount of objects.

This PR proposes a change to the API that provides a batch endpoint that gives the client information about a number of objects at once.

The client will POST an array of oid/size objects to the batch endpoint, and the endpoint will return an array of link relations for each object asked about. The presence or absence of any of the download, upload, and verify relations can be used to determine the state of the object.

  • download present - server has and has verified the object
  • upload present - server does not have the object, client can upload and follow verify (if given)

TODO:

  • Spec the batch endpoints
  • Upload queue can work with batch endpoint
  • Add a get command to bulk download

@rubyist rubyist added the wip label May 5, 2015
@technoweenie
Copy link
Contributor

Interesting. I imagined having endpoints like /download, /upload, and /verify. Having one endpoint that can send the correct hypermedia relations is better 👍

@technoweenie
Copy link
Contributor

/cc @ddanier @meltingice since you've both worked on Git LFS server implementations.

@technoweenie
Copy link
Contributor

Any thoughts on always sending and returning objects? So instead of:

[
  { "oid": "abcdef" }, ...
]

Send a hash. It gives us a place to add extra meta data for the request. Maybe we want to tell the server what the local commit is that we're downloading the objects for?

{
  "commit": "sha1",
  "branch": "master",
  "objects": [
    { "oid": "abcdef" }, ...
  ]
}

@rubyist
Copy link
Contributor Author

rubyist commented May 5, 2015

I like that idea 👍

@meltingice
Copy link

Cool, I like this idea. Should help to speed things up a lot.

Is this going to be a mandatory endpoint, or is there some way the client can discover whether or not the server offers batching? Perhaps try the batch endpoint first, and then fall back to non-batched endpoints if it 404's?

@technoweenie
Copy link
Contributor

Is this going to be a mandatory endpoint, or is there some way the client can discover whether or not the server offers batching? Perhaps try the batch endpoint first, and then fall back to non-batched endpoints if it 404's?

I'd like to transition the code to using the batch endpoint as the single required endpoint for servers to implement for Git LFS v1.0.

  1. First, ship experimental support. Enable some local lfs.batch config on a repository to use it.
  2. Make the batch API default, and fallback to the original API if the server returns a 404 or lfs.batch is false.
  3. Remove support for the old /objects API.

@ddanier
Copy link
Contributor

ddanier commented May 6, 2015

I think a batch job would be really nice.

However I don't like the idea of adding additional data to the requests (commit sha1, branch). The git-lfs server is currently pretty easy to implement as it's basically a simple key value store. This is - in my eyes - one of the key features that make it a feasible solution for many git providers (GitLab could for example add this without many problems, I did my own implementation in just under 10 hours). Now just adding more information might not complicate things, but allow us to do so. In the end this might make implementations much harder.

Another thing to keep in mind is that we are talking about big files. So the API request to check a files existance should have a pretty low impact on the overall performance, most of the time should be spended up/downloading the file itself. So although adding a batch job is a great thing from a technical perspective it might not necessarily be that much better from a usability perspective. My personal opinion is to just implement it anyways, as you cannot change this so easily when more servers exist.

@technoweenie
Copy link
Contributor

However I don't like the idea of adding additional data to the requests (commit sha1, branch).

I want to add this meta data in a way that's not required for servers to process. It should be easy to just focus on the proposed "objects" key if that's all your implementation cares about. I'm hoping to keep the API simple for small servers, with optional features that bigger servers can choose to implement.

So the API request to check a files existance should have a pretty low impact on the overall performance, most of the time should be spended up/downloading the file itself.

I hope to ship support for both APIs in the client first, defaulting to the current key/value methods. If we can't prove that reducing the API overhead helps in a meaningful way, we can scrap it. I'm anxious to compare this to the work done in #258.

@rubyist
Copy link
Contributor Author

rubyist commented May 26, 2015

I should add that since github.com does not yet have a batch api, any testing will need to be done against the lfs test server. A matching PR containing a batch endpoint is here: git-lfs/lfs-test-server#27

rubyist added 2 commits May 26, 2015 15:46
* Ensure concurrent values are at least 1
* Ensure batch boolean follows git config's rules
* Tests for each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started using this pattern:

git commit -m "add a.dat" 2>&1 | tee commit.log
grep "master (root-commit)" commit.log

Nice thing is that the output is only printed to STDOUT once with tee. If the last grep fails, you'll only see the commit output once, and the matches of the successful grep commands.

rubyist and others added 9 commits May 27, 2015 15:03
Reorganize the transfer queue to provide a channel to watch for object
OIDs as they finish. This can be used in the `get` command to feed a
goroutine that will copy the file to the working directory and inform
the update-index process about it as the transfers finish. This leads to
a greatly reduced amount of time spent updating the index after a get.
 is no longer needed, just 'git lfs'
grep and string comparison improvements from test-happy-path.sh
@technoweenie
Copy link
Contributor

Merging this, as it seems like a great improvement, and will serve as a base for future changes.

technoweenie added a commit that referenced this pull request May 28, 2015
Endpoint for batch upload/download operations
@technoweenie technoweenie merged commit ac67b68 into master May 28, 2015
@technoweenie technoweenie deleted the multitransfer branch May 28, 2015 16:50
@catphish
Copy link
Contributor

As per #341 I'd really appreciate it if this capability could be made available via the SSH API in addition to HTTP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants