docs/spec: catalog V2 API specification proposal by stevvooe · Pull Request #583 · distribution/distribution

stevvooe · 2015-06-02T02:15:20Z

This contains a proposal for a catalog API, provided access to the internal
contents of a registry instance. The API endpoint is prefixed with an
underscore, which is illegal in images names, to prevent collisions with
repositories names. To avoid issues with large result sets, a paginated version
of the API is proposed. We make an addition to the tags API to support
pagination to ensure the specification is consistent.

Closes #441, #382.

Signed-off-by: Stephen J Day stephen.day@docker.com

cc @dmp42 @dmcgowan @ncdc @pdevine

wking · 2015-06-02T03:41:24Z

On Mon, Jun 01, 2015 at 07:15:23PM -0700, Stephen Day wrote:

To avoid issues with large result sets, a paginated version of the
API is proposed.

To reduce load on the server from unlimited requests on large sets, it
might make sense to allow the server to set a default and maximum
value for n. That means clients will have to support pagination,
but it allows to server to put an upper bound on the amount of work
required to fill a single request.

You likely also want to return an envelope or header parameter with a
count for the full result, so paginators can say “page $i of $count”
in their results (or know what fraction they've already worked
through). That count may change in subsequent calls, but that's ok.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of tracking a
consistent location in the face of mutating backing data is up to the
server (e.g., you need to figure out the appropriate offset for ‘last’
in your results, even if the key referenced by ‘last’ no longer
exists). Personally, I'm in favor of integer offsets and leaving the
robustness in the face of mutating lists up to the client, but without
clear external constraints it's probably just personal preference.

RichardScothern · 2015-06-02T17:06:29Z

docs/spec/api.md

remove 'with'

stevvooe · 2015-06-02T23:00:36Z

To reduce load on the server from unlimited requests on large sets, it
might make sense to allow the server to set a default and maximum
value for n. That means clients will have to support pagination,
but it allows to server to put an upper bound on the amount of work
required to fill a single request.

This specification leaves that to the server and client. A server can always refuse a client without support. Please note that catalog listing will always be hidden by admin privileges (or maybe a listing scope), since it may expose registry wide data.

You likely also want to return an envelope or header parameter with a
count for the full result, so paginators can say “page $i of $count”
in their results (or know what fraction they've already worked
through). That count may change in subsequent calls, but that's ok.

How would one implement this efficiently with the current backend layout and storage driver system? Perhaps, this is something we specify and don't implement.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of tracking a
consistent location in the face of mutating backing data is up to the
server (e.g., you need to figure out the appropriate offset for ‘last’
in your results, even if the key referenced by ‘last’ no longer
exists). Personally, I'm in favor of integer offsets and leaving the
robustness in the face of mutating lists up to the client, but without
clear external constraints it's probably just personal preference.

The issue with an integer index is that it requires the server assign rank to the results. In the face of a changing data set, rank is always changing where as a "pivot" is a property of the data. Forcing this problem on the client just moves it some where it cannot be addressed. The registry may not even have a full few of its dataset to calculate the rank properly.

dmcgowan · 2015-06-02T23:22:51Z

docs/spec/api.md.tmpl

👍 on using a header, the property is similar to Location and Range

ekristen · 2015-06-03T14:25:33Z

Looks good. +1

wking · 2015-06-03T17:03:40Z

On Tue, Jun 02, 2015 at 04:00:38PM -0700, Stephen Day wrote:

To reduce load on the server from unlimited requests on large
sets, it might make sense to allow the server to set a default and
maximum value for n. That means clients will have to support
pagination, but it allows to server to put an upper bound on the
amount of work required to fill a single request.

This specification leaves that to the server and client. A server
can always refuse a client without support.

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also enforce
limits on n (and still claim to satisfy the spec), it's probably worth
mentioning that explicitly in the spec. From reading it now, I'd
consider a server that 400ed my large-n request to be nonconformant.

Please note that catalog listing will always be hidden by admin
privileges (or maybe a listing scope), since it may expose registry
wide data.

Ah, good point. So I guess relying on well-behaved clients to not
abuse expensive, unlimited calls isn't that unreasonable.

You likely also want to return an envelope or header parameter
with a count for the full result, so paginators can say “page $i
of $count” in their results (or know what fraction they've already
worked through). That count may change in subsequent calls, but
that's ok.

How would one implement this efficiently with the current backend
layout and storage driver system? Perhaps, this is something we
specify and don't implement.

The current filesystem driver already has a len(fileNames) call 1,
but yeah, that would be harder with S3. I feel like this is a common
feature of paginated results though, so we should either figure out
how to support it with drivers like S3 (I've suggested separating
transactional mutable storage from content-addressable storage before,
docker-archive/docker-registry#704), or explain why our clients won't need
this common feature.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of
tracking a consistent location in the face of mutating backing
data is up to the server (e.g., you need to figure out the
appropriate offset for ‘last’ in your results, even if the key
referenced by ‘last’ no longer exists). Personally, I'm in favor
of integer offsets and leaving the robustness in the face of
mutating lists up to the client, but without clear external
constraints it's probably just personal preference.

The issue with an integer index is that it requires the server
assign rank to the results.

Yeah. I guess the issue is how you expect storage backends to scale
internally. If they shard out lexically, then it's easier for them to
lookup an offset using a ‘last’ entry. If they shard out into counted
bins, then it's easier for them to lookup by an interger offset.
Lexical sharding works pretty well for cryptographic hashes, but some
of our key sets (tags and image names) are user-supplied text, so I
expect storage backends will have to figure out a non-lexical sharding
scheme anyway, and attaching ancestor counts to entries in that tree
doesn't seem to hard. Once they have ancestor-counting sharding for
the unhashed keys, I think they might as well use the same scheme for
all keysets. We're unlikely to extend the storage API to allow
flagging of crypto-hashed keysets, so the storage driver will probably
have to use the same scheme for all keysets.

stevvooe · 2015-06-04T01:27:46Z

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also enforce
limits on n (and still claim to satisfy the spec), it's probably worth
mentioning that explicitly in the spec. From reading it now, I'd
consider a server that 400ed my large-n request to be nonconformant.

I added this to cover that. This endpoint can be protected by access control or just return an empty result.

wking · 2015-06-04T17:17:14Z

On Wed, Jun 03, 2015 at 06:27:47PM -0700, Stephen Day wrote:

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also
enforce limits on n (and still claim to satisfy the spec), it's
probably worth mentioning that explicitly in the spec. From
reading it now, I'd consider a server that 400ed my large-n
request to be nonconformant.

I added
this
to cover that. This endpoint can be protected by access control or
just return an empty result.

That helps, but it only mentions authorization an upstream concerns.
For those cases, there's no reason for a client to perform a follow-up
request. I'd like to see it also mention the case where a client asks
for $n results, but we only return $m < $n for efficiency reasons. In
that case, the client is welcome to perform a follow-up request with
an increased last/offset.

stevvooe · 2015-06-04T18:30:14Z

@wking I'm not sure that belongs in the specification. Please provide updated language for that part if you believe it should be changed. Perhaps, we can have an option to redirect to the paginated version but that is not really the same resource.

No matter what, this is an inefficient endpoint. Most applications would be better served by accessing the storage backend directly, but no one seems willing to do this. It's unfortunate, since we took measures to make the layout very approachable. Clearly, this was a poor trade off.

wking · 2015-06-07T05:18:08Z

On Thu, Jun 04, 2015 at 11:30:15AM -0700, Stephen Day wrote:

@wking I'm not sure that belongs in the specification. Please
provide updated language for that part if you believe it should be
changed. Perhaps, we can have an option to redirect to the paginated
version but that is not really the same resource.

Yeah, I guess I'd just only have the paginated version in the spec.
Since you pointed out the admin scoping 1, having a potentially
expensive API isn't really a big deal. If both the registry and
admin-client devs think the registry is up to serving unpaginated
results, then I guess they're welcome to do that.

No matter what, this is an inefficient endpoint. Most applications
would be better served by accessing the storage backend directly,
but no one seems willing to do this.

It seems worthwile to have a way to write backend-agnostic admin
tooling, and let folks go directly to the backend if they have
performance issues via the backend-agnostic API. So I think having
both options available is a good thing.

Of course, you could also get the backend-agnostic API by making the
backends independent processes that listen on their own socket. Then
both the registry and admin tools could talk to the backend directly
via the same storage-driver API, instead of proxying the calls through
the registry here:

(registry-client) →{registry API}→ (registry) →{storage-API}↘
(admin-client) →{storage API}→ (stand-alone-storage-driver-process) → (storage)

But for storage-drivers built on sufficiently atomic drivers/storage,
you can probably just run have parallel processes with in-process
drivers:

(registry-client) →{registry API}→ (registry →{storage API}→ storage-driver) ↘
(admin-client) →{storage API}→ (stand-alone-storage-driver-process) → (storage)

with both storage-drivers running on the same backing storage.

shaded-enmity · 2015-06-15T10:53:02Z

Label/Metadata inspection

(formerly #600)

This proposal adds metadata inspection into the catalog API.
While labels provide the most specific information, it might also be viable to inspect other parts of image metadata and manifest itself.
The functionality could be modelled after PROPFIND in WebDAV aka RFC4918 with the exception of using JSON documents instead of XML.

The endpoint could be like:
/v2/_catalog/metadata?repo=my/repo

The body of the request would contain specification which metadata to return, as well as the pagination information specified above.
The metadata specification could possible be in any of the following formats:

MongoDB query
jq
JPath

The catalogue would return one record per each tag in the repository:

Request: (jq examplet)
.[] {version: .labels.acme.version, options: .labels.acme.provides, id: .v1Compatibility.[0].Id}
Response:

{
    "tag": "latest",
    "digest": "sha256:...",
    "result": {
       "version": "1.0.0-1",
       "options": "journald,Xsocket",
       "id": "654d2ca2aaa2f1244ea40d8290883c5964a7f971f778e230414b7830e3829867"
    }
}

Since the filtering could possibly end up with large amounts of data, I think that it would be sane to limit the response size to, say, 8kb per record (or some empiric ratio between average manifest size).

What do you think? @stevvooe @ncdc

stevvooe · 2015-06-15T21:15:17Z

@shaded-enmity This is completely outside of the scope and context of this proposal. As stated in #600, the registry provides access to data. It has no facilities for indexing and collating data except under its primary key (tag, name, digest, etc.). Adding such functionality is a massive departure from those design goals.

@ncdc

This contains a proposal for a catalog API, provided access to the internal contents of a registry instance. The API endpoint is prefixed with an underscore, which is illegal in images names, to prevent collisions with repositories names. To avoid issues with large result sets, a paginated version of the API is proposed. We make an addition to the tags API to support pagination to ensure the specification is conistent. Signed-off-by: Stephen J Day <stephen.day@docker.com>

Move the specification to use a Link header, rather than a "next" entry in the json results. This prevents requiring clients from parsing the request body to issue the next request. It also ensures that the returned response body does not change in between requests. The ordering of the specification has been slightly tweaked, as well. Listing image tags has been moved after the catalog specification. Tag pagination now heavily references catalog pagination. Signed-off-by: Stephen J Day <stephen.day@docker.com>

stevvooe · 2015-07-23T03:11:45Z

Closed in favor of #653.

stevvooe added specification work in progress labels Jun 2, 2015

stevvooe added this to the Registry/2.1 milestone Jun 2, 2015

GordonTheTurtle added the status/1-design-review label Jun 2, 2015

stevvooe added the in progress label Jun 2, 2015

RichardScothern reviewed Jun 2, 2015
View reviewed changes

docs/spec/api.md Outdated

Copy link
Copy Markdown

RichardScothern Jun 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'with'

dmcgowan reviewed Jun 2, 2015
View reviewed changes

stevvooe force-pushed the catalog-spec-proposal branch from 28302c3 to db2366a Compare June 3, 2015 01:29

This was referenced Jun 3, 2015

List images and tagging HTTP APIs #382

Closed

propose registry search functionality #206

Open

stevvooe mentioned this pull request Jun 11, 2015

Label inspection & aggregation endpoint #600

Closed

stevvooe mentioned this pull request Jun 26, 2015

Catalog for V2 API Implementation #653

Merged

stevvooe force-pushed the catalog-spec-proposal branch from d74240d to 63af0b9 Compare June 26, 2015 21:49

jgangemi mentioned this pull request Jul 8, 2015

Support for Docker Registry 2.0 kwk/docker-registry-frontend#43

Closed

stevvooe removed the in progress label Jul 13, 2015

stevvooe removed status/1-design-review work in progress labels Jul 13, 2015

stevvooe changed the title ~~[WIP] docs/spec: catalog V2 API specification proposal~~ docs/spec: catalog V2 API specification proposal Jul 13, 2015

stevvooe added 2 commits July 13, 2015 15:33

stevvooe force-pushed the catalog-spec-proposal branch from 63af0b9 to 777c0fb Compare July 13, 2015 22:34

stevvooe closed this Jul 23, 2015

Conversation

stevvooe commented Jun 2, 2015

Uh oh!

wking commented Jun 2, 2015

Uh oh!

RichardScothern Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

stevvooe commented Jun 2, 2015

Uh oh!

dmcgowan Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

stevvooe Jun 3, 2015

Choose a reason for hiding this comment

Uh oh!

ekristen commented Jun 3, 2015

Uh oh!

wking commented Jun 3, 2015

Uh oh!

stevvooe commented Jun 4, 2015

Uh oh!

wking commented Jun 4, 2015

Uh oh!

stevvooe commented Jun 4, 2015

Uh oh!

wking commented Jun 7, 2015

Uh oh!

shaded-enmity commented Jun 15, 2015

Label/Metadata inspection

Uh oh!

stevvooe commented Jun 15, 2015

Uh oh!

stevvooe commented Jul 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants