Skip to content

docs/spec: catalog V2 API specification proposal#583

Closed
stevvooe wants to merge 2 commits intodistribution:masterfrom
stevvooe:catalog-spec-proposal
Closed

docs/spec: catalog V2 API specification proposal#583
stevvooe wants to merge 2 commits intodistribution:masterfrom
stevvooe:catalog-spec-proposal

Conversation

@stevvooe
Copy link
Copy Markdown
Collaborator

@stevvooe stevvooe commented Jun 2, 2015

This contains a proposal for a catalog API, provided access to the internal
contents of a registry instance. The API endpoint is prefixed with an
underscore, which is illegal in images names, to prevent collisions with
repositories names. To avoid issues with large result sets, a paginated version
of the API is proposed. We make an addition to the tags API to support
pagination to ensure the specification is consistent.

Closes #441, #382.

Signed-off-by: Stephen J Day stephen.day@docker.com

cc @dmp42 @dmcgowan @ncdc @pdevine

@wking
Copy link
Copy Markdown
Contributor

wking commented Jun 2, 2015

On Mon, Jun 01, 2015 at 07:15:23PM -0700, Stephen Day wrote:

To avoid issues with large result sets, a paginated version of the
API is proposed.

To reduce load on the server from unlimited requests on large sets, it
might make sense to allow the server to set a default and maximum
value for n. That means clients will have to support pagination,
but it allows to server to put an upper bound on the amount of work
required to fill a single request.

You likely also want to return an envelope or header parameter with a
count for the full result, so paginators can say “page $i of $count”
in their results (or know what fraction they've already worked
through). That count may change in subsequent calls, but that's ok.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of tracking a
consistent location in the face of mutating backing data is up to the
server (e.g., you need to figure out the appropriate offset for ‘last’
in your results, even if the key referenced by ‘last’ no longer
exists). Personally, I'm in favor of integer offsets and leaving the
robustness in the face of mutating lists up to the client, but without
clear external constraints it's probably just personal preference.

docs/spec/api.md Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'with'

@stevvooe
Copy link
Copy Markdown
Collaborator Author

stevvooe commented Jun 2, 2015

To reduce load on the server from unlimited requests on large sets, it
might make sense to allow the server to set a default and maximum
value for n. That means clients will have to support pagination,
but it allows to server to put an upper bound on the amount of work
required to fill a single request.

This specification leaves that to the server and client. A server can always refuse a client without support. Please note that catalog listing will always be hidden by admin privileges (or maybe a listing scope), since it may expose registry wide data.

You likely also want to return an envelope or header parameter with a
count for the full result, so paginators can say “page $i of $count”
in their results (or know what fraction they've already worked
through). That count may change in subsequent calls, but that's ok.

How would one implement this efficiently with the current backend layout and storage driver system? Perhaps, this is something we specify and don't implement.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of tracking a
consistent location in the face of mutating backing data is up to the
server (e.g., you need to figure out the appropriate offset for ‘last’
in your results, even if the key referenced by ‘last’ no longer
exists). Personally, I'm in favor of integer offsets and leaving the
robustness in the face of mutating lists up to the client, but without
clear external constraints it's probably just personal preference.

The issue with an integer index is that it requires the server assign rank to the results. In the face of a changing data set, rank is always changing where as a "pivot" is a property of the data. Forcing this problem on the client just moves it some where it cannot be addressed. The registry may not even have a full few of its dataset to calculate the rank properly.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on using a header, the property is similar to Location and Range

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O, I see.

@ekristen
Copy link
Copy Markdown

ekristen commented Jun 3, 2015

Looks good. +1

@wking
Copy link
Copy Markdown
Contributor

wking commented Jun 3, 2015

On Tue, Jun 02, 2015 at 04:00:38PM -0700, Stephen Day wrote:

To reduce load on the server from unlimited requests on large
sets, it might make sense to allow the server to set a default and
maximum value for n. That means clients will have to support
pagination, but it allows to server to put an upper bound on the
amount of work required to fill a single request.

This specification leaves that to the server and client. A server
can always refuse a client without support.

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also enforce
limits on n (and still claim to satisfy the spec), it's probably worth
mentioning that explicitly in the spec. From reading it now, I'd
consider a server that 400ed my large-n request to be nonconformant.

Please note that catalog listing will always be hidden by admin
privileges (or maybe a listing scope), since it may expose registry
wide data.

Ah, good point. So I guess relying on well-behaved clients to not
abuse expensive, unlimited calls isn't that unreasonable.

You likely also want to return an envelope or header parameter
with a count for the full result, so paginators can say “page $i
of $count” in their results (or know what fraction they've already
worked through). That count may change in subsequent calls, but
that's ok.

How would one implement this efficiently with the current backend
layout and storage driver system? Perhaps, this is something we
specify and don't implement.

The current filesystem driver already has a len(fileNames) call 1,
but yeah, that would be harder with S3. I feel like this is a common
feature of paginated results though, so we should either figure out
how to support it with drivers like S3 (I've suggested separating
transactional mutable storage from content-addressable storage before,
docker-archive/docker-registry#704), or explain why our clients won't need
this common feature.

The decision to use ‘last’ instead of an integer offset (like
Elasticsearch and Redis, see [1,2]) means that the task of
tracking a consistent location in the face of mutating backing
data is up to the server (e.g., you need to figure out the
appropriate offset for ‘last’ in your results, even if the key
referenced by ‘last’ no longer exists). Personally, I'm in favor
of integer offsets and leaving the robustness in the face of
mutating lists up to the client, but without clear external
constraints it's probably just personal preference.

The issue with an integer index is that it requires the server
assign rank to the results.

Yeah. I guess the issue is how you expect storage backends to scale
internally. If they shard out lexically, then it's easier for them to
lookup an offset using a ‘last’ entry. If they shard out into counted
bins, then it's easier for them to lookup by an interger offset.
Lexical sharding works pretty well for cryptographic hashes, but some
of our key sets (tags and image names) are user-supplied text, so I
expect storage backends will have to figure out a non-lexical sharding
scheme anyway, and attaching ancestor counts to entries in that tree
doesn't seem to hard. Once they have ancestor-counting sharding for
the unhashed keys, I think they might as well use the same scheme for
all keysets. We're unlikely to extend the storage API to allow
flagging of crypto-hashed keysets, so the storage driver will probably
have to use the same scheme for all keysets.

@stevvooe
Copy link
Copy Markdown
Collaborator Author

stevvooe commented Jun 4, 2015

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also enforce
limits on n (and still claim to satisfy the spec), it's probably worth
mentioning that explicitly in the spec. From reading it now, I'd
consider a server that 400ed my large-n request to be nonconformant.

I added this to cover that. This endpoint can be protected by access control or just return an empty result.

@wking
Copy link
Copy Markdown
Contributor

wking commented Jun 4, 2015

On Wed, Jun 03, 2015 at 06:27:47PM -0700, Stephen Day wrote:

If server implementations can optionally not support the unlimited
endpoint (and still claim to satisfy the spec) and can also
enforce limits on n (and still claim to satisfy the spec), it's
probably worth mentioning that explicitly in the spec. From
reading it now, I'd consider a server that 400ed my large-n
request to be nonconformant.

I added
this
to cover that. This endpoint can be protected by access control or
just return an empty result.

That helps, but it only mentions authorization an upstream concerns.
For those cases, there's no reason for a client to perform a follow-up
request. I'd like to see it also mention the case where a client asks
for $n results, but we only return $m < $n for efficiency reasons. In
that case, the client is welcome to perform a follow-up request with
an increased last/offset.

@stevvooe
Copy link
Copy Markdown
Collaborator Author

stevvooe commented Jun 4, 2015

@wking I'm not sure that belongs in the specification. Please provide updated language for that part if you believe it should be changed. Perhaps, we can have an option to redirect to the paginated version but that is not really the same resource.

No matter what, this is an inefficient endpoint. Most applications would be better served by accessing the storage backend directly, but no one seems willing to do this. It's unfortunate, since we took measures to make the layout very approachable. Clearly, this was a poor trade off.

@wking
Copy link
Copy Markdown
Contributor

wking commented Jun 7, 2015

On Thu, Jun 04, 2015 at 11:30:15AM -0700, Stephen Day wrote:

@wking I'm not sure that belongs in the specification. Please
provide updated language for that part if you believe it should be
changed. Perhaps, we can have an option to redirect to the paginated
version but that is not really the same resource.

Yeah, I guess I'd just only have the paginated version in the spec.
Since you pointed out the admin scoping 1, having a potentially
expensive API isn't really a big deal. If both the registry and
admin-client devs think the registry is up to serving unpaginated
results, then I guess they're welcome to do that.

No matter what, this is an inefficient endpoint. Most applications
would be better served by accessing the storage backend directly,
but no one seems willing to do this.

It seems worthwile to have a way to write backend-agnostic admin
tooling, and let folks go directly to the backend if they have
performance issues via the backend-agnostic API. So I think having
both options available is a good thing.

Of course, you could also get the backend-agnostic API by making the
backends independent processes that listen on their own socket. Then
both the registry and admin tools could talk to the backend directly
via the same storage-driver API, instead of proxying the calls through
the registry here:

(registry-client) →{registry API}→ (registry) →{storage-API}↘
(admin-client) →{storage API}→ (stand-alone-storage-driver-process) → (storage)

But for storage-drivers built on sufficiently atomic drivers/storage,
you can probably just run have parallel processes with in-process
drivers:

(registry-client) →{registry API}→ (registry →{storage API}→ storage-driver) ↘
(admin-client) →{storage API}→ (stand-alone-storage-driver-process) → (storage)

with both storage-drivers running on the same backing storage.

@shaded-enmity
Copy link
Copy Markdown

Label/Metadata inspection

(formerly #600)

This proposal adds metadata inspection into the catalog API.
While labels provide the most specific information, it might also be viable to inspect other parts of image metadata and manifest itself.
The functionality could be modelled after PROPFIND in WebDAV aka RFC4918 with the exception of using JSON documents instead of XML.

The endpoint could be like:
/v2/_catalog/metadata?repo=my/repo

The body of the request would contain specification which metadata to return, as well as the pagination information specified above.
The metadata specification could possible be in any of the following formats:

  1. MongoDB query
  2. jq
  3. JPath

The catalogue would return one record per each tag in the repository:

Request: (jq examplet)
.[] {version: .labels.acme.version, options: .labels.acme.provides, id: .v1Compatibility.[0].Id}
Response:

{
    "tag": "latest",
    "digest": "sha256:...",
    "result": {
       "version": "1.0.0-1",
       "options": "journald,Xsocket",
       "id": "654d2ca2aaa2f1244ea40d8290883c5964a7f971f778e230414b7830e3829867"
    }
}

Since the filtering could possibly end up with large amounts of data, I think that it would be sane to limit the response size to, say, 8kb per record (or some empiric ratio between average manifest size).

What do you think? @stevvooe @ncdc

@stevvooe
Copy link
Copy Markdown
Collaborator Author

@shaded-enmity This is completely outside of the scope and context of this proposal. As stated in #600, the registry provides access to data. It has no facilities for indexing and collating data except under its primary key (tag, name, digest, etc.). Adding such functionality is a massive departure from those design goals.

@ncdc

@stevvooe stevvooe changed the title [WIP] docs/spec: catalog V2 API specification proposal docs/spec: catalog V2 API specification proposal Jul 13, 2015
stevvooe added 2 commits July 13, 2015 15:33
This contains a proposal for a catalog API, provided access to the internal
contents of a registry instance. The API endpoint is prefixed with an
underscore, which is illegal in images names, to prevent collisions with
repositories names. To avoid issues with large result sets, a paginated version
of the API is proposed. We make an addition to the tags API to support
pagination to ensure the specification is conistent.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
Move the specification to use a Link header, rather than a "next" entry in the
json results. This prevents requiring clients from parsing the request body to
issue the next request. It also ensures that the returned response body does
not change in between requests.

The ordering of the specification has been slightly tweaked, as well. Listing
image tags has been moved after the catalog specification. Tag pagination now
heavily references catalog pagination.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
@stevvooe stevvooe force-pushed the catalog-spec-proposal branch from 63af0b9 to 777c0fb Compare July 13, 2015 22:34
@stevvooe
Copy link
Copy Markdown
Collaborator Author

Closed in favor of #653.

@stevvooe stevvooe closed this Jul 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

how list all images or search api?

7 participants