Refcount images and cleanup orphans by wking · Pull Request #409 · docker-archive/docker-registry

wking · 2014-06-04T05:38:13Z

Also add versioning to the storage backend, and lay some groundwork
for a more opaque storage driver. It shouldn't take too much more
work to factor out the file-based assumptions out of driver.Base, but
I've left that for later work. See the commit messages for details on
specific changes. And let Travis complete before merging, since I
haven't even run this locally yet ;). I'll try and get on tomorrow
and link this to some earlier discussion, but I don't have time to
look up the references now ;).

samalba · 2014-06-04T15:26:59Z

I think we should have that completely optional. The registry should work without it. On a large dataset, if a refcount is not updated and the refcount index gets corrupted, what are the options to rebuild it? I think it's suitable only for small dataset by design.

wking · 2014-06-04T15:46:51Z

On Wed, Jun 04, 2014 at 08:27:01AM -0700, Sam Alba wrote:

I think we should have that completely optional. The registry should
work without it.

Can do. However, the only use cases I can think of where you'd want to
turn it off are:

If you have external tags pointing to images, which would not be
accounted for in the registry's internal refcounting.
If you wanted to archive all the images you'd ever seen, but not the
tags that referenced them.

Neither is particularly convincing to me, but there's no accounting
for taste ;).

On a large dataset, if a refcount is not updated and the refcount
index gets corrupted, what are the options to rebuild it?

As in “docker-registry crashes while/before writing the references
files”? I see two cases there:

Crashed during a write. When restarted, this can be detected by
invalid JSON in the references file, and we'd have to walk the whole
repositories/tags tree to see if any of those files depended on our
corrupted image. Slow, but that's ok for recovering from such an
unlikely situation.
Crashed before descending to a particular image in the ancestry
stack. I think this is also unlikely, and it would be marked by the
complete lack of a references file. Same recovery as for “crashed
during a write”.

Neither case is currently handled in this PR, but I can add the code
to do so. Are there other cases I should consider?

I think it's suitable only for small dataset by design.

Why is that? Is the problem the long list of references for
commonly-used base images. For example, I imagine index.docker.io has
quite a number of references to the various ‘ubuntu’ images.

In any case, how you implement refcounting is a storage-side decision.
So long as the add_references and and remove_references endpoints are
there (even if they're null-ops), the core code doesn't have to care
about how you want to handle this.

samalba · 2014-06-04T17:07:50Z

Crashed during a write. When restarted, this can be detected by
invalid JSON in the references file, and we'd have to walk the whole
repositories/tags tree to see if any of those files depended on our
corrupted image. Slow, but that's ok for recovering from such an
unlikely situation.

To rephrase what you explain, the operation is not atomic. Walking the whole dataset means it's not suitable for big dataset.

wking · 2014-06-04T17:51:24Z

On Wed, Jun 04, 2014 at 10:07:53AM -0700, Sam Alba wrote:

Crashed during a write. When restarted, this can be detected by
invalid JSON in the references file, and we'd have to walk the whole
repositories/tags tree to see if any of those files depended on our
corrupted image. Slow, but that's ok for recovering from such an
unlikely situation.

To rephrase what you explain, the operation is not atomic. Walking
the whole dataset means it's not suitable for big dataset.

To make things atomic, I'd suggest initially staging images (and their
refcounts) in a tmp/ directory and moving them into their true
locations after they were full (assuming your backing store supports
atomic moves). I'd do this for all of the registry's file
image/repository alterations, not just refcounting. What happens
currently if you crash with a half-written image ID in a tag file?

Even with atomic file-writes, you'd need some way to distinguish
between provisional refcounts (i.e. claimed by a tag that we're in the
process of writing) that were still provisional because they were
still in progress, and those that were still provisional because we
crashed while writing them (so they were aborted). In that case, I
think the logic would be:

def put_tag(namespace, repository, tag):
… extract data from request …
… write marker for tag-put-in-progress …
store.add_provisional_references(…)
… put all repository-side tag content …
store.add_references(
image_id=data, namespace=namespace, repository=repository, tag=tag)
… remove tag-put-in-progress marker …
return toolkit.response()

When touching the references files, we'd check for the provisional
references file. If we found the provisional referencing tags, we'd
add them to the permanent list. If we found an active
tag-put-in-progress marker, we'd leave them in the provisional
references list, but treat them as true references (to avoid garbage
collecting images while the put was in progress). If we found an
inactive tag-put-in-progress marker, we'd start a background job
cleaning up the ancestry chain of the failed provisional reference.
For local storage, the tag-put-in-progress marker storage could be
implemented with flock. For other backends, I'm not sure what the
best approach would be.

How does that sound?

Having a transactional storage backend would make this much easier ;).

This way we can access FileNotFoundError too, without the flake8 "H301 one import per line" raised by something like: from .exceptions import FileNotFoundError, NotImplementedError I'm not sure why we implement our own NotImplementedError when Python already has one built in, but that's probably off topic :p.

From Index._walk_storage. It's useful for creating the initial index, and also for the new storage upgrades.

Mirroring Base.get_registries. Both methods are useful for abstracting away the current nested filesystem implementation, which will hopefully not be baked into Base for much longer.

From docker_registry.tags.get_tags. Centralizing another abstraction method helps hide the current nested filesystem implementation from the coming upgrade methods.

Besides factoring out a bunch of repetitive {self.images}/{image_id} construction, this gives us a path to recursively remove when we want to delete an image. Modeled after ab3805a (storage: Add Storage.repository_path, 2014-02-12).

This is more explicit, and avoids getting clobbered by the later: data = create_tag_json(user_agent=ua)

Before this commit there was no way to remove images that were no longer needed by any tags. This commit adds a persistent list of references for each image, with both the depending image id and the namespace, repository, and tag. When we delete a tag, we traverse it's ancestors removing any obsolete references and any images that are no longer referenced. This is more complicated than just having a refcount number, but also more resilient. For example, someone will eventually have a power outage after removing a tag while the refcounts were still being adjusted. By keeping the references themselves in a list, it's easy to check for stale references and remove them. If it turns out that the removal checks are too slow, it would be easy to make them more optimistic (e.g. passing the descendant image and tags that we expect to remove down the _remove_references stack and removing them instead of checking all listed references for all ancestors). The JSON serialization for the [namespace, repository, tag] key allows us to use JSON to serialize the resulting dictionary (JSON only supports string keys). It's a bit awkward, but I wanted to make sure a single tag was only represented once (with whatever its current image was) in the references list.

The upgrade from storage v0.0 to storage v0.1 just adds tag references and cleans up unreferenced images. I think all the storage subclasses are too tightly bound to the file-based structure layed out in driver.Base. It would be nice is we exposed an API closer to the web API (add a tag, get a tag, remove a tag, add a layer, get a layer, remove a layer, ...), and left the internal details completely up to the driver. Of course, the file-like handling the the current Base class could be a FileStorage utility for easy subclassing, but it feels wrong to bake it into Base. If we do pull it out, we'd want to version both the Base API used by storage consumers. We'd leave it up to the implementations (like FileStorage) to version the backing data store they used to provide the Base API: registry core <--(Base API)--> storage driver <--> backing store Base.version and the version checks in Base._check_version allow the storage driver to manage the backing store, and are independent of the Base API used by the core. In any case, the data in the backing store persists between software upgrades, so we need *some* way to manage it as we polish the persistent format.

This is part of the internal backing-store interface, and isn't part of the public Base API.

wking · 2014-06-05T03:40:43Z

After cleaning up some silly typos, the unit tests are passing again
:p.

We still need to reach a consensus on where this should live (I can
pull it out into a refcounting class (CautiousRefcounter?) that
subclasses Base and is in turn subclassed by the in-tree storage
backends. That way other storage backends have a clean slate by
subclassing Base, or can use my implementation by subclassing
CautiousRefcounter, or can skip ref-counting entirely by subclassing a
NoopRefcounter. How does that sound?

sirupsen · 2014-06-25T15:01:48Z

Would love to see this in. What's holding it back?

dmp42 · 2014-06-25T15:57:23Z

Mainly:

It's a complex change not fitting well with the short term milestones
this is not going to be used by the official docker registry itself (see discussion about the existing dataset) unless there is a practical solution to handle the existing (growing) volume of images
or it should be made optional

Either way, if you want to try it, I guess nothing prevents you from giving it a spin from @wking branch.

wking · 2014-06-25T18:26:19Z

On Wed, Jun 25, 2014 at 08:57:25AM -0700, Olivier Gambier wrote:

or it should be made optional

Optional shouldn't be too hard, especially now that option parsing is
a bit better. Any feedback on this structure 1, or should I stay
closer to what I already have in this branch?

wking · 2014-08-12T16:28:29Z

On Wed, Jun 25, 2014 at 11:26:14AM -0700, W. Trevor King wrote:

Wed, Jun 25, 2014 at 08:57:25AM -0700, Olivier Gambier:

or it should be made optional

Optional shouldn't be too hard, especially now that option parsing is
a bit better. Any feedback on this structure 1, or should I stay
closer to what I already have in this branch?

Ping.

wking · 2014-09-06T00:22:33Z

On Tue, Aug 12, 2014 at 09:28:21AM -0700, W. Trevor King wrote:

Wed, Jun 25, 2014 at 11:26:14AM -0700, W. Trevor King:

Wed, Jun 25, 2014 at 08:57:25AM -0700, Olivier Gambier:

or it should be made optional

Optional shouldn't be too hard, especially now that option parsing is
a bit better. Any feedback on this structure 1, or should I stay
closer to what I already have in this branch?

Ping.

I just saw #533, so I thought I'd ping this again. I'm still looking
for feedback on my proposed structure for swapable refcounting
algorithms…

dmp42 · 2014-09-06T00:36:21Z

@wking sorry about the lag on this - will look

wking · 2014-09-06T16:25:51Z

On Tue, Aug 12, 2014 at 09:28:21AM -0700, W. Trevor King wrote:

Wed, Jun 25, 2014 at 11:26:14AM -0700, W. Trevor King:

Wed, Jun 25, 2014 at 08:57:25AM -0700, Olivier Gambier:

or it should be made optional

Optional shouldn't be too hard, especially now that option parsing is
a bit better. Any feedback on this structure 1, or should I stay
closer to what I already have in this branch?

Ping.

I just saw #533, so I thought I'd ping this again. I'm still looking
for feedback on my proposed structure for swapable refcounting
algorithms…

dmp42 · 2014-09-08T21:50:06Z

About the open questions:

I would like this to live in its own subclass, and leave Base as-is.

Also, I would be happy with separate drivers (s3-rf, file-rf - where rf=ref-counting - but any other, better, suffix can do) that would subclass both their parent (s3, file) and the refcounting base if that's possible.

I really want this as standalone and "separate" as possible - and to minimize the effort on the drivers developers - and make this an explicit (simple) choice for the users.

Sorry for the long wait on this, and for the bikeshedding - and thank you for your work on this @wking

Best.

ncdc · 2014-09-16T20:08:00Z

@wking would you like me to try to convert this to an extension?

wking · 2014-09-16T20:37:24Z

On Tue, Sep 16, 2014 at 01:08:00PM -0700, Andy Goldstein wrote:

would you like me to try to convert this to an extension?

Sure. If we're not touching Base, what's your plan for injecting this
into the Storage class structure? Personally, I think the current
storage API is a confusing mix of registry-facing APIs and
backend-facing helpers that leads to wonky behavior like whatever was
fixed by #553 (see also my earlier comments [1,2]. I've been working
towards what I'd like the storage APIs to look like 3, but it's
very WIP (and I'll wait until #570 lands before working on it much
more). If we get there, it the registry-side API would be a nice
place to hang some logging 4 where it's independent of any
particular storage backend.

I think this sort of thing would be much cleaner once we had the
filesystem stuff abstracted out of the registry-side storage API.

ncdc · 2014-09-16T20:47:59Z

@wking I wasn't actually going to try to modify Storage at all. I was going to treat all the reference counting code (and the files it creates) as something solely managed by the extension (even though _references goes in .../images/[id]/).

wking · 2014-09-16T20:57:16Z

On Tue, Sep 16, 2014 at 01:48:00PM -0700, Andy Goldstein wrote:

I wasn't actually going to try to modify Storage at all.

Ah, I suppose you could tie it to the signals instead. Where would
you store tag references (I do it with db05eef8)?

ncdc · 2014-09-16T20:59:19Z

If we use your existing code in this PR, just like you've been doing it via image_references_path.

wking · 2014-09-16T21:03:02Z

On Tue, Sep 16, 2014 at 01:59:20PM -0700, Andy Goldstein wrote:

If we use your existing code in this PR, just like you've been doing
it via image_references_path.

Which I added to Base in 3603066. If you're not touching Base, where
do you put it?

ncdc · 2014-09-16T21:08:47Z

I'd put image_references_path in the extension code, and use signals to react to tags being created and deleted.

wking · 2014-09-16T21:18:30Z

On Tue, Sep 16, 2014 at 02:08:49PM -0700, Andy Goldstein wrote:

I'd put image_references_path in the extension code, and use signals
to react to tags being created and deleted.

I think it's going to be hard to convince me to like anything that's
going to end up binding to the current Base API ;). Since fixing that
API is a long shot at the moment, feel free to do whatever it takes to
get this PR translated into something @dmp42 will accept sooner. If
it ends up keeping a lot of filesystem-backend assumptions, I can
always cut those strings and pretty things up later if/when we get a
cleaner registry-side storage API.

ncdc · 2014-09-17T19:56:00Z

@wking do you think we need to worry about file locking and/or mutexes when updating image references?

wking · 2014-09-17T20:00:22Z

On Wed, Sep 17, 2014 at 12:56:01PM -0700, Andy Goldstein wrote:

@wking do you think we need to worry about file locking and/or
mutexes when updating image references?

Here are my previous thoughts for these issues [1,2].

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

dbason · 2014-10-09T21:52:35Z

@wking Sorry if this has been covered off (I'm having problems following the various discussions on deletion of images). Will the code as is clean up images on an s3 compatible back end (I ask because most of the scripts/cleanups I have seen in other threads look to be local file system only)?

wking · 2014-10-09T22:10:16Z

On Thu, Oct 09, 2014 at 02:52:35PM -0700, dbason wrote:

Sorry if this has been covered off (I'm having problems following
the various discussions on deletion of images). Will the code as is
clean up images on an s3 compatible back end (I ask because most of
the scripts/cleanups I have seen in other threads look to be local
file system only)?

I tweaked Base a bit, but I don't think I added any new
storage-implementation-facing methods, so I expect it will. This
particular implementation is unlikely to land though, so you might
want to checkout #606 instead.

@wking

Add refcount extension to store image references and cleanup orphans. Based heavily on @wking's work in docker-archive#409 - this is just an adaptation to extension format. Signed-off-by: Andy Goldstein <agoldste@redhat.com>

dmp42 · 2014-10-29T17:53:26Z

@wking @ncdc do you guys intend to work on this for v1?

wking · 2014-10-29T17:55:01Z

On Wed, Oct 29, 2014 at 10:53:28AM -0700, Olivier Gambier wrote:

@wking @ncdc do you guys intend to work on this for v1?

Not me, and @ncdc has #606.

wking added 9 commits June 4, 2014 17:59

driver: Move registry iteration to Base.get_registries

21b33cb

From Index._walk_storage. It's useful for creating the initial index, and also for the new storage upgrades.

driver: Add Base.get_images

c1c0289

Mirroring Base.get_registries. Both methods are useful for abstracting away the current nested filesystem implementation, which will hopefully not be baked into Base for much longer.

driver: Move tag iteration to Base.tags

c8eb75d

From docker_registry.tags.get_tags. Centralizing another abstraction method helps hide the current nested filesystem implementation from the coming upgrade methods.

driver: Add Base.image_path

6ffc4fd

Besides factoring out a bunch of repetitive {self.images}/{image_id} construction, this gives us a path to recursively remove when we want to delete an image. Modeled after ab3805a (storage: Add Storage.repository_path, 2014-02-12).

tags: Rename 'data' to 'image_id' in put_tag

2c20483

This is more explicit, and avoids getting clobbered by the later: data = create_tag_json(user_agent=ua)

driver: Make Base._image_path private

f13512f

This is part of the internal backing-store interface, and isn't part of the public Base API.

wking mentioned this pull request Jul 30, 2014

Deleting an image #497

Open

wking mentioned this pull request Aug 14, 2014

add feature to cleanup docker registry storage:remove history images #523

Open

wking mentioned this pull request Sep 11, 2014

Experimental extensions model #566

Closed

ncdc mentioned this pull request Sep 17, 2014

Add image to tag_deleted signal payload #577

Merged

ncdc mentioned this pull request Oct 3, 2014

Add refcount extension #606

Closed

wking mentioned this pull request Oct 10, 2014

NG: Transactional storage #626

Closed

wking mentioned this pull request Oct 22, 2014

Linear scan of a repository with lots (200+) of tags causes pull IO timeout #614

Open

wking closed this Oct 29, 2014

wking mentioned this pull request Nov 10, 2014

Cleanup of unused images / layers #706

Open

Conversation

wking commented Jun 4, 2014

Uh oh!

samalba commented Jun 4, 2014

Uh oh!

wking commented Jun 4, 2014

Uh oh!

samalba commented Jun 4, 2014

Uh oh!

wking commented Jun 4, 2014

Uh oh!

wking commented Jun 5, 2014

Uh oh!

sirupsen commented Jun 25, 2014

Uh oh!

dmp42 commented Jun 25, 2014

Uh oh!

wking commented Jun 25, 2014

Uh oh!

wking commented Aug 12, 2014

Uh oh!

wking commented Sep 6, 2014

Uh oh!

dmp42 commented Sep 6, 2014

Uh oh!

wking commented Sep 6, 2014

Uh oh!

dmp42 commented Sep 8, 2014

Uh oh!

ncdc commented Sep 16, 2014

Uh oh!

wking commented Sep 16, 2014

Uh oh!

ncdc commented Sep 16, 2014

Uh oh!

wking commented Sep 16, 2014

Uh oh!

ncdc commented Sep 16, 2014

Uh oh!

wking commented Sep 16, 2014

Uh oh!

ncdc commented Sep 16, 2014

Uh oh!

wking commented Sep 16, 2014

Uh oh!

ncdc commented Sep 17, 2014

Uh oh!

wking commented Sep 17, 2014

Uh oh!

dbason commented Oct 9, 2014

Uh oh!

wking commented Oct 9, 2014

Uh oh!

dmp42 commented Oct 29, 2014

Uh oh!

wking commented Oct 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants