Skip to content

libimage: fix manifest race during listing#2400

Merged
openshift-merge-bot[bot] merged 2 commits into
containers:mainfrom
Luap99:list-manifest
Apr 9, 2025
Merged

libimage: fix manifest race during listing#2400
openshift-merge-bot[bot] merged 2 commits into
containers:mainfrom
Luap99:list-manifest

Conversation

@Luap99

@Luap99 Luap99 commented Apr 1, 2025

Copy link
Copy Markdown
Member

I saw a flake in parallel podman testing, podman images can fail if the manifest was removed at the right time. In general listing should never be able to fail when another image or manifest is removed in parallel.

Change the logic to convert to manifest and only collect the digests in the success case and ignore all other errors to make the listing more robust.

I observed the following error from podman images: Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known

@openshift-ci

openshift-ci Bot commented Apr 1, 2025

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved label Apr 1, 2025
@Luap99

Luap99 commented Apr 1, 2025

Copy link
Copy Markdown
Member Author

cc @mtrmac @rhatdan

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK to the general idea (we don’t hold the storage lock long enough to get a consistent snapshot, and if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling).

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

Comment thread libimage/layer_tree.go
if err != nil {
return nil, err
}
// ignore errors, common errors are

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx is unused as of now, lint test is failing.

@Luap99

Luap99 commented Apr 1, 2025

Copy link
Copy Markdown
Member Author

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

I wasn't sure if these would be the only ones it can encounter and I wanted to be on the safer side on podman images should just work.
It is certainly possible to check specific errors and not ignore others if that is preferred.

@mtrmac

mtrmac commented Apr 1, 2025

Copy link
Copy Markdown
Contributor

if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling.

… ugh, this can theoretically violate causality:

  • A refers to C
  • We run MultiList
  • User adds B referring to C
  • User deletes A
  • We see that A has been removed, and don’t see its reference to C. We have never seen B. We think that C is dangling, but it was not dangling at any point in time.

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

@Luap99

Luap99 commented Apr 1, 2025

Copy link
Copy Markdown
Member Author

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

if we talk about operations like podman system/image prune than they are extremely racy by design. Like every single one of them. As there are no global locks taken for the main logic of the commands they all do something like list + remove. Even if you make the manifest listing "atomic" the time between list and rm will still be unlocked. A manifest list with a reference to that image could have been added after listing but before removal. To fix that we would need to teach the storage to not remove the image if it is part of an manifest.
And it is not even that would be enough, currently something like podman run IMAGE will pull the image and then there is a window between the image is unused after pull before we actually create the storage container to reference the image.

So overall I don't think we need to worry to about this case to much here. All I care about is to make the commands at least consistent so that they at least don't randomly fail which is far worse IMO.

@mtrmac

mtrmac commented Apr 1, 2025

Copy link
Copy Markdown
Contributor

WFM. I have little sympathy for a user adding a reference to an untagged image, while running prune concurrently; but the fact that podman run is not resistant to concurrent prune is a strong argument that this is not something we really are aiming to support at the moment.

@mheon

mheon commented Apr 2, 2025

Copy link
Copy Markdown
Member

LGTM

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge as is.

Comment thread libimage/layer_tree.go
Luap99 added 2 commits April 9, 2025 13:54
I saw a flake in parallel podman testing, podman images can fail if the
manifest was removed at the right time. In general listing should never
be able to fail when another image or manifest is removed in parallel.

Change the logic to convert to manifest and only collect the digests in
the success case and ignore all other errors to make the listing more
robust.

I observed the following error from podman images:
Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
All other errors are returned wrapped with the image ID so do the same
when the manifest blobl decoding fails.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99

Luap99 commented Apr 9, 2025

Copy link
Copy Markdown
Member Author

created a podman test PR just to see if this breaks anything podman-container-tools/podman#25840

@Luap99

Luap99 commented Apr 9, 2025

Copy link
Copy Markdown
Member Author

Podman PR looks good, PTAL again

@mtrmac

mtrmac commented Apr 9, 2025

Copy link
Copy Markdown
Contributor

/lgtm

@mtrmac

mtrmac commented Apr 9, 2025

Copy link
Copy Markdown
Contributor

Thanks!

@openshift-merge-bot openshift-merge-bot Bot merged commit f71a7a6 into containers:main Apr 9, 2025
@Luap99 Luap99 deleted the list-manifest branch April 9, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants