Skip to content

Staged layer creation#378

Merged
mtrmac merged 17 commits into
podman-container-tools:mainfrom
Luap99:staged-layer-creation
Nov 26, 2025
Merged

Staged layer creation#378
mtrmac merged 17 commits into
podman-container-tools:mainfrom
Luap99:staged-layer-creation

Conversation

@Luap99

@Luap99 Luap99 commented Oct 8, 2025

Copy link
Copy Markdown
Member

No description provided.

@github-actions github-actions Bot added the storage Related to "storage" package label Oct 8, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 8, 2025
@podmanbot

Copy link
Copy Markdown

✅ A new PR has been created in buildah to vendor these changes: podman-container-tools/buildah#6414

@Luap99

Luap99 commented Oct 8, 2025

Copy link
Copy Markdown
Member Author

Podman PR podman-container-tools/podman#27251 and the buildah test PR podman-container-tools/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref podman-container-tools/podman#25862 (comment)

I still need to work on the actual feature here though to extract while the store in unlocked.

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)

Comment thread storage/store.go Outdated
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 348a11e to b7780f2 Compare October 13, 2025 13:14
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 13, 2025

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m mostly looking because I was curious — feel free to disregard.

The tar-split comment might explain some of the “unexpected EOF” test failures.

Comment thread storage/internal/tempdir/tempdir.go Outdated
Comment thread storage/layers.go Outdated
Comment thread storage/drivers/overlay/overlay_test.go Outdated
Comment thread storage/drivers/overlay/overlay.go Outdated
Comment thread storage/layers.go Outdated
Comment thread storage/layers.go Outdated
Comment thread storage/layers.go
Comment thread storage/layers.go Outdated
@Luap99 Luap99 force-pushed the staged-layer-creation branch from b7780f2 to bbb2266 Compare October 15, 2025 14:59
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 15, 2025
@Luap99

Luap99 commented Oct 15, 2025

Copy link
Copy Markdown
Member Author

@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures.

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

	diff := path.Join(dir, "diff")
	if err := idtools.MkdirAs(diff, forcedSt.Mode, forcedSt.IDs.UID, forcedSt.IDs.GID); err != nil {
		return err
	}

	if d.options.forceMask != nil {
		st.Mode |= os.ModeDir
		if err := idtools.SetContainersOverrideXattr(diff, st); err != nil {
			return err
		}
	}

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?


Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation?
https://api.cirrus-ci.com/v1/artifact/task/5906611702595584/html/sys-podman-debian-13-rootless-host-sqlite.log.html

I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever.
I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

@mtrmac

mtrmac commented Oct 15, 2025

Copy link
Copy Markdown
Contributor

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

I think that could work.

I was thinking StageAddition does not actually need to create (os.Create/os.Mkdir) the tmpAddPath at all. All of that happens inside a lock-protected td.tempDirPath, so there is ~nothing special, that I can see, about populating tmpAddPath — the provided callback can create the staged item without any help. (That could also mean StageDirectoryAddition and StageFileAddition could be consolidated into one. And I’m not immediately sure we need a callbackStageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)


I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

Per the locking hierarchy documented at the top of store, I think you’re right here.

@Luap99

Luap99 commented Oct 15, 2025

Copy link
Copy Markdown
Member Author

And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards.

But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that.

@Luap99 Luap99 force-pushed the staged-layer-creation branch from bbb2266 to 74d0e97 Compare October 22, 2025 18:42
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 22, 2025
@Luap99

Luap99 commented Oct 22, 2025

Copy link
Copy Markdown
Member Author

@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely.

@Luap99 Luap99 force-pushed the staged-layer-creation branch from 74d0e97 to 5cf326c Compare October 23, 2025 10:31
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 5cf326c to e60d339 Compare October 23, 2025 12:15
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025
@Luap99

Luap99 commented Oct 23, 2025

Copy link
Copy Markdown
Member Author

Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear.

We have this code in putLayer()

	if options.HostUIDMapping {
		options.UIDMap = nil
	}
	if options.HostGIDMapping {
		options.GIDMap = nil
	}
	uidMap := options.UIDMap
	gidMap := options.GIDMap
	if parent != "" {
		var ilayer *Layer
		for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
			lstore := l
			if lstore != rlstore {
				if err := lstore.startReading(); err != nil {
					return nil, -1, err
				}
				defer lstore.stopReading()
			}
			if l, err := lstore.Get(parent); err == nil && l != nil {
				ilayer = l
				parent = ilayer.ID
				break
			}
		}
		if ilayer == nil {
			return nil, -1, ErrLayerUnknown
		}
		parentLayer = ilayer

		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()
		containers, err := s.containerStore.Containers()
		if err != nil {
			return nil, -1, err
		}
		for _, container := range containers {
			if container.LayerID == parent {
				return nil, -1, ErrParentIsContainer
			}
		}
		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = ilayer.UIDMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = ilayer.GIDMap
		}
	} else {
		// FIXME? It’s unclear why we are holding containerStore locked here at all
		// (and because we are not modifying it, why it is a write lock, not a read lock).
		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()

		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = s.uidMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = s.gidMap
		}
	}
	if s.canUseShifting(uidMap, gidMap) {
		options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
	} else {
		options.IDMappingOptions = types.IDMappingOptions{
			HostUIDMapping: options.HostUIDMapping,
			HostGIDMapping: options.HostGIDMapping,
			UIDMap:         copySlicePreferringNil(uidMap),
			GIDMap:         copySlicePreferringNil(gidMap),
		}
	}

However we extract the layer with the caller specified layerOptions.UIDMap, layerOptions.GIDMap in stageWithUnlockedStore() before this code can get run.

I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true.

@mtrmac

mtrmac commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

(I didn’t look at the current code in this PR yet.) AFAICS layer.[UG]IDMap never changes once the layer is created, so it should be safe to look up the parent, read these values, and then unlock.

IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope.

Comment thread storage/layers.go
Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap),
// FIXME: What to do here? We have no lock and assigned label yet.
// Overlayfs should not need it anyway so this seems fine for now.
MountLabel: "",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct

Comment thread storage/layers.go Outdated
Comment thread storage/drivers/overlay/overlay.go Outdated
Comment thread storage/layers.go
Comment thread storage/layers.go

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.

(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)

Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)

Comment thread storage/internal/tempdir/tempdir.go Outdated
Comment thread storage/internal/tempdir/tempdir.go Outdated
Comment thread storage/internal/tempdir/tempdir.go Outdated
Comment thread storage/internal/tempdir/tempdir.go Outdated
Comment thread storage/drivers/overlay/overlay.go Outdated
Comment thread storage/store.go Outdated
Comment thread storage/store.go Outdated
Comment thread storage/drivers/overlay/overlay.go Outdated
Comment thread storage/drivers/overlay/overlay.go
Comment thread storage/store.go Outdated
@Luap99 Luap99 force-pushed the staged-layer-creation branch from e60d339 to 91fcdbb Compare October 30, 2025 19:06
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 30, 2025
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 91fcdbb to 651c8f5 Compare November 4, 2025 14:46
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 4, 2025
@Luap99

Luap99 commented Nov 4, 2025

Copy link
Copy Markdown
Member Author
#!/usr/bin/env bpftrace

kfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    @blockedTime[tid, $lockname] = nsecs;
  }
}

kretfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    // lock duration in msec
    $lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
    if ($lock_duration) {
      printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
    }
    delete(@blockedTime[tid, $lockname]);
    @lockholdTime[pid, args->fd] = (nsecs, $lockname);
  }
}

tracepoint:syscalls:sys_enter_close
{
  $a = @lockholdTime[pid, args->fd];
  if ($a.0) {
    // lock duration in msec
    $lock_duration = (nsecs - $a.0)/1000000;
    if ($lock_duration) {
      printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
    }
    delete(@lockholdTime[pid, args->fd]);
  }
}

FYI this is the script I used to measure lock times during my demo last week, not sure if there is a decent place to store this. I guess it could be helpful in the future to find more lock contention, should I add it under hack/ maybe?

Add a new function to stage additions. This should be used to extract
the layer content into a temp directory without holding the storage
lock and then under the lock just rename the directory into the final
location to reduce the lock contention.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is not clear to me when it will hit the code path there, by normal
layer creation we always pass a valid parent so this branch is never
reached AFAICT.

Let's remove it and see if all tests still pass in podman, buildah and
others...

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Split out the layer permission gathering from the main create() function
so it can be reused elsehwere, see the next commit.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Add a function to apply the diff into a tmporary directory so we can do
that unlocked and only rename under the lock.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I cannot see any reason why we should buffer the full tar split content
in memory before writing it. That layer is still mark partial at this
point and the store is locked so there is no concurrent access either
thus we do not need the atomic rename here.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Split it into multiple function to make it reusable without having a
layer and so that it can be used unlocked see the following commits.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The extracting of the tar under the store lock is a bottleneck as many
concurrent processes might hold the locks for a long time on big layers.

To address this move the layer extraction before we take the locks if
possible. Currently this only work when using the overlay driver as the
implementation requires driver specifc details in order for a rename()
to work.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It doesn't seem needed here so don't take it.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
A minor rework to enable more changes in following commits. Note the
caller still must hold the layer store locks so ensure we return the
layer locked and let the caller unlock instead.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Make it reusable for other callers, see next commit.

Also while at it remove the dedupeStrings() call, as pointed out by
Miloslav the work it is doing is more expensive than just checking the
same name several times as it does a O(1) map lookup. Also most callers
won't pass duplicated names to begin with.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The untar can be quite expensive so check for id, name conflicts right
away. Also we must populate the idmappings so we extract with the right
uids/gids.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This function was added in commit c577a81 and used by older drivers we
no longer suppor, such as aufs and windows. As such this is dead code
and can be removed.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is unused in all drivers now, so it can be removed.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We use this this typo all the time now so make the naming a bit more
clear.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Just be safe based on the review feedback from the PR.
podman-container-tools#378

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The function is just a redirection to another one so inline it directly
as we do not gain anything from the extra indirection.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Also add a missing sync when we stage to ensure the content was flushed
to disk.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 force-pushed the staged-layer-creation branch from 7f04b7f to 83dd0eb Compare November 26, 2025 14:25
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 26, 2025
@Luap99

Luap99 commented Nov 26, 2025

Copy link
Copy Markdown
Member Author

@mtrmac I addressed all your comments now. Podman is also happy in podman-container-tools/podman#27251 (modulo some unrelated flakes)

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mtrmac mtrmac merged commit c780818 into podman-container-tools:main Nov 26, 2025
38 checks passed
@Luap99 Luap99 deleted the staged-layer-creation branch November 26, 2025 15:47
@Luap99

Luap99 commented Nov 28, 2025

Copy link
Copy Markdown
Member Author
#!/usr/bin/env bpftrace

kfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    @blockedTime[tid, $lockname] = nsecs;
  }
}

kretfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    // lock duration in msec
    $lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
    if ($lock_duration) {
      printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
    }

    // store max and create histogram that get printed on bpftrace exit
    @block_max[$lockname] = max($lock_duration);
    @block_duration[$lockname] = hist($lock_duration);

    delete(@blockedTime[tid, $lockname]);
    @lockholdTime[pid, args->fd] = (nsecs, $lockname);
  }
}

tracepoint:syscalls:sys_enter_close
{
  $a = @lockholdTime[pid, args->fd];
  if ($a.0) {
    // lock duration in msec
    $lock_duration = (nsecs - $a.0)/1000000;
    if ($lock_duration) {
      printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
    }

    // store max and create histogram that get printed on bpftrace exit
    @lock_max[$a.1] = max($lock_duration);
    @lock_duration[$a.1] = hist($lock_duration);

    delete(@lockholdTime[pid, args->fd]);
  }
}

Updated test script which prints some nice histograms and the max time values on exit for each lock file. It could be useful to measure other code parts as well. Let me know if you like me to commit this.

@mtrmac

mtrmac commented Nov 28, 2025

Copy link
Copy Markdown
Contributor

Maybe into storage/contrib? It’s valuable right now, it could be useful as a starting point for a similar analysis of something else in the future, but I don’t think we’d want to commit to maintaining this script. I don’t have a strong preference.

Luap99 added a commit to Luap99/container-libs that referenced this pull request Jan 20, 2026
Just be safe based on the review feedback from the PR.
podman-container-tools#378

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
(cherry picked from commit 3bfe961)
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants