Staged layer creation by Luap99 · Pull Request #378 · podman-container-tools/container-libs

Luap99 · 2025-10-08T10:49:30Z

No description provided.

podmanbot · 2025-10-08T10:50:41Z

✅ A new PR has been created in buildah to vendor these changes: podman-container-tools/buildah#6414

Luap99 · 2025-10-08T11:41:13Z

Podman PR podman-container-tools/podman#27251 and the buildah test PR podman-container-tools/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref podman-container-tools/podman#25862 (comment)

I still need to work on the actual feature here though to extract while the store in unlocked.

mtrmac

ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)

mtrmac

I’m mostly looking because I was curious — feel free to disregard.

The tar-split comment might explain some of the “unexpected EOF” test failures.

Luap99 · 2025-10-15T16:43:26Z

@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures.

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

	diff := path.Join(dir, "diff")
	if err := idtools.MkdirAs(diff, forcedSt.Mode, forcedSt.IDs.UID, forcedSt.IDs.GID); err != nil {
		return err
	}

	if d.options.forceMask != nil {
		st.Mode |= os.ModeDir
		if err := idtools.SetContainersOverrideXattr(diff, st); err != nil {
			return err
		}
	}

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation?
https://api.cirrus-ci.com/v1/artifact/task/5906611702595584/html/sys-podman-debian-13-rootless-host-sqlite.log.html

I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever.
I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

mtrmac · 2025-10-15T17:03:30Z

Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code

Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used?

I think that could work.

I was thinking StageAddition does not actually need to create (os.Create/os.Mkdir) the tmpAddPath at all. All of that happens inside a lock-protected td.tempDirPath, so there is ~nothing special, that I can see, about populating tmpAddPath — the provided callback can create the staged item without any help. (That could also mean StageDirectoryAddition and StageFileAddition could be consolidated into one. And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

I haven't checked all the code paths but I guess with the locking order requirements what I did is basically impossible to achieve anyway and I have to indeed move this out to before we get the lock?

Per the locking hierarchy documented at the top of store, I think you’re right here.

Luap99 · 2025-10-15T17:15:36Z

And I’m not immediately sure we need a callback — StageAddition could return a newStagingPathToPopulate — but I also didn’t now carefully re-read the tempdir package.)

Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards.

But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that.

Luap99 · 2025-10-22T19:09:23Z

@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely.

Luap99 · 2025-10-23T17:46:07Z

Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear.

We have this code in putLayer()

	if options.HostUIDMapping {
		options.UIDMap = nil
	}
	if options.HostGIDMapping {
		options.GIDMap = nil
	}
	uidMap := options.UIDMap
	gidMap := options.GIDMap
	if parent != "" {
		var ilayer *Layer
		for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
			lstore := l
			if lstore != rlstore {
				if err := lstore.startReading(); err != nil {
					return nil, -1, err
				}
				defer lstore.stopReading()
			}
			if l, err := lstore.Get(parent); err == nil && l != nil {
				ilayer = l
				parent = ilayer.ID
				break
			}
		}
		if ilayer == nil {
			return nil, -1, ErrLayerUnknown
		}
		parentLayer = ilayer

		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()
		containers, err := s.containerStore.Containers()
		if err != nil {
			return nil, -1, err
		}
		for _, container := range containers {
			if container.LayerID == parent {
				return nil, -1, ErrParentIsContainer
			}
		}
		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = ilayer.UIDMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = ilayer.GIDMap
		}
	} else {
		// FIXME? It’s unclear why we are holding containerStore locked here at all
		// (and because we are not modifying it, why it is a write lock, not a read lock).
		if err := s.containerStore.startWriting(); err != nil {
			return nil, -1, err
		}
		defer s.containerStore.stopWriting()

		if !options.HostUIDMapping && len(options.UIDMap) == 0 {
			uidMap = s.uidMap
		}
		if !options.HostGIDMapping && len(options.GIDMap) == 0 {
			gidMap = s.gidMap
		}
	}
	if s.canUseShifting(uidMap, gidMap) {
		options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
	} else {
		options.IDMappingOptions = types.IDMappingOptions{
			HostUIDMapping: options.HostUIDMapping,
			HostGIDMapping: options.HostGIDMapping,
			UIDMap:         copySlicePreferringNil(uidMap),
			GIDMap:         copySlicePreferringNil(gidMap),
		}
	}

However we extract the layer with the caller specified layerOptions.UIDMap, layerOptions.GIDMap in stageWithUnlockedStore() before this code can get run.

I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true.

mtrmac · 2025-10-23T18:01:52Z

(I didn’t look at the current code in this PR yet.) AFAICS layer.[UG]IDMap never changes once the layer is created, so it should be safe to look up the parent, read these values, and then unlock.

IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope.

giuseppe · 2025-10-23T20:08:05Z

+			Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap),
+			// FIXME: What to do here? We have no lock and assigned label yet.
+			// Overlayfs should not need it anyway so this seems fine for now.
+			MountLabel: "",


I think this is correct

mtrmac

Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.

(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)

Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)

Luap99 · 2025-11-04T17:19:17Z

#!/usr/bin/env bpftrace

kfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    @blockedTime[tid, $lockname] = nsecs;
  }
}

kretfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    // lock duration in msec
    $lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
    if ($lock_duration) {
      printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
    }
    delete(@blockedTime[tid, $lockname]);
    @lockholdTime[pid, args->fd] = (nsecs, $lockname);
  }
}

tracepoint:syscalls:sys_enter_close
{
  $a = @lockholdTime[pid, args->fd];
  if ($a.0) {
    // lock duration in msec
    $lock_duration = (nsecs - $a.0)/1000000;
    if ($lock_duration) {
      printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
    }
    delete(@lockholdTime[pid, args->fd]);
  }
}

FYI this is the script I used to measure lock times during my demo last week, not sure if there is a decent place to store this. I guess it could be helpful in the future to find more lock contention, should I add it under hack/ maybe?

Add a new function to stage additions. This should be used to extract the layer content into a temp directory without holding the storage lock and then under the lock just rename the directory into the final location to reduce the lock contention. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

It is not clear to me when it will hit the code path there, by normal layer creation we always pass a valid parent so this branch is never reached AFAICT. Let's remove it and see if all tests still pass in podman, buildah and others... Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Split out the layer permission gathering from the main create() function so it can be reused elsehwere, see the next commit. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Add a function to apply the diff into a tmporary directory so we can do that unlocked and only rename under the lock. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

I cannot see any reason why we should buffer the full tar split content in memory before writing it. That layer is still mark partial at this point and the store is locked so there is no concurrent access either thus we do not need the atomic rename here. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Split it into multiple function to make it reusable without having a layer and so that it can be used unlocked see the following commits. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

The extracting of the tar under the store lock is a bottleneck as many concurrent processes might hold the locks for a long time on big layers. To address this move the layer extraction before we take the locks if possible. Currently this only work when using the overlay driver as the implementation requires driver specifc details in order for a rename() to work. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

It doesn't seem needed here so don't take it. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

A minor rework to enable more changes in following commits. Note the caller still must hold the layer store locks so ensure we return the layer locked and let the caller unlock instead. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Make it reusable for other callers, see next commit. Also while at it remove the dedupeStrings() call, as pointed out by Miloslav the work it is doing is more expensive than just checking the same name several times as it does a O(1) map lookup. Also most callers won't pass duplicated names to begin with. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

The untar can be quite expensive so check for id, name conflicts right away. Also we must populate the idmappings so we extract with the right uids/gids. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

This function was added in commit c577a81 and used by older drivers we no longer suppor, such as aufs and windows. As such this is dead code and can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

It is unused in all drivers now, so it can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

We use this this typo all the time now so make the naming a bit more clear. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Just be safe based on the review feedback from the PR. podman-container-tools#378 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

The function is just a redirection to another one so inline it directly as we do not gain anything from the extra indirection. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Also add a missing sync when we stage to ensure the content was flushed to disk. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2025-11-26T15:06:38Z

@mtrmac I addressed all your comments now. Podman is also happy in podman-container-tools/podman#27251 (modulo some unrelated flakes)

mtrmac

Thanks!

Luap99 · 2025-11-28T17:12:23Z

#!/usr/bin/env bpftrace

kfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    @blockedTime[tid, $lockname] = nsecs;
  }
}

kretfunc:fcntl_setlk
{
  $lockname = str(args->filp->f_path.dentry->d_name.name);

  if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
      $lockname == "images.lock" || $lockname == "containers.lock" ) {

    // lock duration in msec
    $lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
    if ($lock_duration) {
      printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
    }

    // store max and create histogram that get printed on bpftrace exit
    @block_max[$lockname] = max($lock_duration);
    @block_duration[$lockname] = hist($lock_duration);

    delete(@blockedTime[tid, $lockname]);
    @lockholdTime[pid, args->fd] = (nsecs, $lockname);
  }
}

tracepoint:syscalls:sys_enter_close
{
  $a = @lockholdTime[pid, args->fd];
  if ($a.0) {
    // lock duration in msec
    $lock_duration = (nsecs - $a.0)/1000000;
    if ($lock_duration) {
      printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
    }

    // store max and create histogram that get printed on bpftrace exit
    @lock_max[$a.1] = max($lock_duration);
    @lock_duration[$a.1] = hist($lock_duration);

    delete(@lockholdTime[pid, args->fd]);
  }
}

Updated test script which prints some nice histograms and the max time values on exit for each lock file. It could be useful to measure other code parts as well. Let me know if you like me to commit this.

mtrmac · 2025-11-28T18:06:09Z

Maybe into storage/contrib? It’s valuable right now, it could be useful as a starting point for a similar analysis of something else in the future, but I don’t think we’d want to commit to maintaining this script. I don’t have a strong preference.

Just be safe based on the review feedback from the PR. podman-container-tools#378 Signed-off-by: Paul Holzinger <pholzing@redhat.com> (cherry picked from commit 3bfe961) Signed-off-by: Paul Holzinger <pholzing@redhat.com>

github-actions Bot added the storage Related to "storage" package label Oct 8, 2025

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 8, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

6d83f9f

podmanbot mentioned this pull request Oct 8, 2025

Sync: WIP: Staged layer creation podman-container-tools/buildah#6414

Closed

mtrmac reviewed Oct 8, 2025

View reviewed changes

Comment thread storage/store.go Outdated

Luap99 force-pushed the staged-layer-creation branch from 348a11e to b7780f2 Compare October 13, 2025 13:14

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 13, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

ac18582

mtrmac reviewed Oct 13, 2025

View reviewed changes

Luap99 force-pushed the staged-layer-creation branch from b7780f2 to bbb2266 Compare October 15, 2025 14:59

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 15, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

45eda1a

Luap99 force-pushed the staged-layer-creation branch from bbb2266 to 74d0e97 Compare October 22, 2025 18:42

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 22, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

4ce027b

Luap99 force-pushed the staged-layer-creation branch from 74d0e97 to 5cf326c Compare October 23, 2025 10:31

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

284f29f

Luap99 force-pushed the staged-layer-creation branch from 5cf326c to e60d339 Compare October 23, 2025 12:15

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 23, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

e21be83

giuseppe reviewed Oct 23, 2025

View reviewed changes

mtrmac reviewed Oct 23, 2025

View reviewed changes

Luap99 force-pushed the staged-layer-creation branch from e60d339 to 91fcdbb Compare October 30, 2025 19:06

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Oct 30, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

e20dc5d

Luap99 force-pushed the staged-layer-creation branch from 91fcdbb to 651c8f5 Compare November 4, 2025 14:46

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 4, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

31d42ea

Luap99 added 17 commits November 26, 2025 15:25

overlay: add getLayerPermissions()

3517eba

Split out the layer permission gathering from the main create() function so it can be reused elsehwere, see the next commit. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

overlay: add StartStagingDiffToApply()

d1cc083

Add a function to apply the diff into a tmporary directory so we can do that unlocked and only rename under the lock. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: rework applyDiffWithOptions()

5403dfc

Split it into multiple function to make it reusable without having a layer and so that it can be used unlocked see the following commits. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: don't lock container store in putLayer

1cecfee

It doesn't seem needed here so don't take it. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: do conflict checks before staging untar

f524d14

The untar can be quite expensive so check for id, name conflicts right away. Also we must populate the idmappings so we extract with the right uids/gids. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: remove NaiveCreateFromTemplate()

af19fb6

This function was added in commit c577a81 and used by older drivers we no longer suppor, such as aufs and windows. As such this is dead code and can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: remove parent param from ApplyDiff()

b37aa04

It is unused in all drivers now, so it can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: rename stagedLayerOptions to layerCreationContents

b93f966

We use this this typo all the time now so make the naming a bit more clear. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: don't use stagedLayerExtract when mountLabel is set

3bfe961

Just be safe based on the review feedback from the PR. podman-container-tools#378 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: inline putLayer

e58297d

The function is just a redirection to another one so inline it directly as we do not gain anything from the extra indirection. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

storage: check for close error on the tar split file

83dd0eb

Also add a missing sync when we stage to ensure the content was flushed to disk. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 force-pushed the staged-layer-creation branch from 7f04b7f to 83dd0eb Compare November 26, 2025 14:25

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 26, 2025

dnm: Vendor changes from podman-container-tools/container-libs#378

e8fdfdc

mtrmac approved these changes Nov 26, 2025

View reviewed changes

mtrmac merged commit c780818 into podman-container-tools:main Nov 26, 2025
38 checks passed

Luap99 deleted the staged-layer-creation branch November 26, 2025 15:47

Luap99 mentioned this pull request Nov 28, 2025

storage: the userns idmap copy layer code path still runs under the locks #506

Closed

Luap99 mentioned this pull request Dec 3, 2025

Add bcachefs storage driver with subvolume/snapshot support #518

Open

Luap99 mentioned this pull request Jan 20, 2026

[podman-5.8] backport staged layer creation #599

Merged

Luap99 mentioned this pull request Mar 24, 2026

RFC: Partial storage locking to increase 'storing signature' performance #177

Closed

Conversation

Luap99 commented Oct 8, 2025

Uh oh!

podmanbot commented Oct 8, 2025

Uh oh!

Luap99 commented Oct 8, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luap99 commented Oct 15, 2025

Uh oh!

mtrmac commented Oct 15, 2025

Uh oh!

Luap99 commented Oct 15, 2025

Uh oh!

Luap99 commented Oct 22, 2025

Uh oh!

Luap99 commented Oct 23, 2025

Uh oh!

mtrmac commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giuseppe Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luap99 commented Nov 4, 2025

Uh oh!

Luap99 commented Nov 26, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Luap99 commented Nov 28, 2025

Uh oh!

mtrmac commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mtrmac commented Oct 23, 2025 •

edited

Loading