Introduce a temporary directory for deleting layer files outside the lock#2325
Conversation
ca5a283 to
cb73550
Compare
I think we should avoid using private/company-specific Google docs to design public FOSS in general especially as part of the movement of projects in this space to the CNCF. |
|
@cgwalters ok |
a1810a4 to
d4352cd
Compare
mtrmac
left a comment
There was a problem hiding this comment.
A very preliminary first look — feel free to ignore if feedback is not yet wanted.
At this point I’m mostly focused on the TempDir mechanism’s design/API; callers are interesting to the extent that they represent use cases that need to be handled, but
pkg/tempdir/tempdir.go
Outdated
| // Required global storage lock must be held before calling this function. | ||
| func recover(rootDir string) error { | ||
| tempDirPattern := filepath.Join(rootDir, "temp-dir-*") | ||
| tempDirs, err := filepath.Glob(tempDirPattern) |
There was a problem hiding this comment.
As is, this would add a .Glob to every RW lock acquisition. It would be interesting to try to optimize that.
There was a problem hiding this comment.
let's just use a parent directory for the temp directories so we don't have to look at their name.
instead of: /foo/run/vfs-layers/temp-dir-28400 use something like /foo/run/vfs-layers/temp-dirs/28400 so we avoid the glob
There was a problem hiding this comment.
(I’m wondering about even more optimization — we could have an “might have a staging directory” bool in layers.json, and use the layerStore.lockfile.RecordWrite notification mechanism to make all of this ~zero-cost to users of layerStore.startReading. But, one step at a time.)
cc142b0 to
ce89bc4
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Thanks! Good progress, TempDir now looks safe, and the documentation is great.
For Podman service, and CRI-O, we do need layer deletion to happen very soon, not just at store shutdown. And, for those, we might also want the recovery to happen much more frequently (e.g. if someone runs podman rmi as an one-time operation on an OpenShift node, that crashes, and the only other c/storage user is a long-running CRI-O process).
That might, OTOH, make the serialization implied by tempDirGlobalLock more frequent = more costly and less desirable, again. I don’t think we absolutely need that lock?
- Split instance state and
TempDirstate. Then there will be no need to do any locking inTempDiritself, for in-process purposes, it would just contain read-only fields. And we can make the instance single-goroutine-owned = again, no need for in-process locks. - Move
instanceLockout of the instance directory, and change lifetimes so that the lock is created before, and deleted after, the instance directory. Then I think we don’t needtempDirGlobalLockfor cross-process purposes, either (? apart from coordinating recovery).
drivers/overlay/overlay.go
Outdated
| id := entry.Name() | ||
| switch id { | ||
| case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: | ||
| case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: |
There was a problem hiding this comment.
Uh… do we need to worry about forward compatibility, i.e. an older version of this code running and unaware of tempDir?
There was a problem hiding this comment.
I'm not sure, but I moved it to stagingDir. But I'm not sure how it is with recreateSymlinks(). I think it should ignore stagingDir.
There was a problem hiding this comment.
I'm not sure, but I moved it to
stagingDir.
That might make things worse, see it would interfere with pruneStagingDirectories.
But I'm not sure how it is with
recreateSymlinks(). I think it should ignorestagingDir.
Yes. It probably works well enough now, because $stagingDir/link doesn’t exist, so the directory is ignored - but it seems attractive to clean this up, going forward.
The very last resort which always works would be to leave d.home for layer directories, and have the caller of the driver provide a separate d.metadataHome where we have no prior commitments.
But, maybe, the answer is that we can leave tempDir in d.home, as long as $tempDir/link is never created?? And should we now at least reserve a new path prefix, e.g. defining that ${d.home}/meta* is never a layer directory?
Or, alternatively, we could define a storage version number, and refuse to work with future versions. That would be drastic but clearly correct, OTOH we would be signing up for very precisely managing this to keep CRI-O and Podman in sync (and for not forgetting to update the version number…)
All of this forward-compatibility design seems valuable, but also somewhat out of scope of this “unlocked deletes” effort — or at least clearly out of scope of this prototype.
Cc: @giuseppe
There was a problem hiding this comment.
Now that this is no longer just a prototype, we’ll need to find a solution.
There was a problem hiding this comment.
@giuseppe With this, old versions’ GarbageCollect == podman system prune --external can delete all of tempDir.
I think that’s good enough, it’s supposed to be run at boot at quiet state, but I’m highlighting this in case I’m missing something.
|
I only tangentially skimmed this PR but just in case of interest, from the very start ostree has implemented a vaguely similar scheme in https://github.com/ostreedev/ostree/blob/7a32b86ee4c0232bfde97a3cd87c4a82b6e73218/src/libostree/ostree-repo.c#L6292 |
pkg/tempdir/tempdir.go
Outdated
| Example structure: | ||
|
|
||
| rootDir/ | ||
| .global-tmp-lock (tempDirGlobalLock) | ||
| temp-dir-ABC/ | ||
| .lock (instanceLock for temp-dir-ABC) | ||
| file1 | ||
| file3 | ||
| temp-dir-XYZ/ | ||
| .lock (instanceLock for temp-dir-XYZ) | ||
| file2 | ||
| */ |
There was a problem hiding this comment.
#2325 (comment) still holds.
This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.
So you have process A that holds the lock while it runs rm -rf temp-dir-XYZ, what a new process B should do when it attempts to lock the directory but doesn't find the lock file? 1: Assume the directory is being deleted (process A could crash), or 2: delete it as well?
I think it is safer to have:
temp-dir-XZY/data/
temp-dir-XZY/.lock
first temp-dir-XZY/data/ and if that succeeded proceed to remove temp-dir-XZY.
When process B tries to lock the directory, it will find the lock file until data/ exists, and if it doesn't find lock then it can do rmdir("temp-dir-XZY") because that could happen if the process A crashed after removing the lock file but before removing the directory (or after creating the directory but before the lock file was created)
There was a problem hiding this comment.
I don’t think we need that directory at all — $root/${instanceID}.lock could mark ownership of $root/${instanceID}.data.
On creation, allocate the lock first. On deletion, unlock and delete the lock last.
(This could even, hypothetically, allow ….data to be a regular file, without the overhead of any extra directories.)
There was a problem hiding this comment.
what if the content we are trying to move to the temp-dir has a file/directory named .lock?
There was a problem hiding this comment.
This structure is subject to a race condition when the .lock file is deleted before the other files in the directory.
I think that part is good enough — things like os.RemoveAll really shouldn’t crash if a parent is removed in the meantime.
The problem with moving the lock inside the instance directory is at creation time, where we have the instance directory without any lock, so we need to protect against cleanup/recovery immediately deleting it. The current implementation does solve that, by holding the global lock — but that also means that the cleanup/recovery needs to obtain the global lock every time just to check whether anything needs to be cleaned up.
Together with #2325 (review) arguing that we should do the cleanups frequently, not just at process shutdown, that implies obtaining an extra lock file on both layerStore.startReading and layerStore.startWriting. I’d rather prefer to avoid that cost. (I’d even like to avoid the Readdir on every start…ing, but that will require more cooperation with layerStore.)
There was a problem hiding this comment.
what if the content we are trying to move to the temp-dir has a file/directory named
.lock?
I think that’s fine?
- File names under
$rootare owned by thetempdirpackage. Callers have no claim to decide names, we would simply not give them the API (and the current code in.Addthat bases the file name on the base name of the input can be changed). - So, a regular file would be named
$root/${instanceID}.data. No ambiguity. - If the instance were a larger directory structure (our primary use case now:, a layer deletion in progress),
$root/${instanceID}.datawould be a directory containing arbitrary contents. We would not be looking for locks inside the ….datasubdirectory at all.
pkg/tempdir/tempdir.go
Outdated
| td.tempDirGlobalLock.Lock() | ||
| defer td.tempDirGlobalLock.Unlock() |
There was a problem hiding this comment.
we don't need the lock here since it is released when the function exits, so for the caller it doesn't matter whether the code below runs with the lock or not.
As @mtrmac pointed out, we need to use os.ReadDir not Walk it recursively
dbd4cad to
76529ec
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Ratholes inside ratholes …
If I’m reading things right, using pkg/lockfile.LockFile for staging directories would leak memory? Cc: @giuseppe to keep me honest.
If so, we need a new idea, or a new primitive. I was thinking a new internal/staginglock that supports
CreateAndTryLockTryLockExistingPathUnlockAndDelete
without an ever-growing lock map. If so, can we do that in yet another separate PR, fix overlay’s stagingDirBase, and then return to this one?
But maybe there’s some much simpler approach I’m now missing.
|
I've edited the |
giuseppe
left a comment
There was a problem hiding this comment.
do you have a PR for Podman to test the new functionality and see how it would work?
In particular, I'd like to see how the deferred removal would work in a cleanup process
|
Here is a PR with vendoring of c/stroage with a new approach for deleting layers with a temporary directory: containers/podman#26546 |
mtrmac
left a comment
There was a problem hiding this comment.
Nice, I think this proves the concept beyond doubt.
Now, for merging / using in production, the one larger correctness issue I see is correctly placing the staging directories in graph drivers.
(I do worry a little about performance — we now trigger recovery on process start (clearly correct and necessary), and after every time we observe some other process making a modification to the store. That might be frequent in high-load multi-Podman systems… OTOH it should not affect the podman-remote server nor CRI-O, who are ~the only writers, so, on balance, I think this is probably fine, even with the ~2 extra ReadDirs.)
drivers/vfs/driver.go
Outdated
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if err := t.Add(d.dir(id)); err != nil { |
There was a problem hiding this comment.
d.dir can be filepath.Join(d.home, "dir", …) or filepath.Join(d.imageStore, d.String(), "dir", …) (or something with d.additionalHomes[], although those are presumably in R/O image stores and should not be possible to delete).
d.home and d.imageStore can be on different filesystems (and, I think, that would frequently would be the case: the “image” layers would be in durable storage and the “container” layers in something faster but less durable).
So, I think the GetTempDirRootDir driver API needs to be generalized to return a list of directories. (And that would remove the current “"" means N/A” special case.)
| } | ||
|
|
||
| func (d *Driver) removeCommon(id string, cleanup func(string) error) error { | ||
| dir := d.dir(id) |
There was a problem hiding this comment.
Similarly to VFS, this can point to d.imageStore, not to d.home.
drivers/overlay/overlay.go
Outdated
| id := entry.Name() | ||
| switch id { | ||
| case linkDir, stagingDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: | ||
| case linkDir, stagingDir, tempDir, quota.BackingFsBlockDeviceLink, mountProgramFlagFile: |
There was a problem hiding this comment.
Now that this is no longer just a prototype, we’ll need to find a solution.
Fixes: https://issues.redhat.com/browse/RUN-3104 Signed-off-by: Jan Rodák <hony.com@seznam.cz>
mtrmac
left a comment
There was a problem hiding this comment.
Just a brief look, I didn’t review the driver code changes.
| // deferredDelete deletes a layer with the specified name or ID. | ||
| // This removal happen immediately (the layer is no longer usable), | ||
| // but physically deleting the files may be deferred. | ||
| // Caller MUST call all returned cleanup functions outside of the locks. |
There was a problem hiding this comment.
… EVEN IF the function returns an error.
mtrmac
left a comment
There was a problem hiding this comment.
I’m afraid the additional image store implementation is confusing , the way it is mixed over Store and drivers. (Exhibit N+1 why I refuse to work on adding more options to graph drivers.)
|
@mtrmac PTAL |
drivers/overlay/overlay.go
Outdated
| } | ||
|
|
||
| func (d *Driver) DeferredRemove(id string) (tempdir.CleanupTempDirFunc, error) { | ||
| _, homeDir, _ := d.dir2(id, false) |
There was a problem hiding this comment.
I think this homeDir is [undocumented :( and] “where to create layers”; it does not, AFAICS, consistently match the directory where id is found — so this does not return the right directory.
Really I think having dir2, in both graph drivers, return the parent(?) directory would be cleaner (centralizing knowledge/policy, and allowing to clearly track where data comes from) than hard-coding any knowledge in the DeferredRemove callers.
Admittedly, yes, that would imply documenting what dir2 does, and that’s a bit of a puzzle. But, as this PR shows, that would be valuable.
There was a problem hiding this comment.
The rework and documentation of dir2 should be done in another PR. Should I create an issue for this?
There was a problem hiding this comment.
I think that’s ultimately up to you.
I think after this merges you are likely to be pinged if this behaves in a surprising way, and right now you understand the code better than most people do most of the time. So, I think it would help your future self to document it very soon. But that’s a guess about the future, I don’t think I’m 100% justified in requiring you to do that investment now.
This enables deferred deletion of physical files outside of global locks, improving performance and reducing lock contention. Signed-off-by: Jan Rodák <hony.com@seznam.cz>
|
@mtrmac PTAL |
giuseppe
left a comment
There was a problem hiding this comment.
thanks for the great work!
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Honny1 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR contains the implementation of the temporary directory and the integration of the temporary directory. The temporary directory is used to delete files outside of the global storage lock to reduce the global lock time. Instead of deleting files. Files are moved to the temporary directory by renaming them. Renaming a file is faster than deleting it from disk. After all files have been moved to the temporary directory and the file references have been removed. And the global storage lock is unlocked. The temporary directory is removed.
Fixes: https://issues.redhat.com/browse/RHEL-36087
Fixes: https://issues.redhat.com/browse/RUN-3104