Staged layer creation#378
Conversation
|
✅ A new PR has been created in buildah to vendor these changes: podman-container-tools/buildah#6414 |
|
Podman PR podman-container-tools/podman#27251 and the buildah test PR podman-container-tools/buildah#6414 from the bot both look good so that means we can remove the special case from ApplyDiff() in overlay I think, ref podman-container-tools/podman#25862 (comment) I still need to work on the actual feature here though to extract while the store in unlocked. |
mtrmac
left a comment
There was a problem hiding this comment.
ACK, simplifying ApplyDiff this way does look correct. (I didn’t carefully look at the tempdir addition yet.)
348a11e to
b7780f2
Compare
mtrmac
left a comment
There was a problem hiding this comment.
I’m mostly looking because I was curious — feel free to disregard.
The tar-split comment might explain some of the “unexpected EOF” test failures.
b7780f2 to
bbb2266
Compare
|
@mtrmac FYI I have not really addressed most of your comments yet, I am just trying to push things to see how much things break. Still seeing plenty of test failures. Issue 1 I see is that I just use the 700 permission from the tmpdir due the rename instead of the proper diff dir creation permissions that are in the driver.create() code Not sure if I should expose that into the tmpdir creation logic, I guess that makes the most sense since only the dirver should now the exact permission that should be used? Second problem I see are timeouts (in parallel running tests) which I guess mean I added a deadlock situation? I guess looking at the code this unlock/lock again thing I did is indeed completely broken and unsafe due ABBA deadlock, i.e. in putlayer we also hold the containerStore lock so only unlocking the layer store makes it possible that another process can get the layer lock and then blocks on the still gold container store thus both process handing forever. |
I think that could work. I was thinking
Per the locking hierarchy documented at the top of |
Yeah my thinking was that the callback provides a "lifetime" of when the path is safe to use, if I return a string/struct with the path then the caller can cleanup/commit and then still use the path afterwards. This is really where I start to hate go because in rust this would be trivial to enforce so that there could only ever be one call to commit and then render the object useless afterwards. But yes usage wise this callback is indeed getting quite ugly to the point where just returning the path is much simpler and well how go works in general. I do like the suggestion of just returning the path to consolidate both tmpdir functions into one so I will go with that. |
bbb2266 to
74d0e97
Compare
|
@mtrmac I will push this into podman and run more tests tomorrow but I think like this it should be workable now. I fix the minor lint issues here of course on the next push. Let me know if this approach seem right to you, I guess the code could need some more better comments/function names likely. |
74d0e97 to
5cf326c
Compare
5cf326c to
e60d339
Compare
|
Ok last issue noticed in podman the idmapping logic cannot be implemented unlocked I fear. We have this code in putLayer() if options.HostUIDMapping {
options.UIDMap = nil
}
if options.HostGIDMapping {
options.GIDMap = nil
}
uidMap := options.UIDMap
gidMap := options.GIDMap
if parent != "" {
var ilayer *Layer
for _, l := range append([]roLayerStore{rlstore}, rlstores...) {
lstore := l
if lstore != rlstore {
if err := lstore.startReading(); err != nil {
return nil, -1, err
}
defer lstore.stopReading()
}
if l, err := lstore.Get(parent); err == nil && l != nil {
ilayer = l
parent = ilayer.ID
break
}
}
if ilayer == nil {
return nil, -1, ErrLayerUnknown
}
parentLayer = ilayer
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
containers, err := s.containerStore.Containers()
if err != nil {
return nil, -1, err
}
for _, container := range containers {
if container.LayerID == parent {
return nil, -1, ErrParentIsContainer
}
}
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = ilayer.UIDMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = ilayer.GIDMap
}
} else {
// FIXME? It’s unclear why we are holding containerStore locked here at all
// (and because we are not modifying it, why it is a write lock, not a read lock).
if err := s.containerStore.startWriting(); err != nil {
return nil, -1, err
}
defer s.containerStore.stopWriting()
if !options.HostUIDMapping && len(options.UIDMap) == 0 {
uidMap = s.uidMap
}
if !options.HostGIDMapping && len(options.GIDMap) == 0 {
gidMap = s.gidMap
}
}
if s.canUseShifting(uidMap, gidMap) {
options.IDMappingOptions = types.IDMappingOptions{HostUIDMapping: true, HostGIDMapping: true, UIDMap: nil, GIDMap: nil}
} else {
options.IDMappingOptions = types.IDMappingOptions{
HostUIDMapping: options.HostUIDMapping,
HostGIDMapping: options.HostGIDMapping,
UIDMap: copySlicePreferringNil(uidMap),
GIDMap: copySlicePreferringNil(gidMap),
}
}However we extract the layer with the caller specified I guess the s.uidMap case could be done without a lock but not the parent lookups? So I guess the simple solution would be to only use the unlocked extract path when options.HostGIDMapping is true. |
|
(I didn’t look at the current code in this PR yet.) AFAICS IIRC the plan was to start layer creation with an ID lookup (so that we don’t start expensively staging it if it already exists), so the parent lookup could be done within the same lock scope. |
| Mappings: idtools.NewIDMappingsFromMaps(layerOptions.UIDMap, layerOptions.GIDMap), | ||
| // FIXME: What to do here? We have no lock and assigned label yet. | ||
| // Overlayfs should not need it anyway so this seems fine for now. | ||
| MountLabel: "", |
mtrmac
left a comment
There was a problem hiding this comment.
Reading this commit by commit, this looks really great — the comments about documenting locking semantics etc. are basically the final polish.
(Note to self: Eventually it might be worth re-reading the final state as is, to check whether there is any opportunity to simplify.)
Around #378 (comment) and more recently with the parent’s mapping there was some tentative discussion about checking whether the layer exists before deciding to stage it — that’s still to be decided, I think. (In c/image, commitLayer does do a layer presence check before deciding to create it — but in case it is reusing an existing local layer by extracting it into a temporary tarball to be applied, there is still quite a window in which the layer could be concurrently created. Of course, c/image can add one more check to its caller — but if we happened to take locks to read the parent’s state, a lookup for an ID already existing would be ~free.)
e60d339 to
91fcdbb
Compare
91fcdbb to
651c8f5
Compare
#!/usr/bin/env bpftrace
kfunc:fcntl_setlk
{
$lockname = str(args->filp->f_path.dentry->d_name.name);
if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
$lockname == "images.lock" || $lockname == "containers.lock" ) {
@blockedTime[tid, $lockname] = nsecs;
}
}
kretfunc:fcntl_setlk
{
$lockname = str(args->filp->f_path.dentry->d_name.name);
if ($lockname == "storage.lock" || $lockname == "layers.lock" ||
$lockname == "images.lock" || $lockname == "containers.lock" ) {
// lock duration in msec
$lock_duration = (nsecs - @blockedTime[tid, $lockname])/1000000;
if ($lock_duration) {
printf("blocked %s time: %lld msec\n", $lockname, $lock_duration);
}
delete(@blockedTime[tid, $lockname]);
@lockholdTime[pid, args->fd] = (nsecs, $lockname);
}
}
tracepoint:syscalls:sys_enter_close
{
$a = @lockholdTime[pid, args->fd];
if ($a.0) {
// lock duration in msec
$lock_duration = (nsecs - $a.0)/1000000;
if ($lock_duration) {
printf("lock %s time: %lld msec\n", $a.1, $lock_duration);
}
delete(@lockholdTime[pid, args->fd]);
}
}FYI this is the script I used to measure lock times during my demo last week, not sure if there is a decent place to store this. I guess it could be helpful in the future to find more lock contention, should I add it under hack/ maybe? |
Add a new function to stage additions. This should be used to extract the layer content into a temp directory without holding the storage lock and then under the lock just rename the directory into the final location to reduce the lock contention. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is not clear to me when it will hit the code path there, by normal layer creation we always pass a valid parent so this branch is never reached AFAICT. Let's remove it and see if all tests still pass in podman, buildah and others... Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Split out the layer permission gathering from the main create() function so it can be reused elsehwere, see the next commit. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Add a function to apply the diff into a tmporary directory so we can do that unlocked and only rename under the lock. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I cannot see any reason why we should buffer the full tar split content in memory before writing it. That layer is still mark partial at this point and the store is locked so there is no concurrent access either thus we do not need the atomic rename here. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Split it into multiple function to make it reusable without having a layer and so that it can be used unlocked see the following commits. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The extracting of the tar under the store lock is a bottleneck as many concurrent processes might hold the locks for a long time on big layers. To address this move the layer extraction before we take the locks if possible. Currently this only work when using the overlay driver as the implementation requires driver specifc details in order for a rename() to work. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It doesn't seem needed here so don't take it. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
A minor rework to enable more changes in following commits. Note the caller still must hold the layer store locks so ensure we return the layer locked and let the caller unlock instead. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Make it reusable for other callers, see next commit. Also while at it remove the dedupeStrings() call, as pointed out by Miloslav the work it is doing is more expensive than just checking the same name several times as it does a O(1) map lookup. Also most callers won't pass duplicated names to begin with. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The untar can be quite expensive so check for id, name conflicts right away. Also we must populate the idmappings so we extract with the right uids/gids. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This function was added in commit c577a81 and used by older drivers we no longer suppor, such as aufs and windows. As such this is dead code and can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is unused in all drivers now, so it can be removed. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We use this this typo all the time now so make the naming a bit more clear. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Just be safe based on the review feedback from the PR. podman-container-tools#378 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The function is just a redirection to another one so inline it directly as we do not gain anything from the extra indirection. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Also add a missing sync when we stage to ensure the content was flushed to disk. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
7f04b7f to
83dd0eb
Compare
|
@mtrmac I addressed all your comments now. Podman is also happy in podman-container-tools/podman#27251 (modulo some unrelated flakes) |
Updated test script which prints some nice histograms and the max time values on exit for each lock file. It could be useful to measure other code parts as well. Let me know if you like me to commit this. |
|
Maybe into |
Just be safe based on the review feedback from the PR. podman-container-tools#378 Signed-off-by: Paul Holzinger <pholzing@redhat.com> (cherry picked from commit 3bfe961) Signed-off-by: Paul Holzinger <pholzing@redhat.com>
No description provided.