nvmeof: Add mount cache and locking for safe nvme disconnect#6192
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an in-memory mount cache for NVMe-oF staging mounts so NodeUnstage can avoid repeated findmnt --list calls and decide when it’s safe to disconnect NVMe controllers.
Changes:
- Changed
GetAllNVMeMountedDevices()to returndevice -> stagingTargetmappings (instead of aboolset). - Introduced
MountCache(bidirectional device/path cache) and initialized it at NodeServer startup. - Updated NodeUnstage disconnect decision flow to use the in-memory cache snapshot.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| internal/nvmeof/util/mounter.go | Changes mounted-device discovery to return device→mountpoint mapping. |
| internal/nvmeof/util/mounter_cache.go | Adds new mutex-protected mount cache with bidirectional maps. |
| internal/nvmeof/nvmeof_initiator.go | Updates disconnect interfaces/helpers to accept the new mounted-devices map type. |
| internal/nvmeof/nodeserver/nodeserver.go | Initializes cache on startup; updates stage/unstage flows to maintain and consult the cache for disconnect decisions. |
| transaction.isMounted = true | ||
|
|
||
| // Resolve real device and update cache | ||
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | ||
| if err != nil { | ||
| log.WarningLog(ctx, "failed to resolve device: %v", err) | ||
| // Fallback - try to continue | ||
| realDevice = devicePath | ||
| } | ||
|
|
||
| ns.mountCache.Add(realDevice, stagingTargetPath) |
There was a problem hiding this comment.
Between the successful mount (transaction.isMounted = true) and the later ns.mountCache.Add call, there is a window where the namespace is mounted in the system but missing from mountCache. Because NodeStage/NodeUnstage are only locked per-volume (volumeLocks), a concurrent unstage of a different namespace in the same subsystem could take a snapshot that omits this mount and incorrectly disconnect shared controllers. To avoid false negatives, consider updating mountCache immediately after the mount succeeds (using the best-available device identity) or adding a safety validation (e.g., re-check mounts via findmnt) before disconnect decisions.
| transaction.isMounted = true | |
| // Resolve real device and update cache | |
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | |
| if err != nil { | |
| log.WarningLog(ctx, "failed to resolve device: %v", err) | |
| // Fallback - try to continue | |
| realDevice = devicePath | |
| } | |
| ns.mountCache.Add(realDevice, stagingTargetPath) | |
| // Immediately record the mount in the cache using the best-available device identity | |
| ns.mountCache.Add(devicePath, stagingTargetPath) | |
| // Resolve real device and refine cache entry if possible | |
| realDevice, err := nvmeutil.GetDeviceFromMountpoint(ctx, stagingTargetPath) | |
| if err != nil { | |
| log.WarningLog(ctx, "failed to resolve device: %v", err) | |
| // Fallback - keep using devicePath for cache | |
| } else if realDevice != "" && realDevice != devicePath { | |
| // Update cache to use the canonical device identifier for this mountpoint | |
| ns.mountCache.RemoveByMountPoint(stagingTargetPath) | |
| ns.mountCache.Add(realDevice, stagingTargetPath) | |
| } | |
| transaction.isMounted = true |
There was a problem hiding this comment.
Because NodeStage/NodeUnstage are only locked per-volume (volumeLocks), a concurrent unstage of a different namespace in the same subsystem could take a snapshot that omits this mount and incorrectly disconnect shared controllers
As I said in the PR description, in the next PR I will add group lock(which locks per group type stage\unstage) . Once this PR will be merged: #6183
8673fba to
077fce8
Compare
077fce8 to
9558efb
Compare
| // MountCache is a thread-safe cache that maintains the mapping between staging paths and devices. | ||
| // Note: This map is 1:1, meaning each staging path corresponds to exactly one device. | ||
| // device can only be mounted to one staging path at a time, and each staging path can only have one device. | ||
| type MountCache struct { |
There was a problem hiding this comment.
Can you provide a MountCache interface and mountCache struct that contains the actual implementation?
| } | ||
|
|
||
| // Initialize mounted devices cache on startup to ensure we have an accurate view | ||
| mountedDevices, err := getNVMeMountedDevices(context.Background()) |
There was a problem hiding this comment.
Make getNVMeMountedDevices() a function of NodeServer (call it initNVMeMountedDevices()?) and add the path/device directly in that function. This removes the for-loop a little below and makes this function a little smaller.
5e46539 to
6d03f95
Compare
53bba11 to
2fe4143
Compare
2fe4143 to
0803810
Compare
0803810 to
353a958
Compare
|
@Mergifyio rebase |
353a958 to
2a614e0
Compare
|
Deprecation notice: This pull request comes from a fork and was rebased using |
✅ Branch has been successfully rebased |
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
@Mergifyio rebase |
|
@Mergifyio queue |
☑️ Nothing to do, the required conditions are not metDetails
|
Merge Queue Status🛑 Queue command has been cancelled |
|
/test ci/centos/k8s-e2e-external-storage/1.35 |
|
/test ci/centos/k8s-e2e-external-storage/1.34 |
|
/test ci/centos/upgrade-tests-cephfs |
|
/test ci/centos/mini-e2e-helm/k8s-1.35 |
|
/test ci/centos/upgrade-tests-rbd |
|
/test ci/centos/mini-e2e-helm/k8s-1.34 |
|
/test ci/centos/k8s-e2e-external-storage/1.33 |
|
/test ci/centos/mini-e2e/k8s-1.35 |
|
/test ci/centos/mini-e2e/k8s-1.34 |
|
/test ci/centos/mini-e2e-helm/k8s-1.33 |
|
/test ci/centos/mini-e2e/k8s-1.33 |
|
Deprecation notice: This pull request comes from a fork and was queued with |
Merge Queue Status
This pull request spent 15 seconds in the queue, including 1 second running CI. Required conditions to merge
|
Describe what this PR does
Adds an in-memory cache to track which nvme devices are currently mounted at staging paths. This eliminates expensive
findmnt --listsystem calls on every unstage operation.Structure:
The cache maintains bidirectional maps for fast lookups:
pathToDevice: staging path -> device pathdeviceToPath: device path -> staging path (reverse)Initialization:
Cache is populated at node server startup by scanning existing mounts, then kept up to date as volumes are staged and unstaged.
Thread Safety:
Protected by
sync.Mutex. We use a simple mutex instead ofsync.RWMutexbecause:Why copying the map is safe:
When
GetAllDevices()returns a copy of the cache, callers get a consistent snapshot at that moment. The copy happens after cache updates (first, the pod unmounts, then deletes itself, and last fetch the copy of cache), so concurrentunstageoperations each see current state. Even if two unstages run simultaneously, each removes its device from the cache first, then gets a fresh snapshot showing remaining devices.So the last of them who call to
GetAllDevices()will get the update list, and will know if disconnect or not. (NodeStage cannot run at this time !!)Related PR##
after this pr: #6183 will be merged, group locking (between NodeStage\NodeUnStage) will added.
Future concerns
next step is to add group lock after #6183 will be merged.
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)