Skip to content

nvmeof: fix bug in cleanup NVMe-oF resources #5936

Merged
mergify[bot] merged 5 commits into
ceph:develfrom
gadididi:nvmeof/fix_listener_bug
Jan 20, 2026
Merged

nvmeof: fix bug in cleanup NVMe-oF resources #5936
mergify[bot] merged 5 commits into
ceph:develfrom
gadididi:nvmeof/fix_listener_bug

Conversation

@gadididi

Copy link
Copy Markdown
Contributor

Describe what this PR does

Problem

  1. When CreateVolume fails during NVMe-oF resource setup, the cleanup defer does not execute properly, leaving orphaned subsystems (instead of removing it) on the gateway that cause subsequent volume creation attempts to success (but should not success).

Root Cause

The issue occurred in the following sequence:

  1. CreateVolume creates an RBD volume successfully
  2. createNVMeoFResources() is called to set up NVMe-oF resources:
    • Creates subsystem successfully
    • Attempts to create listener - FAILS (e.g., "Invalid parameters")
  3. createNVMeoFResources() returns nil, error <-- Bug
  4. Deferred cleanup in CreateVolume was not called at all! <-- Another Bug
  5. Cleanup never executes - subsystem remains on gateway
  6. Retry attempt sees existing subsystem and assumes it's ready, but it's actually incomplete (missing listeners) <-- Another Bug
  7. Volume creation succeeds on retry but with incorrect configuration (missing the listeners..)

Example from logs:

I0114 14:09:57.263554       1 utils.go:350] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 GRPC call: /csi.v1.Controller/CreateVolume
I0114 14:09:57.466641       1 nvmeof.go:248] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Creating NVMe subsystem: nqn.2016-06.io.spdk:cnode1.rook-ceph on gateway rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:5500
I0114 14:09:57.490502       1 nvmeof.go:277] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Subsystem created successfully: nqn.2016-06.io.spdk:cnode1.rook-ceph
I0114 14:09:57.490521       1 controllerserver.go:569] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Creating listener 0: rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420 (rook-ceph-nvmeof-nvmeof-0)
I0114 14:09:57.490525       1 nvmeof.go:367] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Adding listener rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc to subsystem nqn.2016-06.io.spdk:cnode1.rook-ceph
E0114 14:09:57.509716       1 controllerserver.go:152] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 NVMe-oF resource setup failed for volumeID 0001-0009-rook-ceph-0000000000000002-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505: subsystem setup failed: failed to create listener 0 (rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420 (rook-ceph-nvmeof-nvmeof-0)): gateway AddListener returned error (status=32602): Failure adding nqn.2016-06.io.spdk:cnode1.rook-ceph listener at rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420: Invalid parameters
I0114 14:09:57.512304       1 omap.go:89] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 got omap values: (pool="nvmeofpool", namespace="", name="csi.volume.6a7447ca-ce4b-4fcf-a123-4ce22d9d5505"): map[csi.imageid:12971c079d96 csi.imagename:csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505 csi.volname:pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 csi.volume.owner:default]
I0114 14:09:57.533878       1 rbd_util.go:1548] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 rbd: snap rm nvmeofpool/csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505-temp@csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505 using mon 
I0114 14:09:57.536827       1 rbd_util.go:682] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 rbd: delete csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505-temp using mon 10.101.55.5:6789, pool nvmeofpool
I0114 14:09:57.538643       1 controllerserver.go:1106] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 deleting image csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505
I0114 14:09:57.538719       1 rbd_util.go:682] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 rbd: delete csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505 using mon 10.101.55.5:6789, pool nvmeofpool
I0114 14:09:57.557466       1 rbd_util.go:728] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 rbd: adding task to remove image "nvmeofpool/csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505" with id "12971c079d96" from trash
I0114 14:09:57.565129       1 rbd_util.go:756] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 rbd: successfully added task to move image "nvmeofpool/csi-vol-6a7447ca-ce4b-4fcf-a123-4ce22d9d5505" with id "12971c079d96" to trash
I0114 14:09:57.569464       1 omap.go:126] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 removed omap keys (pool="nvmeofpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4]
E0114 14:09:57.569599       1 utils.go:355] ID: 15 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 GRPC error: rpc error: code = Internal desc = NVMe-oF setup failed: subsystem setup failed: failed to create listener 0 (rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420 (rook-ceph-nvmeof-nvmeof-0)): gateway AddListener returned error (status=32602): Failure adding nqn.2016-06.io.spdk:cnode1.rook-ceph listener at rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420: Invalid parameters
I0114 14:09:58.077290       1 utils.go:350] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 GRPC call: /csi.v1.Controller/CreateVolume
...
...
I0114 14:09:58.126102       1 controllerserver.go:1115] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Connected to the gateway rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:5500
I0114 14:09:58.126154       1 nvmeof.go:307] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Checking if subsystem nqn.2016-06.io.spdk:cnode1.rook-ceph exists on gateway rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:5500
I0114 14:09:58.133641       1 nvmeof.go:323] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Subsystem nqn.2016-06.io.spdk:cnode1.rook-ceph exists
I0114 14:09:58.133669       1 controllerserver.go:699] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 subsystem nqn.2016-06.io.spdk:cnode1.rook-ceph and Listener [rook-ceph-nvmeof-nvmeof-0.rook-ceph.svc:4420 (rook-ceph-nvmeof-nvmeof-0)] for the subsystem were created
I0114 14:09:58.133686       1 nvmeof.go:112] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Creating namespace for RBD nvmeofpool/csi-vol-df56be2d-81ee-45a9-97d1-f66152770812 in subsystem nqn.2016-06.io.spdk:cnode1.rook-ceph
I0114 14:09:58.801472       1 nvmeof.go:143] ID: 16 Req-ID: pvc-ba4f407d-fda1-4a11-afd1-d944eb4971c4 Namespace created with NSID: 1

you can see from the log that in the second try (by CSI mechanism) it passed and subsystem existed already, but it should not!!

Solution

1. Return nvmeofData on Error (Main Fix)

Modified createNVMeoFResources() to return nvmeofData instead of nil when errors occur after the struct is initialized:

// Before
if err := ensureSubsystem(...); err != nil {
    return nil, fmt.Errorf("subsystem setup failed: %w", err)  // ❌ Returns nil
}

// After
if err := ensureSubsystem(...); err != nil {
    return nvmeofData, fmt.Errorf("subsystem setup failed: %w", err)  // ✅ Returns nvmeofData
}

Why this fixes it: The cleanup defer now receives valid nvmeofData containing gateway connection info and subsystem details, allowing it to properly clean up the orphaned subsystem.

2. Add NamespaceID Guard

Added check in cleanupNVMeoFResources() to skip namespace deletion when it was never created:

if nvmeofData.NamespaceID > 0 {
    // Delete namespace
} else {
    log.DebugLog(ctx, "No namespace to delete (NamespaceID=0)")
}

Why this is needed: Since we now return nvmeofData even when namespace creation fails, NamespaceID may still be 0 (the default value). Attempting to delete namespace 0 would fail.

3. Fix ensureSubsystem to Create Listeners Even if Subsystem Exists

Modified ensureSubsystem() to attempt listener creation even when the subsystem already exists:

// Before
if !exists {
    err = gateway.CreateSubsystem(ctx, subsystemNQN, networkMask)
    if err != nil {
        return err
    }
    
    // Only create listeners if we just created the subsystem
    if networkMask == "" {
        for i, listener := range listeners { ... }
    }
}

// After
if !exists {
    err = gateway.CreateSubsystem(ctx, subsystemNQN, networkMask)
    if err != nil {
        return err
    }
}

// Create listeners regardless of whether subsystem was just created
// (handles retry scenarios and multiple PVCs with different listeners on same subsystem)
if networkMask == "" {
    for i, listener := range listeners { ... }
}

4. The defer was declared AFTER the call that could fail

// ❌ WRONG - defer declared after the call
nvmeofData, err = cs.createNVMeoFResources(...)
if err != nil {
    return nil, status.Errorf(...)  // Returns here, defer below never registered!
}
defer func() {  // ← This line never reached on error!
    if err == nil {
        return
    }
    cleanupErr := cs.cleanupNVMeoFResources(ctx, nvmeofData)
}()

Moved the cleanup defer to be declared BEFORE the call to createNVMeoFResources():

// ✅ CORRECT - defer declared before the call
var nvmeofData *nvmeof.NVMeoFVolumeData

defer func() {
    if err != nil && nvmeofData != nil {  // Added nil check
        cleanupErr := cs.cleanupNVMeoFResources(ctx, nvmeofData)
        if cleanupErr != nil {
            log.ErrorLog(ctx, "failed to cleanup NVMe-oF resources: %v", cleanupErr)
        }
    }
}()

nvmeofData, err = cs.createNVMeoFResources(...)
if err != nil {
    return nil, status.Errorf(...)  // Defer IS registered and will execute
}

Note: The RBD cleanup defer remains in its current position (after CreateVolume) because:

  • If RBD creation fails, nothing was created → no cleanup needed
  • If RBD creation succeeds, we need volumeID from the response for cleanup
  • NVMe-oF is different because partial resources (subsystem) can exist even on failure

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@gadididi gadididi requested a review from nixpanic January 15, 2026 13:47
@gadididi gadididi self-assigned this Jan 15, 2026
@gadididi gadididi added the component/nvme-of Issues and PRs related to NVMe-oF. label Jan 15, 2026
@mergify mergify Bot added the bug Something isn't working label Jan 15, 2026
@gadididi gadididi force-pushed the nvmeof/fix_listener_bug branch from 6b69715 to 52cbada Compare January 15, 2026 15:25
@gadididi gadididi marked this pull request as ready for review January 15, 2026 15:25
@gadididi gadididi requested a review from Madhu-1 January 15, 2026 15:26
Comment thread internal/nvmeof/nvmeof.go
Comment thread internal/nvmeof/controller/controllerserver.go
Comment thread internal/nvmeof/controller/controllerserver.go
add another case, if the subsystem of the rquested
ns (that we want to remove) does not exist, dont raise an error.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi force-pushed the nvmeof/fix_listener_bug branch from 52cbada to 3c27975 Compare January 18, 2026 08:05
When CreateVolume fails during NVMe-oF resource setup
(e.g., listener creation failure), the cleanup defer was
not executing properly because createNVMeoFResources()
returned nil for nvmeofData on error. This left
orphaned subsystems on the gateway that prevented subsequent volume
creation attempts.
Changes:
- Modified createNVMeoFResources() to return nvmeofData even on error
  (instead of nil), allowing cleanup to access gateway connection info
  and subsystem details
- Added NamespaceID > 0 check in cleanupNVMeoFResources() to skip
  namespace deletion when it was never created (NamespaceID defaults
  to 0)

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
when there is some failure to remove some listener,
it is ok to just print warning, because delete the subsystem will
delete the listener.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
before this change,if the subsystem exists,it would not
create listeners. but maybe another pvc on same subsystem
but with different listener\s was requested, it would not
create the listener\s (in manual mode of adding the listeners)
now, even the subsystem exists, if we are in manual listener mode
try to add them (if they exist, it is ok)

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
the defer in CreateVolume() for cleanup the
NVMe-oF resoureces was in incorrect place,
so in case of createNVMeoFResources() failure
it was not called at all.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi force-pushed the nvmeof/fix_listener_bug branch from 3c27975 to 8ebc1c2 Compare January 18, 2026 08:22
@gadididi gadididi requested a review from nixpanic January 18, 2026 08:25

@nixpanic nixpanic left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me.

// Step 3: Ensure subsystem exists (and listener)
if err := ensureSubsystem(ctx, gateway, nvmeofData.SubsystemNQN, networkMask, nvmeofData.ListenerInfo); err != nil {
return nil, fmt.Errorf("subsystem setup failed: %w", err)
return nvmeofData, fmt.Errorf("subsystem setup failed: %w", err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nicer to not return nvmeofData in case of a failure. That would mean a cleanup needs to be done in this function too. It should make the code simpler to understand, but it also adds a little more complexity in this function 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic , I wanted to reuse the cleanupNVMeoFResources() from DeleteVolume().

So, if I return nil it means here:

	defer func() {
		// Cleanup NVMe-oF resources on subsequent errors.
		// only if there was an error and nvmeofData is not nil, it means resources were created.
		if err != nil && nvmeofData != nil {
			log.DebugLog(ctx, "Cleaning up NVMe-oF resources for volume %q due to error: %v", volumeID, err)
			cleanupErr := cs.cleanupNVMeoFResources(ctx, nvmeofData)
			if cleanupErr != nil {
				log.ErrorLog(ctx, "failed to cleanup NVMe-oF resources for volume  %q: %v", volumeID, cleanupErr)
			}
		}
	}()

I cannot to send to cs.cleanupNVMeoFResources() the nvmeofData because it will be empty. So, I will need to change the cleanupNVMeoFResources() to get subsystemNqn, NSid , and ListenerInfo..
(and the cleanupNVMeoFResources() that is called in the DeleteVolume() as well)
what do you think is the best way to clean in case of partial initialization?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to have each function cleanup the resources it allocated itself. That also means there is less re-use possible of functions like cleanupNVMeoFResources(). To structure it the way I would prefer, each function that allocates resources would probably get a cleanup-helper function, which then can get called from cleanupNVMeoFResources().

But I'm not sure that it really results in a simpler/cleaner approach... The current way is good enough for now, we can try to make it better maintainable later on.

@nixpanic nixpanic requested a review from a team January 19, 2026 09:02
@Madhu-1 Madhu-1 added the ci/skip/e2e skip running e2e CI jobs label Jan 20, 2026
@Madhu-1

Madhu-1 commented Jan 20, 2026

Copy link
Copy Markdown
Collaborator

@Mergifyio queue

@mergify

mergify Bot commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

queue

✅ The pull request has been merged automatically

Details

The pull request has been merged automatically at 6b688a9

@mergify

mergify Bot commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

Merge Queue Status

✅ The pull request has been merged at 8ebc1c2

This pull request spent 35 minutes 5 seconds in the queue, including 34 minutes 47 seconds running CI.
The checks were run on draft #5948.

Required conditions to merge
  • #approved-reviews-by >= 2 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • any of:
    • all of:
      • base=devel
      • status-success=codespell
      • status-success=go-test
      • status-success=golangci-lint
      • status-success=lint-extras
      • status-success=mod-check
      • status-success=multi-arch-build
      • status-success=uncommitted-code-check
      • any of:
        • label=ci/skip/e2e
        • all of:
          • status-success=ci/centos/k8s-e2e-external-storage/1.32
          • status-success=ci/centos/k8s-e2e-external-storage/1.33
          • status-success=ci/centos/k8s-e2e-external-storage/1.34
          • status-success=ci/centos/mini-e2e-helm/k8s-1.32
          • status-success=ci/centos/mini-e2e-helm/k8s-1.33
          • status-success=ci/centos/mini-e2e-helm/k8s-1.34
          • status-success=ci/centos/mini-e2e/k8s-1.32
          • status-success=ci/centos/mini-e2e/k8s-1.33
          • status-success=ci/centos/mini-e2e/k8s-1.34
          • status-success=ci/centos/upgrade-tests-cephfs
          • status-success=ci/centos/upgrade-tests-rbd
    • all of:
      • base~=^(release-.+)$
      • status-success=codespell
      • status-success=go-test
      • status-success=golangci-lint
      • status-success=lint-extras
      • status-success=mod-check
      • status-success=multi-arch-build
      • status-success=uncommitted-code-check
      • any of:
        • label=ci/skip/e2e
        • all of:
          • status-success=ci/centos/k8s-e2e-external-storage/1.32
          • status-success=ci/centos/k8s-e2e-external-storage/1.33
          • status-success=ci/centos/k8s-e2e-external-storage/1.34
          • status-success=ci/centos/mini-e2e-helm/k8s-1.32
          • status-success=ci/centos/mini-e2e-helm/k8s-1.33
          • status-success=ci/centos/mini-e2e-helm/k8s-1.34
          • status-success=ci/centos/mini-e2e/k8s-1.32
          • status-success=ci/centos/mini-e2e/k8s-1.33
          • status-success=ci/centos/mini-e2e/k8s-1.34
          • status-success=ci/centos/upgrade-tests-cephfs
          • status-success=ci/centos/upgrade-tests-rbd
    • all of:
      • base=release-v3.15
      • status-success=codespell
      • status-success=go-test
      • status-success=golangci-lint
      • status-success=lint-extras
      • status-success=mod-check
      • status-success=multi-arch-build
      • status-success=uncommitted-code-check
      • any of:
        • label=ci/skip/e2e
        • all of:
          • status-success=ci/centos/k8s-e2e-external-storage/1.31
          • status-success=ci/centos/mini-e2e-helm/k8s-1.31
          • status-success=ci/centos/mini-e2e/k8s-1.31
          • status-success=ci/centos/k8s-e2e-external-storage/1.32
          • status-success=ci/centos/k8s-e2e-external-storage/1.33
          • status-success=ci/centos/mini-e2e-helm/k8s-1.32
          • status-success=ci/centos/mini-e2e-helm/k8s-1.33
          • status-success=ci/centos/mini-e2e/k8s-1.32
          • status-success=ci/centos/mini-e2e/k8s-1.33
          • status-success=ci/centos/upgrade-tests-cephfs
          • status-success=ci/centos/upgrade-tests-rbd
    • all of:
      • base=ci/centos
      • status-success=ci/centos/jjb-validate
      • status-success=ci/centos/job-validation

@mergify mergify Bot added the queued label Jan 20, 2026
mergify Bot added a commit that referenced this pull request Jan 20, 2026
@mergify mergify Bot merged commit 6b688a9 into ceph:devel Jan 20, 2026
21 of 22 checks passed
@mergify mergify Bot removed the queued label Jan 20, 2026
@gadididi gadididi deleted the nvmeof/fix_listener_bug branch January 20, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/skip/e2e skip running e2e CI jobs component/nvme-of Issues and PRs related to NVMe-oF.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants