Skip to content

CephCSI GetMetadataAllocated returns empty stream (0 blocks) despite rbd diff showing data #2

@kaovilai

Description

@kaovilai

Problem

CephCSI's GetMetadataAllocated gRPC call returns an empty stream (0 blocks, volumeCapacityBytes=0, blockMetadataType=UNKNOWN) while the CSI driver logs "successfully streamed metadata allocated". The underlying rbd diff at the Ceph level returns data correctly.

Environment

  • OCP: 4.21.7 (Kubernetes 1.34)
  • ODF: 4.21
  • CephCSI: release-4.21 (commit ad878c6a)
  • Ceph: Squid 20.1.0-159.el9cp
  • librbd: 20.1.0-159.el9cp (has rbd_diff_iterate3)
  • go-ceph: red-hat-storage/go-ceph v0.32.1-0.20260109062642-357605f36918

Reproduction

  1. Create PVC with ocs-storagecluster-ceph-rbd StorageClass
  2. Write data to the PVC (e.g., 10MB via dd)
  3. Create VolumeSnapshot
  4. Call GetMetadataAllocated via the external-snapshot-metadata iterator API

Evidence

CephCSI sidecar logs (empty stream received)

get_metadata_allocated.go:69] "calling CSI driver" snapshotId="0001-0011-..."
get_metadata_allocated.go:154] "stream EOF" blockMetadataType="UNKNOWN" lastByteOffset=0 lastSize=0 lastResponseNum=0 volumeCapacityBytes=0

CephCSI driver logs (claims success)

utils.go:329] GRPC call: /csi.v1.SnapshotMetadata/GetMetadataAllocated
omap.go:89] got omap values: map[csi.imageid:... csi.imagename:csi-snap-... csi.source:csi-vol-...]
sms_controllerserver.go:145] successfully streamed metadata allocated

rbd diff works at Ceph level ✅

# Direct rbd diff on clone image
$ rbd diff ocs-storagecluster-cephblockpool/csi-snap-... --format json
[{"offset":0,"length":2691072,"exists":"true"},{"offset":16777216,"length":618496,"exists":"true"},...] # 7 blocks

# Python rbd.diff_iterate (rbd_diff_iterate2) on source image at snap ID 21
$ python3 -c "image.set_snap_by_id(21); image.diff_iterate(0, size, None, cb)"
Got 7 blocks ✅

# C-level rbd_diff_iterate3 via ctypes (exact same call as CephCSI)
$ python3 -c "diff_iterate3(image, 0, 0, size, 0, callback, None)"
Got 7 blocks ✅

Analysis

Code path (CephCSI release-4.21)

  1. sms_controllerserver.go:GetMetadataAllocatedmgr.GetSnapshotByID()genSnapFromSnapID()
  2. genSnapFromSnapID() reads omap journal to populate:
    • RbdImageName = csi.source (source PVC image)
    • RbdSnapName = csi.imagename (clone image)
  3. updateSnapshotDetails()toVolume()getImageInfo() on clone → sets VolSize
  4. ProcessMetadata():
    • Opens source image (rbdSnap.open())
    • Gets parent snap ID from clone (getRBDSnapID())
    • Sets snap context on source image (image.SetSnapByID(snapID))
    • Calls image.DiffIterateByID() with Length=VolSize, FromSnapID=0

Key finding

The C-level rbd_diff_iterate3 function works correctly with identical parameters. The issue is somewhere in the Go binding layer (red-hat-storage/go-ceph) or in how CephCSI invokes it. The DiffIterateByID Go method uses dlsym to load rbd_diff_iterate3 and calls it via a C wrapper function.

Hypotheses (in order of likelihood)

  1. Go callback not invoked: The CGo callback mechanism (//export diffIterateByIDCallback) might not be called by rbd_diff_iterate3 due to function pointer casting issue in the dlsym wrapper
  2. VolSize incorrectly 0: If updateSnapshotDetailsgetImageInfo fails silently, VolSize would be 0, causing DiffIterateByIDConfig.Length=0 (0 bytes to scan). However, the startingOffset >= volSize check should catch this
  3. Wrong image context: If rbdSnap.open() opens a different image than expected (e.g., due to RadosNamespace or pool mismatch)

Next Steps

  • Build a minimal Go test binary that reproduces the exact CephCSI code path with go-ceph DiffIterateByID
  • Add debug logging to snap_diff.go to print VolSize, snapID, image name before DiffIterateByID call
  • Check if upstream ceph-csi has the same issue or if it's specific to the Red Hat fork
  • File upstream bug if confirmed

Related

  • CephCSI source: internal/rbd/sms_controllerserver.go, internal/rbd/snap_diff.go
  • go-ceph DiffIterateByID: rbd/diff_iterate_by_id.go (Red Hat fork only)
  • KEP-3314: CSI Changed Block Tracking

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions