Added the SnapshotMetadata service. by carlbraganza · Pull Request #551 · container-storage-interface/spec

carlbraganza · 2023-08-18T00:04:09Z

What type of PR is this?
This is a proposed addition to the CSI specification.

What this PR does / why we need it:
It adds a SnapshotMetadata service through which one can obtain allocated or changed block metadata for snapshots.

Which issue(s) this PR fixes:
n/a

Special notes for your reviewer:
Derived from kubernetes/enhancements#4082

Does this PR introduce an API-breaking change?:

NONE

xing-yang · 2023-08-18T00:35:39Z

What about this PR #522?

carlbraganza · 2023-08-18T04:28:36Z

What about this PR #522?

It has been closed.

carlbraganza · 2023-08-18T04:36:30Z

+      // SNAPSHOT_METADATA_SERVICE indicates that the Plugin provides
+      // RPCs to retrieve metadata on the allocated blocks of a single
+      // snapshot, or the changed blocks between a pair of snapshots of
+      // the same block volume.
+      // The presence of this capability determines whether the CO will
+      // attempt to invoke the OPTIONAL SnapshotMetadata service RPCs.
+      SNAPSHOT_METADATA_SERVICE = 4 [(alpha_enum_value) = true];


I blindly copied this pattern for the SnapshotMetadata service as it is an optional service.
In the Kubernetes CO implementation proposal, the existence of this gRPC service would be advertised by the presence of a CR, so this capability is not strictly required.

It occurs to me that defining this capability implies that if an SP wanted to use a separate container for the SnapshotMetadata service that they'd have to also implement an Identity service in this container.

Feedback I received verbally on this is that every CSI server process must provide an Identity service.

xing-yang · 2023-08-23T16:08:12Z

cc @bswartz

xing-yang · 2023-09-07T19:30:15Z

cc @jdef

nixpanic · 2023-09-08T15:36:48Z

+// allocated blocks of a snapshot: i.e. this identifies the data ranges
+// that have valid data as they were the target of some previous write
+// operation on the volume.
+message GetAllocatedRequest {


Ceph-CSI would like to implement this functionality. In order to communicate with the Ceph cluster, the requests (this one and GetDeltaRequest) need to provide credentials like the secrets as in CreateVolumeRequest:

map<string, string> secrets = 5 [(csi_secret) = true];

I think that makes sense. Many providers might be using this for volume operations @carlbraganza wdyt?
We also might want to add a way to pass the provider-specific params just like the way it's done with CreateVolumeRequest?

// Plugin specific creation-time parameters passed in as opaque // key-value pairs. This field is OPTIONAL. The Plugin is responsible // for parsing and validating these parameters. COs will treat // these as opaque. map<string, string> parameters = 4;

Parameters are probably not required, as the CSI-driver should be able to fetch those based on the snapshot_id.

@nixpanic In K8s, do the secret data come from either of:

csi.storage.k8s.io/snapshotter-secret-[name|namespace] entries of the VolumeSnapshotClass.Parameters

csi.storage.k8s.io/provisioner-secret-[name|namespace] entries of a StorageClass.Parameters

Ceph-CSI would like to implement this functionality. In order to communicate with the Ceph cluster, the requests (this one and GetDeltaRequest) need to provide credentials like the secrets as in CreateVolumeRequest:

map<string, string> secrets = 5 [(csi_secret) = true];

why aren't backend secrets configured at the plugin level? why do they need to be part of the RPC here? is this an exercise in future-proofing?

why aren't backend secrets configured at the plugin level? why do they need to be part of the RPC here? is this an exercise in future-proofing?

Ceph-CSI for example has a single instance for the CSI-drivers, but (in Kubernetes) the StorageClass points to a particular cluster. Different StorageClasses can point to different clusters, each with their own secrets (also linked in the StorageClass). Because of the stateless nature, there is no guarantee that secrets can be (temporarily) stored for plugin before APIs of this new SnapshotMetadata service are called.

@nixpanic In K8s, do the secret data come from either of:

* `csi.storage.k8s.io/snapshotter-secret-[name|namespace]` entries of the `VolumeSnapshotClass.Parameters` * `csi.storage.k8s.io/provisioner-secret-[name|namespace]` entries of a `StorageClass.Parameters`

Yes, that is right. The kubernetes-csi/external-snapshotter resolves those secrets and passes the contents in the CSI procedures.

@jdef, I would say yes, this is an exercise in future proofing. Every other CSI RPC that we've added without a secrets field has later needed a secrets field to be added. My stance is that the default should be to require the CO to provide pre-RPC secrets and to figure out what those look like, unless we can prove it doesn't make sense to do so.

The underlying problem are the few CSI drivers that use per-volume secrets which the CO is responsible for managing. Those drivers find it impossible to implement new features until a secrets field is added and the CO is updated to provide the appropriate per-volume secrets.

bswartz · 2023-09-18T15:06:05Z

+  // in this list is controlled by the max_results value in the
+  // GetAllocatedRequest message.
+  // This field is OPTIONAL.
+  repeated BlockMetadata block_metadata = 3;


Suggested change

repeated BlockMetadata block_metadata = 3;

repeated BlockMetadata block_metadata = 3;

// Indicates there are no more allocated blocks in the list.

// This field is REQUIRED.

bool end_of_list = 4;

As I understand it, a gRPC stream returns an EOF on proper termination and an error otherwise. Would that not suffice?

My concern here is when max_results is specified, and you get an EOF, how do you know if you need to call it one more time? Or even if max_results is set to zero, can the SP decide to send less than all the results in one stream?

During my review I was going to add like a whole paragraph explaining the relationship between the max_results field and the number of elements in the block_metadata field, and I decided it was too confusing and it would be better to make it explicit.

The core of the problem is that we've introduced BOTH pagination and streaming and this means the client can't rely on the stream ending to indicate that there are no more pages.

The core of the problem is that we've introduced BOTH pagination and streaming and this means the client can't rely on the stream ending to indicate that there are no more pages.

Hmm ... I'm not sure I understand your confusion! Perhaps we can use this snippet of Go client prototype code (written by @PrasadG193) to dissect the problem?

c.initGRPCClient(snapMetaSvc.Spec.CACert, snapMetaSvc.Spec.Address, saToken, snapNamespace) stream, err := c.client.GetDelta(ctx, &pgrpc.GetDeltaRequest{ BaseSnapshotId: snapNamespace + "/" + baseVolumeSnapshot, TargetSnapshotId: snapNamespace + "/" + targetVolumeSnapshot, StartingOffset: 0, MaxResults: uint32(256), }) if err != nil { return err } done := make(chan bool) fmt.Println("Resp received:") go func() { for { resp, err := stream.Recv() if err == io.EOF { done <- true //means stream is finished return } if err != nil { log.Fatalf("cannot receive %v", err) } respJson, _ := json.Marshal(resp) fmt.Println(string(respJson)) } }() <-done //we will wait until all response is received

The example above doesn't really illustrate processing of the multiple entries in the returned block_metadata slice, as the response is simply dumped as a string in JSON format, but it is clear that instead of the fmt.Println a real client would process the block_metadata slice ("page") in the resp. The stream processing loop itself is exited only on a non-nil err value from stream.Recv(), where io.EOF is not really an error but the formal proper end-of-stream indicator.

The corresponding server (sidecar) side logic for this primarily consists of a loop doing this:

func (s *Server) GetDelta(req *pgrpc.GetDeltaRequest, cbtClientStream pgrpc.SnapshotMetadata_GetDeltaServer) error { ... // call the SP service after assembling the parameters done := make(chan bool) go func() { for { ... // receive resp from the SP service log.Print("Received response from csi driver, proxying to client") if err := cbtClientStream.Send(resp); err != nil { log.Printf(fmt.Sprintf("cannot send %v", err)) return } } }() <-done //we will wait until all response is received return nil }

(The prototype sidecar has very little error handling)

Oh I see. You're using an error to signal a normal condition. So you always take an extra trip through the loop, except when there are are no records to return. My main concern was how we signal termination to the client, and the error semantics don't show up in the protobuf definitions so I overlooked it.

Yes, it appears to be the idiomatic way in Go.

I don't hate this way of doing it, but even if it's idiomatic in Go, I'm not sure we use errors in this way anywhere else in CSI. I just want to highlight that a boolean flag in the return message would achieve the same effect as using an error to signal the end of iteration.

Why not both? Conventionally, client side should keep the connection alive until io.EOF, to avoid leaks. The end_of_list is a confirmation from the server to the client that everything has been sent. If there is an io.EOF but end_of_list is false, then the client knows it got an incomplete list, and will need to handle it.

IMO, its a bit paranoic!

I was thinking of a server-side graceful termination scenario, where the sidecar (or gRPC) had a chance to send an io.EOF before an unexpected termination, and not all the block metadata has been retrieved from the driver plugin. There is no way for the backup software to know the list it got is incomplete. Is there not a valid scenario?

jdef · 2023-09-28T21:05:12Z

+// allocated blocks of a snapshot: i.e. this identifies the data ranges
+// that have valid data as they were the target of some previous write
+// operation on the volume.
+message GetAllocatedRequest {


Ceph-CSI would like to implement this functionality. In order to communicate with the Ceph cluster, the requests (this one and GetDeltaRequest) need to provide credentials like the secrets as in CreateVolumeRequest:

map<string, string> secrets = 5 [(csi_secret) = true];

why aren't backend secrets configured at the plugin level? why do they need to be part of the RPC here? is this an exercise in future-proofing?

jdef · 2023-09-28T21:14:09Z

    }
 }

+service SnapshotMetadata {


i'm wondering about "stream" for the responses. we don't stream any other type of response. i notice that each of the response types already has repeated block metadata. in practice, how large do we expect these metadata chains to be? it also looks like there's a mechanism to resume from some offset (perhaps in case the stream is interrupted?) - which means that clients already have to keep track of what they've consumed.

why not just send a very large batch of metadata for each request? the golang default gRPC max message size is 4MiB. is that not good enough?

is the use of streaming an example of premature optimization?

This is definitely a new gRPC pattern for the CSI API, but I don't think this is a case of premature optimization.

The amount of changed block data would be proportional to the size of the volume, the amount of change, and the mechanism chosen to represent changed areas (FIXED_LENGH or VARIABLE_LENGTH). For example, assuming a 1TiB volume (2^40) and 1MiB blocks, there would be 1MiB such blocks; with 8KiB blocks there would 128MiB such blocks. In the worst case for FIXED_LENGTH all the blocks could have changed. In the worst case for VARIABLE_LENGTH every alternative block could have changed. 16 bytes (+ some overhead?) are used for each changed block datum, which should also be factored in, and can approach or exceed this 4MiB threshold.

Beyond the length of a single stream response, assuming a fixed slice length forces the client to support this size, which could break some memory conservative clients. (This is why there is an option for the client to request a max_results limit).

Recall too, that an earlier iteration of this KEP was criticized for passing all this metadata through the Kubernetes API server. The direct connection from the backup app to this new CSI service bypasses the K8s API server but cannot reduce the quantity of metadata!

@jdef We discussed this in the SIG Data Protection WG meeting. Some relevant observations:

We're dealing with metadata access here, and not a control path. There is no existing precedent for such operations in the current CSI spec.

The performance of a gRPC stream would better than the comparative use of a classic pagination API.

A classic pagination API would involve multiple round-trip calls, versus a single call with the stream API.

A classic pagination API requires the SP to fill the next page of metadata to return. Depending on the underlying APIs, this may require the SP to reissue the original allocated/diff call and skip the metadata records already returned in previous calls. In contrast, an SP server processing a stream API can send multiple responses without reissuing the allocated/diff call.

the last bullet point is probably the most convincing here

saad-ali · 2023-10-14T00:02:49Z

@carlbraganza can you make sure you've signed the CLA. See https://github.com/container-storage-interface/spec/blob/master/CONTRIBUTING.md#licence-and-cla

carlbraganza · 2023-10-17T15:51:35Z

@carlbraganza can you make sure you've signed the CLA. See https://github.com/container-storage-interface/spec/blob/master/CONTRIBUTING.md#licence-and-cla

Thanks. Looking into getting this signed.

jdef · 2023-11-18T16:39:36Z

+
+  // The FIXED_LENGTH value indicates that data ranges are
+  // returned in fixed size blocks.
+  FIXED_LENGTH=1;


nit: consistent spacing around = sign, here and below

jdef · 2023-11-18T16:42:10Z

+```
+
+
+#### GetMetadataAllocated RPC


nit: other RPC section headers don't end w/ RPC. i haven't scanned the entire doc, but whatever we do here should be consistent w/ the rest (or, at least, the majority of the rest)

Looks like other RPC sections have the name in the form: RPC Name (within backquotes). The errors subsection has no special formatting. Changing...

jdef · 2023-11-18T16:45:22Z

+- The **FIXED_LENGTH** style requires a fixed value for the `size_bytes` field in all tuples in a given sequence.
+- The **VARIABLE_LENGTH** style does not constrain the value of the `size_bytes` field in a sequence.
+
+The Snapshot Metadata service permits either style at the discretion of the plugin as long as the style does not change mid-stream in any given RPC.


this seems to imply that subsequent invocations of the same RPC may specify different length styles - even for the same snapshot. is this intended? i hope so, because otherwise the plugin would need to maintain state to enforce consistent styles across multiple RPCs - and we probably don't want to force plugins to do that

Yes. The spec leaves it to the SP to chose the style on a per RPC basis; consistency is required only within each RPC.

jdef · 2023-11-18T16:47:50Z

+
+The Snapshot Metadata service is an optional service that is used to retrieve metadata on the allocated blocks of a single snapshot, or the changed blocks between a pair of snapshots of the same block volume.
+Retrieval of the data blocks of a snapshot is not addressed by this service and it is assumed that existing mechanisms to read snapshot data will suffice.
+


There is general language around concurrency of RPCs issued by CO to a plugin for some volume - some recommendation meant to ease the burden on plugins. I don't see the same wording applied for snapshots (and we should clarify that, in general). But my hunch is that the spirit of such calls would be the same, limiting concurrency w/ respect to calls that pertain to a particular snapshot (or snapshot group). Has thought been given to the concurrency of these new RPCs? i.e. is the intent to align w/ the current spirit of the spec w/ respect to concurrency, or is a different concurrency model needed/wanted?

I had not looked at that in detail previously.

Looking at the Concurrency and Error Scheme sections it appears that returning an ABORTED error code would suffice if the SP decides to limit concurrent operations. The existing recommendation that the caller retry with an exponential back off is also sufficient.

jdef

thanks for being patient. completed another round of review

jdef · 2023-11-18T16:54:31Z

+  // The plugin will determine an appropriate value if 0, and is
+  // always free to send less than the requested value.
+  // This field is OPTIONAL.
+  uint32 max_results = 3;


another "max" field uses int32 instead, another int64. maybe be consistent and not specify unsigned?

Okay. I'll use the int32 data type used in ListVolumesRequest and ListSnapshotsRequest. The odd one out is NodeGetInfoResponse which uses int64.

jdef · 2023-11-18T16:56:30Z

+  map<string, string> secrets = 4 [(csi_secret) = true];
+}
+
+// GetMetadataAllocatedResponse messages are returned in gRPC stream.


grammar: ".. are returned in streaming fashion." or something

jdef · 2023-11-18T16:57:59Z

+  // This value must be the same in all such messages returned by
+  // the stream.
+  // This field is REQUIRED.
+  uint64 volume_capacity_bytes = 2;


this is int64 elsewhere...

jdef · 2023-11-18T16:59:47Z

+
+  // This is a list of data range tuples.
+  // If the value of max_results in the GetMetadataAllocatedRequest
+  // message is non-zero, then the number of entries in this list


is greater than zero (if we drop the unsigned)

jdef · 2023-11-18T17:01:28Z

+  // MUST be less than or equal to that value.
+  // The byte_offset field of each message MUST be strictly increasing.
+  // The SP MUST NOT return a BlockMetadata message with a byte_offset
+  // less than than the value of starting_offset in the


this is inconsistent w/ the downward rounding option granted to plugins as previously described - please reconcile

for example, if the CO requests a starting offset of 600 and the plugin realigns/rounds to boundary starting at offset 512, then the first block metadata response could be reported w/ an initial offset of 512 - which is not strictly increasing from the CO-specified starting offset.

or am I missing something?

No, I think you're correct here - this is not quite right. Rewording to:

// The data range of the first BlockMetadata message in the list MUST
// include the starting_offset requested in the GetMetadataAllocatedRequest
// message.

jdef · 2023-11-18T17:08:51Z

+  map<string, string> secrets = 5 [(csi_secret) = true];
+}
+
+// GetMetadataDeltaResponse messages are returned in gRPC stream.


same grammar nit

jdef · 2023-11-18T17:09:05Z

+  // This value must be the same in all such messages returned by
+  // the stream.
+  // This field is REQUIRED.
+  uint64 volume_capacity_bytes = 2;


jdef · 2023-11-18T17:09:21Z

+
+  // This is a list of data range tuples.
+  // If the value of max_results in the GetMetadataDeltaRequest message
+  // is non-zero, then the number of entries in this list MUST be less


greater than zero

jdef · 2023-11-18T17:10:12Z

+  // than or equal to that value.
+  // The byte_offset field of each message MUST be strictly increasing.
+  // The SP MUST NOT return a BlockMetadata message with a byte_offset
+  // less than than the value of starting_offset in the


same conflict w/ earlier, optional rounding-down by plugin

Changed to:

// The data range of the first BlockMetadata message in the list MUST
// include the starting_offset requested in the GetMetadataDeltaRequest
// message.

@jdef This was subsequently modified in response to feedback from @bswartz. Please see this comment below: #551 (comment)

jdef · 2023-11-18T17:10:42Z

+##### GetMetadataDelta Errors
+If the plugin is unable to complete the `GetMetadataDelta` call successfully it must return a non-OK gRPC code in the gRPC status.
+
+The following conditions are well defined:


similar concerns w/ snapshots in not-ready states: does NOT_FOUND work or some other code needed?

Please see my earlier response.

carlbraganza · 2023-11-20T22:28:12Z

@jdef This comment was in an outdated part of the document so I'm not sure if it was addressed:

why aren't backend secrets configured at the plugin level? why do they need to be part of the RPC here? is this an exercise in future-proofing?

This was brought to our attention by @nixpanic, because the Ceph CSI plugin supports multiple sets of secrets (per Pool, I believe).

carlbraganza · 2023-11-20T23:30:17Z

thanks for being patient. completed another round of review

Thanks @jdef. I tried to address all the items you surfaced. PTAL.

bswartz · 2023-11-22T22:17:19Z

+  // The plugins may round down this offset to the nearest alignment
+  // boundary based on the BlockMetadataType used.


Sorry to be pedantic about this but the explanation isn't clear enough that someone could write a test to verify this behavior. Is the SP only allowed to round down if the type is FIXED_LENGTH? And is it only allowed to round by the exactly the size of one block? Would an SP that rounds down by more than one block (because internally it operates on groups of blocks) be in violation of the spec? If an SP uses VARIABLE_LENGTH blocks, how much can it round down by?

My worry is that if we leave it up to SPs and COs to figure out who should discard the "repeated" metadata blocks that might result from an interrupted transmission, they might make incompatible assumptions such that in cases of actual interruption, one or more changed blocks would get missed, but only on certain SP+CO combinations, and the bug is unlikely to get caught because it's a rare thing for these connections to get interrupted unless you specifically test for it. This bug would result in very subtle data corruption, which wouldn't even have a chance of breaking anything until a corrupted backup is restored, and even then the corruption might not get noticed, but it would be there.

I want to be crystal clear about what the SP is allowed to do with this field and what assumptions the CO is allowed to make, so we can write automated tests that ensure there is no chance of a misunderstanding by a SP or a CO implementer resulting in data corruption. I realize that the earlier wording of this section (which I had recommended) no longer made sense after the additional changes we agreed to, but we still need more clarity here.

bswartz · 2023-11-22T22:24:16Z

+  // The data range of the first BlockMetadata message in the list MUST
+  // include the starting_offset requested in the
+  // GetMetadataAllocatedRequest message.


Oh wait, this is the language I was looking for in my earlier comment. I think this may be clear enough because it's precise. Do we need to add any clarification in the case that this message is streamed and there are multiple responses for a single request? In that case the restriction mentioned only applies to the first response. The second and later responses must follow after the last block of the previous response (which is hopefully obvious).

Whew!

Okay. I'll clarify that it is the first tuple in the stream that must have this property. Will now read as:

// The data range of the first BlockMetadata message in the stream
// of GetMetadataAllocatedResponse messages returned MUST include the
// starting_offset requested in the GetMetadataAllocatedRequest
// message.

The language is likewise modified in the description of the GetMetadataDeltaResponse message.

Oops! @bswartz please see the follow up comment below: #551 (comment)

Thanks for pointing out the case about the block containing starting_offset not containing any changes -- I had missed this. I'm convinced we can be more precise -- I'm going to take a swing a recommending some new language which is impossible to misunderstand.

carlbraganza · 2023-11-28T00:30:14Z

After responding to @bswartz's previous comment, it occurred to me that requiring the starting_offset to be present in the first tuple range is not correct, because there may not be allocated/changed blocks in the range including that offset. What we really want is that the SP should consider this starting_offset in its computation when generating the data stream. Accordingly, I changed the text to the following:

// The computation of the data range in the first BlockMetadata
// message in the stream of GetMetadataAllocatedResponse messages
// returned MUST consider the starting_offset requested in the
// GetMetadataAllocatedRequest message, though it is not required
// that the first data range tuple returned actually include the
// starting_offset, as the first allocated block range may start
// after this offset.

The language is likewise modified in the description of the GetMetadataDeltaResponse message.

I don't know if this negatively impact's @bswartz's request for precise language that can be tested, but it is more accurate.

carlbraganza · 2023-11-29T04:20:08Z

@bswartz wrote:

I'm thinking somewhere in the spec.md file (not the proto file) in the section about change-block tracking we need to mention in plain English how the SP is responsible for ensuring EVERY changed block gets communicated to the CO before reporting the end of stream, and that the CO is responsible for picking the correct starting_offset when restarting an interrupted connection, in order to ensure that it doesn't miss any blocks. The stakes are high for not screwing up this metadata transmission, so it's worth recommending caution to implementers not to be careless.

I think this is a great idea. I'm adding the following section between the description of the metadata format and the RPC definitions:

### Special considerations for processing a metadata gRPC stream

The remote procedure calls of this service return snapshot metadata within a gRPC stream of response messages.
There are some important things for the SP and CO to consider here:

- In normal operation an SP MUST terminate the stream only after ALL metadata is transmitted, 
  using the language specific idiom for terminating a gRPC stream source.
  This results in the CO receiving its language specific notification idiom for the end of a gRPC 
  stream, which provides a **definitive indication that all available metadata has been received**.

- If the SP encounters an error while recovering the metadata it MUST abort transmission of the 
  stream with its language specific error idiom.
  This results in the CO receiving the error in its language specific idiom, which will be different 
  from the language specific idiom for the end of a gRPC stream.

- It is possible that the gRPC stream gets interrupted for arbitrary reasons beyond the control 
  of either the SP or the CO.
  The SP will get an error when writing to the stream and MUST abort its transmission.
  The CO will receive an error in its language specific idiom, which will be different from the 
  language specific idiom for the end of a gRPC stream.

- In all circumstances where the CO receives an error when reading from the gRPC stream it 
  MAY attempt to **continue** the operation by re-sending its request message but with a 
 `starting_offset` adjusted to the NEXT byte position beyond that of any metadata already received.
  The SP MUST always ensure that the `starting_offset` requested be considered in the 
  computation of the data range for the first message in the returned gRPC stream, though the 
  data range of the first message is not required to actually include the `starting_offset` if there 
  is no applicable data between the `starting_offset` and the start of the data range returned by the first message.

(Updated after @jdef's feedback and formatted for display above).

jdef

playing catch-up here, trying to review individual commits. additional word-smithing may be needed.

jdef · 2023-12-09T17:16:53Z

+  it MAY attempt to **continue** the operation by re-sending its request message but with a
+  starting_offset adjusted to the NEXT byte position beyond that of any metadata already received.
+  The SP MUST always examine the starting_offset and ensure that metadata sent in
  the gRPC stream logically includes the starting_offset but not any byte position prior


"but not any byte position prior" seems to contradict the gRPC method docs, which seem to indicate that starting_offset can live in the window defined by (offset, offset+size). or am I reading this wrong?

Yes, you're correct! I think I got stuck in the previous thread on continuation. Changing to:

The SP MUST always ensure that the starting_offset requested be considered in the computation of the data range for the first message in the returned gRPC stream, though the data range of the first message is not required to actually include the starting_offset if there is no applicable data between the starting_offset and the start of the data range returned by the first message.

This took quite a few rewrites to be unambiguous and yet semi-readable; please feel free to offer alternatives.

bswartz

/lgtm

jdef · 2024-02-03T17:18:52Z

+  // The SP MUST NOT return any data ranges for which the byte_offset
+  // plus the size_bytes is less than starting_offset.
+  // The SP MAY return data ranges with byte_offset less than
+  // starting_offset if starting_offset is less than byte_offset plus


"... byte_offset less than starting_offset if starting_offset is less than byte_offset ..."

sorry, i'm still confused by this phrasing. what am i missing?

This is language I recommended, and what I was trying to achieve is to constrain the SP's behavior in case of restarting an abnormally terminated stream. Basically I want to be very precise about what starting_offset means, so that SPs don't take liberties and start somewhere other than where the CO expects. Unfortunately, the only way to do that is to describe what mathematical relationships must hold between the values in the request and response.

Given a requested starting_offset S, and the first block_metadata B0 of the first SP response in a stream, the SP MUST ensure that B0.byte_offset <= S < B0.byte_offset + B0.size_bytes.

.. is this what you're getting at? if so, can we represent the mathematical relationship using language like this instead of the two sentences that attempt to model this in English? I see where I went wrong reading it the first time, and I probably would have not had the same trouble w/ the same in equation form.

Yeah I'm perfectly fine using language like that. I had worried that introducing new variables like S and B0 might be extra confusing, but this seems like a case where it will be a little confusing either way, and I do like the clarity of mathematical symbols to express what is really a mathematical relationship.

@carlbraganza do you think you can reword using the language that @jdef recommends?

I replaced the 2 sentences above with the following:

// The SP MUST ensure that the returned response stream does not
// contain BlockMetadata tuples that start before but not overlap
// the requested starting_offset: i.e. if S is the requested
// starting_offset, and B0 is block_metadata[0] of the first message
// in the response stream, then the following must always be true:
// B0.byte_offset > S || S < B0.byte_offset + B0.size_bytes.

jdef · 2024-02-03T17:23:21Z

+  // The SP MUST NOT return any data ranges for which the byte_offset
+  // plus the size_bytes is less than starting_offset.
+  // The SP MAY return data ranges with byte_offset less than
+  // starting_offset if starting_offset is less than byte_offset plus


PrasadG193

Change occurences of "0 based" -> "zero based" to maintain consistency as suggested by jdef

jdef · 2024-02-22T16:54:06Z

+  // the requested starting_offset: i.e. if S is the requested
+  // starting_offset, and B0 is block_metadata[0] of the first message
+  // in the response stream, then the following must always be true:
+  // B0.byte_offset > S || S < B0.byte_offset + B0.size_bytes.


Suggested change

// B0.byte_offset > S || S < B0.byte_offset + B0.size_bytes.

// S < B0.byte_offset or-else S < B0.byte_offset + B0.size_bytes.

^ this seems more readable? (for me, anyway); otherwise 👍

No problem. How about if I change it to:

...
in the response stream, then either (S < B0.byte_offset) must be
true or else (S < B0.byte_offset + B0.size_bytes) must be true.

I think this accurately replaces || in my original sentence.

jdef

LGTM mod nit

bswartz

I found some more typos, and rather than just recommend the minimal fix, I am recommending a significant simplification of the language, because using mathematical expressions allows us to be more precise with less verbiage.

bswartz · 2024-02-23T16:38:33Z

+  // The SP MUST ensure that the returned response stream does not
+  // contain BlockMetadata tuples that start before but not overlap
+  // the requested starting_offset: i.e. if S is the requested
+  // starting_offset, and B0 is block_metadata[0] of the first message
+  // in the response stream, then either (S < B0.byte_offset) must be
+  // true or else (S < B0.byte_offset + B0.size_bytes) must be true.


Now that we have precise math expressions, we can make the English verbiage a bit more terse. Note that I'm not nitpicking here, the previous language had typos that needed fixing anyways, so I'm going to recommend a wholesale simplification of this block. Also, the two inequalities expressed were redundant, because the second implies the first as long as size_bytes is non-negative (and we actually require it to be positive).

Suggested change

// The SP MUST ensure that the returned response stream does not

// contain BlockMetadata tuples that start before but not overlap

// the requested starting_offset: i.e. if S is the requested

// starting_offset, and B0 is block_metadata[0] of the first message

// in the response stream, then either (S < B0.byte_offset) must be

// true or else (S < B0.byte_offset + B0.size_bytes) must be true.

// The SP MUST ensure that the returned response stream does not

// contain BlockMetadata tuples that end before the requested

// starting_offset: i.e. if S is the requested starting_offset, and B0 is

// block_metadata[0] of the first message in the response stream,

// then S < B0.byte_offset + B0.size_bytes must be true.

Thanks, Ben. That is simpler. Will now read as:

// The SP MUST ensure that the returned response stream does not
// contain BlockMetadata tuples that end before the requested
// starting_offset: i.e. if S is the requested starting_offset, and
// B0 is block_metadata[0] of the first message in the response
// stream, then (S < B0.byte_offset + B0.size_bytes) must be true.

bswartz · 2024-02-23T16:46:03Z

+  int64 byte_offset = 1;
+
+  // This is the size of the data range.
+  // size_bytes MUST be greater than 0.


Suggested change

// size_bytes MUST be greater than 0.

// size_bytes MUST be greater than zero.

bswartz · 2024-02-23T16:46:21Z

+  int64 byte_offset = 1;
+
+  // This is the size of the data range.
+  // size_bytes MUST be greater than 0.


Suggested change

// size_bytes MUST be greater than 0.

// size_bytes MUST be greater than zero.

bswartz · 2024-02-23T17:49:07Z

+  // This is a list of data range tuples.
+  // If the value of max_results in the GetMetadataAllocatedRequest
+  // message is greater than zero, then the number of entries in this
+  // list MUST be less than or equal to that value.
+  // The byte_offset field of each message MUST be strictly increasing.
+  // The computation of the data range in the first BlockMetadata
+  // message in the stream of GetMetadataAllocatedResponse messages
+  // returned MUST consider the starting_offset requested in the
+  // GetMetadataAllocatedRequest message.
+  // The SP MUST NOT return data ranges for which byte_offset plus
+  // size_bytes is less than or equal to starting_offset.
+  // Allocated data ranges which overlap the starting_offset
+  // or for which byte_offset plus size_bytes is greater than
+  // starting_offset MUST be returned by the SP.
+  // BlockMetadata messages MUST NOT overlap for a given snapshot, i.e.
+  // for any two BlockMetadata messages, A and B, if
+  // A.byte_offset < B.byte_offset then
+  // A.byte_offset + A.size_bytes <= B.byte_offset MUST be true.


Another simplification of the English since we can lean on mathematical expressions for precision. Also I think that we don't need to repeat the language about starting_offset here. A simple reference to the other section should be enough.

Suggested change

// This is a list of data range tuples.

// If the value of max_results in the GetMetadataAllocatedRequest

// message is greater than zero, then the number of entries in this

// list MUST be less than or equal to that value.

// The byte_offset field of each message MUST be strictly increasing.

// The computation of the data range in the first BlockMetadata

// message in the stream of GetMetadataAllocatedResponse messages

// returned MUST consider the starting_offset requested in the

// GetMetadataAllocatedRequest message.

// The SP MUST NOT return data ranges for which byte_offset plus

// size_bytes is less than or equal to starting_offset.

// Allocated data ranges which overlap the starting_offset

// or for which byte_offset plus size_bytes is greater than

// starting_offset MUST be returned by the SP.

// BlockMetadata messages MUST NOT overlap for a given snapshot, i.e.

// for any two BlockMetadata messages, A and B, if

// A.byte_offset < B.byte_offset then

// A.byte_offset + A.size_bytes <= B.byte_offset MUST be true.

// This is a list of data range tuples.

// If the value of max_results in the GetMetadataAllocatedRequest

// message is greater than zero, then the number of entries in this

// list MUST be less than or equal to that value.

// The SP MUST respect the value of starting_offset in the request.

// The byte_offset field of each BlockMetadata message MUST be

// strictly increasing and BlockMetadata messages MUST NOT overlap:

// i.e. for any two BlockMetadata messages, A and B, if A is returned

// before B, then A.byte_offset + A.size_bytes <= B.byte_offset MUST

// be true.

Changed to read:

// This is a list of data range tuples.
// If the value of max_results in the GetMetadataAllocatedRequest
// message is greater than zero, then the number of entries in this
// list MUST be less than or equal to that value.
// The SP MUST respect the value of starting_offset in the request.
// The byte_offset fields of adjacent BlockMetadata messages
// MUST be strictly increasing and messages MUST NOT overlap:
// i.e. for any two BlockMetadata messages, A and B, if A is returned
// before B, then (A.byte_offset + A.size_bytes <= B.byte_offset)
// MUST be true.
// This MUST also be true if A and B are from block_metadata lists in
// different GetMetadataAllocatedResponse messages in the gRPC stream.
// This field is OPTIONAL.

I added an additional sentence to extend the notion of "adjacent" to include the entire gRPC stream.

bswartz · 2024-02-23T17:51:15Z

+  // The SP MUST ensure that the returned response stream does not
+  // contain BlockMetadata tuples that start before but not overlap
+  // the requested starting_offset: i.e. if S is the requested
+  // starting_offset, and B0 is block_metadata[0] of the first message
+  // in the response stream, then either (S < B0.byte_offset) must be
+  // true or else (S < B0.byte_offset + B0.size_bytes) must be true.


Suggested change

// The SP MUST ensure that the returned response stream does not

// contain BlockMetadata tuples that start before but not overlap

// the requested starting_offset: i.e. if S is the requested

// starting_offset, and B0 is block_metadata[0] of the first message

// in the response stream, then either (S < B0.byte_offset) must be

// true or else (S < B0.byte_offset + B0.size_bytes) must be true.

// The SP MUST ensure that the returned response stream does not

// contain BlockMetadata tuples that end before the requested

// starting_offset: i.e. if S is the requested starting_offset, and B0 is

// block_metadata[0] of the first message in the response stream,

// then S < B0.byte_offset + B0.size_bytes must be true.

Handled like the previous case.

bswartz · 2024-02-23T17:52:15Z

+  // This is a list of data range tuples.
+  // If the value of max_results in the GetMetadataDeltaRequest message
+  // is greater than zero, then the number of entries in this list MUST
+  // be less than or equal to that value.
+  // The byte_offset field of each message MUST be strictly increasing.
+  // The computation of the data range in the first BlockMetadata
+  // message in the stream of GetMetadataDeltaResponse messages
+  // returned MUST consider the starting_offset requested in the
+  // GetMetadataDeltaRequest message.
+  // The SP MUST NOT return data ranges for which byte_offset plus
+  // size_bytes is less than or equal to starting_offset.
+  // Changed data ranges which overlap the starting_offset
+  // or for which byte_offset plus size_bytes is greater than
+  // starting_offset MUST be returned by the SP.
+  // BlockMetadata messages MUST NOT overlap for a given snapshot, i.e.
+  // for any two BlockMetadata messages, A and B, if
+  // A.byte_offset < B.byte_offset then
+  // A.byte_offset + A.size_bytes <= B.byte_offset MUST be true.


Suggested change

// This is a list of data range tuples.

// If the value of max_results in the GetMetadataDeltaRequest message

// is greater than zero, then the number of entries in this list MUST

// be less than or equal to that value.

// The byte_offset field of each message MUST be strictly increasing.

// The computation of the data range in the first BlockMetadata

// message in the stream of GetMetadataDeltaResponse messages

// returned MUST consider the starting_offset requested in the

// GetMetadataDeltaRequest message.

// The SP MUST NOT return data ranges for which byte_offset plus

// size_bytes is less than or equal to starting_offset.

// Changed data ranges which overlap the starting_offset

// or for which byte_offset plus size_bytes is greater than

// starting_offset MUST be returned by the SP.

// BlockMetadata messages MUST NOT overlap for a given snapshot, i.e.

// for any two BlockMetadata messages, A and B, if

// A.byte_offset < B.byte_offset then

// A.byte_offset + A.size_bytes <= B.byte_offset MUST be true.

// This is a list of data range tuples.

// If the value of max_results in the GetMetadataAllocatedRequest

// message is greater than zero, then the number of entries in this

// list MUST be less than or equal to that value.

// The SP MUST respect the value of starting_offset in the request.

// The byte_offset field of each BlockMetadata message MUST be

// strictly increasing and BlockMetadata messages MUST NOT overlap:

// i.e. for any two BlockMetadata messages, A and B, if A is returned

// before B, then A.byte_offset + A.size_bytes <= B.byte_offset MUST

// be true.

Handled like the previous case.

bswartz

/lgtm

jdef · 2024-02-28T18:38:41Z

skimmed through the changes yesterday, LGTM

only remaining nit is the needless newline whitespaces that pad the new protobuf blocks

xing-yang · 2024-03-06T21:43:53Z

@carlbraganza Can you squash the commits?

bswartz · 2024-03-07T02:48:20Z

@carlbraganza I can't find evidence of a signed CLA. Did that end up happening?

bswartz · 2024-03-07T02:49:14Z

BTW I can squash the commits too, but we should add some more detail to the squashed commit message.

carlbraganza · 2024-03-07T20:38:55Z

Rebased on master and squashed comments.

bswartz · 2024-03-07T23:39:44Z

As soon as we get a signed CLA from the author(s) we can merge this.

saad-ali · 2024-03-11T21:03:30Z

I can confirm we've received CLA from Veeam including Carl Braganza

carlbraganza · 2024-03-12T15:47:45Z

I looked at the build test failure - my guess is that I used the wrong version of protoc.

bswartz · 2024-03-12T18:25:16Z

You probably need to rebase on HEAD and rebuild. We merged a patch that changed the Makefile significantly and also updated the protoc compiler version

* Add draft of CSI CBT KEP Signed-off-by: Ivan Sim <ivan.sim@dell.com> * Update KEP status Signed-off-by: Ivan Sim <ivan.sim@dell.com> * Initial structure. Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section. * Removed distracting links to common K8s definitions. Clarified the proposal. * More caveats. Better grammar. * Use "snapshot access session". * addressed most of the feedback in the PR. * Updated role figure. * More refinements. * Session figure. Renamed figure files. * Fix background of session figure. * Updated figures and roles. * Propose a new role for session data. * GRPC spec * Don't propose roles. * Add user stories in the proposal (#2) * Add user stories in the proposal Signed-off-by: Prasad Ghangal <prasad.ghangal@gmail.com> * Remove acceptance criteria for the user stories * Make changes suggested by Carl --------- Signed-off-by: Prasad Ghangal <prasad.ghangal@gmail.com> * Added details to the manager, sidecar and SP service sections. Fixed session figure errors and rewrote the client gRPC description in the risks section. * Called out UNRESOLVED issues. More on the SP service and sidecar. * Resolved issues with expiry and advertising. * Updated TOC * Fixed typo and svg space rendering. * Fixed typo in perms figure. * Typo in session figure. More detail in user stories. * Add SnapshotSession CRDs (#5) * Add SnapshotSession CRDs * Add CR descriptions * Address review comments * Address review comments * Remove typo * Remove unnecessary new line * Added image of the flow when the TokenRequest and TokenReview APIs are used. * Fixed figure spacing * Updated permissions svg; removed session. * Updated figures. Removed session figure. * Added explanation of permissions. * Updated overview and risks. * Updated RPC and components. * Completed remaining rewrite. * Updated to CSI spec to reflect container-storage-interface/spec#551 * Removed the security_token and namespace from the gRPC spec. Pass the security token via the metadata authorization key. Pass the namespace as part of the K8s snapshot id string. * Update sections on test plan, PRR and graduation criteria Signed-off-by: Ivan Sim <ihcsim@gmail.com> * More neutral language on passing the auth token. * Updated to reflect changes in the CSI spec PR. * Use a separate gRPC API for the sidecar. * Replaced authorization gRPC metadata with a security_token field in request messages. * Fixed typo. * Updated CSI spec; downplayed similarity between the K8s and CSI gRPC services. * Add beta and GA graduation criteria Signed-off-by: Ivan Sim <ihcsim@gmail.com> * Updated CSI spec again - no unsigned numbers used. * Update KEP milestone to v1.30 Signed-off-by: Ivan Sim <ihcsim@gmail.com> * Update 'Scalability' section Signed-off-by: Ivan Sim <ihcsim@gmail.com> * Add sig-auth as participating sigs Signed-off-by: Ivan Sim <ihcsim@gmail.com> * Require that the CR be named for the driver. * Removed the label requirement for the CR. * Replaced johnbelamaric with soltysh for PRR approver. * Bump up milestone to v1.31 * Change KEP status to implementable --------- Signed-off-by: Ivan Sim <ivan.sim@dell.com> Signed-off-by: Prasad Ghangal <prasad.ghangal@gmail.com> Signed-off-by: Ivan Sim <ihcsim@gmail.com> Co-authored-by: Carl Braganza <carl@kasten.io> Co-authored-by: Prasad Ghangal <prasad.ghangal@gmail.com>

carlbraganza added a commit to ihcsim/enhancements that referenced this pull request Aug 18, 2023

Updated to CSI spec to reflect container-storage-interface/spec#551

e58bb3d

ihcsim mentioned this pull request Aug 18, 2023

Add VolumeSnapshotDelta CSI RPCs #522

Closed

carlbraganza commented Aug 18, 2023

View reviewed changes

carlbraganza added a commit to ihcsim/enhancements that referenced this pull request Aug 18, 2023

Updated to CSI spec to reflect container-storage-interface/spec#551

2bfe89f

carlbraganza force-pushed the snapshot-metadata-service branch from f4a058e to e1bb128 Compare August 18, 2023 20:44

bswartz self-requested a review August 23, 2023 16:08

nixpanic reviewed Sep 8, 2023

View reviewed changes

carlbraganza mentioned this pull request Sep 15, 2023

KEP-3314: CSI Changed Block Tracking kubernetes/enhancements#4082

Merged

bswartz requested changes Sep 18, 2023

View reviewed changes

carlbraganza requested review from bswartz and nixpanic September 19, 2023 00:44

jdef reviewed Sep 28, 2023

View reviewed changes

pacoxu mentioned this pull request Nov 10, 2023

CSI Differential Snapshot for Block Volumes kubernetes/enhancements#3314

Open

8 tasks

jdef reviewed Nov 18, 2023

View reviewed changes

jdef requested changes Nov 18, 2023

View reviewed changes

carlbraganza requested review from PrasadG193, ihcsim and jdef November 20, 2023 23:23

bswartz requested changes Nov 22, 2023

View reviewed changes

bswartz reviewed Nov 22, 2023

View reviewed changes

carlbraganza requested a review from bswartz November 28, 2023 00:07

carlbraganza requested a review from bswartz November 29, 2023 04:21

jdef reviewed Dec 9, 2023

View reviewed changes

carlbraganza requested a review from jdef December 15, 2023 18:54

bswartz approved these changes Jan 24, 2024

View reviewed changes

jdef reviewed Feb 3, 2024

View reviewed changes

PrasadG193 reviewed Feb 6, 2024

View reviewed changes

Comment thread csi.proto Outdated

Comment thread spec.md Outdated

carlbraganza requested a review from jdef February 21, 2024 20:55

jdef reviewed Feb 22, 2024

View reviewed changes

jdef approved these changes Feb 22, 2024

View reviewed changes

bswartz requested changes Feb 23, 2024

View reviewed changes

carlbraganza requested a review from bswartz February 23, 2024 21:28

bswartz approved these changes Feb 26, 2024

View reviewed changes

carlbraganza force-pushed the snapshot-metadata-service branch 2 times, most recently from f1eae06 to afc847a Compare March 7, 2024 20:37

Added the SnapshotMetadata service.

aa0e11d

carlbraganza force-pushed the snapshot-metadata-service branch from afc847a to aa0e11d Compare March 12, 2024 19:34

bswartz merged commit 2696773 into container-storage-interface:master Mar 14, 2024

carlbraganza deleted the snapshot-metadata-service branch March 15, 2024 00:09

xing-yang mentioned this pull request Apr 9, 2024

REQUEST: Create a new github repo for external-snapshot-metadata under kubernetes-csi kubernetes/org#4874

Closed


		The Snapshot Metadata service is an optional service that is used to retrieve metadata on the allocated blocks of a single snapshot, or the changed blocks between a pair of snapshots of the same block volume.
		Retrieval of the data blocks of a snapshot is not addressed by this service and it is assumed that existing mechanisms to read snapshot data will suffice.

		// The plugins may round down this offset to the nearest alignment
		// boundary based on the BlockMetadataType used.

	// B0.byte_offset > S \|\| S < B0.byte_offset + B0.size_bytes.
	// S < B0.byte_offset or-else S < B0.byte_offset + B0.size_bytes.

	// size_bytes MUST be greater than 0.
	// size_bytes MUST be greater than zero.

		```


		#### GetMetadataAllocated RPC

Conversation

carlbraganza commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xing-yang commented Aug 18, 2023

Uh oh!

carlbraganza commented Aug 18, 2023

Uh oh!

carlbraganza Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xing-yang commented Aug 23, 2023

Uh oh!

xing-yang commented Sep 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlbraganza Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlbraganza Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlbraganza commented Aug 18, 2023 •

edited

Loading

carlbraganza Aug 18, 2023 •

edited

Loading

carlbraganza Sep 14, 2023 •

edited

Loading

carlbraganza Sep 29, 2023 •

edited

Loading

jdef Nov 18, 2023 •

edited

Loading

carlbraganza Nov 20, 2023 •

edited

Loading