Skip to content

implement custom collection metadata#7123

Merged
timvisee merged 3 commits intodevfrom
collection-level-metadata
Aug 26, 2025
Merged

implement custom collection metadata#7123
timvisee merged 3 commits intodevfrom
collection-level-metadata

Conversation

@generall
Copy link
Member

PATCH /collections/star_charts
{
  "metadata": {
    "any_values": "go-like-this"
  }
}

#3957

@generall generall requested a review from timvisee August 24, 2025 21:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (31)
lib/collection/src/shards/local_shard/shard_ops.rs (1)

165-168: Minor style consistency: pick a convention for From vs Into at call sites.

This file now uses CollectionInfo::from(...), while other code (e.g., in operations/types.rs, Line 258+) uses .into() for field conversions. Consider standardizing: commonly, use .into() when the target type is evident from context (return type or field type), and T::from(x) when making the target explicit improves readability. Not a blocker.

lib/collection/tests/integration/snapshot_recovery_test.rs (1)

40-49: Exercise metadata persistence through snapshot/restore (set non-empty metadata and assert).

Right now the test initializes metadata: None. Since this PR introduces collection-level metadata, this test is a perfect place to cover persistence across snapshot and recovery. Recommend setting a small, non-empty metadata object and (if an accessor is available) asserting it round-trips correctly in the recovered collection.

Apply this minimal change to cover the write-path; I can follow up with an accessor-based assertion once you point me to the public way to read collection metadata in tests:

         uuid: None,
-        metadata: None,
+        metadata: Some(segment::types::Payload(
+            serde_json::Map::from_iter([
+                ("any_values".to_string(), serde_json::json!("go-like-this")),
+                ("from".to_string(), serde_json::json!("snapshot_recovery_test")),
+            ])
+        )),

Optionally, add an assertion after loading recovered_collection to confirm metadata equality via the appropriate API (e.g., collection info/config). I can wire this up if you prefer.

lib/collection/src/tests/points_dedup.rs (1)

56-65: LGTM; consider setting non-empty metadata to increase coverage.

Setting metadata: None is correct for compiling against the new field. If you want to cover that metadata does not interfere with dedup flows, you could set a tiny metadata object here as well (no behavior change expected):

         uuid: None,
-        metadata: None,
+        metadata: Some(segment::types::Payload(
+            serde_json::Map::from_iter([("suite".to_string(), serde_json::json!("points_dedup"))])
+        )),
src/consensus.rs (1)

1536-1544: Cover consensus path with non-empty metadata in the CreateCollection test.

Great that the test was updated for the new field. To ensure metadata is actually propagated through the consensus log and applied, set a non-empty value here. This adds end-to-end coverage with nearly zero overhead.

                                 strict_mode_config: None,
                                 uuid: None,
-                                metadata: None,
+                                metadata: Some(segment::types::Payload(
+                                    serde_json::Map::from_iter([
+                                        ("test_case".to_string(), serde_json::json!("consensus")),
+                                        ("rev".to_string(), serde_json::json!(1)),
+                                    ])
+                                )),

Optional: after the op is committed, fetch the collection config (via whatever accessor is canonical in tests) and assert metadata matches what you set. I can add that snippet once you point me to the preferred accessor.

lib/segment/src/types.rs (1)

1672-1675: Deriving Hash for Payload: verify order-insensitive, version-stable hashing; add a regression test.

Deriving Hash on Payload(Map<String, Value>) is convenient, but two concerns to sanity-check:

  • Compatibility: ensure the repo’s serde_json version implements Hash for Value/Map (this has changed over time). If not guaranteed across all supported build targets, compilation may fail or hash semantics may differ.
  • Semantics: JSON object equality is order-insensitive. Hash must be consistent with Eq. Please verify that the derived Hash for the underlying serde_json::Map is order-insensitive in the version you use.

Suggested guardrail (additive test) to ensure invariants hold across upgrades:

// place under the existing tests module in this file
#[test]
fn test_payload_hash_order_invariance() {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};

    fn h<T: Hash>(t: &T) -> u64 {
        let mut s = DefaultHasher::new();
        t.hash(&mut s);
        s.finish()
    }

    let p1 = payload_json! { "a": 1, "b": 2 };
    let p2 = payload_json! { "b": 2, "a": 1 };

    assert_eq!(p1, p2, "Eq should ignore object key order");
    assert_eq!(h(&p1), h(&p2), "Hash must be consistent with Eq");
}

If the above fails on any CI target, consider an explicit impl Hash for Payload that canonicalizes object keys before hashing (or rely on your existing StableHash machinery for config hashing).

lib/collection/tests/integration/common/mod.rs (1)

58-58: LGTM: new metadata field initialized to None in fixture

This aligns with the new Option on CollectionConfigInternal and keeps old configs backward-compatible due to serde defaults.

If you want an integration assertion for round-trip persistence, I can add a small test that writes/loads the config and verifies metadata remains unset/omitted.

lib/collection/src/operations/verification/mod.rs (1)

514-514: LGTM: metadata: None added to strict-mode test fixture

Keeps the fixture compiling against the extended config without affecting the strict-mode behavior under test.

If useful, we can also add a variant of this fixture with non-empty metadata to ensure strict-mode checks remain orthogonal to metadata presence.

lib/storage/tests/integration/alias_tests.rs (1)

124-124: Consider exercising metadata end-to-end by setting a non-empty value

Setting metadata to None compiles, but adding a small non-empty map here would give us an end-to-end signal that alias ops and collection creation paths accept and persist metadata correctly. For example:

-                            metadata: None,
+                            metadata: Some(serde_json::json!({"env": "test", "suite": "alias"})),

And add this import at the top of the file (outside the shown hunk):

use serde_json::json;

Optionally, fetch the collection config after creation and assert the metadata is present.

lib/collection/src/tests/fixtures.rs (1)

53-53: Expose a metadata-aware fixture to reduce repetition across tests

Right now all fixtures pin metadata: None. Consider adding a helper that accepts an Option so tests that need to validate metadata behavior don’t have to re-implement config construction.

Example (new function outside this hunk):

pub fn create_collection_config_with_dim_and_metadata(
    dim: usize,
    metadata: Option<segment::types::Payload>,
) -> CollectionConfigInternal {
    let mut cfg = create_collection_config_with_dim(dim);
    cfg.metadata = metadata;
    cfg
}

This keeps current call sites unchanged while enabling targeted tests to pass metadata when needed.

lib/collection/src/config.rs (1)

235-240: Metadata Integration Verified – Consider Size Validation & Anonymization Checks

• State-management now tracks metadata changes via is_metadata_updated in lib/collection/src/collection/state_management.rs, so config-change detection properly accounts for the new field.
• Storage-layer conversions in lib/storage/src/content_manager/conversions.rs and lib/storage/src/content_manager/collection_meta_ops.rs pass through the metadata payload as intended.
• API surface in lib/api/src/grpc/proto/collections.proto documents and exposes merge semantics for metadata (e.g., “will be merged with already stored metadata”).
• Telemetry code in lib/collection/src/telemetry.rs includes metadata with an #[anonymize(value = None)] annotation—please confirm that the anonymization logic covers nested JSON values and prevents any PII leakage.

Optional improvements:

  • Enforce a soft size limit on the metadata payload during create/update (e.g., via request validation) to prevent large blobs from bloating configs.
  • If desired, I can provide a reusable validation helper and integrate it into the relevant handlers.
lib/collection/src/tests/snapshot_test.rs (1)

50-59: Test config updated with metadata: None — consistent with new field.

This preserves existing behavior while keeping tests compiling. Consider adding a follow-up test that sets non-empty metadata, creates a snapshot, restores, and asserts metadata round-trips. This would harden persistence guarantees.

Would you like me to draft such a test?

lib/collection/src/collection/collection_ops.rs (1)

166-174: Prefer concise guard binding; explicit type annotation is noisy.

Use the same style as other methods in this impl for consistency.

-        let mut collection_config_guard: tokio::sync::RwLockWriteGuard<
-            '_,
-            crate::config::CollectionConfigInternal,
-        > = self.collection_config.write().await;
+        let mut config = self.collection_config.write().await;
tests/openapi/test_collection_metadata.py (1)

16-49: Great coverage of merge/override; consider a clearer name and a nested-merge case.

  • Rename to reflect intent, e.g., test_collection_metadata_patch_merge.
  • Add a case with nested metadata to pin down deep-merge behavior (and prevent regressions).

Here’s an additional test you can drop into this module:

def test_collection_metadata_nested_merge(collection_name):
    # set nested dict
    assert request_with_validation(
        api="/collections/{collection_name}",
        method="PATCH",
        path_params={"collection_name": collection_name},
        body={"metadata": {"nested": {"a": 1}}}
    ).ok

    # merge another key into the same nested object
    assert request_with_validation(
        api="/collections/{collection_name}",
        method="PATCH",
        path_params={"collection_name": collection_name},
        body={"metadata": {"nested": {"b": 2}}}
    ).ok

    res = request_with_validation(
        api="/collections/{collection_name}",
        method="GET",
        path_params={"collection_name": collection_name},
    )
    assert res.ok
    meta = res.json()["result"]["config"]["metadata"]
    assert meta.get("nested") in ({"a": 1, "b": 2}, {"b": 2, "a": 1})

Also consider a persistence check (patch → restart node → GET) in the E2E suite once update_metadata persists to disk.

lib/storage/src/content_manager/toc/collection_meta_ops.rs (1)

178-180: Clarify and document metadata merge semantics
The call to collection.update_metadata(metadata) invokes Payload::merge, but the behavior around key removal and empty‐map handling isn’t specified. Please verify and explicitly document:

  • Removal semantics: does sending a key with value null delete it, or merely set its value to JSON null? Is there support for an explicit “remove” list?
  • Empty map {} handling: does it act as a no-op, clear all existing metadata, or result in an error?
  • Omitted field behavior: what happens if the client omits the metadata field entirely?

Consider adding optional guardrails at the API boundary:

  • Enforce a maximum metadata size or nested depth to prevent oversized blobs from impacting snapshot size and replication latency.
  • Introduce a “strict mode” toggle that validates metadata payloads before applying.

Location for clarification and potential refactor:
• lib/storage/src/content_manager/toc/collection_meta_ops.rs lines 178–180

Let me know if you’d like me to open a follow-up PR to implement explicit delete semantics and size validation.

lib/collection/src/operations/types.rs (1)

198-209: Internal→public conversion forwards metadata

The conversion remains explicit and compiler-checked. This ensures GET /collections/{name} reflects metadata. Consider adding an integration test that:

  • creates with metadata,
  • patches (merge),
  • verifies CollectionInfo.config.metadata matches expectations.
lib/collection/src/collection/state_management.rs (1)

138-155: Snapshot apply: treat metadata change as config change without optimizer recreation — correct tradeoff

  • Adding metadata to the destructure and comparing it (is_metadata_updated) is correct.
  • Including it in is_config_updated triggers a persisted config save, while keeping recreate_optimizers tied to “core” config is the right balance.

Suggestion (optional): add a small comment explaining that metadata-only updates intentionally avoid optimizer recreation, to preempt future “why didn’t optimizers restart?” questions.

lib/api/src/grpc/proto/collections.proto (3)

413-414: CreateCollection.metadata: clarify merge and deletion semantics in comments

To avoid client confusion, tighten the field docs to specify merge and delete semantics. Example patch:

-  map<string, Value> metadata = 18; // Arbitrary JSON metadata for the collection
+  // Arbitrary JSON metadata for the collection.
+  // Semantics:
+  // - On create: stored as-is.
+  // - On update (see UpdateCollection.metadata): server MERGES keys into existing metadata.
+  //   To delete a key, set its value to null (if supported by Value) or use an explicit remove API if available.
+  map<string, Value> metadata = 18;

426-427: UpdateCollection.metadata: specify exact merge behavior and key removal approach

The comment mentions “merged,” but clients need exact rules. Suggest expanding:

-  map<string, Value> metadata = 10; // Arbitrary JSON-like metadata for the collection, will be merged with already stored metadata
+  // Arbitrary JSON-like metadata for the collection. Merge behavior:
+  // - Keys present in the request overwrite existing keys.
+  // - Keys absent in the request remain unchanged.
+  // - To remove a key, set its value to null (if Value supports null); otherwise provide a dedicated remove-list in the future.
+  map<string, Value> metadata = 10;

Also consider noting any practical size limits (e.g., max total bytes), if enforced.

If null-based deletion is intended, please confirm that Value supports null. If not, we may need an explicit removal mechanism or a special sentinel.


466-467: CollectionConfig.metadata: expose but keep docs consistent with other surfaces

Make the docs consistent with CreateCollection/UpdateCollection to avoid ambiguity:

-  map<string, Value> metadata = 7; // Arbitrary JSON metadata for the collection
+  // Arbitrary JSON metadata for the collection. Returned as stored after merges from updates.
+  map<string, Value> metadata = 7;
lib/collection/src/operations/conversions.rs (1)

1948-2047: Minor: consider preserving the “empty vs none” distinction (optional).

For TryFrom<api::grpc::qdrant::CollectionConfig>, empty metadata is converted to None. If elsewhere you intend to differentiate “unset” from “explicitly empty”, consider mapping empty to Some(empty) for symmetry. Not critical for info/get paths.

lib/storage/src/content_manager/collection_meta_ops.rs (1)

263-267: UpdateCollection comment promises “empty object clears” — keep code aligned.

This is the desired API contract. After adopting the conversion fix in conversions.rs, confirm the update application logic treats Some(empty) as “clear”. Consider adding a unit/integration test.

src/migrations/single_to_cluster.rs (1)

118-121: Avoid potential panic on missing shard_id in shards map.

unwrap() will panic if shard_ids and shards ever diverge. Safer to use expect with context or skip with a warning to keep migration resilient.

Apply this diff:

-                    for shard_id in shard_ids {
-                        let shard_info = shards.get(shard_id).unwrap();
+                    for shard_id in shard_ids {
+                        let shard_info = shards.get(shard_id).expect(
+                            "inconsistent state: shard_id from key mapping not found in shards",
+                        );

Alternatively, log-and-continue if you prefer best-effort migration.

docs/redoc/master/openapi.json (6)

6725-6735: CollectionConfig.metadata: addition looks good; tighten wording and add an example

Great addition. Minor docs nits and a small example will improve clarity.

Apply this diff:

           "metadata": {
-            "description": "Arbitrary JSON metadata for the collection This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+            "description": "Arbitrary JSON metadata for the collection. This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+            "example": {
+              "owner": "ml-team",
+              "purpose": "experiments-2025-08",
+              "any_values": "go-like-this"
+            },
             "anyOf": [
               {
                 "$ref": "#/components/schemas/Payload"
               },
               {
                 "nullable": true
               }
             ]
           }

7491-7497: New Payload schema: add a brief description for discoverability

Schema is fine; adding a one-liner helps readers understand it when jumping to the definition.

Apply this diff:

       "Payload": {
-        "type": "object",
+        "type": "object",
+        "description": "Free-form JSON object used for point payloads and collection-level metadata.",
         "additionalProperties": true,
         "example": {
           "city": "London",
           "color": "green"
         }
       },

9735-9745: CreateCollection.metadata: mirror wording and provide example

Match the description style and include an example consistent with the PATCH use-case from the PR description.

Apply this diff:

           "metadata": {
-            "description": "Arbitrary JSON metadata for the collection This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+            "description": "Arbitrary JSON metadata for the collection. This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+            "example": {
+              "any_values": "go-like-this"
+            },
             "anyOf": [
               {
                 "$ref": "#/components/schemas/Payload"
               },
               {
                 "nullable": true
               }
             ]
           }

Additionally, consider adding a request example to the Create Collection endpoint’s requestBody to showcase metadata usage. I can draft it if you’d like.


10116-10126: UpdateCollection.metadata: clarify merge semantics and unchanged/clear behavior; add example

The description mentions merge semantics but doesn’t specify whether the merge is deep vs shallow, and how to keep unchanged vs clear keys. Also, it says “set it to an empty object” to remove metadata; given the schema allows null, clarify whether null clears or should be ignored.

  • Please confirm:
    • Is the merge deep (recursive) or shallow (one level)?
    • Does "metadata": null clear all metadata, or should clients send {} to clear?
    • How do clients delete a single key (e.g., remove only "foo")—is sending "foo": null supported, or must they send the full object without "foo"?

Apply this wording tweak and add a concrete example:

           "metadata": {
-            "description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To remove metadata, set it to an empty object.",
+            "description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To clear all metadata, set this field to an empty object ({}). Omit this field to leave metadata unchanged.",
+            "example": {
+              "release": "v1.2.3",
+              "notes": "backfill completed"
+            },
             "anyOf": [
               {
                 "$ref": "#/components/schemas/Payload"
               },
               {
                 "nullable": true
               }
             ]
           }

Optionally, add an endpoint-level example to PATCH /collections/{collection_name} requestBody showing:

{
  "metadata": { "any_values": "go-like-this" }
}

11975-11985: CollectionConfigTelemetry.metadata: align wording

Minor punctuation to match the other occurrences.

Apply this diff:

           "metadata": {
-            "description": "Arbitrary JSON metadata for the collection",
+            "description": "Arbitrary JSON metadata for the collection.",
             "anyOf": [
               {
                 "$ref": "#/components/schemas/Payload"
               },
               {
                 "nullable": true
               }
             ]
           }

10116-10126: Add endpoint-level examples to PATCH /collections/{collection_name}

The PR description promises a PATCH example; adding an OpenAPI example improves SDK generation and docs.

Add this block under the PATCH requestBody.content.application/json (outside this exact hunk):

"examples": {
  "set_metadata": {
    "summary": "Set collection-level metadata",
    "value": {
      "metadata": {
        "any_values": "go-like-this"
      }
    }
  },
  "clear_metadata": {
    "summary": "Clear all collection metadata",
    "value": {
      "metadata": {}
    }
  }
}

I can prepare a precise diff against the PATCH operation if preferred.

lib/api/src/grpc/qdrant.rs (3)

811-814: CreateCollection.metadata: add guardrails and document constraints

The field/tag placement looks correct (new tag 18, no collisions). Before this ships broadly:

  • Add server-side limits for metadata (max total bytes, max keys, max nesting) and fail fast with Status::invalid_argument if exceeded. This prevents unbounded memory usage on create.

  • Clarify allowed types in docs (IntegerValue vs DoubleValue semantics, disallow NaN/inf, max int range). If NaN/inf are possible in DoubleValue, explicitly reject them.

  • Consider reserving a namespace (e.g., qdrant.*) for potential future internal keys to avoid user collisions.

If you want, I can sketch a validator hook in the create path that walks Value with a byte/depth budget.


858-861: UpdateCollection.metadata: merge semantics under-specified; consider explicit patch model

The comment promises a merge, but proto3 map presence makes it impossible to distinguish “field absent” from “present but empty” and there’s no clear way to remove keys. NullValue could mean “set to null” or “delete key”.

Suggest introducing an explicit patch model to avoid ambiguity:

  • Add oneof metadata_op { map<string, Value> upsert = X; RepeatedString remove_keys = Y; bool clear = Z; }
  • Alternatively, a MetadataPatch { map<string, Value> set; repeated string unset; bool clear; } nested under UpdateCollection.

This gives unambiguous upsert/unset/clear semantics, and “absent” means “no change”.

If you prefer to keep the single map, please document:

  • whether sending {} means “no change” or “clear all,”
  • whether NullValue deletes a key or sets it to null,
  • and how nested StructValue merges vs overwrites.

Happy to draft the proto change if you want to go this route.


963-966: CollectionConfig.metadata in responses/telemetry: avoid accidental PII leakage

Surfacing arbitrary user metadata in CollectionConfig means it may be logged, exported via telemetry, or scraped by tooling. Recommend:

  • Redact or opt-out: ensure telemetry export paths either redact metadata by default or gate it behind a config/feature flag.
  • Size caps in responses: consider truncating oversized metadata in CollectionInfo/Config responses to protect memory usage and client UX.
  • Document the risk in API docs, and discourage storing secrets/tokens under metadata.

Please verify where CollectionConfig is serialized to JSON or exported (e.g., telemetry). If it is, add a redaction path there.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bf60e96 and 2688e67.

📒 Files selected for processing (33)
  • docs/grpc/docs.md (6 hunks)
  • docs/redoc/master/openapi.json (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (4 hunks)
  • lib/api/src/grpc/qdrant.rs (4 hunks)
  • lib/collection/benches/batch_query_bench.rs (1 hunks)
  • lib/collection/benches/batch_search_bench.rs (1 hunks)
  • lib/collection/src/collection/collection_ops.rs (2 hunks)
  • lib/collection/src/collection/state_management.rs (1 hunks)
  • lib/collection/src/config.rs (2 hunks)
  • lib/collection/src/operations/conversions.rs (4 hunks)
  • lib/collection/src/operations/types.rs (3 hunks)
  • lib/collection/src/operations/verification/mod.rs (1 hunks)
  • lib/collection/src/shards/local_shard/shard_ops.rs (1 hunks)
  • lib/collection/src/shards/replica_set/update.rs (1 hunks)
  • lib/collection/src/shards/shard_holder/shard_mapping.rs (1 hunks)
  • lib/collection/src/telemetry.rs (4 hunks)
  • lib/collection/src/tests/fixtures.rs (1 hunks)
  • lib/collection/src/tests/points_dedup.rs (1 hunks)
  • lib/collection/src/tests/query_prefetch_offset_limit.rs (1 hunks)
  • lib/collection/src/tests/snapshot_test.rs (1 hunks)
  • lib/collection/tests/integration/common/mod.rs (1 hunks)
  • lib/collection/tests/integration/multi_vec_test.rs (1 hunks)
  • lib/collection/tests/integration/snapshot_recovery_test.rs (1 hunks)
  • lib/segment/src/types.rs (1 hunks)
  • lib/storage/src/content_manager/collection_meta_ops.rs (6 hunks)
  • lib/storage/src/content_manager/conversions.rs (5 hunks)
  • lib/storage/src/content_manager/mod.rs (1 hunks)
  • lib/storage/src/content_manager/toc/collection_meta_ops.rs (2 hunks)
  • lib/storage/src/content_manager/toc/create_collection.rs (2 hunks)
  • lib/storage/tests/integration/alias_tests.rs (1 hunks)
  • src/consensus.rs (1 hunks)
  • src/migrations/single_to_cluster.rs (5 hunks)
  • tests/openapi/test_collection_metadata.py (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-11T07:57:01.399Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#6986
File: lib/segment/src/vector_storage/quantized/quantized_vectors.rs:81-84
Timestamp: 2025-08-11T07:57:01.399Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_vectors.rs`, the BitsStoreType parameter difference between single-vector and multi-vector Binary quantization is intentional: single-vector storage uses `EncodedVectorsBin<u128, ...>` to enable 128-bit SIMD/popcount optimizations for longer vectors, while multi-vector storage uses `EncodedVectorsBin<u8, ...>` because multivectors are typically shorter and benefit from byte-granular storage.

Applied to files:

  • lib/collection/src/config.rs
📚 Learning: 2025-08-15T11:41:10.629Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#7043
File: lib/segment/src/vector_storage/quantized/quantized_vectors.rs:857-857
Timestamp: 2025-08-15T11:41:10.629Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_vectors.rs`, the multivector offset storage has two different patterns: for RAM storage, offsets are collected into Vec<MultivectorOffset> and used directly; for MMAP storage, offsets are consumed to create a file via create_offsets_file_from_iter, then the file is loaded back as MultivectorOffsetsStorageMmap. The direct consumption of offsets iterator in the MMAP case is intentional.

Applied to files:

  • lib/collection/src/config.rs
📚 Learning: 2025-08-11T00:37:34.100Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#6986
File: lib/segment/src/vector_storage/quantized/quantized_ram_storage.rs:46-55
Timestamp: 2025-08-11T00:37:34.100Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_ram_storage.rs`, the `ChunkedVectors` used in `QuantizedRamStorage` is intentionally designed to be non-persistent during updates. The `push_vector` method only updates the in-memory vectors without flushing to disk, and this is expected behavior rather than a bug.

Applied to files:

  • lib/collection/src/config.rs
📚 Learning: 2025-08-10T18:26:33.017Z
Learnt from: generall
PR: qdrant/qdrant#7006
File: docs/redoc/master/openapi.json:10393-10401
Timestamp: 2025-08-10T18:26:33.017Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated. Any description changes must be made in the source (e.g., lib/api/src/rest/schema.rs and/or lib/api/src/grpc/proto/points.proto), not edited directly in the JSON.

Applied to files:

  • lib/api/src/grpc/qdrant.rs
🧬 Code graph analysis (6)
lib/collection/src/shards/local_shard/shard_ops.rs (1)
lib/collection/src/operations/types.rs (15)
  • from (97-104)
  • from (188-210)
  • from (259-280)
  • from (671-673)
  • from (691-693)
  • from (1185-1190)
  • from (1194-1199)
  • from (1203-1253)
  • from (1257-1259)
  • from (1263-1268)
  • from (1272-1277)
  • from (1281-1286)
  • from (1290-1295)
  • from (1299-1304)
  • from (1308-1313)
tests/openapi/test_collection_metadata.py (2)
tests/openapi/helpers/helpers.py (1)
  • request_with_validation (41-116)
tests/openapi/conftest.py (1)
  • on_disk_vectors (11-12)
src/migrations/single_to_cluster.rs (4)
lib/collection/src/shards/shard_holder/mod.rs (2)
  • new (77-110)
  • shards (284-284)
lib/collection/src/collection/mod.rs (1)
  • new (103-200)
lib/collection/src/collection/state_management.rs (1)
  • shards (190-190)
lib/collection/src/shards/shard_holder/shard_mapping.rs (2)
  • shard_key (71-76)
  • shard_ids (50-59)
lib/storage/src/content_manager/conversions.rs (1)
lib/api/src/conversions/json.rs (1)
  • proto_to_payloads (49-55)
lib/collection/src/operations/conversions.rs (1)
lib/api/src/conversions/json.rs (2)
  • payload_to_proto (9-14)
  • proto_to_payloads (49-55)
lib/api/src/grpc/qdrant.rs (1)
lib/collection/src/operations/conversions.rs (1)
  • map (1930-1932)
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

17-17: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


29-29: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


88-88: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: e2e-tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: lint
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
🔇 Additional comments (35)
lib/collection/src/shards/local_shard/shard_ops.rs (2)

165-168: Switch to explicit From() is correct and side‑effect free.

Functionally equivalent to the prior .into() and makes the target type explicit at the call site. No behavioral change; good fit given impl From<ShardInfoInternal> for CollectionInfo exists.


165-168: Metadata propagation confirmed via CollectionConfig.metadata.

The metadata field lives on CollectionConfigInternal and is correctly carried through in the From<CollectionConfigInternal> for CollectionConfig implementation (mapping metadataCollectionConfig.metadata), and then included in CollectionInfo via its config field (see config: CollectionConfig::from(config)). There is intentionally no top-level metadata on CollectionInfo; it surfaces under config.metadata in API responses.

No further changes required here.

lib/collection/src/shards/replica_set/update.rs (2)

772-773: LGTM: initialize new metadata field in the test fixture

Explicitly setting metadata: None keeps the fixture aligned with the updated CollectionConfigInternal and avoids unintended behavior changes in this test. No issues found.


772-773: Follow-up: add a targeted test for metadata merge and persistence semantics

Given the PR introduces in-memory merge behavior for collection-level metadata, consider adding a small unit/integration test that:

  • Creates a collection with metadata: None.
  • Applies an update that sets metadata via the update path.
  • Asserts the new metadata is visible from memory.
  • Reloads the collection (or simulates) and asserts whether metadata persists or not, per intended design.

This will lock in the contract between API, in-memory config, and on-disk persistence.

lib/collection/src/tests/query_prefetch_offset_limit.rs (1)

62-63: LGTM: metadata default added to config literal

Adding metadata: None keeps the fixture compiling against the new struct without affecting the query/prefetch assertions. Looks good.

lib/collection/tests/integration/multi_vec_test.rs (1)

66-67: LGTM: metadata field initialized

metadata: None is correctly added to the CollectionConfigInternal literal. Multi-vector tests remain unaffected by this addition.

lib/collection/src/shards/shard_holder/shard_mapping.rs (1)

189-189: LGTM: test config updated for new field

Including metadata: None in the test’s config ensures compatibility with the expanded struct. No behavior changes to shard key mapping or migration logic.

lib/collection/benches/batch_query_bench.rs (1)

71-72: LGTM: benchmark fixture updated

metadata: None is added to the bench setup config. Bench behavior and performance measurements remain unchanged.

lib/collection/benches/batch_search_bench.rs (1)

88-90: LGTM; bench updated for the new field.

Setting metadata: None here is appropriate for a perf bench. If you later want to include a small constant metadata object to exercise that path too, it should not materially affect results.

lib/collection/src/config.rs (1)

14-17: Importing Payload into config module looks correct

Bringing segment::types::Payload into scope here is consistent with using it in CollectionConfigInternal and avoids re-qualifying the type throughout the module.

lib/storage/src/content_manager/toc/create_collection.rs (2)

35-52: Metadata field correctly plumbed from CreateCollection.

Clean destructuring; ownership of metadata moves from operation as intended. No issues.


214-223: Metadata Persistence Verified

Collection::new immediately invokes collection_config.save(path), which serializes the entire CollectionConfigInternal (including the metadata field) to config.json via serde_json::to_vec(self) and an atomic write (qdrant.tech). On restart, Collection::load uses CollectionConfigInternal::load(path) to deserialize the same file, restoring the metadata. Furthermore, collection‐level snapshots bundle the config.json file, so the metadata will be present in any snapshot (qdrant.tech).

No changes required.

lib/storage/src/content_manager/mod.rs (1)

124-141: Added metadata: None to UpdateCollection in remove_replica — good catch.

Keeps the struct literal aligned with the expanded API so future serde/default changes don’t bite. LGTM.

lib/collection/src/collection/collection_ops.rs (2)

7-7: Import of Payload is appropriate.

Needed for the new metadata update method; no concerns.


172-176: Merge semantics verified
The Payload::merge implementation delegates to utils::merge_map, which for each entry in the source:

  • Removes the key if the incoming value is null.
  • Otherwise calls dest.insert(key, value), replacing any existing value under that key.

This ensures new values override existing ones and that nested JSON objects are replaced (rather than recursively merged), matching the test’s assumption around overriding keys like new_meta: "value3". There are no further changes needed here.

tests/openapi/test_collection_metadata.py (1)

1-14: Autouse fixture setup/teardown looks good.

Keeps tests isolated and exercises both in-memory and on-disk vectors via parametrization.

lib/storage/src/content_manager/toc/collection_meta_ops.rs (1)

138-139: Plumbs metadata through UpdateCollection destructuring correctly

Including metadata in the destructured UpdateCollection keeps this method future-proof and compiler-enforced for new fields. No issues spotted.

lib/collection/src/telemetry.rs (3)

5-5: Necessary import for telemetry wiring Payload

Importing Payload here is expected for the new telemetry surface. Looks good.


89-93: Telemetry: metadata added and anonymized — good default

  • #[serde(default, skip_serializing_if = "Option::is_none")] prevents noisy output.
  • #[anonymize(value = None)] avoids leaking user-provided metadata in telemetry.

No changes requested.


105-116: Propagates metadata from internal config into telemetry

The From<CollectionConfigInternal> impl now mirrors the struct fields, including metadata. This keeps telemetry consistent with the internal source of truth.

lib/collection/src/operations/types.rs (1)

180-185: Public API: adds collection-level metadata with appropriate serde behavior

  • Field docs help position intended use.
  • Option<Payload> with skip_serializing_if preserves backward compatibility.
    No concerns on the surface API.
lib/api/src/grpc/proto/collections.proto (1)

6-7: Imports Value type for metadata

The json_with_int.proto import enables the Value type; this keeps consistency with other JSON-like fields. All good.

lib/collection/src/operations/conversions.rs (2)

391-399: Plumbs metadata through CollectionConfig conversion (input→output) correctly.

Destructuring metadata here prepares it for inclusion in the gRPC response. No issues spotted.


530-533: gRPC response includes metadata map; good defaulting.

Serializing Option<Payload> to HashMap<String, Value> with unwrap_or_default() is the right call; it keeps wire format stable (empty map instead of omitting the field).

lib/storage/src/content_manager/conversions.rs (1)

113-117: CreateCollection: empty metadata → None is acceptable.

Treating empty metadata as None on create is fine since the resulting state is the same. Keep as-is unless you later choose to preserve an explicit-empty semantic on creation too.

docs/grpc/docs.md (2)

1727-1738: Doc/code contract: “empty object clears metadata” requires preserving emptiness in server-side conversion.

The docs for UpdateCollection.metadata say an empty object clears stored metadata. Ensure the server conversion (and update handler) honor that. See my code comment proposing to pass Some(empty) through. If this behavior isn’t implemented, either adjust the code (preferred) or amend the doc.


17-17: Markdown lint nit: unordered list indentation (MD007).

Autogenerated file; safe to ignore unless your pipeline enforces it.

Also applies to: 29-29, 88-88

lib/storage/src/content_manager/collection_meta_ops.rs (3)

185-190: Nicely scoped metadata field on CreateCollection.

The serde defaults and skip-when-none are correct; type aligns with Payload usage elsewhere.


289-291: Defaulting metadata to None in new_empty is appropriate.

Keeps the “no-op unless provided” semantics.


439-441: Propagates metadata from internal config to CreateCollection (for cloning/migration).

Good pass-through; keeps metadata intact across operations.

Also applies to: 469-470

src/migrations/single_to_cluster.rs (3)

41-49: Good: use collection State snapshot to avoid deep coupling.

Destructuring State here simplifies downstream usage and keeps responsibilities clear.


61-63: Migration builds CreateCollection with full config+metadata; looks correct.

Passing through vectors, sparse_vectors, config, and metadata ensures semantic parity post-migration.

Also applies to: 89-107


140-156: Replica state activation loop is correct.

Only marks local replicas Active; aligns with post-migration expectations.

docs/redoc/master/openapi.json (1)

7491-7497: Payload schema relocation verified—no duplicates or stale refs remain

  • Exactly one Payload definition under components/schemas (line 7490)
  • Nine exact $ref: "#/components/schemas/Payload" occurrences
  • No leftover #/definitions/Payload or stray #/components/Payload references

All existing references point to the single relocated schema. No further action required.

lib/api/src/grpc/qdrant.rs (1)

2-101: Verify migration scope and JSON semantics

Please address the following remaining points:

  • Wire compatibility / migration
    Confirm that the new json_with_int.proto types (i.e. Value/Struct/ListValue) are only used in newly added fields and aren’t replacing existing google.protobuf.Value-based fields in any public RPC. For example, the imports in points.proto (e.g. map<string, Value> payload) and collections.proto (e.g. map<string, Value> metadata) should correspond solely to new fields, so that mixed-version clusters or SDKs won’t experience decode failures.

  • JSON representation vs. serde derives (optional refactor)
    The current #[derive(serde::Serialize)] impl will emit { "kind": { "IntegerValue": 42 } } rather than a bare JSON number or string. If these types are ever serialized to telemetry or surfaced in JSON APIs, consider either:
    • A custom Serialize/Deserialize impl (or newtype wrapper) for native JSON (null/number/string/bool/object/array), with a distinct path for integers.
    • Or adjust the doc comments to reflect the { "kind": … } shape.

  • Naming ergonomics (nitpick)
    Defining Struct/Value/ListValue alongside the well-known types can confuse downstream users. You may want to publicly alias or re-export them (e.g. pub use json_with_int::{Value as JsonValue}) to reduce qualification boilerplate.

  • Bounds & validation (architecture advice)
    Since these dynamic values underpin user-supplied metadata, ensure the service layer enforces sensible upper bounds on key count, total serialized size, and nesting depth to prevent trivial RAM DoS vectors.

  • Proto scan completed
    A search confirmed no lingering imports or references to google.protobuf.Value/Struct/ListValue outside of timestamp/empty in other services. Everything in lib/api/src/grpc/proto/json_with_int.proto is self-contained.

Comment on lines +202 to +206
metadata: if metadata.is_empty() {
None
} else {
Some(json::proto_to_payloads(metadata)?)
},
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Bug: empty metadata on UpdateCollection cannot clear metadata (spec says it should).

Docs say “To remove metadata, set it to an empty object”, but this conversion turns an empty object into None (no-op). That makes it impossible to clear metadata via PATCH.

Fix by preserving emptiness: pass Some(empty) through so the update layer can interpret it as “clear all metadata”.

Apply this diff:

-                metadata: if metadata.is_empty() {
-                    None
-                } else {
-                    Some(json::proto_to_payloads(metadata)?)
-                },
+                // Preserve empty object to signal "clear all metadata"
+                metadata: Some(json::proto_to_payloads(metadata)?),

Follow-up: ensure the update handler treats Some(Payload { .. empty .. }) as “clear” (or implement that behavior if missing). Add an e2e test: PATCH with { "metadata": {} } then GET should return no metadata.


🏁 Script executed:

#!/bin/bash
# Verify how UpdateCollection.metadata is applied and whether empty means "clear".
# 1) Locate where UpdateCollection is handled to update collection state.
rg -nP -C3 --type=rust '\bUpdateCollection\b.*\{[^}]*metadata' 
rg -nP -C3 --type=rust 'metadata\b' lib/collection src lib/storage \
  | rg -nP -C2 '(merge|clear|overwrite).*metadata|metadata\s*=\s*'

# 2) Look for semantics that merge or clear collection-level metadata.
rg -nP -C3 --type=rust 'merge.*metadata|clear.*metadata|set.*metadata|overwrite.*metadata'

Length of output: 9844


Review Conversion Logic: Preserve Empty Metadata in PATCH Requests

The conversion in lib/storage/src/content_manager/conversions.rs currently drops empty metadata (treating {} as “not provided”) instead of passing it through. This prevents a PATCH of { "metadata": {} } from ever clearing existing metadata.

Specifically, the code at lines 202–206 reads:

metadata: if metadata.is_empty() {
    None
} else {
    Some(json::proto_to_payloads(metadata)?)
},

By returning None when the incoming metadata map is empty, we skip the update entirely instead of signaling a clear.

To align with the docs (“To remove metadata, set it to an empty object”) and enable clearing metadata, replace that snippet with:

- metadata: if metadata.is_empty() {
-     None
- } else {
-     Some(json::proto_to_payloads(metadata)?)
- },
+ // Preserve empty object to signal "clear all metadata"
+ metadata: Some(json::proto_to_payloads(metadata)?),

Apply this change in
lib/storage/src/content_manager/conversions.rs
around lines 202–206.
Follow up: Verify in
lib/storage/src/content_manager/collection_meta_ops.rs
and corresponding handlers that
Some(empty_payload) is interpreted as “clear metadata.”
If not, implement the logic to treat an empty Payload as a clear signal.

Test: Add an end-to-end test:

  1. PATCH /collections/{name} with { "metadata": {} }
  2. Then GET /collections/{name} should return no metadata (i.e. {}).
🤖 Prompt for AI Agents
In lib/storage/src/content_manager/conversions.rs around lines 202 to 206, the
conversion currently maps an empty incoming metadata map to None which prevents
PATCH requests from clearing metadata; change the logic to always return
Some(json::proto_to_payloads(metadata)?) even when metadata is empty so that an
explicit empty payload signals “clear metadata.” After that, verify
lib/storage/src/content_manager/collection_meta_ops.rs and related handlers
treat Some(empty_payload) as a clear operation and if they do not, update them
to interpret an empty Payload as a clear request. Finally, add an end-to-end
test: PATCH /collections/{name} with { "metadata": {} } then GET
/collections/{name} and assert metadata is empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents a PATCH of { "metadata": {} } from ever clearing existing metadata.

That is correct, however it it is possible to clean specific metadata key by doing

{ "metadata": {"key": null} }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generall Thank you for the clarification! You're absolutely right that individual metadata keys can be cleared using {"key": null} syntax - that's a more granular approach to metadata management.

This raises an important design question: Is the current behavior for empty metadata objects {} intentional?

Currently:

  • {"metadata": {"key": null}} → clears specific key ✅ (as you noted)
  • {"metadata": {}} → no-op (treated as "not provided")

The documentation mentions "To remove metadata, set it to an empty object" - but if key-level clearing via null is the intended mechanism, then perhaps:

  1. The docs should be updated to clarify that clearing is done per-key with null values
  2. Or, both approaches should be supported (empty object = clear all, null values = clear specific keys)

What's the intended design here? Should empty metadata objects clear all metadata, or is the current key-by-key clearing approach the preferred pattern?

🐰💭


✏️ Learnings added
Learnt from: generall
PR: qdrant/qdrant#7123
File: lib/storage/src/content_manager/conversions.rs:202-206
Timestamp: 2025-08-25T09:23:06.556Z
Learning: In Qdrant's collection metadata system, individual metadata keys can be cleared by setting them to null in PATCH requests, e.g., `{"metadata": {"key": null}}` will remove the specific "key" from the collection's metadata.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't have { "metadata": {} } to clear whole object, as there are no difference between empty map and null in gPRC, so API would be inconsistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generall Ah, that's a crucial architectural constraint I missed! You're absolutely right - gRPC doesn't distinguish between an empty map and null/unset, so using {"metadata": {}} to clear all metadata would create inconsistency between the gRPC and REST APIs.

The current design using {"metadata": {"key": null}} for individual key clearing is the correct approach that works consistently across both protocols. My suggestion to change the conversion logic was based on a misunderstanding of this fundamental limitation.

Thank you for clarifying this - it makes perfect sense why the API is designed this way to maintain consistency across different protocol endpoints.


✏️ Learnings added
Learnt from: generall
PR: qdrant/qdrant#7123
File: lib/storage/src/content_manager/conversions.rs:202-206
Timestamp: 2025-08-25T09:26:13.854Z
Learning: In Qdrant's API design, gRPC protocol limitations prevent using empty objects `{}` to clear all metadata because gRPC doesn't distinguish between empty maps and null/unset values. This constraint requires API consistency between gRPC and REST endpoints, making per-key clearing with null values (`{"key": null}`) the standard approach for metadata management.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: generall
PR: qdrant/qdrant#7123
File: lib/storage/src/content_manager/conversions.rs:202-206
Timestamp: 2025-08-25T09:23:06.556Z
Learning: In Qdrant's collection metadata system, individual metadata keys can be cleared by setting them to null in PATCH requests, e.g., `{"metadata": {"key": null}}` will remove the specific "key" from the collection's metadata.

@qdrant qdrant deleted a comment from coderabbitai bot Aug 25, 2025
Comment on lines +33 to +49
response = request_with_validation(
api="/collections/{collection_name}",
method="PATCH",
path_params={"collection_name": collection_name},
body={"metadata": {"new_meta": "value3"}}
)
assert response.ok

response = request_with_validation(
api='/collections/{collection_name}',
method="GET",
path_params={'collection_name': collection_name},
)
assert response.ok
result = response.json()["result"]

assert result["config"]["metadata"] == {"new_meta": "value3", "new_meta_2": "value2"}
Copy link
Member

@timvisee timvisee Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, there is no way to unset keys from the metadata.

How about making the patch request replace the whole metadata object, not merging keys? That would allow unsetting keys.

I understand that isn't exactly great either, because it still doesn't allow us to unset the whole object. But I'd argue it's a bit better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is {"metadata": {"key": null}}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work through our Python client with None. That's what I tried in the included test 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work through our Python client with None. That's what I tried in the included test 😃

why not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ Looks like it works now. I've added it to the test.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds collection-level metadata support across the stack. gRPC/proto gains metadata maps on CreateCollection, UpdateCollection, and CollectionConfig (map<string, Value>) with JSON-like Value types added/consolidated in Rust bindings. OpenAPI introduces a central Payload schema and references it from multiple entities, adding metadata fields and a nullable uuid in telemetry. Internal structs now include Option metadata; conversions and storage paths are updated to carry and merge metadata. A new Collection::update_metadata method persists changes; state/apply paths detect metadata updates. Migrations and shard ops adapt to new config shape. Tests and benches update initializers; a new OpenAPI test validates PATCH merge semantics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • coszio
  • agourlay

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch collection-level-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/collection/src/operations/verification/mod.rs (1)

506-515: Add positive metadata coverage test

A search of the repository found no existing tests that exercise metadata on CollectionConfigInternal. To avoid regressions, please add at least one new test for a non‐None metadata payload. For example, in lib/collection/src/operations/verification/mod.rs (or in a new integration test under tests/), create a collection config with metadata = Some(your_payload) and assert that:

  • It survives a save/load round‐trip unchanged.
  • It does not change the outcome of strict‐mode verification.
  • When metadata is None, it is omitted from telemetry/info outputs.

Pinpoint test locations to add or files to create:

  • lib/collection/src/operations/verification/mod.rs (adjacent to existing fixture tests)
  • Or a new file: tests/collection_metadata_roundtrip.rs

Test skeleton suggestion:

#[tokio::test]
async fn metadata_roundtrip_preserved() -> CollectionResult<()> {
    // 1. Build config with Some(metadata)
    // 2. Save and load collection
    // 3. Assert loaded_config.metadata == original metadata
    // 4. Run strict‐mode verification and assert same result as metadata=None
    // 5. Fetch telemetry/info and assert metadata field is absent when set to None
    Ok(())
}
♻️ Duplicate comments (1)
lib/collection/src/collection/collection_ops.rs (1)

166-180: update_metadata persists to disk and drops the write guard — looks correct

  • Correctly merges into existing metadata or sets it when absent.
  • Drops the write lock before saving, matching save patterns elsewhere.
  • Addresses the earlier persistence concern raised in review threads.

Optional polish:

  • Add a doc comment clarifying merge semantics (null removes keys).
  • Minor variable naming consistency (use “config” like other methods).

Suggested inline doc + naming tweak:

-    pub async fn update_metadata(&self, metadata: Payload) -> CollectionResult<()> {
-        let mut collection_config_guard: tokio::sync::RwLockWriteGuard<
+    /// Merge collection metadata and persist.
+    ///
+    /// Semantics:
+    /// - Keys present are merged into existing metadata.
+    /// - Values set to JSON null remove the corresponding keys.
+    pub async fn update_metadata(&self, metadata: Payload) -> CollectionResult<()> {
+        let mut config: tokio::sync::RwLockWriteGuard<
             '_,
             crate::config::CollectionConfigInternal,
         > = self.collection_config.write().await;
 
-        if let Some(current_metadata) = collection_config_guard.metadata.as_mut() {
+        if let Some(current_metadata) = config.metadata.as_mut() {
             current_metadata.merge(&metadata);
         } else {
-            collection_config_guard.metadata = Some(metadata);
+            config.metadata = Some(metadata);
         }
-        drop(collection_config_guard);
+        drop(config);
         self.collection_config.read().await.save(&self.path)?;
         Ok(())
     }
🧹 Nitpick comments (35)
lib/collection/src/shards/shard_holder/shard_mapping.rs (1)

180-190: Add a test for non-empty metadata in shard mapping

I wasn’t able to find any existing tests that instantiate CollectionConfigInternal with a non-None metadata field, including in the shard-mapping integration tests. Please add a test case under lib/collection/src/shards/shard_holder (or in your integration suite) that:

  • Constructs CollectionConfigInternal with metadata: Some(json!({"env": "test"}))
  • Calls the shard-mapping logic (via to_bytes/load or through the public API)
  • Asserts that the serialized bytes contain the "env":"test" payload
  • Verifies that shard-key mapping behavior is unchanged

This will ensure that arbitrary JSON metadata is correctly round-tripped without side-effects.

src/migrations/single_to_cluster.rs (2)

61-63: Defaulting sharding_method via unwrap_or_default — confirm default matches desired migration semantics

Minor: ensure ShardingMethod::default aligns with how you expect legacy (single-node) collections to migrate (typically Auto). If not, consider making the fallback explicit for readability.


91-103: Auto sharding: distribution filters to local replicas only — guard against empty distribution

If, for any reason, this peer holds no replicas (edge case), you’d propose an empty distribution. Consider warning or short-circuiting to avoid a silent no-op.

Apply this localized guard:

-                collection_create_operation.set_distribution(ShardDistributionProposal {
-                    distribution: shards
+                let distribution: std::collections::HashMap<_, _> = shards
                         .iter()
                         .filter_map(|(shard_id, shard_info)| {
                             if shard_info.replicas.contains_key(&this_peer_id) {
                                 Some((*shard_id, vec![this_peer_id]))
                             } else {
                                 None
                             }
-                        })
-                        .collect(),
-                });
+                        })
+                        .collect();
+                if distribution.is_empty() {
+                    tracing::warn!("No local replicas found during single→cluster migration for collection {}", collection_name);
+                }
+                collection_create_operation.set_distribution(ShardDistributionProposal { distribution });
lib/api/src/grpc/qdrant.rs (4)

24-61: Clarify serde JSON shape for oneof Value before relying on it externally

These types derive serde::Serialize, including the oneof value::Kind. If any code (telemetry, logs, debug dumps, etc.) relies on the JSON produced by serde here, be aware it won’t match “native” google.protobuf JSON for Value unless you add custom serializers. If this is only used server‑side or for gRPC adapters, you’re fine; otherwise, please document or add tests for the emitted JSON shape.

Would you like me to sketch a small serializer that converts this Value to serde_json::Value with the expected null/number/string/bool/array/object semantics?


811-814: CreateCollection.metadata: add usage notes and bounds; align terminology

Nice addition. Two suggestions:

  • Document intended semantics at create time (e.g., nested “StructValue” allowed, max size, allowed key set). Consider bounding the total serialized size or nesting depth to avoid abuse.
  • Wording nits: UpdateCollection says “JSON-like”; CreateCollection/CollectionConfig say “Arbitrary JSON”. Pick one term across the API to avoid confusion.

Apply this comment-only tweak for consistency:

-    /// Arbitrary JSON metadata for the collection
+    /// Arbitrary JSON-like metadata for the collection.
+    /// Nested objects (StructValue) and arrays (ListValue) are allowed.
+    /// Note: Consider size/nesting limits to protect against overly large metadata.

858-861: UpdateCollection.metadata cannot express “clear all” and makes deletes ambiguous

A proto3 map field is always present as an empty map when omitted, so the server cannot distinguish:

  • no-op (caller didn’t touch metadata) vs
  • “clear all metadata”.

Also, delete semantics for individual keys are not defined. If NullValue is intended to mean “delete key”, please document and test it; otherwise, consider a diff message with explicit set/unset.

Proto-level shape (reflected here after regeneration) that avoids ambiguity:

-    /// Arbitrary JSON-like metadata for the collection, will be merged with already stored metadata
-    #[prost(map = "string, message", tag = "10")]
-    pub metadata: ::std::collections::HashMap<::prost::alloc::string::String, Value>,
+    /// Metadata changes: keys in `set` are upserted/merged; keys in `unset` are removed.
+    #[prost(message, optional, tag = "10")]
+    pub metadata: ::core::option::Option<CollectionMetadataDiff>,

Add this message (shown here in Rust; source change should be made in the .proto, then re-generate):

#[derive(serde::Serialize, Clone, PartialEq, ::prost::Message)]
pub struct CollectionMetadataDiff {
    /// Upsert/merge these keys
    #[prost(map = "string, message", tag = "1")]
    pub set: ::std::collections::HashMap<::prost::alloc::string::String, Value>,
    /// Remove these keys (if present)
    #[prost(string, repeated, tag = "2")]
    pub unset: ::prost::alloc::vec::Vec<::prost::alloc::string::String>,
}

If you prefer to keep the map, explicitly state and test the rules:

  • empty map = no-op,
  • NullValue = delete the key,
  • StructValue = deep-merge, ListValue = replace (or append?) — document which.

858-861: Deep-merge semantics need to be unambiguous and tested

Because Value supports nested structs and lists, merge behavior matters:

  • StructValue: deep-merge (recursive) or replace?
  • ListValue: replace, append, or dedupe?
  • IntegerValue/DoubleValue/StringValue/BoolValue: replace.
  • NullValue: delete vs literal null?

Please encode these rules in doc comments and add unit/integration tests (gRPC and REST) that cover:

  • no-op update,
  • add, replace, delete,
  • deep merge vs replace at nested levels,
  • list behavior.

I can draft test vectors and a merge helper that enforces one canonical policy (e.g., deep-merge for StructValue, replace for ListValue, NullValue deletes).

Also applies to: 811-814, 963-966

docs/grpc/docs.md (5)

17-17: Keep ToC indentation consistent or silence MD007 for generated docs

markdownlint flags MD007 (ul-indent) on these lines. The entire ToC uses 4-space indentation for nested items, so fixing only the new entries would make the list inconsistent. Prefer either:

  • adjust the generator to emit 2-space indents everywhere, or
  • add a markdownlint override for this generated file/folder to silence MD007.

Also applies to: 29-29, 88-88


519-528: Optional: clarify that this is a proto map rendered as repeated MapEntry

To reduce confusion for readers unfamiliar with protoc’s doc style, add a short note that MetadataEntry represents a map<string, Value> in the proto.

Apply this diff to append a brief hint to the section heading:

-### CollectionConfig.MetadataEntry
+### CollectionConfig.MetadataEntry
+Note: Rendered as a repeated entry in this documentation, but defined in proto as `map<string, Value>`.

736-745: Optional: add the same “proto map” hint here for consistency

Mirror the explanatory note you add for CollectionConfig.MetadataEntry.

-### CreateCollection.MetadataEntry
+### CreateCollection.MetadataEntry
+Note: Rendered as a repeated entry in this documentation, but defined in proto as `map<string, Value>`.

1737-1738: Document merge semantics and null-based key removal for UpdateCollection.metadata

The current description says metadata “will be merged,” but doesn’t define the merge strategy or how to remove keys. Based on the implementation, setting a metadata value to null removes that key. Also, keep wording consistent (“JSON” vs “JSON-like”).

Apply this diff to clarify semantics and unify wording:

-| metadata | [UpdateCollection.MetadataEntry](#qdrant-UpdateCollection-MetadataEntry) | repeated | Arbitrary JSON-like metadata for the collection, will be merged with already stored metadata |
+| metadata | [UpdateCollection.MetadataEntry](#qdrant-UpdateCollection-MetadataEntry) | repeated | Arbitrary JSON metadata for the collection. Merge is shallow by key with existing metadata; setting a key's value to `null` removes that key. |

Note: The null-removal behavior is confirmed by the team’s learnings for this PR (“setting {"metadata": {"key": null}} removes the key”). If desired, I can add a cross-link or a short example snippet in the HTTP docs for parity.


1746-1753: Optional: cross-reference Value(null) here to make deletion semantics discoverable

A tiny inline hint here will help users discover how to delete keys when reading the entry schema alone.

-### UpdateCollection.MetadataEntry
+### UpdateCollection.MetadataEntry
+Tip: Use `value = NullValue` (JSON `null`) to delete an existing metadata key during an UpdateCollection request.
docs/redoc/master/openapi.json (4)

6725-6734: Add punctuation and tighten description for metadata (and keep semantics consistent with other sections).

Minor wording/punctuation issue and opportunity to align phrasing with other occurrences.

Apply this diff to polish the description:

-          "metadata": {
-            "description": "Arbitrary JSON metadata for the collection This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+          "metadata": {
+            "description": "Arbitrary JSON metadata for the collection. Use this to store application-specific information such as creation time, migration data, inference model info, etc.",

Optionally add an example to improve discoverability:

           "metadata": {
             "description": "Arbitrary JSON metadata for the collection. Use this to store application-specific information such as creation time, migration data, inference model info, etc.",
             "anyOf": [
               {
                 "$ref": "#/components/schemas/Payload"
               },
               {
                 "nullable": true
               }
             ]
+            ,
+            "example": { "team": "search", "purpose": "experiments" }
           }

7490-7497: Centralized Payload schema: good consolidation; consider (optional) stronger type shape for client codegen.

The new shared Payload is a solid improvement and will reduce duplication. If you want stricter typing for SDKs, consider constraining values to JSON primitives/arrays/objects recursively instead of bare additionalProperties: true. Not required, but it can improve generated types.

For example:

-      "Payload": {
-        "type": "object",
-        "additionalProperties": true,
+      "Payload": {
+        "type": "object",
+        "additionalProperties": {
+          "oneOf": [
+            { "type": "string" },
+            { "type": "number", "format": "double" },
+            { "type": "boolean" },
+            { "type": "array", "items": { "$ref": "#/components/schemas/Payload" } },
+            { "$ref": "#/components/schemas/Payload" }
+          ]
+        },
         "example": {
           "city": "London",
           "color": "green"
         }
       },

Note: this keeps values fully JSON-like while remaining recursive. Feel free to keep current version if you prefer simplicity.


9735-9744: CreateCollection.metadata: align copy, add example for clarity.

Small text polish and a concrete example helps usage.

-          "metadata": {
-            "description": "Arbitrary JSON metadata for the collection This can be used to store application-specific information such as creation time, migration data, inference model info, etc.",
+          "metadata": {
+            "description": "Arbitrary JSON metadata for the collection. Use this to store application-specific information such as creation time, migration data, inference model info, etc.",
             "anyOf": [
               { "$ref": "#/components/schemas/Payload" },
               { "nullable": true }
             ]
-          }
+            ,
+            "example": { "owner": "data-platform", "created_by": "provisioner", "created_at": "2025-08-24T21:10:00Z" }
+          }

11975-11984: CollectionConfigTelemetry.metadata: reuse the same phrasing as input fields for consistency.

Telemetry exposure looks good. Mirror the phrasing used for CollectionConfig.metadata to make behavior clear to readers scanning schemas.

-          "metadata": {
-            "description": "Arbitrary JSON metadata for the collection",
+          "metadata": {
+            "description": "Arbitrary JSON metadata for the collection.",
             "anyOf": [
               { "$ref": "#/components/schemas/Payload" },
               { "nullable": true }
             ]
           }

Optional: add a short note that this is read-only observational data (since it’s telemetry), e.g., “Reported value of collection metadata (read-only).”

lib/collection/tests/integration/common/mod.rs (1)

58-59: Explicit None is fine; consider a metadata-aware fixture for upcoming tests

The added field correctly satisfies the new struct shape. To make it easy to write metadata-focused tests without duplicating setup, consider introducing a variant of this fixture that accepts an Option<Payload> and forwards it here, keeping this one as a convenience wrapper that passes None. This keeps existing call sites untouched while enabling targeted metadata tests.

I can draft a small helper like:

use segment::types::Payload;

#[cfg(test)]
pub async fn simple_collection_fixture_with_metadata(
    collection_path: &Path,
    shard_number: u32,
    metadata: Option<Payload>,
) -> Collection {
    // ... same setup ...
    let collection_config = CollectionConfigInternal {
        // ...
        uuid: None,
        metadata,
    };
    // ...
}

And keep the current simple_collection_fixture calling the new one with None.

lib/collection/src/tests/query_prefetch_offset_limit.rs (1)

62-63: LGTM – no behavioral change for prefetch tests

Setting metadata: None in the existing tests aligns with the new field and doesn’t affect query/prefetch logic. I verified that there are currently no tests initializing metadata: Some(...) in the suite (the only match is in production code: lib/storage/src/content_manager/consensus_manager.rs:1041), so we lack coverage for non-empty metadata.

• Consider adding a test under lib/collection/src/tests/ (e.g. in query_prefetch_offset_limit.rs or a new file) that initializes your request with metadata: Some(my_meta) and asserts it’s preserved in the result.

lib/collection/src/tests/snapshot_test.rs (1)

58-59: Add snapshot round‑trip coverage for non-empty metadata

Right now we snapshot/restore with metadata: None. It would be valuable to assert that custom metadata survives snapshot and restore. Suggest adding a sibling test (or parametrizing this one) that sets a small payload and then checks it on the recovered collection.

Happy to add a test that sets, snapshots, restores, and verifies something like:

use segment::types::Payload;
use serde_json::{Map, Value};

let meta = Payload(Map::from_iter([("any_values".to_string(), Value::from("go-like-this"))]));
// set metadata during config init or via update method if available

Then assert the recovered collection exposes the same metadata via the appropriate API.

lib/collection/src/tests/points_dedup.rs (1)

64-65: LGTM; optional: one dedup run with non-empty metadata

The explicit None is correct. To harden guarantees that metadata does not affect point deduplication, consider one variant of the fixture initialized with a small non-empty metadata payload and rerun one of the dedup assertions unchanged. This keeps behavior coverage strong without changing production code.

I can wire a tiny metadata payload here with:

use segment::types::Payload;
use serde_json::{Map, Value};

let config = CollectionConfigInternal {
    // ...
    uuid: None,
    metadata: Some(Payload(Map::from_iter([(
        "any_values".to_string(),
        Value::from("go-like-this"),
    )]))),
};
lib/collection/tests/integration/snapshot_recovery_test.rs (1)

48-49: Cover metadata persistence across snapshot recovery (integration path)

I ran a repo-wide search for non-empty metadata in tests and only found one occurrence in production code (lib/storage/src/content_manager/consensus_manager.rs), with no tests exercising Some(meta) in integration tests. To strengthen this snapshot recovery test, let’s:

• In lib/collection/tests/integration/snapshot_recovery_test.rs, set a small non-empty metadata payload before snapshotting, for example:

let collection_desc = CollectionDescription {
    name: "my_collection".into(),
    metadata: Some(json!({"version": "1"})),
    // …
};

• After performing recovery, assert that the recovered metadata matches the original:

assert_eq!(recovered_collection.metadata, Some(json!({"version": "1"})));

This ensures collection-level metadata is correctly persisted through both snapshot and recovery paths. Please let me know if you’d like me to draft the exact code snippet.

lib/segment/src/types.rs (1)

1672-1675: Deriving Hash for Payload is appropriate but confirm usage expectations

Adding Hash to Payload(Map<String, Value>) is likely required so enums/structs containing Payload (e.g., UpdateCollection -> ConsensusOperations) can keep deriving Hash. Please double-check two things:

  • That downstream hashing is not relied upon for cross-node determinism or persistence semantics (use StableHash if that’s needed).
  • That our current serde_json/indexmap versions guarantee a sensible Hash implementation for Map<String, Value> in our toolchain.

If both are satisfied, this change looks good.

lib/storage/src/content_manager/mod.rs (1)

129-141: Prefer new_empty constructor to reduce boilerplate and future churn

You’re constructing UpdateCollection with all Nones just to attach replica changes. Using the helper avoids repeating fields and protects this site from future field additions.

Apply:

-            let mut operation = UpdateCollectionOperation::new(
-                collection_name,
-                UpdateCollection {
-                    vectors: None,
-                    optimizers_config: None,
-                    params: None,
-                    hnsw_config: None,
-                    quantization_config: None,
-                    sparse_vectors: None,
-                    strict_mode_config: None,
-                    metadata: None,
-                },
-            );
+            let mut operation = UpdateCollectionOperation::new_empty(collection_name);
tests/openapi/test_collection_metadata.py (3)

16-23: Rename test to reflect behavior (metadata patch/merge)

Current name is misleading. Consider renaming to improve discoverability.

-def test_collection_exists(collection_name):
+def test_collection_metadata_patch_merge_and_delete(collection_name):

33-47: Per-key deletion via JSON null is exercised — but note this bypasses the Python client

This validates the contract of setting a key to null to remove it. The earlier reviewer concern was about the Python SDK translating None; this test uses raw HTTP via requests, not the qdrant-client SDK. If the SDK behavior was the original pain point, consider adding (in the client repo) an SDK test ensuring None serializes to JSON null for metadata values.

I can draft a minimal SDK test case or open an issue with repro steps if helpful.


49-58: Add a test that empty object patch is a no-op (documenting gRPC constraint)

Given gRPC can’t distinguish empty map vs unset, PATCH with {"metadata": {}} should not clear all keys. Add a test to lock this behavior and reference the API constraint.

Example additional test:

def test_metadata_empty_patch_is_noop(collection_name):
    # Seed a key
    r = request_with_validation(
        api="/collections/{collection_name}", method="PATCH",
        path_params={"collection_name": collection_name},
        body={"metadata": {"k": "v"}}
    )
    assert r.ok

    # Empty object patch should do nothing
    r = request_with_validation(
        api="/collections/{collection_name}", method="PATCH",
        path_params={"collection_name": collection_name},
        body={"metadata": {}}
    )
    assert r.ok

    # Verify key remains
    r = request_with_validation(
        api="/collections/{collection_name}", method="GET",
        path_params={"collection_name": collection_name},
    )
    assert r.ok
    assert r.json()["result"]["config"]["metadata"] == {"k": "v"}

This also aligns with the learnings that null clears individual keys, while empty objects do not clear all metadata.

lib/collection/src/config.rs (1)

235-240: Optional metadata on internal config is backward-compatible — add explicit null semantics to docs

Serde defaults and skip when None preserve compatibility with existing config.json files. Consider extending the docstring to state that JSON null deletes a key during PATCH, and that clearing all keys via {} is intentionally unsupported due to gRPC map semantics.

-    /// Arbitrary JSON metadata for the collection
-    /// This can be used to store application-specific information
-    /// such as creation time, migration data, inference model info, etc.
+    /// Arbitrary JSON metadata for the collection.
+    /// This can be used to store application-specific information
+    /// such as creation time, migration data, inference model info, etc.
+    ///
+    /// Update semantics (via PATCH/UpdateCollection):
+    /// - Keys are merged into existing metadata.
+    /// - Setting a key to JSON `null` removes that key.
+    /// - Sending an empty object `{}` does not clear all metadata (gRPC maps
+    ///   cannot express “unset vs empty” distinctly; per-key nulls are the supported pattern).
lib/storage/src/content_manager/toc/collection_meta_ops.rs (1)

178-180: Applying metadata updates in update_collection path

Calling collection.update_metadata(metadata).await? integrates cleanly with the existing update flow. No optimizer recreation needed — correct.

Optional micro-optimization: if the provided payload is empty, you could skip the write/save to avoid an unnecessary disk flush.

lib/collection/src/operations/conversions.rs (1)

1948-2047: Proto→internal conversion: empty-map-as-None preserves gRPC semantics

The logic

  • empty map -> None (no-op / “not provided”)
  • non-empty map -> Some(payload)

is consistent with the gRPC limitation (empty map indistinguishable from unset). This also matches Update semantics elsewhere. Consider adding an inline comment to deter future regressions.

Apply this diff to document intent inline:

-            metadata: if metadata.is_empty() {
+            // gRPC does not distinguish between "unset" and an empty map.
+            // Treat empty map as "not provided" (no-op); only propagate when non-empty.
+            metadata: if metadata.is_empty() {
                 None
             } else {
                 Some(api::conversions::json::proto_to_payloads(metadata)?)
             },
lib/collection/src/operations/types.rs (1)

180-185: Public config gains metadata with correct serde behavior; plumbing is consistent

  • Field is optional, skipped when None; comments explain intended usage well.
  • From forwards metadata as-is, aligning REST/telemetry views.

Looks good. One minor thought: if you foresee redacting metadata in any anonymized public dumps, mirror telemetry’s #[anonymize(value = None)] here; otherwise keeping it visible in user-facing config is reasonable.

Also applies to: 198-209

lib/api/src/grpc/proto/collections.proto (1)

413-414: Clarify merge/clear semantics for metadata in proto docs

Excellent: CreateCollection/UpdateCollection/CollectionConfig expose metadata as map<string, Value>. Given gRPC’s “no null vs empty map” limitation, please make the per-key clearing rule explicit in comments to avoid client confusion.

Apply the doc tweak:

 message CreateCollection {
-  map<string, Value> metadata = 18; // Arbitrary JSON metadata for the collection
+  map<string, Value> metadata = 18; // Arbitrary JSON metadata for the collection.
+  // Note: gRPC cannot distinguish between an empty map and an unset field.
+  // Clients should not rely on "{}" to "clear" metadata here.
 }

 message UpdateCollection {
-  map<string, Value> metadata = 10; // Arbitrary JSON-like metadata for the collection, will be merged with already stored metadata
+  map<string, Value> metadata = 10; // Arbitrary JSON-like metadata for the collection; merged with existing.
+  // To remove a specific key, set it to null, e.g. {"key": null}. Sending "{}" is a no-op in gRPC.
 }

 message CollectionConfig {
-  map<string, Value> metadata = 7; // Arbitrary JSON metadata for the collection
+  map<string, Value> metadata = 7; // Arbitrary JSON metadata for the collection (empty map when absent).
 }

Also applies to: 426-427, 466-467

lib/storage/src/content_manager/conversions.rs (2)

202-206: UpdateCollection: retain empty-map-as-no-op; add guardrail comment

The empty => None mapping matches the intended PATCH semantics (per-key null to delete). Consider an inline comment explaining why "{}" is a no-op in gRPC to prevent future “clear-all” regressions.

Apply this diff:

-                metadata: if metadata.is_empty() {
+                // gRPC maps cannot signal "clear all" with an empty object;
+                // keep empty as None (no update). Use {"key": null} to remove keys.
+                metadata: if metadata.is_empty() {
                     None
                 } else {
                     Some(json::proto_to_payloads(metadata)?)
                 },

4-4: Nit: import alias optional

Minor: if only proto_to_payloads is used, importing it directly (use api::conversions::json::proto_to_payloads) could cut the json:: prefix. Pure style; feel free to ignore.

lib/storage/src/content_manager/collection_meta_ops.rs (2)

185-190: CreateCollection.metadata: Solid addition; add a size guard to prevent unbounded config growth

Arbitrary JSON at collection scope is useful, but without limits it can bloat snapshots, state diffs, and network payloads. Consider a soft cap (e.g., 64–256 KiB) with validation.

Apply this minimal change to add a validator on the field:

     #[serde(default, skip_serializing_if = "Option::is_none")]
+    #[validate(custom(function = "validate_metadata_size"))]
     pub metadata: Option<Payload>,

Add the helper in this module (or a shared validators module):

// Outside the struct, same file or a shared validators module
use validator::ValidationError;

const MAX_METADATA_BYTES: usize = 128 * 1024; // adjust as needed

fn validate_metadata_size(metadata: &Option<Payload>) -> Result<(), ValidationError> {
    if let Some(m) = metadata {
        // serialize to estimate size; fallback to key count if serialization fails
        let size = serde_json::to_vec(m).map(|v| v.len()).unwrap_or_else(|_| m.len());
        if size > MAX_METADATA_BYTES {
            return Err(ValidationError::new("collection_metadata_too_large"));
        }
    }
    Ok(())
}

263-266: Clarify metadata merge semantics and key deletion behavior

Proposed update to the metadata field documentation in lib/storage/src/content_manager/collection_meta_ops.rs:

-    /// Metadata to update for the collection. If provided, this will merge with existing metadata.
-    /// To remove metadata, set it to an empty object.
+    /// Metadata to update for the collection.
+    ///
+    /// Semantics:
+    /// - Omitted (`None`): leave existing metadata unchanged.
+    /// - Provided as an empty object `{}`: clear _all_ existing metadata.
+    /// - Provided as a non-empty map:
+    ///     • For each key with a non-null value: set or overwrite that key.  
+    ///     • For each key with a `null` value: remove that key from existing metadata.  
+    /// - Merge is _shallow_: top-level keys are merged; nested object values are replaced outright, not merged recursively.
     #[serde(default, skip_serializing_if = "Option::is_none")]
     pub metadata: Option<Payload>,

This makes it explicit:

  • how omission is handled
  • how full-clear vs. per-key deletion works
  • that the merge is shallow rather than recursive
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bf60e96 and 6e74d2f.

📒 Files selected for processing (33)
  • docs/grpc/docs.md (6 hunks)
  • docs/redoc/master/openapi.json (5 hunks)
  • lib/api/src/grpc/proto/collections.proto (4 hunks)
  • lib/api/src/grpc/qdrant.rs (4 hunks)
  • lib/collection/benches/batch_query_bench.rs (1 hunks)
  • lib/collection/benches/batch_search_bench.rs (1 hunks)
  • lib/collection/src/collection/collection_ops.rs (2 hunks)
  • lib/collection/src/collection/state_management.rs (1 hunks)
  • lib/collection/src/config.rs (2 hunks)
  • lib/collection/src/operations/conversions.rs (4 hunks)
  • lib/collection/src/operations/types.rs (3 hunks)
  • lib/collection/src/operations/verification/mod.rs (1 hunks)
  • lib/collection/src/shards/local_shard/shard_ops.rs (1 hunks)
  • lib/collection/src/shards/replica_set/update.rs (1 hunks)
  • lib/collection/src/shards/shard_holder/shard_mapping.rs (1 hunks)
  • lib/collection/src/telemetry.rs (4 hunks)
  • lib/collection/src/tests/fixtures.rs (1 hunks)
  • lib/collection/src/tests/points_dedup.rs (1 hunks)
  • lib/collection/src/tests/query_prefetch_offset_limit.rs (1 hunks)
  • lib/collection/src/tests/snapshot_test.rs (1 hunks)
  • lib/collection/tests/integration/common/mod.rs (1 hunks)
  • lib/collection/tests/integration/multi_vec_test.rs (1 hunks)
  • lib/collection/tests/integration/snapshot_recovery_test.rs (1 hunks)
  • lib/segment/src/types.rs (1 hunks)
  • lib/storage/src/content_manager/collection_meta_ops.rs (6 hunks)
  • lib/storage/src/content_manager/conversions.rs (5 hunks)
  • lib/storage/src/content_manager/mod.rs (1 hunks)
  • lib/storage/src/content_manager/toc/collection_meta_ops.rs (2 hunks)
  • lib/storage/src/content_manager/toc/create_collection.rs (2 hunks)
  • lib/storage/tests/integration/alias_tests.rs (1 hunks)
  • src/consensus.rs (1 hunks)
  • src/migrations/single_to_cluster.rs (5 hunks)
  • tests/openapi/test_collection_metadata.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-08-25T09:23:06.556Z
Learnt from: generall
PR: qdrant/qdrant#7123
File: lib/storage/src/content_manager/conversions.rs:202-206
Timestamp: 2025-08-25T09:23:06.556Z
Learning: In Qdrant's collection metadata system, individual metadata keys can be cleared by setting them to null in PATCH requests, e.g., `{"metadata": {"key": null}}` will remove the specific "key" from the collection's metadata.

Applied to files:

  • tests/openapi/test_collection_metadata.py
  • lib/storage/src/content_manager/conversions.rs
  • docs/grpc/docs.md
📚 Learning: 2025-08-25T09:26:13.854Z
Learnt from: generall
PR: qdrant/qdrant#7123
File: lib/storage/src/content_manager/conversions.rs:202-206
Timestamp: 2025-08-25T09:26:13.854Z
Learning: In Qdrant's API design, gRPC protocol limitations prevent using empty objects `{}` to clear all metadata because gRPC doesn't distinguish between empty maps and null/unset values. This constraint requires API consistency between gRPC and REST endpoints, making per-key clearing with null values (`{"key": null}`) the standard approach for metadata management.

Applied to files:

  • tests/openapi/test_collection_metadata.py
  • lib/storage/src/content_manager/conversions.rs
📚 Learning: 2025-08-11T07:57:01.399Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#6986
File: lib/segment/src/vector_storage/quantized/quantized_vectors.rs:81-84
Timestamp: 2025-08-11T07:57:01.399Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_vectors.rs`, the BitsStoreType parameter difference between single-vector and multi-vector Binary quantization is intentional: single-vector storage uses `EncodedVectorsBin<u128, ...>` to enable 128-bit SIMD/popcount optimizations for longer vectors, while multi-vector storage uses `EncodedVectorsBin<u8, ...>` because multivectors are typically shorter and benefit from byte-granular storage.

Applied to files:

  • lib/collection/src/config.rs
📚 Learning: 2025-08-15T11:41:10.629Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#7043
File: lib/segment/src/vector_storage/quantized/quantized_vectors.rs:857-857
Timestamp: 2025-08-15T11:41:10.629Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_vectors.rs`, the multivector offset storage has two different patterns: for RAM storage, offsets are collected into Vec<MultivectorOffset> and used directly; for MMAP storage, offsets are consumed to create a file via create_offsets_file_from_iter, then the file is loaded back as MultivectorOffsetsStorageMmap. The direct consumption of offsets iterator in the MMAP case is intentional.

Applied to files:

  • lib/collection/src/config.rs
📚 Learning: 2025-08-11T00:37:34.100Z
Learnt from: IvanPleshkov
PR: qdrant/qdrant#6986
File: lib/segment/src/vector_storage/quantized/quantized_ram_storage.rs:46-55
Timestamp: 2025-08-11T00:37:34.100Z
Learning: In `lib/segment/src/vector_storage/quantized/quantized_ram_storage.rs`, the `ChunkedVectors` used in `QuantizedRamStorage` is intentionally designed to be non-persistent during updates. The `push_vector` method only updates the in-memory vectors without flushing to disk, and this is expected behavior rather than a bug.

Applied to files:

  • lib/collection/src/config.rs
🧬 Code graph analysis (7)
tests/openapi/test_collection_metadata.py (2)
tests/openapi/helpers/helpers.py (1)
  • request_with_validation (41-116)
tests/openapi/conftest.py (1)
  • on_disk_vectors (11-12)
lib/collection/src/telemetry.rs (1)
lib/segment/src/common/anonymize.rs (13)
  • anonymize (45-45)
  • anonymize (49-51)
  • anonymize (55-57)
  • anonymize (61-63)
  • anonymize (67-71)
  • anonymize (75-79)
  • anonymize (115-119)
  • anonymize (123-127)
  • anonymize (131-135)
  • anonymize (139-148)
  • anonymize (152-154)
  • anonymize (158-162)
  • anonymize (166-181)
lib/collection/src/shards/local_shard/shard_ops.rs (1)
lib/collection/src/operations/types.rs (15)
  • from (97-104)
  • from (188-210)
  • from (259-280)
  • from (671-673)
  • from (691-693)
  • from (1185-1190)
  • from (1194-1199)
  • from (1203-1253)
  • from (1257-1259)
  • from (1263-1268)
  • from (1272-1277)
  • from (1281-1286)
  • from (1290-1295)
  • from (1299-1304)
  • from (1308-1313)
lib/api/src/grpc/qdrant.rs (1)
lib/collection/src/operations/conversions.rs (1)
  • map (1930-1932)
src/migrations/single_to_cluster.rs (3)
lib/collection/src/shards/shard_holder/mod.rs (2)
  • new (77-110)
  • shards (284-284)
lib/collection/src/collection/mod.rs (1)
  • new (103-200)
lib/collection/src/shards/shard_holder/shard_mapping.rs (2)
  • shard_key (71-76)
  • shard_ids (50-59)
lib/collection/src/operations/conversions.rs (1)
lib/api/src/conversions/json.rs (2)
  • payload_to_proto (9-14)
  • proto_to_payloads (49-55)
lib/storage/src/content_manager/conversions.rs (1)
lib/api/src/conversions/json.rs (1)
  • proto_to_payloads (49-55)
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md

17-17: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


29-29: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


88-88: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: e2e-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
🔇 Additional comments (33)
lib/collection/src/shards/local_shard/shard_ops.rs (1)

165-168: Switch to explicit From conversion is clear and type-safe (LGTM)

Using CollectionInfo::from(...) makes the intent explicit and avoids inference pitfalls with Into. No functional change.

src/migrations/single_to_cluster.rs (5)

3-5: Importing State and internal config types is appropriate for the new state-based migration path

This aligns the migration with the new collection state API. Looks correct.


41-49: State destructuring is precise; ignored fields are documented (good)

The explicit comments for resharding/transfers/payload_index_schema make the intent clear during single→cluster migration.


67-83: CreateCollection fields sourced from params/config look consistent; metadata inclusion is correct

Fields mapped from params.* and *_config.into() are coherent. The deprecated init_from is intentionally set to None — OK. No functional issues spotted.


140-156: Ordering of meta ops: ensure SetShardReplicaState runs after Create operations are applied

You await each submit, but depending on dispatcher semantics, the create may be enqueued rather than fully applied when SetShardReplicaState is submitted. If the dispatcher guarantees in-order, durable application before returning, you’re fine; otherwise, consider chaining on applied notifications or batching ops.

If unsure, please point to dispatcher submit semantics or add a comment to document the guarantee.


50-59: Verify metadata propagation end-to-end

I wasn’t able to locate a CreateCollection definition in the lib/collection crate, so please double-check that the consensus op still carries the new metadata field and that it’s applied on followers. I have, however, confirmed that your migration does surface metadata in the public types:

• In lib/collection/src/operations/types.rs,
impl From<CollectionConfigInternal> for CollectionConfig preserves metadata unchanged.
• In lib/collection/src/telemetry.rs,
impl From<CollectionConfigInternal> for CollectionConfigTelemetry also destructures and re-emits metadata.

Please ensure:

  • The consensus CreateCollection op (wherever it is defined) declares a metadata: Option<Payload> field.
  • The follower’s apply/execute logic for CreateCollection persists that field into CollectionConfigInternal.
  • The telemetry or API-facing “collection info” type (e.g. your CollectionInfo or equivalent) includes metadata and returns it after migration.
lib/api/src/grpc/qdrant.rs (2)

2-101: Consolidated JSON-like Value types look good

Moving Struct/Value/ListValue/NullValue to a single, public place and reusing them across messages removes duplication and clarifies the API surface. Tag ranges and oneof layout look consistent and non‑overlapping.


963-966: Metadata round-trip verified

All three sanity checks pass, confirming that CollectionConfig.metadata is fully wired through:

  • The gRPC definition in lib/api/src/grpc/qdrant.rs defines exactly one Struct, one Value, and one ListValue—no duplicates remain in that file.
  • The server-side handler in lib/collection/src/collection/collection_ops.rs implements update_metadata, which either merges into existing metadata or sets it anew (current_metadata.merge(&metadata) or metadata = Some(…)).
  • gRPC JSON values are converted into internal payloads via the api::conversions::json::proto_to_payloads helper (and the accompanying TryFrom/TryInto impls), ensuring any map<string,Value> round-trips correctly into the storage layer.

No further changes required.

docs/grpc/docs.md (2)

512-513: LGTM: CollectionConfig.metadata field is correctly documented

The field type and link to Value are accurate, and the description is clear. No issues.


729-730: LGTM: CreateCollection.metadata field is correctly documented

Matches the proto and mirrors CollectionConfig usage. No issues.

lib/collection/src/tests/fixtures.rs (1)

53-54: LGTM: fixtures now initialize the new metadata field explicitly

Setting metadata: None keeps test fixtures aligned with the new CollectionConfigInternal shape and avoids accidental default differences. No further action needed.

lib/collection/src/collection/state_management.rs (1)

138-139: Correctly treat metadata changes as config updates without touching optimizers

The destructuring pattern forces compile-time coverage for new fields, and adding is_metadata_updated into is_config_updated ensures metadata-only updates are persisted while avoiding unnecessary optimizer recreation via is_core_config_updated. This matches expected semantics for user-provided collection metadata.

Also applies to: 146-147, 151-155

lib/collection/tests/integration/multi_vec_test.rs (1)

66-67: LGTM: test config updated for new field

Explicit metadata: None keeps the integration test’s CollectionConfigInternal literal in sync with the struct definition. Good to go.

lib/collection/src/shards/replica_set/update.rs (1)

772-773: LGTM: config literal includes metadata

The test helper config now includes metadata: None, consistent with the new internal config schema. No issues spotted.

src/consensus.rs (1)

1541-1542: LGTM: consensus test constructs CreateCollection with metadata

Adding metadata: None to CreateCollection keeps the test aligned with the extended API surface. This also documents the field’s optionality in a core-path test.

lib/collection/benches/batch_search_bench.rs (1)

89-90: Metadata field initialization is correct

Setting metadata: None keeps the bench aligned with the new CollectionConfigInternal shape without changing behavior. LGTM.

lib/storage/tests/integration/alias_tests.rs (1)

124-126: Test update matches API surface

Adding metadata: None keeps the integration test in sync with the new CreateCollection shape. No behavior change. LGTM.

lib/collection/benches/batch_query_bench.rs (1)

71-72: Metadata set to None — consistent with config evolution

Bench remains behaviorally identical while tracking the new field. Looks good.

tests/openapi/test_collection_metadata.py (1)

7-14: Good autouse fixture with reliable teardown

Collection setup/teardown via yield looks correct and safe across failures.

lib/collection/src/collection/collection_ops.rs (1)

7-7: Importing Payload here is correct

Required for new metadata update flow. No issues.

lib/collection/src/config.rs (1)

14-17: Payload import is necessary for metadata field

Looks good.

lib/storage/src/content_manager/toc/create_collection.rs (2)

51-52: Plumbing metadata through CreateCollection operation

Destructuring includes metadata — expected.


214-223: Persisting initial metadata on collection creation – mapping verified

Storing the provided metadata into CollectionConfigInternal is correct, and the public API surfaces it back to clients as expected:

  • In lib/collection/src/operations/types.rs (around line 184), the public CollectionConfig struct declares
    pub metadata: Option<Payload> with the appropriate serde annotations.
  • In the same file (around lines 196–199), the From<CollectionConfigInternal> conversion destructures and includes metadata when building CollectionConfig.
  • In lib/collection/src/operations/conversions.rs (sections starting at line 396 and again at 2040), metadata is correctly mapped to and from the gRPC/Proto representations via payload_to_proto and proto_to_payloads.

All mappings ensure that any metadata provided at collection creation flows through the internal config into the public-facing types and across the wire. No further changes are needed.

lib/storage/src/content_manager/toc/collection_meta_ops.rs (1)

131-140: Including metadata in UpdateCollection struct

Destructuring the new optional field is correct.

lib/collection/src/telemetry.rs (2)

89-93: Telemetry: metadata field is correctly optional and anonymized

Using Option with serde(default, skip_serializing_if) avoids noisy output, and #[anonymize(value = None)] prevents leaking app metadata in telemetry. Good call.


105-116: Plumbing metadata through telemetry From<...> is complete

Forwarding metadata from CollectionConfigInternal into CollectionConfigTelemetry ensures consistency with public API responses. No gaps spotted.

lib/collection/src/operations/conversions.rs (1)

398-399: Expose metadata in gRPC CollectionInfo.config; defaulting to empty map is correct

  • Mapping Option to map<string, Value> via payload_to_proto and unwrap_or_default() is the right shape for gRPC (no null vs empty distinction).
  • This keeps GetCollectionInfo responses backward compatible while surfacing metadata when present.

No action needed.

Also applies to: 530-533

lib/api/src/grpc/proto/collections.proto (1)

6-7: Verify Protobuf Include Paths for json_with_int.proto

Confirmed that json_with_int.proto exists and is imported where needed:

  • File: lib/api/src/grpc/proto/json_with_int.proto
  • Referenced in:
    lib/api/src/grpc/proto/points.proto (line 8)
    lib/api/src/grpc/proto/collections.proto (line 6)

Please ensure that every protoc/BUILD invocation or Gradle/Maven source-set for your gRPC targets includes the lib/api/src/grpc/proto directory (or otherwise references json_with_int.proto) so that map<string, Value> fields compile correctly across all languages and targets.

lib/storage/src/content_manager/conversions.rs (1)

113-118: CreateCollection: empty-map-as-None is intentional and consistent

Treating an empty metadata map as None keeps REST/gRPC behavior aligned and avoids accidental full clears. Good.

lib/storage/src/content_manager/collection_meta_ops.rs (4)

288-291: new_empty defaults: LGTM

Initializing metadata: None ensures PATCH requests don’t accidentally wipe metadata. Good.


467-470: Passing metadata through the CreateCollection builder: LGTM

Including metadata in the constructed CreateCollection preserves config fidelity. No issues spotted.


439-441: Metadata propagation verified across conversion implementations

I’ve checked all relevant From/TryFrom implementations that transform between CollectionConfigInternal and the various create/update operation types:

  • lib/storage/src/content_manager/collection_meta_ops.rsimpl From<CollectionConfigInternal> for CreateCollection (lines 429–431) includes metadata.
  • lib/storage/src/content_manager/conversions.rs – both the From<CreateCollectionOperation> and TryFrom<UpdateCollectionOperation> blocks destructure and handle metadata (lines 83–86 and 179–182).
  • lib/storage/src/content_manager/toc/create_collection.rs – operation destructuring (lines 49–52 and 220–223) correctly captures metadata.
  • All other collection and telemetry conversions (e.g., in operations/conversions.rs, collection/src/operations/types.rs, and related modules) also destructure and propagate metadata consistently.

No instances were found where metadata was omitted. Everything is in parity, so no further changes are needed.


18-20: ✅ Confirmed: Payload Implements JsonSchema

I verified that the segment::types::Payload type derives JsonSchema, so it does indeed implement the trait required for OpenAPI schema generation:

  • In segment/src/types.rs:
    #[derive(Clone, Debug, Eq, PartialEq, Deserialize, Serialize, JsonSchema, Hash)]
    #[schemars(example = "payload_example")]
    pub struct Payload(pub Map<String, Value>);
    (confirmed via ripgrep) (britannica.com)

Since CreateCollection (in collection_meta_ops.rs) derives JsonSchema, and any reference to Payload in this module will rely on that implementation, no additional schema override is needed.

Comment on lines +10116 to +10125
"metadata": {
"description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To remove metadata, set it to an empty object.",
"anyOf": [
{
"$ref": "#/components/schemas/Payload"
},
{
"nullable": true
}
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Clarify PATCH semantics: merge vs replace vs delete are ambiguous.

The description says “merge with existing metadata” and “to remove metadata, set it to an empty object.” If the server performs a shallow merge (RFC 7386-like), sending {} won’t remove existing keys; it’s a no-op. If the server replaces, then merge wording is misleading.

Please explicitly specify:

  • whether it’s a shallow object merge (unspecified keys left intact),
  • how to replace the entire object,
  • how to delete the metadata completely.

Proposed wording below assumes shallow merge + explicit null to clear:

-          "metadata": {
-            "description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To remove metadata, set it to an empty object.",
+          "metadata": {
+            "description": "Metadata to update for the collection. If provided, keys are shallow-merged into the existing metadata: provided keys overwrite existing ones; unspecified keys remain unchanged. To replace the entire metadata object, send the full object you want stored. To clear all metadata, set this field to null.",
             "anyOf": [
               { "$ref": "#/components/schemas/Payload" },
               { "nullable": true }
             ]
           }

If the backend behavior differs (e.g., replace-on-PATCH), adjust the text accordingly and consider adding an explicit x-notes or example:

           "metadata": {
             "description": "...",
             "anyOf": [ { "$ref": "#/components/schemas/Payload" }, { "nullable": true } ],
+            "examples": [
+              { "summary": "Merge keys", "value": { "ingestion_version": "v2" } },
+              { "summary": "Clear all metadata", "value": null },
+              { "summary": "Replace entire object", "value": { "owner": "ml" } }
+            ]
           }

Would you like me to add an OpenAPI test/example request under the PATCH endpoint showing these three cases?


I’ll verify how the server handles metadata in PATCH by inspecting the relevant Rust code:


🏁 Script executed:

#!/bin/bash
# Locate the PatchCollection handler to see how metadata fields are applied
rg -nP --type=rs "PatchCollection" -g "lib/storage/src" -C3

# Inspect the merging logic in conversions.rs where null vs empty object is handled
sed -n '180,240p' lib/storage/src/content_manager/conversions.rs

Length of output: 2280


To confirm the exact metadata‐handling semantics, let’s inspect the top of conversions.rs (where PatchCollection is converted) and how the raw gRPC metadata field is flattened:


🏁 Script executed:

#!/bin/bash
# 1) Show the file header & beginning of conversions.rs to find how `metadata` is mapped
sed -n '1,200p' lib/storage/src/content_manager/conversions.rs

# 2) Search specifically for any mapping from the gRPC Struct to an internal map
rg -nP '\.metadata' -n lib/storage/src/content_manager/conversions.rs

Length of output: 9160


Clarify collection metadata PATCH semantics

The OpenAPI description for metadata is currently inaccurate:

  • By default, only the keys you include are shallow-merged into existing metadata; unspecified keys are left untouched.
  • Sending {} or null does not clear existing metadata—it’s treated as “no update.”
  • To delete individual metadata keys, set their values to null (e.g. { "table": null }).
  • There is no built-in way to wholesale replace or clear the entire metadata object in one shot.

Please update the schema and description accordingly. For example:

           "metadata": {
-            "description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To remove metadata, set it to an empty object.",
+            "description": "Metadata update semantics: only provided keys are shallow-merged into existing metadata; unspecified keys remain unchanged. To delete a key, set its value to `null`; sending `{}` or `null` performs no update. There is no single-step clear-all operation.",
             "anyOf": [
               { "$ref": "#/components/schemas/Payload" },
               { "nullable": true }
             ],
+            "examples": [
+              {
+                "summary": "Merge keys",
+                "value": { "ingestion_version": "v2" }
+              },
+              {
+                "summary": "Delete a key",
+                "value": { "table": null }
+              },
+              {
+                "summary": "No update",
+                "value": {}
+              }
+            ]
           }

Would you like to add a note or example clarifying that “clear all” must be done by explicitly nulling each existing key?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"metadata": {
"description": "Metadata to update for the collection. If provided, this will merge with existing metadata. To remove metadata, set it to an empty object.",
"anyOf": [
{
"$ref": "#/components/schemas/Payload"
},
{
"nullable": true
}
]
"metadata": {
"description": "Metadata update semantics: only provided keys are shallow-merged into existing metadata; unspecified keys remain unchanged. To delete a key, set its value to `null`; sending `{}` or `null` performs no update. There is no single-step clear-all operation.",
"anyOf": [
{
"$ref": "#/components/schemas/Payload"
},
{
"nullable": true
}
],
"examples": [
{
"summary": "Merge keys",
"value": { "ingestion_version": "v2" }
},
{
"summary": "Delete a key",
"value": { "table": null }
},
{
"summary": "No update",
"value": {}
}
]
}

Comment on lines +113 to 118
metadata: if metadata.is_empty() {
None
} else {
Some(json::proto_to_payloads(metadata)?)
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Add tests to lock semantics: per-key null clears; empty map no-ops

To prevent regressions, add e2e tests for both REST and gRPC:

  • PATCH with {"metadata":{"k":null}} removes key k.
  • PATCH with {"metadata":{}} leaves metadata unchanged.

Also applies to: 202-206


🏁 Script executed:

#!/bin/bash
# Search existing tests for metadata update/clear semantics
rg -nP -C3 --type=rust --type=py '(PATCH|UpdateCollection).*metadata' tests lib | sed -n '1,200p'
rg -nP -C2 --type=py '"metadata"\s*:\s*\{\s*\}' tests
rg -nP -C2 --type=py '"metadata"\s*:\s*\{[^}]*null' tests

Length of output: 213


Secure metadata patch semantics with end-to-end tests

To prevent regressions in both REST and gRPC APIs, we need to explicitly lock in the intended behavior when clients PATCH point metadata:

  • PATCH with {"metadata":{"key":null}} must remove the key entry from the existing metadata map.
  • PATCH with {"metadata":{}} must be a no-op and leave all existing metadata unchanged.

Please add end-to-end tests covering both cases, targeting the conversion logic in:

  • lib/storage/src/content_manager/conversions.rs (around lines 113–118 and 202–206)

Include tests for:

  • REST: HTTP PATCH against /collections/{name}/points to verify per-key null clears and empty-map no-ops.
  • gRPC: UpdateCollection calls with metadata maps containing null values and empty maps, asserting identical behavior.

These tests are mandatory to ensure future changes to metadata handling cannot introduce breaking regressions.

🤖 Prompt for AI Agents
In lib/storage/src/content_manager/conversions.rs around lines 113–118 and
202–206, add end-to-end tests to lock in PATCH semantics for point metadata:
create REST tests that send HTTP PATCH to /collections/{name}/points with body
{"metadata":{"key":null}} and assert the specific key is removed from the
existing metadata map, and a second REST test that sends {"metadata":{}} and
asserts existing metadata remains unchanged; likewise add gRPC tests that call
UpdateCollection with metadata maps containing null values (to verify per-key
deletion) and with empty maps (to verify no-op), and assert the resulting stored
metadata matches the REST behavior exactly. Ensure tests exercise the conversion
logic path in the specified file (covering both code locations) and include
setup/teardown so they run end-to-end against the server instance.

Comment on lines +115 to 121
for (shard_key, shard_ids) in shards_key_mapping.iter() {
let mut placement = Vec::new();

for shard_id in shards {
let shard_info = collection_state.shards.get(shard_id).unwrap();
for shard_id in shard_ids {
let shard_info = shards.get(shard_id).unwrap();
placement.push(shard_info.replicas.keys().copied().collect());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid unwrap on shards.get(shard_id) to prevent migration-time panics

Unwrap will panic if a shard_key_mapping references a shard absent from shards (possible after prior manual edits or partial failures). Handle gracefully and log.

Apply this safe access:

-                    for shard_id in shard_ids {
-                        let shard_info = shards.get(shard_id).unwrap();
-                        placement.push(shard_info.replicas.keys().copied().collect());
-                    }
+                    for shard_id in shard_ids {
+                        if let Some(shard_info) = shards.get(shard_id) {
+                            placement.push(shard_info.replicas.keys().copied().collect());
+                        } else {
+                            tracing::warn!(?shard_id, "Shard id missing in shards map during migration; skipping placement entry");
+                            continue;
+                        }
+                    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (shard_key, shard_ids) in shards_key_mapping.iter() {
let mut placement = Vec::new();
for shard_id in shards {
let shard_info = collection_state.shards.get(shard_id).unwrap();
for shard_id in shard_ids {
let shard_info = shards.get(shard_id).unwrap();
placement.push(shard_info.replicas.keys().copied().collect());
}
for (shard_key, shard_ids) in shards_key_mapping.iter() {
let mut placement = Vec::new();
for shard_id in shard_ids {
if let Some(shard_info) = shards.get(shard_id) {
placement.push(shard_info.replicas.keys().copied().collect());
} else {
tracing::warn!(
?shard_id,
"Shard id missing in shards map during migration; skipping placement entry"
);
continue;
}
}
// ...rest of the migration logic...
}
🤖 Prompt for AI Agents
In src/migrations/single_to_cluster.rs around lines 115-121, avoid using
unwrap() on shards.get(shard_id) — replace it with a safe lookup (if let
Some(...) / match) that logs a warning including the shard_id and shard_key when
the shard is missing and then continues (or skips adding placement for that
shard_id) instead of panicking; use the existing logging facility (e.g., warn!
or process logger) to record context so the migration can proceed safely without
crashing.

@timvisee timvisee merged commit e744bd6 into dev Aug 26, 2025
16 checks passed
@timvisee timvisee deleted the collection-level-metadata branch August 26, 2025 09:49
timvisee added a commit that referenced this pull request Aug 26, 2025
* implement custom collection metadata

* persist metadata change

* Also unset a key in the test

---------

Co-authored-by: timvisee <tim@visee.me>
@coderabbitai coderabbitai bot mentioned this pull request Nov 11, 2025
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants