Measure hardware IO for update operations by JojiiOfficial · Pull Request #5922 · qdrant/qdrant

JojiiOfficial · 2025-01-31T10:28:11Z

Depends on #5912

Adds measurements for all internal update functions, traits and reporting in rest+grpc API.
This doesn't include vector storage yet, since measurements for this will be done in the next PR.
Payload measurements are already fully implemented and are now available in the API too.

generall · 2025-02-02T22:05:50Z

Please include an integration test for this changes

lib/collection/src/shards/remote_shard.rs

timvisee · 2025-02-07T12:41:32Z

src/actix/helpers.rs

    dispatcher: &Dispatcher,
    collection_name: String,
    report_to_api: bool,
+    wait: Option<bool>, // Whether the client is awaiting the operation or `None` if function always waits.


I find this parameter a bit confusing. Do we need it if we also have report_to_api?

I'm confused by it because it doesn't seem to match the wait parameters we have in our API requests:

wait parameter: wait in API report in this function

None false true

Some(false) false false

Some(true) true true

On one hand, I agree that this parameter is confusing and probably an abstraction leak (hardware counter should not care if the request is wait or not).
On the other hand, if we decide that the caller should provide correct report_to_api value, which accounts for service_config.hardware_reporting() && params.wait it might be easy to forget in one of many places to do correctly

~~I will push the second version~~, but I am on the fence with this change

I agree with this parameter being quite confusing but I think we shouldn't make the caller decide the logic because, as @generall already said, it's very easy to make a mistake here.
An alternative could be to make this an enum with variants like Blocking/NonBlocking to make it more clear and report HW if it's not Some(BlockingMode::NonBlocking). And maybe even don't make the parameter an option but always state the type of operation.

For context, the reason we need this is because if we don't wait, we get empty or wrong hardware metrics in the API response which confuses users and may make implementation errors because we assume those values to be correct.

src/actix/api/count_api.rs

src/common/metrics.rs

timvisee · 2025-02-07T14:02:37Z

src/tonic/api/points_api.rs

        );

-        RequestHwCounter::new(counter, self.service_config.hardware_reporting())
+        let waiting = wait != Some(false);


Maybe rename this to report_to_api?

tests/consensus_tests/test_hw_measurement.py

generall · 2025-02-13T17:11:12Z

It seems that payload operations created during resharding are still accounted

ffuugoo · 2025-02-18T11:08:54Z

I've rebased this on dev, cause I've just merged #5977 and it introduced a small conflict.

coderabbitai · 2025-03-03T10:16:05Z

📝 Walkthrough

Walkthrough

The updates integrate hardware measurement tracking throughout the system. A new hardware usage field is added to various response models and Protocol Buffers messages, ensuring that telemetry data is captured during operations. Conversion implementations, API service functions, and internal update methods now pass a hardware measurement accumulator parameter, which standardizes the collection of hardware usage metrics across different workflows. Changes are applied across documentation, GRPC, Actix, and Tonic interfaces, with method signatures updated to include additional hardware measurement arguments. The modifications also extend to shard operations, replica set functions, and content management processes, thereby embedding hardware measurement handling into both local and remote operations. Additionally, new test functions and assertions have been incorporated into integration and consensus tests to capture, verify, and compare hardware telemetry, and metric types have been updated to counters to better reflect cumulative hardware activity.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (20)

lib/api/src/grpc/qdrant.rs (1)
9183-9184: Document the purpose of the unmeasured flag.

The new boolean flag unmeasured has been added, but its purpose and usage conditions aren't clear from the code. Consider adding a code comment explaining when and why this flag would be set to true.
    #[prost(message, optional, tag = "3")]
    pub clock_tag: ::core::option::Option<ClockTag>,
+   /// Flag indicating if hardware measurements were skipped for this operation
    #[prost(bool, optional, tag = "4")]
    pub unmeasured: ::core::option::Option<bool>,
docs/grpc/docs.md (1)
3502-3502: Clarify the usage Field Description

The new optional usage field is correctly added to the PointsOperationResponse documentation, in line with the hardware usage measurement update. However, the description column is currently empty. It would be beneficial to provide a brief explanation of what hardware metrics are tracked (e.g., CPU usage, payload I/O read/write, vector I/O read/write) so users understand its purpose.

For example, consider changing the row diff as follows:
-| usage | [HardwareUsage](#qdrant-HardwareUsage) | optional |  |
+| usage | [HardwareUsage](#qdrant-HardwareUsage) | optional | Aggregated hardware usage metrics (CPU, payload_io_read, payload_io_write, vector_io_read, vector_io_write). |
lib/collection/src/collection/payload_index_schema.rs (1)
86-92: Consider preserving hardware measurements for index operations

While using HwMeasurementAcc::disposable() is technically correct, index operations can be computationally expensive. It might be valuable to track these in the future to give users insight into resource consumption during index management.

Consider passing through an actual measurement accumulator from higher-level methods rather than using a disposable one:
-            .update_all_local(
-                delete_index_operation,
-                false,
-                HwMeasurementAcc::disposable(), // Unmeasured API
-            )
+            .update_all_local(
+                delete_index_operation,
+                false,
+                hw_measurement_acc, // Propagate measurements from caller
+            )
And update the method signature to accept this parameter:
-    pub async fn drop_payload_index(
-        &self,
-        field_name: JsonPath,
-    ) -> CollectionResult<Option<UpdateResult>> {
+    pub async fn drop_payload_index(
+        &self,
+        field_name: JsonPath,
+        hw_measurement_acc: HwMeasurementAcc,
+    ) -> CollectionResult<Option<UpdateResult>> {
src/actix/api/search_api.rs (1)

1-339: Verify integration test for hardware reporting

The PR comment mentioned that an integration test is needed. While the code changes look good, I'd recommend adding at least one test that verifies the hardware usage is properly reported in the API response.

Would you like me to suggest an integration test implementation to verify that hardware usage metrics are properly included in API responses?
lib/collection/src/shards/forward_proxy_shard.rs (1)
314-325: Consider updating this block to use update_remote_unmeasured for consistency.

While the current operation at line 186 has been updated to use update_remote_unmeasured, this similar block still uses the standard update method. For consistency and to prevent redundant hardware measurements for internal operations, consider updating this to use update_remote_unmeasured as well.
if let Some(operation) = forward_operation {
    let remote_result = self
        .remote_shard
-       .update(operation, false, hw_measurement_acc)
+       .update_remote_unmeasured(operation, false, hw_measurement_acc)
        .await
        .map_err(|err| {
            CollectionError::forward_proxy_error(self.remote_shard.peer_id, err)
        })?;

    // Merge `result` and `remote_result`:
    //
    // - Pick `clock_tag` with *newer* `clock_tick`
    let tick = result.clock_tag.map(|tag| tag.clock_tick);
    let remote_tick = remote_result.clock_tag.map(|tag| tag.clock_tick);
lib/collection/tests/integration/collection_test.rs (1)
252-255: Consider reusing the hardware counter.

A new hardware counter is being created for each operation. Consider reusing the existing counter if you want to accumulate measurements across multiple operations.
- let hw_counter = HwMeasurementAcc::new();
  collection
-     .update_from_client_simple(assign_payload, true, WriteOrdering::default(), hw_counter)
+     .update_from_client_simple(assign_payload, true, WriteOrdering::default(), hw_counter.clone())
lib/collection/src/shards/queue_proxy_shard.rs (1)
505-506: Consider measuring hardware usage during transfer operations

Currently, a disposable hardware measurement accumulator is used for transfer operations, indicated as "Internal operation". Since the PR aims to add measurements for update operations, you might want to consider whether these transfer operations should also be measured and included in the metrics rather than using disposable counters.

If these operations can be significant in terms of hardware usage, tracking them could provide more comprehensive resource utilization insights.
-    let disposed_hw = HwMeasurementAcc::disposable(); // Internal operation
+    // Use a real accumulator to track transfer operation resource usage
+    let transfer_hw = HwMeasurementAcc::new(); // Track transfer operations
src/tonic/api/points_api.rs (1)
55-56: Consider renaming the variable for clarity.

The variable naming could be improved to better distinguish between the parameter and its processed value.
- let waiting = wait != Some(false);
+ let should_wait = wait != Some(false);
- RequestHwCounter::new(counter, self.service_config.hardware_reporting() && waiting)
+ RequestHwCounter::new(counter, self.service_config.hardware_reporting() && should_wait)
tests/consensus_tests/test_hw_measurement.py (6)
28-28: Add tests for vector writes when implemented.

The TODO comment indicates the need for tests for vector writes. This aligns with the PR description mentioning that vector storage measurements will be included in a future PR.

Would you like assistance in preparing a placeholder structure for these vector write tests to make them easier to implement in the future PR?

40-40: Consider adjusting assertion to ensure minimum expected change.

The current assertion only checks if the payload I/O has increased by any amount (>=), but it would be better to assert that the increase meets the minimum expected value.
-assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= expected_delta
+assert_with_upper_bound_error(
+    abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]),
+    expected_delta
+)
63-63: Verify equality with error margin for hardware metrics.

The strict equality check may be too rigid for hardware metrics that can have small variations. Consider using the assert_with_upper_bound_error function you defined for more flexibility.
-assert update_hw_data['payload_io_write'] == total_payload_io_write - total_payload_io_write_old
+delta = total_payload_io_write - total_payload_io_write_old
+assert_with_upper_bound_error(update_hw_data['payload_io_write'], delta)
92-101: Use raised exceptions instead of assert False.

Using assert False can be problematic because assertions can be disabled with the Python -O flag. Instead, raise an explicit AssertionError.
def assert_with_upper_bound_error(inp: int, min_value: int, upper_bound_error_percent: float = 0.05):
    """Asserts `inp` being equal to `min_value` with a max upperbound error given in percent."""
    if inp < min_value:
-        assert False, f"Assertion {inp} >= {min_value} failed"
+        raise AssertionError(f"Assertion {inp} >= {min_value} failed")

    max_error = ceil(float(min_value) + float(min_value) * upper_bound_error_percent)
    upper_bound = inp + max_error

    if inp > upper_bound:
-        assert False, f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed."
+        raise AssertionError(f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed.")
🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

101-101: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

97-98: Simplify the upper bound calculation.

The current calculation for the upper bound is a bit confusing. A simpler approach would make the code more readable.
-    max_error = ceil(float(min_value) + float(min_value) * upper_bound_error_percent)
-    upper_bound = inp + max_error
+    # Calculate maximum allowed value based on the minimum value plus error margin
+    upper_bound = ceil(float(min_value) * (1 + upper_bound_error_percent))
88-88: Increase readability with a separate assertion for each check.

The test is combining two assertions (minimum payload write and upper bound) in one line. Consider separating them for clarity.
-assert_with_upper_bound_error(peer_hw["payload_io_write"], upsert_vectors * 5)  # 50 vectors on this node with payload of ~5 bytes
+# Assert minimum payload I/O write (50 vectors on this node with payload of ~5 bytes)
+min_expected_payload = upsert_vectors * 5
+assert_with_upper_bound_error(peer_hw["payload_io_write"], min_expected_payload)
lib/collection/src/shards/remote_shard.rs (1)

245-245: Reduce argument list if possible.

You have annotated this function to allow many arguments. While this works, consider grouping related parameters into a struct for better readability and maintainability.
lib/collection/src/shards/replica_set/update.rs (1)
197-204: Consider adding a descriptive comment about hardware measurement cloning

The code clones the hw_measurement_acc when passing it to update_impl. While this is correct behavior to maintain separate tracking for each attempt, it would be helpful to add a comment explaining why cloning is necessary here.
-                hw_measurement_acc.clone(),
+                // Clone the accumulator to maintain separate tracking for each retry attempt
+                hw_measurement_acc.clone(),
src/common/update.rs (3)
230-231: Consider adding doc comments explaining the purpose of clippy attributes

The #[expect(clippy::too_many_arguments)] attribute is used correctly, but it would be helpful to add a brief comment explaining why this exception is necessary.
-#[expect(clippy::too_many_arguments)]
+// This function has many arguments due to its central role in coordinating
+// point operations with various configurations and measurement capabilities
+#[expect(clippy::too_many_arguments)]
Also applies to: 272-273

379-380: Consider using a separate variable for cloned hardware counter

To improve code readability, consider storing the cloned hardware measurement in a variable before passing it to the update function.
-                hw_measurement_acc.clone(),
+                let hw_acc_cloned = hw_measurement_acc.clone();
+                hw_acc_cloned,
818-828: Consider refactoring to reduce parameter count

While the #[expect(clippy::too_many_arguments)] attribute is in place, consider refactoring to use a parameter object or struct to group related parameters, which would improve maintainability.
-pub async fn update(
-    toc: &TableOfContent,
-    collection_name: &str,
-    operation: CollectionUpdateOperations,
-    internal_params: InternalUpdateParams,
-    params: UpdateParams,
-    shard_key: Option<ShardKeySelector>,
-    access: Access,
-    hw_measurement_acc: HwMeasurementAcc,
-) -> Result<UpdateResult, StorageError> {
+pub struct UpdateContext<'a> {
+    pub toc: &'a TableOfContent,
+    pub collection_name: &'a str,
+    pub operation: CollectionUpdateOperations,
+    pub internal_params: InternalUpdateParams,
+    pub params: UpdateParams,
+    pub shard_key: Option<ShardKeySelector>,
+    pub access: Access,
+    pub hw_measurement_acc: HwMeasurementAcc,
+}
+
+pub async fn update(
+    ctx: UpdateContext<'_>,
+) -> Result<UpdateResult, StorageError> {
src/tonic/api/points_internal_api.rs (1)
258-262: Consider adding a comment explaining the purpose of hardware measurement

For clarity, consider adding a comment explaining why hardware measurement is being tracked and how it's used in the response.
+        // Create a hardware measurement accumulator for this collection to track resource usage
+        // during the update operation. This will be included in the response for monitoring.
         let hw_metrics = self.get_request_collection_hw_usage_counter_for_internal(
             update_point_vectors.collection_name.clone(),
         );

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fd3bba and aaa2a78.

📒 Files selected for processing (47)

docs/grpc/docs.md (1 hunks)
lib/api/src/grpc/conversions.rs (1 hunks)
lib/api/src/grpc/proto/points.proto (1 hunks)
lib/api/src/grpc/proto/points_internal_service.proto (3 hunks)
lib/api/src/grpc/qdrant.rs (4 hunks)
lib/collection/src/collection/clean.rs (1 hunks)
lib/collection/src/collection/payload_index_schema.rs (3 hunks)
lib/collection/src/collection/point_ops.rs (9 hunks)
lib/collection/src/collection/sharding_keys.rs (3 hunks)
lib/collection/src/shards/conversions.rs (2 hunks)
lib/collection/src/shards/forward_proxy_shard.rs (1 hunks)
lib/collection/src/shards/local_shard/shard_ops.rs (2 hunks)
lib/collection/src/shards/queue_proxy_shard.rs (3 hunks)
lib/collection/src/shards/remote_shard.rs (9 hunks)
lib/collection/src/shards/replica_set/mod.rs (1 hunks)
lib/collection/src/shards/replica_set/update.rs (8 hunks)
lib/collection/src/tests/points_dedup.rs (1 hunks)
lib/collection/src/update_handler.rs (4 hunks)
lib/collection/tests/integration/collection_restore_test.rs (3 hunks)
lib/collection/tests/integration/collection_test.rs (9 hunks)
lib/collection/tests/integration/distance_matrix_test.rs (1 hunks)
lib/collection/tests/integration/grouping_test.rs (4 hunks)
lib/collection/tests/integration/lookup_test.rs (1 hunks)
lib/collection/tests/integration/multi_vec_test.rs (1 hunks)
lib/collection/tests/integration/pagination_test.rs (1 hunks)
lib/collection/tests/integration/snapshot_recovery_test.rs (1 hunks)
lib/storage/src/content_manager/data_transfer.rs (2 hunks)
lib/storage/src/content_manager/toc/point_ops.rs (6 hunks)
lib/storage/src/content_manager/toc/request_hw_counter.rs (1 hunks)
src/actix/api/count_api.rs (1 hunks)
src/actix/api/discovery_api.rs (2 hunks)
src/actix/api/facet_api.rs (1 hunks)
src/actix/api/local_shard_api.rs (3 hunks)
src/actix/api/query_api.rs (3 hunks)
src/actix/api/recommend_api.rs (3 hunks)
src/actix/api/retrieve_api.rs (3 hunks)
src/actix/api/search_api.rs (5 hunks)
src/actix/api/update_api.rs (16 hunks)
src/actix/helpers.rs (1 hunks)
src/common/metrics.rs (1 hunks)
src/common/update.rs (38 hunks)
src/tonic/api/points_api.rs (28 hunks)
src/tonic/api/points_internal_api.rs (9 hunks)
src/tonic/api/update_common.rs (35 hunks)
tests/consensus_tests/fixtures.py (2 hunks)
tests/consensus_tests/test_hw_measurement.py (1 hunks)
tests/consensus_tests/utils.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/consensus_tests/test_hw_measurement.py

5-5: from .utils import * used; unable to detect undefined names

(F403)

18-18: start_cluster may be undefined, or defined from star imports

(F405)

20-20: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

68-68: start_cluster may be undefined, or defined from star imports

(F405)

70-70: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

72-72: check_collection_points_count may be undefined, or defined from star imports

(F405)

80-80: wait_collection_points_count may be undefined, or defined from star imports

(F405)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

101-101: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test (windows-latest)
GitHub Check: test-consensus
GitHub Check: test (ubuntu-latest)
GitHub Check: test
GitHub Check: test
GitHub Check: test

🔇 Additional comments (151)

lib/api/src/grpc/qdrant.rs (3)

5803-5804: Good addition of hardware usage tracking to the response structure.

The addition of an optional usage field of type HardwareUsage to the response structure aligns perfectly with the PR's objectives of implementing hardware IO measurements for update operations. This will provide valuable metrics for users monitoring resource consumption.

9304-9305: Consistent implementation of hardware usage tracking.

Good addition of the usage field here, maintaining consistency with other response structures. The field being optional ensures backward compatibility with existing clients, as noted in the comment below about remaining compatible with UpdateResult.

9794-9795: Hardware usage tracking is consistently implemented across APIs.

The hardware usage field has been added consistently to all relevant response structures, including this internal client response. This ensures uniformity in how resource consumption is reported throughout both public and internal APIs.

tests/consensus_tests/utils.py (1)

82-82: Hardware reporting enabled for consensus tests

This change enables hardware usage reporting in the testing environment, which aligns with the PR objective of adding measurements for all internal update operations.

src/actix/api/facet_api.rs (1)

57-57: Hardware tracking parameter properly added

The None parameter has been added to the get_request_hardware_counter function call, properly implementing hardware usage tracking for the facet API endpoint. This is consistent with the PR's goal of adding hardware measurement capabilities throughout the system.

src/actix/api/retrieve_api.rs (3)

103-103: Hardware tracking parameter properly added to get_point

The None parameter has been added to the get_request_hardware_counter function call, properly implementing hardware usage tracking for the point retrieval endpoint.

162-162: Hardware tracking parameter properly added to get_points

The None parameter has been added to the get_request_hardware_counter function call, properly implementing hardware usage tracking for the batch point retrieval endpoint.

223-223: Hardware tracking parameter properly added to scroll_points

The None parameter has been added to the get_request_hardware_counter function call, properly implementing hardware usage tracking for the scroll points endpoint.

src/actix/api/count_api.rs (1)

52-52: This field is correctly set to None for read APIs.

The hardware counter doesn't need the wait parameter for read operations like count_points, so adding None here is appropriate.

lib/collection/tests/integration/pagination_test.rs (1)

40-43: Hardware measurement correctly integrated into the test.

The hardware measurement accumulator is properly initialized and passed to the update operation. This change aligns with the PR objective to add measurements for all update functions.

lib/collection/src/collection/sharding_keys.rs (3)

3-3: Import added appropriately.

The import for the hardware measurement accumulator is properly added.

64-65: Address the TODO comment in a future PR.

The code correctly uses HwMeasurementAcc::disposable() but there's an outstanding TODO comment about propagating the value.

This hardware counter is created but not returned from the function. According to the PR objectives, you're tracking measurements for update operations, but the current implementation with disposable() means these measurements aren't being propagated outside of this function.

Consider whether these measurements should be captured and propagated in a subsequent PR as indicated by the TODO comment.

127-131: Hardware counter correctly passed to update_local.

The hardware measurement accumulator is correctly passed to the update_local method, enabling the tracking of hardware usage during index creation operations.

lib/api/src/grpc/proto/points.proto (1)

813-813: Hardware usage field properly added to the response message.

The optional HardwareUsage usage field is correctly added to the PointsOperationResponse message, enabling the API to report hardware metrics to clients.

This aligns perfectly with the PR objective to add reporting capabilities for hardware measurements in both REST and gRPC APIs.

lib/collection/tests/integration/snapshot_recovery_test.rs (3)

109-112: Hardware counter correctly added to update operation

The hardware measurement counter is properly initialized and passed to the update operation. This change aligns with the PR objective of adding measurements for update operations.

161-169: Consistent use of hardware measurement for search operations

The code correctly initializes and passes the HwMeasurementAcc instance to the search operation, maintaining consistency across the codebase for hardware measurement tracking.

173-181: Separate hardware counter used for recovery testing

Good practice to use a separate hardware counter instance for the recovered collection search, which allows for potential independent measurement of original vs recovered collection performance.

lib/collection/src/collection/payload_index_schema.rs (2)

4-4: Import for hardware measurement correctly added

The import is properly added to support the hardware measurement feature.

70-71: Explicit comment about unmeasured API usage

The code correctly uses HwMeasurementAcc::disposable() with a clear comment indicating that this is an unmeasured API. This is appropriate for internal operations that don't need to be tracked in user-facing measurement results.

lib/collection/tests/integration/distance_matrix_test.rs (2)

69-72: Hardware counter correctly integrated with update operation

The hardware measurement counter is properly initialized and passed to the update operation, consistent with the implementation in other tests.

75-87: Hardware counter correctly used for matrix search

Good to see the consistent use of hardware measurements for different operations, including the matrix search in this test.

lib/collection/tests/integration/multi_vec_test.rs (4)

110-113: Hardware counter correctly added to multi-vector update operation

The hardware measurement counter is properly initialized and passed to the update operation in this multi-vector context.

133-141: Consistent hardware measurement for vector search operations

The code correctly initializes and passes the hardware measurement accumulator to the search operation, maintaining consistency with other search implementations.

223-236: Hardware measurement correctly applied to point retrieval

The code properly integrates hardware measurement with the point retrieval operation, ensuring that all operations have consistent measurement capabilities.

248-263: Hardware measurement applied to recommendation operations

The recommendation functionality correctly includes hardware measurements, completing the integration across all major operation types.

src/actix/api/discovery_api.rs (2)

51-56: Consistent API parameter update for hardware reporting

This change adds the None parameter to get_request_hardware_counter function to align with hardware measurement tracking, supporting the new monitoring capability being added system-wide.

106-111: Consistent API parameter update for hardware reporting

This change adds the None parameter to get_request_hardware_counter function to align with hardware measurement tracking, supporting the new monitoring capability being added system-wide.
lib/collection/src/tests/points_dedup.rs (1)

135-135:
Details

❓ Verification inconclusive

Update to include hardware measurement in local update operations

The update_local method now accepts a hardware measurement accumulator, which aligns with the PR objective of tracking I/O operations for updates.

🏁 Script executed:
#!/bin/bash
# Verify the signature change for update_local across the codebase
rg -t rust "fn update_local.*HwMeasurementAcc" --no-filename -A 1
Length of output: 67
Review Verification: Confirm update_local Signature Change Consistency
The change to include a hardware measurement accumulator appears in the test file (lib/collection/src/tests/points_dedup.rs), as shown by the call:
.update_local(op, true, HwMeasurementAcc::new())
However, our automated search for function definitions (using rg -t rust "fn update_local.*HwMeasurementAcc") returned no results. This suggests that while the test call has been updated, it’s unclear whether all implementations of update_local have been updated accordingly.

Please manually verify that:

The signature of update_local in all relevant modules now includes the hardware measurement accumulator parameter.

All calls to update_local throughout the codebase align with this updated signature.

Thank you for checking these details to ensure consistency with the PR objective of tracking I/O operations.
lib/collection/tests/integration/collection_restore_test.rs (3)

47-50: Added hardware measurement tracking to update operations

Added hardware measurement tracking via HwMeasurementAcc as required by the updated update_from_client_simple signature. This change is consistent with the PR goal of measuring hardware I/O for update operations.

92-95: Added hardware measurement tracking to update operations

Added hardware measurement tracking via HwMeasurementAcc as required by the updated update_from_client_simple signature. This change is consistent with the PR goal of measuring hardware I/O for update operations.

173-176: Added hardware measurement tracking to update operations

Added hardware measurement tracking via HwMeasurementAcc as required by the updated update_from_client_simple signature. This change is consistent with the PR goal of measuring hardware I/O for update operations.

src/actix/api/search_api.rs (5)

59-64: Hardware reporting parameter added consistently

Added None parameter to get_request_hardware_counter to support hardware usage tracking while maintaining backward compatibility. This change is part of the system-wide implementation of hardware measurement for operations.

130-135: Hardware reporting parameter added consistently

Added None parameter to get_request_hardware_counter to support hardware usage tracking while maintaining backward compatibility. This change is part of the system-wide implementation of hardware measurement for operations.

196-201: Hardware reporting parameter added consistently

Added None parameter to get_request_hardware_counter to support hardware usage tracking while maintaining backward compatibility. This change is part of the system-wide implementation of hardware measurement for operations.

251-256: Hardware reporting parameter added consistently

Added None parameter to get_request_hardware_counter to support hardware usage tracking while maintaining backward compatibility. This change is part of the system-wide implementation of hardware measurement for operations.

307-312: Hardware reporting parameter added consistently

Added None parameter to get_request_hardware_counter to support hardware usage tracking while maintaining backward compatibility. This change is part of the system-wide implementation of hardware measurement for operations.

lib/collection/src/shards/replica_set/mod.rs (1)

948-956: Hardware measurement parameter correctly integrated in update_local call

The additional parameter hw_measurement_acc is now properly passed to the update_local method, ensuring hardware I/O metrics are collected during point deletion operations.

lib/collection/src/shards/conversions.rs (2)

65-65: Function signature updated to include unmeasured_on_remote parameter

This boolean parameter has been added to control whether hardware measurements should be tracked for remote operations, supporting the PR's goal of adding hardware measurement capabilities.

83-83: Properly setting unmeasured field based on parameter

The new parameter is correctly used to set the unmeasured field in the UpsertPointsInternal struct, allowing the caller to control whether hardware metrics should be collected for this operation when executed remotely.

src/actix/helpers.rs (1)

20-22: Conditional hardware reporting based on wait parameter

The addition of the wait parameter and the logic to set report_to_api = report_to_api && wait != Some(false) ensures hardware metrics are only reported when appropriate. This prevents reporting empty or incorrect hardware metrics in non-blocking API responses.

The parameter might seem confusing at first glance, but it serves an important purpose in ensuring accurate hardware metrics reporting. As discussed in previous reviews, making this an explicit parameter helps avoid implementation errors across the codebase.

lib/collection/tests/integration/lookup_test.rs (1)

72-75: Test updated to include hardware measurement parameter

The test now correctly creates and passes a HwMeasurementAcc instance to the update_from_client_simple method, aligning with the API changes. This ensures tests remain compatible with the enhanced hardware measurement tracking.
src/actix/api/local_shard_api.rs (3)

111-111: Added parameter for hardware monitoring configuration.

The addition of None as the last argument to get_request_hardware_counter maintains consistency with other API calls in this file. This is part of the systematic implementation of hardware monitoring across all endpoints.

185-185: Added parameter for hardware monitoring configuration.

The addition of None as the last argument to get_request_hardware_counter completes the implementation of hardware monitoring for all endpoint functions in this file.

53-53:
Details

✅ Verification successful

Added parameter for hardware monitoring configuration.

The addition of None as the last argument to get_request_hardware_counter aligns with the PR's objective to implement hardware IO measurements. This parameter likely controls waiting behavior for hardware measurements.

🏁 Script executed:
#!/bin/bash
# Verify the signature of get_request_hardware_counter to understand the parameter addition

# Find the function definition in the codebase
rg -n "fn get_request_hardware_counter" --type rust
Length of output: 114
Action: Approve the parameter addition for hardware monitoring

The addition of None as the final argument to the get_request_hardware_counter call in src/actix/api/local_shard_api.rs properly aligns with the PR’s objective. The search confirms that get_request_hardware_counter is defined in src/actix/helpers.rs and now accepts this optional parameter to adjust waiting behavior for hardware measurements.

Confirmed the function’s location: src/actix/helpers.rs

Verified that the passed None is consistent with the intended design for hardware IO measurement
lib/collection/src/collection/clean.rs (1)

336-337:
Details

✅ Verification successful

Proper implementation of hardware monitoring for internal operations.

The use of HwMeasurementAcc::disposable() for the update_local operation is appropriate since this is an internal cleaning operation where we don't need to track or report hardware usage metrics to clients.

🏁 Script executed:
#!/bin/bash
# Verify how disposable HwMeasurementAcc is used in similar internal operations

# Look for other instances of disposable hardware measurement accumulators
rg "HwMeasurementAcc::disposable" --type rust
Length of output: 3001
Hardware Measurement Metrics Usage Confirmed

The usage of HwMeasurementAcc::disposable() in the call to .update_local(delete_operation, last_batch, HwMeasurementAcc::disposable()) is consistent with similar internal operations across the codebase. As verified by its consistent usage in other modules (e.g., internal cleaning operations in lib/collection/src/collection/clean.rs and elsewhere), it's appropriate for operations where hardware measurement tracking isn’t needed.
lib/storage/src/content_manager/toc/request_hw_counter.rs (1)

14-14:
Details

✅ Verification successful

Added Clone derivation for better hardware counter management.

Adding #[derive(Clone)] to the RequestHwCounter struct is essential for propagating hardware measurements through the API chain. This allows the counter to be duplicated and passed to multiple operations without ownership conflicts, which is particularly important for accumulating hardware metrics across different components.

🏁 Script executed:
#!/bin/bash
# Check for instances where RequestHwCounter is cloned to understand the usage pattern

rg "request_hw_counter.*clone" --type rust
Length of output: 834
Clone Derivation Confirmed for RequestHwCounter

The addition of #[derive(Clone)] in lib/storage/src/content_manager/toc/request_hw_counter.rs is validated by its active usage in src/tonic/api/update_common.rs, where multiple calls to request_hw_counter.clone() are observed. This confirms that the change is essential for correctly propagating hardware measurements through the API chain. No further modifications are required.
lib/collection/src/shards/forward_proxy_shard.rs (1)

186-186:
Details

✅ Verification successful

Updated to use the specialized unmeasured update method.

The change from update to update_remote_unmeasured is appropriate for internal shard transfer operations. This specialized method likely avoids redundant hardware measurements for internal operations while maintaining the functionality of the update.

🏁 Script executed:
#!/bin/bash
# Verify the implementation of update_remote_unmeasured to understand its behavior

# Look for the definition of update_remote_unmeasured
rg "fn update_remote_unmeasured" --type rust

# Check if there are other internal operations using this method
rg "\.update_remote_unmeasured" --type rust
Length of output: 265
Shard Transfer Method – Confirmed Correct Usage

The specialized method update_remote_unmeasured is implemented in lib/collection/src/shards/remote_shard.rs and is correctly invoked within lib/collection/src/shards/forward_proxy_shard.rs for internal shard transfer operations. This change avoids redundant hardware measurements and preserves the intended update functionality.
src/actix/api/query_api.rs (3)

119-124: LGTM: Consistent parameter addition.

The None parameter addition matches the updated function signature, maintaining consistency with the implementation in the query_points endpoint.

199-204: LGTM: Consistent parameter addition.

The None parameter addition matches the updated function signature, maintaining consistency with other query endpoints in this file.

52-57:
Details

✅ Verification successful

Confirm hardware reporting parameter usage is consistent.

The None parameter added to get_request_hardware_counter ensures the function properly handles the new hardware measurement tracking feature. This appears to be part of a system-wide update to add measurement capabilities for operations.

🏁 Script executed:
#!/bin/bash
# Verify consistency of get_request_hardware_counter calls in API endpoints

echo "Checking for consistent usage of get_request_hardware_counter across API files:"
rg --type rust "get_request_hardware_counter\(" src/actix/api/ -A 5
Length of output: 11038
Hardware Reporting Parameter Usage Confirmed

The verification shows that the usage of get_request_hardware_counter is consistent across the API endpoints. In all endpoints (including query_api.rs, count_api.rs, facet_api.rs, etc.), the function is called with service_config.hardware_reporting() paired with None. The exception is in update_api.rs, where Some(params.wait) is intentionally provided for its specific operational requirements.

No changes are required here.
lib/collection/tests/integration/collection_test.rs (8)

61-64: LGTM: Hardware measurement integrates properly with test operation.

The addition of the hardware counter parameter allows for tracking I/O operations during test execution. The implementation is clean and straightforward.

132-135: LGTM: Consistent hardware counter implementation.

The hardware counter is properly initialized and passed to the update method, consistent with the pattern established in the first test.

236-239: LGTM: Proper hardware counter usage.

The hardware counter is correctly initialized and passed to the update method.

379-386: LGTM: Good use of clone for hardware counter reuse.

Properly cloning the hardware counter allows it to be reused in both the update operation and the recommend operation, allowing for cumulative measurement if needed.

444-447: LGTM: Consistent hardware counter implementation.

The hardware counter is properly initialized and passed to the update method.

542-545: LGTM: Consistent hardware counter implementation.

The hardware counter is properly initialized and passed to the update method.

796-803: LGTM: Good use of clone for hardware counter reuse.

Properly cloning the hardware counter allows it to be reused in both operations.

822-824: LGTM: Consistent hardware counter reuse.

The same hardware counter is used for the delete operation, which is appropriate for tracking cumulative measurements across operations.

tests/consensus_tests/fixtures.py (2)

50-65: LGTM: New helper for payload updates with telemetry tracking.

This new function will be helpful for testing hardware measurement during payload update operations. It properly handles the HTTP request and returns the JSON response for further verification.

209-215: LGTM: Clean implementation for retrieving hardware telemetry data.

This function provides a simple way to get hardware telemetry information for a specific collection, enabling tests to verify that measurements are being properly tracked and reported.
lib/api/src/grpc/conversions.rs (3)

2117-2128: LGTM: Added hardware usage field to response conversion.

The usage field is now properly included in the conversion from internal to external response format, ensuring hardware measurements are preserved across API boundaries.

2134-2144: LGTM: Properly implemented bi-directional conversion with hardware usage.

The reverse conversion (external to internal) also correctly handles the hardware usage field, maintaining consistency in both directions.

2596-2606:
Details

✅ Verification successful

Verify hardware measurements are accurately converted to protobuf format.

The implementation of From<HwMeasurementAcc> for HardwareUsage looks good, converting all relevant measurement fields. Make sure these are collected and reported correctly in all contexts where they're used.

🏁 Script executed:
#!/bin/bash
# Check where HardwareUsage is used and if any fields might be missing

echo "Checking protobuf definition of HardwareUsage:"
rg -A 10 "message HardwareUsage" --glob "*.proto"

echo "\nChecking where HardwareUsage is accumulated and reported:"
rg "HardwareUsage" --type rust lib/api/src/
Length of output: 2745
Verified Conversion Logic: The conversion function from HwMeasurementAcc to HardwareUsage in lib/api/src/grpc/conversions.rs correctly maps all fields. The protobuf definition (found in lib/api/src/grpc/proto/points.proto) includes cpu, payload_io_read, payload_io_write, vector_io_read, and vector_io_write, and all of these are being accurately converted. Additionally, the usage of HardwareUsage across the codebase has been confirmed, with no missing or additional fields detected.
lib/collection/src/update_handler.rs (2)

59-59: Addition of hardware measurement tracking to OperationData

The new field hw_measurements correctly stores hardware measurement data in the operation data structure, which will help track resource usage during operations.

702-707: Hardware measurement integration for update operations

Good change - replacing the disposable counter with the measurement accumulator from the operation data allows for tracking hardware usage throughout the update process.

lib/collection/src/shards/local_shard/shard_ops.rs (2)

41-41: Parameter renamed to indicate active usage

Good change - removing the underscore prefix indicates that the parameter is now being actively used rather than ignored.

91-91: Hardware measurement accumulator passed to operation data

The hardware measurement accumulator is now correctly passed to the update operation, allowing resource usage to be tracked during shard operations.

lib/collection/src/shards/queue_proxy_shard.rs (2)

698-699: Hardware measurement parameter added to transfer_operations_batch

The function signature has been properly updated to include hardware measurement tracking capabilities.

715-715: Hardware measurement propagated to remote update

The hardware measurement accumulator is correctly passed to the forward_update method, ensuring consistent tracking across local and remote operations.

lib/collection/tests/integration/grouping_test.rs (3)

83-90: Hardware measurement integration in test setup

The test has been correctly updated to include hardware measurement when calling update_from_client_simple.

523-552: Hardware counter reused across multiple test operations

Good approach - creating a single hardware counter and reusing it with clone() for multiple operations in the test setup.

581-586: Hardware measurement passed to update operation in lookup collection

The hardware measurement accumulator is correctly passed to the update operation, ensuring consistent implementation across all test cases.

src/common/metrics.rs (1)

341-341: Appropriate change from gauge to counter for hardware metrics.

The metric type has been correctly changed from GAUGE to COUNTER for all hardware metrics, as these represent cumulative values that only increase over time. This change aligns with the comment from timvisee in a previous review who pointed out that a total should be represented as a counter.

Also applies to: 348-348, 355-355, 362-362, 369-369

src/tonic/api/points_api.rs (3)

46-57: Improved function signature with optional waiting configuration.

The addition of the wait parameter to control whether hardware tracking should wait for operations to complete is a good enhancement. The logic for determining the waiting value is also clear and defaults correctly to true when wait is None.

72-73: Consistent extraction of wait parameter from update requests.

The code correctly extracts the wait parameter from requests and passes it to the hardware counter initialization for update operations.

Also applies to: 97-98, 144-145

119-120: Consistent use of None for read operations.

Read operations correctly pass None as the wait parameter, which will default to waiting behavior. This is consistent across all read operations.

Also applies to: 167-170, 237-240
lib/api/src/grpc/proto/points_internal_service.proto (2)

49-49: New unmeasured flag for internal upsert operations.

The addition of the unmeasured flag allows for internal operations to opt out of hardware usage measurement. This is useful for performance optimization in cases where measurement isn't needed.

Consider documenting the purpose of this flag with a comment to enhance code readability:
- optional bool unmeasured = 4;
+ optional bool unmeasured = 4; // When true, hardware usage will not be measured for this operation
104-104: Added hardware usage field to response messages.

The addition of the HardwareUsage field to response messages is consistent with the PR objective to report hardware usage metrics through the API. This implementation allows clients to receive detailed information about resource consumption.

Also applies to: 323-323
src/actix/api/recommend_api.rs (2)

60-65: Consistently passing None for hardware counter in recommend APIs.

The addition of the None parameter to all get_request_hardware_counter calls is consistent with the changes in other API endpoints. This ensures uniform behavior across the entire codebase.

Also applies to: 146-151, 211-216

1-239:
Details

❓ Verification inconclusive

Consider adding integration test for hardware measurement in recommendation API.

While the code changes look correct, it would be valuable to add integration tests that verify the hardware measurements are properly captured and reported for recommendation operations.

🏁 Script executed:
#!/bin/bash
# Description: Check for integration tests related to hardware measurements in recommendation API

# Look for tests that verify hardware measurements in recommendation operations
rg -A 3 "recommend.*hardware|hardware.*recommend" --type=rust lib/collection/tests/
Length of output: 85
Integration Test for Hardware Measurement:
The recommendation API endpoints correctly incorporate hardware measurement via get_request_hardware_counter, but our search did not reveal any integration tests verifying that these measurements are captured and reported as expected. Please verify manually whether such tests exist in other parts of the repository; if not, consider adding integration tests to ensure that hardware measurements in the recommendation operations are fully validated.
lib/storage/src/content_manager/toc/point_ops.rs (9)

448-448: Parameter addition for hardware measurement tracking.

Adding hw_measurement_acc parameter to the _update_shard_keys method allows for tracking hardware resource usage during shard key updates. This is consistent with the overall PR objective of measuring hardware IO for update operations.

459-461: Properly propagating hardware measurement to collection operations.

The hardware measurement accumulator is correctly passed to the update_from_client method, with cloning to ensure each update operation gets its own copy of the accumulator.

477-486: Function signature updated with hardware measurement parameter.

The #[allow(clippy::too_many_arguments)] is necessary as the function now exceeds the default limit for parameter count. The addition of hw_measurement_acc parameter to the public update method ensures consistency with the internal implementation.

548-550: Hardware measurement propagation for non-empty shard selector.

The hardware measurement accumulator is correctly propagated to the client update operation, ensuring resource usage is tracked across all code paths.

562-564: Hardware measurement propagation for all shards.

Hardware measurement is correctly passed to the client update when no specific shard keys are provided, ensuring comprehensive tracking of IO operations.

572-574: Hardware measurement for multi-shard updates.

The hardware measurement accumulator is properly passed to the _update_shard_keys method, ensuring consistent measurement across all update code paths.

585-587: Hardware measurement for specific shard key updates.

Hardware usage is properly tracked for shard key-specific updates, maintaining consistent measurement across all update pathways.

597-599: Hardware measurement for multiple shard key updates.

The hardware measurement accumulator is correctly propagated for multi-shard key operations, ensuring comprehensive tracking across all update paths.

609-611: Hardware measurement for peer updates.

Hardware usage is properly tracked for shard-id specific update operations coming from peers, completing the coverage of all update code paths.

lib/storage/src/content_manager/data_transfer.rs (2)

134-137: Proper handling of internal operations.

Using HwMeasurementAcc::disposable() is appropriate for internal operations that shouldn't affect user-facing hardware metrics. This ensures data replication operations don't skew hardware usage reporting.

237-240: Consistent approach to internal operations.

Similar to the replication function, using a disposable hardware measurement accumulator for index transfers prevents these internal operations from affecting user-facing metrics. This maintains the distinction between user-initiated operations and system activities.
tests/consensus_tests/test_hw_measurement.py (1)

5-5:
Details

✅ Verification successful

Consider explicit imports instead of star imports.

Using wildcard imports (from .utils import *) makes it difficult to determine which symbols are imported and can lead to namespace pollution.
-from .utils import *
+from .utils import (
+    start_cluster, wait_collection_exists_and_active_on_all_peers,
+    check_collection_points_count, wait_collection_points_count
+)
🏁 Script executed:
#!/bin/bash
# Check which functions from utils are actually used in this file
grep -o "start_cluster\|wait_collection_exists_and_active_on_all_peers\|check_collection_points_count\|wait_collection_points_count" tests/consensus_tests/test_hw_measurement.py
Length of output: 360
Explicit Imports for Utils Module – Replace Wildcard Import

The verification confirms that all four functions—start_cluster, wait_collection_exists_and_active_on_all_peers, check_collection_points_count, and wait_collection_points_count—are actually in use in this test file. To prevent potential namespace pollution and improve code clarity, please replace the star import with explicit imports. The diff snippet below illustrates the necessary change:

File: tests/consensus_tests/test_hw_measurement.py (Line 5)
- from .utils import *
+ from .utils import (
+     start_cluster,
+     wait_collection_exists_and_active_on_all_peers,
+     check_collection_points_count,
+     wait_collection_points_count
+ )
🧰 Tools

🪛 Ruff (0.8.2)

5-5: from .utils import * used; unable to detect undefined names

(F403)
lib/collection/src/collection/point_ops.rs (10)

31-31: Parameter addition for hardware measurement tracking.

Adding the hw_measurement_acc parameter to the update_all_local method is consistent with the PR's objective of measuring hardware IO for update operations. This enables tracking hardware resource usage during local updates.

53-57: Properly propagating hardware measurements to local updates.

The hardware measurement accumulator is correctly cloned and passed to the update_local method for each shard, ensuring accurate tracking of resource usage.

93-93: Parameter addition for peer update operations.

Adding the hw_measurement_acc parameter to the update_from_peer method extends hardware usage tracking to peer-to-peer updates, aligning with the PR objectives.

106-106: Hardware measurement for weak ordering updates.

The hardware measurement accumulator is correctly propagated to the update_local method in the weak ordering case, ensuring consistent measurement across all ordering types.

116-118: Hardware measurement for medium/strong ordering updates.

The hardware measurement accumulator is correctly passed to the update_with_consistency method for medium and strong ordering, completing coverage for all ordering types.

145-145: Parameter addition for client update operations.

Adding the hw_measurement_acc parameter to the update_from_client method extends hardware usage tracking to client-initiated updates, ensuring comprehensive coverage of update operations.

159-159: Local clone for thread-safe hardware measurement in async context.

Creating a local clone of the hardware measurement accumulator ensures that the async block has its own copy, avoiding potential thread-safety issues in the FuturesUnordered execution.

169-175: Hardware measurement for update_all operations.

The hardware measurement accumulator is correctly passed to the update_with_consistency method for "update_all" operations, ensuring resource tracking for all parts of the update process.

181-187: Hardware measurement for update_only_existing operations.

The hardware measurement accumulator is correctly propagated to the update_with_consistency method for "update_only_existing" operations, completing the coverage for all update operation types.

251-254: Parameter addition for simplified client update method.

Adding the hw_measurement_acc parameter to the update_from_client_simple method and properly forwarding it to update_from_client ensures that the simplified update interface also tracks hardware usage.

lib/collection/src/shards/remote_shard.rs (8)

226-227: Pass HwMeasurementAcc parameter consistently.

Adding the hw_measurement_acc parameter to forward_update is consistent with similar refactoring across the codebase and ensures hardware usage can be tracked here as well.

236-238: Initialize measurement logic carefully.

By passing hw_measurement_acc and false to execute_update_operation, you ensure that measurement is enabled for remote operations by default. Confirm that this matches the intended logic (since false indicates measuring on the remote).

253-255: Extend usage toggles for hardware measurement.

The hw_measurement_acc and unmeasured_on_remote parameters expand the function signature. Validate that these new flags are always used as intended to avoid inconsistent measurement states.

272-273: Propagate 'unmeasured_on_remote' correctly.

Forwarding this boolean ensures the remote logic can correctly decide to skip or enable hardware measurement. This addition appears consistent with the rest of the update code.

507-515: Accumulate remote usage metrics.

Accumulating hardware usage from point_operation_response.usage into hw_measurement_acc is correct. Additionally, ensure that the response usage only reflects newly introduced operations (i.e., no double counting across multiple calls).

674-699: Introduce specialized unmeasured update.

The update_remote_unmeasured method clearly documents that it should only be used for internal operations. Confirm the call sites are truly internal to prevent unintended skipping of hardware measurement.

712-726: Add measurement parameter to update routine.

Taking hw_measurement_acc ensures consistent metric tracking at the top-level update function. The introduction of false for unmeasured_on_remote here aligns with the default logic for end-user updates.

1120-1127: Record usage in facet method.

Collecting CPU and I/O usage after facet calls is consistent with other endpoints. Make sure that the concurrency model (if any parallel calls exist) does not inadvertently mix usage data from multiple calls.

src/actix/api/update_api.rs (13)

10-10: Add HwMeasurementAcc import.

Introducing this import aligns with the hardware measurement tracking refactor.

54-55: Propagate wait into measurement logic.

Passing Some(params.wait) into get_request_hardware_counter ensures that synchronous or asynchronous waiting is included in the usage reporting logic.

66-66: Initiate measurement on upsert.

Calling request_hw_counter.get_counter() to track hardware usage for upsert operations is consistent with the new design.

94-95: Consistent usage tracking in delete endpoint.

Including Some(params.wait) and reusing request_hw_counter.get_counter() ensures the delete operation also reports hardware usage.

Also applies to: 106-107

135-136: Add hardware counter for vector updates.

These changes ensure update_vectors consistently reports usage, helping to gather performance metrics on vector operations.

Also applies to: 147-148

176-177: Include measurement for vector deletion.

Adopting the same pattern for the delete_vectors endpoint keeps hardware usage reporting consistent across vector CRUD operations.

Also applies to: 187-188

215-216: Set payload usage reporting.

By passing the request hardware counter, the set_payload function now collects CPU and I/O usage metrics for payload operations.

Also applies to: 226-227

253-254: Enable usage measurement for overwrite payload.

Just like set_payload, the overwrite operation can now measure hardware usage, thanks to the newly added counter parameter.

Also applies to: 264-265

291-292: Delete payload usage tracking.

Ensuring measurement for both synchronous and asynchronous scenarios in delete_payload helps capture resource usage across all payload removal paths.

Also applies to: 302-303

329-330: Clear payload updates.

Integrating the hardware counter ensures clearing payload data is also measured, consistent with the other routes.

Also applies to: 340-341

380-381: Batch updates include hardware usage.

Adding the counter to the batch endpoint ensures aggregated operations also feed into the global measurement logic.

Also applies to: 394-395

418-419: Use disposable counter for create field index.

The disposable measurement indicates field index creation is not fully tracked as a user-heavy operation, which may be acceptable. Consider whether the overhead is minimal enough to measure as well.

441-442: Similarly skip measurement for delete field index.

Following the same pattern as create, this also uses a disposable counter. Confirm it remains an acceptable approach from a performance perspective.

src/tonic/api/update_common.rs (14)

5-5: Introduce hardware usage types and request counter.

Adding these imports (HardwareUsage, HwMeasurementAcc, RequestHwCounter) lays the foundation for measuring hardware usage in the Tonic layer.

Also applies to: 22-22, 28-28

45-46: Track usage on upsert.

Injecting request_hw_counter and capturing usage within points_operation_response_internal ensures that upsert calls accumulate hardware metrics.

Also applies to: 75-76, 79-80

90-91: Delete operation usage reporting.

These additions mirror the upsert approach, ensuring deletion is properly captured in usage metrics.

Also applies to: 118-119, 122-123

133-134: Integrate measurement for updating vectors.

Consistent with the other changes, capturing usage in the Tonic layer for update_vectors ensures coverage for all vector update paths.

Also applies to: 175-176, 179-180

189-190: Delete vectors usage additions.

Similar to the update logic, ensuring vector deletions feed into usage data helps maintain consistent hardware measurement coverage.

Also applies to: 225-226, 229-230

239-240: Record usage for set payload.

Accepting the request counter in set_payload ensures that memory, I/O, and CPU usage from payload operations is captured at the Tonic layer.

Also applies to: 273-274, 278-279

287-288: Enable reporting on overwrite payload.

Aligning with set payload, the overwrite operation now also updates hardware usage reporting.

Also applies to: 321-322, 326-327

335-336: Capture usage on delete payload.

These lines complete the coverage of payload-related operations for Tonic’s usage tracking.

Also applies to: 367-368, 372-373

381-382: Support clear payload measurement.

Ensuring the clear payload call is tracked closes the loop on all main payload operations.

Also applies to: 409-410, 412-413

423-424: Expanded batch updates coverage.

Including request_hw_counter in the update_batch operation ensures that each sub-operation within the batch is accurately reflected in usage metrics.

Also applies to: 457-458, 474-475, 487-488, 499-500, 525-526, 548-549, 567-568, 589-590, 612-613, 628-629

669-670: Create field index with usage.

Although “unmeasured” in some contexts, accepting the request counter can help if the operation is elevated to full measurement later.

Also applies to: 696-697, 700-701

761-762: Delete field index also unmeasured.

Code follows the same approach as index creation. If usage data becomes more relevant, it can be measured consistently in the future.

Also applies to: 764-765

840-841: Sync operation remains unmeasured.

While it is designated “API unmeasured,” consider verifying that skipping usage measurement here is intentional, as it can involve potentially large data transfers.

Also applies to: 844-844

848-852: Include optional usage in the response structure.

Adding usage: Option<HardwareUsage> to PointsOperationResponseInternal allows the final response to include hardware metrics. This is consistent with the broader changes.

Also applies to: 856-857
lib/collection/src/shards/replica_set/update.rs (6)

30-35: LGTM: Parameter added for hardware measurement tracking

The addition of the hw_measurement parameter to the update_local method is consistent with the approach taken throughout the codebase to enable hardware usage tracking for update operations.

100-106: LGTM: Coherently passing hardware metrics to update operation

The hardware measurement accumulator is correctly propagated to the update method call in the update_with_consistency method to track hardware usage.

177-183: LGTM: Hardware metric tracking added to update signature

The addition of the hw_measurement_acc parameter to the update method aligns with the broader changes to track hardware metrics during operations.

238-245: LGTM: Hardware measurement accumulator added to update_impl signature

The hardware measurement accumulator parameter is consistently added to the update_impl method signature.

635-642: LGTM: Hardware metrics correctly added to forward_update method

The hw_measurement_acc parameter is appropriately added to the forward_update method signature, which ensures hardware usage is tracked even when updates are forwarded to other nodes.

305-306:
Details

✅ Verification successful

Ensure hardware accumulator is properly tracked across execution contexts

The local update operation correctly clones and passes the hardware accumulator to the async block. Make sure the HwMeasurementAcc type can safely track hardware metrics across asynchronous execution contexts.

Also applies to: 322-323

🏁 Script executed:
#!/bin/bash
# Checking if HwMeasurementAcc is designed to work safely with async operations

# Look for synchronization primitives in the implementation
rg -A 10 -B 10 "pub struct HwMeasurementAcc" --glob "*.rs"

# Check if there are any explicit comments or code related to async safety
rg "async|tokio::sync|parking_lot|std::sync" --glob "*hardware_accumulator*.rs"
Length of output: 2224
Confirmed: Hardware accumulator is safely managed in async contexts

After reviewing the implementation in lib/common/common/src/counter/hardware_accumulator.rs, the HwMeasurementAcc type is implemented using thread-safe primitives (Arc and AtomicUsize). This design ensures that it can safely track hardware metrics across asynchronous execution contexts. The cloning of the accumulator passed into the async block on lines 305-306 (and similarly on 322-323) is appropriate and maintains thread safety across execution contexts.
src/common/update.rs (3)

14-14: LGTM: Import for hardware measurement tracking

The import of HwMeasurementAcc is correctly added to support the new hardware measurement functionality.

239-240: LGTM: Consistent addition of hardware measurement parameter

The hardware measurement accumulator parameter is consistently added to all operation functions. This ensures comprehensive tracking of hardware usage across all update operations.

Also applies to: 281-282, 316-317, 349-350, 414-415, 452-453, 491-492, 527-528, 562-563

267-268: LGTM: Hardware measurement propagated to update function

The hardware measurement accumulator is correctly passed through to the underlying update function in all operation handlers.

Also applies to: 302-303, 337-338, 398-399, 440-441, 479-480, 515-516, 548-549, 813-814
src/tonic/api/points_internal_api.rs (5)

151-152: LGTM: Hardware usage metrics added to facet response

The usage field is now correctly populated with hardware usage data in the FacetResponseInternal struct.

189-190: Good implementation of conditional hardware measurement

The code now supports an unmeasured flag to optionally disable hardware measurement tracking. This is useful for operations where tracking is not needed or for testing purposes.

Also applies to: 192-200

227-231: LGTM: Consistent extraction pattern for hardware metrics

The pattern of extracting the internal request, then creating a hardware metrics counter specific to that collection is consistent and well-implemented.

287-291: Consistent hardware metrics implementation across all methods

The hardware metrics collection is consistently implemented across all update methods, which is good for maintaining uniform behavior and tracking.

Also applies to: 315-319, 343-347, 371-375, 399-403

163-172:
Details

✅ Verification successful

Ensure collection-specific hardware metrics are properly drained

The method get_request_collection_hw_usage_counter_for_internal creates a new RequestHwCounter with a metrics drain. Verify that the metrics are properly drained to the collection's counter when operations complete.

🏁 Script executed:
#!/bin/bash
# Check the implementation of HwMeasurementAcc's drain functionality

# Find the implementation of new_with_metrics_drain
rg -A 10 "new_with_metrics_drain" --glob "*hardware_accumulator*.rs"

# Check how the metrics are drained
rg "drain" --glob "*hardware_accumulator*.rs"
Length of output: 3060
Review Update: Verified Proper Draining of Collection-Specific Metrics

After verifying the implementation in src/tonic/api/points_internal_api.rs and reviewing the corresponding functionality in lib/common/common/src/counter/hardware_accumulator.rs, it is confirmed that the get_request_collection_hw_usage_counter_for_internal method correctly utilizes HwMeasurementAcc::new_with_metrics_drain by passing the collection-specific metrics drain from self.toc.get_collection_hw_metrics(collection_name). The underlying implementation properly assigns the provided metrics_drain (without creating a new counter) in accordance with the intended design, ensuring that hardware metrics are correctly drained after operations complete.

No further modifications are required.

src/common/metrics.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

lib/collection/src/shards/replica_set/update.rs (2)
49-52: Fix spelling error in comment.

There's a typo in the comment - "Dont't" should be "Don't".
-        // Dont't measure hw when resharding
+        // Don't measure hw when resharding
🧰 Tools

🪛 GitHub Check: Check for spelling errors

[failure] 49-49:
Dont't ==> Don't

🪛 GitHub Actions: Codespell

[error] 49-49: codespell: Dont't ==> Don't

121-125: Fix spelling error in comment.

Same typo in this comment - "Dont't" should be "Don't".
-        // Dont't measure hw when resharding
+        // Don't measure hw when resharding
🧰 Tools

🪛 GitHub Check: Check for spelling errors

[failure] 121-121:
Dont't ==> Don't
tests/consensus_tests/assertions.py (1)
26-28: Fix variable shadowing in loop parameters.

The loop control variables left and right are shadowing the function parameters with the same names, which could lead to confusion in maintenance.
-def assert_hw_measurements_equal_many(left: list[dict[str, int]], right: list[dict[str, int]]):
-    for left,right in zip(left, right):
-        assert_hw_measurements_equal(left, right)
+def assert_hw_measurements_equal_many(left: list[dict[str, int]], right: list[dict[str, int]]):
+    for left_dict, right_dict in zip(left, right):
+        assert_hw_measurements_equal(left_dict, right_dict)
🧰 Tools

🪛 Ruff (0.8.2)

27-27: Loop control variable left overrides iterable it iterates

(B020)

27-27: Loop control variable right overrides iterable it iterates

(B020)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aaa2a78 and fe513cd.

📒 Files selected for processing (5)

lib/collection/src/shards/replica_set/mod.rs (2 hunks)
lib/collection/src/shards/replica_set/update.rs (10 hunks)
src/common/metrics.rs (1 hunks)
tests/consensus_tests/assertions.py (2 hunks)
tests/consensus_tests/test_resharding.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/common/metrics.rs

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/consensus_tests/assertions.py

27-27: Loop control variable left overrides iterable it iterates

(B020)

27-27: Loop control variable right overrides iterable it iterates

(B020)

🪛 GitHub Check: Check for spelling errors

lib/collection/src/shards/replica_set/update.rs

[failure] 49-49:
Dont't ==> Don't

[failure] 121-121:
Dont't ==> Don't

🪛 GitHub Actions: Codespell

lib/collection/src/shards/replica_set/update.rs

[error] 49-49: codespell: Dont't ==> Don't

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test (macos-latest)
GitHub Check: test-consensus-compose
GitHub Check: test (windows-latest)
GitHub Check: test
GitHub Check: test (ubuntu-latest)
GitHub Check: test
GitHub Check: test-consensus
GitHub Check: test

🔇 Additional comments (9)

lib/collection/src/shards/replica_set/mod.rs (2)

1292-1305: Well-designed utility method for checking resharding state.

This method provides a clean abstraction to check if a replica is in any resharding state, making the code more readable and maintainable. The implementation correctly handles all enum variants and follows the pattern of existing methods.

948-956: Hardware measurement parameter properly passed to update_local method.

The inclusion of the hardware measurement accumulator in the call to update_local is consistent with the broader changes across the codebase to track hardware usage during operations.

lib/collection/src/shards/replica_set/update.rs (5)

34-34: Type consistency in parameter modification.

The parameter change from hw_measurement to mut hw_measurement is appropriate, as the method needs to potentially modify this value for resharding states.

49-52: Hardware measurement optimization for resharding states.

Good optimization to use a disposable hardware measurement accumulator when in resharding state. This prevents unnecessary tracking of metrics during potentially resource-intensive resharding operations.

🧰 Tools

🪛 GitHub Check: Check for spelling errors

[failure] 49-49:
Dont't ==> Don't

🪛 GitHub Actions: Codespell

[error] 49-49: codespell: Dont't ==> Don't

121-125: Consistent implementation of hardware measurement handling for resharding.

This implementation correctly handles the hardware measurement for the resharding case, consistent with the approach in update_local. The check for resharding state before deciding on measurement tracking follows good design principles.

🧰 Tools

🪛 GitHub Check: Check for spelling errors

[failure] 121-121:
Dont't ==> Don't

652-653: Consistent parameter propagation to forward_update.

The hardware measurement accumulator is properly passed through to the forwarded update, ensuring that metrics are collected consistently across remote operations.

665-671: Complete parameter set for forwarded update.

All required parameters, including the hardware measurement accumulator, are correctly passed to the remote leader, maintaining the measurement tracking chain.
tests/consensus_tests/test_resharding.py (1)

241-249:
Details

✅ Verification successful

Good addition of hardware measurement validation.

The new code effectively validates that resharding transfers don't produce unexpected hardware measurements by comparing telemetry data before and after the transfer process. This aligns well with the PR objectives of ensuring accurate hardware IO measurement reporting.

🏁 Script executed:
#!/bin/bash
# Verify the existence of get_telemetry_hw_info function in fixtures.py
# since it's imported but was newly added

fd -e py fixtures.py -x cat {} | grep -A 5 "get_telemetry_hw_info"
Length of output: 319
Hardware Measurement Validation Confirmed

The updated test in tests/consensus_tests/test_resharding.py (lines 241-249) correctly validates that no unexpected hardware measurement changes occur during the resharding transfer. The function get_telemetry_hw_info was successfully verified in fixtures.py, and its behavior is as expected. Great work ensuring that telemetry data remains consistent in line with the PR objectives.
tests/consensus_tests/assertions.py (1)

16-24: Good implementation of hardware measurement comparisons.

The function effectively compares two dictionaries of hardware measurements, with special handling for keys present in only one dictionary and values greater than zero. The symmetric approach ensures both dictionaries are properly validated against each other.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

lib/collection/src/shards/replica_set/update.rs (4)

49-52: Consider refactoring repeated 'resharding check' blocks.
Both here and in lines 121–125, the code sets the measurement to disposable if resharding is detected. You could factor this logic into a small helper to keep the code DRY.

121-125: Duplicate logic for skipping hardware measurement.
As noted for lines 49–52, consider extracting a helper method to avoid duplicating the resharding check logic.

209-215: Excessive cloning of the accumulator could be costly.
If HwMeasurementAcc cloning is expensive, consider passing references or reusing an existing clone.

666-671: Appropriate propagation of hardware measurement to remote forward update.
The comment about clock tags is clear; consider clarifying in the doc comment why the forwarded clock tag must be None.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe513cd and 2d0886a.

📒 Files selected for processing (2)

lib/collection/src/shards/replica_set/update.rs (10 hunks)
src/common/metrics.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/common/metrics.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-consensus-compose
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-low-resources
GitHub Check: test (macos-latest)
GitHub Check: test-consistency
GitHub Check: test (windows-latest)
GitHub Check: test
GitHub Check: test (ubuntu-latest)
GitHub Check: test
GitHub Check: test-consensus
GitHub Check: test

🔇 Additional comments (7)

lib/collection/src/shards/replica_set/update.rs (7)

34-34: Parameters introduced for local hardware measurement look good.
The addition of mut hw_measurement: HwMeasurementAcc is consistent with the update logic.

110-110: Good to see the hardware accumulator carried forward.
The mutable accumulator parameter is added consistently for consistency-based updates.

137-138: Clean integration of hardware measurement into update call.
No issues found with passing the accumulator here.

141-141: Forwarding the hardware measurement is consistent.
Forwarding with the new parameter ensures hardware metrics aren't lost when designating a leader.

193-193: Additional parameter for update function.
This change seamlessly aligns with other methods adding hw_measurement_acc.

255-255: Extended signature retains clarity.
Including the accumulator parameter follows the established pattern in this file.

653-653: Param added for hardware tracking in forward_update.
This ensures uniform coverage of hardware metrics in all update paths.

timvisee

Happy to approve after handling the following review comments:

lib/collection/src/collection/clean.rs

lib/collection/src/shards/replica_set/update.rs

lib/common/common/src/counter/hardware_data.rs

src/actix/api/update_api.rs

tests/consensus_tests/test_hw_measurement.py

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

tests/consensus_tests/test_hw_measurement.py (1)
92-102: 🛠️ Refactor suggestion

Improve error handling in the assert_with_upper_bound_error function.

The function uses assert False with a message, which can be problematic when Python is run with optimizations enabled (python -O) as it removes assertions.
 def assert_with_upper_bound_error(inp: int, min_value: int, upper_bound_error_percent: float = 0.05):
     """Asserts `inp` being equal to `min_value` with a max upperbound error given in percent."""
     if inp < min_value:
-        assert False, f"Assertion {inp} >= {min_value} failed"
+        raise AssertionError(f"Assertion {inp} >= {min_value} failed")
 
     max_error = ceil(float(min_value) + float(min_value) * upper_bound_error_percent)
     upper_bound = inp + max_error
 
     if inp > upper_bound:
-        assert False, f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed."
+        raise AssertionError(f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed.")
🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

101-101: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🧹 Nitpick comments (2)

tests/consensus_tests/test_hw_measurement.py (2)
5-5: Consider replacing wildcard import with explicit imports.

Wildcard imports (from .utils import *) make it hard to determine which functions are being used and where they're defined. This affects code readability and maintainability.
-from .utils import *
+from .utils import (
+    start_cluster,
+    wait_collection_exists_and_active_on_all_peers,
+    check_collection_points_count,
+    wait_collection_points_count
+)
🧰 Tools

🪛 Ruff (0.8.2)

5-5: from .utils import * used; unable to detect undefined names

(F403)

59-60: Reuse the expected_delta variable for consistency.

The assertion uses a hardcoded value (19 * 5) which should match the expected_delta defined earlier. Reusing the variable would improve maintainability.
-        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= (19 * 5)
+        expected_delta = 20 * 5  # 20 vectors times avg. 5 bytes payload
+        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= expected_delta

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff383e3 and 7cbfbfe.

📒 Files selected for processing (3)

lib/collection/src/shards/replica_set/update.rs (10 hunks)
lib/common/common/src/counter/hardware_data.rs (1 hunks)
tests/consensus_tests/test_hw_measurement.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/collection/src/shards/replica_set/update.rs

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/consensus_tests/test_hw_measurement.py

5-5: from .utils import * used; unable to detect undefined names

(F403)

18-18: start_cluster may be undefined, or defined from star imports

(F405)

20-20: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

68-68: start_cluster may be undefined, or defined from star imports

(F405)

70-70: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

72-72: check_collection_points_count may be undefined, or defined from star imports

(F405)

80-80: wait_collection_points_count may be undefined, or defined from star imports

(F405)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

101-101: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: test
GitHub Check: test (windows-latest)
GitHub Check: test-consensus
GitHub Check: test
GitHub Check: test
GitHub Check: test (ubuntu-latest)

🔇 Additional comments (7)

lib/common/common/src/counter/hardware_data.rs (3)

1-9: Well-structured hardware metrics data container.

The HardwareData struct is clearly defined with appropriate fields for tracking various hardware metrics. The documentation is concise and explains the purpose of this structure.

11-13: Good wrapper type design for CPU measurements.

The wrapper type with clear documentation explains its purpose of ensuring proper CPU multiplier application throughout the codebase.

15-25: Clean implementation with proper optimizations.

The implementation correctly applies the CPU multiplier in the constructor and provides an inline getter method. Using self by value is appropriate since the struct implements Copy.
tests/consensus_tests/test_hw_measurement.py (4)

16-66: The test_measuring_hw_for_updates function implements a comprehensive test for hardware I/O measurements.

The test creates a collection, inserts points, and verifies that hardware telemetry data reflects the expected I/O operations. The implementation properly checks both the initial insertion and subsequent upsert operations.

Several TODOs indicate plans to add tests for vector operations in future PRs, which aligns with the PR objectives mentioning that vector storage measurements are planned for a subsequent PR.

🧰 Tools

🪛 Ruff (0.8.2)

18-18: start_cluster may be undefined, or defined from star imports

(F405)

20-20: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

31-31: Make the number of points proportional to the shard count.

This aligns with a previous review comment suggesting the use of N_SHARDS * 20 to ensure proper distribution across shards.

38-40: Consider asserting a range rather than using an absolute difference.

The current assertion checks that the difference is at least a certain value, but it doesn't put an upper bound on the difference. This could allow for unexpectedly large increases to pass the test.
-        expected_delta = (19 * 5)  # ~20 vectors times avg. 5 bytes payload
-        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= expected_delta
+        expected_delta = 20 * 5  # 20 vectors times avg. 5 bytes payload
+        assert_with_upper_bound_error(
+            abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]),
+            expected_delta
+        )
67-91: The test_measuring_hw_for_updates_without_waiting function provides good coverage for asynchronous updates.

This test validates that hardware metrics are collected correctly even when using wait=false in the upsert operation. The test properly waits for points to be inserted before checking the telemetry.

🧰 Tools

🪛 Ruff (0.8.2)

68-68: start_cluster may be undefined, or defined from star imports

(F405)

70-70: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

72-72: check_collection_points_count may be undefined, or defined from star imports

(F405)

80-80: wait_collection_points_count may be undefined, or defined from star imports

(F405)

tests/consensus_tests/test_hw_measurement.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/consensus_tests/test_hw_measurement.py (1)
92-100: ⚠️ Potential issue

Fix the calculation of upper bound in assert_with_upper_bound_error.

The current implementation calculates the upper bound incorrectly. It adds the percentage-based error to the input value instead of properly calculating the maximum error and adding it to the minimum value.
-    upper_bound = ceil(float(min_value) + float(min_value) * upper_bound_error_percent)
+    max_error = ceil(float(min_value) * upper_bound_error_percent)
+    upper_bound = min_value + max_error
🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

100-100: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🧹 Nitpick comments (6)

tests/consensus_tests/test_hw_measurement.py (6)
4-5: Consider replacing star imports with explicit imports.

Using wildcard imports (from .utils import *) makes it harder to track exactly what's being imported and used, which can lead to namespace pollution and unclear dependencies. Consider importing only the specific functions you need from the utils module.
-from .utils import *
+from .utils import (
+    start_cluster,
+    wait_collection_exists_and_active_on_all_peers,
+    check_collection_points_count,
+    wait_collection_points_count
+)
🧰 Tools

🪛 Ruff (0.8.2)

5-5: from .utils import * used; unable to detect undefined names

(F403)

39-40: Consider documenting payload size assumptions.

The test assumes an average payload size of 5 bytes per vector. If this is based on the implementation of upsert_random_points, consider adding a comment explaining this assumption or parameterizing it for clarity and maintainability.

60-60: Consider extracting the expected payload size to a constant.

The value "19 * 5" is repeated from line 39. Consider extracting this to a named constant at the top of the file to improve maintainability and make it easier to update if the payload size changes.
+# Average payload size in bytes per point
+AVG_PAYLOAD_SIZE = 5
+# Expected minimum delta for payload writes per ~20 points
+EXPECTED_PAYLOAD_DELTA = 19 * AVG_PAYLOAD_SIZE

...

-        expected_delta = (19 * 5)  # ~20 vectors times avg. 5 bytes payload
+        expected_delta = EXPECTED_PAYLOAD_DELTA  # ~20 vectors times avg. payload size

...

-        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= (19 * 5)
+        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= EXPECTED_PAYLOAD_DELTA
95-95: Replace assert False with raised exceptions.

Using assert False can be problematic because assertions can be disabled with the -O flag when running Python. Replace these with explicit exception raising for more reliable error handling.
-        assert False, f"Assertion {inp} >= {min_value} failed"
+        raise AssertionError(f"Assertion {inp} >= {min_value} failed")

...

-        assert False, f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed."
+        raise AssertionError(f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed.")
Also applies to: 100-100

🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

88-88: Consider using assert_with_upper_bound_error here for consistency.

Line 88 makes a simple greater-than-or-equal assertion, but you could use the assert_with_upper_bound_error helper function for consistency with other parts of the codebase and to provide better error messages.
-        assert peer_hw["payload_io_write"] >= upsert_vectors * 5  # 50 vectors on this node with payload of ~5 bytes
+        assert_with_upper_bound_error(
+            peer_hw["payload_io_write"],
+            upsert_vectors * 5,  # 50 vectors on this node with payload of ~5 bytes
+            upper_bound_error_percent=0.05
+        )
28-28: Consider adding timeframes for implementing the TODOs.

There are multiple TODO comments about implementing vector storage measurements, which matches the PR objective that mentioned vector storage measurements are planned for a future PR. Consider adding references to future work or tickets to make these TODOs more trackable.

Also applies to: 41-41, 64-64, 90-90

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42a8b14 and 0b001c5.

📒 Files selected for processing (1)

tests/consensus_tests/test_hw_measurement.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/consensus_tests/test_hw_measurement.py

5-5: from .utils import * used; unable to detect undefined names

(F403)

18-18: start_cluster may be undefined, or defined from star imports

(F405)

20-20: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

68-68: start_cluster may be undefined, or defined from star imports

(F405)

70-70: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

72-72: check_collection_points_count may be undefined, or defined from star imports

(F405)

80-80: wait_collection_points_count may be undefined, or defined from star imports

(F405)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

100-100: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: test (macos-latest)
GitHub Check: test-consensus
GitHub Check: test (windows-latest)
GitHub Check: test
GitHub Check: test
GitHub Check: test (ubuntu-latest)
GitHub Check: test

🔇 Additional comments (1)

tests/consensus_tests/test_hw_measurement.py (1)

31-31: Ensure consistent point distribution across shards.

This line seems to implement the suggestion from a previous review. The calculation of N_SHARDS * 20 ensures that points are distributed relatively evenly across the shards, creating a more consistent and predictable test environment.

Co-authored-by: Tim Visée <tim+github@visee.me>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/consensus_tests/test_hw_measurement.py (1)
92-100: ⚠️ Potential issue

Fix the calculation of upper bound in assert_with_upper_bound_error.

The current implementation calculates the upper bound as min_value + min_value * error_percent, which means the allowed error increases with the minimum value. The error percentage should be applied to the minimum value to get a fixed error margin.
-    upper_bound = ceil(float(min_value) + float(min_value) * upper_bound_error_percent)
+    max_error = ceil(float(min_value) * upper_bound_error_percent)
+    upper_bound = min_value + max_error
🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

100-100: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🧹 Nitpick comments (12)

lib/api/src/grpc/qdrant.rs (3)
5798-5799: Add documentation for the usage field.

The new field lacks documentation describing its purpose and usage. Adding comments similar to the existing ones for the time field would improve API usability and maintainability.
    /// Time spent to process
    #[prost(double, tag = "2")]
    pub time: f64,
+   /// Hardware resources consumed during processing
    #[prost(message, optional, tag = "3")]
    pub usage: ::core::option::Option<HardwareUsage>,
9303-9304: Add documentation for the usage field.

The new field lacks documentation describing its purpose and usage. Adding comments similar to the existing ones for the time field would improve API usability and maintainability.
    /// Time spent to process
    #[prost(double, tag = "2")]
    pub time: f64,
+   /// Hardware resources consumed during processing
    #[prost(message, optional, tag = "3")]
    pub usage: ::core::option::Option<HardwareUsage>,
9793-9794: Add documentation for the usage field.

The new field lacks documentation describing its purpose and usage. Adding comments similar to the existing ones for the time field would improve API usability and maintainability.
    /// Time spent to process
    #[prost(double, tag = "2")]
    pub time: f64,
+   /// Hardware resources consumed during processing
    #[prost(message, optional, tag = "3")]
    pub usage: ::core::option::Option<HardwareUsage>,
tests/consensus_tests/fixtures.py (2)
50-64: Add a docstring for clarity

This new function update_points_payload lacks a docstring explaining its purpose, parameters, and return value. This is inconsistent with other functions in the file.

Consider adding a docstring like:
def update_points_payload(
        peer_url,
        points,
        collection_name="test_collection",
        wait="true",
):
    """
    Update payload of specified points in a collection.
    
    Args:
        peer_url: URL of the peer to send request to
        points: List of point IDs to update
        collection_name: Name of the collection (default: "test_collection")
        wait: Whether to wait for the operation to complete (default: "true")
        
    Returns:
        JSON response from the server
    """
209-214: Add docstring for telemetry function

This new function get_telemetry_hw_info lacks a docstring explaining its purpose, parameters, and return value.

Consider adding a docstring like:
def get_telemetry_hw_info(peer_url, collection):
    """
    Retrieve hardware telemetry information for a specific collection.
    
    Args:
        peer_url: URL of the peer to send request to
        collection: Name of the collection to get telemetry for
        
    Returns:
        JSON response containing hardware telemetry data for the specified collection
    """
lib/collection/src/shards/queue_proxy_shard.rs (1)
505-506: Consider reusing disposable hardware measurement

A new disposable hardware measurement is created for each retry attempt, which is inefficient. Consider creating it once before the loop and reusing it.
- for remaining_attempts in (0..BATCH_RETRIES).rev() {
-     let disposed_hw = HwMeasurementAcc::disposable(); // Internal operation
-     match transfer_operations_batch(&batch, &self.remote_shard, disposed_hw).await {
+ let disposed_hw = HwMeasurementAcc::disposable(); // Internal operation
+ for remaining_attempts in (0..BATCH_RETRIES).rev() {
+     match transfer_operations_batch(&batch, &self.remote_shard, disposed_hw.clone()).await {
src/tonic/api/points_api.rs (1)
46-57: Add descriptive type documentation to the wait parameter.

The new wait parameter is added to control hardware reporting behavior, but its purpose isn't immediately clear from the code. Consider adding a documentation comment that explains what this parameter does and its default behavior.
-    fn get_request_collection_hw_usage_counter(
-        &self,
-        collection_name: String,
-        wait: Option<bool>,
-    ) -> RequestHwCounter {
+    /// Returns a hardware usage counter for the given collection
+    /// 
+    /// # Parameters
+    /// * `collection_name` - Name of the collection to get metrics for
+    /// * `wait` - Controls whether hardware metrics are reported to API. 
+    ///            If `None` or not `Some(false)`, hardware reporting is enabled.
+    fn get_request_collection_hw_usage_counter(
+        &self,
+        collection_name: String,
+        wait: Option<bool>,
+    ) -> RequestHwCounter {
lib/api/src/grpc/conversions.rs (1)

2617-2635: Consider adding a test for the HardwareUsage to HardwareData conversion.

The conversion from HardwareUsage to HardwareData looks correct, but it's important to ensure that the CPU multiplier logic is properly tested since it involves a fixed multiplier of 1.

Add unit tests to verify that the conversion correctly handles all hardware usage metrics, especially the CPU measurement which uses a special type.
tests/consensus_tests/test_hw_measurement.py (4)
5-5: Avoid wildcard imports.

Using wildcard imports (from .utils import *) makes it difficult to know which symbols are imported and can lead to name collisions. Consider explicitly importing only the needed functions.
-from .utils import *
+from .utils import (
+    start_cluster, 
+    wait_collection_exists_and_active_on_all_peers,
+    check_collection_points_count,
+    wait_collection_points_count
+)
🧰 Tools

🪛 Ruff (0.8.2)

5-5: from .utils import * used; unable to detect undefined names

(F403)

39-40: Improve assertion to validate minimum and maximum values.

The assertion only checks that the change is at least a certain value, but it doesn't set an upper bound. Consider using the assert_with_upper_bound_error function to validate within a reasonable range.
-        expected_delta = (19 * 5)  # ~20 vectors times avg. 5 bytes payload
-        assert abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"]) >= expected_delta
+        expected_delta = (19 * 5)  # ~20 vectors times avg. 5 bytes payload
+        delta = abs(peer_hw["payload_io_write"] - peer_hw_infos[peer_idx]["payload_io_write"])
+        assert_with_upper_bound_error(delta, expected_delta)
95-95: Replace assert False with raise AssertionError.

Using assert False can be problematic because assertions are skipped when Python is run with the -O optimization flag. Use raise AssertionError to ensure the error is always raised.
-        assert False, f"Assertion {inp} >= {min_value} failed"
+        raise AssertionError(f"Assertion {inp} >= {min_value} failed")
🧰 Tools

🪛 Ruff (0.8.2)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

100-100: Replace assert False with raise AssertionError.

Similar to the previous comment, use raise AssertionError instead of assert False to ensure the error is always raised regardless of Python's optimization settings.
-        assert False, f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed."
+        raise AssertionError(f"Assertion {inp} being below upperbound error of {upper_bound_error_percent}(={upper_bound}) failed.")
🧰 Tools

🪛 Ruff (0.8.2)

100-100: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7ac5f1 and 0e6aa45.

📒 Files selected for processing (51)

docs/grpc/docs.md (1 hunks)
lib/api/src/grpc/conversions.rs (3 hunks)
lib/api/src/grpc/proto/points.proto (1 hunks)
lib/api/src/grpc/proto/points_internal_service.proto (2 hunks)
lib/api/src/grpc/qdrant.rs (3 hunks)
lib/collection/src/collection/clean.rs (2 hunks)
lib/collection/src/collection/payload_index_schema.rs (3 hunks)
lib/collection/src/collection/point_ops.rs (9 hunks)
lib/collection/src/collection/sharding_keys.rs (3 hunks)
lib/collection/src/shards/local_shard/shard_ops.rs (2 hunks)
lib/collection/src/shards/queue_proxy_shard.rs (3 hunks)
lib/collection/src/shards/remote_shard.rs (13 hunks)
lib/collection/src/shards/replica_set/mod.rs (2 hunks)
lib/collection/src/shards/replica_set/update.rs (10 hunks)
lib/collection/src/tests/points_dedup.rs (1 hunks)
lib/collection/src/update_handler.rs (4 hunks)
lib/collection/tests/integration/collection_restore_test.rs (3 hunks)
lib/collection/tests/integration/collection_test.rs (9 hunks)
lib/collection/tests/integration/distance_matrix_test.rs (1 hunks)
lib/collection/tests/integration/grouping_test.rs (4 hunks)
lib/collection/tests/integration/lookup_test.rs (1 hunks)
lib/collection/tests/integration/multi_vec_test.rs (1 hunks)
lib/collection/tests/integration/pagination_test.rs (1 hunks)
lib/collection/tests/integration/snapshot_recovery_test.rs (1 hunks)
lib/common/common/src/counter/hardware_accumulator.rs (3 hunks)
lib/common/common/src/counter/hardware_counter.rs (4 hunks)
lib/common/common/src/counter/hardware_data.rs (1 hunks)
lib/common/common/src/counter/mod.rs (1 hunks)
lib/storage/src/content_manager/data_transfer.rs (2 hunks)
lib/storage/src/content_manager/toc/point_ops.rs (6 hunks)
lib/storage/src/content_manager/toc/request_hw_counter.rs (1 hunks)
src/actix/api/count_api.rs (1 hunks)
src/actix/api/discovery_api.rs (2 hunks)
src/actix/api/facet_api.rs (1 hunks)
src/actix/api/local_shard_api.rs (3 hunks)
src/actix/api/query_api.rs (3 hunks)
src/actix/api/recommend_api.rs (3 hunks)
src/actix/api/retrieve_api.rs (3 hunks)
src/actix/api/search_api.rs (5 hunks)
src/actix/api/update_api.rs (16 hunks)
src/actix/helpers.rs (1 hunks)
src/common/metrics.rs (1 hunks)
src/common/update.rs (38 hunks)
src/tonic/api/points_api.rs (28 hunks)
src/tonic/api/points_internal_api.rs (9 hunks)
src/tonic/api/update_common.rs (35 hunks)
tests/consensus_tests/assertions.py (2 hunks)
tests/consensus_tests/fixtures.py (2 hunks)
tests/consensus_tests/test_hw_measurement.py (1 hunks)
tests/consensus_tests/test_resharding.py (2 hunks)
tests/consensus_tests/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (38)

lib/common/common/src/counter/mod.rs
lib/collection/src/collection/sharding_keys.rs
src/actix/api/facet_api.rs
lib/collection/tests/integration/snapshot_recovery_test.rs
lib/collection/tests/integration/distance_matrix_test.rs
lib/api/src/grpc/proto/points.proto
lib/collection/tests/integration/multi_vec_test.rs
src/actix/api/discovery_api.rs
lib/storage/src/content_manager/toc/request_hw_counter.rs
lib/collection/tests/integration/pagination_test.rs
src/actix/api/count_api.rs
lib/collection/src/collection/payload_index_schema.rs
lib/collection/src/tests/points_dedup.rs
src/actix/api/retrieve_api.rs
tests/consensus_tests/test_resharding.py
tests/consensus_tests/utils.py
lib/collection/tests/integration/lookup_test.rs
src/actix/api/query_api.rs
lib/api/src/grpc/proto/points_internal_service.proto
lib/collection/tests/integration/collection_test.rs
lib/collection/tests/integration/grouping_test.rs
src/actix/api/search_api.rs
docs/grpc/docs.md
tests/consensus_tests/assertions.py
src/actix/api/local_shard_api.rs
src/actix/helpers.rs
lib/collection/src/shards/replica_set/mod.rs
lib/collection/tests/integration/collection_restore_test.rs
lib/collection/src/shards/local_shard/shard_ops.rs
src/actix/api/recommend_api.rs
lib/storage/src/content_manager/data_transfer.rs
lib/storage/src/content_manager/toc/point_ops.rs
lib/common/common/src/counter/hardware_data.rs
lib/collection/src/collection/clean.rs
src/actix/api/update_api.rs
src/common/metrics.rs
lib/collection/src/collection/point_ops.rs
lib/collection/src/shards/replica_set/update.rs

🧰 Additional context used

🪛 Ruff (0.8.2)

tests/consensus_tests/test_hw_measurement.py

5-5: from .utils import * used; unable to detect undefined names

(F403)

18-18: start_cluster may be undefined, or defined from star imports

(F405)

20-20: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

68-68: start_cluster may be undefined, or defined from star imports

(F405)

70-70: wait_collection_exists_and_active_on_all_peers may be undefined, or defined from star imports

(F405)

72-72: check_collection_points_count may be undefined, or defined from star imports

(F405)

80-80: wait_collection_points_count may be undefined, or defined from star imports

(F405)

95-95: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

100-100: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: test
GitHub Check: test (macos-latest)
GitHub Check: test (windows-latest)
GitHub Check: test-consensus
GitHub Check: test (ubuntu-latest)
GitHub Check: test
GitHub Check: test

🔇 Additional comments (69)

lib/api/src/grpc/qdrant.rs (1)

5798-5799: Changes appropriately track hardware usage metrics in response objects.

These changes align well with the PR objective of adding hardware usage measurements to operations. The optional field ensures backward compatibility with existing clients.

Also applies to: 9303-9304, 9793-9794

lib/collection/src/update_handler.rs (3)

59-59: LGTM: Hardware measurements field addition

The addition of the hw_measurements field to OperationData looks good and follows the project's pattern for tracking hardware metrics.

689-690: LGTM: Field extraction in pattern matching

Pattern matching updated correctly to extract the hardware measurements field from the operation data.

702-707: LGTM: Proper passing of hardware measurement

The hardware measurement counter is correctly passed to the update function, replacing the disposable counter with the one from the operation data.

lib/collection/src/shards/queue_proxy_shard.rs (2)

692-693: LGTM: Function signature update

The function signature is properly updated to include the hardware measurement accumulator parameter.

704-711: LGTM: Hardware measurement in forward update

The hardware measurement accumulator is correctly passed to the forward_update method.

lib/common/common/src/counter/hardware_counter.rs (3)

101-103: LGTM: New getter method for CPU measurement

This method correctly returns the real CPU value with the multiplier applied, encapsulating the calculation logic.

160-178: LGTM: Implementation of get_hw_data method

The new method provides a clean way to get all hardware measurements as a single data structure. This improves code organization and maintainability.

180-182: LGTM: Simplified accumulator merging

The refactored merge_to_accumulator method is much cleaner, using the new get_hw_data method to obtain all measurements at once.
src/tonic/api/points_api.rs (1)

55-56: Consider renaming the variable for clarity.

The variable name waiting doesn't clearly convey its purpose. Since it's used to determine whether to report metrics to the API, a name like report_to_api would better describe its function.
-    let waiting = wait != Some(false);
-    RequestHwCounter::new(counter, self.service_config.hardware_reporting() && waiting)
+    let report_to_api = wait != Some(false);
+    RequestHwCounter::new(counter, self.service_config.hardware_reporting() && report_to_api)
lib/common/common/src/counter/hardware_accumulator.rs (3)

56-71: Good abstraction of hardware data accumulation.

This implementation nicely centralizes the hardware data accumulation logic, making the code more maintainable and reducing duplication. The struct destructuring provides good readability.

136-140: Improved API with generic parameter.

The refactoring to accept a generic type parameter that implements Into<HardwareData> makes the API more flexible and extensible, while reducing code duplication. This is a good design choice.

143-148: Consistent implementation with the accumulate method.

The consistent implementation between accumulate and accumulate_request maintains a clean API while providing specialized functionality.

lib/api/src/grpc/conversions.rs (2)

2127-2137: Hardware usage tracking integration for response types.

The inclusion of the usage field in conversions between internal and external response types ensures hardware telemetry data is preserved during transformations.

2142-2153: Consistent implementation for the reverse conversion.

The bidirectional conversions consistently handle the usage field, ensuring data integrity when converting between different response types.

tests/consensus_tests/test_hw_measurement.py (1)

31-31: Good test design for shard-awareness.

Using N_SHARDS * 20 ensures that you test a proportional number of points for each shard, which is a good approach for cluster-wide testing.

src/tonic/api/update_common.rs (16)

5-5: Added imports for hardware usage tracking

These import statements correctly bring in the new HardwareUsage, HwMeasurementAcc, and RequestHwCounter types for usage reporting. No issues found.

Also applies to: 22-22, 28-28

39-46: Added hardware usage parameter & final usage reporting in upsert

The added request_hw_counter parameter, retrieval via request_hw_counter.get_counter(), and final usage reporting through request_hw_counter.to_grpc_api() seem consistent and correct. This approach ensures the function captures hardware usage metrics accurately.

Also applies to: 75-80

84-91: Introduced request_hw_counter for the delete function

Similar to upsert, these additions correctly pass the hardware counter through the call chain, culminating in the usage included in the response. Implementation appears sound, and no issues are found regarding concurrency or error handling.

Also applies to: 118-123

127-134: Hardware usage integration in update_vectors

The request_hw_counter parameter and usage reporting fit well into the function flow. The usage is properly collected and surfaced in the response object with no visible logic or concurrency issues.

Also applies to: 175-180

184-190: delete_vectors function extended with request_hw_counter

This function consistently follows the established pattern of requesting a counter, performing operations, and then embedding usage results in the response. The logic is clear and maintains code consistency.

Also applies to: 225-230

234-240: Integrated hardware usage into set_payload

The newly added parameter and final usage reporting lines appear correct and follow the same design pattern as other update functions. No concerns to raise.

Also applies to: 273-279

282-288: overwrite_payload now includes hardware usage

Adopting the same pattern of obtaining the hardware accumulator and returning it in the final response. This addition is consistent with the rest of the update methods.

Also applies to: 321-327

330-336: Extended delete_payload with usage tracking

Extends usage tracking to payload deletion as well. The logic is straightforward, and no issues with concurrency or error handling are apparent.

Also applies to: 367-373

376-382: clear_payload hardware usage parameter propagation

In line with the other functions, the hardware usage is fetched, utilized, and returned. The design remains consistent.

Also applies to: 408-414

417-424: Comprehensive hardware usage injection in update_batch

Multiple lines reflect the propagation of the request_hw_counter through the entire batch update process for each operation variant. The logic is coherent, bridging the usage data to each individual operation call. Ensure thorough testing to verify correct accumulation of usage metrics across batched updates.

Also applies to: 457-458, 474-475, 499-500, 525-526, 548-549, 567-568, 589-590, 612-613, 628-629, 648-649

664-670: Extended create_field_index with hardware metrics

Passing in request_hw_counter and finalizing usage with to_grpc_api() is consistent. No immediate concerns.

Also applies to: 696-697, 701-702

705-706: create_field_index_internal usage logic

Employs a disposable accumulator in line 730 and omits usage in the response, consistent with the inline comment indicating "API unmeasured". This approach aligns with the existing pattern where internal calls may not measure usage if not needed.

Also applies to: 730-731, 734-735

738-767: Added disposable hardware usage for delete_field_index

Lines 761-762 and 765-766 use a disposable accumulator again to mark unmeasured operations. The approach is correct for indexing concerns, and no concurrency or correctness issues stand out.

769-796: Disposable hardware usage in delete_field_index_internal

Similar to create_field_index_internal, the usage is intentionally omitted from the final response. The change is consistent with other internal index operations.

798-846: Extended sync method with disposable hardware usage

Lines 840-841 and 844-845 show usage injection as “API unmeasured.” This design choice is aligned with the existing pattern for internal synchronization. No issues found.

848-859: points_operation_response_internal extended to accommodate usage

The final structure now includes an optional HardwareUsage field. This is central to presenting collected metrics from the added request_hw_counter. Implementation looks correct.

src/tonic/api/points_internal_api.rs (18)

62-62: Added optional hardware usage field in query_batch_internal

The request_hw_data parameter and usage: request_hw_data.to_grpc_api() insertion are correctly implemented to measure hardware activity for batch queries. Looks consistent with the rest of the codebase.

Also applies to: 103-103

112-113: facet_counts_internal updated to report usage

The new request parameter request_hw_data and the final usage assignment are properly placed. This ensures facet count operations also capture hardware usage.

Also applies to: 151-151

191-196: upsert function changes

Extraction of the internal request, retrieval of collection usage counter, and final call to upsert with hardware metrics are consistently structured. No issues spotted.

Also applies to: 203-203

222-227: Hardware usage in delete function

The newly introduced hw_metrics usage aligns with the pattern of internal request extraction and usage accumulation. Implementation matches the design established in the other methods.

Also applies to: 234-235

253-258: update_vectors usage injection

Extracts the request, obtains the hardware counter, then calls the standard method with usage as an argument. The approach is sound.

Also applies to: 265-266

282-287: Added hardware usage to delete_vectors

Adheres to the same pattern: an internal request is extracted, and hardware usage is tracked. The changes look good.

Also applies to: 294-295

310-315: set_payload changes

Consistently uses get_request_collection_hw_usage_counter_for_internal to gather metrics and passes them downstream. No concerns found.

Also applies to: 321-322

338-343: overwrite_payload method instrumentation

Similarly, these lines incorporate hardware usage retrieval and injection into the overwrite pipeline. Implementation is uniform with the rest of the code.

Also applies to: 349-350

366-371: Enabling usage reporting for delete_payload

No discrepancies in usage injection for payload deletion. The approach is entirely consistent with the other internal methods.

Also applies to: 377-378

394-399: clear_payload usage tracking

Clear payload now includes hardware usage. The logic remains in sync with established patterns.

Also applies to: 405-406

471-482: Core search batch usage

This block obtains a hardware counter for the collection and uses it while performing a core_search_list. Capturing usage for these internal search calls enables consistent telemetry.

507-513: Adding usage tracking for recommend

Hardware usage retrieval and injection are now present for internal recommendation operations. This is aligned with the other newly instrumented calls.

535-545: Extended scroll functionality with HW counter

Grabs the hardware usage counter for scrolling and includes it in the subsequent call. The pattern is uniform and correct.

565-575: Included usage measurement in get

This fetch route is now consistent with other paths, capturing hardware usage data properly.

592-604: count function updated for usage metrics

Acquires the new usage counter, then calls the underlying method. Straightforward, with no detected issues.

611-627: sync remains uninstrumented via disposable usage

The sync logic does not seem to incorporate final usage results in the response, which matches the pattern established in other places for “unmeasured” internal calls. This choice is consistent and documented.

645-655: query_batch capturing usage

Mimics the pattern of obtaining hardware usage within the collection context for batch queries. Implementation is uniform.

665-669: facet route usage injection

This matches the previously changed facet_counts_internal, ensuring that internal calls also capture usage metrics. No issues found.

src/common/update.rs (15)

14-14: Added HwMeasurementAcc import

Brings in the hardware accumulator to support hardware usage tracking. Implementation detail looks standard.

230-240: do_upsert_points now includes hardware measurements

Receives a hw_measurement_acc and passes it into the core update flow. The design is consistent and should reliably capture usage data.

272-282: do_delete_points usage accumulation

This deletion logic parallels the upsert approach by carrying hw_measurement_acc and forwarding it to update. No issues found.

307-317: Propagated accumulator into do_update_vectors

Again, the changes integrate hw_measurement_acc into the existing flow, with consistent usage. Implementation looks correct.

342-405: do_delete_vectors updated for hardware usage

Ensures that both filter-based and point-based vector deletions track usage properly. Multiple calls to update are made, each receiving the hardware accumulator. Good structure.

407-443: do_set_payload gains hardware usage parameter

This addition is straightforward, matching the existing pattern of passing hw_measurement_acc through update. No concerns.

445-482: do_overwrite_payload with usage tracking

Same design approach for overwriting payload. The logic for capturing usage remains uniform.

484-518: do_delete_payload instrumentation

No special edge cases appear unaddressed. This method reliably collects usage metrics through the hw_measurement_acc.

520-551: do_clear_payload aligned with usage accumulation

Maintains consistency with other payload operations, passing usage upward into update.

553-570: Hardware usage in do_batch_update_points

Each sub-operation within the batch references the cloned hw_measurement_acc, preventing usage data loss. This design is coherent for batched updates.

Also applies to: 576-580

672-722: do_create_index flow includes usage metrics

After the consensus operation, the do_create_index_internal call with hw_measurement_acc ensures indexing also tracks hardware usage.

724-751: Usage integrated in do_create_index_internal

As with other methods, it routes the accumulator to update. Straightforward addition with no red flags.

753-791: do_delete_index instrumentation

The changes ensure index deletion is also measured, following the same pattern as create index.

793-816: do_delete_index_internal tracking usage

Continues the pattern of passing hw_measurement_acc into the update call for internal index deletions.

818-878: update function now accepts hw_measurement_acc

The main entry point for updates. The approach of passing the accumulator to toc.update centralizes usage tracking. Implementation is coherent, with no concurrency or performance concerns evident.

lib/collection/src/shards/remote_shard.rs (4)

226-227: Function signature enhancements for hardware usage tracking

The additional hw_measurement_acc parameter has been consistently added to all necessary operation methods. This will allow for comprehensive hardware usage tracking across remote shard operations.

Also applies to: 251-252, 676-677, 1016-1017

503-505: Clean hardware usage accumulation

Good implementation of hardware usage accumulation from point operation responses. The code properly checks for the presence of usage data before accumulating it.

682-690: Proper parameter passing to execute_update_operation

The method call has been properly updated to include the hardware measurement accumulator, maintaining consistent parameter order. The formatting change with each parameter on a new line also improves readability.

737-739: Consistent hardware usage accumulation across operation types

Hardware usage accumulation has been consistently implemented across all operation types (scroll, search, count, retrieve, query_batch, facet). The pattern of checking for usage data before accumulation is maintained throughout.

Also applies to: 811-813, 876-878, 925-927, 982-984, 1052-1054

* Measure update operations hardware IO * Add support for distributed setups * also measure update_local * Add consensus tests for HW metrics of update operations * add test for upserting without waiting * Disable HW usage reporting when not waiting for update API * Review remarks * Fix resharding collecting hw measurements * Fix metric type * New struct HardwareData for better accumulation * Ensure we always apply CPU multiplier * Apply suggestions from code review * Update src/actix/api/update_api.rs Co-authored-by: Tim Visée <tim+github@visee.me> * Fix assert_with_upper_bound_error threshold calculation. * Clarifying why we don't measure shard cleanup --------- Co-authored-by: Tim Visée <tim+github@visee.me>

github-actions bot mentioned this pull request Jan 31, 2025

Flaky test payload_index_test::test_keyword_facet #5059

Closed

JojiiOfficial force-pushed the measure_update_op_io branch from f7e453a to 657951b Compare January 31, 2025 12:00

github-actions bot mentioned this pull request Jan 31, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_5_recommend_eq #5902

Closed

JojiiOfficial force-pushed the measure_update_op_io branch from 0d4e450 to e234949 Compare January 31, 2025 12:15

Base automatically changed from measure_vector_io_reads to dev February 3, 2025 10:56

github-actions bot mentioned this pull request Feb 3, 2025

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

JojiiOfficial force-pushed the measure_update_op_io branch from 98f2eaa to d73fe4f Compare February 3, 2025 15:33

JojiiOfficial requested a review from generall February 3, 2025 15:35

JojiiOfficial mentioned this pull request Feb 4, 2025

Measure io writes for vector upsertions #5944

Merged

JojiiOfficial requested a review from timvisee February 6, 2025 11:58

timvisee reviewed Feb 7, 2025

View reviewed changes

generall force-pushed the measure_update_op_io branch from d73fe4f to 353f5f7 Compare February 13, 2025 15:46

ffuugoo force-pushed the measure_update_op_io branch from 7b8d7e6 to d57cc2f Compare February 18, 2025 11:07

JojiiOfficial force-pushed the measure_update_op_io branch from d57cc2f to 845549e Compare March 3, 2025 10:15

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

src/common/metrics.rs Outdated Show resolved Hide resolved

JojiiOfficial force-pushed the measure_update_op_io branch 2 times, most recently from fe513cd to 2d0886a Compare March 4, 2025 14:52

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

JojiiOfficial force-pushed the measure_update_op_io branch from 2d0886a to c357963 Compare March 4, 2025 14:56

coderabbitai bot reviewed Mar 4, 2025

View reviewed changes

JojiiOfficial requested a review from timvisee March 5, 2025 09:43

timvisee reviewed Mar 6, 2025

View reviewed changes

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

tests/consensus_tests/test_hw_measurement.py Outdated Show resolved Hide resolved

github-actions bot mentioned this pull request Mar 6, 2025

Flaky test tests::hw_metrics::test_hw_metrics_cancellation #6070

Open

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

JojiiOfficial requested a review from timvisee March 6, 2025 12:01

JojiiOfficial and others added 15 commits March 6, 2025 13:24

Measure update operations hardware IO

fa742fb

Add support for distributed setups

549a2e6

also measure update_local

a5d9fd5

Add consensus tests for HW metrics of update operations

315a6eb

add test for upserting without waiting

539aaf6

Disable HW usage reporting when not waiting for update API

e3c18ec

Review remarks

377b59b

Fix resharding collecting hw measurements

3999c12

Fix metric type

32ab975

New struct HardwareData for better accumulation

0a45325

Ensure we always apply CPU multiplier

50242c6

Apply suggestions from code review

f4f04bb

Update src/actix/api/update_api.rs

d248cd7

Co-authored-by: Tim Visée <tim+github@visee.me>

Fix assert_with_upper_bound_error threshold calculation.

50ff32d

Clarifying why we don't measure shard cleanup

0e6aa45

JojiiOfficial force-pushed the measure_update_op_io branch from e7ac5f1 to 0e6aa45 Compare March 6, 2025 12:24

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

timvisee approved these changes Mar 6, 2025

View reviewed changes

JojiiOfficial merged commit 1fded09 into dev Mar 6, 2025
17 checks passed

JojiiOfficial deleted the measure_update_op_io branch March 6, 2025 14:03

This was referenced Oct 27, 2025

Implement logging of slow requests for scroll, facet, count, and retrieve #7291

Merged

Respect payload filters while replicating point #7362

Merged

This was referenced Nov 4, 2025

Add Truncate API #7496

Open

Limit the timeout parameter for all requests to 1h #7500

Closed

coderabbitai bot mentioned this pull request Dec 9, 2025

refactor replica set #7706

Merged

coderabbitai bot mentioned this pull request Dec 22, 2025

feat/shard-ops-timeout #7750

Merged

9 tasks

coderabbitai bot mentioned this pull request Jan 6, 2026

feat/shard ops timeout backup #7871

Closed

9 tasks

coderabbitai bot mentioned this pull request Mar 10, 2026

log cpu utilization #8342

Merged

wait parameter:	wait in API	report in this function
None	false	true
Some(false)	false	false
Some(true)	true	true

Conversation

JojiiOfficial commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented Feb 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timvisee Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

generall Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

generall Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JojiiOfficial Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timvisee Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

generall commented Feb 13, 2025

Uh oh!

ffuugoo commented Feb 18, 2025

Uh oh!

coderabbitai bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JojiiOfficial commented Jan 31, 2025 •

edited

Loading

generall Feb 13, 2025 •

edited

Loading

coderabbitai bot commented Mar 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)