clp-core: Add custom headers support for NetworkReader by anlowee · Pull Request #564 · y-scope/clp

anlowee · 2024-10-24T21:37:37Z

Description

Modified constructors of NetworkReader, DownloaderThread and CurlDownloadHandler to support custom headers passed by users.

Validation performed

Added the Authorization header with a Bearer token to fetch a file from remote, validating the file's size.

Summary by CodeRabbit

New Features
- Enhanced CurlDownloadHandler and NetworkReader constructors to accept custom HTTP headers, allowing for more flexible HTTP requests.
- Custom headers can now be specified during the instantiation of both CurlDownloadHandler and NetworkReader.
- Introduced a minimum table size option in the command-line interface for compression settings.
- Added functionality for the PackedStreamReader class to manage reading packed streams and their metadata.
Tests
- Added new test cases for NetworkReader to validate the handling of custom HTTP headers, ensuring expected headers are included and illegal ones are excluded.
Documentation
- Updated constructor signatures to reflect the addition of the custom headers parameter.
- Generalized macOS setup instructions in the documentation to reflect broader compatibility.

coderabbitai · 2024-10-24T21:37:44Z

Walkthrough

The changes introduce a new parameter, std::optional<std::unordered_map<std::string, std::string>> const& http_header_kv_pairs, to the constructors of both the CurlDownloadHandler and NetworkReader classes, allowing users to specify custom HTTP headers during instantiation. The DownloaderThread class is updated to accommodate this new parameter, enabling it to pass custom headers to the CurlDownloadHandler. The overall functionality and control flow of the classes remain unchanged, with the primary focus on enhancing header customization for HTTP requests.

Changes

File	Change Summary
`components/core/src/clp/CurlDownloadHandler.cpp`	Added a new parameter `http_header_kv_pairs` to the `CurlDownloadHandler` constructor and implemented logic to append valid headers to `m_http_headers`.
`components/core/src/clp/CurlDownloadHandler.hpp`	Updated constructor signature to include `http_header_kv_pairs`, with a default value of `std::nullopt`. Added `<optional>` and `<unordered_map>` header inclusions.
`components/core/src/clp/NetworkReader.cpp`	Updated `NetworkReader` constructor to include `http_header_kv_pairs` and modified `DownloaderThread` to utilize this parameter.
`components/core/src/clp/NetworkReader.hpp`	Modified constructor signatures for `NetworkReader` and `DownloaderThread` to accept `http_header_kv_pairs`, with a default value of `std::nullopt`. Added `<unordered_map>` header inclusion.
`components/core/tests/test-NetworkReader.cpp`	Added a new test case `network_reader_with_http_header_kv_pairs` to validate the handling of custom HTTP headers in `NetworkReader`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant NetworkReader
    participant DownloaderThread
    participant CurlDownloadHandler

    User->>NetworkReader: Create instance with http_header_kv_pairs
    NetworkReader->>DownloaderThread: Pass http_header_kv_pairs
    DownloaderThread->>CurlDownloadHandler: Initialize with http_header_kv_pairs
    CurlDownloadHandler->>CurlDownloadHandler: Append valid headers

Possibly related PRs

clp-s: Implement table packing #466: This PR modifies the ArchiveReader class, including changes to how schema tables are read, which is relevant to the modifications made in the CurlDownloadHandler regarding HTTP headers.
ci: Switch GitHub macOS build workflow to use macos-13 (x86) and macos-14 (ARM) runners. #566: This PR updates the macOS build workflow, which may indirectly affect the compilation and testing of the changes made in the main PR, especially if the changes involve dependencies or configurations sensitive to the build environment.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)

components/core/src/clp/CurlDownloadHandler.hpp (1)
67-68: Documentation needed for custom_headers parameter

The constructor's documentation block should be updated to include details about the custom_headers parameter, including its purpose and any restrictions on header types.

Add documentation like this:
     * @param overall_timeout Maximum time that the transfer may take. Note that this includes
     * `connection_timeout`. Doc: https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html
+    * @param custom_headers Map of custom HTTP headers to be included in the request
components/core/src/clp/NetworkReader.hpp (2)
107-108: Add parameter documentation for custom_headers

The constructor's documentation block should be updated to include the new custom_headers parameter. This helps maintain API documentation completeness.

Add this to the existing parameter documentation:
 * @param buffer_pool_size The required number of buffers in the buffer pool.
 * @param buffer_size The size of each buffer in the buffer pool.
+* @param custom_headers Map of custom HTTP headers to be included in the request.
248-252: Add parameter documentation for custom_headers

The DownloaderThread constructor's documentation block should be updated to include the new parameter.

Add this to the existing parameter documentation:
         * @param reader
         * @param offset Index of the byte at which to start the download.
         * @param disable_caching Whether to disable caching.
+        * @param custom_headers Map of custom HTTP headers to be included in the request.
Also applies to: 265-265

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7043d04 and dd05e19.

📒 Files selected for processing (4)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/CurlDownloadHandler.hpp (2 hunks)
components/core/src/clp/NetworkReader.cpp (3 hunks)
components/core/src/clp/NetworkReader.hpp (4 hunks)

🔇 Additional comments (7)

components/core/src/clp/CurlDownloadHandler.cpp (2)
5-5: LGTM: Required include for std::map

The addition of the map header is necessary for the custom headers parameter.

23-24: 🛠️ Refactor suggestion

Consider adding parameter validation

While the implementation accepts custom headers, it might be beneficial to validate the map's contents before processing.

Let's check if there are any existing parameter validations in the codebase:

Consider adding validation:
        std::chrono::seconds overall_timeout,
        std::map<std::string, std::string> const& custom_headers
)
        : m_error_msg_buf{std::move(error_msg_buf)} {
+    // Validate custom headers
+    for (auto const& [key, value] : custom_headers) {
+        if (key.empty() || value.empty()) {
+            throw std::invalid_argument("Empty key or value in custom headers");
+        }
+    }
components/core/src/clp/CurlDownloadHandler.hpp (1)

7-7: LGTM: Include statement properly added

The <map> header is correctly included and properly ordered alphabetically with other standard headers.
components/core/src/clp/NetworkReader.cpp (2)
121-122: 🛠️ Refactor suggestion

Consider adding header validation and documentation.

While the implementation correctly handles custom headers, consider:

Adding validation for header names and values to prevent injection of malicious or invalid headers

Documenting the expected format and any restrictions for header names/values in the constructor's documentation

Let's check if there are any existing header validation patterns in the codebase:

Consider adding validation like this:
void validate_headers(std::map<std::string, std::string> const& headers) {
    for (auto const& [name, value] : headers) {
        if (name.empty() || value.empty()) {
            throw std::invalid_argument("Header name and value cannot be empty");
        }
        // Add more validation as needed
    }
}
Also applies to: 134-134

219-220: 🛠️ Refactor suggestion

Consider enhancing error handling for header-related failures.

The current implementation might benefit from specific error handling for header-related failures during CURL operations. Consider:

Adding specific error codes or exceptions for header-related failures

Logging header-related errors separately for better debugging

Let's check existing error handling patterns:

Consider extending error handling like this:
try {
    CurlDownloadHandler curl_handler{/*...*/};
    auto const ret_code{curl_handler.perform()};
    if (CURLE_UNSUPPORTED_PROTOCOL == ret_code) {
        // Log specific header-related error
        logger.error("Failed to set custom headers: {}", m_reader.m_curl_error_msg_buf);
    }
    // ... rest of the code
}
components/core/src/clp/NetworkReader.hpp (2)

9-9: LGTM: Header inclusion is properly placed

The addition of <map> header is necessary for the custom headers functionality and follows the file's include ordering convention.

248-252: LGTM: Constructor parameter formatting improves readability

The multi-line parameter formatting enhances code readability while maintaining consistent style.

components/core/src/clp/CurlDownloadHandler.cpp

components/core/src/clp/CurlDownloadHandler.hpp

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com> Co-authored-by: Kirk Rodrigues <2454684+kirkrodrigues@users.noreply.github.com> Co-authored-by: wraymo <raywangv@gmail.com>

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)

components/core/tests/test-NetworkReader.cpp (2)
197-202: Refactor magic number and string concatenation

Consider these improvements:

Define NR_HEADERS as a named constant

Use string formatting or a more efficient string building method
-    const int NR_HEADERS = 10;
-    for (int i = 0; i < NR_HEADERS; i++) {
-      std::string key = "Unit-Test-Key" + std::to_string(i);
-      std::string value = "Unit-Test-Value" + std::to_string(i);
+    static constexpr int kNumTestHeaders = 10;
+    for (int i = 0; i < kNumTestHeaders; i++) {
+      std::string key = std::format("Unit-Test-Key{}", i);
+      std::string value = std::format("Unit-Test-Value{}", i);
208-214: Consider using raw string literals for special characters

Using raw string literals would improve readability when testing headers with special characters.
-    custom_headers["A\nNewline"] = "xx";
-    custom_headers["Legal-Name2"] = "control-char\x01";
+    custom_headers[R"(A\nNewline)"] = "xx";
+    custom_headers["Legal-Name2"] = R"(control-char\x01)";
components/core/src/clp/NetworkReader.hpp (1)

Line range hint 98-109: Consider documenting security implications of custom headers

While the implementation correctly uses const reference for the custom_headers parameter, consider adding documentation about:

Security implications of storing authentication tokens in headers

Best practices for handling sensitive header data

Lifetime management of sensitive header content

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dd05e19 and f134870.

📒 Files selected for processing (4)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/CurlDownloadHandler.hpp (3 hunks)
components/core/src/clp/NetworkReader.hpp (5 hunks)
components/core/tests/test-NetworkReader.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/src/clp/CurlDownloadHandler.hpp

🔇 Additional comments (8)

components/core/src/clp/CurlDownloadHandler.cpp (3)

5-8: LGTM: Required headers added appropriately

The new includes are properly ordered and necessary for the custom headers implementation.

25-26: LGTM: Constructor parameter properly added

The custom_headers parameter is appropriately added as a const reference, which is efficient for passing map data.

63-67: LGTM: Case-insensitive header filtering implemented

The implementation properly handles case-insensitive comparison of reserved headers.

components/core/tests/test-NetworkReader.cpp (4)

5-5: LGTM: Appropriate header inclusion

The addition of the map header is necessary and correctly placed with other STL includes.

195-197: Consider mocking external HTTP service

The test depends on httpbin.org, which could cause test failures if the service is unavailable. Consider:

Using a local mock server for testing

Adding retry logic or proper error handling

Documenting the external dependency

203-207: LGTM: Good coverage of protected headers

Excellent test coverage ensuring that critical headers (Range, Cache-Control, Pragma) cannot be overridden by custom headers.

233-241: LGTM: Comprehensive validation of header restrictions

Good coverage ensuring that:

Protected headers are not included

Invalid header names are rejected

Headers with special characters are handled correctly
components/core/src/clp/NetworkReader.hpp (1)
Line range hint 248-267: Verify thread safety and memory handling of custom headers

The implementation looks good, but consider the following aspects:

Thread Safety: The current implementation correctly handles thread safety by storing a copy of the headers map.

Memory Management: Consider clearing sensitive header data (like authentication tokens) after the download is complete.

Let's verify the usage of these headers:

✅ Verification successful

Let me verify how the custom headers are used in the codebase, particularly in the NetworkReader class.

Custom headers implementation is secure and well-managed

The implementation correctly handles custom headers with proper security considerations:

Headers are passed by const reference and stored as a copy in the thread for thread safety

Reserved headers (Range, Cache-Control, Pragma) are protected from being overridden

Headers are validated and sanitized in CurlDownloadHandler before use

Headers are scoped to the download thread's lifetime and automatically cleaned up

No security or memory management issues found.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for potential sensitive header handling in the codebase
rg -i "authorization|bearer|token|secret|password" --type cpp
Length of output: 119815

Script:
#!/bin/bash
# Search for custom_headers usage in NetworkReader and related files
rg -A 5 -B 5 "custom_headers" --type cpp
Length of output: 15079

components/core/src/clp/CurlDownloadHandler.cpp

components/core/tests/test-NetworkReader.cpp

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

components/core/src/clp/CurlDownloadHandler.cpp (1)

26-26: Include the new parameter in the function documentation

Ensure that the documentation for CurlDownloadHandler is updated to reflect the addition of the custom_headers parameter, so that users are aware of the new functionality.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f134870 and fc7aa7d.

📒 Files selected for processing (2)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/tests/test-NetworkReader.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/tests/test-NetworkReader.cpp

🔇 Additional comments (1)

components/core/src/clp/CurlDownloadHandler.cpp (1)

77-82: ⚠️ Potential issue

Optimize regex usage and update patterns to comply with RFC 7230

The regex patterns for header names and values are being compiled in each iteration of the loop. To improve performance, declare the regex objects as static const outside the loop to avoid recompilation. Additionally, update the header_name_pattern to comply with RFC 7230, which specifies that header field names must start with a letter and can contain letters, digits, and hyphens (-), but not underscores (_).

Apply this diff to address the concerns:

+        // RFC 7230 compliant header name and value patterns
+        static const std::regex header_name_pattern("^[A-Za-z][A-Za-z0-9-]*$");
+        static const std::regex header_value_pattern("^[\\x20-\\x7E]+$");
         for (const auto& [key, value] : custom_headers) {
             // Convert to lowercase for case-insensitive comparison
             std::string lower_key = key;
             std::transform(lower_key.begin(), lower_key.end(), lower_key.begin(), ::tolower);
             
             if (reserved_headers.end() == reserved_headers.find(lower_key)) {
-                // Filter out illegal header names and header values by regex
-                // Can contain alphanumeric characters (A-Z, a-z, 0-9), hyphens (`-`), and underscores
-                // (`_`)
-                std::regex header_name_pattern("^[A-Za-z0-9_-]+$");
-                // Must consist of printable ASCII characters (values between 0x20 and 0x7E)
-                std::regex header_value_pattern("^[\\x20-\\x7E]*$");
                 if (std::regex_match(key, header_name_pattern) && 
                     std::regex_match(value, header_value_pattern)) {
                     m_http_headers.append(key + ": " + value);
                 }
             }
         }

components/core/src/clp/CurlDownloadHandler.cpp

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fc7aa7d and c8acc12.

📒 Files selected for processing (3)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/NetworkReader.cpp (3 hunks)
components/core/tests/test-NetworkReader.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

components/core/src/clp/NetworkReader.cpp
components/core/tests/test-NetworkReader.cpp

🔇 Additional comments (4)

components/core/src/clp/CurlDownloadHandler.cpp (4)

5-8: Include directives updated appropriately for custom headers support

The addition of <map> and <regex> is necessary for handling custom headers and performing validations.

25-26: Constructor signature correctly extended with custom_headers

Including std::map<std::string, std::string> const& custom_headers enhances the class's ability to accept and process custom headers.

63-67: Reserved headers defined appropriately

Defining kReservedHeaders as a local constant improves readability, and the reserved headers are correctly specified to prevent overriding critical HTTP headers.

71-71: Proper character conversion in std::tolower

Casting characters to unsigned char in the lambda function ensures correct behaviour and avoids undefined behaviour with negative char values.

components/core/src/clp/CurlDownloadHandler.cpp

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

components/core/src/clp/CurlDownloadHandler.cpp (1)
76-81: Consider logging invalid headers for debugging.

When headers are rejected due to invalid format, it would be helpful to log this information for debugging purposes.

Consider adding debug logging:
             if (std::regex_match(key, header_name_pattern)
                 && std::regex_match(value, header_value_pattern))
             {
                 m_http_headers.append(key + ": " + value);
+            } else {
+                // Log invalid header format if debug logging is enabled
+                CLOG(DEBUG, "network") << "Rejected invalid header: " << key;
             }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c8acc12 and e1c0605.

📒 Files selected for processing (5)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/CurlDownloadHandler.hpp (3 hunks)
components/core/src/clp/NetworkReader.cpp (3 hunks)
components/core/src/clp/NetworkReader.hpp (5 hunks)
components/core/tests/test-NetworkReader.cpp (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

components/core/src/clp/CurlDownloadHandler.hpp
components/core/src/clp/NetworkReader.cpp
components/core/src/clp/NetworkReader.hpp
components/core/tests/test-NetworkReader.cpp

🔇 Additional comments (2)

components/core/src/clp/CurlDownloadHandler.cpp (2)

5-5: LGTM! New includes are properly organized.

The added includes for map, regex, and set are necessary for the custom headers implementation and follow the project's include ordering convention.

Also applies to: 7-8

25-26: LGTM! Constructor parameter addition is well-designed.

The custom_headers parameter is appropriately added as a const reference, which is efficient for passing map containers.

coderabbitai · 2024-10-25T17:21:22Z

components/core/src/clp/CurlDownloadHandler.cpp

+            if (std::regex_match(key, header_name_pattern)
+                && std::regex_match(value, header_value_pattern))
+            {
+                m_http_headers.append(key + ": " + value);


⚠️ Potential issue

Add separator when appending multiple headers.

When appending multiple headers, ensure they are properly separated. The curl_slist_append function expects each header to be terminated correctly.

Apply this diff:

- m_http_headers.append(key + ": " + value); + m_http_headers.append(key + ": " + value + "\r\n");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

m_http_headers.append(key + ": " + value);

m_http_headers.append(key + ": " + value + "\r\n");

LinZhihao-723

Commented on things we need to fix, and also with two questions for Kirk.
Let's fix these comments first :)
Also we should try to run clang-tidy before asking for the next round of review. We can talk offline to help you set up clang-tidy

components/core/src/clp/NetworkReader.hpp

components/core/src/clp/CurlDownloadHandler.hpp

components/core/src/clp/CurlDownloadHandler.cpp

components/core/tests/test-NetworkReader.cpp

components/core/src/clp/CurlDownloadHandler.cpp

components/core/tests/test-NetworkReader.cpp

…565)

…l name or value

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)

components/core/tests/test-NetworkReader.cpp (3)

105-120: Consider enhancing error validation in test_illegal_header

The function effectively tests illegal headers, but could be more robust by:

Explicitly checking for specific error conditions rather than just empty content
Adding validation for the HTTP response code

Consider this enhancement:

 auto test_illegal_header(std::string const& header_name, std::string const& header_value) -> bool {
     std::unordered_map<std::string, std::string> illegal_custom_headers;
     illegal_custom_headers.emplace(header_name, header_value);
     clp::NetworkReader illegal_reader{
             "https://httpbin.org/headers",
             0,
             false,
             clp::CurlDownloadHandler::cDefaultOverallTimeout,
             clp::CurlDownloadHandler::cDefaultConnectionTimeout,
             clp::NetworkReader::cDefaultBufferPoolSize,
             clp::NetworkReader::cDefaultBufferSize,
             illegal_custom_headers
     };
-    auto const content = get_content(illegal_reader);
-    return content.empty();
+    try {
+        auto const content = get_content(illegal_reader);
+        return content.empty();
+    } catch (const clp::NetworkReader::OperationFailed& e) {
+        // Expected failure for illegal headers
+        return true;
+    }
 }

224-241: Fix typo and maintain type consistency

The variable name "reulgar_reader" contains a typo
The loop counter should use size_t for consistency with the setup loop

Apply these changes:

-    clp::NetworkReader reulgar_reader{
+    clp::NetworkReader regular_reader{
             "https://httpbin.org/headers",
             0,
             false,
             clp::CurlDownloadHandler::cDefaultOverallTimeout,
             clp::CurlDownloadHandler::cDefaultConnectionTimeout,
             clp::NetworkReader::cDefaultBufferPoolSize,
             clp::NetworkReader::cDefaultBufferSize,
             regular_custom_headers
     };
-    auto const content{nlohmann::json::parse(get_content(reulgar_reader))};
+    auto const content{nlohmann::json::parse(get_content(regular_reader))};
     auto const& headers{content.at(0).at("headers")};
-    REQUIRE(assert_curl_error_code(CURLE_OK, reulgar_reader));
-    for (int i = 0; i < cNumRegularTestHeaders; i++) {
+    REQUIRE(assert_curl_error_code(CURLE_OK, regular_reader));
+    for (size_t i{0}; i < cNumRegularTestHeaders; i++) {

242-253: LGTM! Comprehensive invalid header testing

The test cases effectively cover:

Reserved headers that shouldn't be overridden
Invalid header names with various special characters
Invalid header values with control characters

Consider grouping similar test cases using Catch2's SECTION for better organization:

SECTION("Reserved headers cannot be overridden") {
    REQUIRE(test_illegal_header("Range", "bytes=100-"));
    REQUIRE(test_illegal_header("Cache-Control", "no-cache"));
    REQUIRE(test_illegal_header("Pragma", "no-cache"));
}

SECTION("Invalid header names are rejected") {
    REQUIRE(test_illegal_header("A Space", "xx"));
    REQUIRE(test_illegal_header("A\nNewline", "xx"));
    REQUIRE(test_illegal_header("An@At", "xx"));
    REQUIRE(test_illegal_header("-Start-with-Non-Alphanumeric", "xx"));
}

SECTION("Invalid header values are rejected") {
    REQUIRE(test_illegal_header("Legal-Name1", "newline\n"));
    REQUIRE(test_illegal_header("Legal-Name2", "control-char\x01"));
}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e1c0605 and cf46ab7.

📒 Files selected for processing (5)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/CurlDownloadHandler.hpp (3 hunks)
components/core/src/clp/NetworkReader.cpp (3 hunks)
components/core/src/clp/NetworkReader.hpp (5 hunks)
components/core/tests/test-NetworkReader.cpp (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

components/core/src/clp/CurlDownloadHandler.hpp
components/core/src/clp/NetworkReader.hpp

🔇 Additional comments (6)

components/core/src/clp/CurlDownloadHandler.cpp (2)
Line range hint 3-27: LGTM: Constructor signature and includes are well-structured

The new parameter follows the established patterns and the includes are properly organized.

99-105: ⚠️ Potential issue

Improve error message for reserved header usage

The error message should indicate which reserved header was attempted to be used.

Apply this diff:
-                        "curl_download_handler_init failed."
+                        fmt::format("Cannot override reserved header: '{}'", key)
Likely invalid or redundant comment.
components/core/tests/test-NetworkReader.cpp (1)

213-223: LGTM! Well-structured test setup

The test case effectively sets up custom headers using proper formatting and type-safe iterations. The use of httpbin.org as a test endpoint is a good choice for header validation.
components/core/src/clp/NetworkReader.cpp (3)
121-122: LGTM: Constructor parameter addition is well-designed

The new custom_headers parameter is appropriately implemented as an optional const reference to an unordered map, which is memory efficient and maintains ABI compatibility.

134-135: LGTM: Thread initialization properly handles custom headers

The initialization correctly forwards the custom headers to the DownloaderThread while maintaining exception safety through std::make_unique.

220-221: Verify thread safety and lifetime of custom headers

While the implementation looks correct, please ensure:

The custom headers map's lifetime extends beyond the download operation

Thread safety is maintained when accessing the headers

Consider storing a copy of the headers in the DownloaderThread to prevent potential lifetime issues. Example implementation:
- m_custom_headers
+ m_custom_headers_copy // Store a copy in DownloaderThread
Run this verification script:

✅ Verification successful

Based on the search results, I can now provide a final response:

Thread safety and lifetime concerns are properly addressed

The implementation is correct as is, because:

m_custom_headers is stored as a member variable in DownloaderThread and initialized through the constructor, making it a thread-local copy

The headers are passed by const reference to CurlDownloadHandler, ensuring thread safety

The lifetime is managed appropriately since each thread maintains its own copy of the headers

The code follows a proper thread-safe design pattern where:

Headers are passed as std::optional<std::unordered_map> through the constructor

Each thread instance maintains its own copy via m_custom_headers member

Access is read-only during download operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for potential thread safety issues with custom headers access
rg -A 3 "custom_headers" --type cpp

# Verify if there are any other places accessing the headers that might need synchronization
ast-grep --pattern 'm_custom_headers'
Length of output: 6862

coderabbitai · 2024-10-28T17:05:11Z

components/core/src/clp/CurlDownloadHandler.cpp

+                    throw CurlOperationFailed(
+                            ErrorCode_Failure,
+                            __FILE__,
+                            __LINE__,
+                            CURLE_ABORTED_BY_CALLBACK,
+                            "curl_download_handler_init failed."
+                    );
+                }
+            } else {
+                throw CurlOperationFailed(
+                        ErrorCode_Failure,
+                        __FILE__,
+                        __LINE__,
+                        CURLE_ABORTED_BY_CALLBACK,
+                        "curl_download_handler_init failed."
+                );


⚠️ Potential issue

Review CURLE_ABORTED_BY_CALLBACK usage

CURLE_ABORTED_BY_CALLBACK might not be the most appropriate error code for header validation failures. Consider using CURLE_BAD_FUNCTION_ARGUMENT instead.

Apply this diff to both error throws:

- CURLE_ABORTED_BY_CALLBACK, + CURLE_BAD_FUNCTION_ARGUMENT,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_ABORTED_BY_CALLBACK,

"curl_download_handler_init failed."

);

}

} else {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_ABORTED_BY_CALLBACK,

"curl_download_handler_init failed."

);

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

"curl_download_handler_init failed."

);

}

} else {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

"curl_download_handler_init failed."

);

coderabbitai · 2024-10-28T17:05:12Z

components/core/src/clp/CurlDownloadHandler.cpp

+                    throw CurlOperationFailed(
+                            ErrorCode_Failure,
+                            __FILE__,
+                            __LINE__,
+                            CURLE_ABORTED_BY_CALLBACK,
+                            "curl_download_handler_init failed."
+                    );


⚠️ Potential issue

Enhance error message for invalid header format

The current error message "curl_download_handler_init failed" doesn't provide enough context about what specifically failed. Consider including details about the invalid header.

Apply this diff:

- "curl_download_handler_init failed." + fmt::format("Invalid header format - name: '{}', value: '{}'", key, value)

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-28T17:05:12Z

components/core/src/clp/CurlDownloadHandler.cpp

+        std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");
+        // Must consist of printable ASCII characters (values between 0x20 and 0x7E)
+        std::regex const header_value_pattern("^[\\x20-\\x7E]*$");


🛠️ Refactor suggestion

Simplify header name validation pattern

The current pattern ^(?![!#$%&'*+.^_|~-])[!#$%&'*+.^_|~0-9a-zA-Z-]+$ is complex and might be hard to maintain. Consider simplifying it while maintaining RFC 7230 compliance.

Apply this diff:

- std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$"); + // RFC 7230: token = 1*tchar + // tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / + // "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA + std::regex const header_name_pattern("^[a-zA-Z][!#$%&'*+.^_`|~0-9a-zA-Z-]*$");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");

// Must consist of printable ASCII characters (values between 0x20 and 0x7E)

std::regex const header_value_pattern("^[\\x20-\\x7E]*$");

// RFC 7230: token = 1*tchar

// tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /

// "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA

std::regex const header_name_pattern("^[a-zA-Z][!#$%&'*+.^_`|~0-9a-zA-Z-]*$");

// Must consist of printable ASCII characters (values between 0x20 and 0x7E)

std::regex const header_value_pattern("^[\\x20-\\x7E]*$");

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)

components/core/tests/test-NetworkReader.cpp (1)
104-119: Add function documentation

Please add a documentation block explaining:

Purpose of the function

Parameters and their constraints

Return value meaning

Example usage

Add this documentation block:
+/**
+ * Tests if a header with potentially illegal name or value is properly rejected.
+ * @param header_name The header name to test
+ * @param header_value The header value to test
+ * @return true if the header was properly rejected (CURLE_BAD_FUNCTION_ARGUMENT and empty content)
+ */
 auto test_illegal_header(std::string const& header_name, std::string const& header_value) -> bool {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cf46ab7 and d995a0f.

📒 Files selected for processing (2)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/tests/test-NetworkReader.cpp (3 hunks)

🔇 Additional comments (5)

components/core/src/clp/CurlDownloadHandler.cpp (3)
3-3: LGTM: Required includes added correctly

The new includes are properly ordered and necessary for the implementation.

Also applies to: 7-7, 13-13

25-26: LGTM: Parameter signature is well-designed

The parameter type choice (const reference to optional unordered_map) is appropriate for passing custom headers efficiently.

67-69: 🛠️ Refactor suggestion

Simplify header name pattern and add pattern documentation

The current regex pattern is complex and could be simplified while maintaining RFC 7230 compliance.

Apply this diff:
-        std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");
+        // RFC 7230: token must start with a letter followed by allowed characters
+        std::regex const header_name_pattern("^[a-zA-Z][!#$%&'*+.^_`|~0-9a-zA-Z-]*$");
Likely invalid or redundant comment.
components/core/tests/test-NetworkReader.cpp (2)
9-9: LGTM: Required includes added appropriately

The new includes support custom headers (unordered_map), string formatting (fmt), and JSON response parsing (nlohmann/json).

Also applies to: 14-15

212-254: LGTM: Comprehensive test coverage

The test case effectively covers:

Regular custom headers functionality

Protection against overriding reserved headers

Validation of header names and values

Error handling for illegal headers

Let's verify we're testing all reserved headers:
✅ Verification successful

Let me analyze the search results to verify the reserved headers testing coverage.

From the codebase search, I found that the test is already comprehensively covering all the reserved headers that are used in the NetworkReader implementation. Specifically:

The test checks the three reserved headers that are internally set by NetworkReader/CurlDownloadHandler:

"Range" (for offset)

"Cache-Control" (for disable_cache)

"Pragma" (for disable_cache)

Looking at CurlDownloadHandler.cpp implementation, these are indeed the only headers that are reserved and automatically set:
if (0 != offset) {
    std::string const range{"Range: bytes=" + std::to_string(offset) + "-"};
    m_http_headers.append(range);
}
if (disable_caching) {
    m_http_headers.append("Cache-Control: no-cache");
    m_http_headers.append("Pragma: no-cache");
}
The test coverage for reserved headers is complete and accurate

The test correctly verifies that these headers cannot be overridden through custom headers, which is the key security concern being tested here.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for other reserved headers in the codebase that we might need to test
rg -g '!*.{md,txt}' -i 'header|curl.*set.*opt' -A 3 -B 3
Length of output: 91532

coderabbitai · 2024-10-28T17:39:05Z

components/core/src/clp/CurlDownloadHandler.cpp

+                    throw CurlOperationFailed(
+                            ErrorCode_Failure,
+                            __FILE__,
+                            __LINE__,
+                            CURLE_BAD_FUNCTION_ARGUMENT,
+                            "curl_download_handler_init failed."
+                    );


⚠️ Potential issue

Enhance error message for invalid header format

The current error message doesn't provide enough context about what specifically failed.

Apply this diff:

- "curl_download_handler_init failed." + fmt::format( + "Invalid header format - name: '{}', value: '{}'", + key, + value)

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-28T17:39:05Z

components/core/src/clp/CurlDownloadHandler.cpp

+                if (std::regex_match(key, header_name_pattern)
+                    && std::regex_match(value, header_value_pattern))
+                {
+                    m_http_headers.append(fmt::format("{}: {}", key, value));


⚠️ Potential issue

Add CRLF validation for header values

As per libcurl documentation, headers must not be CRLF-terminated as libcurl adds CRLF itself.

Apply this diff:

- m_http_headers.append(fmt::format("{}: {}", key, value)); + // Ensure header value doesn't end with CRLF + if (value.ends_with("\r\n")) { + throw CurlOperationFailed( + ErrorCode_Failure, + __FILE__, + __LINE__, + CURLE_BAD_FUNCTION_ARGUMENT, + fmt::format( + "Header value must not end with CRLF - key: '{}'", + key)); + } + m_http_headers.append(fmt::format("{}: {}", key, value));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

m_http_headers.append(fmt::format("{}: {}", key, value));

// Ensure header value doesn't end with CRLF

if (value.ends_with("\r\n")) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"Header value must not end with CRLF - key: '{}'",

key));

}

m_http_headers.append(fmt::format("{}: {}", key, value));

coderabbitai · 2024-10-28T17:39:06Z

components/core/src/clp/CurlDownloadHandler.cpp

+                throw CurlOperationFailed(
+                        ErrorCode_Failure,
+                        __FILE__,
+                        __LINE__,
+                        CURLE_BAD_FUNCTION_ARGUMENT,
+                        "curl_download_handler_init failed."
+                );


⚠️ Potential issue

Improve error message for reserved header violation

The error message should indicate which reserved header was attempted to be set.

Apply this diff:

- "curl_download_handler_init failed." + fmt::format( + "Cannot override reserved header: '{}'", + key)

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-28T17:39:06Z

components/core/tests/test-NetworkReader.cpp

+    for (int i = 0; i < cNumRegularTestHeaders; i++) {
+        REQUIRE((
+                fmt::format("Unit-Test-Value{}", i) == headers.at(fmt::format("Unit-Test-Key{}", i))
+        ));
+    }


⚠️ Potential issue

Use consistent integer types

The loop counter uses int while the earlier loop uses size_t. For consistency with the previous loop and to follow best practices for container indices, use size_t.

- for (int i = 0; i < cNumRegularTestHeaders; i++) { + for (size_t i{0}; i < cNumRegularTestHeaders; i++) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (int i = 0; i < cNumRegularTestHeaders; i++) {

REQUIRE((

fmt::format("Unit-Test-Value{}", i) == headers.at(fmt::format("Unit-Test-Key{}", i))

));

}

for (size_t i{0}; i < cNumRegularTestHeaders; i++) {

REQUIRE((

fmt::format("Unit-Test-Value{}", i) == headers.at(fmt::format("Unit-Test-Key{}", i))

));

}

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

components/core/src/clp/CurlDownloadHandler.cpp (1)

94-99: Enhance error messages for better debugging

The current error messages could be more specific about the validation failure reason.

Consider these improvements:

-                                    "curl_download_handler_init failed due to illegal header: {}: "
-                                    "{}.",
+                                    "Invalid header format - name: '{}', value: '{}' must match "
+                                    "RFC 7230 requirements",

-                                "curl_download_handler_init failed due to reserved header is being "
-                                "modified: {}: {}.",
+                                "Cannot override reserved header '{}' with value '{}'",

Also applies to: 108-113

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d995a0f and 91209ac.

📒 Files selected for processing (1)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)

🔇 Additional comments (5)

components/core/src/clp/CurlDownloadHandler.cpp (5)
3-14: LGTM: Include statements and constructor signature changes are well-implemented

The new includes and constructor signature changes are appropriate for the custom headers feature. The use of std::optional for the custom_headers parameter allows for backward compatibility, and passing by const reference is efficient for the map type.

Also applies to: 24-27

64-65: LGTM: Reserved headers declaration is well-implemented

The use of constexpr array with string_view for reserved headers is efficient and compile-time evaluated.

87-87: ⚠️ Potential issue

Add CRLF validation for header values

As per libcurl documentation, headers must not be CRLF-terminated since libcurl adds CRLF itself.

Add validation before appending the header:
-                    m_http_headers.append(fmt::format("{}: {}", key, value));
+                    if (value.ends_with("\r\n")) {
+                        throw CurlOperationFailed(
+                            ErrorCode_Failure,
+                            __FILE__,
+                            __LINE__,
+                            CURLE_BAD_FUNCTION_ARGUMENT,
+                            fmt::format("Header value must not be CRLF-terminated - key: '{}'", key));
+                    }
+                    m_http_headers.append(fmt::format("{}: {}", key, value));
Likely invalid or redundant comment.

67-69: 🛠️ Refactor suggestion

Simplify the header name validation pattern

The current pattern is overly complex and could be simplified while maintaining RFC 7230 compliance.

Consider this simpler pattern:
-        std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");
+        // RFC 7230: token must start with a letter followed by allowed characters
+        std::regex const header_name_pattern("^[a-zA-Z][!#$%&'*+.^_`|~0-9a-zA-Z-]*$");
Likely invalid or redundant comment.

80-82: 🛠️ Refactor suggestion

Simplify the reserved header check

The current implementation using std::find could be simplified using a range-based for loop.

Consider this more readable approach:
-            if (cReservedHeaders.end()
-                == std::find(cReservedHeaders.begin(), cReservedHeaders.end(), lower_key))
+            bool is_reserved = false;
+            for (auto const reserved : cReservedHeaders) {
+                if (lower_key == reserved) {
+                    is_reserved = true;
+                    break;
+                }
+            }
+            if (!is_reserved)
Likely invalid or redundant comment.

…s-14 (ARM) runners. (#566)

…l name or value

…owee/clp into xiaochong-add-user-customized-header

LinZhihao-723 · 2024-10-29T23:58:06Z

components/core/src/clp/NetworkReader.cpp

        size_t buffer_pool_size,
-        size_t buffer_size
+        size_t buffer_size,
+        std::optional<std::unordered_map<std::string, std::string>> http_header_kv_pairs


Missing includes for std::string and std::unordered_map. Can u check if clang-tidy alerts on these warnings?

LinZhihao-723 · 2024-10-29T23:59:14Z

components/core/src/clp/CurlDownloadHandler.cpp

+        constexpr std::array<std::string_view, 3> cReservedHeaders
+                = {"range", "cache-control", "pragma"};
+        // RFC 7230 token pattern: one or more tchars
+        std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");


LinZhihao-723 · 2024-10-29T23:59:23Z

components/core/src/clp/CurlDownloadHandler.cpp

+        // RFC 7230 token pattern: one or more tchars
+        std::regex const header_name_pattern("^(?![!#$%&'*+.^_`|~-])[!#$%&'*+.^_`|~0-9a-zA-Z-]+$");
+        // Must consist of printable ASCII characters (values between 0x20 and 0x7E)
+        std::regex const header_value_pattern("^[\\x20-\\x7E]*$");


LinZhihao-723 · 2024-10-29T23:59:55Z

components/core/src/clp/CurlDownloadHandler.cpp

    }
+    if (http_header_kv_pairs.has_value()) {
+        constexpr std::array<std::string_view, 3> cReservedHeaders
+                = {"range", "cache-control", "pragma"};


Do we need to store and match the keys in their lower case?

I think so? Because header names are case-insensitive

Sorry my bad. You're right, I didn't notice that the header field (key) is case-insensitive defined by HTTP protocol.
Then how about:

Let's keep the current way of detecting reserved keys, transforming the key into lower-case

Let's define three static constexpr string_view variables for reserved keys in lower-case form. Use these variables to replace these hardcoded lines:

clp/components/core/src/clp/CurlDownloadHandler.cpp

Line 52 in b92a6c6

std::string const range{"Range: bytes=" + std::to_string(offset) + "-"};

,

clp/components/core/src/clp/CurlDownloadHandler.cpp

Line 56 in b92a6c6

m_http_headers.append("Cache-Control: no-cache");

,

clp/components/core/src/clp/CurlDownloadHandler.cpp

Line 57 in b92a6c6

m_http_headers.append("Pragma: no-cache");

Let's switch back to a unordered_set<std::string_view> for lower-case reserved keys, and construct this set in every constructor call. For lower-cased keys transformed in step 1, let's call contains to ensure they don't appear in the unordered set.

Benefits:

Let's be consistent of using lower-cases keys

The unordered_map search will be faster then array search, especially when user don't given any reserved keys

The overhead of constructing unordered_set with string_view in runtime should be acceptable

LinZhihao-723 · 2024-10-30T00:01:42Z

components/core/src/clp/CurlDownloadHandler.cpp

+                        )
+                );
+            }
+        }


Instead of if (valid) {do_sth;} else {error_handling;}, we prefer to use if (not valid) {error_handling;} do_sth;.
This reduce the level of indentations and thus make the code much cleaner.
I think this function can be rewritten in this suggested way

LinZhihao-723 · 2024-10-30T00:28:00Z

components/core/tests/test-NetworkReader.cpp

+    // The following headers are determined by offset and disable_cache, which should not be
+    // overrided by custom headers.
+    REQUIRE(test_illegal_header("Range", "bytes=100-"));
+    REQUIRE(test_illegal_header("Cache-Control", "no-cache"));
+    REQUIRE(test_illegal_header("Pragma", "no-cache"));
+    // The following headers contain illegal header values, the requests should be rejected.
+    REQUIRE(test_illegal_header("Legal-Name1", "CRLF\n"));
+    REQUIRE(test_illegal_header("Legal-Name2", "CRLF\r"));
+    REQUIRE(test_illegal_header("Legal-Name3", "CRLF\n\r"));
+    REQUIRE(test_illegal_header("Legal-Name4", "CRLF\r\n"));


as said above, let's move this into a separate test-case with GENERATE to replace test_illegal_header

LinZhihao-723 · 2024-10-30T00:29:13Z

components/core/tests/test-NetworkReader.cpp

+    auto content{nlohmann::json::parse(get_content(regular_reader))};
+    if (content.is_array() && 1 == content.size()) {
+        content = content.at(0);
+    }


Suggested change

auto content{nlohmann::json::parse(get_content(regular_reader))};

if (content.is_array() && 1 == content.size()) {

content = content.at(0);

}

auto const content =nlohmann::json::parse(get_content(regular_reader));

Have you tried this?

LinZhihao-723 · 2024-10-30T00:31:03Z

components/core/tests/test-NetworkReader.cpp

+    }
+    auto const& headers{content.at("headers")};
+    REQUIRE(assert_curl_error_code(CURLE_OK, regular_reader));
+    for (int i = 0; i < cNumRegularTestHeaders; i++) {


If applying the comment above, this should be replaced.
But as I commented in the last round of review, we shouldn't use int unless the number can be negative; we should use {} for typed variable initialization; and we should use ++i in general loop variable increment

LinZhihao-723 · 2024-10-30T00:41:49Z

components/core/tests/test-NetworkReader.cpp

+    // We use httpbin (https://httpbin.org/) to test the custom headers. This request will return a
+    // JSON object that contains the custom headers. We check if the headers are in the response.


Suggested change

// We use httpbin (https://httpbin.org/) to test the custom headers. This request will return a

// JSON object that contains the custom headers. We check if the headers are in the response.

// We use httpbin (https://httpbin.org/) to test user-specified headers.

// On success, it is supposed to respond all the user-specified headers as key-value pairs in

// JSON form.

LinZhihao-723 · 2024-10-30T00:43:58Z

components/core/tests/test-NetworkReader.cpp

+    constexpr int cNumRegularTestHeaders{10};
+    for (size_t i{0}; i < cNumRegularTestHeaders; i++) {


Suggested change

constexpr int cNumRegularTestHeaders{10};

for (size_t i{0}; i < cNumRegularTestHeaders; i++) {

constexpr size_t cNumRegularTestHeaders{10};

for (size_t i{0}; i < cNumRegularTestHeaders; ++i) {

Please ensure all the comments are applied from previous reviews
#564 (comment)

LinZhihao-723 · 2024-10-30T00:47:38Z

components/core/tests/test-NetworkReader.cpp

    REQUIRE((clp::ErrorCode_Failure == reader.try_get_pos(pos)));
 }
+
+TEST_CASE("network_reader_with_http_header_kv_pairs", "[NetworkReader]") {


Suggested change

TEST_CASE("network_reader_with_http_header_kv_pairs", "[NetworkReader]") {

// NOLINTNEXTLINE(readability-function-cognitive-complexity)

TEST_CASE("network_reader_with_http_header_kv_pairs", "[NetworkReader]") {

We can safely silent this warning since it's mostly caused by catch2 macros

LinZhihao-723 · 2024-10-30T00:49:47Z

components/core/tests/test-NetworkReader.cpp

+    std::unordered_map<std::string, std::string> regular_http_header_kv_pairs;
+    // We use httpbin (https://httpbin.org/) to test the custom headers. This request will return a
+    // JSON object that contains the custom headers. We check if the headers are in the response.
+    constexpr int cNumRegularTestHeaders{10};


How about: cNumHttpHeaderKeyValuePairs

coderabbitai

Actionable comments posted: 15

🧹 Outside diff range and nitpick comments (22)

.github/workflows/clp-core-build-macos.yaml (1)
41-43: Excellent implementation of the matrix strategy, eh!

The matrix configuration properly supports both macOS 13 and 14, which improves the testing coverage. Consider adding a comment explaining the rationale for these specific versions for future maintenance.

Add a comment above the matrix:
    strategy:
      matrix:
+        # Testing on the two latest macOS versions to ensure compatibility
        runner: ["macos-13", "macos-14"]
components/core/src/clp_s/ColumnWriter.cpp (1)
77-77: Remove unnecessary semicolon, eh

There's a stray semicolon that should be removed.
-    ;
components/core/src/clp/CurlDownloadHandler.cpp (1)
3-3: Remove unused includes.

The includes for <array> and <regex> appear to be unused in the current implementation.
-#include <array>
 #include <chrono>
 #include <cstddef>
 #include <memory>
-#include <regex>
 #include <string>
Also applies to: 7-7
components/core/src/clp_s/ArchiveWriter.hpp (3)
24-24: Add documentation for the new option parameter.

Please add documentation to clarify the purpose, units, and any constraints for min_table_size. This will help users understand how to properly configure this option.
 struct ArchiveWriterOption {
     boost::uuids::uuid id;
     std::string archives_dir;
     int compression_level;
     bool print_archive_stats;
+    /// Minimum size (in bytes) for packed tables before compression
+    /// @note Must be greater than 0
     size_t min_table_size;
 };
36-43: Add documentation for the StreamMetadata struct.

The struct would benefit from documentation explaining its purpose and the significance of each member.
+    /**
+     * Metadata for tracking stream location and size within the archive
+     */
     struct StreamMetadata {
         StreamMetadata(uint64_t file_offset, uint64_t uncompressed_size)
                 : file_offset(file_offset),
                   uncompressed_size(uncompressed_size) {}
 
+        /// Offset of the stream within the archive file
         uint64_t file_offset{};
+        /// Size of the stream before compression
         uint64_t uncompressed_size{};
     };
190-190: Add documentation and consider adding a getter method.

The member variable would benefit from documentation. Also, consider adding a getter method for consistency with other size-related members like get_data_size().
+    /// Minimum size threshold for packed tables before compression
     size_t m_min_table_size{};

+    /**
+     * @return Minimum size threshold for packed tables
+     */
+    [[nodiscard]] size_t get_min_table_size() const { return m_min_table_size; }
components/core/src/clp_s/ArchiveReader.hpp (2)
183-193: Consider enhancing the documentation with thread safety notes

The method documentation is comprehensive regarding functionality and buffer reuse. However, since buffer reuse is involved, it would be helpful to document any thread safety considerations.

Consider adding a note about thread safety, for example:
     * save memory allocations, but can only be used when tables are read one at a time.
+    * Note: Buffer reuse is not thread-safe. Ensure proper synchronization when using this feature
+    * in a multi-threaded context.
     * @param stream_id
214-216: Remove redundant initialization

The explicit initialization of m_stream_buffer_size to 0ULL is redundant as this is the default value for size_t class members.
     std::shared_ptr<char[]> m_stream_buffer{};
-    size_t m_stream_buffer_size{0ULL};
+    size_t m_stream_buffer_size{};
     size_t m_cur_stream_id{0ULL};
components/core/src/clp_s/CommandLineArguments.hpp (1)

180-180: LGTM! Consider adding more detailed documentation.

The member variable is well-placed and properly initialized. The comment indicating "1 MB" is helpful.

Consider adding a brief comment explaining why 1 MB was chosen as the default minimum table size, perhaps in a doxygen-style format like other members in this class.
components/core/tests/test-NetworkReader.cpp (2)
196-223: LGTM: Well-structured test for regular headers

The test case effectively validates the custom headers functionality using httpbin.org's echo service. The implementation includes proper error checking and validation of the response.

Consider adding a brief comment explaining the expected format of httpbin.org's response for better maintainability:
     // We use httpbin (https://httpbin.org/) to test the user-specified headers. On success, it is
-    // supposed to respond all the user-specified headers as key-value pairs in JSON form.
+    // supposed to respond with a JSON object containing a "headers" field that echoes back all
+    // request headers as key-value pairs. Example response:
+    // {
+    //   "headers": {
+    //     "Unit-Test-Key0": "Unit-Test-Value0",
+    //     "Unit-Test-Key1": "Unit-Test-Value1"
+    //   }
+    // }
225-254: Clean up commented code and improve test readability

The test case effectively validates illegal headers using parameterized testing. However, there are a few improvements that could be made:

Remove the commented out code on line 236

Consider extracting the lambda to improve readability

Here's a suggested refactor:
 TEST_CASE("network_reader_with_illegal_http_header_kv_pairs", "[NetworkReader]") {
     auto [header_name, header_value] = GENERATE(
             // The following headers are determined by offset and disable_cache, which should not be
             // overriden by custom headers.
             std::make_pair("Range", "bytes=100-"),
             std::make_pair("RAnGe", "bytes=100-"),
             std::make_pair("Cache-Control", "no-cache"),
             std::make_pair("Pragma", "no-cache"),
             // The CRLF-terminated headers should be rejected.
             std::make_pair("Legal-Name", "CRLF\r\n")
     );
-    // REQUIRE(test_illegal_header(header_name, header_value));
+    auto test_illegal_header = [](std::string const& name, std::string const& value) {
+        std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs;
+        illegal_http_header_kv_pairs.emplace(name, value);
+        clp::NetworkReader illegal_reader{
+                "https://httpbin.org/headers",
+                0,
+                false,
+                clp::CurlDownloadHandler::cDefaultOverallTimeout,
+                clp::CurlDownloadHandler::cDefaultConnectionTimeout,
+                clp::NetworkReader::cDefaultBufferPoolSize,
+                clp::NetworkReader::cDefaultBufferSize,
+                illegal_http_header_kv_pairs
+        };
+        auto const content = get_content(illegal_reader);
+        return assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader)
+               && content.empty();
+    };
-    REQUIRE([header_name, header_value]() {
-        std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs;
-        illegal_http_header_kv_pairs.emplace(header_name, header_value);
-        clp::NetworkReader illegal_reader{
-                "https://httpbin.org/headers",
-                0,
-                false,
-                clp::CurlDownloadHandler::cDefaultOverallTimeout,
-                clp::CurlDownloadHandler::cDefaultConnectionTimeout,
-                clp::NetworkReader::cDefaultBufferPoolSize,
-                clp::NetworkReader::cDefaultBufferSize,
-                illegal_http_header_kv_pairs
-        };
-        auto const content = get_content(illegal_reader);
-        return assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader)
-               && content.empty();
-    }());
+    REQUIRE(test_illegal_header(header_name, header_value));
 }
components/core/src/clp_s/CommandLineArguments.cpp (1)
163-167: Consider enhancing the help text and adding value validation.

The implementation looks good, but could benefit from these improvements:

The help text could be more specific about what constitutes a "packed table" and how this option relates to other size parameters.

Consider adding validation to ensure the value is within reasonable bounds.

Here's a suggested improvement:
     "min-table-size",
     po::value<size_t>(&m_minimum_table_size)->value_name("MIN_TABLE_SIZE")->
         default_value(m_minimum_table_size),
-    "Minimum size (B) for a packed table before it gets compressed."
+    "Minimum size (B) for a packed table before it gets compressed. This affects the compression efficiency by controlling the minimum amount of data that must be accumulated before applying compression."
Also, consider adding validation after the options are parsed:
if (m_minimum_table_size < 1024) {  // Example threshold
    throw std::invalid_argument("min-table-size must be at least 1KB");
}
components/core/src/clp_s/PackedStreamReader.hpp (2)
75-77: Ensure stream_id is within bounds in get_uncompressed_stream_size

The method get_uncompressed_stream_size uses m_stream_metadata.at(stream_id), which will throw an exception if stream_id is out of bounds. Consider adding a clearer error message or explicit bounds checking to improve usability and error handling.

13-18: Enhance class documentation for clarity

The class description can be rephrased for better readability and to clearly convey the usage constraints.

Suggested rephrased documentation:
- * PackedStreamReader ensures that the tables section of an archive is read safely. Any attempt to
- * read the tables section without loading the tables metadata, and any attempt to read tables
- * section out of order will throw. As well, any incorrect usage of this class (e.g. closing without
- * opening) will throw.
+ * PackedStreamReader ensures the safe reading of the tables section in an archive. Any attempt to
+ * read the tables section without first loading the tables metadata, or attempts to read the tables
+ * section out of order, will result in an exception. Additionally, incorrect usage of this class
+ * (e.g., closing without opening) will throw an exception.
components/core/src/clp_s/PackedStreamReader.cpp (5)

87-89: Question the necessity of enforcing strict stream_id ordering.

The current implementation throws an error if stream_id is not strictly greater than m_prev_stream_id, which enforces sequential reading of streams in increasing order. Unless there is a specific requirement for this constraint, consider allowing random access to streams by removing this check. This would make the read_stream function more flexible for users who may need to read streams in any order.

5-41: Add unit tests for PackedStreamReader::read_metadata.

To ensure the correct functionality and robust error handling of the read_metadata method, please consider adding unit tests. These tests should cover various scenarios, including successful metadata reading, handling of corrupted data, and state transitions.

43-55: Add unit tests for PackedStreamReader::open_packed_streams.

Adding unit tests for the open_packed_streams method will help verify that the class correctly handles file opening, state transitions, and error conditions such as attempting to open non-existent files.

57-70: Add unit tests for PackedStreamReader::close.

Unit tests for the close method will ensure that resources are properly released, the internal state is reset correctly, and that the method behaves as expected when called from different states.

72-116: Add unit tests for PackedStreamReader::read_stream to ensure correctness.

To validate the functionality and error handling of the read_stream method, consider adding unit tests that cover:

Reading streams in order and out of order (if allowed after addressing the sequential reading constraint).

Handling invalid stream_id values (e.g., out of bounds).

Scenarios where the buffer size is smaller than uncompressed_size, testing buffer reallocation.

Error handling when seeking fails or when reading decompressed data fails.

components/core/src/clp_s/ArchiveWriter.cpp (1)

145-179: Standardize terminology and correct typographical errors in comments.

To enhance clarity and maintain consistency in the documentation, consider the following improvements:

Use "64-bit stream ID" instead of "64 bit stream id" for proper hyphenation and capitalization.

Replace "id" with "ID" consistently throughout the comments.

In line 150~, rephrase for clarity: "In the first half of the metadata, we identify the number of streams and the offset into the file where each compression stream can be found."

components/core/src/clp_s/ArchiveReader.cpp (2)

51-53: Provide a more descriptive error message for unsupported schemas.

Throwing ErrorCode::ErrorCodeUnsupported when num_separate_column_schemas is non-zero might not give enough context to the user. Consider including additional information in the error message to clarify why this operation is unsupported.

Line range hint 132-151: Handle potential null pointers from read_stream.

The read_schema_table method retrieves stream_buffer from read_stream but does not check if it's nullptr before use. This could lead to dereferencing a null pointer if read_stream fails. Include a check to ensure stream_buffer is valid.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f65de30 and 9aaed32.

📒 Files selected for processing (28)

.github/workflows/clp-core-build-macos.yaml (4 hunks)
.github/workflows/clp-core-build.yaml (2 hunks)
components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/NetworkReader.cpp (3 hunks)
components/core/src/clp/NetworkReader.hpp (5 hunks)
components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (3 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/ColumnWriter.cpp (2 hunks)
components/core/src/clp_s/ColumnWriter.hpp (7 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
components/core/src/clp_s/JsonParser.cpp (1 hunks)
components/core/src/clp_s/JsonParser.hpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.cpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.hpp (1 hunks)
components/core/src/clp_s/SchemaReader.cpp (1 hunks)
components/core/src/clp_s/SchemaReader.hpp (4 hunks)
components/core/src/clp_s/SchemaWriter.cpp (1 hunks)
components/core/src/clp_s/SchemaWriter.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)
components/core/src/clp_s/search/Output.cpp (1 hunks)
components/core/tests/test-NetworkReader.cpp (2 hunks)
deps-tasks.yml (1 hunks)
docs/src/dev-guide/components-core/index.md (2 hunks)
docs/src/dev-guide/components-core/macos-deps-install.md (2 hunks)

✅ Files skipped from review due to trivial changes (3)

.github/workflows/clp-core-build.yaml
docs/src/dev-guide/components-core/index.md
docs/src/dev-guide/components-core/macos-deps-install.md

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/src/clp/NetworkReader.hpp

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (37)

components/core/src/clp_s/SchemaWriter.cpp (4)

7-7: Size tracking addition looks good, eh!

The addition of header size to total uncompressed size is correctly placed before storing the column writer.

12-13: Good use of uniform initialization, eh!

The switch to uniform initialization with curly braces is more consistent with modern C++ practices and prevents narrowing conversions.

15-16: Size tracking logic is now more straightforward, eh!

The simplified approach of directly adding return values from add_value() improves code readability. The total size is properly accumulated for both content types and correctly updates m_total_uncompressed_size.

Also applies to: 20-21, 25-25

29-32: Method signature change aligns with refactoring, eh!

The change from size_t to void return type is consistent with the broader refactoring effort seen in related classes.

Let's verify the consistency of this change across related classes:

✅ Verification successful

The previous ast-grep pattern might have failed due to syntax. Let's try a broader search to find store method implementations and their return types across the codebase.

Return type change is consistent across the codebase, eh!

The search results confirm that all store methods using ZstdCompressor are consistently declared with void return type across the codebase:

Base class ColumnWriter declares virtual void store(ZstdCompressor&)
All derived column writers (Int64ColumnWriter, FloatColumnWriter, etc.) implement void store(ZstdCompressor&)
SchemaWriter also uses void store(ZstdCompressor&)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for store method signatures in related classes
ast-grep --pattern 'void store(ZstdCompressor& $_) {
  $$$
}'

Length of output: 63

Script:

#!/bin/bash
# Search for store method implementations with ZstdCompressor parameter
rg -U "store.*ZstdCompressor" -A 3

# Also search for any return type declarations with store and ZstdCompressor
rg -U "(\w+)\s+store.*ZstdCompressor.*\{" --multiline

Length of output: 5139

components/core/src/clp_s/SchemaWriter.hpp (2)

49-51: LGTM: Well-structured size tracking implementation

The addition of get_total_uncompressed_size() method and its corresponding member variable is well-implemented:

Clear documentation explaining the purpose
Proper initialization of the member variable
Consistent type usage between getter and member

Also applies to: 55-55

44-44: Verify impact of removing size_t return value from store()

The change from size_t to void return type is a breaking change. Please ensure all callers of this method have been updated accordingly.

Let's verify the usage:

✅ Verification successful

Let me search for more context about the usage of this method.

Let me check if there are any direct calls to this method from other files.

Let me fix the search pattern and try again.

The store() method's return type change is safe to proceed

Based on the codebase analysis, the change from size_t to void return type is safe because:

The only direct calls to SchemaWriter::store() are in SchemaWriter.cpp itself, where the return value isn't used
The method is called through ArchiveWriter::store_tables(), but the actual size tracking is now handled by the new m_total_uncompressed_size member variable and its getter
All other store() calls in the codebase are unrelated to this class (they belong to different components like NetworkReader)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to SchemaWriter::store to verify they don't depend on the return value
rg -A 2 "SchemaWriter.*store\(" --type cpp

Length of output: 300

Script:

#!/bin/bash
# Search for any references to SchemaWriter and store
rg -B 2 -A 2 "SchemaWriter.*store" --type cpp

# Look for the implementation of store method
ast-grep --pattern 'void SchemaWriter::store($$$) { $$$ }'

Length of output: 453

Script:

#!/bin/bash
# Search for any direct usage of SchemaWriter's store method
rg -B 3 -A 3 "->store\(|\.store\(" --type cpp

# Also check if SchemaWriter is used in any test files
fd -e cpp -e hpp test | xargs rg "SchemaWriter"

Length of output: 500

Script:

#!/bin/bash
# Search for store method calls with proper escaping
rg -B 3 -A 3 "\->store\(|\\.store\(" --type cpp

# Look for any references to SchemaWriter in the entire codebase
rg -B 2 -A 2 "SchemaWriter" --type cpp

Length of output: 9849

.github/workflows/clp-core-build-macos.yaml (2)

11-11: Verify the directory structure changes for macOS scripts

The path changes from macos-12 to macos suggest a restructuring of the installation scripts directory. Let's verify that the corresponding directory changes have been made.

Also applies to: 24-24

59-59: Verify the installation script location and contents

The installation script path has been updated to use the new directory structure. Let's verify the script's existence and contents.

✅ Verification successful

Let me gather more information about the script's contents to ensure it has the expected functionality.

Installation script verified and properly located

The macOS installation script exists at the specified path and contains the expected package installation commands using Homebrew. The script:

Is executable and properly located in the new directory structure
Contains essential setup commands for installing required dependencies
Follows the same pattern as other platform-specific installation scripts
Includes proper error handling with set -e and set -u

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the installation script exists and contains expected content

echo "Checking installation script..."
SCRIPT_PATH="components/core/tools/scripts/lib_install/macos/install-all.sh"

# Check if the script exists and is executable
if [ ! -x "$SCRIPT_PATH" ]; then
    echo "Error: Installation script not found or not executable"
    exit 1
fi

# Check if the script contains essential setup commands
rg -l "brew" "$SCRIPT_PATH"

Length of output: 368

Script:

#!/bin/bash
# Check the contents of the installation script
echo "=== Installation Script Contents ==="
cat components/core/tools/scripts/lib_install/macos/install-all.sh

# Check if there are any other related installation scripts
echo -e "\n=== Related Installation Scripts ==="
fd install-all.sh

Length of output: 840

components/core/src/clp_s/JsonParser.hpp (1)

35-35: Verify usage and add documentation for min_table_size.

The new min_table_size member lacks documentation explaining its purpose and constraints. Additionally, let's verify its usage across the codebase.

Consider adding documentation to explain:

The purpose of this configuration option
Any constraints on valid values
The default value if not specified

✅ Verification successful

Let me gather more information about the usage and potential documentation by searching for command line arguments and related configuration.

Add documentation for min_table_size member

Based on the command line argument documentation and usage found in the codebase, this member represents the minimum size in bytes for a packed table before it gets compressed. The default value is 1 MB (1024*1024 bytes).

Add the following documentation above the member:

/// Minimum size in bytes for a packed table before compression.
/// Default: 1 MB (1024*1024 bytes)
size_t min_table_size;

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for usage of min_table_size to understand its purpose and verify documentation

# Test 1: Find all references to min_table_size
echo "Searching for min_table_size usage..."
rg "min_table_size" -A 3 -B 3

# Test 2: Look for related documentation or comments
echo "Searching for related documentation..."
rg -i "table.?size" --type cpp --type hpp -A 3 -B 3

Length of output: 4521

Script:

#!/bin/bash
# Look for command line argument handling and documentation
rg "get_minimum_table_size|minimum.?table.?size" -A 5 -B 5

# Look for any related configuration or default values
rg "minimum.?table|min.?table" -A 5 -B 5 --type cpp --type hpp

# Search for any related comments or documentation in header files
fd -e hpp -e h | xargs grep -A 5 -B 5 -i "table.*size"

Length of output: 17203

components/core/src/clp_s/ColumnWriter.cpp (6)

4-6: Changes to Int64ColumnWriter look good, eh!

The size calculation is accurate for int64_t values, and the memory handling in the store method is safe.

Also applies to: 9-12

14-16: FloatColumnWriter changes are consistent with the new pattern

The size calculation and storage implementation align well with the overall design.

Also applies to: 19-22

24-26: BooleanColumnWriter changes maintain type safety and efficiency

The implementation properly handles boolean to uint8_t conversion and storage.

Also applies to: 29-32

Line range hint 34-47: ClpStringColumnWriter changes handle dictionary encoding correctly

The size calculation properly accounts for both the dictionary ID and encoded variable sizes.

Also applies to: 50-57

59-64: VariableStringColumnWriter changes are straightforward and correct

The size calculation accurately reflects the dictionary ID storage requirements.

Also applies to: 67-70

72-76: DateStringColumnWriter changes properly handle timestamp pairs

The implementation correctly manages both timestamp and encoding components, with accurate size calculations.

Also applies to: 80-85

components/core/src/clp/CurlDownloadHandler.cpp (2)

26-27: LGTM!

The new parameter is appropriately placed and typed as an optional const reference.

64-68: LGTM!

The header formatting is clean and consistently uses the defined constants.

components/core/src/clp_s/ColumnWriter.hpp (3)

30-32: Method signature changes improve API clarity

The changes to add_value and store methods enhance the interface by:

Returning the size of encoded data directly from add_value
Simplifying the store method by removing unnecessary size return

Also applies to: 38-39

40-47: Well-documented header size calculation method

The new get_total_header_size method is well-documented and provides a clear contract for derived classes. The documentation effectively explains how this value relates to the total size calculation.

62-62: Verify header size implementation requirements

While the method signature updates are consistent across all derived classes, only ClpStringColumnWriter overrides get_total_header_size. Please verify if other column writers should also override this method based on their header data requirements.

Also applies to: 64-64, 79-79, 81-81, 96-96, 98-98, 120-124, 176-176, 178-178, 194-194, 196-196

✅ Verification successful

Let me gather more information about the column writer implementations to verify if they need header size overrides.

Let me try a different approach to find column writer implementations.

Let me try one more search pattern to find column writer implementations.

Only ClpStringColumnWriter requires header size override

Based on the code analysis, only ClpStringColumnWriter needs to override get_total_header_size() because it maintains additional header data for log dictionary entries. Other column writers (Int64ColumnWriter, FloatColumnWriter, BooleanColumnWriter, VariableStringColumnWriter, and DateStringColumnWriter) store simple arrays of values without any header metadata, correctly inheriting the base implementation that returns 0.

The implementation is consistent with the design where:

ClpStringColumnWriter needs header space for dictionary metadata (sizeof(size_t))
Other column writers don't require header space as they only store raw values

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for header-related data members or operations in column writer implementations
rg -A 5 "header|Header" components/core/src/clp_s/

Length of output: 5601

Script:

#!/bin/bash
# Search for all column writer class definitions and their implementations
ast-grep --pattern 'class $_ColumnWriter : public ColumnWriter {
  $$$
}'

Length of output: 75

Script:

#!/bin/bash
# Search for column writer class definitions using ripgrep
rg -A 10 "class \w+ColumnWriter\s*:\s*public\s*ColumnWriter" components/core/src/clp_s/

Length of output: 89

Script:

#!/bin/bash
# Search for column writer class definitions in the ColumnWriter.hpp file
cat components/core/src/clp_s/ColumnWriter.hpp

Length of output: 5705

components/core/src/clp_s/ArchiveWriter.hpp (1)

45-61: Add documentation and consider input validation.

The struct requires documentation to explain the relationships between streams, schemas, and messages. Additionally, consider adding validation for schema_id.

+    /**
+     * Metadata for tracking schema information within a stream
+     */
     struct SchemaMetadata {
         SchemaMetadata(
                 uint64_t stream_id,
                 uint64_t stream_offset,
                 int32_t schema_id,
                 uint64_t num_messages
         )
                 : stream_id(stream_id),
                   stream_offset(stream_offset),
                   schema_id(schema_id),
                   num_messages(num_messages) {
+            // Validate schema_id
+            if (schema_id < 0) {
+                throw std::invalid_argument("schema_id must be non-negative");
+            }
         }
 
+        /// Unique identifier for the stream
         uint64_t stream_id{};
+        /// Offset within the stream where this schema begins
         uint64_t stream_offset{};
+        /// Identifier for the schema definition
         int32_t schema_id{};
+        /// Number of messages using this schema in the stream
         uint64_t num_messages{};
     };

Let's verify the schema_id validation requirement:

components/core/src/clp_s/CMakeLists.txt (1)

68-69: LGTM: PackedStreamReader integration

The addition of PackedStreamReader.cpp and PackedStreamReader.hpp to CLP_S_SOURCES is properly integrated into the build system.

components/core/src/clp_s/ArchiveReader.hpp (3)

13-13: LGTM: Include statement properly placed

The include statement for PackedStreamReader is correctly placed following the include ordering convention.

205-205: LGTM: Consistent naming convention

The rename from m_id_to_table_metadata to m_id_to_schema_metadata maintains consistency with the schema-focused naming convention throughout the class.

96-100: Verify the impact of method rename

The rename from read_table to read_schema_table better reflects the method's purpose, but we should verify all call sites have been updated.

✅ Verification successful

Method rename has been properly implemented

The search results confirm that:

No instances of the old method name read_table remain in the codebase
The new method name read_schema_table is consistently used across all references:
- Method declaration in ArchiveReader.hpp
- Method implementation in ArchiveReader.cpp
- Method calls in Output.cpp and internally in ArchiveReader.cpp

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining calls to the old method name
rg "->read_table\(" --type cpp

# Search for calls to the new method name to confirm updates
rg "->read_schema_table\(" --type cpp

Length of output: 882

Script:

#!/bin/bash
# Search for any remaining calls to the old method name, escaping the arrow operator
rg "read_table\(" --type cpp -A 2

# Search for calls to the new method name to confirm updates
rg "read_schema_table\(" --type cpp -A 2

Length of output: 1179

components/core/src/clp_s/CommandLineArguments.hpp (1)

111-112: LGTM! The getter method follows the class conventions.

The method signature and implementation are consistent with other getter methods in the class.

components/core/tests/test-NetworkReader.cpp (1)

9-16: LGTM: New includes are properly organized

The new includes are correctly placed in the standard library section and are necessary for the added functionality.

components/core/src/clp/NetworkReader.cpp (3)

134-139: LGTM: Correct initialization of DownloaderThread

The initialization sequence and parameter passing are logically correct.

224-225: LGTM: Proper error handling and initialization

The CurlDownloadHandler initialization is correctly implemented within the try-catch block, ensuring proper error handling for CURL operations.

134-139: Verify thread safety with custom headers

The implementation appears thread-safe, but let's verify there are no potential race conditions with the custom headers handling.

Also applies to: 224-225

✅ Verification successful

Let me gather more information about the thread safety implementation.

Let me check one more aspect of the implementation to verify thread safety.

Custom headers handling is thread-safe

The implementation is thread-safe because:

HTTP headers are moved (std::move) into the DownloaderThread constructor, creating a separate copy
The m_http_header_kv_pairs member is accessed only within the thread's context
No concurrent access patterns were found in the codebase

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any concurrent access to http_header_kv_pairs
rg -n "http_header_kv_pairs" --type cpp

# Search for mutex-protected access patterns
ast-grep --pattern 'std::unique_lock<std::mutex> $_ { $_ }'

Length of output: 2948

Script:

#!/bin/bash
# Look for mutex declarations and usage in NetworkReader
rg -n "mutex" --type cpp -A 3 components/core/src/clp/NetworkReader

# Check DownloaderThread implementation for thread safety
ast-grep --pattern 'class DownloaderThread {
  $$$
}'

# Look for any shared state access
rg -n "m_http_header_kv_pairs" --type cpp -A 2

Length of output: 962

Script:

#!/bin/bash
# Check NetworkReader class implementation for thread synchronization
ast-grep --pattern 'class NetworkReader {
  $$$
}'

# Look for any member variable access patterns
rg -n "std::move" --type cpp -A 2 components/core/src/clp/NetworkReader.cpp

# Check for any concurrent access patterns in DownloaderThread
rg -n "DownloaderThread::" --type cpp -A 5

Length of output: 784

components/core/src/clp_s/JsonParser.cpp (1)

34-34: LGTM: Archive options initialization is properly implemented.

The initialization of min_table_size is correctly placed within the archive options setup and follows the established pattern for option initialization.

components/core/src/clp_s/search/Output.cpp (1)

Line range hint 87-91: LGTM! Method rename aligns with schema-centric data handling.

The change from read_table to read_schema_table accurately reflects the method's purpose of reading schema-specific table data.

Let's verify that all callers have been updated consistently:

✅ Verification successful

Method rename has been consistently applied across the codebase

The verification confirms that:

The old method name read_table is not present anywhere in the codebase
The new method read_schema_table is properly defined in ArchiveReader.hpp and implemented in ArchiveReader.cpp
All calls to this method, including the one in Output.cpp and internal usage in ArchiveReader.cpp, use the new name

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for any remaining calls to the old method name
# Expect: No occurrences of the old method name

# Search for old method name
rg -l 'read_table\(' --type cpp

# Search for new method name for context
rg -l 'read_schema_table\(' --type cpp

Length of output: 206

Script:

#!/bin/bash
# Let's examine the actual method implementations and calls
rg 'read_table\(' --type cpp -A 2
rg 'read_schema_table\(' --type cpp -A 2

Length of output: 1179

components/core/src/clp_s/SchemaReader.hpp (2)

52-56: Ensuring Consistent Use of Fixed-Width Integer Types

Changing uncompressed_size from size_t to uint64_t ensures consistent use of fixed-width integer types within SchemaMetadata. This enhances portability and predictability across different platforms.

139-144: Ensure All Calls to load Are Updated to the New Signature

The load method signature has been updated to accept std::shared_ptr<char[]> stream_buffer, size_t offset, and size_t uncompressed_size. Please verify that all invocations of load have been updated to match this new signature to prevent compilation errors or unintended behaviour.

To confirm, run the following script to locate all calls to load:

components/core/src/clp_s/ArchiveWriter.cpp (1)

190-194: ⚠️ Potential issue

Potential exception in sorting comparator may violate noexcept.

The lambda comparator used in std::sort may throw an exception if get_total_uncompressed_size() throws, potentially violating the noexcept guarantee of std::sort. Ensure that get_total_uncompressed_size() is noexcept or handle exceptions appropriately to prevent undefined behaviour.

Run the following script to verify whether get_total_uncompressed_size() is declared noexcept:

🧰 Tools

🪛 cppcheck

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

components/core/src/clp_s/ArchiveReader.cpp (1)

83-85: ⚠️ Potential issue

Validate stream offsets to prevent out-of-bounds access.

The check if (stream_offset > m_stream_reader.get_uncompressed_stream_size(stream_id)) assumes that stream_offset should not exceed the stream size. Ensure that stream_offset is always within valid bounds to prevent potential buffer overruns.

components/core/src/clp_s/SchemaReader.cpp (1)

40-46: Changes to load method appear correct

The modifications to the load method signature and implementation are appropriate. The use of std::shared_ptr<char[]> for stream_buffer, along with offset and uncompressed_size, enhances buffer management. Assigning m_stream_buffer = stream_buffer and initializing BufferViewReader with m_stream_buffer.get() + offset correctly handles the buffer without introducing ownership issues.

coderabbitai · 2024-10-30T18:19:05Z

components/core/src/clp/CurlDownloadHandler.cpp

+    constexpr std::string_view cRangeHeaderName{"range"};
+    constexpr std::string_view cCacheControlHeaderName{"cache-control"};
+    constexpr std::string_view cPragmaHeaderName{"pragma"};
+    std::unordered_set<std::string_view> reserved_headers;
+    reserved_headers.insert(cRangeHeaderName);
+    reserved_headers.insert(cCacheControlHeaderName);
+    reserved_headers.insert(cPragmaHeaderName);


🛠️ Refactor suggestion

Consider using constexpr array for reserved headers.

While using unordered_set provides O(1) lookup, constructing it in the constructor adds runtime overhead. Since we have a small, fixed set of reserved headers, a constexpr array with a simple linear search would be more efficient.

constexpr std::string_view cRangeHeaderName{"range"}; constexpr std::string_view cCacheControlHeaderName{"cache-control"}; constexpr std::string_view cPragmaHeaderName{"pragma"}; -std::unordered_set<std::string_view> reserved_headers; -reserved_headers.insert(cRangeHeaderName); -reserved_headers.insert(cCacheControlHeaderName); -reserved_headers.insert(cPragmaHeaderName); +constexpr std::array<std::string_view, 3> cReservedHeaders{ + cRangeHeaderName, + cCacheControlHeaderName, + cPragmaHeaderName +};

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

constexpr std::string_view cRangeHeaderName{"range"};

constexpr std::string_view cCacheControlHeaderName{"cache-control"};

constexpr std::string_view cPragmaHeaderName{"pragma"};

std::unordered_set<std::string_view> reserved_headers;

reserved_headers.insert(cRangeHeaderName);

reserved_headers.insert(cCacheControlHeaderName);

reserved_headers.insert(cPragmaHeaderName);

constexpr std::string_view cRangeHeaderName{"range"};

constexpr std::string_view cCacheControlHeaderName{"cache-control"};

constexpr std::string_view cPragmaHeaderName{"pragma"};

constexpr std::array<std::string_view, 3> cReservedHeaders{

cRangeHeaderName,

cCacheControlHeaderName,

cPragmaHeaderName

};

coderabbitai · 2024-10-30T18:19:05Z

components/core/src/clp/CurlDownloadHandler.cpp

+    if (http_header_kv_pairs.has_value()) {
+        for (auto const& [key, value] : http_header_kv_pairs.value()) {
+            // Convert to lowercase for case-insensitive comparison
+            std::string lower_key = key;
+            std::transform(
+                    lower_key.begin(),
+                    lower_key.end(),
+                    lower_key.begin(),
+                    [](unsigned char c) { return std::tolower(c); }
+            );
+            if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) {
+                throw CurlOperationFailed(
+                        ErrorCode_Failure,
+                        __FILE__,
+                        __LINE__,
+                        CURLE_BAD_FUNCTION_ARGUMENT,
+                        fmt::format(
+                                "curl_download_handler_init failed due to "
+                                "header: {}: "
+                                "{}.",
+                                key,
+                                value
+                        )
+                );
+            } else {
+                m_http_headers.append(fmt::format("{}: {}", key, value));
+            }
+        }


🛠️ Refactor suggestion

Improve code structure and error messages.

The current if-else structure increases nesting depth unnecessarily.

The error message could be more specific about whether the failure was due to a reserved header or CRLF termination.

if (http_header_kv_pairs.has_value()) { for (auto const& [key, value] : http_header_kv_pairs.value()) { // Convert to lowercase for case-insensitive comparison std::string lower_key = key; std::transform( lower_key.begin(), lower_key.end(), lower_key.begin(), [](unsigned char c) { return std::tolower(c); } ); - if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) { + // Check for reserved headers + if (reserved_headers.contains(lower_key)) { throw CurlOperationFailed( ErrorCode_Failure, __FILE__, __LINE__, CURLE_BAD_FUNCTION_ARGUMENT, fmt::format( - "curl_download_handler_init failed due to " - "header: {}: " - "{}.", - key, - value + "Cannot override reserved header: '{}'", + key ) ); - } else { - m_http_headers.append(fmt::format("{}: {}", key, value)); } + + // Check for CRLF termination + if (value.ends_with("\r\n")) { + throw CurlOperationFailed( + ErrorCode_Failure, + __FILE__, + __LINE__, + CURLE_BAD_FUNCTION_ARGUMENT, + fmt::format( + "Header value must not be CRLF-terminated: '{}: {}'", + key, + value + ) + ); + } + + m_http_headers.append(fmt::format("{}: {}", key, value)); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (http_header_kv_pairs.has_value()) {

for (auto const& [key, value] : http_header_kv_pairs.value()) {

// Convert to lowercase for case-insensitive comparison

std::string lower_key = key;

std::transform(

lower_key.begin(),

lower_key.end(),

lower_key.begin(),

[](unsigned char c) { return std::tolower(c); }

);

if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"curl_download_handler_init failed due to "

"header: {}: "

"{}.",

key,

value

)

);

} else {

m_http_headers.append(fmt::format("{}: {}", key, value));

}

}

if (http_header_kv_pairs.has_value()) {

for (auto const& [key, value] : http_header_kv_pairs.value()) {

// Convert to lowercase for case-insensitive comparison

std::string lower_key = key;

std::transform(

lower_key.begin(),

lower_key.end(),

lower_key.begin(),

[](unsigned char c) { return std::tolower(c); }

);

// Check for reserved headers

if (reserved_headers.contains(lower_key)) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"Cannot override reserved header: '{}'",

key

)

);

}

// Check for CRLF termination

if (value.ends_with("\r\n")) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"Header value must not be CRLF-terminated: '{}: {}'",

key,

value

)

);

}

m_http_headers.append(fmt::format("{}: {}", key, value));

}

coderabbitai · 2024-10-30T18:19:05Z

components/core/src/clp/NetworkReader.cpp

+        size_t buffer_size,
+        std::optional<std::unordered_map<std::string, std::string>> http_header_kv_pairs


🛠️ Refactor suggestion

Consider passing large parameter by value for move semantics

The http_header_kv_pairs parameter could be more efficiently handled by passing it by value and moving it into the member variable, rather than passing by const reference. This would enable move semantics and avoid potential lifetime issues.

- std::optional<std::unordered_map<std::string, std::string>> http_header_kv_pairs + std::optional<std::unordered_map<std::string, std::string>> http_header_kv_pairs = std::nullopt

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-30T18:19:05Z

deps-tasks.yml

+          SRC_NAME: "yscope-log-viewer-4c69bc11dbe8a5d87b5fbfb0e43a2f2a06f04866"
+          SRC_URL: "https://github.com/y-scope/yscope-log-viewer/archive/4c69bc1.zip"


💡 Codebase verification

Update dependency version to one that supports custom headers

Based on the commit history and PR information, the current version (4c69bc1) is not related to custom headers support. This version only adds a warning about decoder format string syntax changes. There are no recent commits or PRs related to custom headers functionality.

The current commit (4c69bc1) focuses on decoder format string warnings

No evidence of custom headers support in recent commits or PRs

Consider updating to a version that includes the required custom headers functionality

🔗 Analysis chain

Verify the yscope-log-viewer dependency update.

The dependency is being updated to a new commit (4c69bc1). Let's verify if this is the correct version that supports the custom headers functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify the commit exists and check its relationship to custom headers support # Test 1: Verify the commit exists and get its details echo "Verifying commit details..." gh api repos/y-scope/yscope-log-viewer/commits/4c69bc11dbe8a5d87b5fbfb0e43a2f2a06f04866 --jq '.commit.message' # Test 2: Check if this commit is related to custom headers echo "Checking commit history for custom headers changes..." gh api repos/y-scope/yscope-log-viewer/commits/4c69bc11dbe8a5d87b5fbfb0e43a2f2a06f04866/pulls --jq '.[].title'

Length of output: 570

Script:

#!/bin/bash # Let's check the commit history around this version to see if there are any custom headers related changes echo "Checking recent commits..." gh api repos/y-scope/yscope-log-viewer/commits --per-page 10 --jq '.[].commit.message' # Also check open and recently closed PRs for custom headers echo "Checking recent PRs..." gh api repos/y-scope/yscope-log-viewer/pulls --state all --per-page 10 --jq '.[].title'

Length of output: 3031

Script:

#!/bin/bash # Let's try a different approach using simpler GitHub API queries echo "Checking recent commits..." gh api /repos/y-scope/yscope-log-viewer/commits?per_page=10 --jq '.[].commit.message' echo "Checking recent PRs..." gh api /repos/y-scope/yscope-log-viewer/pulls?state=all --jq '.[].title'

Length of output: 4077

coderabbitai · 2024-10-30T18:19:06Z

components/core/src/clp_s/PackedStreamReader.hpp

+    /**
+     * Decompresses a stream with a given stream_id and returns it. This function must be called
+     * strictly in ascending stream_id order. If this function is called twice for the same stream
+     * or if a stream with lower id is requested after a stream with higher id then an error is
+     * thrown.
+     *
+     * Note: the buffer and buffer size are returned by reference. This is to support the use case
+     * where the caller wants to re-use the same buffer for multiple streams to avoid allocations
+     * when they already have a sufficiently large buffer. If no buffer is provided or the provided
+     * buffer is too small calling read_stream will create a buffer exactly as large as the stream
+     * being decompressed.
+     *
+     * @param stream_id
+     * @param buf a shared ptr to the buffer where the stream will be read. The buffer gets resized
+     * if it is too small to contain the requested stream.
+     * @param buf_size the size of the underlying buffer owned by buf -- passed and updated by
+     * reference
+     */
+    void read_stream(size_t stream_id, std::shared_ptr<char[]>& buf, size_t& buf_size);


⚠️ Potential issue

Initialize m_prev_stream_id appropriately to avoid comparison issues

The member m_prev_stream_id is initialized to 0ULL (line 92), which could cause the first call to read_stream with stream_id = 0 to fail the strict ordering check. Since read_stream requires stream_id to be strictly greater than m_prev_stream_id, initializing m_prev_stream_id to std::numeric_limits<size_t>::max() would prevent this issue.

Apply this diff to fix the initialization and include the necessary header:

+#include <limits> ... - size_t m_prev_stream_id{0ULL}; + size_t m_prev_stream_id{std::numeric_limits<size_t>::max()};

Also applies to: 92-92

coderabbitai · 2024-10-30T18:19:06Z

components/core/src/clp_s/ArchiveReader.cpp


-    m_tables_file_reader.open(archive_path_str + constants::cArchiveTablesFile);
    m_table_metadata_file_reader.open(archive_path_str + constants::cArchiveTableMetadataFile);
+    m_stream_reader.open_packed_streams(archive_path_str + constants::cArchiveTablesFile);


⚠️ Potential issue

Add error handling for open_packed_streams.

The call to m_stream_reader.open_packed_streams(...) lacks error handling for potential failures, such as missing files or read errors. Consider checking the return value or wrapping the call in a try-catch block to handle exceptions appropriately.

coderabbitai · 2024-10-30T18:19:07Z

components/core/src/clp_s/ArchiveReader.cpp

+    m_stream_reader.read_metadata(m_table_metadata_decompressor);
+


⚠️ Potential issue

Check for errors after reading metadata.

After invoking m_stream_reader.read_metadata(m_table_metadata_decompressor);, there is no verification to ensure the metadata was read successfully. It's advisable to handle any potential errors that might occur during this operation.

coderabbitai · 2024-10-30T18:19:07Z

components/core/src/clp_s/ArchiveReader.cpp

+    bool prev_metadata_initialized{false};
+    SchemaReader::SchemaMetadata prev_metadata{};
+    int32_t prev_schema_id{};


⚠️ Potential issue

Ensure variables are properly initialised before use.

The variables prev_metadata_initialized, prev_metadata, and prev_schema_id are used within the loop but may not be properly initialised in all code paths. Verify that they are correctly set before the loop to prevent undefined behaviour.

coderabbitai · 2024-10-30T18:19:07Z

components/core/src/clp_s/ArchiveReader.cpp

+        if (prev_metadata_initialized) {
+            uint64_t uncompressed_size{0};
+            if (stream_id != prev_metadata.stream_id) {
+                uncompressed_size
+                        = m_stream_reader.get_uncompressed_stream_size(prev_metadata.stream_id)
+                          - prev_metadata.stream_offset;
+            } else {
+                uncompressed_size = stream_offset - prev_metadata.stream_offset;
+            }
+            prev_metadata.uncompressed_size = uncompressed_size;
+            m_id_to_schema_metadata[prev_schema_id] = prev_metadata;
+        } else {


⚠️ Potential issue

Guard against underflow in uncompressed size calculation.

In the calculation of uncompressed_size, subtracting prev_metadata.stream_offset from stream_offset could result in a negative value if stream_offset is less than prev_metadata.stream_offset. Since uncompressed_size is an unsigned integer, this would lead to underflow. Add a check to ensure stream_offset >= prev_metadata.stream_offset.

coderabbitai · 2024-10-30T18:19:07Z

components/core/src/clp_s/ArchiveReader.cpp

+std::shared_ptr<char[]> ArchiveReader::read_stream(size_t stream_id, bool reuse_buffer) {
+    if (nullptr != m_stream_buffer && m_cur_stream_id == stream_id) {
+        return m_stream_buffer;
+    }
+
+    if (false == reuse_buffer) {
+        m_stream_buffer.reset();
+        m_stream_buffer_size = 0;
+    }
+
+    m_stream_reader.read_stream(stream_id, m_stream_buffer, m_stream_buffer_size);
+    m_cur_stream_id = stream_id;
+    return m_stream_buffer;
+}


⚠️ Potential issue

Address thread safety for shared resources in read_stream.

The method read_stream modifies shared members m_stream_buffer, m_cur_stream_id, and m_stream_buffer_size. If ArchiveReader is accessed by multiple threads, this could cause race conditions. Ensure that access to shared resources is properly synchronised or document that the class is not thread-safe.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)

components/core/tests/test-NetworkReader.cpp (2)

196-223: LGTM! The test case effectively validates custom header functionality.

The test comprehensively verifies that custom headers are correctly transmitted and received.

Consider these improvements:

The constant name cNumHttpHeaderKeyValuePairs could be more descriptive, e.g., cNumTestHeaders

Consider adding a brief comment explaining why 10 headers were chosen as the test size

236-236: Remove commented out code.

The line // REQUIRE(test_illegal_header(header_name, header_value)); should be removed as it appears to be obsolete.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9aaed32 and a901a4d.

📒 Files selected for processing (5)

components/core/src/clp/CurlDownloadHandler.cpp (3 hunks)
components/core/src/clp/CurlDownloadHandler.hpp (3 hunks)
components/core/src/clp/NetworkReader.cpp (4 hunks)
components/core/src/clp/NetworkReader.hpp (5 hunks)
components/core/tests/test-NetworkReader.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

components/core/src/clp/CurlDownloadHandler.hpp
components/core/src/clp/NetworkReader.cpp
components/core/src/clp/NetworkReader.hpp

🔇 Additional comments (4)

components/core/src/clp/CurlDownloadHandler.cpp (3)

3-18: LGTM! All necessary includes are present and properly ordered.

The new includes support the header processing functionality and follow the project's include ordering convention.

31-32: LGTM! Parameter type is well-chosen.

The use of std::optional<std::unordered_map<std::string, std::string>> const& is appropriate as it:

Allows optional header specification
Prevents unnecessary copying through const reference
Uses string type for header names and values

61-67: 🛠️ Refactor suggestion

Consider using a constexpr array for reserved headers.

While the current implementation works, using an unordered_set introduces unnecessary runtime overhead. Since we have a small, fixed set of reserved headers, a constexpr array would be more efficient.

 constexpr std::string_view cRangeHeaderName{"range"};
 constexpr std::string_view cCacheControlHeaderName{"cache-control"};
 constexpr std::string_view cPragmaHeaderName{"pragma"};
-std::unordered_set<std::string_view> reserved_headers;
-reserved_headers.insert(cRangeHeaderName);
-reserved_headers.insert(cCacheControlHeaderName);
-reserved_headers.insert(cPragmaHeaderName);
+constexpr std::array<std::string_view, 3> cReservedHeaders{
+    cRangeHeaderName,
+    cCacheControlHeaderName,
+    cPragmaHeaderName
+};

Likely invalid or redundant comment.

components/core/tests/test-NetworkReader.cpp (1)

9-16: LGTM! The new includes are appropriate for the added functionality.

The added headers support the new test cases by providing necessary data structures and formatting capabilities.

coderabbitai · 2024-10-30T20:50:30Z

components/core/src/clp/CurlDownloadHandler.cpp

+    if (http_header_kv_pairs.has_value()) {
+        for (auto const& [key, value] : http_header_kv_pairs.value()) {
+            // Convert to lowercase for case-insensitive comparison
+            std::string lower_key = key;
+            std::transform(
+                    lower_key.begin(),
+                    lower_key.end(),
+                    lower_key.begin(),
+                    [](unsigned char c) { return std::tolower(c); }
+            );
+            if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) {
+                throw CurlOperationFailed(
+                        ErrorCode_Failure,
+                        __FILE__,
+                        __LINE__,
+                        CURLE_BAD_FUNCTION_ARGUMENT,
+                        fmt::format(
+                                "curl_download_handler_init failed due to "
+                                "header: {}: "
+                                "{}.",
+                                key,
+                                value
+                        )
+                );
+            }
+            m_http_headers.append(fmt::format("{}: {}", key, value));
+        }


🛠️ Refactor suggestion

Improve code structure and error messages.

The current implementation can be improved by:

Separating validation checks for better clarity

Providing more specific error messages

Reducing nesting depth

if (http_header_kv_pairs.has_value()) { for (auto const& [key, value] : http_header_kv_pairs.value()) { std::string lower_key = key; std::transform( lower_key.begin(), lower_key.end(), lower_key.begin(), [](unsigned char c) { return std::tolower(c); } ); - if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) { + // Check for reserved headers + if (reserved_headers.contains(lower_key)) { throw CurlOperationFailed( ErrorCode_Failure, __FILE__, __LINE__, CURLE_BAD_FUNCTION_ARGUMENT, fmt::format( - "curl_download_handler_init failed due to " - "header: {}: " - "{}.", - key, - value + "Cannot override reserved header: '{}'", + key ) ); - } - m_http_headers.append(fmt::format("{}: {}", key, value)); + } + + // Check for CRLF termination + if (value.ends_with("\r\n")) { + throw CurlOperationFailed( + ErrorCode_Failure, + __FILE__, + __LINE__, + CURLE_BAD_FUNCTION_ARGUMENT, + fmt::format( + "Header value must not be CRLF-terminated: '{}: {}'", + key, + value + ) + ); + } + + m_http_headers.append(fmt::format("{}: {}", key, value)); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (http_header_kv_pairs.has_value()) {

for (auto const& [key, value] : http_header_kv_pairs.value()) {

// Convert to lowercase for case-insensitive comparison

std::string lower_key = key;

std::transform(

lower_key.begin(),

lower_key.end(),

lower_key.begin(),

[](unsigned char c) { return std::tolower(c); }

);

if (reserved_headers.contains(lower_key) || value.ends_with("\r\n")) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"curl_download_handler_init failed due to "

"header: {}: "

"{}.",

key,

value

)

);

}

m_http_headers.append(fmt::format("{}: {}", key, value));

}

if (http_header_kv_pairs.has_value()) {

for (auto const& [key, value] : http_header_kv_pairs.value()) {

std::string lower_key = key;

std::transform(

lower_key.begin(),

lower_key.end(),

lower_key.begin(),

[](unsigned char c) { return std::tolower(c); }

);

// Check for reserved headers

if (reserved_headers.contains(lower_key)) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"Cannot override reserved header: '{}'",

key

)

);

}

// Check for CRLF termination

if (value.ends_with("\r\n")) {

throw CurlOperationFailed(

ErrorCode_Failure,

__FILE__,

__LINE__,

CURLE_BAD_FUNCTION_ARGUMENT,

fmt::format(

"Header value must not be CRLF-terminated: '{}: {}'",

key,

value

)

);

}

m_http_headers.append(fmt::format("{}: {}", key, value));

}

coderabbitai · 2024-10-30T20:50:30Z

components/core/tests/test-NetworkReader.cpp

+    REQUIRE([header_name, header_value]() {
+        std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs;
+        illegal_http_header_kv_pairs.emplace(header_name, header_value);
+        clp::NetworkReader illegal_reader{
+                "https://httpbin.org/headers",
+                0,
+                false,
+                clp::CurlDownloadHandler::cDefaultOverallTimeout,
+                clp::CurlDownloadHandler::cDefaultConnectionTimeout,
+                clp::NetworkReader::cDefaultBufferPoolSize,
+                clp::NetworkReader::cDefaultBufferSize,
+                illegal_http_header_kv_pairs
+        };
+        auto const content = get_content(illegal_reader);
+        return assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader)
+               && content.empty();
+    }());


🛠️ Refactor suggestion

Consider simplifying the lambda function.

The lambda could be more concise by removing the explicit return statement and letting the expression flow naturally.

- REQUIRE([header_name, header_value]() { - std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs; - illegal_http_header_kv_pairs.emplace(header_name, header_value); - clp::NetworkReader illegal_reader{ - "https://httpbin.org/headers", - 0, - false, - clp::CurlDownloadHandler::cDefaultOverallTimeout, - clp::CurlDownloadHandler::cDefaultConnectionTimeout, - clp::NetworkReader::cDefaultBufferPoolSize, - clp::NetworkReader::cDefaultBufferSize, - illegal_http_header_kv_pairs - }; - auto const content = get_content(illegal_reader); - return assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader) - && content.empty(); - }()); + std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs{{header_name, header_value}}; + clp::NetworkReader illegal_reader{ + "https://httpbin.org/headers", + 0, + false, + clp::CurlDownloadHandler::cDefaultOverallTimeout, + clp::CurlDownloadHandler::cDefaultConnectionTimeout, + clp::NetworkReader::cDefaultBufferPoolSize, + clp::NetworkReader::cDefaultBufferSize, + illegal_http_header_kv_pairs + }; + auto const content = get_content(illegal_reader); + REQUIRE(assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader)); + REQUIRE(content.empty());

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

REQUIRE([header_name, header_value]() {

std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs;

illegal_http_header_kv_pairs.emplace(header_name, header_value);

clp::NetworkReader illegal_reader{

"https://httpbin.org/headers",

0,

false,

clp::CurlDownloadHandler::cDefaultOverallTimeout,

clp::CurlDownloadHandler::cDefaultConnectionTimeout,

clp::NetworkReader::cDefaultBufferPoolSize,

clp::NetworkReader::cDefaultBufferSize,

illegal_http_header_kv_pairs

};

auto const content = get_content(illegal_reader);

return assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader)

&& content.empty();

}());

std::unordered_map<std::string, std::string> illegal_http_header_kv_pairs{{header_name, header_value}};

clp::NetworkReader illegal_reader{

"https://httpbin.org/headers",

0,

false,

clp::CurlDownloadHandler::cDefaultOverallTimeout,

clp::CurlDownloadHandler::cDefaultConnectionTimeout,

clp::NetworkReader::cDefaultBufferPoolSize,

clp::NetworkReader::cDefaultBufferSize,

illegal_http_header_kv_pairs

};

auto const content = get_content(illegal_reader);

REQUIRE(assert_curl_error_code(CURLE_BAD_FUNCTION_ARGUMENT, illegal_reader));

REQUIRE(content.empty());

🎉 added custom headers

dd05e19

anlowee marked this pull request as ready for review October 24, 2024 21:46

coderabbitai bot reviewed Oct 24, 2024

View reviewed changes

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

components/core/src/clp/CurlDownloadHandler.hpp Outdated Show resolved Hide resolved

gibber9809 and others added 2 commits October 24, 2024 22:52

clp-s: Implement table packing (#466)

4daf8d8

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com> Co-authored-by: Kirk Rodrigues <2454684+kirkrodrigues@users.noreply.github.com> Co-authored-by: wraymo <raywangv@gmail.com>

👕 fixed coderabbitai suggestions

f134870

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

components/core/tests/test-NetworkReader.cpp Outdated Show resolved Hide resolved

👕 fixed coderabbitai suggestions

fc7aa7d

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

👕 fixed coderabbitai suggestions

c8acc12

anlowee requested a review from LinZhihao-723 October 25, 2024 14:41

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

components/core/src/clp/CurlDownloadHandler.cpp Outdated Show resolved Hide resolved

anlowee added 2 commits October 25, 2024 10:53

👕 fixed coderabbitai suggestions

3e2b3cb

👕 fixed linting

e1c0605

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

LinZhihao-723 requested changes Oct 25, 2024

View reviewed changes

anlowee and others added 4 commits October 26, 2024 22:33

💍 addressing comments from a senior

064f8a2

log-viewer-webui: Update yscope-log-viewer to the latest version. (#…

829013b

…565)

🚧 adding exception

4e99893

:stop_sigh: added throwing exception when the headers contains illega…

cf46ab7

…l name or value

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

👕 fixed coderabbitai issues

d995a0f

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

anlowee added 2 commits October 28, 2024 16:09

👕 fixed Zhihao two comments

687aecd

🔥 removed clangd cache

91209ac

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

anlowee added 4 commits October 28, 2024 22:30

🧪 testing why macos build failed

8de14fd

🧪 trying to fix the bug

7fd3811

🧪 trying to fix the bug (forgot to clp-lint)

0daa1a2

🧪 trying to fix the bug and lint-check

e8f7740

ci: Switch GitHub macOS build workflow to use macos-13 (x86) and maco…

b92a6c6

…s-14 (ARM) runners. (#566)

anlowee requested a review from LinZhihao-723 October 29, 2024 23:30

anlowee added 18 commits October 29, 2024 19:42

🎉 added custom headers

9ee7ae2

👕 fixed coderabbitai suggestions

2c8e54c

👕 fixed coderabbitai suggestions

dc30f78

👕 fixed coderabbitai suggestions

deb707f

👕 fixed coderabbitai suggestions

1c96a4b

👕 fixed linting

c8e6d1c

💍 addressing comments from a senior

7293569

🚧 adding exception

96fc749

:stop_sigh: added throwing exception when the headers contains illega…

37b99e9

…l name or value

👕 fixed coderabbitai issues

e979f6b

👕 fixed Zhihao two comments

7c02460

🧪 testing why macos build failed

75aed54

🧪 trying to fix the bug

35adc92

🧪 trying to fix the bug (forgot to clp-lint)

eaecefc

🧪 trying to fix the bug and lint-check

39fa843

👕 addressed some comments from a senior

82a8322

👕 only check CRLF

fb53142

Merge branch 'xiaochong-add-user-customized-header' of github.com:anl…

58ac048

…owee/clp into xiaochong-add-user-customized-header

LinZhihao-723 reviewed Oct 30, 2024

View reviewed changes

LinZhihao-723 requested changes Oct 30, 2024

View reviewed changes

👕 addressed all comments except clang-tidy

9aaed32

coderabbitai bot reviewed Oct 30, 2024

View reviewed changes

root added 2 commits October 30, 2024 20:35

👕 finally applied clang-tidy

647026c

♻️ revert the rebase

a901a4d

coderabbitai bot reviewed Oct 30, 2024

View reviewed changes

anlowee closed this Oct 30, 2024

LinZhihao-723 mentioned this pull request Oct 30, 2024

core: Add support for user-defined HTTP headers in NetworkReader. #568

Merged

anlowee deleted the xiaochong-add-user-customized-header branch October 31, 2024 21:39

	m_http_headers.append(key + ": " + value);
	m_http_headers.append(key + ": " + value + "\r\n");

		// We use httpbin (https://httpbin.org/) to test the custom headers. This request will return a
		// JSON object that contains the custom headers. We check if the headers are in the response.

-    // We use httpbin (https://httpbin.org/) to test the custom headers. This request will return a
-    // JSON object that contains the custom headers. We check if the headers are in the response.
+    // We use httpbin (https://httpbin.org/) to test user-specified headers.
+    // On success, it is supposed to respond all the user-specified headers as key-value pairs in
+    // JSON form.

		constexpr int cNumRegularTestHeaders{10};
		for (size_t i{0}; i < cNumRegularTestHeaders; i++) {

	TEST_CASE("network_reader_with_http_header_kv_pairs", "[NetworkReader]") {
	// NOLINTNEXTLINE(readability-function-cognitive-complexity)
	TEST_CASE("network_reader_with_http_header_kv_pairs", "[NetworkReader]") {

		size_t buffer_size,
		std::optional<std::unordered_map<std::string, std::string>> http_header_kv_pairs

		SRC_NAME: "yscope-log-viewer-4c69bc11dbe8a5d87b5fbfb0e43a2f2a06f04866"
		SRC_URL: "https://github.com/y-scope/yscope-log-viewer/archive/4c69bc1.zip"

		m_stream_reader.read_metadata(m_table_metadata_decompressor);

Conversation

anlowee commented Oct 24, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 28, 2024

anlowee commented Oct 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

LinZhihao-723 Oct 30, 2024 •

edited

Loading