feat(clp-s): Add support for querying kv-pairs in the archive range index.#929
Conversation
…must match at least one record
WalkthroughThis change introduces a new AST transformation pass that evaluates and rewrites filters referencing metadata fields in the archive range index during search. It adds the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Search
participant EvaluateRangeIndexFilters
participant RangeIndex
participant AST
User->>Search: Submit search query
Search->>AST: Parse query to AST
Search->>EvaluateRangeIndexFilters: Run transformation on AST
EvaluateRangeIndexFilters->>RangeIndex: Evaluate filters referencing metadata
RangeIndex-->>EvaluateRangeIndexFilters: Return matching ranges
EvaluateRangeIndexFilters->>AST: Rewrite AST with range-based filters
Search->>AST: Continue search processing (timestamp, decompression, etc.)
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🪛 markdownlint-cli2 (0.17.2)docs/src/user-guide/reference-json-search-syntax.md89-89: Fenced code blocks should have a language specified (MD040, fenced-code-language) 95-95: Fenced code blocks should have a language specified (MD040, fenced-code-language) 101-101: Fenced code blocks should have a language specified (MD040, fenced-code-language) ⏰ Context from checks skipped due to timeout of 90000ms (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
components/core/CMakeLists.txt(1 hunks)components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/clp-s.cpp(2 hunks)components/core/src/clp_s/search/EvaluateMetadataFilters.cpp(1 hunks)components/core/src/clp_s/search/EvaluateMetadataFilters.hpp(1 hunks)components/core/tests/test-clp_s-search.cpp(4 hunks)docs/src/user-guide/reference-json-search-syntax.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/tests/test-clp_s-search.cppcomponents/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/search/EvaluateMetadataFilters.hppcomponents/core/src/clp_s/search/EvaluateMetadataFilters.cpp
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/reference-json-search-syntax.md
93-93: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (12)
components/core/src/clp_s/CMakeLists.txt (1)
171-172: LGTM! Clean addition of new source files.The new
EvaluateMetadataFilterssource files are properly added to theCLP_S_SEARCH_SOURCESlist, following the established pattern and naming conventions.components/core/CMakeLists.txt (1)
342-343: LGTM! Consistent unit test integration.The new
EvaluateMetadataFilterssource files are correctly added to the unit test build target, ensuring proper test coverage for the new functionality.components/core/src/clp_s/clp-s.cpp (2)
32-32: LGTM! Clean include addition.The include for
EvaluateMetadataFilters.hppis properly placed in alphabetical order.
175-182: Well-integrated metadata filter evaluation.The new transformation pass is properly integrated into the search pipeline:
- Correctly positioned after AST transformations and before timestamp index evaluation
- Follows the established pattern for transformation passes
- Properly handles empty expressions with appropriate logging
- Respects case sensitivity settings from command line arguments
- Uses the coding guideline preference
false == <expression>instead of negationThe early return optimization when no metadata ranges match is a nice performance improvement.
components/core/tests/test-clp_s-search.cpp (4)
29-29: LGTM!The include statement is correctly placed and necessary for the new metadata filter evaluation functionality.
164-171: Good implementation and adherence to coding guidelines!The metadata filter evaluation is correctly integrated into the search pipeline. I appreciate the proper use of
false == ignore_casewhich follows the coding guideline to preferfalse == <expression>over!<expression>.
213-224: Comprehensive test coverage for metadata filters!The new test queries effectively validate the metadata filtering functionality with various scenarios including:
- Individual metadata field queries
- Compound queries with AND/OR operators
- Negation with NOT operators
- Mixed metadata and regular field queries
240-240: Good addition for test debugging!The
CAPTUREmacro will display the current query value when a test fails, making it easier to identify the problematic query.components/core/src/clp_s/search/EvaluateMetadataFilters.hpp (1)
1-58: Well-structured header with clear documentation!The header file follows best practices:
- Proper include guards
- Clear and informative documentation comments
- Appropriate use of const reference for the range index parameter
- Good separation of public interface and private implementation details
components/core/src/clp_s/search/EvaluateMetadataFilters.cpp (3)
58-87: Excellent implementation with proper coding standards!The iterative AST traversal is well-implemented and consistently follows the coding guideline by using
false == work_list.empty(). The renormalization step after rewriting ensures the AST remains in a consistent form.
89-154: Well-optimized filter rewriting with range merging!The implementation efficiently merges contiguous matching ranges to minimize the number of filter conditions. The code consistently follows the coding guideline with proper use of
false ==comparisons throughout.
206-222: Potential precision loss for large unsigned integersThe static cast at line 210 could lose precision for unsigned values larger than
INT64_MAX. While the TODO comment acknowledges this, it's worth tracking this limitation.Consider opening an issue to track the implementation of full support for large unsigned values to avoid potential data loss.
| ### Querying kv-pairs in the archive range index | ||
|
|
||
| The archive range index maps metadata about each file ingested into an archive to a logical range | ||
| of the archive. Filters that reference fields in this metadata transparently get converted to | ||
| filters that match corresponding logical ranges of an archive. | ||
|
|
||
| To search for records corresponding to a kv-pair within the archive range index you can prefix the | ||
| key with `$`. For example, the following query searches for records from a file that was passed to | ||
| compression as `test.jsonl`: | ||
|
|
||
| ``` | ||
| $_filename: "test.jsonl" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Well-documented new feature with clear examples.
The new section effectively explains how to query metadata fields using the $ prefix, with a practical example that users can easily understand.
Fix missing language specification for code block.
The static analysis tool correctly identified that the fenced code block should specify a language.
Apply this diff to specify the language:
-```
+```kql
$_filename: "test.jsonl"
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>
93-93: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
In docs/src/user-guide/reference-json-search-syntax.md around lines 83 to 96,
the fenced code block demonstrating querying kv-pairs in the archive range index
is missing a language specification. To fix this, add the language identifier
"kql" immediately after the opening triple backticks of the code block to enable
proper syntax highlighting.
</details>
<!-- This is an auto-generated comment by CodeRabbit -->
| auto ret{evaluate_filter_against_literal_type_value_pair( | ||
| filter_expr, | ||
| ast::LiteralType::BooleanT, | ||
| bool_value, | ||
| m_case_sensitive_match |
There was a problem hiding this comment.
is it possible to deduplicate the code a bit using something like
auto eval = [&](ast::LiteralType type, std::optional<clp::ffi::Value> value) -> bool {
auto result = evaluate_filter_against_literal_type_value_pair(
filter_expr, type, value, m_case_sensitive_match);
return !result.has_error() && (filter_expr->is_inverted() != result.value());
};
There was a problem hiding this comment.
Sure, will do something like this.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/search/EvaluateMetadataFilters.cpp (1)
261-264: Array search not yet supported.As documented in the PR objectives and TODO comment, array search filters are not yet implemented. This is a known limitation of the current implementation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/search/EvaluateMetadataFilters.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/search/EvaluateMetadataFilters.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (4)
components/core/src/clp_s/search/EvaluateMetadataFilters.cpp (4)
38-55: The text encoding helper function is well-implemented.The function correctly encodes text into the four-byte encoded AST format and properly handles dictionary variable extraction.
58-87: The AST traversal and filter rewriting logic is well-structured.The method correctly identifies metadata filters, rewrites them, and applies necessary normalization passes. Good use of the iterative work list pattern.
89-154: Excellent implementation of range filter rewriting with efficient range merging.The method correctly evaluates metadata filters and efficiently merges adjacent matching ranges to minimize the number of generated range filters.
169-178: Good implementation of the deduplication suggestion.The lambda effectively reduces code duplication as suggested in the previous review.
wraymo
left a comment
There was a problem hiding this comment.
Nice work! For EvaluateMetadataFilters, it seems that it's specifically used for the range index rather than searching against the metadata tree. Should we rename it to EvaluateRangeIndexFilters?
| std::vector<clp::ir::four_byte_encoded_variable_t> encoded_vars; | ||
| std::vector<int32_t> dict_var_bounds; | ||
| clp::ffi::encode_message(text, logtype, encoded_vars, dict_var_bounds); | ||
| std::vector<std::string> dict_vars; |
|
|
||
| auto evaluate_expr | ||
| = [&](ast::LiteralType type, std::optional<clp::ffi::Value> const& value) -> bool { | ||
| auto ret{evaluate_filter_against_literal_type_value_pair( |
There was a problem hiding this comment.
Just curious, why are we using clp::ffi methods to evaluate filter? Is it because we don't expose similar methods in clp_s?
There was a problem hiding this comment.
That is part of it, but also
- The current
evaluate_filterfunctions in clp-s rely on a bunch of internal state, so even if we did expose them they would be hard to use - There are really good tests for the clp::ffi evaluate methods, so it should be pretty reliable
| } else if (cur_field.is_object()) { | ||
| if (cur_field.contains(cur_it->get_token())) { |
There was a problem hiding this comment.
Can we merge these two into one condition
| return false == ret.has_error() && filter_expr->is_inverted() != ret.value(); | ||
| }; | ||
|
|
||
| while (false == work_list.empty()) { |
There was a problem hiding this comment.
I'm a bit confused about the purpose of this while loop. Is it because we want to handle nested column in the future? and are we planning to store nested structures in the RangeIndex fields?
There was a problem hiding this comment.
Yes, it is for that and for handling wildcards tokens in the column descriptor.
| )); | ||
|
|
||
| for (auto const& [query, expected_results] : queries_and_results) { | ||
| CAPTURE(query); |
There was a problem hiding this comment.
It helps with debugging -- now when one of the queries fail this test will show you the specific query as well as the test assertion that failed.
| filters that match corresponding logical ranges of an archive. | ||
|
|
||
| To search for records corresponding to a kv-pair within the archive range index you can prefix the | ||
| key with `$`. For example, the following query searches for records from a file that was passed to |
There was a problem hiding this comment.
should we mention the range index fields we currently support?
There was a problem hiding this comment.
I'll mention them in the SFA format design doc that I'm writing, but I think the current example with $_filename should be enough for most users.
The $_archive_creator_id and $_file_split_number fields will probably only be needed by ourselves and advanced users.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
components/core/tests/test-clp_s-search.cpp (1)
239-239: Good debugging improvement with query capture.The
CAPTURE(query)statement provides valuable diagnostic information when tests fail, as discussed in previous reviews. This helps identify which specific query caused a failure.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (6)
components/core/CMakeLists.txt(1 hunks)components/core/src/clp_s/clp-s.cpp(2 hunks)components/core/src/clp_s/search/CMakeLists.txt(2 hunks)components/core/src/clp_s/search/EvaluateRangeIndexFilters.cpp(1 hunks)components/core/src/clp_s/search/EvaluateRangeIndexFilters.hpp(1 hunks)components/core/tests/test-clp_s-search.cpp(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-search.cppcomponents/core/src/clp_s/search/EvaluateRangeIndexFilters.hppcomponents/core/src/clp_s/search/EvaluateRangeIndexFilters.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (11)
components/core/CMakeLists.txt (1)
344-345: LGTM! Build configuration correctly updated.The new source files are properly added to the unit test build configuration and follow the existing alphabetical ordering pattern.
components/core/src/clp_s/search/CMakeLists.txt (2)
18-19: LGTM! Source files correctly added to library build.The new
EvaluateRangeIndexFilterssource files are properly added to the search library configuration following the existing pattern.
48-48: New dependency added for library functionality.The addition of
clp_s::clp_dependenciesas a private dependency suggests the newEvaluateRangeIndexFiltersclass requires additional functionality. This aligns with the integration of range index filtering capabilities.components/core/src/clp_s/clp-s.cpp (2)
33-33: LGTM! Include statement properly added.The include for the new
EvaluateRangeIndexFiltersheader is correctly placed with other search-related includes.
176-183: Excellent integration of metadata filtering with early optimization.The
EvaluateRangeIndexFilterspass is correctly positioned after AST transformations but before timestamp index evaluation, allowing for early termination when no metadata ranges match. The case sensitivity flag derivation from command line arguments is appropriate.The early return with
SPDLOG_INFOprovides good user feedback while avoiding unnecessary archive decompression when metadata filters eliminate all potential matches.components/core/tests/test-clp_s-search.cpp (3)
29-29: LGTM! Test include properly added.The include for
EvaluateRangeIndexFilters.hppis correctly placed with other search-related includes in the test file.
163-169: Excellent test integration mirroring production code.The metadata filtering pass is correctly integrated into the test search function, matching the implementation in the main application. The assertions ensure the transformation doesn't result in null or empty expressions, providing good test coverage.
212-223: Comprehensive test coverage for metadata field querying.The new test queries effectively cover various scenarios:
- Metadata field combinations with logical operators
- Negation of metadata filters
- Disjunctions with filename patterns
- Integration with regular index fields
These tests validate the new archive range index querying functionality thoroughly.
components/core/src/clp_s/search/EvaluateRangeIndexFilters.hpp (1)
15-56: Well-structured transformation class design.The class follows good design principles with clear documentation, appropriate use of const references, and a focused interface that aligns with the single responsibility principle.
components/core/src/clp_s/search/EvaluateRangeIndexFilters.cpp (2)
59-88: Correct AST traversal and transformation logic.The implementation properly traverses the expression tree, identifies filters in the range index namespace, and handles renormalization when changes occur. Good adherence to the coding guideline using
false == work_list.empty().
129-147: Efficient range merging optimization.Good implementation of range merging that combines adjacent ranges to reduce the number of filter conditions. This optimization will improve query performance.
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
kirkrodrigues
left a comment
There was a problem hiding this comment.
Sorry, some minor rephrasing.
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
…ndex. (y-scope#929) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
This PR adds support for querying fields in the archive range index. Any column in the
$namespace is treated as a query against the metadata range index. Filters from the$namespace are evaluated against the range index and either converted into filters on a range oflog_event_idxor constant propagated away.This logic has been implemented in the
EvaluateMetadataFilterstransformation pass. The pass has an iterative implementation that walks the AST to find and rewrite filters on columns in the$namespace and has another iterative implementation that resolves columns against metadata fields in each section of the range index. Expression evaluation is performed byclp::ffi::ir_stream::search::evaluate_filter_against_literal_type_value_pair-- since this evaluation function doesn't yet support array search filters against the archive range index also do not support array search.Note that this PR enables search on fields in the archive range index but not projection.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
$symbol.