refactor(clp-s): Divide clp-s search AST code so that it can easily be turned into a library.#754
Conversation
… independent library builds
…code into their own libraries
WalkthroughThis pull request reorganizes and refactors several utility functions and header dependencies across multiple components. The primary changes include shifting the tokenization functionality from the Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller (JsonParser/TimestampDictionaryReader/clp-s)
participant SU as clp_s::search::SearchUtils
Note over Caller, SU: Tokenization Flow
Caller->>SU: tokenize_column_descriptor(descriptor, tokens, ns)
SU-->>Caller: returns tokens (after unescaping via unescape_kql_internal)
sequenceDiagram
participant CP as Caller (TimestampPattern::init)
participant TP as TimestampPattern
Note over TP: Prevent Reinitialization
CP->>TP: init()
alt m_known_ts_patterns is not nullptr
TP-->>CP: Return early
else
TP->>TP: Initialize m_known_ts_patterns
TP-->>CP: Return after initialization
end
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (7)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/core/src/clp_s/TimestampPattern.hpp (1)
73-76: Improved documentation with thread-safety warning.The updated documentation more clearly describes what the
init()function does and importantly adds a thread-safety warning.Consider adding thread-safety to this initialization method in a future update if this is expected to be used in multi-threaded contexts.
components/core/src/clp_s/search/SearchUtils.cpp (2)
63-130: Consider providing clearer failure diagnostics withintokenize_column_descriptor.
While returningfalsefor error conditions is straightforward, you might improve maintainability by logging or returning a specific error code to indicate the exact issue (i.e., empty token, incomplete escape, etc.).
165-266: Refactor large switch statement inunescape_kql_internal.
Though the logic is correct, the multiple case statements can be simplified or stored in a lookup table to improve readability and reduce repetition.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
components/core/CMakeLists.txt(1 hunks)components/core/src/clp_s/CMakeLists.txt(5 hunks)components/core/src/clp_s/JsonParser.cpp(2 hunks)components/core/src/clp_s/SchemaTree.cpp(1 hunks)components/core/src/clp_s/SchemaTree.hpp(2 hunks)components/core/src/clp_s/TimestampDictionaryReader.cpp(2 hunks)components/core/src/clp_s/TimestampPattern.cpp(1 hunks)components/core/src/clp_s/TimestampPattern.hpp(1 hunks)components/core/src/clp_s/Utils.cpp(0 hunks)components/core/src/clp_s/Utils.hpp(0 hunks)components/core/src/clp_s/clp-s.cpp(2 hunks)components/core/src/clp_s/indexer/CMakeLists.txt(1 hunks)components/core/src/clp_s/search/Output.cpp(4 hunks)components/core/src/clp_s/search/Projection.cpp(1 hunks)components/core/src/clp_s/search/SchemaMatch.cpp(1 hunks)components/core/src/clp_s/search/SearchUtils.cpp(2 hunks)components/core/src/clp_s/search/SearchUtils.hpp(2 hunks)components/core/src/clp_s/search/kql/CMakeLists.txt(1 hunks)components/core/src/clp_s/search/kql/kql.cpp(5 hunks)components/core/src/clp_s/search/sql/CMakeLists.txt(1 hunks)
💤 Files with no reviewable changes (2)
- components/core/src/clp_s/Utils.cpp
- components/core/src/clp_s/Utils.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/search/Projection.cppcomponents/core/src/clp_s/SchemaTree.cppcomponents/core/src/clp_s/TimestampDictionaryReader.cppcomponents/core/src/clp_s/TimestampPattern.cppcomponents/core/src/clp_s/search/SchemaMatch.cppcomponents/core/src/clp_s/SchemaTree.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/TimestampPattern.hppcomponents/core/src/clp_s/search/Output.cppcomponents/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/search/kql/kql.cppcomponents/core/src/clp_s/search/SearchUtils.cppcomponents/core/src/clp_s/search/SearchUtils.hpp
🔇 Additional comments (41)
components/core/src/clp_s/TimestampPattern.cpp (1)
211-214: Properly guards against re-initializationGood addition of a check to prevent unnecessary re-initialization of static data. This avoids redundant work if the function is called multiple times, which is especially important given the amount of pattern data being initialized.
components/core/src/clp_s/indexer/CMakeLists.txt (1)
88-89: Appropriate library dependencies addedThe addition of the new
clp_s::search::astandclp_s::TimestampPatternlibraries properly reflects the structural changes in the PR. This ensures the indexer correctly links against the newly separated components.components/core/src/clp_s/search/Projection.cpp (1)
5-5: Updated include for relocated functionalityThe include path change from "SearchUtils.hpp" to "../SchemaTree.hpp" correctly reflects the relocation of the
node_to_literal_typefunction that's used on line 78. This change maintains the proper dependencies after refactoring.components/core/src/clp_s/search/SchemaMatch.cpp (1)
8-8: Updated include for relocated functionalityThe include path change from "SearchUtils.hpp" to "../SchemaTree.hpp" correctly reflects the relocation of the
node_to_literal_typefunction that's used in multiple places throughout this file (lines 77, 315, 428, and 528). This change aligns with the PR objective of eliminating circular dependencies.components/core/src/clp_s/SchemaTree.cpp (1)
8-33: Good implementation of node type to literal type conversion.The function properly maps each
NodeTypeto the correspondingclp_s::search::LiteralTypewith a comprehensive switch statement, and includes a catchall withUnknownTfor unhandled cases.I noticed there's a TODO comment about supporting
NodeType::Objectin the future. This is good documentation for future work.components/core/src/clp_s/SchemaTree.hpp (2)
15-15: LGTM! Appropriate inclusion of required header.The added include for "search/Literal.hpp" properly brings in the necessary LiteralType definition.
46-51: Well-documented function declaration.The function declaration is clear and the documentation properly explains its purpose and return value.
components/core/src/clp_s/clp-s.cpp (2)
35-35: LGTM! Appropriate inclusion of required header.The added include for "search/SearchUtils.hpp" properly supports the namespace change for tokenization functions.
207-209: Namespace update correctly implemented.The change from
StringUtils::tokenize_column_descriptortoclp_s::search::tokenize_column_descriptoraligns with the PR objective of relocating string processing utilities fromStringUtilsto the specializedSearchUtilsin theclp_s::searchnamespace.components/core/src/clp_s/TimestampDictionaryReader.cpp (2)
5-5: Include change matches the refactoring objective.The change of the include from
Utils.hpptosearch/SearchUtils.hppaligns with the PR's goal of reorganizing utility functions into more specific namespaces.
24-26: Function call updated to use new namespace.The call has been properly updated from
StringUtils::tokenize_column_descriptortoclp_s::search::tokenize_column_descriptor, maintaining the same functionality while using the newly organized code structure.components/core/src/clp_s/JsonParser.cpp (2)
32-32: Added include for reorganized utility functions.This include addition properly supports the refactoring goal of moving search-related utility functions into their own namespace.
94-96: Function call updated to use new namespace.The call to
tokenize_column_descriptorhas been correctly updated to use the function from theclp_s::searchnamespace instead ofStringUtils, consistent with the restructuring objectives.components/core/src/clp_s/search/kql/CMakeLists.txt (1)
17-24: CMake configuration updated to reflect new library dependencies.The target_link_libraries command has been properly updated to include the new dependencies (
clp::string_utils,clp_s::search::ast, andspdlog::spdlog) while removing unnecessary dependencies. This change aligns with the PR's objective to split the codebase into more modular components.components/core/src/clp_s/search/Output.cpp (4)
7-7: Include changed to reflect dependency restructuring.The addition of
SchemaTree.hppand removal ofSearchUtils.hppreflects the refactoring of code to eliminate circular dependencies. This change supports the PR's objective of better code organization.
595-599: Wildcard matching function updated with case sensitivity parameter.The call to
wildcard_matchhas been replaced withStringUtils::wildcard_match_unsafe, which includes an explicit case sensitivity parameter. This preserves the original functionality while using the implementation from the correct namespace.
735-739: Wildcard matching function updated with case sensitivity parameter.Similar to the previous instance, this call has been updated to use
StringUtils::wildcard_match_unsafewith the appropriate case sensitivity parameter, ensuring consistent behavior with the original implementation.
809-814: Wildcard matching function updated with case sensitivity parameter.This is the third instance where
wildcard_matchhas been replaced withStringUtils::wildcard_match_unsafe, properly maintaining case sensitivity behavior. The change is consistent with the other instances and the PR's objectives.components/core/CMakeLists.txt (1)
653-654: Appropriate library additions for the refactored architectureThe addition of
clp_s::search::astandclp_s::TimestampPatternlibraries to unitTest target links correctly implements the architectural restructuring described in the PR objectives. This helps eliminate circular dependencies while enabling the code to be more modular and easier to reuse.components/core/src/clp_s/search/sql/CMakeLists.txt (1)
17-23: Updated dependencies align with the new architectureThe changes to the
sqllibrary dependencies properly reflect the restructured code base, now linking toclp_s::search::astrather than directly to the individual expression classes. This modification correctly implements the modular approach described in the PR objectives.components/core/src/clp_s/search/kql/kql.cpp (6)
8-8: Added appropriate include for string utilitiesIncluding the string_utils header correctly prepares for the usage of
clp::string_utils::clean_up_wildcard_search_stringlater in the file.
21-21: SearchUtils inclusion supports refactored function callsThis include correctly brings in the declarations for
clp_s::search::tokenize_column_descriptorandclp_s::search::unescape_kql_valuethat were moved from StringUtils.
77-77: Updated function call using refactored namespaceCorrectly replaced
StringUtils::unescape_kql_valuewithclp_s::search::unescape_kql_value, aligning with the PR's objective to move these utilities to a more appropriate namespace.
89-89: Updated wildcard search string cleaning function callThe change to use
clp::string_utils::clean_up_wildcard_search_stringinstead ofStringUtilsversion is consistent with the refactoring goals and maintains proper namespace organization.
95-95: Consistent namespace update for unescaping functionThis change to use
clp_s::search::unescape_kql_valuemaintains consistency with the earlier change at line 77 and follows the PR objective to move these utility functions.
114-119: Updated tokenize function call using appropriate namespaceThe replacement of
StringUtils::tokenize_column_descriptorwithclp_s::search::tokenize_column_descriptorproperly implements the function relocations mentioned in the PR objectives.components/core/src/clp_s/search/SearchUtils.hpp (3)
4-6: Added standard library includesThe addition of standard includes for string and vector is appropriate as these types are now used directly in the function signatures rather than being included through SchemaTree.hpp.
35-46: Well-documented tokenize_column_descriptor function declarationThis function declaration properly includes the [[nodiscard]] attribute and thorough documentation explaining the purpose and behavior of the function that was relocated from StringUtils.
48-60: Well-documented unescape_kql_value function declarationThe newly added function declaration includes clear documentation about the KQL value unescaping process and properly uses the [[nodiscard]] attribute to ensure return values are checked.
components/core/src/clp_s/search/SearchUtils.cpp (4)
4-5: Include directives are appropriate.
These new headers are suitable for string and vector operations in this file.
7-7: Confirm simdjson linkage.
The addition of<simdjson.h>is consistent with usage in this file. Ensure that the CMake configuration correctly references this external dependency.
132-134: Simple wrapper function looks good.
unescape_kql_valueis straightforward and provides a helpful abstraction for unescaping KQL values.
137-163: Verify surrogate pair handling inconvert_four_byte_hex_to_utf8.
This approach successfully converts four-byte hex sequences within the Basic Multilingual Plane. However, it may not handle characters beyond0xFFFF(requiring surrogate pairs). If such codepoints must be supported, consider expanding the logic to handle combined surrogate pairs.components/core/src/clp_s/CMakeLists.txt (8)
86-92: New timestamp pattern sources grouping is clear.
DefiningCLP_S_TIMESTAMP_PATTERN_SOURCESclarifies which files belong to the timestamp pattern library and avoids mixing them with other sources.
94-95: Separate definition ofCLP_S_SOURCES.
Pulling out timestamp pattern files fromCLP_S_SOURCEShelps reduce clutter and enables more granular builds.
167-168: IntroducingCLP_S_CORE_AST_SOURCES.
This set appears to isolate core AST functionalities. Confirm that other AST-related files not explicitly listed here do not become orphaned.
199-205: Consolidated search utility files.
IncludingSearchUtils.cppandSearchUtils.hppunder the core AST sources is consistent with the refactoring. Be sure to verify that no circular dependencies remain.
207-217: CreatingCLP_S_SEARCH_SOURCESfosters separation of concerns.
Grouping these search logic files distinctly makes the codebase structure more intuitive and easier to maintain.
252-265: New libraryclp_s_TimestampPatternintroduced.
Splitting outTimestampPatterninto its own target helps ensure modularity. This change is aligned with the goal of avoiding circular dependencies.
266-279:clp_s_search_astlibrary definition.
Creating a dedicated library for AST logic clarifies dependencies and reduces coupling. Ensure any future AST-related additions are also placed here.
296-297: Linking the new libraries toclp-s.
Addingclp_s::search::astandclp_s::TimestampPatternimproves clarity. This sets the stage for simpler reuse in other targets or projects.
|
At a first glance this PR looks a bit big, could at least separate library encapsulation and implementation detail changes into 2 separate PRs. Also I'm wondering if you think removing |
OK -- will revert CMakeLists changes from this PR and leave just the code changes in this PR. And yeah, removing clp_s::StringUtils and switching over to |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/core/src/clp_s/indexer/CMakeLists.txt (1)
58-59: Verify New Source File Entries for SearchUtils LibraryThe additions of
../search/SearchUtils.cppand../search/SearchUtils.hppcorrectly incorporate the new SearchUtils module aimed at improving code re-use. However, note that previous discussions indicated that CMakeLists changes were to be reverted in this PR, with only code changes retained. Kindly confirm whether these new entries should remain or be removed for consistency with the plan.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/indexer/CMakeLists.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: lint-check (ubuntu-latest)
wraymo
left a comment
There was a problem hiding this comment.
Thanks for the refactoring! The code is more modular now.
| return ret; | ||
| } else { | ||
| return StringLiteral::create(StringUtils::clean_up_wildcard_search_string(token)); | ||
| return StringLiteral::create(clp::string_utils::clean_up_wildcard_search_string(token)); |
There was a problem hiding this comment.
Are we going to remove duplicated code in StringUtils and use clp::string_utils instead?
There was a problem hiding this comment.
In general yes, but I think that that should probably happen as its own PR. The only reason I make this change here is because I want kql to be independent of clp_s/Utils.
| * @param s the string to match | ||
| * @param p the pattern to match against | ||
| * @return true if s matches p, false otherwise | ||
| * Converts a KQL string column descriptor delimited by '.' into a list of tokens. The |
There was a problem hiding this comment.
Do you think it's better to put those functions in a class (maybe SearchUtils) like StringUtils?
There was a problem hiding this comment.
We could do it that way if you prefer, I don't have a strong preference.
Could also put it into its own namespace clp_s::search::search_utils or put all of the AST stuff into clp_s::search::ast or something.
There was a problem hiding this comment.
I'm fine with either approach. I personally prefer the namespace option, but it looks like the CLP codebase derives namespaces from directory names. I'm not sure if adding a custom namespace aligns with the existing convention.
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
…e turned into a library. (y-scope#754) Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Description
This PR modifies the clp-s codebase so that the build can be split into a "clp_s::TimestampPattern" library, a "clp_s::search::ast" library, and the core clp binary.
To clean up this split we move around some code to eliminate circular dependencies and delete some duplicated/unnecessary code in the process.
These changes include:
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor
Chores