feat(core): Optimize subquery generation by narrowing wildcard token type inference (fixes #1864).#1865
Conversation
…ards during subquery generation.
WalkthroughAdds two static validators to EncodedVariableInterpreter to check whether wildcard strings could represent integer or float encoded variables; QueryToken uses them to conditionally add IntVar/FloatVar for non-variable tokens. Tests were added. The implementation file contains duplicated definitions of the new methods. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (1)**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (5)📚 Learning: 2024-11-01T03:26:26.386ZApplied to files:
📚 Learning: 2024-10-13T09:27:43.408ZApplied to files:
📚 Learning: 2024-11-29T22:50:17.206ZApplied to files:
📚 Learning: 2025-12-09T01:57:04.839ZApplied to files:
📚 Learning: 2025-05-07T16:56:35.687ZApplied to files:
🧬 Code graph analysis (2)components/core/tests/test-EncodedVariableInterpreter.cpp (2)
components/core/src/clp/EncodedVariableInterpreter.cpp (1)
🪛 Cppcheck (2.19.0)components/core/tests/test-EncodedVariableInterpreter.cpp[information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [information] 8-8: Include file (missingIncludeSystem) [style] 437-437: The function 'decode_message' is never used. (unusedFunction) components/core/src/clp/EncodedVariableInterpreter.cpp[information] 3-3: Include file (missingIncludeSystem) [information] 3-3: Include file (missingIncludeSystem) [style] 200-200: The function 'wildcard_string_could_be_representable_integer_var' is never used. (unusedFunction) [style] 211-211: The function 'wildcard_string_could_be_representable_float_var' is never used. (unusedFunction) [style] 203-203: The function 'search_string_matches_all' is never used. (unusedFunction) [style] 205-205: The function 'get_sub_queries' is never used. (unusedFunction) [style] 207-207: The function 'contains_sub_queries' is never used. (unusedFunction) [style] 209-209: The function 'get_relevant_sub_queries' is never used. (unusedFunction) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
🔇 Additional comments (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
LinZhihao-723
left a comment
There was a problem hiding this comment.
overall lgtm.
We should also mark #457 as resolved.
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
There was a problem hiding this comment.
Sorry for missing this, we should also include <algorithm>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @components/core/src/clp/EncodedVariableInterpreter.cpp:
- Around line 207-214: The function
EncodedVariableInterpreter::wildcard_string_could_be_representable_float_var
currently treats an empty string as matching; add an explicit check at the start
of the function to return false if value.empty(), then keep the existing any_of
predicate unchanged so empty wildcard strings are rejected just like the integer
variant.
- Around line 199-205: The function
wildcard_string_could_be_representable_integer_var currently returns true for
empty strings due to any_of on an empty range; change it to explicitly return
false for empty input by adding an early check (if value.empty() return false)
at the top of
EncodedVariableInterpreter::wildcard_string_could_be_representable_integer_var
so its behavior matches convert_string_to_representable_integer_var's handling
of empty strings and prevents empty wildcard strings from being treated as
representable integer vars.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/EncodedVariableInterpreter.cppcomponents/core/src/clp/EncodedVariableInterpreter.hpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/EncodedVariableInterpreter.cppcomponents/core/src/clp/EncodedVariableInterpreter.hpp
🧠 Learnings (1)
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/EncodedVariableInterpreter.cpp
🧬 Code graph analysis (1)
components/core/src/clp/EncodedVariableInterpreter.cpp (1)
components/core/src/clp/EncodedVariableInterpreter.hpp (4)
value(100-103)value(110-113)value(126-128)value(134-136)
🪛 Cppcheck (2.19.0)
components/core/src/clp/EncodedVariableInterpreter.cpp
[style] 199-199: The function 'wildcard_string_could_be_representable_integer_var' is never used.
(unusedFunction)
[style] 207-207: The function 'wildcard_string_could_be_representable_float_var' is never used.
(unusedFunction)
[style] 203-203: The function 'search_string_matches_all' is never used.
(unusedFunction)
[style] 205-205: The function 'get_sub_queries' is never used.
(unusedFunction)
[style] 207-207: The function 'contains_sub_queries' is never used.
(unusedFunction)
[style] 209-209: The function 'get_relevant_sub_queries' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: check-generated
🔇 Additional comments (1)
components/core/src/clp/EncodedVariableInterpreter.hpp (1)
121-137: LGTM!The declarations are well-formed with appropriate
[[nodiscard]]attributes. The documentation clearly indicates the heuristic nature of these methods.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @components/core/src/clp/EncodedVariableInterpreter.cpp:
- Around line 200-206: The function
wildcard_string_could_be_representable_integer_var currently returns true for
empty input because std::ranges::any_of on an empty range yields false; change
it to explicitly return false when value.empty() and replace the double-negation
any_of pattern with std::ranges::all_of for clarity (i.e., ensure every
character satisfies ('0' <= c && c <= '9') || c == '-' || c == '?' || c == '*').
Keep the function name intact and ensure behavior matches
convert_string_to_representable_integer_var for the empty-string case.
- Around line 208-215: The function
wildcard_string_could_be_representable_float_var has the same empty-string edge
case and confusing double negations; update it to explicitly return false for an
empty value and simplify the predicate by using positive logic (e.g.,
std::ranges::all_of) so it returns true only when value is non-empty and every
character is one of '0'-'9', '-', '.', '?' or '*'; modify the body of
EncodedVariableInterpreter::wildcard_string_could_be_representable_float_var to
check value.empty() first, then return the all_of result.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/EncodedVariableInterpreter.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/EncodedVariableInterpreter.cpp
🧠 Learnings (2)
📚 Learning: 2025-05-07T16:56:35.687Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 882
File: components/core/src/clp/ffi/ir_stream/search/QueryHandlerImpl.cpp:378-402
Timestamp: 2025-05-07T16:56:35.687Z
Learning: In the CLP search component, the `evaluate_wildcard_filter` function should return `AstEvaluationResult::Pruned` when `node_id_value_pairs` is empty, not `AstEvaluationResult::False`. Empty node sets should be treated as "undetermined" rather than definitive non-matches.
Applied to files:
components/core/src/clp/EncodedVariableInterpreter.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/EncodedVariableInterpreter.cpp
🪛 Cppcheck (2.19.0)
components/core/src/clp/EncodedVariableInterpreter.cpp
[information] 3-3: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[style] 200-200: The function 'wildcard_string_could_be_representable_integer_var' is never used.
(unusedFunction)
[style] 208-208: The function 'wildcard_string_could_be_representable_float_var' is never used.
(unusedFunction)
[style] 203-203: The function 'search_string_matches_all' is never used.
(unusedFunction)
[style] 205-205: The function 'get_sub_queries' is never used.
(unusedFunction)
[style] 207-207: The function 'contains_sub_queries' is never used.
(unusedFunction)
[style] 209-209: The function 'get_relevant_sub_queries' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: check-generated
- GitHub Check: build (macos-15)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/src/clp/EncodedVariableInterpreter.cpp (1)
3-3: LGTM!The
<algorithm>header is correctly added forstd::ranges::any_ofusage in the new helper functions.
| auto EncodedVariableInterpreter::wildcard_string_could_be_representable_float_var( | ||
| std::string_view value | ||
| ) -> bool { | ||
| return false == std::ranges::any_of(value, [](char const c) -> bool { | ||
| return false | ||
| == (('0' <= c && c <= '9') || c == '-' || c == '.' || c == '?' || c == '*'); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Same empty string edge case as the integer variant.
This function has the same empty string edge case and double negation complexity as the integer version. Consider the same refactoring approach for consistency.
Suggested refactor
auto EncodedVariableInterpreter::wildcard_string_could_be_representable_float_var(
std::string_view value
) -> bool {
- return false == std::ranges::any_of(value, [](char const c) -> bool {
- return false
- == (('0' <= c && c <= '9') || c == '-' || c == '.' || c == '?' || c == '*');
- });
+ if (value.empty()) {
+ return false;
+ }
+ return std::ranges::all_of(value, [](char const c) -> bool {
+ return ('0' <= c && c <= '9') || c == '-' || c == '.' || c == '?' || c == '*';
+ });
}🧰 Tools
🪛 Cppcheck (2.19.0)
[style] 208-208: The function 'wildcard_string_could_be_representable_float_var' is never used.
(unusedFunction)
[style] 209-209: The function 'get_relevant_sub_queries' is never used.
(unusedFunction)
🤖 Prompt for AI Agents
In @components/core/src/clp/EncodedVariableInterpreter.cpp around lines 208 -
215, The function wildcard_string_could_be_representable_float_var has the same
empty-string edge case and confusing double negations; update it to explicitly
return false for an empty value and simplify the predicate by using positive
logic (e.g., std::ranges::all_of) so it returns true only when value is
non-empty and every character is one of '0'-'9', '-', '.', '?' or '*'; modify
the body of
EncodedVariableInterpreter::wildcard_string_could_be_representable_float_var to
check value.empty() first, then return the all_of result.
LinZhihao-723
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(core): Optimize subquery generation by narrowing wildcard type inference (fixes #1864).
Going to use "wildcard token type inference" as a slight edit. |
…type inference (fixes y-scope#1864). (y-scope#1865) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
…type inference (fixes y-scope#1864). (y-scope#1865) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
…type inference (fixes y-scope#1864). (y-scope#1865) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Description
This PR fixes a performance bug described in #1864. The issue is that during subquery generation we treat non-definite variable tokens containing wildcards as any possible type, which can lead to us considering some impossible query interpretations that in turn lead to unnecessary decompression and scan.
The fix is to add some heuristic helpers that try to determine if a wildcard string could possibly correspond to a representable
IntVarorFloatVar. We use these helpers to skip these query interpretations when they're definitely impossible.For the query and dataset provided by @rishikeshdevsot, this change improves single-threaded query performance by about ~250x because decompression and scan is reduced from ~300GB to ~500MB as a result of not pursuing the impossible query interpretations.
Checklist
breaking change.
Validation performed
IntVarorFloatVar.Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.