-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](inverted index) introduce search function for inverted index #56139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 35579 ms |
TPC-DS: Total hot run time: 188423 ms |
ClickBench: Total hot run time: 29.6 s |
|
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new search function for inverted index functionality, providing a DSL-based search interface for full-text search capabilities. The implementation adds comprehensive support from FE parsing through to BE execution.
Key changes:
- Introduces ANTLR-based DSL parser for structured search queries with support for various clause types (TERM, PHRASE, PREFIX, WILDCARD, etc.)
- Adds new expression types and rewrite rules to handle search function translation from scalar function to slot-based expressions
- Implements BE search evaluation with inverted index integration supporting compound boolean queries
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/Exprs.thrift | Adds thrift structures for search DSL communication between FE and BE |
| fe/fe-core/src/main/antlr4/org/apache/doris/nereids/search/ | ANTLR grammar files for parsing search DSL syntax |
| fe/fe-core/src/main/java/.../SearchDslParser.java | DSL parser implementation with AST building and field binding extraction |
| fe/fe-core/src/main/java/.../Search.java | Scalar function implementation for search DSL |
| fe/fe-core/src/main/java/.../SearchExpression.java | Expression type for translated search with slot references |
| fe/fe-core/src/main/java/.../RewriteSearchToSlots.java | Rewrite rule to convert search functions to slot-based expressions |
| fe/fe-core/src/main/java/.../SearchPredicate.java | Analysis layer predicate for FE-to-BE translation |
| be/src/vec/functions/function_search.* | BE function implementation with inverted index evaluation |
| be/src/vec/exprs/vsearch.* | BE expression evaluation for search operations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
...c/main/java/org/apache/doris/nereids/trees/expressions/functions/scalar/SearchDslParser.java
Outdated
Show resolved
Hide resolved
| private String getCurrentFieldName() { | ||
| // This is a simplified approach - in a real implementation, | ||
| // we'd need to track context properly | ||
| return fieldNames.isEmpty() ? "_all" : fieldNames.iterator().next(); | ||
| } |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getCurrentFieldName method returns arbitrary field name from a Set which has no guaranteed ordering. This could lead to inconsistent field name resolution. Consider maintaining a proper context stack or field name resolution mechanism.
|
|
||
| // Open directory directly | ||
| auto directory = DORIS_TRY( | ||
| index_file_reader->open(&inverted_reader->get_index_meta(), context->io_ctx)); |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DORIS_TRY macro will return from the function on error, but the subsequent code assumes directory is always valid. This could lead to accessing a null pointer if the macro doesn't behave as expected.
| index_file_reader->open(&inverted_reader->get_index_meta(), context->io_ctx)); | |
| index_file_reader->open(&inverted_reader->get_index_meta(), context->io_ctx)); | |
| if (!directory) { | |
| LOG(WARNING) << "search: Failed to open directory for field: " << field_name; | |
| continue; | |
| } |
| DBUG_EXECUTE_IF("inverted_index_reader.select_best_reader", { | ||
| auto type = DebugPoints::instance()->get_debug_param_or_default<int32_t>( | ||
| "inverted_index_reader._select_best_reader", "type", -1); | ||
| "inverted_index_reader.select_best_reader", "type", -1); |
Copilot
AI
Sep 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug point name has been updated from _select_best_reader to select_best_reader but the string literal still uses the old name with underscore prefix. This inconsistency could cause debug functionality to not work properly.
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 34698 ms |
TPC-DS: Total hot run time: 188419 ms |
ClickBench: Total hot run time: 29.98 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
3ec1b4a to
f5af2dc
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
cefc7dc to
1124b83
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
ClickBench: Total hot run time: 30.13 s |
FE Regression Coverage ReportIncrement line coverage |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
1124b83 to
9bf6960
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
…pache#56699) Related PR: apache#56139 Problem Summary: This PR fixes a bug in NULL bitmap handling for MATCH OR queries in inverted index query. The bug was causing incorrect boolean logic evaluation when combining TRUE and NULL values in OR operations.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds restrictions for the search() function to ensure it can only be used in WHERE clauses on single-table OLAP scans. The implementation includes validation rules that reject search() usage in other contexts like SELECT projections, GROUP BY clauses, HAVING clauses, and multi-table scenarios.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds restrictions for the search() function to ensure it can only be used in WHERE clauses on single-table OLAP scans. The implementation includes validation rules that reject search() usage in other contexts like SELECT projections, GROUP BY clauses, HAVING clauses, and multi-table scenarios.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds EXACT DSL functionality to the search function, enabling exact string matching without tokenization. This feature complements existing ANY/ALL operators that work with tokenized indexes by providing strict string equality matching.
### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds EXACT DSL functionality to the search function, enabling exact string matching without tokenization. This feature complements existing ANY/ALL operators that work with tokenized indexes by providing strict string equality matching.
…56718) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds support for variant subcolumn access in search functions, enabling search queries to target specific JSON paths within variant columns using dot notation (e.g., field.subcolumn). The feature extends the search DSL to handle variant data types with subcolumn paths, allowing more granular search capabilities on semi-structured data. ``` SELECT * FROM test_variant_search_subcolumn WHERE search('variantColumn.subcolumn:textMatched'); ```
…56718) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #56139 Problem Summary: This PR adds support for variant subcolumn access in search functions, enabling search queries to target specific JSON paths within variant columns using dot notation (e.g., field.subcolumn). The feature extends the search DSL to handle variant data types with subcolumn paths, allowing more granular search capabilities on semi-structured data. ``` SELECT * FROM test_variant_search_subcolumn WHERE search('variantColumn.subcolumn:textMatched'); ```
…pache#56699) Related PR: apache#56139 Problem Summary: This PR fixes a bug in NULL bitmap handling for MATCH OR queries in inverted index query. The bug was causing incorrect boolean logic evaluation when combining TRUE and NULL values in OR operations.
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#56139 Problem Summary: This PR adds restrictions for the search() function to ensure it can only be used in WHERE clauses on single-table OLAP scans. The implementation includes validation rules that reject search() usage in other contexts like SELECT projections, GROUP BY clauses, HAVING clauses, and multi-table scenarios.
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#56139 Problem Summary: This PR adds EXACT DSL functionality to the search function, enabling exact string matching without tokenization. This feature complements existing ANY/ALL operators that work with tokenized indexes by providing strict string equality matching.
…pache#56718) Issue Number: close #xxx Related PR: apache#56139 Problem Summary: This PR adds support for variant subcolumn access in search functions, enabling search queries to target specific JSON paths within variant columns using dot notation (e.g., field.subcolumn). The feature extends the search DSL to handle variant data types with subcolumn paths, allowing more granular search capabilities on semi-structured data. ``` SELECT * FROM test_variant_search_subcolumn WHERE search('variantColumn.subcolumn:textMatched'); ```
What problem does this PR solve?
Issue Number: close #56682
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)