Skip to content

Conversation

@zzzxl1993
Copy link
Contributor

@zzzxl1993 zzzxl1993 commented Oct 15, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
This PR adds new query types to the inverted index query_v2 framework, specifically implementing regexp, wildcard, and phrase queries. The changes also include a rename from BitmapQuery to BitSetQuery for better clarity.

Key changes:

Implementation of three new query types (regexp, wildcard, phrase) with their corresponding weight and scorer classes
Refactoring to move common reader lookup logic to the base Weight class
Renaming of bitmap-related classes to bit_set for more accurate terminology

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 15, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993 zzzxl1993 changed the title [feture](inverted index) query_v2 add regexp_query and wildcard_query [feture](inverted index) query_v2 add regexp, wildcard, phrase query Oct 24, 2025
@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 27.7 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 966cdd80083cadabac346fff3fb99db632e290e8, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.09	0.08
query4	1.61	0.12	0.12
query5	0.28	0.26	0.24
query6	1.17	0.64	0.65
query7	0.03	0.03	0.02
query8	0.06	0.04	0.04
query9	0.64	0.54	0.52
query10	0.59	0.58	0.57
query11	0.17	0.12	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.62
query14	1.01	1.03	1.01
query15	0.84	0.84	0.83
query16	0.40	0.40	0.39
query17	0.99	0.98	1.02
query18	0.22	0.20	0.21
query19	1.95	1.78	1.80
query20	0.02	0.01	0.01
query21	15.46	0.19	0.12
query22	5.06	0.07	0.05
query23	15.68	0.26	0.11
query24	2.89	1.03	0.66
query25	0.08	0.06	0.06
query26	0.14	0.14	0.13
query27	0.06	0.05	0.06
query28	4.20	1.14	0.94
query29	12.61	3.91	3.26
query30	0.30	0.15	0.11
query31	2.82	0.57	0.38
query32	3.23	0.54	0.47
query33	3.07	3.03	3.02
query34	16.00	5.16	4.53
query35	4.57	4.59	4.58
query36	0.68	0.50	0.51
query37	0.10	0.08	0.07
query38	0.07	0.04	0.04
query39	0.04	0.04	0.03
query40	0.18	0.15	0.13
query41	0.09	0.03	0.04
query42	0.05	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 98.6 s
Total hot run time: 27.7 s

@zzzxl1993
Copy link
Contributor Author

run buildall

2 similar comments
@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993
Copy link
Contributor Author

run buildall

@airborne12 airborne12 requested a review from Copilot October 27, 2025 02:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new query types to the inverted index query_v2 framework, specifically implementing regexp, wildcard, and phrase queries. The changes also include a rename from BitmapQuery to BitSetQuery for better clarity.

Key changes:

  • Implementation of three new query types (regexp, wildcard, phrase) with their corresponding weight and scorer classes
  • Refactoring to move common reader lookup logic to the base Weight class
  • Renaming of bitmap-related classes to bit_set for more accurate terminology

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
wildcard_query_test.cpp Comprehensive test suite for wildcard query functionality
regexp_query_test.cpp Test cases for regular expression query operations
phrase_query_test.cpp Tests for phrase query matching with position-based term matching
intersection_test.cpp Tests for intersection operations on doc sets
boolean_query_test.cpp Updated references from BitmapQuery to BitSetQuery
query_helper_test.cpp Added mock method for multi-term similarity scoring
function_search.cpp Updated comments and references to use BitSetQuery
vsearch.cpp Updated comments referencing BitSetQuery
similarity.h Added for_terms method for multi-term similarity calculation
bm25_similarity.h/cpp Implementation of BM25 scoring for multiple terms
wildcard_weight.h/query.h Wildcard query implementation converting wildcards to regex
regexp_weight.h/cpp/query.h Regexp query with hyperscan pattern matching
phrase_weight.h/scorer.h/cpp/query.h Phrase query with position-based matching
postings_with_offset.h Helper class for position-aware postings
intersection.h/cpp Generic intersection implementation for doc sets
doc_set.h Added MockDocSet for testing and freq/norm methods
weight.h Moved common reader lookup logic to base class
term_weight.h/term_scorer.h Refactored to use base class reader lookup
segment_postings.h Added position extraction methods and made freq/norm non-virtual
const_score_scorer.h Const score wrapper for scorers
bit_set_query/* Renamed from bitmap_query for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (_docs.empty()) {
_current_doc = TERMINATED;
} else {
std::ranges::sort(_docs.begin(), _docs.end());
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::ranges::sort with .begin() and .end() iterators is incorrect. Either use std::ranges::sort(_docs) directly or use std::sort(_docs.begin(), _docs.end()).

Suggested change
std::ranges::sort(_docs.begin(), _docs.end());
std::ranges::sort(_docs);

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +84
static_assert(
requires(TermIterator it) {
it->freq();
it->nextPosition();
}, "TermIterator must expose freq() and nextPosition()");
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static_assert with a requires clause should use std::is_invocable or a proper C++20 concept. The current syntax mixing static_assert with requires expression may not compile correctly on all compilers.

Copilot uses AI. Check for mistakes.
@doris-robot
Copy link

TPC-DS: Total hot run time: 190511 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a327f71c5fdd8c451f259953151a9b8971c4fcec, data reload: false

query1	1077	444	401	401
query2	6589	1658	1736	1658
query3	6752	227	223	223
query4	26652	23877	23200	23200
query5	4936	662	500	500
query6	351	243	236	236
query7	4672	498	310	310
query8	332	309	281	281
query9	8751	2604	2578	2578
query10	516	359	294	294
query11	15577	15051	14843	14843
query12	191	125	115	115
query13	1691	578	445	445
query14	11606	9409	9377	9377
query15	238	204	171	171
query16	7785	714	567	567
query17	1642	788	639	639
query18	2206	438	362	362
query19	440	210	185	185
query20	138	126	137	126
query21	218	132	123	123
query22	4649	4626	4518	4518
query23	34611	34047	33991	33991
query24	8522	2519	2537	2519
query25	623	552	507	507
query26	1705	294	173	173
query27	2779	530	389	389
query28	4536	2242	2220	2220
query29	817	629	525	525
query30	316	254	203	203
query31	945	856	811	811
query32	91	77	69	69
query33	583	392	346	346
query34	851	858	529	529
query35	780	848	803	803
query36	934	989	876	876
query37	129	112	85	85
query38	3520	3535	3428	3428
query39	1472	1444	1425	1425
query40	220	134	121	121
query41	65	62	64	62
query42	121	112	117	112
query43	475	484	499	484
query44	1248	751	757	751
query45	184	182	176	176
query46	909	987	642	642
query47	1753	1784	1722	1722
query48	399	437	319	319
query49	779	522	429	429
query50	672	701	415	415
query51	3861	4001	3961	3961
query52	110	111	105	105
query53	244	267	201	201
query54	598	591	552	552
query55	89	83	88	83
query56	329	337	315	315
query57	1172	1187	1100	1100
query58	281	282	289	282
query59	2617	2653	2560	2560
query60	359	344	348	344
query61	170	155	155	155
query62	822	739	702	702
query63	241	197	218	197
query64	4391	1182	888	888
query65	4021	3942	3954	3942
query66	1093	440	340	340
query67	15353	15031	15011	15011
query68	8368	892	594	594
query69	506	339	291	291
query70	1337	1269	1255	1255
query71	509	352	325	325
query72	5882	4903	4928	4903
query73	693	587	366	366
query74	9081	8957	8868	8868
query75	4064	3324	2870	2870
query76	3816	1138	719	719
query77	831	408	318	318
query78	9518	9792	8886	8886
query79	2028	834	599	599
query80	633	592	501	501
query81	487	269	227	227
query82	419	161	151	151
query83	276	270	255	255
query84	255	103	102	102
query85	889	482	426	426
query86	359	331	297	297
query87	3739	3712	3646	3646
query88	3489	2258	2239	2239
query89	399	325	287	287
query90	2012	217	220	217
query91	166	172	149	149
query92	81	70	71	70
query93	1256	988	641	641
query94	705	411	303	303
query95	397	320	319	319
query96	492	583	287	287
query97	2960	3002	2893	2893
query98	243	215	207	207
query99	1627	1436	1278	1278
Total cold run time: 280316 ms
Total hot run time: 190511 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.48 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a327f71c5fdd8c451f259953151a9b8971c4fcec, data reload: false

query1	0.05	0.05	0.04
query2	0.09	0.05	0.06
query3	0.26	0.08	0.08
query4	1.61	0.12	0.12
query5	0.26	0.26	0.25
query6	1.19	0.66	0.64
query7	0.03	0.02	0.02
query8	0.06	0.04	0.04
query9	0.61	0.52	0.52
query10	0.59	0.58	0.58
query11	0.16	0.12	0.12
query12	0.16	0.12	0.15
query13	0.62	0.61	0.59
query14	1.02	1.03	1.00
query15	0.84	0.83	0.84
query16	0.40	0.40	0.40
query17	1.02	1.05	1.04
query18	0.21	0.19	0.20
query19	1.96	1.80	1.86
query20	0.02	0.01	0.01
query21	15.44	0.18	0.13
query22	5.13	0.07	0.05
query23	15.69	0.27	0.10
query24	2.99	0.50	0.31
query25	0.07	0.07	0.06
query26	0.14	0.13	0.14
query27	0.07	0.05	0.06
query28	3.57	1.14	0.93
query29	12.55	3.98	3.34
query30	0.29	0.16	0.12
query31	2.80	0.59	0.38
query32	3.23	0.54	0.47
query33	3.05	3.03	3.02
query34	15.88	5.14	4.57
query35	4.54	4.53	4.58
query36	0.67	0.52	0.49
query37	0.10	0.07	0.07
query38	0.07	0.05	0.04
query39	0.04	0.02	0.03
query40	0.17	0.16	0.15
query41	0.09	0.03	0.02
query42	0.04	0.04	0.03
query43	0.04	0.04	0.03
Total cold run time: 97.82 s
Total hot run time: 27.48 s

@airborne12
Copy link
Member

run buildall

@doris-robot
Copy link

ClickBench: Total hot run time: 29.52 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 96902e15666dd257dace5b9305da321555914fb1, data reload: false

query1	0.06	0.06	0.06
query2	0.11	0.06	0.06
query3	0.26	0.09	0.11
query4	1.61	0.13	0.12
query5	0.29	0.28	0.28
query6	1.21	0.70	0.69
query7	0.03	0.03	0.03
query8	0.07	0.05	0.05
query9	0.66	0.58	0.57
query10	0.63	0.63	0.63
query11	0.19	0.13	0.13
query12	0.21	0.13	0.14
query13	0.64	0.62	0.61
query14	1.05	1.05	1.03
query15	0.91	0.88	0.89
query16	0.43	0.43	0.43
query17	1.17	1.28	1.09
query18	0.23	0.23	0.22
query19	2.03	1.94	2.02
query20	0.02	0.02	0.02
query21	15.36	0.22	0.15
query22	4.95	0.08	0.06
query23	15.60	0.31	0.11
query24	2.42	0.68	0.70
query25	0.09	0.07	0.06
query26	0.16	0.17	0.16
query27	0.07	0.06	0.06
query28	4.68	1.23	0.98
query29	12.62	4.67	3.91
query30	0.30	0.15	0.13
query31	2.84	0.66	0.42
query32	3.25	0.58	0.48
query33	3.17	3.12	3.17
query34	15.77	5.25	4.58
query35	4.63	4.66	4.63
query36	0.72	0.54	0.53
query37	0.12	0.07	0.07
query38	0.07	0.05	0.05
query39	0.04	0.03	0.03
query40	0.18	0.17	0.15
query41	0.10	0.03	0.04
query42	0.05	0.03	0.04
query43	0.04	0.05	0.04
Total cold run time: 99.04 s
Total hot run time: 29.52 s

@airborne12 airborne12 changed the title [feture](inverted index) query_v2 add regexp, wildcard, phrase query [feature](inverted index) query_v2 add regexp, wildcard, phrase query Oct 27, 2025
@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 84.73% (527/622) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.74% (18040/34207)
Line Coverage 37.98% (163524/430580)
Region Coverage 32.31% (124458/385240)
Branch Coverage 33.72% (54480/161572)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 43.02% (37/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.39% (23912/33495)
Line Coverage 57.83% (248813/430264)
Region Coverage 52.94% (206408/389871)
Branch Coverage 54.61% (88631/162294)

Copy link
Member

@airborne12 airborne12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@csun5285 csun5285 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 27, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@airborne12 airborne12 merged commit a42e763 into apache:master Oct 27, 2025
29 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Oct 27, 2025
…#57007)

Problem Summary:
This PR adds new query types to the inverted index query_v2 framework,
specifically implementing regexp, wildcard, and phrase queries. The
changes also include a rename from BitmapQuery to BitSetQuery for better
clarity.

Key changes:

Implementation of three new query types (regexp, wildcard, phrase) with
their corresponding weight and scorer classes
Refactoring to move common reader lookup logic to the base Weight class
Renaming of bitmap-related classes to bit_set for more accurate
terminology
yiguolei pushed a commit that referenced this pull request Oct 28, 2025
…phrase query #57007 (#57370)

Cherry-picked from #57007

Co-authored-by: zzzxl <yangsiyu@selectdb.com>
airborne12 added a commit that referenced this pull request Oct 29, 2025
…fact some code (#57372)

Related PR: #57007

Problem Summary:
This PR enhances the search functionality by adding support for phrase
queries, wildcard queries, and regex queries, while refactoring code to
improve maintainability and ensure proper NULL semantics handling across
all query types.
github-actions bot pushed a commit that referenced this pull request Oct 29, 2025
…fact some code (#57372)

Related PR: #57007

Problem Summary:
This PR enhances the search functionality by adding support for phrase
queries, wildcard queries, and regex queries, while refactoring code to
improve maintainability and ensure proper NULL semantics handling across
all query types.
dwdwqfwe pushed a commit to dwdwqfwe/doris that referenced this pull request Oct 31, 2025
…apache#57007)

Problem Summary:
This PR adds new query types to the inverted index query_v2 framework,
specifically implementing regexp, wildcard, and phrase queries. The
changes also include a rename from BitmapQuery to BitSetQuery for better
clarity.

Key changes:

Implementation of three new query types (regexp, wildcard, phrase) with
their corresponding weight and scorer classes
Refactoring to move common reader lookup logic to the base Weight class
Renaming of bitmap-related classes to bit_set for more accurate
terminology
dwdwqfwe pushed a commit to dwdwqfwe/doris that referenced this pull request Oct 31, 2025
…fact some code (apache#57372)

Related PR: apache#57007

Problem Summary:
This PR enhances the search functionality by adding support for phrase
queries, wildcard queries, and regex queries, while refactoring code to
improve maintainability and ensure proper NULL semantics handling across
all query types.
@yiguolei yiguolei mentioned this pull request Nov 5, 2025
airborne12 pushed a commit to airborne12/apache-doris that referenced this pull request Jan 7, 2026
…apache#57007)

Problem Summary:
This PR adds new query types to the inverted index query_v2 framework,
specifically implementing regexp, wildcard, and phrase queries. The
changes also include a rename from BitmapQuery to BitSetQuery for better
clarity.

Key changes:

Implementation of three new query types (regexp, wildcard, phrase) with
their corresponding weight and scorer classes
Refactoring to move common reader lookup logic to the base Weight class
Renaming of bitmap-related classes to bit_set for more accurate
terminology
airborne12 added a commit to airborne12/apache-doris that referenced this pull request Jan 7, 2026
…fact some code (apache#57372)

Related PR: apache#57007

Problem Summary:
This PR enhances the search functionality by adding support for phrase
queries, wildcard queries, and regex queries, while refactoring code to
improve maintainability and ensure proper NULL semantics handling across
all query types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants