Skip to content

Fix phrase match to not match with unknown tokens#7252

Merged
agourlay merged 3 commits intodevfrom
fix-phrase-match-no-match-unknown-token
Sep 16, 2025
Merged

Fix phrase match to not match with unknown tokens#7252
agourlay merged 3 commits intodevfrom
fix-phrase-match-no-match-unknown-token

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Sep 15, 2025

original report on Discord https://discord.com/channels/907569970500743200/1415930786171064380

The phrase matcher is currently ignoring unknown tokens in the input query.

This behavior creates false positive during document matching.

The proposed solution is to not build Document containing unknown tokens.

@agourlay agourlay marked this pull request as ready for review September 15, 2025 14:51
@@ -286,7 +286,7 @@ impl FullTextIndex {
phrase: &str,
hw_counter: &HardwareCounterCell,
) -> Option<ParsedQuery> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly the signature of the function was already setup to return an Option and the rustdoc even says

If there are any unseen tokens, returns None

I guess this was an oversight :)

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Sep 16, 2025
@generall
Copy link
Member

Does this PR fix behavior in provided snapshot?
I saw at least one document which actually contains all tokens. Is it because of multiple segments, so the token is known in one segment but not another?

@agourlay
Copy link
Member Author

Does this PR fix behavior in provided snapshot?

Yes, it is a query time fix.

I saw at least one document which actually contains all tokens. Is it because of multiple segments, so the token is known in one segment but not another?

Yes exactly, the second segment did not know the token so it started offering partial matches.

@agourlay agourlay merged commit e79100a into dev Sep 16, 2025
21 of 22 checks passed
@agourlay agourlay deleted the fix-phrase-match-no-match-unknown-token branch September 16, 2025 09:16
timvisee pushed a commit that referenced this pull request Sep 29, 2025
* Fix phrase match to not match with unknown tokens

* add tests

* spelling
@timvisee timvisee mentioned this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants