BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains… by lukas-vlcek · Pull Request #12750 · apache/lucene

lukas-vlcek · 2023-11-02T17:05:18Z

… PathHierarchy tokenizer

Description

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo() method in connection to PathHierarchyTokenizer.

Is there any reason why PathHierarchyTokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

… PathHierarchy tokenizer This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo method in connection to PathHierarchy tokenizer. Is there any reason why PathHierarchy tokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work? I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part. I am looking for any useful tips. Thank you! Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>

mikemccand · 2023-11-21T11:11:49Z

This looks like the root cause?:

      java.lang.AssertionError: inconsistent endOffset 1 pos=0 posLen=1 token=/a/b expected:<2> but was:<4>

Indeed I think the issue is a problem with PathHierarchyTokenizer: it produces tokens all on top of one another (instead of in sequence at incrementing positions) yet the tokens claim different start/end offsets, and BaseTokenStreamTestCase detects that as a corrupt token graph. I think this tokenizer should be setting the PositionLengthAttribute as well, to indicate that each token reaches to a further position ... this should make BaseTokenStreamTestCase happy.

See this blog post for more details about how TokenStreams are actually graphs in Lucene.

msfroh · 2023-11-27T19:21:51Z

I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: a03e38d

The solution there is to always set the position increment and length to 1:

lucene/lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram/NGramTokenizer.java

Lines 186 to 187 in 8ef6a0d

    
           posIncAtt.setPositionIncrement(1); 
        
           posLenAtt.setPositionLength(1);

With that change, your test passes (but I had to change every other test): msfroh@0d05366

Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too.

lukas-vlcek · 2023-12-04T15:32:41Z

I am going to close this PR.
I opened a new PR that has fix for ReversePathHierarchyTokenizer and PathHierarchyTokenizer: #12875

lukas-vlcek mentioned this pull request Nov 2, 2023

Deprecate CamelCase PathHierarchy tokenizer name opensearch-project/OpenSearch#10894

Merged

5 tasks

lukas-vlcek mentioned this pull request Dec 4, 2023

Fix position increment in (Reverse)PathHierarchyTokenizer #12875

Merged

lukas-vlcek closed this Dec 4, 2023

lukas-vlcek deleted the PathHierarchyAnalyzerTest branch December 7, 2023 17:05

hossman mentioned this pull request Feb 25, 2026

PathHierarchyTokenizer "ancestor search" use case broken in lucene >= 10.0 #15769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750
lukas-vlcek wants to merge 1 commit intoapache:mainfrom
lukas-vlcek:PathHierarchyAnalyzerTest

lukas-vlcek commented Nov 2, 2023

Uh oh!

mikemccand commented Nov 21, 2023

Uh oh!

msfroh commented Nov 27, 2023

Uh oh!

lukas-vlcek commented Dec 4, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukas-vlcek commented Nov 2, 2023

Description

Uh oh!

mikemccand commented Nov 21, 2023

Uh oh!

msfroh commented Nov 27, 2023

Uh oh!

lukas-vlcek commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukas-vlcek commented Dec 4, 2023 •

edited

Loading