Skip to content

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750

Closed
lukas-vlcek wants to merge 1 commit intoapache:mainfrom
lukas-vlcek:PathHierarchyAnalyzerTest
Closed

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750
lukas-vlcek wants to merge 1 commit intoapache:mainfrom
lukas-vlcek:PathHierarchyAnalyzerTest

Conversation

@lukas-vlcek
Copy link
Copy Markdown
Contributor

… PathHierarchy tokenizer

Description

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo() method in connection to PathHierarchyTokenizer.

Is there any reason why PathHierarchyTokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

… PathHierarchy tokenizer

This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo
method in connection to PathHierarchy tokenizer.

Is there any reason why PathHierarchy tokenizer shall not be used in the test like this? There
are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer
and then assertAnalyzesTo() method is called to check the tokens. What is special about
PathHierarchy tokenizer that it does not work?

I think the problem might not be in the tokenizer but in the test method itself or in the way
I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing
method is complex, especially when it gets to checkAnalysisConsistency() part.

I am looking for any useful tips. Thank you!

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@mikemccand
Copy link
Copy Markdown
Member

This looks like the root cause?:

      java.lang.AssertionError: inconsistent endOffset 1 pos=0 posLen=1 token=/a/b expected:<2> but was:<4>

Indeed I think the issue is a problem with PathHierarchyTokenizer: it produces tokens all on top of one another (instead of in sequence at incrementing positions) yet the tokens claim different start/end offsets, and BaseTokenStreamTestCase detects that as a corrupt token graph. I think this tokenizer should be setting the PositionLengthAttribute as well, to indicate that each token reaches to a further position ... this should make BaseTokenStreamTestCase happy.

See this blog post for more details about how TokenStreams are actually graphs in Lucene.

@msfroh
Copy link
Copy Markdown
Contributor

msfroh commented Nov 27, 2023

I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: a03e38d

The solution there is to always set the position increment and length to 1:

posIncAtt.setPositionIncrement(1);
posLenAtt.setPositionLength(1);

With that change, your test passes (but I had to change every other test): msfroh@0d05366

Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too.

@lukas-vlcek
Copy link
Copy Markdown
Contributor Author

lukas-vlcek commented Dec 4, 2023

I am going to close this PR.
I opened a new PR that has fix for ReversePathHierarchyTokenizer and PathHierarchyTokenizer: #12875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants