BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750
BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…#12750lukas-vlcek wants to merge 1 commit intoapache:mainfrom
Conversation
… PathHierarchy tokenizer This PR is expected to fail. It demonstrates issue with BaseTokenStreamTestCase.assertAnalyzesTo method in connection to PathHierarchy tokenizer. Is there any reason why PathHierarchy tokenizer shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then assertAnalyzesTo() method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work? I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to checkAnalysisConsistency() part. I am looking for any useful tips. Thank you! Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
|
This looks like the root cause?: Indeed I think the issue is a problem with See this blog post for more details about how |
|
I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: a03e38d The solution there is to always set the position increment and length to 1: With that change, your test passes (but I had to change every other test): msfroh@0d05366 Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too. |
|
I am going to close this PR. |
… PathHierarchy tokenizer
Description
This PR is expected to fail. It demonstrates issue with
BaseTokenStreamTestCase.assertAnalyzesTo()method in connection toPathHierarchyTokenizer.Is there any reason why
PathHierarchyTokenizershall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and thenassertAnalyzesTo()method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work?I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to
checkAnalysisConsistency()part.I am looking for any useful tips. Thank you!