Fix position increment in (Reverse)PathHierarchyTokenizer#12875
Fix position increment in (Reverse)PathHierarchyTokenizer#12875mikemccand merged 2 commits intoapache:mainfrom
Conversation
| posAtt.setPositionIncrement(0); | ||
| } | ||
| posIncAtt.setPositionIncrement(1); | ||
| posLenAtt.setPositionLength(1); |
There was a problem hiding this comment.
Note that 1 is already the default for PositionLengthAttribute so you could skip pulling/setting it.
| posAtt.setPositionIncrement(0); | ||
| } | ||
| posIncAtt.setPositionIncrement(1); | ||
| posLenAtt.setPositionLength(1); |
|
Thanks for tackling this @lukas-vlcek and @msfroh! I left a couple small comments, but otherwise it looks great. Given that this alters the indexed tokens (makes them non-overlapping), I think this should be a 10.0 only change? It's highly unlikely any users are relying on how Also please add a |
Ture... so it sounds like we shall add a test for this as well if we want to claim it in the migration notes! |
|
@mikemccand Do you think you can give me some hint about?
I am looking at |
Hi @lukas-vlcek -- I don't think we need to add a specific unit test to |
PathHierarchyTokenizer was emitting multiple tokens in the same position with changing offsets. To be consistent with EdgeNGramTokenizer (which is conceptually similar -- it's emitting multiple prefixes/suffixes off the input string), we can output every token with length 1 with positions incrementing by 1.
Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.
cd0b3b7 to
476d524
Compare
|
@mikemccand Thanks for review. Done! |
|
Thanks @lukas-vlcek -- looks great -- no need to squash, I'll take care during merge. Exciting improvement, finally! |
Description
Incrementing position attribute for each token in both
PathHieararchyTokenizerandReversePathHieararchyTokenizer.This change makes it possible to use both tokenizers in
BaseTokenStreamTestCase.assertAnalyzesTo()method (test cases extended to demonstrate this).This PR solves and surpasses #12750.
@msfroh I included your original commit from your Lucene fork to keep proper commit attribution.
Shall we squash all commits?