Add BreakIteratorBoundaryScanner support#23248
Conversation
This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
| order(in.readOptionalWriteable(Order::readFromStream)); | ||
| highlightFilter(in.readOptionalBoolean()); | ||
| forceSource(in.readOptionalBoolean()); | ||
| boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream)); |
There was a problem hiding this comment.
Since this PR can go to 5.x can you add a check on the version, like this:
if (in.getVersion().onOrAfter(Version.V_5_4_0_UNRELEASED)) {
boundaryScannerType(in.readOptionalWriteable(BoundaryScannerType::readFromStream));
}
This will ensure that a node in 5.4 can safely send request to a 5.3 node.
There was a problem hiding this comment.
Nice, I didn't see this pattern in the files I changed, so had no idea :). I will do that in all 4 places you commented on.
| if (in.readBoolean()) { | ||
| boundaryChars(in.readString().toCharArray()); | ||
| } | ||
| boundaryScannerLocale(in.readOptionalString()); |
| out.writeOptionalWriteable(order); | ||
| out.writeOptionalBoolean(highlightFilter); | ||
| out.writeOptionalBoolean(forceSource); | ||
| out.writeOptionalWriteable(boundaryScannerType); |
| out.writeBoolean(hasBoundaryScannerLocale); | ||
| if (hasBoundaryScannerLocale) { | ||
| out.writeString(boundaryScannerLocale.toLanguageTag()); | ||
| } |
| } | ||
|
|
||
| public static enum BoundaryScannerType implements Writeable { | ||
| SIMPLE, BREAK_ITERATOR; |
There was a problem hiding this comment.
BREAK_ITERATOR is not reflecting what this type is doing. Maybe just SENTENCE ? Or SENTENCE_BREAK ?
There was a problem hiding this comment.
This is actually exactly what it means. As I wrote in the PR comment, I wanted to get initial feedback. Now that I see you're OK with the general approach, I want to add a "break_iterator_type" property which defaults to "SENTENCE" but also supports "WORD", "CHARACTER" and "LINE".
There was a problem hiding this comment.
Got it thanks. Though we could avoid creating another option and merge the logic in a single one ? The boundary scanner type could be any of:
- simple
- sentence
- word
The fact that we use a break iterator for sentence and word is just an implementation detail ?
Although "CHARACTER" seems weird, should we really expose it ?
There was a problem hiding this comment.
I don't think we have to expose all options, was just thinking "if Lucene supports it ..". But having said that, from an API perspective this may be weird, so I'll take your proposal and add "boundary_scanner" with values "simple", "sentence" and "word". I'll do that.
| if (boundaryScannerType.toUpperCase(Locale.ROOT).equals(BREAK_ITERATOR.name())) { | ||
| return BoundaryScannerType.BREAK_ITERATOR; | ||
| } | ||
| return SIMPLE; |
There was a problem hiding this comment.
I think it's too lenient, we should fail if the type is unknown.
There was a problem hiding this comment.
I followed the same approach as in Order though I agree with you. Just thought this is how ES prefers to do things. I can simplify the method to use valueOf(boundaryScannerType.toUpperCase(Locale.ROOT))
|
@jimczi I believe I've addressed all your comments. Hope I addressed the BW ones as you intended! |
|
@jimczi I only now noticed your comment about adding something to the docs. Can you point me to where that is? |
|
Thanks @shaie ! You can check the current docs for the FVH here: https://github.com/elastic/elasticsearch/blob/master/docs/reference/search/request/highlighting.asciidoc#fast-vector-highlighter |
|
@jimczi I updated the docs and also renamed |
|
@elasticmachine please test it |
|
@jimczi, the build failed but seems weird exception, something about Git plugin .. |
|
Yes it's just a problem with our CI. I'll test locally :( |
|
K thanks @jimczi. Let me know if there are any issues, though I did run some tests locally and they passed (ones I thought are related to the change). |
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
| boolean hasBoundaryScannerLocale = boundaryScannerLocale != null; | ||
| out.writeBoolean(hasBoundaryScannerLocale); | ||
| if (hasBoundaryScannerLocale) { | ||
| out.writeString(boundaryScannerLocale.toLanguageTag()); |
There was a problem hiding this comment.
out.writeOptionalString() since you use in.readOptionalString in the readFrom ?
There was a problem hiding this comment.
No, notice that since boundaryScannerLocale is a Locale, I first write a boolean, and then write the String if it's not null. If I wanted to use writeOptionalString I would need to do something like:
out.writeOptionalString(boundaryScannerLocale == null ? null : boundaryScannerLocale.toLanguageTag())
Do you think that's preferred?
There was a problem hiding this comment.
Just for reference @jimczi, I borrowed this pattern from the code just above those lines which handles boundaryChars:
boolean hasBounaryChars = boundaryChars != null;
out.writeBoolean(hasBounaryChars);
if (hasBounaryChars) {
out.writeString(String.valueOf(boundaryChars));
}
So before I change the code I wrote, does this change your opinion? not that I mind changing the code, just prefer the code to be consistent with the rest.
There was a problem hiding this comment.
I don't mind as long as we use writeString/readString and writeOptionalString/readOptionalString consistently.
So you can maybe just change the readFrom to explicitly use readBoolean.
| } | ||
| } | ||
|
|
||
| public static enum BoundaryScannerType implements Writeable { |
There was a problem hiding this comment.
this makes our code checker unhappy since enum are always static in this context.
jimczi
left a comment
There was a problem hiding this comment.
The build fails for me. I've left some comments regarding the failing codes. You can test your change locally with gradle clean check to make sure that the build is ok.
| if (randomBoolean()) { | ||
| highlightBuilder.boundaryScannerType(randomFrom(BoundaryScannerType.values())); | ||
| } else { | ||
| // also test the string setter |
There was a problem hiding this comment.
this fails the build, the toString is not correctly applied
| } | ||
| if (fieldOptions.boundaryScannerType == null) { | ||
| fieldOptions.boundaryScannerType = globalOptions.boundaryScannerType; | ||
| } |
There was a problem hiding this comment.
The merge of the boundaryScannerLocale is missing
There was a problem hiding this comment.
Sorry about that. Fixed.
|
I'm running |
|
@jimczi I pushed a commit which addresses all your comments and all tests pass. Actually, when running I can rerun, but if you want to start the tests too, wanted to let you know. |
|
Thanks @shaie ! |
This commit adds a boundary_scanner property to the search highlight request so the user can specify different boundary scanners: * `chars` (default, current behavior) * `word` Use a WordBreakIterator * `sentence` Use a SentenceBreakIterator This commit also adds "boundary_scanner_locale" to define which locale should be used when scanning the text.
|
Thanks @jimczi! |
* master: (26 commits) CLI: Fix prompting for yes/no to handle console returning null (elastic#23320) Tests: Fix reproduce line for packagingTest (elastic#23365) Build: Remove extra copies of netty license (elastic#23361) [TEST] Removes timeout based wait_for_active_shards REST test (elastic#23360) [TEST] increase timeout slightly in wait_for_active_shards test to allow for index creation cluster state update to be processed before ensuring the wait times out Handle snapshot repository's missing index.latest Adding equals/hashCode to MainResponse (elastic#23352) Always restore the ThreadContext for operations delayed due to a block (elastic#23349) Add support for named xcontent parsers to high level REST client (elastic#23328) Add unit tests for ParentToChildAggregator (elastic#23305) Fix after last merge with master and apply last comments [INGEST] Lazy load the geoip databases. disable BWC tests for the highlighters, need a new 5.x build to make it work Expose WordDelimiterGraphTokenFilter (elastic#23327) Test that buildCredentials returns correct clazz (elastic#23334) Add BreakIteratorBoundaryScanner support for FVH (elastic#23248) Prioritize listing index-N blobs over index.latest in reading snapshots (elastic#23333) Test: Fix hdfs test fixture setup on windows delete and index tests can share some part of the code Remove createSampleDocument method and use the sync'ed index method ...
This commit adds a boundary_scanner property to the search highlight
request (defaults to "simple", per current behavior) so the user can
specify "break_iterator" and its type ("sentence", "word", "line" and
"character"). Also adds "boundary_scanner_locale" to define which one
should be used when highlighting the text.