Ukrainian language plugin can fill up heap#71998
Merged
romseygeek merged 2 commits intoelastic:masterfrom Apr 21, 2021
Merged
Conversation
Collaborator
|
Pinging @elastic/es-search (Team:Search) |
jpountz
approved these changes
Apr 21, 2021
romseygeek
added a commit
that referenced
this pull request
Apr 21, 2021
The lucene Ukrainian analyzer has a bug where a large in-memory dictionary is loaded and stored on a thread local for every tokenstream generated in a new thread (for more details see https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks added in #50908, we create a tokenstream for every registered analyzer in every shard, which means that any node with the ukrainian plugin installed will leak one copy of this dictionary per shard, whether or not the ukrainian analyzer is actually being used. This commit makes the plugin use a fixed version of the UkrainianMorfologikAnalyzer, until we merge a version of lucene that contains the upstream fix.
Contributor
|
@romseygeek Is the version label correct in this PR? It's not listed in the release notes (https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-7.13.0.html). If this didn't make it to 7.13.0, will it be in 7.13.1? Thx! |
Contributor
Author
|
Not sure why it's not in the release notes, but it's in the 7.13 release: d6038a3 |
ppf2
added a commit
that referenced
this pull request
May 26, 2021
#71998 was fixed in 7.13.0 but it is missing from the release notes.
Contributor
|
Thx for confirming @romseygeek ! I have filed a doc PR to add it (#73440). |
jrodewig
pushed a commit
that referenced
this pull request
May 26, 2021
#71998 was fixed in 7.13.0 but was missed in the release notes.
jrodewig
added a commit
that referenced
this pull request
May 26, 2021
#71998 was fixed in 7.13.0 but was missed in the release notes. Co-authored-by: Pius <pius@elastic.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The lucene Ukrainian analyzer has a bug where a large in-memory
dictionary is loaded and stored on a thread local for every tokenstream
generated in a new thread (for more details see
https://issues.apache.org/jira/browse/LUCENE-9930). Due to checks
added in #50908, we create a tokenstream for every registered
analyzer in every shard, which means that any node with the ukrainian
plugin installed will leak one copy of this dictionary for every shard,
whether or not the ukrainian analyzer is actually being used.
This commit makes the plugin use a fixed version of the
UkrainianMorfologikAnalyzer, until we merge a version of lucene that
contains the upstream fix.