Add a new merge policy that interleaves old and new segments on force merge#48533
Merged
jimczi merged 2 commits intoelastic:masterfrom Oct 29, 2019
Merged
Add a new merge policy that interleaves old and new segments on force merge#48533jimczi merged 2 commits intoelastic:masterfrom
jimczi merged 2 commits intoelastic:masterfrom
Conversation
… merge This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates elastic#37043
Collaborator
|
Pinging @elastic/es-distributed (:Distributed/Engine) |
jpountz
approved these changes
Oct 25, 2019
| // and then interleave them to colocate oldest and most recent segments together. | ||
| private List<SegmentCommitInfo> interleaveList(List<SegmentCommitInfo> infos) throws IOException { | ||
| List<SegmentCommitInfo> newInfos = new ArrayList<>(infos.size()); | ||
| Collections.sort(infos, Comparator.comparing(a -> a.info.name)); |
Contributor
There was a problem hiding this comment.
I think we should avoid changing infos in place.
Contributor
There was a problem hiding this comment.
Making a copy would also help ensure that the list supports random-access.
| // We wrap the merge policy for all indices even though it is mostly useful for time-based indices | ||
| // but there should be no overhead for other type of indices so it's simpler than adding a setting | ||
| // to enable it. | ||
| mergePolicy = new ShuflleForcedMergePolicy(mergePolicy); |
Contributor
There was a problem hiding this comment.
I agree with doing it all the time for simplicity, but can you add an escape hatch in case it proves problematic for some use-cases?
Contributor
Author
There was a problem hiding this comment.
Sure I added a system property in aae5c30
jpountz
approved these changes
Oct 28, 2019
jimczi
added a commit
that referenced
this pull request
Oct 29, 2019
… merge (#48533) This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in # remain efficient even after running a force merge. Relates #37043
mayya-sharipova
added a commit
to elastic/rally-tracks
that referenced
this pull request
Nov 14, 2019
Measure the performance of sort operations after force merging to 1 segment. PR elastic/elasticsearch#48533 adds a new merge policy that interleaves old and new segments on force merge. This checks the sort performance with this policy after docs are merged to 1 segment.
mayya-sharipova
added a commit
to elastic/rally-tracks
that referenced
this pull request
Nov 14, 2019
Measure the performance of sort operations after force merging to 1 segment. PR elastic/elasticsearch#48533 adds a new merge policy that interleaves old and new segments on force merge. This checks the sort performance with this policy after docs are merged to 1 segment.
mayya-sharipova
added a commit
that referenced
this pull request
Nov 26, 2019
This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Closes #37043 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
mayya-sharipova
added a commit
that referenced
this pull request
Nov 29, 2019
This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Backport for #48804 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
This was referenced Feb 3, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change adds a new merge policy that interleaves eldest and newest segments picked by MergePolicy#findForcedMerges and MergePolicy#findForcedDeletesMerges. This allows time-based indices, that usually have the eldest documents first, to be efficient at finding the most recent documents too. Although we wrap this merge policy for all indices even though it is mostly useful for time-based but there should be no overhead for other type of indices so it's simpler
than adding a setting to enable it. This change is needed in order to ensure that the optimizations that we are working on in #37043 remain efficient even after running a force merge.
Relates #37043