-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Reindex resiliency #42612
Copy link
Copy link
Open
Labels
:Distributed/ReindexIssues relating to reindex that are not caused by issues further downIssues relating to reindex that are not caused by issues further downMetaTeam:DistributedMeta label for distributed team.Meta label for distributed team.
Description
We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.
There are two primary problems to solve:
- Data node resiliency. Reindex relies on scroll queries which are not resilient.
- Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.
Search resiliency
- Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
- Support reindex from remote when source version above 6.6+
- Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
- Back-off strategy on repeated failures
- Verify overhead of seq_no ordering
Coordinator node resiliency:
- POC to clarify this subject more (Make reindexing managed by a persistent task #43382)
- Decide on start reindex job action name
indices:data/write/start_reindexindices:admin/reindex/start_reindexcluster:admin/reindex/start_reindexindices:data/reindex/start_reindex
- Decide on persistent reindex task name
- Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
- Refactor common parts from data frames and roll-up
- Add reindex persistent task and remove it when done (Make reindexing managed by a persistent task #43382)
- Allocation of reindex persistent task (Make reindexing managed by a persistent task #43382)
- Store progress information periodically into .tasks index
- Resume from existing progress information when allocated to new node
- Make updates to persistent tasks resilient against master failovers
- Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination
Slicing:
- Investigate having multiple in flight search and bulk requests as an alternative
Benchmarking:
- Compare rally original indexing to reindex
- Overhead of scripting and ingest pipelines
Misc:
- Handle write failures by retrying when appropriate
- Refined error handling, filter out known/retryable errors
- HLRC support for new persistent task id.
- Examine if transport client in 7.x can call resilient reindex (workaround).
- Add serialization tests for get reindex request
Docs
- Clarify how to use resilient reindex in reference docs (conflict handling, parameters)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
:Distributed/ReindexIssues relating to reindex that are not caused by issues further downIssues relating to reindex that are not caused by issues further downMetaTeam:DistributedMeta label for distributed team.Meta label for distributed team.
Type
Fields
Give feedbackNo fields configured for issues without a type.