Currently the downsample api is coordinated from the elected master node.
The downsample operation runs on the nodes that have shards for source index.
On each shard a process runs the scans the data and rolls it up into another index (by using the bulk api).
If the elected master node fails then there is more coordination between the processes on the
other nodes and these operations will just continue.
If the downsampling is coordinated from ILM then the downsample operation will be retried, which means there will be two downsample operations running (see #93580).
If the operation that coordinates the downsample operation on elected master node fails then the other operations
that do the actual downsampling/rolling up should be halted too.
This can be achieved by integrating the downsample operation with persistent tasks. This will allow a cancelled or stopped downsample operation to continue instead of restart from the beginning. The persist task can keep track of the last processed tsid as a marker. When retrying an existing downsample operation, the downsample operation can skip the processed tsid and all previous tsids. It may happen that we rollup documents for the tsid that was in progress, but reindex a few rolled up documents shouldn't be an issue.
Additionally the index requests that are generated from the downsample process on nodes with shards of source index should always assume that the target index exists, so never auto creating the target index.
Currently the downsample api is coordinated from the elected master node.
The downsample operation runs on the nodes that have shards for source index.
On each shard a process runs the scans the data and rolls it up into another index (by using the bulk api).
If the elected master node fails then there is more coordination between the processes on the
other nodes and these operations will just continue.
If the downsampling is coordinated from ILM then the downsample operation will be retried, which means there will be two downsample operations running (see #93580).
If the operation that coordinates the downsample operation on elected master node fails then the other operations
that do the actual downsampling/rolling up should be halted too.
This can be achieved by integrating the downsample operation with persistent tasks. This will allow a cancelled or stopped downsample operation to continue instead of restart from the beginning. The persist task can keep track of the last processed tsid as a marker. When retrying an existing downsample operation, the downsample operation can skip the processed tsid and all previous tsids. It may happen that we rollup documents for the tsid that was in progress, but reindex a few rolled up documents shouldn't be an issue.
Additionally the index requests that are generated from the downsample process on nodes with shards of source index should always assume that the target index exists, so never auto creating the target index.