Skip to content

Unable to restart uncompleted downsampling tasks in ES 8.13 and above #106880

@salvatore-campagna

Description

@salvatore-campagna

Elasticsearch Version

8.13 and above

Installed Plugins

No response

Java Version

bundled

OS Version

all

Problem Description

PR #97557 introduced DownsampleShardTaskParams a data structure used by our persistent task framework to store task specific data including, in this case, downsampling tasks specific data for tasks started when a downsampling operation is carried out.

PR #98023 introduced an array of strings dimensions which is used to store the set of dimensions defined for the original index the downsampling task is operating onto. This is required because with TSID Hashing we lose the ability to decode dimensions just by decoding the _tsid field and we need to store them unencoded somewhere else to support resuming interrupted persistent tasks.

Addition of the new dimensions string array changes the format of our wire protocol which we use when serialising and deserialising instances of objects like DownsampleShardTaskParams. This kind of changes require code to handle backward compatibility with nodes running older versions of Elasticsearch which "speak" a different version of the wire protocol. The check is missing (this is the bug!) as result, newer versions of Elasticsearch try to read a boolean unconditionally and later on, if the boolean is true, an array of strings (dimensions), ignoring the fact that the boolean and string array might or might not be there. Older versions of Elasticsearch do not serialize such boolean and/or string array since that did not exist when the older version was released. This is why newer versions of Elasticsearch need the check on the wire protocol version and need to implement backward compatible behaviour.

Moreover instances of DownsampleShardTaskParams are serialised as part of the cluster state which is written/read by nodes in the cluster and which needs to be readable by new nodes running a newer version of Elasticsearch after an upgrade. This is why the upgrade process is affected.

The issue happens because a node running Elasticsearch older than 8.13 (8.10.x-8.12.x) writes such cluster state with
DownsampleShardTaskParams not including the dimensions string array. Then, after nodes start moving to a new version as a result of an upgrade to 8.13, deserialising the cluster state fails in the node running version 8.13 because the dimensions array is missing.

(NOTE: hopefully failure in deserielizing the cluster state means the node running version 8.13 will never be able to join the cluster).

Steps to Reproduce

Ideally could happen just by having at least one downsampling task starting, then upgrading to version 8.13 while the downsampling task is running. Note also that the executor is not going to restart them as a result of the failure being unrecoverable.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    :StorageEngine/DownsamplingDownsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data>bugTeam:StorageEngine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions