Today we use System#currentTimeMillis and friends to record the timestamp of ILM phase transitions and SLM activities. For instance, in the wait-for-snapshot step we wait for a snapshot with a timestamp after the corresponding ILM phase transition. This mostly works, but if the nodes in the cluster do not have synchronised (and monotonic) clocks then there's a risk that we will determine that events did not happen in the correct order.
I think it would work better to use a logical clock for this purpose to avoid the problems of potential clock skew. Since these activities all correspond with cluster state updates, and since the cluster state version is monotonically increasing and updated in a coordinated fashion, I believe it would work well to use the cluster state version for the logical clock.
Today we use
System#currentTimeMillisand friends to record the timestamp of ILM phase transitions and SLM activities. For instance, in thewait-for-snapshotstep we wait for a snapshot with a timestamp after the corresponding ILM phase transition. This mostly works, but if the nodes in the cluster do not have synchronised (and monotonic) clocks then there's a risk that we will determine that events did not happen in the correct order.I think it would work better to use a logical clock for this purpose to avoid the problems of potential clock skew. Since these activities all correspond with cluster state updates, and since the cluster state version is monotonically increasing and updated in a coordinated fashion, I believe it would work well to use the cluster state version for the logical clock.