Skip to content

Proposer-Based Timesamps leap second handling #7724

@williambanfield

Description

@williambanfield

Overview

Proposer-based timestamps adds a new parameter, Synchrony.Precision, which dictates how skewed all validators clocks are expected to be from each other. We need to pick a default value of Synchrony.Precision that is lenient enough such that chains with reasonable NTP connections will not stall as a result of clock skew. In general, clocks synchronized to reliable NTP servers should be expected to skew by a matter of milliseconds. However, during leap second events, this skew may increase depending on the different strategies used for handling leap seconds.

Leap Second Overview

The leap second is a mechanism for adjusting UTC to more closely align with International Atomic Time (IAT). Leap seconds take the form of an extra second added (or removed) from UTC and occur at most twice a year. During a leap second event, a program may observe different values from the system clock depending on the strategy for handling leap seconds that the operating system and NTP server it is talking to employ.

Step back the clock

During a leap second, the operating system or NTP server may step-back the system clock. A program running in an environment that steps the clock back will observe the second 23:59:59th for two consecutive seconds. This would appear to the program reading from the clock in this window as a one second step back from 23:59:59.99 to 23:59:59.00.

Smear the clock

An NTP server may elect to smear the leap second. A leap second smear takes the form of changing the rate at which wallclock times occur. In the case of adding a second to UTC, the clock is adjusted to move slightly slower than normal for a period of 12 hours before the leap second until 12 hours after the leap second.

Who is affected by this?

In general, block chains with nodes running with different leap-second strategies may be presented with an issue. Specifically, if some nodes jump and others smear there will be a 24 hour period where the expected clock skew between nodes will be much larger than normal. The skew between the nodes of different types will increase approaching the leap second up to a maximum of about 500ms and decreasing after the leap second.

This presents a problem for nodes running in production since different cloud providers handle the leap second differently.

Additionally, nodes on networks that perform the step-back strategy may experience a few instants of ~1 second clock skew as different nodes apply the leap second step-back at different instants in accordance with their local clock. This may result in a missed round while different validators apply the leap second.

AWS

AWS EC2 instances rely on the 'step-back' mechanism for handling leap seconds. AWS products like CloudFront do smear, but Amazon's on-demand VMs, as documented, do not.

GCP

Google's NTP servers use a leap smear. An experiment running a VM on GCE revealed that, by default the instance is pointed at Google's NTP.

Possible Solutions

Document a 'correct' mechanism for handling leap seconds

A possible option for overcoming this issue is to determine that only a single leap-second handling strategy should be employed per chain. This would alleviate the issue since all validators, either stepping or smearing, would experience the behavior at the same time and would remain roughly as skewed as usual.

Select a lenient Precision value

The maximum difference between nodes in a mixed leap second handling set will be 500ms, plus or minus the skew of any two clocks in the set. Therefore, picking a default Precision value above 500ms will ensure that this problem is not encountered by chains running in such a heterogeneous environment.

Selecting a lenient default Precision value is a more attractive option. 500ms is still quite small in comparison to considered MessageDelays. Any slowdown that may be achieved by a Byzantine proposer is still at most 500ms every time the proposer is selected, making this a relatively safe choice. Any chain that wishes to use a less lenient value, possibly to limit to only one type of leap second handling, may do so via a governance proposal after successfully producing blocks with the 500ms value.

Note neither of these solutions solve the case of briefly large clock skews for validators performing the step-back strategy. However, the amount of time that validators experience this level of skew is bound by Precision and is therefore likely to be quite small and at most affect a few votes in a round.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions