-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: support non-ticking wall clocks #59239
Description
Is your feature request related to a problem? Please describe.
One of (my) major gripes about multiTestContext was that it incentivized the use of manual clocks (clocks for which the physical clock is basically fixed until adjusted by the test), which I considered harmful.
Upon further reflection, this should not be harmful. Instead, what is harmful is using the hlc clock source to impose backoffs, such as is done here:
cockroach/pkg/kv/kvserver/store_pool.go
Lines 493 to 506 in 789ea8c
| // NB: We use clock.Now().GoTime() instead of clock.PhysicalTime() is order to | |
| // take clock signals from remote nodes into consideration. | |
| now := sp.clock.Now().GoTime() | |
| timeUntilStoreDead := TimeUntilStoreDead.Get(&sp.st.SV) | |
| for _, repl := range repls { | |
| detail := sp.getStoreDetailLocked(repl.StoreID) | |
| switch detail.status(now, timeUntilStoreDead, sp.nodeLivenessFn) { | |
| case storeStatusDecommissioning: | |
| decommissioningReplicas = append(decommissioningReplicas, repl) | |
| } | |
| } | |
| return | |
| } |
cockroach/pkg/kv/kvserver/store_pool.go
Lines 751 to 767 in 789ea8c
| case throttleDeclined: | |
| timeout := DeclinedReservationsTimeout.Get(&sp.st.SV) | |
| detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout) | |
| if log.V(2) { | |
| ctx := sp.AnnotateCtx(context.TODO()) | |
| log.Infof(ctx, "snapshot declined (%s), s%d will be throttled for %s until %s", | |
| why, storeID, timeout, detail.throttledUntil) | |
| } | |
| case throttleFailed: | |
| timeout := FailedReservationsTimeout.Get(&sp.st.SV) | |
| detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout) | |
| if log.V(2) { | |
| ctx := sp.AnnotateCtx(context.TODO()) | |
| log.Infof(ctx, "snapshot failed (%s), s%d will be throttled for %s until %s", | |
| why, storeID, timeout, detail.throttledUntil) | |
| } | |
| } |
This kind of throttling will never resolve under an HLC powered by a manual clock.
Most code instead relies on timeutil.{Now,Since} which is basically time.Now(), and is thus unaffected by any particular test setup.
Describe the solution you'd like
Stop using the HLC for backoffs. Treat the HLC as a logical clock that only coincidentally has a resemblance of physical time. CockroachDB should run "just fine" and with the same retry characteristics if all nodes in a cluster use a HLC that moves at 1/100th the speed of a real clock.
Instead, such sources should use a dedicated physical time source and it would be understood that CockroachDB will not function properly unless that time source behaves roughly like a wall time. This source basically corresponds to timeutil.Now(), but that is a singleton which we may want to rectify and instead plumb a wall time source around the codebase; since we already plumb the HLC this may not be too hard, though onerous - timeutil.Now() alone has over 700 callers.
Describe alternatives you've considered
We have considered making it unsupported to run CockroachDB with fully manual hlc clocks, but are looking at tests that are difficult to write without it, such as TestReplicaClockUpdates.
Additional context