Skip to content

kvserver: support non-ticking wall clocks #59239

@tbg

Description

@tbg

Is your feature request related to a problem? Please describe.
One of (my) major gripes about multiTestContext was that it incentivized the use of manual clocks (clocks for which the physical clock is basically fixed until adjusted by the test), which I considered harmful.

Upon further reflection, this should not be harmful. Instead, what is harmful is using the hlc clock source to impose backoffs, such as is done here:

// NB: We use clock.Now().GoTime() instead of clock.PhysicalTime() is order to
// take clock signals from remote nodes into consideration.
now := sp.clock.Now().GoTime()
timeUntilStoreDead := TimeUntilStoreDead.Get(&sp.st.SV)
for _, repl := range repls {
detail := sp.getStoreDetailLocked(repl.StoreID)
switch detail.status(now, timeUntilStoreDead, sp.nodeLivenessFn) {
case storeStatusDecommissioning:
decommissioningReplicas = append(decommissioningReplicas, repl)
}
}
return
}

case throttleDeclined:
timeout := DeclinedReservationsTimeout.Get(&sp.st.SV)
detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout)
if log.V(2) {
ctx := sp.AnnotateCtx(context.TODO())
log.Infof(ctx, "snapshot declined (%s), s%d will be throttled for %s until %s",
why, storeID, timeout, detail.throttledUntil)
}
case throttleFailed:
timeout := FailedReservationsTimeout.Get(&sp.st.SV)
detail.throttledUntil = sp.clock.PhysicalTime().Add(timeout)
if log.V(2) {
ctx := sp.AnnotateCtx(context.TODO())
log.Infof(ctx, "snapshot failed (%s), s%d will be throttled for %s until %s",
why, storeID, timeout, detail.throttledUntil)
}
}

This kind of throttling will never resolve under an HLC powered by a manual clock.

Most code instead relies on timeutil.{Now,Since} which is basically time.Now(), and is thus unaffected by any particular test setup.

Describe the solution you'd like

Stop using the HLC for backoffs. Treat the HLC as a logical clock that only coincidentally has a resemblance of physical time. CockroachDB should run "just fine" and with the same retry characteristics if all nodes in a cluster use a HLC that moves at 1/100th the speed of a real clock.

Instead, such sources should use a dedicated physical time source and it would be understood that CockroachDB will not function properly unless that time source behaves roughly like a wall time. This source basically corresponds to timeutil.Now(), but that is a singleton which we may want to rectify and instead plumb a wall time source around the codebase; since we already plumb the HLC this may not be too hard, though onerous - timeutil.Now() alone has over 700 callers.

Describe alternatives you've considered

We have considered making it unsupported to run CockroachDB with fully manual hlc clocks, but are looking at tests that are difficult to write without it, such as TestReplicaClockUpdates.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions