grunning: `grunning.Time()` is non-monotonic

**Describe the problem**

In https://github.com/cockroachdb/cockroach/pull/93952 we observed non-monotonicity in `grunning.Time()`. The problem is due to an accounting bug in the [patched runtime](https://github.com/cockroachdb/cockroach/blob/f27fa9d/docs/RFCS/20220602_fine_grained_cpu_attribution.md#design) we're now running with. To maintain the nanoseconds spent per-g in the `_Grunning` state, we want to intercept all transitions in and out of the `_Grunning` state. 

1. When we transition into the `_Grunning` state, we maintain a timestamp on `type g struct` to record when we were scheduled:

https://github.com/cockroachdb/cockroach/blob/f27fa9d4c56e4654939e772d857027da00feb7d5/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch#L11-L16

2. When we transition out of the `Grunning` state, we accumulate the time we ran in the most recent interval:

https://github.com/cockroachdb/cockroach/blob/f27fa9d4c56e4654939e772d857027da00feb7d5/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch#L18-L21

As for surfacing how much time the calling goroutine has spent running, we use the last scheduled timestamp, accumulated counter when last descheduled, and a current monotonic clock reading, to get to the following:

https://github.com/cockroachdb/cockroach/blob/f27fa9d4c56e4654939e772d857027da00feb7d5/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch#L45-L51

This works, except that the patch does not actually intercept all transitions in and out of the `_Grunning` state. Specifically, we're missing the following points in the Go scheduler code:

```diff
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
index 4f1bed23d2..76871ad52e 100644
--- i/src/runtime/proc.go
+++ w/src/runtime/proc.go
@@ -929,6 +929,9 @@ func casfrom_Gscanstatus(gp *g, oldval, newval uint32) {
                dumpgstatus(gp)
                throw("casfrom_Gscanstatus: gp->status is not in scan state")
        }
+       if newval == _Grunning {
+               gp.lastsched = nanotime()
+       }
        releaseLockRank(lockRankGscan)
 }

@@ -944,6 +947,9 @@ func castogscanstatus(gp *g, oldval, newval uint32) bool {
                        r := atomic.Cas(&gp.atomicstatus, oldval, newval)
                        if r {
                                acquireLockRank(lockRankGscan)
+                               if oldval == _Grunning {
+                                       gp.runningnanos += nanotime() - gp.lastsched
+                               }
                        }
                        return r

@@ -1065,6 +1071,7 @@ func casGToPreemptScan(gp *g, old, new uint32) {
        }
        acquireLockRank(lockRankGscan)
        for !atomic.Cas(&gp.atomicstatus, _Grunning, _Gscan|_Gpreempted) {
+               gp.runningnanos += nanotime() - gp.lastsched
        }
 }
```

With it, the reproduction below no longer reproduces the regression.

**To Reproduce**

With that branch, and the following assertion of monotonicity, it reliably fails after a few seconds when using `dev test --stress  pkg/sql/opt/exec/explain -f TestCPUTimeEndToEnd`:

```diff
--- a/pkg/sql/colflow/stats.go
+++ b/pkg/sql/colflow/stats.go
@@ -23,6 +23,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/sql/execstats"
        "github.com/cockroachdb/cockroach/pkg/util/buildutil"
        "github.com/cockroachdb/cockroach/pkg/util/grunning"
+       "github.com/cockroachdb/cockroach/pkg/util/log"
        "github.com/cockroachdb/cockroach/pkg/util/mon"
        "github.com/cockroachdb/cockroach/pkg/util/syncutil"
        "github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -123,7 +124,14 @@ func (bic *batchInfoCollector) Next() coldata.Batch {
        bic.stopwatch.Start()
        // Wrap the call to Next() with a panic catcher in order to get the correct
        // execution time (e.g. in the statement bundle).
+       startCPUTime := grunning.Time()
        err := colexecerror.CatchVectorizedRuntimeError(bic.next)
+       endCPUTime := grunning.Time()
+       if endCPUTime < startCPUTime {
+               log.Fatalf(context.TODO(), "non-monotonicity! endCPUTime (%d) < startCPUTime (%d) ",
+                       endCPUTime, startCPUTime,
+               )
+       }
        bic.stopwatch.Stop()
        if err != nil {
                colexecerror.InternalError(err)
```

**Impact**

Before updating the patch, and all the testing/{micro,macro}benchmarking it entails, lets assess the impact. Since we're missing interception points in and out of `_Grunning`, what could go wrong? Given `grunning.Time()` is powered by a `grunningnanos()`, which is `gp.runningnanos + nanotime() - gp.lastsched`, we can break this down to the following:

- `gp.runningnanos` is maintained when leaving `_Grunning`. So it might be older/smaller than it actually is. So `grunningnanos()` can be an under estimate, and thus non-monotonic.

- `gp.lastsched` is maintained when entering `_Grunning`. So it might be older/smaller than it actually is. So `grunningnanos()` can be an over-estimate. It might also lead to `gp.runningnanos` being higher than it's supposed to be (since we use `gp.runningnanos += nanotime() - gp.lastsched` to keep it up-to-date), so `grunningnanos()` is again an over-estimate.

If the analysis above holds, given our current uses of `grunning.Time()`, we could paper over the non-monotonicity (zero out differences between end/start time if the latter is smaller than the former), and document its inaccuracy.

+cc @DrewKimball, @kvoli, @sumeerbhola.

Jira issue: CRDB-23571

Epic CRDB-34249

	+ now := nanotime()
	+ if newval == _Grunning {
	+ // We're transitioning into the running state, record the timestamp for
	+ // subsequent use.
	+ gp.lastsched = now
	+ }

	if oldval == _Grunning {
	+ // We're transitioning out of running, record how long we were in the
	+ // state.
	+ gp.runningnanos += now - gp.lastsched

	+// grunningnanos returns the wall time spent by current g in the running state.
	+// A goroutine may be running on an OS thread that's descheduled by the OS
	+// scheduler, this time still counts towards the metric.
	+func grunningnanos() int64 {
	+ gp := getg()
	+ return gp.runningnanos + nanotime() - gp.lastsched
	+}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grunning: `grunning.Time()` is non-monotonic #95529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

grunning: grunning.Time() is non-monotonic #95529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

grunning: `grunning.Time()` is non-monotonic #95529