Skip to content

grunning: grunning.Time() is non-monotonic #95529

@irfansharif

Description

@irfansharif

Describe the problem

In #93952 we observed non-monotonicity in grunning.Time(). The problem is due to an accounting bug in the patched runtime we're now running with. To maintain the nanoseconds spent per-g in the _Grunning state, we want to intercept all transitions in and out of the _Grunning state.

  1. When we transition into the _Grunning state, we maintain a timestamp on type g struct to record when we were scheduled:

+ now := nanotime()
+ if newval == _Grunning {
+ // We're transitioning into the running state, record the timestamp for
+ // subsequent use.
+ gp.lastsched = now
+ }

  1. When we transition out of the Grunning state, we accumulate the time we ran in the most recent interval:

if oldval == _Grunning {
+ // We're transitioning out of running, record how long we were in the
+ // state.
+ gp.runningnanos += now - gp.lastsched

As for surfacing how much time the calling goroutine has spent running, we use the last scheduled timestamp, accumulated counter when last descheduled, and a current monotonic clock reading, to get to the following:

+// grunningnanos returns the wall time spent by current g in the running state.
+// A goroutine may be running on an OS thread that's descheduled by the OS
+// scheduler, this time still counts towards the metric.
+func grunningnanos() int64 {
+ gp := getg()
+ return gp.runningnanos + nanotime() - gp.lastsched
+}

This works, except that the patch does not actually intercept all transitions in and out of the _Grunning state. Specifically, we're missing the following points in the Go scheduler code:

diff --git i/src/runtime/proc.go w/src/runtime/proc.go
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
index 4f1bed23d2..76871ad52e 100644
--- i/src/runtime/proc.go
+++ w/src/runtime/proc.go
@@ -929,6 +929,9 @@ func casfrom_Gscanstatus(gp *g, oldval, newval uint32) {
                dumpgstatus(gp)
                throw("casfrom_Gscanstatus: gp->status is not in scan state")
        }
+       if newval == _Grunning {
+               gp.lastsched = nanotime()
+       }
        releaseLockRank(lockRankGscan)
 }

@@ -944,6 +947,9 @@ func castogscanstatus(gp *g, oldval, newval uint32) bool {
                        r := atomic.Cas(&gp.atomicstatus, oldval, newval)
                        if r {
                                acquireLockRank(lockRankGscan)
+                               if oldval == _Grunning {
+                                       gp.runningnanos += nanotime() - gp.lastsched
+                               }
                        }
                        return r

@@ -1065,6 +1071,7 @@ func casGToPreemptScan(gp *g, old, new uint32) {
        }
        acquireLockRank(lockRankGscan)
        for !atomic.Cas(&gp.atomicstatus, _Grunning, _Gscan|_Gpreempted) {
+               gp.runningnanos += nanotime() - gp.lastsched
        }
 }

With it, the reproduction below no longer reproduces the regression.

To Reproduce

With that branch, and the following assertion of monotonicity, it reliably fails after a few seconds when using dev test --stress pkg/sql/opt/exec/explain -f TestCPUTimeEndToEnd:

--- a/pkg/sql/colflow/stats.go
+++ b/pkg/sql/colflow/stats.go
@@ -23,6 +23,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/sql/execstats"
        "github.com/cockroachdb/cockroach/pkg/util/buildutil"
        "github.com/cockroachdb/cockroach/pkg/util/grunning"
+       "github.com/cockroachdb/cockroach/pkg/util/log"
        "github.com/cockroachdb/cockroach/pkg/util/mon"
        "github.com/cockroachdb/cockroach/pkg/util/syncutil"
        "github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -123,7 +124,14 @@ func (bic *batchInfoCollector) Next() coldata.Batch {
        bic.stopwatch.Start()
        // Wrap the call to Next() with a panic catcher in order to get the correct
        // execution time (e.g. in the statement bundle).
+       startCPUTime := grunning.Time()
        err := colexecerror.CatchVectorizedRuntimeError(bic.next)
+       endCPUTime := grunning.Time()
+       if endCPUTime < startCPUTime {
+               log.Fatalf(context.TODO(), "non-monotonicity! endCPUTime (%d) < startCPUTime (%d) ",
+                       endCPUTime, startCPUTime,
+               )
+       }
        bic.stopwatch.Stop()
        if err != nil {
                colexecerror.InternalError(err)

Impact

Before updating the patch, and all the testing/{micro,macro}benchmarking it entails, lets assess the impact. Since we're missing interception points in and out of _Grunning, what could go wrong? Given grunning.Time() is powered by a grunningnanos(), which is gp.runningnanos + nanotime() - gp.lastsched, we can break this down to the following:

  • gp.runningnanos is maintained when leaving _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an under estimate, and thus non-monotonic.

  • gp.lastsched is maintained when entering _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an over-estimate. It might also lead to gp.runningnanos being higher than it's supposed to be (since we use gp.runningnanos += nanotime() - gp.lastsched to keep it up-to-date), so grunningnanos() is again an over-estimate.

If the analysis above holds, given our current uses of grunning.Time(), we could paper over the non-monotonicity (zero out differences between end/start time if the latter is smaller than the former), and document its inaccuracy.

+cc @DrewKimball, @kvoli, @sumeerbhola.

Jira issue: CRDB-23571

Epic CRDB-34249

Metadata

Metadata

Assignees

Labels

A-admission-controlC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-admission-controlAdmission Control

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions