Describe the problem
In #93952 we observed non-monotonicity in grunning.Time(). The problem is due to an accounting bug in the patched runtime we're now running with. To maintain the nanoseconds spent per-g in the _Grunning state, we want to intercept all transitions in and out of the _Grunning state.
- When we transition into the
_Grunning state, we maintain a timestamp on type g struct to record when we were scheduled:
|
+ now := nanotime() |
|
+ if newval == _Grunning { |
|
+ // We're transitioning into the running state, record the timestamp for |
|
+ // subsequent use. |
|
+ gp.lastsched = now |
|
+ } |
- When we transition out of the
Grunning state, we accumulate the time we ran in the most recent interval:
|
if oldval == _Grunning { |
|
+ // We're transitioning out of running, record how long we were in the |
|
+ // state. |
|
+ gp.runningnanos += now - gp.lastsched |
As for surfacing how much time the calling goroutine has spent running, we use the last scheduled timestamp, accumulated counter when last descheduled, and a current monotonic clock reading, to get to the following:
|
+// grunningnanos returns the wall time spent by current g in the running state. |
|
+// A goroutine may be running on an OS thread that's descheduled by the OS |
|
+// scheduler, this time still counts towards the metric. |
|
+func grunningnanos() int64 { |
|
+ gp := getg() |
|
+ return gp.runningnanos + nanotime() - gp.lastsched |
|
+} |
This works, except that the patch does not actually intercept all transitions in and out of the _Grunning state. Specifically, we're missing the following points in the Go scheduler code:
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
diff --git i/src/runtime/proc.go w/src/runtime/proc.go
index 4f1bed23d2..76871ad52e 100644
--- i/src/runtime/proc.go
+++ w/src/runtime/proc.go
@@ -929,6 +929,9 @@ func casfrom_Gscanstatus(gp *g, oldval, newval uint32) {
dumpgstatus(gp)
throw("casfrom_Gscanstatus: gp->status is not in scan state")
}
+ if newval == _Grunning {
+ gp.lastsched = nanotime()
+ }
releaseLockRank(lockRankGscan)
}
@@ -944,6 +947,9 @@ func castogscanstatus(gp *g, oldval, newval uint32) bool {
r := atomic.Cas(&gp.atomicstatus, oldval, newval)
if r {
acquireLockRank(lockRankGscan)
+ if oldval == _Grunning {
+ gp.runningnanos += nanotime() - gp.lastsched
+ }
}
return r
@@ -1065,6 +1071,7 @@ func casGToPreemptScan(gp *g, old, new uint32) {
}
acquireLockRank(lockRankGscan)
for !atomic.Cas(&gp.atomicstatus, _Grunning, _Gscan|_Gpreempted) {
+ gp.runningnanos += nanotime() - gp.lastsched
}
}
With it, the reproduction below no longer reproduces the regression.
To Reproduce
With that branch, and the following assertion of monotonicity, it reliably fails after a few seconds when using dev test --stress pkg/sql/opt/exec/explain -f TestCPUTimeEndToEnd:
--- a/pkg/sql/colflow/stats.go
+++ b/pkg/sql/colflow/stats.go
@@ -23,6 +23,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/sql/execstats"
"github.com/cockroachdb/cockroach/pkg/util/buildutil"
"github.com/cockroachdb/cockroach/pkg/util/grunning"
+ "github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/mon"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -123,7 +124,14 @@ func (bic *batchInfoCollector) Next() coldata.Batch {
bic.stopwatch.Start()
// Wrap the call to Next() with a panic catcher in order to get the correct
// execution time (e.g. in the statement bundle).
+ startCPUTime := grunning.Time()
err := colexecerror.CatchVectorizedRuntimeError(bic.next)
+ endCPUTime := grunning.Time()
+ if endCPUTime < startCPUTime {
+ log.Fatalf(context.TODO(), "non-monotonicity! endCPUTime (%d) < startCPUTime (%d) ",
+ endCPUTime, startCPUTime,
+ )
+ }
bic.stopwatch.Stop()
if err != nil {
colexecerror.InternalError(err)
Impact
Before updating the patch, and all the testing/{micro,macro}benchmarking it entails, lets assess the impact. Since we're missing interception points in and out of _Grunning, what could go wrong? Given grunning.Time() is powered by a grunningnanos(), which is gp.runningnanos + nanotime() - gp.lastsched, we can break this down to the following:
-
gp.runningnanos is maintained when leaving _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an under estimate, and thus non-monotonic.
-
gp.lastsched is maintained when entering _Grunning. So it might be older/smaller than it actually is. So grunningnanos() can be an over-estimate. It might also lead to gp.runningnanos being higher than it's supposed to be (since we use gp.runningnanos += nanotime() - gp.lastsched to keep it up-to-date), so grunningnanos() is again an over-estimate.
If the analysis above holds, given our current uses of grunning.Time(), we could paper over the non-monotonicity (zero out differences between end/start time if the latter is smaller than the former), and document its inaccuracy.
+cc @DrewKimball, @kvoli, @sumeerbhola.
Jira issue: CRDB-23571
Epic CRDB-34249
Describe the problem
In #93952 we observed non-monotonicity in
grunning.Time(). The problem is due to an accounting bug in the patched runtime we're now running with. To maintain the nanoseconds spent per-g in the_Grunningstate, we want to intercept all transitions in and out of the_Grunningstate._Grunningstate, we maintain a timestamp ontype g structto record when we were scheduled:cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch
Lines 11 to 16 in f27fa9d
Grunningstate, we accumulate the time we ran in the most recent interval:cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch
Lines 18 to 21 in f27fa9d
As for surfacing how much time the calling goroutine has spent running, we use the last scheduled timestamp, accumulated counter when last descheduled, and a current monotonic clock reading, to get to the following:
cockroach/build/teamcity/internal/release/build-and-publish-patched-go/diff.patch
Lines 45 to 51 in f27fa9d
This works, except that the patch does not actually intercept all transitions in and out of the
_Grunningstate. Specifically, we're missing the following points in the Go scheduler code:With it, the reproduction below no longer reproduces the regression.
To Reproduce
With that branch, and the following assertion of monotonicity, it reliably fails after a few seconds when using
dev test --stress pkg/sql/opt/exec/explain -f TestCPUTimeEndToEnd:Impact
Before updating the patch, and all the testing/{micro,macro}benchmarking it entails, lets assess the impact. Since we're missing interception points in and out of
_Grunning, what could go wrong? Givengrunning.Time()is powered by agrunningnanos(), which isgp.runningnanos + nanotime() - gp.lastsched, we can break this down to the following:gp.runningnanosis maintained when leaving_Grunning. So it might be older/smaller than it actually is. Sogrunningnanos()can be an under estimate, and thus non-monotonic.gp.lastschedis maintained when entering_Grunning. So it might be older/smaller than it actually is. Sogrunningnanos()can be an over-estimate. It might also lead togp.runningnanosbeing higher than it's supposed to be (since we usegp.runningnanos += nanotime() - gp.lastschedto keep it up-to-date), sogrunningnanos()is again an over-estimate.If the analysis above holds, given our current uses of
grunning.Time(), we could paper over the non-monotonicity (zero out differences between end/start time if the latter is smaller than the former), and document its inaccuracy.+cc @DrewKimball, @kvoli, @sumeerbhola.
Jira issue: CRDB-23571
Epic CRDB-34249