proposal: runtime/pprof: add PMU-based profiles

### What version of Go are you using (`go version`)?

<pre>
$ go version
go version go1.13.4 linux/amd64
</pre>

### Does this issue reproduce with the latest release?

Yes; tested on `go version devel +74d366f484 Mon Jan 27 20:45:39 2020 +0000 linux/amd64`
 

### What operating system and processor architecture are you using (`go env`)?

<details><summary><code>go env</code> Output</summary><br><pre>
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xxx/.cache/go-build"
GOENV="/home/xxx/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xxx/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/xxx/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/xxx/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build696469159=/tmp/go-build -gno-record-gcc-switches"

</pre></details>

### What did you do?

The following experiments demonstrate that pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs). The tests are highly predictive in nature and involve little runtime overheads (allocation, GC, system call, etc.). They are carefully designed so that we can compare pprof CPU profiles against our expectations. The evaluation shows a wide difference from expectation. One should not confuse this to be a runtime issue; the issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid. In a proposal/design document that will follow, I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters. PMU event-based sampling is a mature technology available on almost all modern CPUs. PMU as a sampling agent offers many benefits (in no particular priority order):
* Enhance sampling granularity from 100s of milli-second to 100s of microseconds.
* Avoid precision and accuracy problems of OS timers.
* Enable collecting rich hardware profiles such as cache misses, branch misses, inter-socket cacheline transfers, to name a new, which help root cause performance problems more easily.

There are two test cases `goroutine.go` and `serial.go` in this issue.

## test 1 (parallel)
Download the first test case, `goroutine.go` program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go
It has ten exactly the same goroutines: `f1-f10` and I use pprof to collect the CPU profiles. 
Run the following command several times and notice that each time, pprof reports a different amount of time spent in each of the ten routines.

`go run goroutine.go &&  go tool pprof -top goroutine_prof`


## test 2 (serial)

Download the second test case, `serial.go` go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go:
It has ten functions (`A_expect_1_82` - `J_expect_18_18`).  The function `A_expect_1_82` is expected to consume 1.82% of the total execution time and `J_expect_18_18` is expected to consume 18.18% of the execution time, and so on. The code is serial and there is complete data dependence between each function and each iteration of the loop in the functions to avoid any hardware-level optimizations. 
Run the following command several times.
`go run serial.go &&  go tool pprof -top serial_prof`







### What did you expect to see?


For `goroutine.go`, each function (`f1-10`) should be attributed with exactly (or almost exactly) 10% of the execution time on each run. 

For `serial.go` the time attribution should roughly follow the distribution shown below in each run.

```
FUNCTION NAME			EXPECTED RELATIVE TIME

A_expect_1_82			1.82%
B_expect_3_64			3.64%
C_expect_5_46			5.46%
D_expect_7_27			7.27%
E_expect_9_09			9.09%
F_expect_10_91			10.91%
G_expect_12_73			12.73%
H_expect_14_546			14.546%
I_expect_16_36			16.36%
J_expect_18_18			18.18%
```


### What did you see instead?

#### test 1 (parallel)
Run 1,  2, and 3, shown below, respectively show the `pprof -top` output of `goroutine.go`.
You will notice in a single run (say Run 1), `f1-f10` have a wide variance in the time attributed to them; the expectation is that each of them gets 10% execution time. There is up to `6x` difference in time attributed to the function with the highest amount if attribution (`main.f7`, `4210ms`, in `Run 1`) vs. the function with the lowest a amount of attribution (`main.f9`, `700ms`, in `Run 1`). This shows a poor accuracy (deviation from the ground truth) of pprof timer-based profiles.

Furthermore, the time attributed to a function widely varies from run to run. Notice how the top-10 ordering changes. In `Run 1`, `main.f7` is shown to run for `4210ms`, whereas in `Run 2` it is shown to run for only `520ms`. The expectation is that the measurements remain the same from run to run. This shows a poor precision (unpredictability of measurement) of pprof timer-based profiles.

#### goroutine.go/Run 1:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.70s, Total samples = 18060ms (269.37%)
Showing nodes accounting for 18060ms, 100% of 18060ms total
      flat  flat%   sum%        cum   cum%
    4210ms 23.31% 23.31%     4210ms 23.31%  main.f7
    2610ms 14.45% 37.76%     2610ms 14.45%  main.f2
    2010ms 11.13% 48.89%     2010ms 11.13%  main.f6
    1810ms 10.02% 58.91%     1810ms 10.02%  main.f10
    1780ms  9.86% 68.77%     1780ms  9.86%  main.f3
    1410ms  7.81% 76.58%     1410ms  7.81%  main.f1
    1310ms  7.25% 83.83%     1310ms  7.25%  main.f4
    1110ms  6.15% 89.98%     1110ms  6.15%  main.f5
    1110ms  6.15% 96.12%     1110ms  6.15%  main.f8
     700ms  3.88%   100%      700ms  3.88%  main.f9
```

#### goroutine.go/Run 2:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.71s, Total samples = 17400ms (259.39%)
Showing nodes accounting for 17400ms, 100% of 17400ms total
      flat  flat%   sum%        cum   cum%
    3250ms 18.68% 18.68%     3250ms 18.68%  main.f2
    2180ms 12.53% 31.21%     2180ms 12.53%  main.f9
    2100ms 12.07% 43.28%     2100ms 12.07%  main.f1
    1770ms 10.17% 53.45%     1770ms 10.17%  main.f6
    1700ms  9.77% 63.22%     1700ms  9.77%  main.f5
    1550ms  8.91% 72.13%     1550ms  8.91%  main.f4
    1500ms  8.62% 80.75%     1500ms  8.62%  main.f8
    1440ms  8.28% 89.02%     1440ms  8.28%  main.f3
    1390ms  7.99% 97.01%     1390ms  7.99%  main.f10
     520ms  2.99%   100%      520ms  2.99%  main.f7
```


#### goroutine.go/Run 3:
```
File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:48pm (PST)
Duration: 6.71s, Total samples = 17.73s (264.31%)
Showing nodes accounting for 17.73s, 100% of 17.73s total
      flat  flat%   sum%        cum   cum%
     3.74s 21.09% 21.09%      3.74s 21.09%  main.f7
     2.08s 11.73% 32.83%      2.08s 11.73%  main.f9
     2.05s 11.56% 44.39%      2.05s 11.56%  main.f2
     1.85s 10.43% 54.82%      1.85s 10.43%  main.f10
     1.78s 10.04% 64.86%      1.78s 10.04%  main.f1
     1.43s  8.07% 72.93%      1.43s  8.07%  main.f3
     1.42s  8.01% 80.94%      1.42s  8.01%  main.f8
     1.18s  6.66% 87.59%      1.18s  6.66%  main.f6
     1.17s  6.60% 94.19%      1.17s  6.60%  main.f5
     1.03s  5.81%   100%      1.03s  5.81%  main.f4
```

#### test 2 (serial)
The output for `go run serial.go &&  go tool pprof -top serial_prof` for three runs is shown below.
Comparing the flat% (or cum%) against the expected percentage for each function shows a large difference. For example, `main.H_expect_14_546`, which is expected to have 14.546% execution time, is attributed 25% of the execution time in Run 1. Furthermore, run to run, there is a lack of precision, for example, `main.I_expect_16_36` is attributed 6.25% (20ms) execution time in Run 1, whereas it is attributed 21.88% (70ms)  execution time in Run 2.


#### serial.go/Run 1:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:42pm (PST)
Duration: 501.51ms, Total samples = 320ms (63.81%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      80ms 25.00% 25.00%       80ms 25.00%  main.H_expect_14_546
      80ms 25.00% 50.00%       80ms 25.00%  main.J_expect_18_18
      60ms 18.75% 68.75%       60ms 18.75%  main.G_expect_12_73
      20ms  6.25% 75.00%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 81.25%       20ms  6.25%  main.D_expect_7_27
      20ms  6.25% 87.50%       20ms  6.25%  main.F_expect_10_91
      20ms  6.25% 93.75%       20ms  6.25%  main.I_expect_16_36
      10ms  3.12% 96.88%       10ms  3.12%  main.A_expect_1_82
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main
```
#### serial.go/Run 2:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:44pm (PST)
Duration: 501.31ms, Total samples = 320ms (63.83%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      70ms 21.88% 21.88%       70ms 21.88%  main.I_expect_16_36
      50ms 15.62% 37.50%       50ms 15.62%  main.J_expect_18_18
      40ms 12.50% 50.00%       40ms 12.50%  main.E_expect_9_09
      40ms 12.50% 62.50%       40ms 12.50%  main.F_expect_10_91
      40ms 12.50% 75.00%       40ms 12.50%  main.H_expect_14_546
      30ms  9.38% 84.38%       30ms  9.38%  main.D_expect_7_27
      20ms  6.25% 90.62%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 96.88%       20ms  6.25%  main.G_expect_12_73
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main
```
#### serial.go/Run 3:
```
File: serial
Type: cpu
Time: Jan 27, 2020 at 1:45pm (PST)
Duration: 501.39ms, Total samples = 310ms (61.83%)
Showing nodes accounting for 310ms, 100% of 310ms total
      flat  flat%   sum%        cum   cum%
     110ms 35.48% 35.48%      110ms 35.48%  main.J_expect_18_18
      70ms 22.58% 58.06%       70ms 22.58%  main.G_expect_12_73
      60ms 19.35% 77.42%       60ms 19.35%  main.F_expect_10_91
      30ms  9.68% 87.10%       30ms  9.68%  main.I_expect_16_36
      20ms  6.45% 93.55%       20ms  6.45%  main.H_expect_14_546
      10ms  3.23% 96.77%       10ms  3.23%  main.B_expect_3_64
      10ms  3.23%   100%       10ms  3.23%  main.C_expect_5_46
         0     0%   100%      310ms   100%  main.main
         0     0%   100%      310ms   100%  runtime.main
```

### Improved results with PMU-based profiling.

In a prototype PMU-based profiling implementation, below are the pprof profiles for CPU cycles hardware performance counter for the same `goroutine.go` program. Notice that each functions gets the same (or almost the same) CPU cycles attribution within a single run and across runs.

#### goroutine.go/Run 1:
```
File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:49pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23400000000 10.00% 10.00% 23400000000 10.00%  main.f1
23400000000 10.00% 20.00% 23400000000 10.00%  main.f10
23400000000 10.00% 30.00% 23400000000 10.00%  main.f2
23400000000 10.00% 40.00% 23400000000 10.00%  main.f3
23400000000 10.00% 50.00% 23400000000 10.00%  main.f4
23400000000 10.00% 60.00% 23400000000 10.00%  main.f5
23400000000 10.00% 70.00% 23400000000 10.00%  main.f6
23400000000 10.00% 80.00% 23400000000 10.00%  main.f7
23400000000 10.00% 90.00% 23400000000 10.00%  main.f8
23400000000 10.00%   100% 23400000000 10.00%  main.f9
```

#### goroutine.go/Run 2:
```
File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:51pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23800000000 10.17% 10.17% 23800000000 10.17%  main.f1
23500000000 10.04% 20.21% 23500000000 10.04%  main.f7
23500000000 10.04% 30.26% 23500000000 10.04%  main.f9
23400000000 10.00% 40.26% 23400000000 10.00%  main.f10
23400000000 10.00% 50.26% 23400000000 10.00%  main.f2
23400000000 10.00% 60.26% 23400000000 10.00%  main.f4
23400000000 10.00% 70.26% 23400000000 10.00%  main.f6
23400000000 10.00% 80.26% 23400000000 10.00%  main.f8
23300000000  9.96% 90.21% 23300000000  9.96%  main.f3
22900000000  9.79%   100% 22900000000  9.79%  main.f5
```

Below are the pprof profiles for CPU cycles hardware performance counter for the same `serial.go` program. Notice that each function gets close to expected CPU cycles attribution within a single run and across runs.

#### serial.go/Run 1:
```
File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 165000000 14.93% 49.59%  165000000 14.93%  main.H_expect_14_546
 137000000 12.40% 61.99%  137000000 12.40%  main.G_expect_12_73
 120000000 10.86% 72.85%  120000000 10.86%  main.F_expect_10_91
 100000000  9.05% 81.90%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.32%   82000000  7.42%  main.D_expect_7_27
  63000000  5.70% 95.02%   63000000  5.70%  main.C_expect_5_46
  37000000  3.35% 98.37%   37000000  3.35%  main.B_expect_3_64
  18000000  1.63%   100%   18000000  1.63%  main.A_expect_1_82
         0     0%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main
```

#### serial.go/Run 2:
```
File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 159000000 14.39% 49.05%  159000000 14.39%  main.H_expect_14_546
 142000000 12.85% 61.90%  142000000 12.85%  main.G_expect_12_73
 119000000 10.77% 72.67%  119000000 10.77%  main.F_expect_10_91
 100000000  9.05% 81.72%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.14%   82000000  7.42%  main.D_expect_7_27
  61000000  5.52% 94.66%   61000000  5.52%  main.C_expect_5_46
  40000000  3.62% 98.28%   40000000  3.62%  main.B_expect_3_64
  18000000  1.63% 99.91%   18000000  1.63%  main.A_expect_1_82
   1000000  0.09%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main
```

### Dependence on the number of cores and length of test execution:

The results of `goroutine.go` test depend on the number of CPU cores available. On a multi-core CPU, if you set `GOMAXPROCS=1`, `goroutine.go` will not show a huge variation, since each goroutine runs for several seconds. However, if you set  `GOMAXPROCS` to a larger value, say 4, you will notice a significant measurement attribution problem. One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired. 

The results of `serial.go` can change based on the time of execution of each function. By passing the `-m=<int>` to the program, you can make the program run for longer or shorter. By making it run longer, the profiles can be made more accurate, but when a function runs for less than 100ms, the accuracy is often low.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: runtime/pprof: add PMU-based profiles #36821

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

test 1 (parallel)

test 2 (serial)

What did you expect to see?

What did you see instead?

test 1 (parallel)

goroutine.go/Run 1:

goroutine.go/Run 2:

goroutine.go/Run 3:

test 2 (serial)

serial.go/Run 1:

serial.go/Run 2:

serial.go/Run 3:

Improved results with PMU-based profiling.

goroutine.go/Run 1:

goroutine.go/Run 2:

serial.go/Run 1:

serial.go/Run 2:

Dependence on the number of cores and length of test execution:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: runtime/pprof: add PMU-based profiles #36821

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

test 1 (parallel)

test 2 (serial)

What did you expect to see?

What did you see instead?

test 1 (parallel)

goroutine.go/Run 1:

goroutine.go/Run 2:

goroutine.go/Run 3:

test 2 (serial)

serial.go/Run 1:

serial.go/Run 2:

serial.go/Run 3:

Improved results with PMU-based profiling.

goroutine.go/Run 1:

goroutine.go/Run 2:

serial.go/Run 1:

serial.go/Run 2:

Dependence on the number of cores and length of test execution:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?