proposal: runtime/trace: improve Go diagnostics / tracing

**Overview**

Go offers extensive diagnostic [capabilities](https://golang.org/doc/diagnostics.html) out of the box, but they are of limited usefulness in a larger, highly concurrent, real-world application utilizing 100+ Go routines.

For the purposes of this proposal, I am using a real-world [application](https://github.com/robaho/go-trader). This application uses gRPC and has a single RPC stream active. It also supports 'fix financial protocol' via standard TCP sockets (no active connections). It emits multicast packets as well. It has a web interface, and also web socket interface (both with no active connections). So there is a lot of network end-points in-use. Although it is a real-world application, it is very small compared to most enterprise applications.

The following screen shot shows a profile, and the OS cpu utilization - notice the very heavy system cpu usage:

![image](https://user-images.githubusercontent.com/4958833/49469424-1788d580-f7cd-11e8-9f76-e2452f97f62e.png)

A tree sample provides a bit more context:

<pre>
      10ms   runtime.libcCall
             runtime.pthread_mutex_lock
             runtime.semawakeup
             runtime.notewakeup
             runtime.startm
             runtime.wakep
             runtime.ready
             runtime.goready.func1
             runtime.systemstack
             runtime.goready
             runtime.send
             runtime.chansend
             runtime.selectnbsend
             google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut
             google.golang.org/grpc/internal/transport.(*controlBuffer).put
             google.golang.org/grpc/internal/transport.(*http2Server).handleData
             google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams
             google.golang.org/grpc.(*Server).serveStreams
             google.golang.org/grpc.(*Server).handleRawConn.func1
</pre>

Here is the OS total CPU % breakdown:

![image](https://user-images.githubusercontent.com/4958833/49471650-16f33d80-f7d3-11e8-9cfd-c543dd386024.png)

Here is the 'web' view for the profile:

![profile001](https://user-images.githubusercontent.com/4958833/49470259-27091e00-f7cf-11e8-96ce-73e981feaa29.png)

Here is a GODEBUG sampling with scheddetail=1 for the same process.

<pre>
SCHED 87116ms: gomaxprocs=4 idleprocs=1 threads=14 spinningthreads=1 idlethreads=4 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0
  P0: status=0 schedtick=580546 syscalltick=798347 m=-1 runqsize=0 gfreecnt=0
  P1: status=2 schedtick=570346 syscalltick=805620 m=-1 runqsize=0 gfreecnt=0
  P2: status=1 schedtick=571461 syscalltick=801749 m=4 runqsize=0 gfreecnt=0
  P3: status=1 schedtick=567930 syscalltick=814616 m=3 runqsize=0 gfreecnt=0
  M13: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M12: p=-1 curg=50 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M11: p=-1 curg=19 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M10: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M9: p=1 curg=16 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M8: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M7: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M6: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M5: p=-1 curg=34 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
  M4: p=2 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M3: p=3 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=true blocked=false lockedg=-1
  M2: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M1: p=-1 curg=17 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=17
  M0: p=0 curg=1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  G1: status=3(chan receive) m=0 lockedm=-1
  G17: status=6() m=1 lockedm=1
  G2: status=4(force gc (idle)) m=-1 lockedm=-1
  G18: status=4(GC sweep wait) m=-1 lockedm=-1
  G3: status=4(finalizer wait) m=-1 lockedm=-1
  G5: status=4(chan receive) m=-1 lockedm=-1
  G6: status=4(IO wait) m=-1 lockedm=-1
  G7: status=4(select) m=-1 lockedm=-1
  G8: status=4(select) m=-1 lockedm=-1
  G9: status=4(select) m=-1 lockedm=-1
  G10: status=4(select) m=-1 lockedm=-1
  G11: status=4(select) m=-1 lockedm=-1
  G12: status=4(select) m=-1 lockedm=-1

... 20 entries exactly like above except Go routine number...

  G28: status=4(GC worker (idle)) m=-1 lockedm=-1
  G71: status=4(IO wait) m=-1 lockedm=-1
  G72: status=4(IO wait) m=-1 lockedm=-1
  G73: status=4(sleep) m=-1 lockedm=-1
  G74: status=4(timer goroutine (idle)) m=-1 lockedm=-1
  G41: status=4(select) m=-1 lockedm=-1
  G42: status=4(IO wait) m=-1 lockedm=-1
  G75: status=4(GC worker (idle)) m=-1 lockedm=-1
  G29: status=4(GC worker (idle)) m=-1 lockedm=-1
  G44: status=4(GC worker (idle)) m=-1 lockedm=-1
</pre>

When using the 'trace' the flow it a bit easier to spot, as 'goroutine analysis' shows:

![image](https://user-images.githubusercontent.com/4958833/49475822-de0c9600-f7dd-11e8-9dd8-f8f293761a57.png)

And going into the detail for startMarketData (exhibit A) function, we see:

![image](https://user-images.githubusercontent.com/4958833/49475896-114f2500-f7de-11e8-93c1-9f3329061c7e.png)

and examining the sync block time, we see:

![image](https://user-images.githubusercontent.com/4958833/49476029-6f7c0800-f7de-11e8-88e1-89a11ed8b4dc.png)

but examining the scheduler wait time, we see:

![image](https://user-images.githubusercontent.com/4958833/49476066-81f64180-f7de-11e8-8bda-8b9e3dd5bcdd.png)

**Dilema**

Although the profile clearly shows lots of OS activity, and the OS monitor shows heavy CPU utilization, there is no easy way to analyze what is causing the heavy system cpu%.  The biggest problem is that the profile is NOT partitioned by 'go routine'.  Using the trace tool, things are better but there is no aggregate view. Ideally the detail view (exhibit A above), would be shown on the 'analysis page' for all go routines - the only way to get this information is to inspect each routine individually. Something like this (filed as issue #29103)

![image](https://user-images.githubusercontent.com/4958833/49477054-48730580-f7e1-11e8-8f7a-a0ae89732657.png)

But more importantly, there needs to be a summary analysis as to the "why" - not reviewing a graph, for example, if I open the detail for the start market data routine above, it should show (sample only):

<pre>
startmarketdata.func1
   network wait time 102 ms:
      5% send on socket
      70% poll
      25% read on socket
   scheduler wait 1 us
      57% wait for GC cycle
      43% wait for CPU
   sync time 50 ms
      67% wait on channel
               100% sent by go routine xxxx
      23% wait on mutex filename:line#
   etc...
</pre>

The user can also use the 'trace viewer' which is 100% manual - it is very powerful, but difficult to see the forest through the trees.

**Specific Proposals**

1. Ability to add a "name" to a Go routine, something like:
<pre>
      go "myname" func(){...}
      go stringexp func(){...}
</pre>
Working with the current naming is more difficult than need be - you need to be an expert in all of the codebases of all of the libraries to even have an basic understanding of what is going on. The name would allows the library creator to provide significant information as to it purpose. This need should be appended to the existing "name" in the diagnostic tools.

2. In lieu of 1, at a minimum, the go routine name as provided now should be changed to include the source file and line number. Often a single function spawns multiple go routines, and the idiomatic way is anonymous functions, so you end up with func1, func2, etc. The source/line would make tracking these down much easier. This would also be the case, if the program did not provide a name in 1.

3. The trace output should have an option for per go routine stack sampling, which would place a 'stack trace event' in the trace file on the sampling interval. It may be possible to sync a profile capture with a sync capture based on timestamps, but it simplifies a lot to have everything in a single capture file.

4. There should be trace reporting by Go routine instance. Currently with the trace, all of the routines of a single type are lumped together. This does not allow easy detection of processing bottlenecks. For example, if there are 4 routines consuming 15%, is it 1 routine consuming 15%, or 3 consuming 1% and 1 consuming 14% ?

5. Most importantly, there is no public API to read the trace file to produce custom analysis reports, especially when there exists an API to add 'user events' to the trace. The internal/trace package should be made public. Currently, the information might be obtained in json from the 'tool trace webserver', but this is undocumented, and probably too inefficient for certain analysis needs. The current solution involves copy and pasting the package.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: runtime/trace: improve Go diagnostics / tracing #29104

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: runtime/trace: improve Go diagnostics / tracing #29104

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions