Overview
Go offers extensive diagnostic capabilities out of the box, but they are of limited usefulness in a larger, highly concurrent, real-world application utilizing 100+ Go routines.
For the purposes of this proposal, I am using a real-world application. This application uses gRPC and has a single RPC stream active. It also supports 'fix financial protocol' via standard TCP sockets (no active connections). It emits multicast packets as well. It has a web interface, and also web socket interface (both with no active connections). So there is a lot of network end-points in-use. Although it is a real-world application, it is very small compared to most enterprise applications.
The following screen shot shows a profile, and the OS cpu utilization - notice the very heavy system cpu usage:

A tree sample provides a bit more context:
10ms runtime.libcCall
runtime.pthread_mutex_lock
runtime.semawakeup
runtime.notewakeup
runtime.startm
runtime.wakep
runtime.ready
runtime.goready.func1
runtime.systemstack
runtime.goready
runtime.send
runtime.chansend
runtime.selectnbsend
google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut
google.golang.org/grpc/internal/transport.(*controlBuffer).put
google.golang.org/grpc/internal/transport.(*http2Server).handleData
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams
google.golang.org/grpc.(*Server).serveStreams
google.golang.org/grpc.(*Server).handleRawConn.func1
Here is the OS total CPU % breakdown:

Here is the 'web' view for the profile:

Here is a GODEBUG sampling with scheddetail=1 for the same process.
SCHED 87116ms: gomaxprocs=4 idleprocs=1 threads=14 spinningthreads=1 idlethreads=4 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0
P0: status=0 schedtick=580546 syscalltick=798347 m=-1 runqsize=0 gfreecnt=0
P1: status=2 schedtick=570346 syscalltick=805620 m=-1 runqsize=0 gfreecnt=0
P2: status=1 schedtick=571461 syscalltick=801749 m=4 runqsize=0 gfreecnt=0
P3: status=1 schedtick=567930 syscalltick=814616 m=3 runqsize=0 gfreecnt=0
M13: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M12: p=-1 curg=50 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M11: p=-1 curg=19 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M10: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M9: p=1 curg=16 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
M8: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M7: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M6: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M5: p=-1 curg=34 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1
M4: p=2 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
M3: p=3 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=true blocked=false lockedg=-1
M2: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
M1: p=-1 curg=17 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=17
M0: p=0 curg=1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
G1: status=3(chan receive) m=0 lockedm=-1
G17: status=6() m=1 lockedm=1
G2: status=4(force gc (idle)) m=-1 lockedm=-1
G18: status=4(GC sweep wait) m=-1 lockedm=-1
G3: status=4(finalizer wait) m=-1 lockedm=-1
G5: status=4(chan receive) m=-1 lockedm=-1
G6: status=4(IO wait) m=-1 lockedm=-1
G7: status=4(select) m=-1 lockedm=-1
G8: status=4(select) m=-1 lockedm=-1
G9: status=4(select) m=-1 lockedm=-1
G10: status=4(select) m=-1 lockedm=-1
G11: status=4(select) m=-1 lockedm=-1
G12: status=4(select) m=-1 lockedm=-1
... 20 entries exactly like above except Go routine number...
G28: status=4(GC worker (idle)) m=-1 lockedm=-1
G71: status=4(IO wait) m=-1 lockedm=-1
G72: status=4(IO wait) m=-1 lockedm=-1
G73: status=4(sleep) m=-1 lockedm=-1
G74: status=4(timer goroutine (idle)) m=-1 lockedm=-1
G41: status=4(select) m=-1 lockedm=-1
G42: status=4(IO wait) m=-1 lockedm=-1
G75: status=4(GC worker (idle)) m=-1 lockedm=-1
G29: status=4(GC worker (idle)) m=-1 lockedm=-1
G44: status=4(GC worker (idle)) m=-1 lockedm=-1
When using the 'trace' the flow it a bit easier to spot, as 'goroutine analysis' shows:

And going into the detail for startMarketData (exhibit A) function, we see:

and examining the sync block time, we see:

but examining the scheduler wait time, we see:

Dilema
Although the profile clearly shows lots of OS activity, and the OS monitor shows heavy CPU utilization, there is no easy way to analyze what is causing the heavy system cpu%. The biggest problem is that the profile is NOT partitioned by 'go routine'. Using the trace tool, things are better but there is no aggregate view. Ideally the detail view (exhibit A above), would be shown on the 'analysis page' for all go routines - the only way to get this information is to inspect each routine individually. Something like this (filed as issue #29103)

But more importantly, there needs to be a summary analysis as to the "why" - not reviewing a graph, for example, if I open the detail for the start market data routine above, it should show (sample only):
startmarketdata.func1
network wait time 102 ms:
5% send on socket
70% poll
25% read on socket
scheduler wait 1 us
57% wait for GC cycle
43% wait for CPU
sync time 50 ms
67% wait on channel
100% sent by go routine xxxx
23% wait on mutex filename:line#
etc...
The user can also use the 'trace viewer' which is 100% manual - it is very powerful, but difficult to see the forest through the trees.
Specific Proposals
- Ability to add a "name" to a Go routine, something like:
go "myname" func(){...}
go stringexp func(){...}
Working with the current naming is more difficult than need be - you need to be an expert in all of the codebases of all of the libraries to even have an basic understanding of what is going on. The name would allows the library creator to provide significant information as to it purpose. This need should be appended to the existing "name" in the diagnostic tools.
-
In lieu of 1, at a minimum, the go routine name as provided now should be changed to include the source file and line number. Often a single function spawns multiple go routines, and the idiomatic way is anonymous functions, so you end up with func1, func2, etc. The source/line would make tracking these down much easier. This would also be the case, if the program did not provide a name in 1.
-
The trace output should have an option for per go routine stack sampling, which would place a 'stack trace event' in the trace file on the sampling interval. It may be possible to sync a profile capture with a sync capture based on timestamps, but it simplifies a lot to have everything in a single capture file.
-
There should be trace reporting by Go routine instance. Currently with the trace, all of the routines of a single type are lumped together. This does not allow easy detection of processing bottlenecks. For example, if there are 4 routines consuming 15%, is it 1 routine consuming 15%, or 3 consuming 1% and 1 consuming 14% ?
-
Most importantly, there is no public API to read the trace file to produce custom analysis reports, especially when there exists an API to add 'user events' to the trace. The internal/trace package should be made public. Currently, the information might be obtained in json from the 'tool trace webserver', but this is undocumented, and probably too inefficient for certain analysis needs. The current solution involves copy and pasting the package.
Overview
Go offers extensive diagnostic capabilities out of the box, but they are of limited usefulness in a larger, highly concurrent, real-world application utilizing 100+ Go routines.
For the purposes of this proposal, I am using a real-world application. This application uses gRPC and has a single RPC stream active. It also supports 'fix financial protocol' via standard TCP sockets (no active connections). It emits multicast packets as well. It has a web interface, and also web socket interface (both with no active connections). So there is a lot of network end-points in-use. Although it is a real-world application, it is very small compared to most enterprise applications.
The following screen shot shows a profile, and the OS cpu utilization - notice the very heavy system cpu usage:
A tree sample provides a bit more context:
10ms runtime.libcCall runtime.pthread_mutex_lock runtime.semawakeup runtime.notewakeup runtime.startm runtime.wakep runtime.ready runtime.goready.func1 runtime.systemstack runtime.goready runtime.send runtime.chansend runtime.selectnbsend google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut google.golang.org/grpc/internal/transport.(*controlBuffer).put google.golang.org/grpc/internal/transport.(*http2Server).handleData google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams google.golang.org/grpc.(*Server).serveStreams google.golang.org/grpc.(*Server).handleRawConn.func1Here is the OS total CPU % breakdown:
Here is the 'web' view for the profile:
Here is a GODEBUG sampling with scheddetail=1 for the same process.
When using the 'trace' the flow it a bit easier to spot, as 'goroutine analysis' shows:
And going into the detail for startMarketData (exhibit A) function, we see:
and examining the sync block time, we see:
but examining the scheduler wait time, we see:
Dilema
Although the profile clearly shows lots of OS activity, and the OS monitor shows heavy CPU utilization, there is no easy way to analyze what is causing the heavy system cpu%. The biggest problem is that the profile is NOT partitioned by 'go routine'. Using the trace tool, things are better but there is no aggregate view. Ideally the detail view (exhibit A above), would be shown on the 'analysis page' for all go routines - the only way to get this information is to inspect each routine individually. Something like this (filed as issue #29103)
But more importantly, there needs to be a summary analysis as to the "why" - not reviewing a graph, for example, if I open the detail for the start market data routine above, it should show (sample only):
startmarketdata.func1 network wait time 102 ms: 5% send on socket 70% poll 25% read on socket scheduler wait 1 us 57% wait for GC cycle 43% wait for CPU sync time 50 ms 67% wait on channel 100% sent by go routine xxxx 23% wait on mutex filename:line# etc...The user can also use the 'trace viewer' which is 100% manual - it is very powerful, but difficult to see the forest through the trees.
Specific Proposals
go "myname" func(){...} go stringexp func(){...}Working with the current naming is more difficult than need be - you need to be an expert in all of the codebases of all of the libraries to even have an basic understanding of what is going on. The name would allows the library creator to provide significant information as to it purpose. This need should be appended to the existing "name" in the diagnostic tools.
In lieu of 1, at a minimum, the go routine name as provided now should be changed to include the source file and line number. Often a single function spawns multiple go routines, and the idiomatic way is anonymous functions, so you end up with func1, func2, etc. The source/line would make tracking these down much easier. This would also be the case, if the program did not provide a name in 1.
The trace output should have an option for per go routine stack sampling, which would place a 'stack trace event' in the trace file on the sampling interval. It may be possible to sync a profile capture with a sync capture based on timestamps, but it simplifies a lot to have everything in a single capture file.
There should be trace reporting by Go routine instance. Currently with the trace, all of the routines of a single type are lumped together. This does not allow easy detection of processing bottlenecks. For example, if there are 4 routines consuming 15%, is it 1 routine consuming 15%, or 3 consuming 1% and 1 consuming 14% ?
Most importantly, there is no public API to read the trace file to produce custom analysis reports, especially when there exists an API to add 'user events' to the trace. The internal/trace package should be made public. Currently, the information might be obtained in json from the 'tool trace webserver', but this is undocumented, and probably too inefficient for certain analysis needs. The current solution involves copy and pasting the package.