metrics: Add additional metrics to p2p and consensus#2425
metrics: Add additional metrics to p2p and consensus#2425xla merged 13 commits intotendermint:developfrom
Conversation
|
Two things I'd like direct feedback on:
I've also deliberately avoided implementing the configurable namespacing with this PR - will do that separately so as to ease review. |
melekes
left a comment
There was a problem hiding this comment.
Great start! I've left a few comments that needs to be addressed though
.gitignore
Outdated
| *.iml | ||
|
|
||
| libs/pubsub/query/fuzz_test/output | ||
| libs/db/test* |
There was a problem hiding this comment.
sorry about this. it should be fixed on develop as of now. please rebase and remove
config/metrics.go
Outdated
| @@ -0,0 +1,3 @@ | |||
| package config | |||
|
|
|||
| const MetricsNamespace = "tendermint" | |||
There was a problem hiding this comment.
Shouldn't this be a variable to allow modifications?
There was a problem hiding this comment.
Was thinking I'd make it configurable in a separate PR that would add support for passed-in namespaces.
consensus/metrics.go
Outdated
| // Total number of transactions. | ||
| TotalTxs metrics.Gauge | ||
| // The latest block height. | ||
| LatestBlockHeight metrics.Gauge |
There was a problem hiding this comment.
There will be a confusion with Height. Maybe we should rename it to CommittedHeight?
|
|
||
| "github.com/pkg/errors" | ||
|
|
||
| amino "github.com/tendermint/go-amino" |
There was a problem hiding this comment.
please do no remove prefix
|
|
||
| func (conR *ConsensusReactor) setCatchingUp() { | ||
| var catchingUp float64 | ||
| if conR.fastSync { |
There was a problem hiding this comment.
"Whether or not a node is synced. 0 if syncing, 1 if synced." 1 and 0 mixed up
if you're adding this condition, you need to modify metric name and description to be
"Whether or not a node is fast synced. 1 if fast syncing, 0 if synced or syncing without fast sync."
consensus/reactor.go
Outdated
| // NewConsensusReactor returns a new ConsensusReactor with the given | ||
| // consensusState. | ||
| func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor { | ||
| func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor { |
There was a problem hiding this comment.
metrics should be an option. see consensus/state.go for example
// WithMetrics sets the metrics.
func WithMetrics(metrics *Metrics) CSOption {
return func(cs *ConsensusState) { cs.metrics = metrics }
}
mempool/metrics.go
Outdated
| prometheus "github.com/go-kit/kit/metrics/prometheus" | ||
| "github.com/go-kit/kit/metrics/prometheus" | ||
| stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
| "github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
tmcfg to be consistent or better just cfg
p2p/peer.go
Outdated
| } | ||
|
|
||
| p.metrics.PeerPendingSendBytes.With("peer-id", string(p.ID())).Set(sendQueueSize) | ||
| case <-p.quitChan: |
There was a problem hiding this comment.
peer already has a quit channel: p.Quit()
p2p/peer.go
Outdated
| return err | ||
| } | ||
|
|
||
| go func() { |
There was a problem hiding this comment.
please extract to a separate function
p2p/peer.go
Outdated
| reactorsByCh map[byte]Reactor, | ||
| chDescs []*tmconn.ChannelDescriptor, | ||
| onPeerError func(Peer, interface{}), | ||
| metrics *Metrics, |
There was a problem hiding this comment.
again, should be an option, not a requirement
There was a problem hiding this comment.
Thanks for picking this up, definitely a good set of important datapoints for increased introspectability.
While most of the new metrics seem in the right place, the changes to the Peer are awfully intrusive and have time sensistivity and own stateful setup cost. I don't think we are well served adding more stop-gap solutions to that part of the codebase. We would be better served with the conn being an interface and have the actual implementation wrapped with an instrumenting one to keep concerns isolated. Another red flag is the peer-id label, which we have to investigate will not lead to a cardinality explosion as this can be the cause for degraded performance of the monitoring pipeline to the point of it being downright unusable if the values in that label exceed the 10s/100s.
We seem to have introduced a data race as well with this change.
CHANGELOG_PENDING.md
Outdated
|
|
||
| IMPROVEMENTS: | ||
|
|
||
| - Added additional metrics to p2p and consensus |
There was a problem hiding this comment.
Our CHANGELOG format and tense for entries looks like this: `[component] effect in present tense (#issure reference, @contributor)
So I'd propose:
[consensus] Add additional metrics
[p2p] Add addtional metrics
Alternatively expanding what metrics have been added and a reference to all issues concerning those changes.
consensus/metrics.go
Outdated
| prometheus "github.com/go-kit/kit/metrics/prometheus" | ||
| "github.com/go-kit/kit/metrics/prometheus" | ||
| stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
| tmcfg "github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
Incorrect bundling of imports, we follwo this convention:
import (
// stdlib packages
// external packages
// tendermint packages
)
consensus/metrics.go
Outdated
| return &Metrics{ | ||
| Height: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
| Subsystem: "consensus", | ||
| Namespace: tmcfg.MetricsNamespace, |
There was a problem hiding this comment.
We are trying to move away from our package having a dependency on config, while it's tempting it leaks to much and leads to hard to reason about states. A better approach here would be to have namespace be a param fro the function.
consensus/reactor.go
Outdated
| // NewConsensusReactor returns a new ConsensusReactor with the given | ||
| // consensusState. | ||
| func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor { | ||
| func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor { |
There was a problem hiding this comment.
Instead of requiring it to be passed this would be better served as an option. We employ functional options in a couple of places for this use-case. The default would be the NopMetrics() and only overriden if passed in. For reference:
Lines 41 to 48 in d419fff
mempool/metrics.go
Outdated
| prometheus "github.com/go-kit/kit/metrics/prometheus" | ||
| "github.com/go-kit/kit/metrics/prometheus" | ||
| stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
| "github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
See comment about import order.
mempool/metrics.go
Outdated
| func PrometheusMetrics() *Metrics { | ||
| return &Metrics{ | ||
| Size: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
| Namespace: config.MetricsNamespace, |
There was a problem hiding this comment.
See comment about dependency on config and passing in of the namespace.
node/node.go
Outdated
| bcReactor := bc.NewBlockchainReactor(state.Copy(), blockExec, blockStore, fastSync) | ||
| bcReactor.SetLogger(logger.With("module", "blockchain")) | ||
|
|
||
| csm := cs.WithMetrics(csMetrics) |
There was a problem hiding this comment.
What is the reason to break it out into its own variable?
p2p/metrics.go
Outdated
| prometheus "github.com/go-kit/kit/metrics/prometheus" | ||
| "github.com/go-kit/kit/metrics/prometheus" | ||
| stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
| "github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
See comment about import order.
p2p/metrics.go
Outdated
| return &Metrics{ | ||
| Peers: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
| Subsystem: "p2p", | ||
| Namespace: config.MetricsNamespace, |
There was a problem hiding this comment.
See comment about depending on config and passing of namespace.
9936e82 to
9fd2344
Compare
|
@xla - the data race is related to the Re: separating the metrics code into an |
consensus/metrics.go
Outdated
| TotalTxs metrics.Gauge | ||
| // The latest block height. | ||
| CommittedHeight metrics.Gauge | ||
| // Whether or not a node is fast synced. 1 if yes, 0 if no. |
consensus/metrics.go
Outdated
| FastSyncing: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: MetricsSubsystem, | ||
| Name: "catching_up", |
consensus/metrics.go
Outdated
| Namespace: namespace, | ||
| Subsystem: MetricsSubsystem, | ||
| Name: "catching_up", | ||
| Help: "Whether or not a node is synced. 0 if syncing, 1 if synced.", |
There was a problem hiding this comment.
Whether or not a node is fast syncing. 1 if yes, 0 if no.
consensus/reactor.go
Outdated
| metrics *Metrics | ||
| } | ||
|
|
||
| type ROption func(*ConsensusReactor) |
consensus/state.go
Outdated
| // WithMetrics sets the metrics. | ||
| func WithMetrics(metrics *Metrics) CSOption { | ||
| // StateWithMetrics sets the metrics. | ||
| func StateWithMetrics(metrics *Metrics) CSOption { |
p2p/peer.go
Outdated
| tmconn "github.com/tendermint/tendermint/p2p/conn" | ||
| ) | ||
|
|
||
| const MetricsTickerDuration = 10 * time.Second |
p2p/peer.go
Outdated
| return err | ||
| } | ||
|
|
||
| go p.metricsCollector() |
There was a problem hiding this comment.
I'd rename to metricsReporter()
p2p/peer.go
Outdated
| return fmt.Sprintf("Peer{%v %v in}", p.mconn, p.ID()) | ||
| } | ||
|
|
||
| func PeerWithMetrics(metrics *Metrics) PeerOption { |
p2p/peer.go
Outdated
|
|
||
| func PeerWithMetrics(metrics *Metrics) PeerOption { | ||
| return func(p *peer) { | ||
| if metrics != nil { |
|
Cool. I have another PR with a bunch more metrics once this guy is merged :). |
xla
left a comment
There was a problem hiding this comment.
Thanks for the follow-up!
@xla - the data race is related to the
Status()method. It already has a TODO indicating thatatomicshould be used: https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L588. Would you like me to refactor that code to use atomic?
That would be really helpful, it seems that the recent changes surfaced it more prominently.
Re: separating the metrics code into an
instrumentinginterface - I'm happy to do this as part of this PR, but I'm not sure that the overhead of adding metrics directly to the peer is very high. From the discussion that spawned this issue (cosmos/cosmos-sdk#2169), my understanding is that Prometheus metrics are very lightweight and there isn't a common pattern in the codebase yet to expose metrics. The only 'heavy' piece of this implementation is the goroutine that polls a peer's status, but I don't think running a fairly tight loop once every 10 seconds will be a huge performance hit.
You are right, this is outside of the scope of the changes you are working on and better suited for the ongoing p2p refactor, see #2067. My original comment was not in regards to the time complexity of gathering the metrics, that's not a concern. I was referring to the added complexity and interleaving of concerns in our code, instead of aiming for clear separation.
| import ( | ||
| "github.com/go-kit/kit/metrics" | ||
| "github.com/go-kit/kit/metrics/discard" | ||
|
|
There was a problem hiding this comment.
This newline is meaningful, we group by: stdlib, external packages, internal packages.
config/toml.go
Outdated
| max_open_connections = {{ .Instrumentation.MaxOpenConnections }} | ||
|
|
||
| # Instrumentation namespace | ||
| namespace = {{ .Instrumentation.Namespace }} |
There was a problem hiding this comment.
Seems to break the integration tests, check circle.
p2p/peer.go
Outdated
|
|
||
| err := p.mconn.Start() | ||
| return err | ||
| if err != nil { |
There was a problem hiding this comment.
Can be inlined: if err := p.mconn.Start(); err != nil {
p2p/peer.go
Outdated
| ourNodeInfo NodeInfo, | ||
| timeout time.Duration, | ||
| ourNodeInfo NodeInfo, | ||
| timeout time.Duration, |
p2p/peer.go
Outdated
| reactorsByCh map[byte]Reactor, | ||
| chDescs []*tmconn.ChannelDescriptor, | ||
| onPeerError func(Peer, interface{}), | ||
| config tmconn.MConnConfig, |
p2p/peer.go
Outdated
| reactorsByCh map[byte]Reactor, | ||
| chDescs []*tmconn.ChannelDescriptor, | ||
| onPeerError func(Peer, interface{}), | ||
| options ...PeerOption, |
p2p/peer.go
Outdated
| Data *cmn.CMap | ||
|
|
||
| metrics *Metrics | ||
|
|
Codecov Report
@@ Coverage Diff @@
## develop #2425 +/- ##
==========================================
- Coverage 61.56% 61.4% -0.16%
==========================================
Files 197 197
Lines 16291 16388 +97
==========================================
+ Hits 10029 10063 +34
- Misses 5430 5489 +59
- Partials 832 836 +4
|
|
@mslipper Thanks for all the ammendments. Could you resolve the CHANGELOG conflict? After that the PR shouldbe good to go. |
Partially addresses cosmos/cosmos-sdk#2169.