Skip to content

Manager Instrumentation#2356

Merged
dperny merged 1 commit intomoby:masterfrom
anshulpundir:metrics
Dec 4, 2017
Merged

Manager Instrumentation#2356
dperny merged 1 commit intomoby:masterfrom
anshulpundir:metrics

Conversation

@anshulpundir
Copy link
Copy Markdown
Contributor

@anshulpundir anshulpundir commented Aug 24, 2017

Signed-off-by: Anshul Pundir anshul.pundir@docker.com

Added latency metrics for raft propose latency, snapshot, scheduling delay, store related stuff etc.

@GordonTheTurtle
Copy link
Copy Markdown

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "metrics" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 24, 2017

Codecov Report

Merging #2356 into master will decrease coverage by 0.03%.
The diff coverage is 90.47%.

@@            Coverage Diff             @@
##           master    #2356      +/-   ##
==========================================
- Coverage   61.66%   61.62%   -0.04%     
==========================================
  Files         128      128              
  Lines       21076    21036      -40     
==========================================
- Hits        12997    12964      -33     
+ Misses       6677     6662      -15     
- Partials     1402     1410       +8

// sure it sees the state as of this moment.
<-viewStarted

doSnapshotLatencyMetric.Observe(float64(time.Since(start).Nanoseconds()))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't wait for the full snapshot process - that happens inside the goroutine.

updateLatencyTimer metrics.Timer

// view()/read txn latency timer.
viewLatencyTimner metrics.Timer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: timner

viewLatencyTimner metrics.Timer

// lookup() latency timer.
lookupLatencyTimner metrics.Timer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: timner


// View executes a read transaction.
func (s *MemoryStore) View(cb func(ReadTx)) {
defer metrics.StartTimer(viewLatencyTimner)()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a useful metric. Txn and Commit are lockless and should be nearly instantaneous. The time taken is going to be dominated by cb, which could be doing many different things depending on the context. Without knowing which callback is invoked, statistics on the duration of View won't really tell us much. It may be more useful to instrument individual call sites.

But I suppose having this metric doesn't hurt, either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

}

func (n *Node) doSnapshot(ctx context.Context, raftConfig api.RaftConfig) {
func (n *Node) triggerSnapshot(ctx context.Context, raftConfig api.RaftConfig) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// Update scheduling delay metric for running tasks.
if status.State == api.TaskStateRunning {
schedulingDelayTimer.Update(time.Duration(status.AppliedAt.Nanos - task.Meta.CreatedAt.Nanos))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are only looking at the Nanos field, and ignoring the seconds field. If the second changes between these two timestamps, the result won't be accurate. You should use gogotypes.DurationFromProto to convert each timestamp to a time.Time, and then do normal time math on them.


// lookup is an internal typed wrapper around memdb.
func (tx readTx) lookup(table, index, id string) api.StoreObject {
defer metrics.StartTimer(lookupLatencyTimner)()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the latency of a lookup is going to be less than the timer overhead. It's just an in-memory data structure lookup.

Maybe that's okay, but I feel this would add a lot of overhead, and the result wouldn't be especially interesting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -76,8 +77,33 @@ var (
// WedgeTimeout is the maximum amount of time the store lock may be
// held before declaring a suspected deadlock.
WedgeTimeout = 30 * time.Second
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to add a metric for how long the store lock gets held.

func init() {
ns := metrics.NewNamespace("swarm", "dispatcher", nil)
schedulingDelayTimer = ns.NewTimer("scheduling_delay",
"Scheduling delay, which is basically (task_status_running - task_status_creation).")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the last part of this.

Maybe "Scheduling delay, which is the time a task spends between the NEW and RUNNING states".

current := status.State
status.State = state
status.Message = msg
status.AppliedAt = ptypes.MustTimestampProto(time.Now())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't how AppliedAt was intended to be used.

	// AppliedAt gives a timestamp of when this status update was applied to
	// the Task object.

The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

I think you should use time.Now() in the dispatcher instead of relying on a timestamp from the TaskStatus. You can't compare a timestamp that comes from a worker with one that was originally recorded on the manager, because their clocks may be out of sync. Even if there's only a few seconds difference, it would wildly throw off the results.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

You mean it gets applied/persisted to the task object in the store ?

The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean it gets applied/persisted to the task object in the store ?

Yes, that's what it's supposed to timestamp.

I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

You're correct, neither approach will be completely accurate. However, I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.

I see the point. Assuming perfectly synced clocks may give ambiguous or wrong results, but including network latency will include the extra network latency but the results will not be wrong.

Copy link
Copy Markdown
Contributor Author

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, appreciate the review! @aaronlehmann

current := status.State
status.State = state
status.Message = msg
status.AppliedAt = ptypes.MustTimestampProto(time.Now())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

You mean it gets applied/persisted to the task object in the store ?

The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

"google.golang.org/grpc/transport"

"github.com/docker/go-events"
metrics "github.com/docker/go-metrics"
Copy link
Copy Markdown
Contributor

@nishanttotla nishanttotla Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: providing a separate name isn't required. go-metrics can be used as metrics in this file. Same comment for other instances of this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totes. but I saw it done another place in swarmkit, and thought 'metrics.something' suited the style better than 'go-metrics.something'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still be metrics.something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yea I see what you mean. Whats the way to figure out the package name without looking at the code ? Also, any reason why its not the same as the directory name ? @nishanttotla

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a way without looking at the code, but by convention, packages named go-foo typically have the package name set to foo. I don't think hyphens are allowed in package names.

@anshulpundir anshulpundir force-pushed the metrics branch 2 times, most recently from 67abc5d to 3b0c109 Compare October 11, 2017 00:03
@stevvooe
Copy link
Copy Markdown
Contributor

@anshulpundir When submitting these PRs, it would help if you could include the output of the /metrics endpoint with the new values so that we can evaluate whether or not the reporting looks correct.


task.Status = *status
task.Status.AppliedBy = d.securityConfig.ClientTLSCreds.NodeID()
task.Status.AppliedAt = ptypes.MustTimestampProto(time.Now())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "applied at" and when was it added?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppliedAt gives a timestamp of when this status update was applied to the Task object.


func init() {
ns := metrics.NewNamespace("swarm", "store", nil)
updateLatencyTimer = ns.NewTimer("write_txn_latency",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the abbreviation here.

// from the worker node, which may cause unknown incorrect results due to clock skew.
if status.State == api.TaskStateRunning {
start := time.Unix(status.AppliedAt.GetSeconds(), int64(status.AppliedAt.GetNanos()))
schedulingDelayTimer.Update(time.Since(start))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use UpdateSince.

WedgeTimeout = 30 * time.Second

// update()/write txn latency timer.
updateLatencyTimer metrics.Timer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variance on these values might be too high to be well covered by the HistogramVec default buckets. The defaults are in https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L59. The max bucket is 10 seconds, which should cover the large majority of use cases. If we need granularity over 10 seconds, this might be a problem.

Copy link
Copy Markdown
Contributor Author

@anshulpundir anshulpundir Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! However, this is unrelated to this change and I think can be addressed separately.

@anshulpundir
Copy link
Copy Markdown
Contributor Author

anshulpundir commented Nov 29, 2017

output from /metrics endpoint

TYPE swarm_raft_snapshot_latency_seconds histogram swarm_raft_snapshot_latency_seconds_bucket{le="0.005"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.01"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.025"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.05"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.1"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.25"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.5"} 1 swarm_raft_snapshot_latency_seconds_bucket{le="1"} 3 swarm_raft_snapshot_latency_seconds_bucket{le="2.5"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="5"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="10"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="+Inf"} 4 swarm_raft_snapshot_latency_seconds_sum 2.777252233 swarm_raft_snapshot_latency_seconds_count 4 HELP swarm_raft_transaction_latency_seconds Raft transaction latency. TYPE swarm_raft_transaction_latency_seconds histogram swarm_raft_transaction_latency_seconds_bucket{le="0.005"} 0 swarm_raft_transaction_latency_seconds_bucket{le="0.01"} 9 swarm_raft_transaction_latency_seconds_bucket{le="0.025"} 217 swarm_raft_transaction_latency_seconds_bucket{le="0.05"} 403 swarm_raft_transaction_latency_seconds_bucket{le="0.1"} 407 swarm_raft_transaction_latency_seconds_bucket{le="0.25"} 411 swarm_raft_transaction_latency_seconds_bucket{le="0.5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="1"} 412 swarm_raft_transaction_latency_seconds_bucket{le="2.5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="10"} 412 swarm_raft_transaction_latency_seconds_bucket{le="+Inf"} 412 swarm_raft_transaction_latency_seconds_sum 11.902067514999999 swarm_raft_transaction_latency_seconds_count 412 HELP swarm_store_batch_latency_seconds Raft store batch latency. TYPE swarm_store_batch_latency_seconds histogram swarm_store_batch_latency_seconds_bucket{le="0.005"} 14 swarm_store_batch_latency_seconds_bucket{le="0.01"} 17 swarm_store_batch_latency_seconds_bucket{le="0.025"} 19 swarm_store_batch_latency_seconds_bucket{le="0.05"} 24 swarm_store_batch_latency_seconds_bucket{le="0.1"} 24 swarm_store_batch_latency_seconds_bucket{le="0.25"} 24 swarm_store_batch_latency_seconds_bucket{le="0.5"} 24 swarm_store_batch_latency_seconds_bucket{le="1"} 24 swarm_store_batch_latency_seconds_bucket{le="2.5"} 24 swarm_store_batch_latency_seconds_bucket{le="5"} 24 swarm_store_batch_latency_seconds_bucket{le="10"} 24 swarm_store_batch_latency_seconds_bucket{le="+Inf"} 24 swarm_store_batch_latency_seconds_sum 0.20429262100000004 swarm_store_batch_latency_seconds_count 24 HELP swarm_store_lookup_latency_seconds Raft store read latency. TYPE swarm_store_lookup_latency_seconds histogram swarm_store_lookup_latency_seconds_bucket{le="0.005"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.01"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.025"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.05"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.1"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.25"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.5"} 845 swarm_store_lookup_latency_seconds_bucket{le="1"} 845 swarm_store_lookup_latency_seconds_bucket{le="2.5"} 845 swarm_store_lookup_latency_seconds_bucket{le="5"} 845 swarm_store_lookup_latency_seconds_bucket{le="10"} 845 swarm_store_lookup_latency_seconds_bucket{le="+Inf"} 845 swarm_store_lookup_latency_seconds_sum 0.0028883279999999986 swarm_store_lookup_latency_seconds_count 845 HELP swarm_store_read_txn_latency_seconds Raft store read txn latency. TYPE swarm_store_read_txn_latency_seconds histogram swarm_store_read_txn_latency_seconds_bucket{le="0.005"} 2217 swarm_store_read_txn_latency_seconds_bucket{le="0.01"} 2218 swarm_store_read_txn_latency_seconds_bucket{le="0.025"} 2218 swarm_store_read_txn_latency_seconds_bucket{le="0.05"} 2222 swarm_store_read_txn_latency_seconds_bucket{le="0.1"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="0.25"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="0.5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="1"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="2.5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="10"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="+Inf"} 2223 swarm_store_read_txn_latency_seconds_sum 0.23226729999999982 swarm_store_read_txn_latency_seconds_count 2223 HELP swarm_store_write_txn_latency_seconds Raft store write txn latency. TYPE swarm_store_write_txn_latency_seconds histogram swarm_store_write_txn_latency_seconds_bucket{le="0.005"} 8 swarm_store_write_txn_latency_seconds_bucket{le="0.01"} 13 swarm_store_write_txn_latency_seconds_bucket{le="0.025"} 211 swarm_store_write_txn_latency_seconds_bucket{le="0.05"} 410 swarm_store_write_txn_latency_seconds_bucket{le="0.1"} 414 swarm_store_write_txn_latency_seconds_bucket{le="0.25"} 418 swarm_store_write_txn_latency_seconds_bucket{le="0.5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="1"} 419 swarm_store_write_txn_latency_seconds_bucket{le="2.5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="10"} 419 swarm_store_write_txn_latency_seconds_bucket{le="+Inf"} 419 swarm_store_write_txn_latency_seconds_sum 11.956525793000004 swarm_store_write_txn_latency_seconds_count 419

@anshulpundir
Copy link
Copy Markdown
Contributor Author

addressed comments. @aaronlehmann @stevvooe please take a look thx!

@anshulpundir anshulpundir changed the title Manager Instrumentation. Manager Instrumentation Nov 29, 2017
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
@stevvooe
Copy link
Copy Markdown
Contributor

stevvooe commented Dec 4, 2017

LGTM

// ProposeValue calls Propose on the underlying raft library(etcd/raft) and waits
// on the commit log action before returning a result
func (n *Node) ProposeValue(ctx context.Context, storeAction []api.StoreAction, cb func()) error {
defer metrics.StartTimer(proposeLatencyTimer)()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this defer call work?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't a defect in your code i just don't know how this works.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartTimer() starts the timer and returns a function which is deferred. The return function terminates the timer.

@dperny
Copy link
Copy Markdown
Collaborator

dperny commented Dec 4, 2017

LGTM.

@dperny dperny merged commit 4429c76 into moby:master Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants