Manager Instrumentation by anshulpundir · Pull Request #2356 · moby/swarmkit

anshulpundir · 2017-08-24T21:16:46Z

Signed-off-by: Anshul Pundir anshul.pundir@docker.com

Added latency metrics for raft propose latency, snapshot, scheduling delay, store related stuff etc.

GordonTheTurtle · 2017-08-24T21:16:48Z

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "metrics" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

codecov · 2017-08-24T21:28:28Z

Codecov Report

Merging #2356 into master will decrease coverage by 0.03%.
The diff coverage is 90.47%.

@@            Coverage Diff             @@
##           master    #2356      +/-   ##
==========================================
- Coverage   61.66%   61.62%   -0.04%     
==========================================
  Files         128      128              
  Lines       21076    21036      -40     
==========================================
- Hits        12997    12964      -33     
+ Misses       6677     6662      -15     
- Partials     1402     1410       +8

aaronlehmann · 2017-08-25T02:27:37Z

manager/state/raft/storage.go

 	// sure it sees the state as of this moment.
 	<-viewStarted
+
+	doSnapshotLatencyMetric.Observe(float64(time.Since(start).Nanoseconds()))


This won't wait for the full snapshot process - that happens inside the goroutine.

aaronlehmann · 2017-10-05T03:13:01Z

manager/state/store/memory.go

+	updateLatencyTimer metrics.Timer
+
+	// view()/read txn latency timer.
+	viewLatencyTimner metrics.Timer


Typo: timner

aaronlehmann · 2017-10-05T03:13:11Z

manager/state/store/memory.go

+	viewLatencyTimner metrics.Timer
+
+	// lookup() latency timer.
+	lookupLatencyTimner metrics.Timer


Typo: timner

aaronlehmann · 2017-10-05T03:16:42Z

manager/state/store/memory.go


 // View executes a read transaction.
 func (s *MemoryStore) View(cb func(ReadTx)) {
+	defer metrics.StartTimer(viewLatencyTimner)()


I'm not sure this is a useful metric. Txn and Commit are lockless and should be nearly instantaneous. The time taken is going to be dominated by cb, which could be doing many different things depending on the context. Without knowing which callback is invoked, statistics on the duration of View won't really tell us much. It may be more useful to instrument individual call sites.

But I suppose having this metric doesn't hurt, either.

aaronlehmann · 2017-10-05T03:17:37Z

manager/state/raft/storage.go

 }

-func (n *Node) doSnapshot(ctx context.Context, raftConfig api.RaftConfig) {
+func (n *Node) triggerSnapshot(ctx context.Context, raftConfig api.RaftConfig) {


aaronlehmann · 2017-10-05T03:22:01Z

manager/dispatcher/dispatcher.go


+				// Update scheduling delay metric for running tasks.
+				if status.State == api.TaskStateRunning {
+					schedulingDelayTimer.Update(time.Duration(status.AppliedAt.Nanos - task.Meta.CreatedAt.Nanos))


You are only looking at the Nanos field, and ignoring the seconds field. If the second changes between these two timestamps, the result won't be accurate. You should use gogotypes.DurationFromProto to convert each timestamp to a time.Time, and then do normal time math on them.

aaronlehmann · 2017-10-05T03:25:20Z

manager/state/store/memory.go


 // lookup is an internal typed wrapper around memdb.
 func (tx readTx) lookup(table, index, id string) api.StoreObject {
+	defer metrics.StartTimer(lookupLatencyTimner)()


I'm pretty sure the latency of a lookup is going to be less than the timer overhead. It's just an in-memory data structure lookup.

Maybe that's okay, but I feel this would add a lot of overhead, and the result wouldn't be especially interesting.

aaronlehmann · 2017-10-05T03:26:21Z

manager/state/store/memory.go

@@ -76,8 +77,33 @@ var (
 	// WedgeTimeout is the maximum amount of time the store lock may be
 	// held before declaring a suspected deadlock.
 	WedgeTimeout = 30 * time.Second


It would be interesting to add a metric for how long the store lock gets held.

aaronlehmann · 2017-10-05T03:28:03Z

manager/dispatcher/dispatcher.go

+func init() {
+	ns := metrics.NewNamespace("swarm", "dispatcher", nil)
+	schedulingDelayTimer = ns.NewTimer("scheduling_delay",
+		"Scheduling delay, which is basically (task_status_running - task_status_creation).")


I don't understand the last part of this.

Maybe "Scheduling delay, which is the time a task spends between the NEW and RUNNING states".

aaronlehmann · 2017-10-05T03:33:20Z

agent/exec/controller.go

 		current := status.State
 		status.State = state
 		status.Message = msg
+		status.AppliedAt = ptypes.MustTimestampProto(time.Now())


This isn't how AppliedAt was intended to be used.

// AppliedAt gives a timestamp of when this status update was applied to // the Task object.

The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

I think you should use time.Now() in the dispatcher instead of relying on a timestamp from the TaskStatus. You can't compare a timestamp that comes from a worker with one that was originally recorded on the manager, because their clocks may be out of sync. Even if there's only a few seconds difference, it would wildly throw off the results.

This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

You mean it gets applied/persisted to the task object in the store ?

The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

You mean it gets applied/persisted to the task object in the store ?

Yes, that's what it's supposed to timestamp.

I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

You're correct, neither approach will be completely accurate. However, I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.

I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.

I see the point. Assuming perfectly synced clocks may give ambiguous or wrong results, but including network latency will include the extra network latency but the results will not be wrong.

anshulpundir

Once again, appreciate the review! @aaronlehmann

anshulpundir · 2017-10-05T19:43:46Z

agent/exec/controller.go

 		current := status.State
 		status.State = state
 		status.Message = msg
+		status.AppliedAt = ptypes.MustTimestampProto(time.Now())


This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.

You mean it gets applied/persisted to the task object in the store ?

The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.

Please correct me if I'm wrong.

nishanttotla · 2017-10-05T19:45:49Z

manager/dispatcher/dispatcher.go

 	"google.golang.org/grpc/transport"

 	"github.com/docker/go-events"
+	metrics "github.com/docker/go-metrics"


nit: providing a separate name isn't required. go-metrics can be used as metrics in this file. Same comment for other instances of this.

Totes. but I saw it done another place in swarmkit, and thought 'metrics.something' suited the style better than 'go-metrics.something'

It will still be metrics.something.

Sorry, yea I see what you mean. Whats the way to figure out the package name without looking at the code ? Also, any reason why its not the same as the directory name ? @nishanttotla

I don't think there is a way without looking at the code, but by convention, packages named go-foo typically have the package name set to foo. I don't think hyphens are allowed in package names.

stevvooe · 2017-10-11T21:35:27Z

@anshulpundir When submitting these PRs, it would help if you could include the output of the /metrics endpoint with the new values so that we can evaluate whether or not the reporting looks correct.

stevvooe · 2017-10-11T21:37:03Z

manager/dispatcher/dispatcher.go

+
 				task.Status = *status
 				task.Status.AppliedBy = d.securityConfig.ClientTLSCreds.NodeID()
 				task.Status.AppliedAt = ptypes.MustTimestampProto(time.Now())


What is "applied at" and when was it added?

AppliedAt gives a timestamp of when this status update was applied to the Task object.

stevvooe · 2017-10-11T21:37:55Z

manager/state/store/memory.go


+func init() {
+	ns := metrics.NewNamespace("swarm", "store", nil)
+	updateLatencyTimer = ns.NewTimer("write_txn_latency",


I'm not sure how I feel about the abbreviation here.

stevvooe · 2017-10-11T21:43:16Z

manager/dispatcher/dispatcher.go

+				// from the worker node, which may cause unknown incorrect results due to clock skew.
+				if status.State == api.TaskStateRunning {
+					start := time.Unix(status.AppliedAt.GetSeconds(), int64(status.AppliedAt.GetNanos()))
+					schedulingDelayTimer.Update(time.Since(start))


Use UpdateSince.

stevvooe · 2017-10-11T21:50:36Z

manager/state/store/memory.go

 	WedgeTimeout = 30 * time.Second
+
+	// update()/write txn latency timer.
+	updateLatencyTimer metrics.Timer


The variance on these values might be too high to be well covered by the HistogramVec default buckets. The defaults are in https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L59. The max bucket is 10 seconds, which should cover the large majority of use cases. If we need granularity over 10 seconds, this might be a problem.

Thanks for pointing this out! However, this is unrelated to this change and I think can be addressed separately.

anshulpundir · 2017-11-29T22:12:34Z

output from /metrics endpoint

TYPE swarm_raft_snapshot_latency_seconds histogram swarm_raft_snapshot_latency_seconds_bucket{le="0.005"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.01"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.025"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.05"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.1"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.25"} 0 swarm_raft_snapshot_latency_seconds_bucket{le="0.5"} 1 swarm_raft_snapshot_latency_seconds_bucket{le="1"} 3 swarm_raft_snapshot_latency_seconds_bucket{le="2.5"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="5"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="10"} 4 swarm_raft_snapshot_latency_seconds_bucket{le="+Inf"} 4 swarm_raft_snapshot_latency_seconds_sum 2.777252233 swarm_raft_snapshot_latency_seconds_count 4 HELP swarm_raft_transaction_latency_seconds Raft transaction latency. TYPE swarm_raft_transaction_latency_seconds histogram swarm_raft_transaction_latency_seconds_bucket{le="0.005"} 0 swarm_raft_transaction_latency_seconds_bucket{le="0.01"} 9 swarm_raft_transaction_latency_seconds_bucket{le="0.025"} 217 swarm_raft_transaction_latency_seconds_bucket{le="0.05"} 403 swarm_raft_transaction_latency_seconds_bucket{le="0.1"} 407 swarm_raft_transaction_latency_seconds_bucket{le="0.25"} 411 swarm_raft_transaction_latency_seconds_bucket{le="0.5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="1"} 412 swarm_raft_transaction_latency_seconds_bucket{le="2.5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="5"} 412 swarm_raft_transaction_latency_seconds_bucket{le="10"} 412 swarm_raft_transaction_latency_seconds_bucket{le="+Inf"} 412 swarm_raft_transaction_latency_seconds_sum 11.902067514999999 swarm_raft_transaction_latency_seconds_count 412 HELP swarm_store_batch_latency_seconds Raft store batch latency. TYPE swarm_store_batch_latency_seconds histogram swarm_store_batch_latency_seconds_bucket{le="0.005"} 14 swarm_store_batch_latency_seconds_bucket{le="0.01"} 17 swarm_store_batch_latency_seconds_bucket{le="0.025"} 19 swarm_store_batch_latency_seconds_bucket{le="0.05"} 24 swarm_store_batch_latency_seconds_bucket{le="0.1"} 24 swarm_store_batch_latency_seconds_bucket{le="0.25"} 24 swarm_store_batch_latency_seconds_bucket{le="0.5"} 24 swarm_store_batch_latency_seconds_bucket{le="1"} 24 swarm_store_batch_latency_seconds_bucket{le="2.5"} 24 swarm_store_batch_latency_seconds_bucket{le="5"} 24 swarm_store_batch_latency_seconds_bucket{le="10"} 24 swarm_store_batch_latency_seconds_bucket{le="+Inf"} 24 swarm_store_batch_latency_seconds_sum 0.20429262100000004 swarm_store_batch_latency_seconds_count 24 HELP swarm_store_lookup_latency_seconds Raft store read latency. TYPE swarm_store_lookup_latency_seconds histogram swarm_store_lookup_latency_seconds_bucket{le="0.005"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.01"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.025"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.05"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.1"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.25"} 845 swarm_store_lookup_latency_seconds_bucket{le="0.5"} 845 swarm_store_lookup_latency_seconds_bucket{le="1"} 845 swarm_store_lookup_latency_seconds_bucket{le="2.5"} 845 swarm_store_lookup_latency_seconds_bucket{le="5"} 845 swarm_store_lookup_latency_seconds_bucket{le="10"} 845 swarm_store_lookup_latency_seconds_bucket{le="+Inf"} 845 swarm_store_lookup_latency_seconds_sum 0.0028883279999999986 swarm_store_lookup_latency_seconds_count 845 HELP swarm_store_read_txn_latency_seconds Raft store read txn latency. TYPE swarm_store_read_txn_latency_seconds histogram swarm_store_read_txn_latency_seconds_bucket{le="0.005"} 2217 swarm_store_read_txn_latency_seconds_bucket{le="0.01"} 2218 swarm_store_read_txn_latency_seconds_bucket{le="0.025"} 2218 swarm_store_read_txn_latency_seconds_bucket{le="0.05"} 2222 swarm_store_read_txn_latency_seconds_bucket{le="0.1"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="0.25"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="0.5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="1"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="2.5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="5"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="10"} 2223 swarm_store_read_txn_latency_seconds_bucket{le="+Inf"} 2223 swarm_store_read_txn_latency_seconds_sum 0.23226729999999982 swarm_store_read_txn_latency_seconds_count 2223 HELP swarm_store_write_txn_latency_seconds Raft store write txn latency. TYPE swarm_store_write_txn_latency_seconds histogram swarm_store_write_txn_latency_seconds_bucket{le="0.005"} 8 swarm_store_write_txn_latency_seconds_bucket{le="0.01"} 13 swarm_store_write_txn_latency_seconds_bucket{le="0.025"} 211 swarm_store_write_txn_latency_seconds_bucket{le="0.05"} 410 swarm_store_write_txn_latency_seconds_bucket{le="0.1"} 414 swarm_store_write_txn_latency_seconds_bucket{le="0.25"} 418 swarm_store_write_txn_latency_seconds_bucket{le="0.5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="1"} 419 swarm_store_write_txn_latency_seconds_bucket{le="2.5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="5"} 419 swarm_store_write_txn_latency_seconds_bucket{le="10"} 419 swarm_store_write_txn_latency_seconds_bucket{le="+Inf"} 419 swarm_store_write_txn_latency_seconds_sum 11.956525793000004 swarm_store_write_txn_latency_seconds_count 419

anshulpundir · 2017-11-29T22:32:15Z

addressed comments. @aaronlehmann @stevvooe please take a look thx!

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

stevvooe · 2017-12-04T19:40:29Z

LGTM

dperny · 2017-12-04T19:56:37Z

manager/state/raft/raft.go

 // ProposeValue calls Propose on the underlying raft library(etcd/raft) and waits
 // on the commit log action before returning a result
 func (n *Node) ProposeValue(ctx context.Context, storeAction []api.StoreAction, cb func()) error {
+	defer metrics.StartTimer(proposeLatencyTimer)()


how does this defer call work?

this isn't a defect in your code i just don't know how this works.

StartTimer() starts the timer and returns a function which is deferred. The return function terminates the timer.

dperny · 2017-12-04T20:00:06Z

LGTM.

GordonTheTurtle added the dco/no label Aug 24, 2017

anshulpundir requested a review from aluzzardi August 24, 2017 21:17

GordonTheTurtle added the dco/no label Aug 24, 2017

aaronlehmann reviewed Aug 25, 2017

View reviewed changes

anshulpundir force-pushed the metrics branch from 4d0a128 to d7fc08d Compare August 25, 2017 23:24

GordonTheTurtle added the dco/no label Aug 25, 2017

anshulpundir force-pushed the metrics branch from d7fc08d to 4708cb3 Compare August 26, 2017 00:15

GordonTheTurtle removed the dco/no label Aug 26, 2017

anshulpundir force-pushed the metrics branch 3 times, most recently from 40b9030 to 9ad063f Compare August 28, 2017 23:25

anshulpundir requested a review from cyli August 28, 2017 23:38

anshulpundir requested review from aaronlehmann and nishanttotla October 4, 2017 18:36

aaronlehmann requested changes Oct 5, 2017

View reviewed changes

anshulpundir commented Oct 5, 2017

View reviewed changes

nishanttotla reviewed Oct 5, 2017

View reviewed changes

anshulpundir force-pushed the metrics branch 2 times, most recently from 67abc5d to 3b0c109 Compare October 11, 2017 00:03

stevvooe reviewed Oct 11, 2017

View reviewed changes

anshulpundir force-pushed the metrics branch from 3b0c109 to 1735f32 Compare November 29, 2017 22:30

anshulpundir changed the title ~~Manager Instrumentation.~~ Manager Instrumentation Nov 29, 2017

Manager Instrumentation.

c2de210

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

anshulpundir force-pushed the metrics branch from 1735f32 to c2de210 Compare December 4, 2017 19:37

dperny reviewed Dec 4, 2017

View reviewed changes

dperny merged commit 4429c76 into moby:master Dec 4, 2017

andrewhsu mentioned this pull request Dec 5, 2017

Vendor swarmkit to 4429c763 moby/moby#35698

Merged

Conversation

anshulpundir commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GordonTheTurtle commented Aug 24, 2017

Uh oh!

codecov bot commented Aug 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishanttotla Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevvooe commented Oct 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir Nov 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir commented Nov 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anshulpundir commented Nov 29, 2017

Uh oh!

stevvooe commented Dec 4, 2017

Uh oh!

anshulpundir commented Aug 24, 2017 •

edited

Loading

codecov bot commented Aug 24, 2017 •

edited

Loading

nishanttotla Oct 5, 2017 •

edited

Loading

anshulpundir Nov 29, 2017 •

edited

Loading

anshulpundir commented Nov 29, 2017 •

edited

Loading