Conversation
|
Please sign your commits following these rules: $ git clone -b "metrics" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git commit --amend -s --no-edit
$ git push -fAmending updates the existing PR. You DO NOT need to open a new one. |
Codecov Report
@@ Coverage Diff @@
## master #2356 +/- ##
==========================================
- Coverage 61.66% 61.62% -0.04%
==========================================
Files 128 128
Lines 21076 21036 -40
==========================================
- Hits 12997 12964 -33
+ Misses 6677 6662 -15
- Partials 1402 1410 +8 |
manager/state/raft/storage.go
Outdated
| // sure it sees the state as of this moment. | ||
| <-viewStarted | ||
|
|
||
| doSnapshotLatencyMetric.Observe(float64(time.Since(start).Nanoseconds())) |
There was a problem hiding this comment.
This won't wait for the full snapshot process - that happens inside the goroutine.
40b9030 to
9ad063f
Compare
manager/state/store/memory.go
Outdated
| updateLatencyTimer metrics.Timer | ||
|
|
||
| // view()/read txn latency timer. | ||
| viewLatencyTimner metrics.Timer |
manager/state/store/memory.go
Outdated
| viewLatencyTimner metrics.Timer | ||
|
|
||
| // lookup() latency timer. | ||
| lookupLatencyTimner metrics.Timer |
manager/state/store/memory.go
Outdated
|
|
||
| // View executes a read transaction. | ||
| func (s *MemoryStore) View(cb func(ReadTx)) { | ||
| defer metrics.StartTimer(viewLatencyTimner)() |
There was a problem hiding this comment.
I'm not sure this is a useful metric. Txn and Commit are lockless and should be nearly instantaneous. The time taken is going to be dominated by cb, which could be doing many different things depending on the context. Without knowing which callback is invoked, statistics on the duration of View won't really tell us much. It may be more useful to instrument individual call sites.
But I suppose having this metric doesn't hurt, either.
| } | ||
|
|
||
| func (n *Node) doSnapshot(ctx context.Context, raftConfig api.RaftConfig) { | ||
| func (n *Node) triggerSnapshot(ctx context.Context, raftConfig api.RaftConfig) { |
manager/dispatcher/dispatcher.go
Outdated
|
|
||
| // Update scheduling delay metric for running tasks. | ||
| if status.State == api.TaskStateRunning { | ||
| schedulingDelayTimer.Update(time.Duration(status.AppliedAt.Nanos - task.Meta.CreatedAt.Nanos)) |
There was a problem hiding this comment.
You are only looking at the Nanos field, and ignoring the seconds field. If the second changes between these two timestamps, the result won't be accurate. You should use gogotypes.DurationFromProto to convert each timestamp to a time.Time, and then do normal time math on them.
manager/state/store/memory.go
Outdated
|
|
||
| // lookup is an internal typed wrapper around memdb. | ||
| func (tx readTx) lookup(table, index, id string) api.StoreObject { | ||
| defer metrics.StartTimer(lookupLatencyTimner)() |
There was a problem hiding this comment.
I'm pretty sure the latency of a lookup is going to be less than the timer overhead. It's just an in-memory data structure lookup.
Maybe that's okay, but I feel this would add a lot of overhead, and the result wouldn't be especially interesting.
| @@ -76,8 +77,33 @@ var ( | |||
| // WedgeTimeout is the maximum amount of time the store lock may be | |||
| // held before declaring a suspected deadlock. | |||
| WedgeTimeout = 30 * time.Second | |||
There was a problem hiding this comment.
It would be interesting to add a metric for how long the store lock gets held.
manager/dispatcher/dispatcher.go
Outdated
| func init() { | ||
| ns := metrics.NewNamespace("swarm", "dispatcher", nil) | ||
| schedulingDelayTimer = ns.NewTimer("scheduling_delay", | ||
| "Scheduling delay, which is basically (task_status_running - task_status_creation).") |
There was a problem hiding this comment.
I don't understand the last part of this.
Maybe "Scheduling delay, which is the time a task spends between the NEW and RUNNING states".
agent/exec/controller.go
Outdated
| current := status.State | ||
| status.State = state | ||
| status.Message = msg | ||
| status.AppliedAt = ptypes.MustTimestampProto(time.Now()) |
There was a problem hiding this comment.
This isn't how AppliedAt was intended to be used.
// AppliedAt gives a timestamp of when this status update was applied to
// the Task object.The status update gets applied to the Task object in the dispatcher, but this code is in the agent.
I think you should use time.Now() in the dispatcher instead of relying on a timestamp from the TaskStatus. You can't compare a timestamp that comes from a worker with one that was originally recorded on the manager, because their clocks may be out of sync. Even if there's only a few seconds difference, it would wildly throw off the results.
There was a problem hiding this comment.
This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.
You mean it gets applied/persisted to the task object in the store ?
The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.
Please correct me if I'm wrong.
There was a problem hiding this comment.
You mean it gets applied/persisted to the task object in the store ?
Yes, that's what it's supposed to timestamp.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.
Please correct me if I'm wrong.
You're correct, neither approach will be completely accurate. However, I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.
There was a problem hiding this comment.
I think counting the extra network latency is less catastrophic than assuming perfectly sync'd clocks.
I see the point. Assuming perfectly synced clocks may give ambiguous or wrong results, but including network latency will include the extra network latency but the results will not be wrong.
anshulpundir
left a comment
There was a problem hiding this comment.
Once again, appreciate the review! @aaronlehmann
agent/exec/controller.go
Outdated
| current := status.State | ||
| status.State = state | ||
| status.Message = msg | ||
| status.AppliedAt = ptypes.MustTimestampProto(time.Now()) |
There was a problem hiding this comment.
This isn't how AppliedAt was intended to be used. The status update gets applied to the Task object in the dispatcher, but this code is in the agent.
You mean it gets applied/persisted to the task object in the store ?
The actual status change happens in different components as the task passes through them to the worker.
I agree with what you're saying, but that will not give us accurate results either since the actual time of status change is when it actually happens.
Please correct me if I'm wrong.
manager/dispatcher/dispatcher.go
Outdated
| "google.golang.org/grpc/transport" | ||
|
|
||
| "github.com/docker/go-events" | ||
| metrics "github.com/docker/go-metrics" |
There was a problem hiding this comment.
nit: providing a separate name isn't required. go-metrics can be used as metrics in this file. Same comment for other instances of this.
There was a problem hiding this comment.
Totes. but I saw it done another place in swarmkit, and thought 'metrics.something' suited the style better than 'go-metrics.something'
There was a problem hiding this comment.
It will still be metrics.something.
There was a problem hiding this comment.
Sorry, yea I see what you mean. Whats the way to figure out the package name without looking at the code ? Also, any reason why its not the same as the directory name ? @nishanttotla
There was a problem hiding this comment.
I don't think there is a way without looking at the code, but by convention, packages named go-foo typically have the package name set to foo. I don't think hyphens are allowed in package names.
67abc5d to
3b0c109
Compare
|
@anshulpundir When submitting these PRs, it would help if you could include the output of the |
|
|
||
| task.Status = *status | ||
| task.Status.AppliedBy = d.securityConfig.ClientTLSCreds.NodeID() | ||
| task.Status.AppliedAt = ptypes.MustTimestampProto(time.Now()) |
There was a problem hiding this comment.
What is "applied at" and when was it added?
There was a problem hiding this comment.
AppliedAt gives a timestamp of when this status update was applied to the Task object.
manager/state/store/memory.go
Outdated
|
|
||
| func init() { | ||
| ns := metrics.NewNamespace("swarm", "store", nil) | ||
| updateLatencyTimer = ns.NewTimer("write_txn_latency", |
There was a problem hiding this comment.
I'm not sure how I feel about the abbreviation here.
manager/dispatcher/dispatcher.go
Outdated
| // from the worker node, which may cause unknown incorrect results due to clock skew. | ||
| if status.State == api.TaskStateRunning { | ||
| start := time.Unix(status.AppliedAt.GetSeconds(), int64(status.AppliedAt.GetNanos())) | ||
| schedulingDelayTimer.Update(time.Since(start)) |
| WedgeTimeout = 30 * time.Second | ||
|
|
||
| // update()/write txn latency timer. | ||
| updateLatencyTimer metrics.Timer |
There was a problem hiding this comment.
The variance on these values might be too high to be well covered by the HistogramVec default buckets. The defaults are in https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L59. The max bucket is 10 seconds, which should cover the large majority of use cases. If we need granularity over 10 seconds, this might be a problem.
There was a problem hiding this comment.
Thanks for pointing this out! However, this is unrelated to this change and I think can be addressed separately.
|
output from /metrics endpoint
|
3b0c109 to
1735f32
Compare
|
addressed comments. @aaronlehmann @stevvooe please take a look thx! |
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
|
LGTM |
| // ProposeValue calls Propose on the underlying raft library(etcd/raft) and waits | ||
| // on the commit log action before returning a result | ||
| func (n *Node) ProposeValue(ctx context.Context, storeAction []api.StoreAction, cb func()) error { | ||
| defer metrics.StartTimer(proposeLatencyTimer)() |
There was a problem hiding this comment.
how does this defer call work?
There was a problem hiding this comment.
this isn't a defect in your code i just don't know how this works.
There was a problem hiding this comment.
StartTimer() starts the timer and returns a function which is deferred. The return function terminates the timer.
|
LGTM. |
Signed-off-by: Anshul Pundir anshul.pundir@docker.com
Added latency metrics for raft propose latency, snapshot, scheduling delay, store related stuff etc.