Proposal: Container and Engine telemetry#9130
Proposal: Container and Engine telemetry#9130discordianfish wants to merge 1 commit intomoby:masterfrom
Conversation
|
Great. We talked a lot with @crosbymichael about something like this. I'll have a deeper look once I get some spare cycles. Thanks! |
There was a problem hiding this comment.
when the code works, it would be good to add an example output.
There was a problem hiding this comment.
oh, can you please also make a man/docker-metrics-1.md?
There was a problem hiding this comment.
@SvenDowideit [x] done!
I kept it simple because I was too lazy to generate output manually. Once implemented, I can extend the man page. I also realize that we probably should add some filtering for the metrics so you get only what you are interested in. But that's something to IMO discuss later.
There was a problem hiding this comment.
@discordianfish We can use go template, just like docker inspect. No need to implement the filter flags otherwise the code may become more and more complicated 😃
There was a problem hiding this comment.
@tobegit3hub Good point. Let's do that. But as I said, I would like to get this first approved and implemented before having to decide on all those minor UI things. We can do that on the PR implementing this I suppose.
|
nice! |
3323fd9 to
f337e32
Compare
|
FYI: I've change the metrics format significantly. It uses now pretty much the same format as prometheus which makes it much more consistent at the price that it's a little bit harder to parse, given that it uses a labels notion. First I proposed different structures for different kind of metrics and implicit encoding of dimensions into keys. But that's ugly and in the end harder to implement and to parse properly. |
f337e32 to
32859a6
Compare
|
I'll let @crosbymichael validate the format. I'm +1'ing the concept though. |
|
I think there's a similar PR in progress here; #8886 should both PRs be put next to each other and decided which one to proceed with? |
|
@thaJeztah Thanks for pointed that out. I think we should first figure out the format/UI/API, then implement it. This PR so far doesn't include any implementation, so happy to use whatever matches the design. |
|
@discordianfish yes, I noticed this was in a "design" stage and the other PR started "spontaneously". The main reason for pointing it out was to prevent people (the other PR?) investing a lot of time in something that might not get approved. But possibly they can join in after approval to help implement. Nice proposal, btw, thanks for that! |
|
/cc @vmarmol I wonder if you considered http://golang.org/pkg/expvar/ as an (additional?) way to expose this? |
|
@proppy Good idea! I would add those metrics as well and use it where possible. |
There was a problem hiding this comment.
Just a note: we ended up wrapping numbers in quotes in Prometheus since JSON doesn't support NaN, +Inf, or -Inf, and we needed to support those special values. Not sure if you'll ever need to support those kinds of values though.
There was a problem hiding this comment.
I can't think of any situation where we would need NaN or Inf. Anyone else who can?
There was a problem hiding this comment.
The kernel, today, typically uses int64max for those types of values.
|
I think we also need to include a couple of other stats:
Both of these can be sort of derived, but it would be good to have them available in the metrics. |
|
Yeah I'm +1 on expvar - we expose the endpoint in debug mode eg |
|
@huslage Good point. I'll look into this, but can't say yet how it will look into details therefor I've added it to this PRs description (instead of docs) |
This is proposing to add per container metrics, based on cgroups, as well as docker engine metrics about Docker's internal health and job management. This will either solve or at least provide a basis to address moby#8875, moby#1091, moby#4530, moby#5473, moby#8842 Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
32859a6 to
b218adc
Compare
|
@shykes @aluzzardi @crosbymichael Can we decide soon what's next with this proposal? Even though there isn't any code here yet, it already starts to get messy when rebasing.. |
|
#9984 was just merged, I don't know if you want to refine this as this also includes the engine stats |
|
@crosbymichael You mean I shall update the proposal based on the implementation? LOL. |
This is proposing to add per container metrics, based on cgroups, as well as docker engine metrics about Docker's internal health and job management.
This will either solve or at least provide a basis to address #8875, #1091, #4530, #5473, #8842
Why?
Tools like cAdvisor talk to the docker daemon as well as access the cgroup kernel features to read metrics from. Even though both use libcontainer, it's error prone and racy. Metrics are coupled to the container and since this container is managed by Docker, it should also provide the metrics for it.
Beside that, metrics are key for every production deployment of Docker. So this is a problem that everyone needs to solve. This is also reflected in the high number of issues asking for native docker metrics.
This also proposes white box telemetry. The idea is to expose various metrics about the Docker internals. For now, some failure/success counters and latencies for the internal job management as well as latencies for the API endpoints.
With the proposed change, it's easy to pull the metrics from Docker into any monitoring system. For smaller setups and on the fly debugging, there will be a metrics endpoint.
How?
The exposed metrics are pretty raw: We won't do rate conversions or advanced features like histograms and leave this to the monitoring tool. This has the limitation that
docker metrics [container]will only show the current raw, values. If-fis used, it will convert the values into a rate where appropriate. To make it explicit whether a metric is a gauge or a counter, the type is included in the API.The API is self-explanatory by including a description for each metric.
/cc @aluzzardi @crosbymichael @johncosta @kencochrane
Additions