Skip to content

Proposal: Container and Engine telemetry#9130

Closed
discordianfish wants to merge 1 commit intomoby:masterfrom
discordianfish:propose-docker-metrics
Closed

Proposal: Container and Engine telemetry#9130
discordianfish wants to merge 1 commit intomoby:masterfrom
discordianfish:propose-docker-metrics

Conversation

@discordianfish
Copy link
Contributor

This is proposing to add per container metrics, based on cgroups, as well as docker engine metrics about Docker's internal health and job management.

This will either solve or at least provide a basis to address #8875, #1091, #4530, #5473, #8842

Why?

Tools like cAdvisor talk to the docker daemon as well as access the cgroup kernel features to read metrics from. Even though both use libcontainer, it's error prone and racy. Metrics are coupled to the container and since this container is managed by Docker, it should also provide the metrics for it.

Beside that, metrics are key for every production deployment of Docker. So this is a problem that everyone needs to solve. This is also reflected in the high number of issues asking for native docker metrics.

This also proposes white box telemetry. The idea is to expose various metrics about the Docker internals. For now, some failure/success counters and latencies for the internal job management as well as latencies for the API endpoints.

With the proposed change, it's easy to pull the metrics from Docker into any monitoring system. For smaller setups and on the fly debugging, there will be a metrics endpoint.

How?

The exposed metrics are pretty raw: We won't do rate conversions or advanced features like histograms and leave this to the monitoring tool. This has the limitation that docker metrics [container] will only show the current raw, values. If -f is used, it will convert the values into a rate where appropriate. To make it explicit whether a metric is a gauge or a counter, the type is included in the API.
The API is self-explanatory by including a description for each metric.

/cc @aluzzardi @crosbymichael @johncosta @kencochrane

Additions

  • expvar/ to get various stats (memstats, http handler) and maybe use it for implementation
  • container uptime and restart counters
  • correlate jobs to containers where possible; have a container dimension for the jobs

@aluzzardi
Copy link
Member

Great. We talked a lot with @crosbymichael about something like this.
Your infra/prometheus background is really appreciated for metrics.

I'll have a deeper look once I get some spare cycles. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the code works, it would be good to add an example output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, can you please also make a man/docker-metrics-1.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SvenDowideit [x] done!
I kept it simple because I was too lazy to generate output manually. Once implemented, I can extend the man page. I also realize that we probably should add some filtering for the metrics so you get only what you are interested in. But that's something to IMO discuss later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@discordianfish We can use go template, just like docker inspect. No need to implement the filter flags otherwise the code may become more and more complicated 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobegit3hub Good point. Let's do that. But as I said, I would like to get this first approved and implemented before having to decide on all those minor UI things. We can do that on the PR implementing this I suppose.

@SvenDowideit
Copy link
Contributor

nice!

@discordianfish discordianfish force-pushed the propose-docker-metrics branch 4 times, most recently from 3323fd9 to f337e32 Compare November 13, 2014 20:06
@discordianfish
Copy link
Contributor Author

FYI: I've change the metrics format significantly. It uses now pretty much the same format as prometheus which makes it much more consistent at the price that it's a little bit harder to parse, given that it uses a labels notion.

First I proposed different structures for different kind of metrics and implicit encoding of dimensions into keys. But that's ugly and in the end harder to implement and to parse properly.

@aluzzardi
Copy link
Member

I'll let @crosbymichael validate the format. I'm +1'ing the concept though.

@thaJeztah
Copy link
Member

I think there's a similar PR in progress here; #8886 should both PRs be put next to each other and decided which one to proceed with?

@discordianfish
Copy link
Contributor Author

@thaJeztah Thanks for pointed that out. I think we should first figure out the format/UI/API, then implement it. This PR so far doesn't include any implementation, so happy to use whatever matches the design.

@thaJeztah
Copy link
Member

@discordianfish yes, I noticed this was in a "design" stage and the other PR started "spontaneously".

The main reason for pointing it out was to prevent people (the other PR?) investing a lot of time in something that might not get approved. But possibly they can join in after approval to help implement.

Nice proposal, btw, thanks for that!

@proppy
Copy link
Contributor

proppy commented Nov 15, 2014

/cc @vmarmol

I wonder if you considered http://golang.org/pkg/expvar/ as an (additional?) way to expose this?

@discordianfish
Copy link
Contributor Author

@proppy Good idea! I would add those metrics as well and use it where possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: we ended up wrapping numbers in quotes in Prometheus since JSON doesn't support NaN, +Inf, or -Inf, and we needed to support those special values. Not sure if you'll ever need to support those kinds of values though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any situation where we would need NaN or Inf. Anyone else who can?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel, today, typically uses int64max for those types of values.

@huslage
Copy link
Contributor

huslage commented Nov 20, 2014

I think we also need to include a couple of other stats:

  • how long it took to pull the image that this container is running from.
  • container uptime.

Both of these can be sort of derived, but it would be good to have them available in the metrics.

@pnasrat
Copy link
Contributor

pnasrat commented Nov 20, 2014

Yeah I'm +1 on expvar - we expose the endpoint in debug mode eg http://localhost:4243/debug/vars

@discordianfish
Copy link
Contributor Author

@huslage Good point. I'll look into this, but can't say yet how it will look into details therefor I've added it to this PRs description (instead of docs)

This is proposing to add per container metrics, based on cgroups, as well as
docker engine metrics about Docker's internal health and job management.

This will either solve or at least provide a basis to address moby#8875, moby#1091, moby#4530, moby#5473, moby#8842

Signed-off-by: Johannes 'fish' Ziemke <github@freigeist.org>
@discordianfish discordianfish force-pushed the propose-docker-metrics branch from 32859a6 to b218adc Compare January 5, 2015 11:54
@discordianfish
Copy link
Contributor Author

@shykes @aluzzardi @crosbymichael Can we decide soon what's next with this proposal? Even though there isn't any code here yet, it already starts to get messy when rebasing..

@crosbymichael
Copy link
Contributor

#9984 was just merged, I don't know if you want to refine this as this also includes the engine stats

@discordianfish
Copy link
Contributor Author

@crosbymichael You mean I shall update the proposal based on the implementation? LOL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.