Skip to content

Add prometheus metrics output to docker#25820

Merged
thaJeztah merged 1 commit intomoby:masterfrom
crosbymichael:prom
Oct 27, 2016
Merged

Add prometheus metrics output to docker#25820
thaJeztah merged 1 commit intomoby:masterfrom
crosbymichael:prom

Conversation

@crosbymichael
Copy link
Contributor

This adds metrics support to docker via the prometheus output. There is a new /metrics endpoint where the prometheus handler is installed.

This adds metrics to the daemon package for basic container, image, and daemon operations.

The client package being used is located here:

https://github.com/docker/go-metrics

This pr is the beginning, any more packages need to be instrumented but help/suggestions from the individual subsystem maintainers is needed to collection relevant and useful information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on subscribing to the health events for this container and keeping counters? This way you can track/alert containers that are flapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. I was trying to keep the scope small at this point to make sure we have the basics monitored before adding more metrics. This is just the first metrics PR, we will expand the scope after to other packages/etc.

@jhorwit2
Copy link
Contributor

Thoughts on adding expvar collector if daemon is in debug mode?

❤️ this PR! 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartTimer costs a closure allocation. This interface is the most performant, but simple stuff can use the sugar.

@crosbymichael
Copy link
Contributor Author

@xbglowx you might be interested in this PR

@cpuguy83
Copy link
Member

So cool design LGTM

@discordianfish
Copy link
Contributor

@crosbymichael Ha, nice! You might want to look into #9130 for what naming I proposed. That follows the prometheus best practices better. In doubt, see: https://prometheus.io/docs/practices/naming/
Would be really great if those metrics followed the best pratices.

@fabxc You might be interested in this as well.

@stevvooe
Copy link
Contributor

Would be really great if those metrics followed the best pratices.

@discordianfish We've read through this in detail, a few times now, and we think we are following them. It would be more constructive if you could identify the areas where we don't follow best practices or suggest how making them different could help out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are aware, but this isn't called anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, i removed it for now until we have multiple registries

@crosbymichael crosbymichael added this to the 1.13.0 milestone Aug 31, 2016
@discordianfish
Copy link
Contributor

@stevvooe I referred to my PR for specific naming suggestions. Beside that by quickly skimming through:

  • Standardize on seconds everywhere
  • Use self-explanatory metric names (not network_tx but network_tx_packets for example, you could also try to use similar/same names as in node_exporter or container_exporter)

Since we're migrating away from Docker, I can't really put much more time in this though. Still, if you have specific questions I'm happy to answer.

@icecrime
Copy link
Contributor

icecrime commented Sep 9, 2016

Ping @LK4D4: PTAL!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crosbymichael The recommendation from @discordianfish was to standardize on using seconds. Should we update go-metrics to follow that standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this, everyone expects cputime in nanoseconds, that is that is returned from all stat and cgroup calls return. i don't know if it makes sense to do it here or not. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of the units type. I don't think people will take time to make a considered choice in practice, leaving us with inconsistent counters throughout the project.

However, I do see your point.

In this case, I am really not sure, since this is actually a gauge (may need to be a counter). We don't really get a guaranteed conversion here, since it will still be incremented as a float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah i think this should be a counter.

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 26, 2016

@crosbymichael I wonder how heavy all registrations in init. Reexec is still used in overlay2 storage driver, chrootarchive and in multiple places in libnetwork.

@LK4D4
Copy link
Contributor

LK4D4 commented Sep 27, 2016

Ok, seems like it quite heavy, but that's supposed way to use prometheus client.
@crosbymichael need rebase.

@crosbymichael
Copy link
Contributor Author

Rebased and placed this feature behind the experimental flag.

@LK4D4 for the initialization, these counters are not heavy at all and have no out of process calls or allocations. They are just initializing types and they are small

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d is unused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's split it a little and return an error if MetricsAddress is not empty on non-experimental.

This adds a metrics packages that creates additional metrics.  Add the
metrics endpoint to the docker api server under `/metrics`.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add metrics to daemon package

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

api: use standard way for metrics route

Also add "type" query parameter

Signed-off-by: Alexander Morozov <lk4d4@docker.com>

Convert timers to ms

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@LK4D4 LK4D4 removed the status/failing-ci Indicates that the PR in its current state fails the test suite label Oct 27, 2016
@LK4D4
Copy link
Contributor

LK4D4 commented Oct 27, 2016

LGTM

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 27, 2016

ping @thaJeztah

@thaJeztah
Copy link
Member

Since this is a new API, we need a new section in the docs for this. Let's do so in a follow up before 1.13 is released

/cc @mstanleyjones

@thaJeztah
Copy link
Member

opened #27843 for tracking docs

@crosbymichael crosbymichael deleted the prom branch October 28, 2016 16:57
@outcoldman
Copy link
Contributor

Am I right that in current implementation only prometheus can be used for collecting metrics? Why not to have extensibility as in log drivers? Current implementation is really tight to the prometheus client library without any abstractions.

@brian-brazil
Copy link

Am I right that in current implementation only prometheus can be used for collecting metrics?

For clarity I define "collection" to mean instrumentation, and "ingestion" as the process of getting the collected metric data into a monitoring system.

In that context, the answer is yes as I understand things on the Docker side.

Why not to have extensibility as in log drivers? Current implementation is really tight to the prometheus client library without any abstractions.

The challenge is that metric instrumentation is nowhere near as standardised as logs. Indeed Prometheus client libraries can be viewed as one standardisation option to allow for extensiblity, as they are designed to be an open ecosystem.

You could add an abstraction at the instrumentation level as I believe you are suggesting. You'd likely end up with a lowest-common denominator library, which would lose you the benefits of things like labels, floating point numbers, take a performance hit and not be idiomatic for any monitoring system. In short you'd gain abstraction, at the cost of good instrumentation. I personally do not believe that is a good tradeoff, and have never seen it work out well when attempted.

With Prometheus client libraries there's abstraction at the ingestion level. The Prometheus approach is to instrument full-on with Prometheus client libraries, and then you use a small shim (called a "bridge") to output to other systems like Graphite (currently in code review to be included out of the box for the Go library, already exists for Java/Python), New Relic, InfluxDB, CloudWatch etc. There's no need for a user to run a Prometheus server in such a scenario, see https://www.robustperception.io/exporting-to-graphite-with-the-prometheus-python-client/ for example. There are other ways you can plumb that too.

Prometheus client libraries put all the smarts in the server rather than the instrumentation and have a data model that's on the powerful end of the scale, so it's almost always possible to automagically convert to something sensible in other monitoring/instrumentation systems. The reverse is unfortunately rarely true.

@peterbourgon
Copy link
Contributor

You could add an abstraction at the instrumentation level . . .

Self-plug: one candidate is go-kit/kit/metrics, which provides exactly this abstraction and suffers none of the drawbacks you mention. I'm not necessarily advocating for it in this circumstance—it seems exceedingly unlikely that Docker should or would use a different metrics backend than Prometheus—but at least the option is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.