Skip to content

Support metrics with dots in their names #347

@SylvainJuge

Description

@SylvainJuge

Describe the bug

As of version 1.18.0, the java agent added instrumentation of micrometer framework, which allows to capture metrics managed with this framework. While the reported issue is specific to HikariCP connection pool, it might happen with other frameworks, or could be user-provided.

Initial investigation of the issue shows that some metric names are problematic and conflict with metrics storage.
This issue has been opened in the apm repository as the actual change required might spread over multiple repositories and the solution to fix it properly isn't defined yet.

This was reported in our public forum here : https://discuss.elastic.co/t/failed-to-transform-sample-model-sample/249962

Currently, you can't have two metrics where one is the prefix of another with . as separator, connections=3 and connections.idle=2 will conflict and at least one of them will be ignored as one expects connections to be a metric value, and the other expects it to be an object with an idle property.

To Reproduce

Follow instructions described in the forum to reproduce it with the Java agent.

Any agent that will sent two metrics with names that conflict with each other will have a similar issue.

Expected behavior

Metric names should not have any impact on our ability to capture them.

Debug logs

Server debug logs (provided in the forum)
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.idle", Value:2, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.creation.count", Value:600, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.creation.sum.us", Value:7.0216e+07, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.timeout", Value:6, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.pending", Value:0, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.creation.sum.us", Value:5.7659e+07, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.active", Value:0, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.idle", Value:2, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.creation.count", Value:511, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.usage.count", Value:128944, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.usage.sum.us", Value:1.983578e+09, Values:[]float64(nil), Counts:[]int64(nil)}
Sep 25 10:23:09 apm-server[31357]: WARN [transform] model/metricset.go:141 failed to transform sample model.Sample{Name:"hikaricp.connections.max", Value:50, Values:[]float64(nil), Counts:[]int64(nil)}
Agent logs with log_level=trace

Relevant part where the issue is:

"hikaricp.connections":{"value":3.0},"hikaricp.connections.timeout":{"value":0.0}

Full log

Sep 28 16:34:03 env[3405743]: {"metricset":{"timestamp":1601310843073000,"tags":{"pool":"HikariPool-2"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":46},"hikaricp.connections.usage.sum.us":{"value":409000.0},"hikaricp.connections":{"value":3.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections.acquire.count":{"value":46},"hikaricp.connections.acquire.sum.us":{"value":159244.109},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.idle":{"value":3.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}
Sep 28 16:34:03 env[3405743]: {"metricset":{"timestamp":1601310843073000,"tags":{"pool":"HikariPool-1"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":50},"hikaricp.connections.usage.sum.us":{"value":889000.0},"hikaricp.connections.acquire.count":{"value":50},"hikaricp.connections.acquire.sum.us":{"value":217912.23},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections":{"value":3.0},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.idle":{"value":3.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}
Sep 28 16:34:33 env[3405743]: {"metricset":{"timestamp":1601310873073000,"tags":{"pool":"HikariPool-2"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":60},"hikaricp.connections.usage.sum.us":{"value":416000.0},"hikaricp.connections":{"value":3.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections.acquire.count":{"value":60},"hikaricp.connections.acquire.sum.us":{"value":172080.99},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.idle":{"value":3.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}
Sep 28 16:34:33 env[3405743]: {"metricset":{"timestamp":1601310873073000,"tags":{"pool":"HikariPool-1"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":64},"hikaricp.connections.usage.sum.us":{"value":896000.0},"hikaricp.connections.acquire.count":{"value":64},"hikaricp.connections.acquire.sum.us":{"value":243931.7},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections":{"value":3.0},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.idle":{"value":3.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}
Sep 28 16:35:03 env[3405743]: {"metricset":{"timestamp":1601310903073000,"tags":{"pool":"HikariPool-2"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":74},"hikaricp.connections.usage.sum.us":{"value":423000.0},"hikaricp.connections":{"value":2.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections.acquire.count":{"value":74},"hikaricp.connections.acquire.sum.us":{"value":184358.326},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.idle":{"value":2.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}
Sep 28 16:35:03 env[3405743]: {"metricset":{"timestamp":1601310903073000,"tags":{"pool":"HikariPool-1"},"samples":{"hikaricp.connections.max":{"value":50.0},"hikaricp.connections.usage.count":{"value":78},"hikaricp.connections.usage.sum.us":{"value":903000.0},"hikaricp.connections.acquire.count":{"value":78},"hikaricp.connections.acquire.sum.us":{"value":269137.9},"hikaricp.connections.min":{"value":2.0},"hikaricp.connections":{"value":2.0},"hikaricp.connections.timeout":{"value":0.0},"hikaricp.connections.pending":{"value":0.0},"hikaricp.connections.active":{"value":0.0},"hikaricp.connections.idle":{"value":2.0},"hikaricp.connections.creation.count":{"value":0},"hikaricp.connections.creation.sum.us":{"value":0.0}}}}

Possible approaches to solve this

While we can apply workarounds that are specific to HikariCP or even to our instrumentation of Micrometer, it seems that a proper fix for this needs to be applied at either agent or apm-server level.

Here is a non-definitive list of ideas that we (java agent team) have come so far:

  1. Agent could simply rename metrics on the fly, for example replace . with _. While simple to change, that would make such metrics different from others that use dots and break compatibility with existing metrics and even impact APM UI as current visualizations rely on metric names. Another downside is with custom metrics, those will have different (but close) names which could be confusing.
  2. Agent (or server) could post-process any metricset to ensure that no metrics conflict by adding an extra _value field automatically when required. This would work as long as those metrics are sent together in the same metricset, which is quite likely. In the example above all metrics related to a database connection pool will always reported together.
  3. Add a configuration option at agent-level that will allow to rename those conflicting metrics. This would assume that the frequency of such cases is small enough to not be a constant burden. Given that hikaricp is the default connection pool for spring-boot applications, which itself is a very popular framework, that's not a very appealing option for Java agent.

Currently, we think that the option 2. is the best for the long term, and having a few metrics with an extra _value attribute should be straight-forward for the end-user.

What are the other options that we could implement for this ?


Current workaround

  • with agent version 1.18.0, add micrometer to disable_instrumentations, this will completely disable micrometer integration.
  • use latest snapshot version or 1.19.0 ( not released as time of writing) and set disabled_metrics=hikaricp.connections

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions