Use LongUpDownCounter for Linked Project Error Metrics#139657
Use LongUpDownCounter for Linked Project Error Metrics#139657JeremyDahlgren merged 2 commits intoelastic:mainfrom
Conversation
Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Relates: ES-12696
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ywangd
left a comment
There was a problem hiding this comment.
I need some help to understand how the change fixes the issue.
| .registerLongUpDownCounter(CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME, "linked project connection attempt failure count", "count") | ||
| .add(0); |
There was a problem hiding this comment.
IIUC, this triggers the metric collection once at node start time. How does it help with fixing the problem unless it periodically publishes it? If it does not periodically publish the metric, would it also be viable to keep using the LongCounter and increment it by 0 at registration time?
There was a problem hiding this comment.
LongCounter incremented by 0 at registration time does not work using the same command I was running in the description. I can dig into the implementations to track down the difference in behavior. The messages I received to use LongUpDownCounter as a workaround did not include an explanation as to why it works differently. It would be good to understand this.
There was a problem hiding this comment.
There was a problem hiding this comment.
Honestly this feels like a wasteful workaround. This will continuously publish a value for each unique combination of attributes, until either the max cardinality is reached (1999) or the node is restarted. If I understand it all correctly the limitation is with Kibana requiring the metric to be present in the mappings of the APM datastreams. It seems like you should be able to build a dashboard by defining the metric and attribute labels. You could also define their types. Yes the UI can help by automatically offering existing metrics and labels that it finds, but it feels like a limitation to raise an error if it isn't present and prevent you from moving forward. Why can't it just display empty visualizations? The owner of the dashboard is going to investigate and look at the source data and see misconfigurations and fix them. Same when preparing alerts. Or maybe provide an upload with test data so you can verify with it?
There was a problem hiding this comment.
Thanks for the investigation. I agree this isn't ideal. Short of other actionable recommendations, I am ok with proceeding with it for the time being.
* upstream/main: (253 commits) Adds ST_SIMPLIFY geo spatial function (elastic#136309) Take control of max clause count verification in Lucene searcher (elastic#139752) [ML] Unmute Inference Test (elastic#139765) Parameterize the vector operation benchmark tests (elastic#139735) Fix node reduction pushdown tests for release tests (elastic#139548) Fix flakiness in TSDataGenerationHelper (elastic#139759) CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596) Add release notes for v9.1.9 release (elastic#139674) Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746) Suppress Tika entitlement warnings from AWT (elastic#139711) Check field storage when synthetic source is enabled, in tests (elastic#139715) Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678) Merge fixes from patch branch back into main (elastic#139721) Define native bulk operations for vector square distance (elastic#139198) Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657) ESQL: Add javadoc that explains version-aware planning (elastic#139706) Add helper to pick node for reindex relocation (elastic#139081) Fix auth serialization randomized version test (elastic#139182) ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540) Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741 ...
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Relates: ES-12696 (cherry picked from commit b1aa31a)


Replaces
LongCounterwithLongUpDownCounterfor the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Per a request from core-infra the string concatenation has been eliminated from the metric attribute strings for easier code search.Using the
LongUpDownCounterwith an initialadd(0)at registration time I was able to observe the metric in the polled values:I also was able to artificially inject some actual increments, and observed the zero value still being collected when no actual increments were coming through:
This refactoring is necessary since the suggested workaround of using an explicit formula instead of a quick function in the dashboards did not prevent the 'not found' issue.
Relates: ES-12696
Relates: ES-12697