Use LongUpDownCounter for Linked Project Error Metrics by JeremyDahlgren · Pull Request #139657 · elastic/elasticsearch

JeremyDahlgren · 2025-12-16T23:11:35Z

Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Per a request from core-infra the string concatenation has been eliminated from the metric attribute strings for easier code search.

Using the LongUpDownCounter with an initial add(0) at registration time I was able to observe the metric in the polled values:

$ ./gradlew run -Dtests.es.logger.level=ERROR --with-apm-server --apm-metrics="es.projects.linked.connections.error.total"
...
Metricset [runTask-0]: {"timestamp":1765923703735396,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923713731978,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923723734827,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
...

I also was able to artificially inject some actual increments, and observed the zero value still being collected when no actual increments were coming through:

Metricset [runTask-0]: {"timestamp":1765923159434143,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923169438216,"tags":{"otel_instrumentation_scope_name":"elasticsearch","es_linked_project_alias":"DEFAULT","es_linked_project_attempt":"initial","es_linked_project_id":"DEFAULT"},"samples":{"es.projects.linked.connections.error.total":{"value":1}}}
Metricset [runTask-0]: {"timestamp":1765923169438216,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923179436766,"tags":{"otel_instrumentation_scope_name":"elasticsearch","es_linked_project_alias":"DEFAULT","es_linked_project_attempt":"initial","es_linked_project_id":"DEFAULT"},"samples":{"es.projects.linked.connections.error.total":{"value":1}}}
Metricset [runTask-0]: {"timestamp":1765923179436766,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}

This refactoring is necessary since the suggested workaround of using an explicit formula instead of a quick function in the dashboards did not prevent the 'not found' issue.

Relates: ES-12696
Relates: ES-12697

Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Relates: ES-12696

elasticsearchmachine · 2025-12-16T23:14:31Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd

I need some help to understand how the change fixes the issue.

ywangd · 2025-12-17T01:29:44Z

server/src/main/java/org/elasticsearch/transport/RemoteClusterService.java

+            .registerLongUpDownCounter(CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME, "linked project connection attempt failure count", "count")
+            .add(0);


IIUC, this triggers the metric collection once at node start time. How does it help with fixing the problem unless it periodically publishes it? If it does not periodically publish the metric, would it also be viable to keep using the LongCounter and increment it by 0 at registration time?

LongCounter incremented by 0 at registration time does not work using the same command I was running in the description. I can dig into the implementations to track down the difference in behavior. The messages I received to use LongUpDownCounter as a workaround did not include an explanation as to why it works differently. It would be good to understand this.

The LongUpDownCounter uses AggregationTemporality.CUMULATIVE while LongCounter uses AggregationTemporality.DELTA. In DefaultSynchronousMetricStorage::collect() the handle's recorded values get reset for DELTA, while for CUMULATIVE the initial entry isn't cleared.

Honestly this feels like a wasteful workaround. This will continuously publish a value for each unique combination of attributes, until either the max cardinality is reached (1999) or the node is restarted. If I understand it all correctly the limitation is with Kibana requiring the metric to be present in the mappings of the APM datastreams. It seems like you should be able to build a dashboard by defining the metric and attribute labels. You could also define their types. Yes the UI can help by automatically offering existing metrics and labels that it finds, but it feels like a limitation to raise an error if it isn't present and prevent you from moving forward. Why can't it just display empty visualizations? The owner of the dashboard is going to investigate and look at the source data and see misconfigurations and fix them. Same when preparing alerts. Or maybe provide an upload with test data so you can verify with it?

Thanks for the investigation. I agree this isn't ideal. Short of other actionable recommendations, I am ok with proceeding with it for the time being.

ywangd

LGTM

* upstream/main: (253 commits) Adds ST_SIMPLIFY geo spatial function (elastic#136309) Take control of max clause count verification in Lucene searcher (elastic#139752) [ML] Unmute Inference Test (elastic#139765) Parameterize the vector operation benchmark tests (elastic#139735) Fix node reduction pushdown tests for release tests (elastic#139548) Fix flakiness in TSDataGenerationHelper (elastic#139759) CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596) Add release notes for v9.1.9 release (elastic#139674) Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746) Suppress Tika entitlement warnings from AWT (elastic#139711) Check field storage when synthetic source is enabled, in tests (elastic#139715) Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678) Merge fixes from patch branch back into main (elastic#139721) Define native bulk operations for vector square distance (elastic#139198) Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657) ESQL: Add javadoc that explains version-aware planning (elastic#139706) Add helper to pick node for reindex relocation (elastic#139081) Fix auth serialization randomized version test (elastic#139182) ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540) Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741 ...

JeremyDahlgren · 2026-01-09T22:41:38Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3

Questions ?

Please refer to the Backport tool documentation

Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Relates: ES-12696 (cherry picked from commit b1aa31a)

JeremyDahlgren requested a review from ywangd December 16, 2025 23:11

JeremyDahlgren added >non-issue :Distributed/Network Http and internode communication implementations Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0 labels Dec 16, 2025

JeremyDahlgren marked this pull request as ready for review December 16, 2025 23:14

[CI] Auto commit changes from spotless

d835a8c

ywangd reviewed Dec 17, 2025

View reviewed changes

elasticsearchmachine added v9.4.0 and removed v9.3.0 labels Dec 17, 2025

ywangd approved these changes Dec 18, 2025

View reviewed changes

JeremyDahlgren merged commit b1aa31a into elastic:main Dec 18, 2025
35 checks passed

JeremyDahlgren added the v9.3.1 label Jan 9, 2026

JeremyDahlgren mentioned this pull request Jan 9, 2026

[9.3] Use LongUpDownCounter for Linked Project Error Metrics (#139657) #140480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LongUpDownCounter for Linked Project Error Metrics#139657

Use LongUpDownCounter for Linked Project Error Metrics#139657
JeremyDahlgren merged 2 commits intoelastic:mainfrom
JeremyDahlgren:es-12696-refactor-counter-impl-used

JeremyDahlgren commented Dec 16, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Dec 16, 2025

Uh oh!

ywangd left a comment

Uh oh!

ywangd Dec 17, 2025

Uh oh!

JeremyDahlgren Dec 17, 2025

Uh oh!

JeremyDahlgren Dec 17, 2025

Uh oh!

JeremyDahlgren Dec 17, 2025

Uh oh!

ywangd Dec 18, 2025

Uh oh!

ywangd left a comment

Uh oh!

Uh oh!

JeremyDahlgren commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		.registerLongUpDownCounter(CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME, "linked project connection attempt failure count", "count")
		.add(0);

Conversation

JeremyDahlgren commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 16, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JeremyDahlgren commented Jan 9, 2026

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JeremyDahlgren commented Dec 16, 2025 •

edited

Loading