Skip to content

Use LongUpDownCounter for Linked Project Error Metrics#139657

Merged
JeremyDahlgren merged 2 commits intoelastic:mainfrom
JeremyDahlgren:es-12696-refactor-counter-impl-used
Dec 18, 2025
Merged

Use LongUpDownCounter for Linked Project Error Metrics#139657
JeremyDahlgren merged 2 commits intoelastic:mainfrom
JeremyDahlgren:es-12696-refactor-counter-impl-used

Conversation

@JeremyDahlgren
Copy link
Copy Markdown
Contributor

@JeremyDahlgren JeremyDahlgren commented Dec 16, 2025

Replaces LongCounter with LongUpDownCounter for the metric used for linked project connection errors. Since the metric may not be incremented for long periods of time it was not appearing in the mappings of the current index of the APM datastream. This results in 'not found' errors when developing dashboards and alerts. The change required adding an initial zero value so the metric will be collected every polling cycle. Per a request from core-infra the string concatenation has been eliminated from the metric attribute strings for easier code search.

Using the LongUpDownCounter with an initial add(0) at registration time I was able to observe the metric in the polled values:

$ ./gradlew run -Dtests.es.logger.level=ERROR --with-apm-server --apm-metrics="es.projects.linked.connections.error.total"
...
Metricset [runTask-0]: {"timestamp":1765923703735396,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923713731978,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923723734827,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
...

I also was able to artificially inject some actual increments, and observed the zero value still being collected when no actual increments were coming through:

Metricset [runTask-0]: {"timestamp":1765923159434143,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923169438216,"tags":{"otel_instrumentation_scope_name":"elasticsearch","es_linked_project_alias":"DEFAULT","es_linked_project_attempt":"initial","es_linked_project_id":"DEFAULT"},"samples":{"es.projects.linked.connections.error.total":{"value":1}}}
Metricset [runTask-0]: {"timestamp":1765923169438216,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}
Metricset [runTask-0]: {"timestamp":1765923179436766,"tags":{"otel_instrumentation_scope_name":"elasticsearch","es_linked_project_alias":"DEFAULT","es_linked_project_attempt":"initial","es_linked_project_id":"DEFAULT"},"samples":{"es.projects.linked.connections.error.total":{"value":1}}}
Metricset [runTask-0]: {"timestamp":1765923179436766,"tags":{"otel_instrumentation_scope_name":"elasticsearch"},"samples":{"es.projects.linked.connections.error.total":{"value":0}}}

This refactoring is necessary since the suggested workaround of using an explicit formula instead of a quick function in the dashboards did not prevent the 'not found' issue.

Relates: ES-12696
Relates: ES-12697

Replaces LongCounter with LongUpDownCounter for the metric
used for linked project connection errors.  Since the metric
may not be incremented for long periods of time it was not
appearing in the mappings of the current index of the APM
datastream.  This results in 'not found' errors when developing
dashboards and alerts.  The change required adding an initial
zero value so the metric will be collected every polling cycle.

Relates: ES-12696
@JeremyDahlgren JeremyDahlgren added >non-issue :Distributed/Network Http and internode communication implementations Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.0 labels Dec 16, 2025
@JeremyDahlgren JeremyDahlgren marked this pull request as ready for review December 16, 2025 23:14
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some help to understand how the change fixes the issue.

Comment on lines +110 to +111
.registerLongUpDownCounter(CONNECTION_ATTEMPT_FAILURES_COUNTER_NAME, "linked project connection attempt failure count", "count")
.add(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this triggers the metric collection once at node start time. How does it help with fixing the problem unless it periodically publishes it? If it does not periodically publish the metric, would it also be viable to keep using the LongCounter and increment it by 0 at registration time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LongCounter incremented by 0 at registration time does not work using the same command I was running in the description. I can dig into the implementations to track down the difference in behavior. The messages I received to use LongUpDownCounter as a workaround did not include an explanation as to why it works differently. It would be good to understand this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LongUpDownCounter uses AggregationTemporality.CUMULATIVE while LongCounter uses AggregationTemporality.DELTA. In DefaultSynchronousMetricStorage::collect() the handle's recorded values get reset for DELTA, while for CUMULATIVE the initial entry isn't cleared.

Screenshot 2025-12-17 at 12 27 25 Screenshot 2025-12-17 at 12 48 06

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly this feels like a wasteful workaround. This will continuously publish a value for each unique combination of attributes, until either the max cardinality is reached (1999) or the node is restarted. If I understand it all correctly the limitation is with Kibana requiring the metric to be present in the mappings of the APM datastreams. It seems like you should be able to build a dashboard by defining the metric and attribute labels. You could also define their types. Yes the UI can help by automatically offering existing metrics and labels that it finds, but it feels like a limitation to raise an error if it isn't present and prevent you from moving forward. Why can't it just display empty visualizations? The owner of the dashboard is going to investigate and look at the source data and see misconfigurations and fix them. Same when preparing alerts. Or maybe provide an upload with test data so you can verify with it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. I agree this isn't ideal. Short of other actionable recommendations, I am ok with proceeding with it for the time being.

Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JeremyDahlgren JeremyDahlgren merged commit b1aa31a into elastic:main Dec 18, 2025
35 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 19, 2025
* upstream/main: (253 commits)
  Adds ST_SIMPLIFY geo spatial function (elastic#136309)
  Take control of max clause count verification in Lucene searcher (elastic#139752)
  [ML] Unmute Inference Test (elastic#139765)
  Parameterize the vector operation benchmark tests (elastic#139735)
  Fix node reduction pushdown tests for release tests (elastic#139548)
  Fix flakiness in TSDataGenerationHelper (elastic#139759)
  CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596)
  Add release notes for v9.1.9 release (elastic#139674)
  Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746)
  Suppress Tika entitlement warnings from AWT (elastic#139711)
  Check field storage when synthetic source is enabled, in tests (elastic#139715)
  Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678)
  Merge fixes from patch branch back into main (elastic#139721)
  Define native bulk operations for vector square distance (elastic#139198)
  Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657)
  ESQL: Add javadoc that explains version-aware planning (elastic#139706)
  Add helper to pick node for reindex relocation (elastic#139081)
  Fix auth serialization randomized version test (elastic#139182)
  ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540)
  Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741
  ...
@JeremyDahlgren
Copy link
Copy Markdown
Contributor Author

💚 All backports created successfully

Status Branch Result
9.3

Questions ?

Please refer to the Backport tool documentation

JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Jan 9, 2026
Replaces LongCounter with LongUpDownCounter for the metric
used for linked project connection errors.  Since the metric
may not be incremented for long periods of time it was not
appearing in the mappings of the current index of the APM
datastream.  This results in 'not found' errors when developing
dashboards and alerts.  The change required adding an initial
zero value so the metric will be collected every polling cycle.

Relates: ES-12696
(cherry picked from commit b1aa31a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants