Skip to content

fix: Improve telemetry allocations#2185

Merged
kodiakhq[bot] merged 11 commits intomainfrom
fix/improve-telemetry
Jul 17, 2025
Merged

fix: Improve telemetry allocations#2185
kodiakhq[bot] merged 11 commits intomainfrom
fix/improve-telemetry

Conversation

@murarustefaan
Copy link
Copy Markdown
Member

@murarustefaan murarustefaan commented Jun 6, 2025

This PR:

  • moves OTEL counter initialization to a single place
  • removes startTime and endTime non-counter metrics
  • introduces duration metric
  • removes client-level cardinality from the metrics, preventing memory accumulation

@github-actions github-actions bot added the fix label Jun 6, 2025
Copy link
Copy Markdown
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@murarustefaan
Copy link
Copy Markdown
Member Author

murarustefaan commented Jun 9, 2025

Maybe we could do this behind an option somehow?

I'm afraid not, sadly. Having client_id metric granularity means that for plugins like AWS, a single AWS account generates a metric cardinality of around 19-20k due to our multiplexing. This is totally against what opentelemetry recommends, a maximum metric cardinality of 2k.

I also do not see the reason why we need to expose the internal client_id implementations as separate metrics, since we already expose them as traces and spans (this PR only touches the metric implementation); we can just as easily update the DD dashboard & CLI to use the new metrics.

Flagging this behind an option would just postpone the issue, rather than fix it. 😞
I'm open to other suggestions, but realistically from pprof-ing my way trough this, reducing the metric cardinality is the only thing that will fix the memory usage (both in terms of in-memory metric storage as well as aggregation call allocations in the OTEL SDK).

See this related OTEL issue.

@erezrokah
Copy link
Copy Markdown
Member

I also do not see the reason why we need to expose the internal client_id implementations as separate metrics

Yeah agree, I think the reason is that's how we saved pre-OTEL metrics (per client id) so OTEL code is aligned with that.

Can we try setting OTEL_GO_X_CARDINALITY_LIMIT per the suggestion in the issue? At least when running the platform

Comment on lines -113 to -128
startTime, err := otel.Meter(OtelName).Int64Counter("sync.table.start_time",
metric.WithDescription("Start time of syncing a table"),
metric.WithUnit("ns"),
duration, err := otel.Meter(ResourceName).Int64Counter(durationMetricName,
metric.WithDescription("Duration of syncing a table"),
metric.WithUnit("ms"),
)
if err != nil {
return nil
}

endTime, err := otel.Meter(OtelName).Int64Counter("sync.table.end_time",
metric.WithDescription("End time of syncing a table"),
metric.WithUnit("ns"),
)

if err != nil {
return nil
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the 2 timestamp metrics, which did not really make sense as metrics since they were not increasing like a counter should (however, aggregations were still ran periodically for them, hence cardinality).

@murarustefaan
Copy link
Copy Markdown
Member Author

Can we try setting OTEL_GO_X_CARDINALITY_LIMIT per the suggestion in the issue? At least when running the platform

Setting OTEL_GO_X_CARDINALITY_LIMIT is a separate thing, targeted to another repo.

Setting that will also have the effect that metrics will simply not be counted, therefore there will be missing data.

Implementing this changeset will solve the platform issues as well while not needed to set the custom OTEL flag since cardinality will be lower than that (number of tables).

@github-actions github-actions bot added fix and removed fix labels Jun 9, 2025
@erezrokah
Copy link
Copy Markdown
Member

Setting that will also have the effect that metrics will simply not be counted, therefore there will be missing data.

Yeah not saying we shouldn't fix the cause, just if the platform doesn't use the metrics, and we can set OTEL_GO_X_CARDINALITY_LIMIT to solve the memory issues that seems like an easy fix

@murarustefaan murarustefaan marked this pull request as ready for review June 9, 2025 12:05
@murarustefaan murarustefaan requested review from a team and jon-s58 June 9, 2025 12:05
@murarustefaan murarustefaan force-pushed the fix/improve-telemetry branch from 129dd57 to 2123d20 Compare July 10, 2025 10:29
@murarustefaan murarustefaan force-pushed the fix/improve-telemetry branch from 2123d20 to 457f17a Compare July 10, 2025 11:31
Copy link
Copy Markdown

@przste-go przste-go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

m.measurements[table.Name].clients[client.ID()] = &measurement{duration: &durationMeasurement{}}
}
for _, relation := range table.Relations {
m.InitWithClients(invocationID, relation, clients)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only question here. Didn't you want to get rid of per client otel metrics to reduce cardinality even further ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually makes sense, initially I wanted to build this changeset in a way where we would have been able to detect per-invocation metrics; however, I missed that we have traces that are able to give us that data so I think it's safe to say we don't really need that level of granularity in metrics for now.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about client.ID() ? Isn't table.Name enough if we have per client traces ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client_id is not exposed in the OTEL metrics, it's just used for tracking per-client resolver durations.

Set: attribute.NewSet(
attribute.Key("sync.table.name").String(tableName),
),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Copy link
Copy Markdown
Member

@erezrokah erezrokah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, did we run a sync with this? Wondering how the logs/traces/metrics look like now.

To confirm, the only user facing change is dropping StartTime/EndTime in favor of Duration?

@murarustefaan
Copy link
Copy Markdown
Member Author

Looks good, did we run a sync with this? Wondering how the logs/traces/metrics look like now.

Yes, see the description of the CLI draft PR for before and after metric examples made in a backwards-com.
cloudquery/cloudquery#21048

To confirm, the only user facing change is dropping StartTime/EndTime in favor of Duration?

ClientID will also miss on plugin versions built on top of this. See the above PR.

@murarustefaan
Copy link
Copy Markdown
Member Author

OK, this is ready to merge.

I added back a static, empty client.id metric point so that previous CLI versions do not break when used with --table-metrics-location (they were trying to cast this to string without error checking and therefore it was panicking).

@kodiakhq kodiakhq bot merged commit b07ce76 into main Jul 17, 2025
10 of 11 checks passed
@kodiakhq kodiakhq bot deleted the fix/improve-telemetry branch July 17, 2025 10:26
kodiakhq bot pushed a commit to cloudquery/cloudquery that referenced this pull request Jul 17, 2025
…c point granularity (#21048)

Should go with cloudquery/plugin-sdk#2185 in order to allow the CLI to read metrics without the client_id metric point.

[old CLI, old SDK](https://github.com/user-attachments/files/21236264/before.txt)
[old CLI, new SDK](https://github.com/user-attachments/files/21289151/after.txt)
[updated CLI, old SDK](https://github.com/user-attachments/files/21289612/after.txt)
[updated CLI, new SDK](https://github.com/user-attachments/files/21236263/after.txt)

The above showcase that Duration metrics are unavailable if using an old CLI version with newer plugin versions, but that's a downside we have to work with.
kodiakhq bot pushed a commit that referenced this pull request Jul 18, 2025
🤖 I have created a release *beep* *boop*
---


## [4.87.3](v4.87.2...v4.87.3) (2025-07-17)


### Bug Fixes

* **deps:** Update module github.com/aws/aws-sdk-go-v2/service/licensemanager to v1.32.0 ([#2230](#2230)) ([28a1479](28a1479))
* Improve telemetry allocations ([#2185](#2185)) ([b07ce76](b07ce76))
* Upgrade golangci-lint to v2 ([#2228](#2228)) ([7fc238c](7fc238c))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants