fix: Improve telemetry allocations#2185
Conversation
erezrokah
left a comment
There was a problem hiding this comment.
Maybe we could do this behind an option somehow? The code in https://github.com/cloudquery/cloudquery/blob/983062b2210f6a269743afda4c56c5df9e0cdc77/cli/internal/otel/receiver.go and https://github.com/DataDog/integrations-extras/blob/ecce46b7b5ec5cac1a509714b01da7fc394be083/cloudquery/assets/dashboards/cloudquery_syncs_insights.json relies on the metrics and probably other OTEL consumers.
I'm afraid not, sadly. Having I also do not see the reason why we need to expose the internal client_id implementations as separate metrics, since we already expose them as traces and spans (this PR only touches the metric implementation); we can just as easily update the DD dashboard & CLI to use the new metrics. Flagging this behind an option would just postpone the issue, rather than fix it. 😞 |
Yeah agree, I think the reason is that's how we saved pre-OTEL metrics (per client id) so OTEL code is aligned with that. Can we try setting |
| startTime, err := otel.Meter(OtelName).Int64Counter("sync.table.start_time", | ||
| metric.WithDescription("Start time of syncing a table"), | ||
| metric.WithUnit("ns"), | ||
| duration, err := otel.Meter(ResourceName).Int64Counter(durationMetricName, | ||
| metric.WithDescription("Duration of syncing a table"), | ||
| metric.WithUnit("ms"), | ||
| ) | ||
| if err != nil { | ||
| return nil | ||
| } | ||
|
|
||
| endTime, err := otel.Meter(OtelName).Int64Counter("sync.table.end_time", | ||
| metric.WithDescription("End time of syncing a table"), | ||
| metric.WithUnit("ns"), | ||
| ) | ||
|
|
||
| if err != nil { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
This removes the 2 timestamp metrics, which did not really make sense as metrics since they were not increasing like a counter should (however, aggregations were still ran periodically for them, hence cardinality).
Setting OTEL_GO_X_CARDINALITY_LIMIT is a separate thing, targeted to another repo. Setting that will also have the effect that metrics will simply not be counted, therefore there will be missing data. Implementing this changeset will solve the platform issues as well while not needed to set the custom OTEL flag since cardinality will be lower than that (number of tables). |
Yeah not saying we shouldn't fix the cause, just if the platform doesn't use the metrics, and we can set |
129dd57 to
2123d20
Compare
2123d20 to
457f17a
Compare
scheduler/metrics/metrics.go
Outdated
| m.measurements[table.Name].clients[client.ID()] = &measurement{duration: &durationMeasurement{}} | ||
| } | ||
| for _, relation := range table.Relations { | ||
| m.InitWithClients(invocationID, relation, clients) |
There was a problem hiding this comment.
Only question here. Didn't you want to get rid of per client otel metrics to reduce cardinality even further ?
There was a problem hiding this comment.
That actually makes sense, initially I wanted to build this changeset in a way where we would have been able to detect per-invocation metrics; however, I missed that we have traces that are able to give us that data so I think it's safe to say we don't really need that level of granularity in metrics for now.
There was a problem hiding this comment.
What about client.ID() ? Isn't table.Name enough if we have per client traces ?
There was a problem hiding this comment.
The client_id is not exposed in the OTEL metrics, it's just used for tracking per-client resolver durations.
plugin-sdk/scheduler/metrics/metrics.go
Lines 88 to 90 in 9e30558
erezrokah
left a comment
There was a problem hiding this comment.
Looks good, did we run a sync with this? Wondering how the logs/traces/metrics look like now.
To confirm, the only user facing change is dropping StartTime/EndTime in favor of Duration?
Yes, see the description of the CLI draft PR for before and after metric examples made in a backwards-com.
ClientID will also miss on plugin versions built on top of this. See the above PR. |
…k in older versions
|
OK, this is ready to merge. I added back a static, empty |
…c point granularity (#21048) Should go with cloudquery/plugin-sdk#2185 in order to allow the CLI to read metrics without the client_id metric point. [old CLI, old SDK](https://github.com/user-attachments/files/21236264/before.txt) [old CLI, new SDK](https://github.com/user-attachments/files/21289151/after.txt) [updated CLI, old SDK](https://github.com/user-attachments/files/21289612/after.txt) [updated CLI, new SDK](https://github.com/user-attachments/files/21236263/after.txt) The above showcase that Duration metrics are unavailable if using an old CLI version with newer plugin versions, but that's a downside we have to work with.
🤖 I have created a release *beep* *boop* --- ## [4.87.3](v4.87.2...v4.87.3) (2025-07-17) ### Bug Fixes * **deps:** Update module github.com/aws/aws-sdk-go-v2/service/licensemanager to v1.32.0 ([#2230](#2230)) ([28a1479](28a1479)) * Improve telemetry allocations ([#2185](#2185)) ([b07ce76](b07ce76)) * Upgrade golangci-lint to v2 ([#2228](#2228)) ([7fc238c](7fc238c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This PR: