Skip to content

A96: OTel Metrics for Retries#488

Merged
yashykt merged 14 commits intogrpc:masterfrom
yashykt:OTelRetryMetrics
Jul 1, 2025
Merged

A96: OTel Metrics for Retries#488
yashykt merged 14 commits intogrpc:masterfrom
yashykt:OTelRetryMetrics

Conversation

@yashykt
Copy link
Member

@yashykt yashykt commented Apr 3, 2025

No description provided.

grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call.
Metric Name | Type | Unit | Labels | Description
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need more than 5 as the upper bound here. When we implement estubs retries, there are modes where there is no limit on the number of retry attempts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

Metric Name | Type | Unit | Labels | Description
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, there is no upper bound on the number of transparent retry attempts if each attempt never actually goes out on the wire.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.)

------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10].
grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In estubs, the upper bound for the number of hedged requests can be quite large, so I think we'll want to support larger numbers here too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

Copy link
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've talked about this before, but I don't understand how expontential histograms can possibly work for aggregation. And in any case, we don't support them in the non-per-call metrics APIs, so I don't see how we can recommend that here.

Exponential Histograms are just a type of histograms. Our APIs do not need anything more to support exponential histograms. The OpenTelemetry SDK user can simply provide the buckets to use when creating the histogram view.

expontential histograms can possibly work for aggregation

Aggregation for histograms, which in most cases is calculating percentiles (50% / 95% / 99 %) is always approximate since we no longer have the exact values.

grpc.client.call.retry_delay | Histogram | s | grpc.method (required), grpc.target (required) | Total time of delay while there is no active attempt during the client call.
Metric Name | Type | Unit | Labels | Description
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

Metric Name | Type | Unit | Labels | Description
------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 10 taking that in consideration. I think this should be enough. I'm not sure it makes sense to distinguish between cases where the number if larger than 10. (I'm willing to change the boundaries here if there are strong opinions. These are just recommendations anyway.)

------------------------------------ | --------- | ------------------- | ---------------------------------------------- | -----------
grpc.client.call.retries | Histogram | {retry} | grpc.method (required), grpc.target (required) | Number of retries during the client call. If there were no retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
grpc.client.call.transparent_retries | Histogram | {transparent_retry} | grpc.method (required), grpc.target (required) | Number of transparent retries during the client call. If there were no transparent retries, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5,10].
grpc.client.call.hedges | Histogram | {hedge} | grpc.method (required), grpc.target (required) | Number of hedges during the client call. If there were no hedges, 0 is not reported. Recommended histogram bucket boundaries are [1,2,3,4,5].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we end up changing the limit on retries, we can update the advice here.

Otherwise, if we've already made enough progress on the new changes, please let me know what the better boundaries would be.

@yashykt yashykt merged commit 67e6ea1 into grpc:master Jul 1, 2025
1 check passed
copybara-service bot pushed a commit to grpc/grpc that referenced this pull request Jul 25, 2025
Implements OTel retry metrics as per grpc/proposal#488

Closes #39195

COPYBARA_INTEGRATE_REVIEW=#39195 from yashykt:AddOTelRetryMetrics b7e8b77
PiperOrigin-RevId: 787268504
asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Jul 28, 2025
Implements OTel retry metrics as per grpc/proposal#488

Closes grpc#39195

COPYBARA_INTEGRATE_REVIEW=grpc#39195 from yashykt:AddOTelRetryMetrics b7e8b77
PiperOrigin-RevId: 787268504
eugeneo pushed a commit to eugeneo/grpc that referenced this pull request Aug 11, 2025
Implements OTel retry metrics as per grpc/proposal#488

Closes grpc#39195

COPYBARA_INTEGRATE_REVIEW=grpc#39195 from yashykt:AddOTelRetryMetrics b7e8b77
PiperOrigin-RevId: 787268504
paulosjca pushed a commit to paulosjca/grpc that referenced this pull request Aug 23, 2025
Implements OTel retry metrics as per grpc/proposal#488

Closes grpc#39195

COPYBARA_INTEGRATE_REVIEW=grpc#39195 from yashykt:AddOTelRetryMetrics b7e8b77
PiperOrigin-RevId: 787268504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants