Skip to content

Engine: Allow error response code to be customized#16257

Merged
bboreham merged 30 commits intoprometheus:mainfrom
justinjung04:error-code
Aug 19, 2025
Merged

Engine: Allow error response code to be customized#16257
bboreham merged 30 commits intoprometheus:mainfrom
justinjung04:error-code

Conversation

@justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Mar 21, 2025

Description

Currently the API always returns http code 422 for engine execution error, and

This PR allows the error code to be overriden, based on the ErrorType and the error itself.

Before the PR:

Body: {"status":"error","data":"test","errorType":"execution","error":"rpc error: code = Code(429) desc = too many requests, please retry later"}
Status Code: 422

After the PR:

// this is a bad example of override, but just showing how it works
errorTypeToStatusCode := func(errNum errorNum, err error) (code int, override bool) {
    if errNum == ErrorExec {
        if strings.Contains(err.Error(), "too many requests") {
            return http.StatusTooManyRequests, true
        }
        return http.StatusUnprocessableEntity, true
    }

    return 0, false
}
api := v1.NewAPI(..., errorTypeToStatusCode)

---

Body: {"status":"error","data":"test","errorType":"execution","error":"rpc error: code = Code(429) desc = too many requests, please retry later"}
Status Code: 429

Testing

Added unit tests

Which issue(s) does the PR fix:

N/A

Does this PR introduce a user-facing change?

NONE

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 marked this pull request as ready for review March 21, 2025 21:12
@justinjung04 justinjung04 changed the title Error code Error response code improvement from engine failures Mar 24, 2025
@justinjung04 justinjung04 changed the title Error response code improvement from engine failures Improve error response code from engine failures Mar 24, 2025
@machine424
Copy link
Member

Thanks for this.
I don't think this is meant for Prometheus itself, which means this will always result in unnecessary traversals of the errorExec chains.
For library use, we need to consider a different approach that doesn't penalize Prometheus.

@justinjung04
Copy link
Contributor Author

Thank you @machine424 for taking a look.
To avoid the traversals, would you be more willing to have a struct that contains error code and error chain (and make it a standard way of returning error from downstream), such that Prometheus could just read the error code directly?

Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

justinjung04 commented Apr 1, 2025

Added new struct

type errorWithStatusCode struct {
    statusCode int
    err        error
}

and downstream components (prometheus dependencies) can do the following to specify HTTP error code on engine execution error:

if errWithCode, ok := grpcutil.ErrorWithHTTPStatusCode(code, err); ok {
    return errWithCode
}
return err

justinjung04 and others added 2 commits April 1, 2025 12:05
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <justinjung04@gmail.com>
@machine424
Copy link
Member

Thanks, but I still believe that util/grpcutil doesn't belong in the Prometheus codebase.
Instead, we could allow customization of the apiError -> status code mapping, assuming other maintainers agree to open that door.

I hadn't shared my opinion earlier, but as an end user I’d be confused e.g. if an API occasionally returned a 429 status, because the API itself is rate limited by an underlying service. Could you share more about your use case?

@justinjung04
Copy link
Contributor Author

justinjung04 commented Apr 2, 2025

I'm a Cortex user, and I'd like to implement different throttling mechanisms on queryables where I want to return error code other than 422.

For example, if the data layer is hammered, I want to throttle user with 429 instead of 422, encouraging user to retry the queries at a slower pace, rather than implying the individual queries are beyond certain limits and can never succeed.

I don't believe full customization of all different apiErr.typ is necessary, but more looking for a way to override error code for errorExec type.

@justinjung04
Copy link
Contributor Author

I believe another option is to add a new field in the API struct that contains an error code override function, but only applicable to errorExec type. Please let me know what you would prefer!

@machine424
Copy link
Member

Thanks for sharing the details.
It's about making tradeoffs with such changes, we want the code to be customizable with less overhead on your end, but we also want to keep the code on our end simple and readable.
we could try replacing the switch with a lookup from map[errorType]func(errorType)int for which we can allow dependents to override entries, but we should check that the overhead is acceptable, if not we'd need to go for a default switch that you can override as a whole...

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Improve error response code from engine failures Allow error response code to be customized Apr 15, 2025
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

justinjung04 commented Apr 15, 2025

  • Created ErrorTypeToStatusCode which can be passed to v1.NewAPI to override the status code based on error type and the error itself. For example,
errorTypeToStatusCode := ErrorTypeToStatusCode{
    ErrorExec: func(err error) int {
        if strings.Contains(err.Error(), "too many requests") {
            return http.StatusTooManyRequests
        }
        return http.StatusUnprocessableEntity
    }
}
api := v1.NewAPI(..., errorTypeToStatusCode)
  • Exposed all errorType
  • Added unit tests
  • Added changelog

Signed-off-by: Justin Jung <justinjung04@gmail.com>
@justinjung04
Copy link
Contributor Author

@machine424 Updates have been made and conflicts have been resolved. Could you take another look?

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks better to me, just some suggestions.

@machine424
Copy link
Member

cc @bboreham as a "prometheus as a library" user, FYI

@bboreham
Copy link
Member

If you're going to change this, why go via strings at all? Isn't it logically more like an enum?

@justinjung04
Copy link
Contributor Author

justinjung04 commented May 28, 2025

@bboreham Could you elaborate? The error type enums haven't been changed, and I just exposed them so that code handler could be customized.

Here's an examples uage:

errorTypeToStatusCode := ErrorTypeToStatusCode{
    ErrorExec: func(err error) int {
        if strings.Contains(err.Error(), "too many requests") {
            return http.StatusTooManyRequests
        }
        return http.StatusUnprocessableEntity
    }
}
api := v1.NewAPI(..., errorTypeToStatusCode)

where ErrorExec is an enum exported from Prometheus. New tests have been added inTestResponseError that tests the override.

Signed-off-by: Justin Jung <jungjust@amazon.com>
@bboreham
Copy link
Member

I mean ErrorType would become an int type so we're not hashing string values every time we use it.

@justinjung04
Copy link
Contributor Author

Oh I see, I believe we still need the string consts since they're used for creating apiError:

type apiError struct {
	typ errorType
	err error
}

func (e *apiError) Error() string {
	return fmt.Sprintf("%s: %s", e.typ, e.err)
}

To avoid string hashing, I can create another set of enum (const with int type), and add that to the apiError struct like this?

// existing string consts
type errorTypeStr string
const (
	errorTimeoutStr       errorTypeStr = "timeout"
	errorCanceledStr      errorTypeStr = "canceled"
	...
)

// new enum exported
type errorType int
const (
	ErrorTimeout       errorType = iota
	ErrorCanceled
	...
)

// update apiError struct to include both error string and type
type apiError struct {
	typ errorType
	typStr errorTypeStr
	err error
}

...

if handler, ok := api.errorTypeToStatusCode[apiErr.typ]; ok {
	code = handler(apiErr.err)
} else {
	switch apiErr.typStr {
	case errorTimeoutStr:
        ...
	}
}

Please let me know if this alternative looks good to you

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

Updated errorType to a struct that contains both int and string. The int values for all error types are exported, and ErrorTypeToStatusCode uses the exported error type int as the map key.

@justinjung04
Copy link
Contributor Author

@machine424 @bboreham Could you take another look?

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; I guess we can live with this but it still seems a little over-complicated.

Do we need both tests? They seem to be doing about the same thing for the new functionality.

justinjung04 and others added 6 commits July 2, 2025 09:49
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <justinjung04@gmail.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04
Copy link
Contributor Author

@machine424 @bboreham could I get final reviews?

justinjung04 and others added 2 commits August 18, 2025 11:11
Signed-off-by: Justin Jung <justinjung04@gmail.com>
Signed-off-by: Justin Jung <jungjust@amazon.com>
@justinjung04 justinjung04 changed the title Allow error response code to be customized Engine: Allow error response code to be customized Aug 19, 2025
@bboreham bboreham merged commit 0f98dcb into prometheus:main Aug 19, 2025
28 of 30 checks passed
perebaj pushed a commit to perebaj/prometheus that referenced this pull request Aug 22, 2025
Currently the API always returns http code 422 for engine execution error, and

This PR allows the error code to be overriden, based on the ErrorType and the error itself.

Signed-off-by: Justin Jung <jungjust@amazon.com>
Signed-off-by: Justin Jung <justinjung04@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: perebaj <perebaj@gmail.com>
dmitryax added a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Oct 31, 2025
Both k8s and prometheus libraries have to be updated because they depend
on each other and both have breaking changes.

The following files changed to adopt to the API breaking changes:
- processor/k8sattributesprocessor/internal/kube/fake_informer.go:
  - kubernetes/kubernetes#126387
- receiver/prometheusreceiver/metrics_receiver.go:
  - prometheus/prometheus#16257
- receiver/prometheusremotewritereceiver/receiver.go:
  - prometheus/prometheus#17160

I haven't touched any usages of deprecated k8s API to keep the changeset
small. Created a separate issue for that instead
#43891

Some prometheus tests are failing due to behavioral change in prometheus
library. I skipped them for now and reported
#43893
jelly-afk pushed a commit to jelly-afk/opentelemetry-collector-contrib that referenced this pull request Nov 6, 2025
Both k8s and prometheus libraries have to be updated because they depend
on each other and both have breaking changes.

The following files changed to adopt to the API breaking changes:
- processor/k8sattributesprocessor/internal/kube/fake_informer.go:
  - kubernetes/kubernetes#126387
- receiver/prometheusreceiver/metrics_receiver.go:
  - prometheus/prometheus#16257
- receiver/prometheusremotewritereceiver/receiver.go:
  - prometheus/prometheus#17160

I haven't touched any usages of deprecated k8s API to keep the changeset
small. Created a separate issue for that instead
open-telemetry#43891

Some prometheus tests are failing due to behavioral change in prometheus
library. I skipped them for now and reported
open-telemetry#43893
dyl10s pushed a commit to dyl10s/opentelemetry-collector-contrib that referenced this pull request Nov 21, 2025
Both k8s and prometheus libraries have to be updated because they depend
on each other and both have breaking changes.

The following files changed to adopt to the API breaking changes:
- processor/k8sattributesprocessor/internal/kube/fake_informer.go:
  - kubernetes/kubernetes#126387
- receiver/prometheusreceiver/metrics_receiver.go:
  - prometheus/prometheus#16257
- receiver/prometheusremotewritereceiver/receiver.go:
  - prometheus/prometheus#17160

I haven't touched any usages of deprecated k8s API to keep the changeset
small. Created a separate issue for that instead
open-telemetry#43891

Some prometheus tests are failing due to behavioral change in prometheus
library. I skipped them for now and reported
open-telemetry#43893
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants