Skip to content

Rails standardized error reporting interface#43625

Merged
byroot merged 1 commit intorails:mainfrom
Shopify:error-reporting-api
Nov 19, 2021
Merged

Rails standardized error reporting interface#43625
byroot merged 1 commit intorails:mainfrom
Shopify:error-reporting-api

Conversation

@casperisfine
Copy link
Copy Markdown
Contributor

@casperisfine casperisfine commented Nov 10, 2021

Fix: #43472

The reporter is held by the executor, but the Rails module provides a
nicer Rails.error shortcut.

For ease of use, two block based specialized methods are exposed.

handle, which swallow errors and forward them to the subscribers:

Rails.error.handle do
  1 + '1' # raises TypeError
end
1 + 1 # This will be executed

record, which forward the errors to the subscribes but let it
continue rewinding the call stack:

Rails.error.record do
  1 + '1' # raises TypeError
end
1 + 1 # This won't be executed.

For cases where the blocked based API isn't suitable, the lower level
report method can be used:

Rails.error.report(error, handled: true / false)

Interface

In this PR I only introduce Rails.error. However in followups I'd like to introduce "local" error reporters so that gems like Active Support / Active Record can report errors as well. It would work a bit like Rails.logger today.

e.g. ActiveSupport.error would default to an error reporter that just logs the error, but in the Railtie we'd replace it with Rails.error.

@fschwahn
Copy link
Copy Markdown
Contributor

These error reporting services usually allow for different error levels (similar to log levels). Is there a plan to handle this as part of this API as well?

This is mainly relevant when using Rails.error.report(error) directly.

@casperisfine
Copy link
Copy Markdown
Contributor Author

Is there a plan to handle this as part of this API as well?

Yeah, I mentioned it in my original issue, but since then I'm a bit on the fence. Some do have this and it's an useful feature, but then they don't all have the same list of levels etc. So I figure that would be something that would be passed as part of the context, e.g.:

Rails.error.report(error, handled: true, context: { severity: "warning" })

@fschwahn
Copy link
Copy Markdown
Contributor

So I figure that would be something that would be passed as part of the context, e.g.:

Ah, got it. But would require the error reporting services to "agree" on a similar API for severity on their own, otherwise moving between services would still be difficult.

Thinking about it, maybe it would make sense after all to add severity explicitly, because it'd be easy enough for a service to ignore (eg. report everything as error, regardless of severity).

@casperisfine
Copy link
Copy Markdown
Contributor Author

would require the error reporting services to "agree" on a similar API for severity on their own,

Yes, but even if we had a severity parameter they'd still need to agree on what the possible values are. In the end I think the service supporting a severity field can look at the handled parameter and default to something like error if it's not handled, and warning if it is.

otherwise moving between services would still be difficult.

Note that ease of migration isn't one of my stated goals. Pretty much each service I looked at had one or several unique feature. My goal isn't to abstract all provider, but to define an (extensible) common denominator, so that Rails and Rails aware libraries can report error too.

Now maybe it is important for Rails and Rails aware libraries to be able to set a severity level, like when they log, hence why I'm a bit on the fence. I'll think more about it.

@fschwahn
Copy link
Copy Markdown
Contributor

Note that ease of migration isn't one of my stated goals. Pretty much each service I looked at had one or several unique feature. My goal isn't to abstract all provider, but to define an (extensible) common denominator, so that Rails and Rails aware libraries can report error too.

Totally, I get this. I just had the feeling that severity is still part of that common denominator, but that's a judgement call for sure.

Now maybe it is important for Rails and Rails aware libraries to be able to set a severity level, like when they log, hence why I'm a bit on the fence. I'll think more about it.

My use case for this is gems. What we are currently doing in gems where we want to report something is use ActiveSupport::Notifications, subscribe to those notifications in the main_app, and report them to our service. It works, but is a bit clunky.

@casperisfine
Copy link
Copy Markdown
Contributor Author

Yeah make sense. I think I'll add a severity parameter, but I think I'll have to restrict it to a set of specific values. Probably :error, :warning, :info.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one subscriber raises an error during reporting, will it interrupts the reporters after it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the subscriber.report call should itself be wrapped in a rescue that falls back to the logs if the subscriber throws an error. The subscriber shouldn't, but we can't rely on them not to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, subscribers are expected to never raise, but we definitely shouldn't blindly trust them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so the difficulty here is to have access to a logger. I'll have to dig more into this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're in dev mode, we might as well let the subscriber error raise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose fatal level. Normally it's for errors that do crash the program which is not the case here, but I feel it's critical enough to warrant it anyway.

Fix: rails#43472

The reporter is held by the executor, but the `Rails` module provides a
nicer `Rails.error` shortcut.

For ease of use, two block based specialized methods are exposed.

`handle`, which swallow errors and forward them to the subscribers:

```ruby
Rails.error.handle do
  1 + '1' # raises TypeError
end
1 + 1 # This will be executed
```

`record`, which forward the errors to the subscribes but let it
continue rewinding the call stack:

```ruby
Rails.error.record do
  1 + '1' # raises TypeError
end
1 + 1 # This won't be executed.
```

For cases where the blocked based API isn't suitable, the lower level
`report` method can be used:

```ruby
Rails.error.report(error, handled: true / false)
```
Copy link
Copy Markdown
Member

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I guess my only concern is that we have to find every place inside Rails where we want to report exceptions but we swallow them. I think that's probably fine though?

This might be a bad idea, but have we thought about using TracePoint for intercepting exceptions? Then we don't need to wrap anything. I have no idea what the perf implications would be though.

@casperisfine
Copy link
Copy Markdown
Contributor Author

I guess my only concern is that we have to find every place inside Rails where we want to report exceptions but we swallow them. I think that's probably fine though?

Yes it's fine, there isn't that many of them, and even if we miss one of two, it's not the end of the world.

This might be a bad idea, but have we thought about using TracePoint for intercepting exceptions?

That would trigger on a lot of exceptions that are perfectly normal and not worth reporting. Also Tracepoint is global, not per thread / fiber.

@casperisfine
Copy link
Copy Markdown
Contributor Author

Something I noticed today is that Active Job doesn't call Executor#wrap but expects the underlying job system to do it. Sidekiq does call app.reloader.wrap which delegates to app.executor.wrap, so for that one at least we're good.

But I doubt all queue systems out there do it. So I'm tempted to add it in Active Job itself just to be safe. Worst case wrap is a noop if called recusrsively.

@ghiculescu
Copy link
Copy Markdown
Member

Something I noticed today is that Active Job doesn't call Executor#wrap but expects the underlying job system to do it.

Hmm, isn't that what

app.reloader.wrap do
does?

@casperisfine
Copy link
Copy Markdown
Contributor Author

Ah good catch. We're good then.

@byroot
Copy link
Copy Markdown
Member

byroot commented Nov 19, 2021

Ok, let's do this.

@byroot byroot merged commit 5a98545 into rails:main Nov 19, 2021
@morgoth
Copy link
Copy Markdown
Member

morgoth commented Nov 19, 2021

is @byroot doing reviews of @casperisfine pull requests? ;-)

@casperisfine
Copy link
Copy Markdown
Contributor Author

One is my professional account, the other my personal. Only one has commit rights.

@casperisfine
Copy link
Copy Markdown
Contributor Author

@st0012, no worries, thanks a lot for this feedback, I'll read through these issues.

@Flixt
Copy link
Copy Markdown
Contributor

Flixt commented Mar 21, 2022

Hey, coming from getsentry/sentry-ruby#1680. Not even sure if this is the right place to put this, anyway:

I wonder if it would be considered valuable by the rest of you, if there'd be a thing like ActionMailer::Base.deliveries for reported errors. Think ActiveSupport::ErrorReporter.reported_errors just like ActionMailer::Base.deliveries for the tests. Very interested in your overall opinion here. Is that even something Rails should care about? Do you consider testing reported errors useful?

@casperisfine
Copy link
Copy Markdown
Contributor Author

I think it would make sense yes, since handle does "swallow" errors it's important to be able to test it was.

@st0012
Copy link
Copy Markdown
Contributor

st0012 commented Apr 4, 2022

@casperisfine Just had this idea: can Rails give hints to the subscriber about where the reporting came from?

Like in this part:

rescue => error
@executor.error_reporter.report(error, handled: false)
raise
ensure

In can be

@executor.error_reporter.report(error, handled: false, framework: true)
# or if there want to standardize this source information, we can have
@executor.error_reporter.report(error, handled: false, source: :framework)

# other sources could be
@executor.error_reporter.report(error, handled: false, source: :app) # this will be the default
@executor.error_reporter.report(error, handled: false, source: :library) # for adapters like MemCacheStore

Then, subscribers can use that information to determine if that's something it wants to report.

In Sentry's case, we'll be able to opt-in the subscriber by default but skip non-app sources.

@casperisfine
Copy link
Copy Markdown
Contributor Author

can Rails give hints to the subscriber about where the reporting came from?

It's a possibility, and could be useful, but I don't think that's the proper solution to your user problem.

If I understood correctly, what happened to them is that that one single exception source made them reach their quota very fast. This could have happened with any other regular error in their app, I don't think that justify hiding a specific error.

With our custom error reporter, what we do to avoid this is that we throttle error sources, e.g. if we see the same error more than X time in Y minutes, we stop reporting it for Z minutes. That's a feature many error reporters have, and that would make sense to implement in Rails.

And for users that really don't care about the CacheStore errors, I believe that the better solution was suggested in the upstream thread, we should report it from the default error_handler, so that you can entirely disable it.

@st0012
Copy link
Copy Markdown
Contributor

st0012 commented Apr 17, 2022

If I understood correctly, what happened to them is that that one single exception source made them reach their quota very fast. This could have happened with any other regular error in their app, I don't think that justify hiding a specific error.

I think the core problem is that the Redis::BaseError in RedisCacheStore#failsafe is high volume but basically negligible to most users. Can you share your thoughts behind the changes in #43712?

I want to find a way to opt-out this error and similar future cases. Using severity would work but it feels weird because maybe we don't want to skip all warning errors. Using abstraction layers (source) may also be an incorrect way to categorize this case.

I think this case's characteristics are:

  • The error is deliberately swallowed.
  • For many years, it was only logged and I don't see SDKs trying to report it.
  • I think we can say it's expected to happen from time to time?
  • Its kinda of at the bottom layer of the framework.

Maybe a find-grained throttling strategy can work, but let me explain it in the next section:

With our custom error reporter, what we do to avoid this is that we throttle error sources, e.g. if we see the same error more than X time in Y minutes, we stop reporting it for Z minutes. That's a feature many error reporters have, and that would make sense to implement in Rails.

I think this could be a nice feature but I don't want it to be default or univeral. This is because we already implemented a rate-limiting mechanism between SDK and the server. In addition to that, we also collect the rate-limited stats (basically a category: quantity hash) and send it to the server for statistical purpose. So if there's another level of throttling on the Rails level, the stats wouldn't be accurate.

But if we can throttle Redis::BaseError and only let Sentry's error subscriber receive it once 30secs, it'd be nice. In that case, this error will be more like a health-check message instead of an error imo.

And for users that really don't care about the CacheStore errors, I believe that the better solution was suggested in the upstream thread, we should report it from the default error_handler, so that you can entirely disable it.

Sorry I may missed it, which comment are you referring to?

To reiterate: as a SDK maintainer, I wish Rails can:

  1. Provide a unified API for error reporting (this PR 🎉 ).
  2. Monitor error-hotspots but give the SDK a way to opt-in/out from them.

BTW, I think it'd be great if the error reporter can also capture errors from the runner command. The purpose would be similar to my proposal for a runner API.

@casperisfine
Copy link
Copy Markdown
Contributor Author

Can you share your thoughts behind the changes in #43712?

Well, it was one goal of that standardized API to allow framework and library code to report handled internal errors. Otherwise the app owner might not realize the application is performing in degraded mode.

Now of course so users may already have dedicated monitoring for redis etc, so there might be a need to customize what is reported or not.

Sorry I may missed it, which comment are you referring to?

This one: getsentry/sentry-ruby#1765

specifically:

It does seem unfortunate that even if you provide your own error_handler callback, ActiveSupport.error_reporter gets called regardless.

Which I agree with.

I think it'd be great if the error reporter can also capture errors from the runner command.

Yes, 💯 , that's something I had in mind.

Also as mentioned IRL, I think we could pass some kind of source argument, e.g. RedisCacheStore could pass source: "redis_cache_store.active_support", allowing to easily filter out the errors you know you don't care about.

@st0012
Copy link
Copy Markdown
Contributor

st0012 commented Apr 30, 2022

I think we could pass some kind of source argument, e.g. RedisCacheStore could pass source: "redis_cache_store.active_support"

Yes that'd be great. The SDK will provide some default and allow users to customize them too.

But what about the errors captured by the execution wrapper? Always seeing execution_wrapper.active_support as source won't be too helpful. Would it be possible to peek into the block and use that as the source instead?

casperisfine pushed a commit to Shopify/rails that referenced this pull request May 5, 2022
Ref: rails#43625 (comment)

Some users may not be interested by some internal errors.
By providing a `source` attribute we allow to easilly filter
these errors out.
danielnolan added a commit to danielnolan/honeybadger-ruby that referenced this pull request Oct 24, 2022
* Since upgrading to Honeybadger 5.0 we are getting duplicate errors
  reported for Sidekiq::JobRetry::Skip and another reported error for
  the exception that caused the Sidekiq job to be retried.
* There was a pr merged to Sentry for a similar issue and Mike Perham
  said this exception could be ignored.
  getsentry/sentry-ruby#1763
  rails/rails#43625 (comment)
joshuap pushed a commit to honeybadger-io/honeybadger-ruby that referenced this pull request Oct 25, 2022
* Since upgrading to Honeybadger 5.0 we are getting duplicate errors
  reported for Sidekiq::JobRetry::Skip and another reported error for
  the exception that caused the Sidekiq job to be retried.
* There was a pr merged to Sentry for a similar issue and Mike Perham
  said this exception could be ignored.
  getsentry/sentry-ruby#1763
  rails/rails#43625 (comment)
nikz added a commit to nikz/raygun4ruby that referenced this pull request May 9, 2024
`Sidekiq::JobRetry::Skip` is used for retry control and isn't really a
resolvable error to track

As per [recommendation from the Sidekiq
maintainer](rails/rails#43625 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rails standardized error reporting interface