Skip to content

Dealing with thousands of errors (Or, event_from_exception is slow. Or, Rails 7's error_reporter is awkward) #1765

@jdelStrother

Description

@jdelStrother

We recently had a cache server go down, which caused our servers to become overwhelmed and start dropping requests like crazy. This seemed unusual: previously we've been fine with a cache server disappearing - we might get a handful of errors, but Rails' cache normally behaves fairly sensibly, just behaving like a cache-miss & silently failing on writes.

I think the new problem lies in a mix of Rails 7's error reporting behaviour, and Sentry's event_from_exception being a relatively slow method.

In Rails 7, every single cache error gets sent to the error_reporter (https://github.com/rails/rails/blob/0169d15bc7ec4557971d6ac6120e48b2cac9c407/activesupport/lib/active_support/cache/redis_cache_store.rb#L460-L466)

These then get sent to Sentry. We're using Sentry's BackgroundWorker to process these asynchronously (and drop events that exceed the max_queue size), but before that happens, Sentry calls event_from_exception on the current thread. This seems like quite a heavy method, mostly due to StackTraceBuilder -

image

and worse, it's called for every single error-report, regardless if BackgroundWorker is about to discard the error because the queue is full. config.before_send is called after event_from_exception, so there's not an obvious way for me to control whether that work gets done.

I'm not sure what a good fix would look like here -

  • Possibly the fault lies with Rails 7 - for now, I think I'm going to patch the RedisCacheStore#failsafe method to not call error_reporter&.report. It does seem unfortunate that even if you provide your own error_handler callback, ActiveSupport.error_reporter gets called regardless.
  • It would be nice if event_from_exception happened asynchronously in BackgroundWorker - that way a) it's not blocking the current thread, and b) it would benefit from not doing unnecessary work when we're just about to discard the exception. But that looks like quite a big change from the current flow, especially if it's keeping compatibility with the other config.async options.
  • Maybe Sentry.capture_exception could have its own throttle that just discards exceptions if there's been more than X in the past Y seconds ?

Any other suggestions?

Metadata

Metadata

Assignees

No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions