Skip to content

ctxgroup,*: catch panics when using workers in jobs #76734

@ajwerner

Description

@ajwerner

Is your feature request related to a problem? Please describe.

In light of #58164 / #62243 and general prevailing winds that crashing the process is bad, we should be more proactive to catch panics when spinning off helper goroutines when operating underneath jobs or query execution. Already the initial goroutine in those settings is set up to recover from panics; it feels wrong to not give the same treatment to goroutines launched by those goroutines. The stopper too has long-ago been endowed with support to recover from such panics and send valuable information in the form of sentry reports.

The use of ctxgroup is largely confined to jobs at the time of writing. Other parts of the codebase have a stronger tendency towards using the stopper directly (for better or for worse).

Describe the solution you'd like

I'm not sure how heavy-handed to be on this topic with this issue. As a first pass, I'd like there to be some feature that makes it easy for me to gain peace of mind that panics will be translated in a uniform way. What I propose is a new method GoSafe (🚲🏠 on the name) which takes an opName as a redact.SafeString that looks like this:

func (g Group) GoCtxSafe(op redact.SafeString, f func(context.Context) error) {
	g.wrapped.Go(func() (err error) {
		defer func() {
			switch r := recover().(type) {
			case nil:
				return
			case error:
				err = errors.Wrapf(r, "failed to %s", op)
			default:
				err = errors.AssertionFailedf("failed to %s: %v", op, r)
			}
		}()
		return f(g.ctx)
	})
}

Describe alternatives you've considered

I think there's an argument to be made that just adding the above method doesn't go far enough. Instead we should make this the only way and go and audit all existing calls. That transition could be made over some time by first adding the new one method, updating the old ones, and then renaming.

Another question is how and whether this should get hooked directly into sentry reporting. My 2c is no, it shouldn't. We should rely on existing error propagation to bring the error up to the right point in the call stack where a report will be generated if the error is an assertion failure. That said, should all panic'd errors be treated that way? I can certainly imagine use cases where panics are used soundly at API boundaries to propagate errors and unwind the stack cleanly. In those cases, it'd be a bummer to have the error marked as an assertion failure. I think I'd see some logic in letting the user control that behavior on the group itself, but making it an option where the default is to mark all panics, even of errors, as assertion failures.

On the note about reporting, we should confirm that assertion failures returned from jobs get reported.

Additional context

Some motivation follows from #76715 (comment).

Jira issue: CRDB-13253

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-disaster-recovery

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions