ctxgroup,*: catch panics when using workers in jobs

**Is your feature request related to a problem? Please describe.**

In light of https://github.com/cockroachdb/cockroach/issues/58164 / https://github.com/cockroachdb/cockroach/pull/62243 and general prevailing winds that crashing the process is bad, we should be more proactive to catch panics when spinning off helper goroutines when operating underneath jobs or query execution. Already the initial goroutine in those settings is set up to recover from panics; it feels wrong to not give the same treatment to goroutines launched by those goroutines. The `stopper` too has long-ago been endowed with support to recover from such panics and send valuable information in the form of sentry reports.

The use of `ctxgroup` is largely confined to jobs at the time of writing. Other parts of the codebase have a stronger tendency towards using the stopper directly (for better or for worse). 

**Describe the solution you'd like**

I'm not sure how heavy-handed to be on this topic with this issue. As a first pass, I'd like there to be some feature that makes it *easy* for me to gain peace of mind that panics will be translated in a uniform way. What I propose is a new method `GoSafe` (🚲🏠 on the name) which takes an `opName` as a `redact.SafeString` that looks like this:

```go
func (g Group) GoCtxSafe(op redact.SafeString, f func(context.Context) error) {
	g.wrapped.Go(func() (err error) {
		defer func() {
			switch r := recover().(type) {
			case nil:
				return
			case error:
				err = errors.Wrapf(r, "failed to %s", op)
			default:
				err = errors.AssertionFailedf("failed to %s: %v", op, r)
			}
		}()
		return f(g.ctx)
	})
}
```

**Describe alternatives you've considered**

I think there's an argument to be made that just adding the above method doesn't go far enough. Instead we should make this *the only way* and go and audit all existing calls. That transition could be made over some time by first adding the new one method, updating the old ones, and then renaming.

Another question is how and whether this should get hooked directly into sentry reporting. My 2c is no, it shouldn't. We should rely on existing error propagation to bring the error up to the right point in the call stack where a report will be generated if the error is an assertion failure. That said, should all panic'd errors be treated that way? I can certainly imagine use cases where panics are used soundly at API boundaries to propagate errors and unwind the stack cleanly. In those cases, it'd be a bummer to have the error marked as an assertion failure. I think I'd see some logic in letting the user control that behavior on the group itself, but making it an option where the default is to mark all panics, even of errors, as assertion failures.

On the note about reporting, we should confirm that assertion failures returned from jobs get reported. 

**Additional context**

Some motivation follows from https://github.com/cockroachdb/cockroach/pull/76715#issuecomment-1043026133.


Jira issue: CRDB-13253

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctxgroup,*: catch panics when using workers in jobs #76734

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ctxgroup,*: catch panics when using workers in jobs #76734

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions