sql: fix a few issues with reporting of errors to sentry by yuzefovich · Pull Request #43301 · cockroachdb/cockroach

yuzefovich · 2019-12-18T19:07:07Z

sql: use the correct context when recording an error for sentry

Previously, context.Background() was used to record an internal error.
That context is missing the registered tags (e.g. 'statement' tag) which
results in an incomplete sentry report. Now this is fixed.

Release note: None

sql: remove CloseWithErr method from CommandResultClose interface

The behavior of CloseWithErr method can be obtained with SetError
followed by Close, so this commit does such refactoring which simplifies
the interface.

Release note: None

sql: fix double reporting of the same error with sentry

Previously, in a certain code path both connExecutor and pgwire would
record telemetry for the same error to be sent to sentry. This resulted
in duplicated events. Now this is fixed.

Release note: None

cockroach-teamcity · 2019-12-18T19:07:14Z

This change is

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

			res.CloseWithErr(pe.errorCause())
		} else {
			ex.recordError(ctx, resErr)

The duplicate report happens because of these two lines - both ex.recordError and res.Close create a report, I'm not sure which one should not be happening. Should we modify res.Close to take an argument whether the report has already been created?

knz

Fantastic find. This must have been painful to troubleshoot. You're a hero.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, yuzefovich wrote…

The duplicate report happens because of these two lines - both ex.recordError and res.Close create a report, I'm not sure which one should not be happening. Should we modify res.Close to take an argument whether the report has already been created?

Unsure what the best approach is.
My recollection is the following:

we can't make res.Close the only place responsible because if the resultwriter is not pgwire, the error would get lost
we can't make this code here the only place responsible, because we need to catch the error on other termination paths (e.g. a panic)

Maybe what you suggest is not too bad. It's possible that the number of people who can propose an informed opinion on this has been reduced to just 1 (you) over the last year.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, knz (kena) wrote…

Unsure what the best approach is.
My recollection is the following:

we can't make res.Close the only place responsible because if the resultwriter is not pgwire, the error would get lost

we can't make this code here the only place responsible, because we need to catch the error on other termination paths (e.g. a panic)

Maybe what you suggest is not too bad. It's possible that the number of people who can propose an informed opinion on this has been reduced to just 1 (you) over the last year.

I'm under the same impression as your recollection of things, so I pushed a second commit to add that argument. PTAL.

andreimatei

For the first commit, where is the sentry reporting happening?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 1434 at r2 (raw file):

			// We have just reported the error, so we tell res.Close not to report
			// the error to avoid the duplicate error report.
			res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()), false /* shouldReportError */)

@andreimatei this is where the reporting happens and where the context was needed.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)

pkg/sql/conn_executor.go, line 1434 at r2 (raw file):

Previously, yuzefovich wrote…

@andreimatei this is where the reporting happens and where the context was needed.

pls point me to the code that calls into sentry

knz · 2019-12-18T19:35:38Z

In pgwire/conn.go:


func writeErr(
  ctx context.Context, sv *settings.Values, err error, msgBuilder *writeBuffer, w io.Writer,
) error {
  // Record telemetry for the error.
  sqltelemetry.RecordError(ctx, err, sv)

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 1434 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls point me to the code that calls into sentry

pkg/sql/pgwire/conn.go:1190

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, yuzefovich wrote…

I'm under the same impression as your recollection of things, so I pushed a second commit to add that argument. PTAL.

I think we need a clear contract about who reports what. Given that pgwire is in the reporting game, I think the connExecutor should get out of it; I don't see what benefit we get from sql ever doing it. Perhaps for internal executors... but we don't do it consistently. I'd rather do it in the bufferedCommandResult if we care about internal executor. Or, rather, forget about them.
I also think that the way in which pgwire gets that statement from the context is fairly ridiculous. It's obviously extremely fragile, and we care about that statement enough. Getting random dynamic tags from the context to put them in the report is fine, but the statement is not dynamic. Moreover, pgwire actually has the statement. It doesn't need to get it in an insane way from sql. The constructor of the CommandResult has it, for example. So let's just hang it from the CommandResult.

So, I'd keep the first commit but change the way in which the statement is retrieved for the report, and I'd do something else for the 2nd commit.

pkg/sql/conn_io.go, line 671 at r3 (raw file):

	// an error (if present). This should be set to true only if the caller will
	// take care of creating the report on their own.
	Close(ctx context.Context, tsi TransactionStatusIndicator, shouldReportError bool)

No bool arguments to high-profile APIs like this! In fact, no bool args ever (discouraged by the style guide too.
Make it an enum if we don't find a better idea.

Also what's the "should"? Should every function Foo be called shouldFoo? :)
Just reportError.

pkg/sql/conn_io.go, line 671 at r3 (raw file):

	// an error (if present). This should be set to true only if the caller will
	// take care of creating the report on their own.
	Close(ctx context.Context, tsi TransactionStatusIndicator, shouldReportError bool)

The only caller that sets the new argument doesn't need it - the result in question is created just above and has no error. So the arg is not needed.

pkg/sql/conn_io.go, line 681 at r3 (raw file):

	// command in the same batch as the one being closed. The contract is that the
	// next result created corresponds to the first command in the next batch.
	CloseWithErr(context.Context, error)

It looks to me like we can get rid of this CloseWithErr(). It's only called in one place, which can call SetError() + Close(). Getting rid of it would simplify the reasoning about this PR. Perhaps in a preamble commit.

pkg/sql/pgwire/conn.go, line 1174 at r3 (raw file):

}

func (c *conn) bufferErr(ctx context.Context, err error, shouldReportError bool) {

no bools

And I don't think the arg is needed. If the code is simplified, I think this will always be false.

pkg/sql/pgwire/conn.go, line 1188 at r3 (raw file):

}

func writeErr(

split this into writeErr() and writeAndReportErr().

knz · 2019-12-18T21:13:05Z

t. Perhaps for internal executors...

yes!

but we don't do it consistently. I'd rather do it in the bufferedCommandResult if we care about internal executor.

Maybe that's a good idea.

Or, rather, forget about them.

Surely not :)

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_io.go, line 681 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It looks to me like we can get rid of this CloseWithErr(). It's only called in one place, which can call SetError() + Close(). Getting rid of it would simplify the reasoning about this PR. Perhaps in a preamble commit.

You're right. Added such commit (I wanna see that nothing breaks, so I've just pushed it.)

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think we need a clear contract about who reports what. Given that pgwire is in the reporting game, I think the connExecutor should get out of it; I don't see what benefit we get from sql ever doing it. Perhaps for internal executors... but we don't do it consistently. I'd rather do it in the bufferedCommandResult if we care about internal executor. Or, rather, forget about them.
I also think that the way in which pgwire gets that statement from the context is fairly ridiculous. It's obviously extremely fragile, and we care about that statement enough. Getting random dynamic tags from the context to put them in the report is fine, but the statement is not dynamic. Moreover, pgwire actually has the statement. It doesn't need to get it in an insane way from sql. The constructor of the CommandResult has it, for example. So let's just hang it from the CommandResult.

So, I'd keep the first commit but change the way in which the statement is retrieved for the report, and I'd do something else for the 2nd commit.

I'm leaning more towards keeping the reporting of statement execution errors in sql, and not removing it from pgwire. So as soon as a CommandResult is created, the responsibility of reporting its error falls to the ConnExecutor.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm leaning more towards keeping the reporting of statement execution errors in sql, and not removing it from pgwire. So as soon as a CommandResult is created, the responsibility of reporting its error falls to the ConnExecutor.

I meant "and removing it from pgwire"

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I meant "and removing it from pgwire"

I support making SQL generally more responsible for reporting.

Meanwhile we still need a mechanism in pgwire to record errors happening within pgwire.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, knz (kena) wrote…

I support making SQL generally more responsible for reporting.

Meanwhile we still need a mechanism in pgwire to record errors happening within pgwire.

yeah, for everything not using newCommandResult (i.e. for all the non-statement commands), pgwire would remain responsible

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yeah, for everything not using newCommandResult (i.e. for all the non-statement commands), pgwire would remain responsible

Pardon my ignorance, but I don't fully understand where the line between non-statement and statement commands lies. This seems like a bigger undertaking than I had in mind.

However, I looked into changing the way that statement is propagated, and although it is true that commandResult can store tree.Statement, SendReport is being called by other things (I got to util/log logDepth() function) and for those it'll be a lot of plumbing to pass in the statement outside of the context. I decided to not proceed with that change.

I appreciate the thoughtful comments that you guys provided though.

pkg/sql/conn_io.go, line 671 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

No bool arguments to high-profile APIs like this! In fact, no bool args ever (discouraged by the style guide too.
Make it an enum if we don't find a better idea.

Also what's the "should"? Should every function Foo be called shouldFoo? :)
Just reportError.

Introduced an enum.

pkg/sql/conn_io.go, line 671 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The only caller that sets the new argument doesn't need it - the result in question is created just above and has no error. So the arg is not needed.

Hm, I don't think it's correct (if there was no error, we would not be seeing duplicated reports).

pkg/sql/pgwire/conn.go, line 1174 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

no bools

And I don't think the arg is needed. If the code is simplified, I think this will always be false.

No bools, but I believe the argument is needed.

pkg/sql/pgwire/conn.go, line 1188 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

split this into writeErr() and writeAndReportErr().

Done.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

SendReport is being called by other things

I think it's fine for randos to take whatever they find in the context. In particular, log.Fatal() calls shouldn't happen, so I don't care too much about them.
But errors reported for the execution of SQL statements are not random, and also they're supposed to happen (or maybe not? We report on code == pgcode.Internal || errors.HasAssertionFailure(err) and I forgot when the Internal code happens).

Anyway, the thing I care more about is the contract about who reports what. Passing an extra arg to res.Close() I really don't like, and I don't think it's necessary. The CommandResult interface I care about.

What was the code doing before this PR? Isn't it nonsense?
It has two errors to deal with: resErr and pe. If there's no resErr, then it was passing pe to res.CloseWithError(), and the result was in charge of reporting the one and only error. Cool.
If there was a resErr, then it was recording the resErr (which was already part of the result and hence reported by the result), and it was ignoring pe. Does that make any sense? Perhaps what it wanted to do is record pe, not resErr. As written (before and after your PR), if you simply remove ex.recordError(ctx, resErr), don't all the problems go away while reporting exactly what we already report?

Unless I'm missing something, I'd just get rid of that call and move on.

pkg/sql/conn_executor.go, line 1379 at r6 (raw file):

				// If we're about to close the connection, close res in order to flush
				// now, as we won't have an opportunity to do it later.
				res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()), sqltelemetry.ErrorNotRecorded)

long line

knz

Reviewed 6 of 6 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1436 at r6 (raw file):

		}
		errStatus := sqltelemetry.ErrorNotRecorded
		if resErr == nil && ok {

Can you add a few comments to explain why errStatus is needed and what are the various combination of causes/circumstances covered by the various branches of this conditional.

IMHO This code is very complex and it is important to help the future maintainer.

pkg/sql/conn_io.go, line 669 at r6 (raw file):

	// state that needs to be closed even when the whole connection is about to be
	// terminated.
	Close(context.Context, TransactionStatusIndicator, sqltelemetry.RecordErrorStatus)

ditto my comment above about explaining the status more and the various cases that it covers (with examples).

pkg/sql/pgwire/conn.go, line 1177 at r6 (raw file):

		if err := writeAndReportErr(ctx, c.sv,
			err, &c.msgBuilder, &c.writerState.buf); err != nil {
			panic(fmt.Sprintf("unexpected err from buffer: %s", err))

So I went and investigated this case further. I believe a panic here is not warranted.
tldr:

make bufferErr return error and propagate the error from here
change the remaining call points in command_result.go (in Close) to log then discard the error.

The reason is that it's actually legitimate for an error to occur here. The client could have disconnected and dropped its connection just before the error packet is sent. In fact it's even likely, for example if a user interrupts a long-running query after they quit the client that issued it. We don't want to panic in those cases.

pkg/sql/sqltelemetry/report.go, line 25 at r6 (raw file):

// RecordErrorStatus indicates whether the error has already been recorded.
type RecordErrorStatus int

type foo bool

const a foo = true
const b foo = false

smaller footprint in memory
you don't need a default in the switch.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

SendReport is being called by other things

I think it's fine for randos to take whatever they find in the context. In particular, log.Fatal() calls shouldn't happen, so I don't care too much about them.
But errors reported for the execution of SQL statements are not random, and also they're supposed to happen (or maybe not? We report on code == pgcode.Internal || errors.HasAssertionFailure(err) and I forgot when the Internal code happens).

Anyway, the thing I care more about is the contract about who reports what. Passing an extra arg to res.Close() I really don't like, and I don't think it's necessary. The CommandResult interface I care about.

What was the code doing before this PR? Isn't it nonsense?
It has two errors to deal with: resErr and pe. If there's no resErr, then it was passing pe to res.CloseWithError(), and the result was in charge of reporting the one and only error. Cool.
If there was a resErr, then it was recording the resErr (which was already part of the result and hence reported by the result), and it was ignoring pe. Does that make any sense? Perhaps what it wanted to do is record pe, not resErr. As written (before and after your PR), if you simply remove ex.recordError(ctx, resErr), don't all the problems go away while reporting exactly what we already report?

Unless I'm missing something, I'd just get rid of that call and move on.

I think you're right that the code was not doing what it intended, so I switched to "recording" pe and reporting resErr through pgwire, and I also agree that simply removing ex.recordError(ctx, resErr) would have solved the double reporting problem. Sorry for not seeing it before and making all this fuzz.

Yet, I'm leaning towards actually recording pe. Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?

pkg/sql/conn_executor.go, line 1379 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

long line

Removed.

pkg/sql/conn_executor.go, line 1436 at r6 (raw file):

Previously, knz (kena) wrote…

Can you add a few comments to explain why errStatus is needed and what are the various combination of causes/circumstances covered by the various branches of this conditional.

IMHO This code is very complex and it is important to help the future maintainer.

Removed.

pkg/sql/conn_io.go, line 669 at r6 (raw file):

Previously, knz (kena) wrote…

ditto my comment above about explaining the status more and the various cases that it covers (with examples).

Removed.

pkg/sql/pgwire/conn.go, line 1177 at r6 (raw file):

Previously, knz (kena) wrote…

So I went and investigated this case further. I believe a panic here is not warranted.
tldr:

make bufferErr return error and propagate the error from here

change the remaining call points in command_result.go (in Close) to log then discard the error.

The reason is that it's actually legitimate for an error to occur here. The client could have disconnected and dropped its connection just before the error packet is sent. In fact it's even likely, for example if a user interrupts a long-running query after they quit the client that issued it. We don't want to panic in those cases.

Makes sense, added another commit.

pkg/sql/sqltelemetry/report.go, line 25 at r6 (raw file):

Previously, knz (kena) wrote…

type foo bool

const a foo = true
const b foo = false
smaller footprint in memory

you don't need a default in the switch.

Code was removed, but thanks for the suggestion.

knz

LGTM with nits

Reviewed 1 of 3 files at r8, 1 of 1 files at r9, 2 of 2 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, yuzefovich wrote…

I think you're right that the code was not doing what it intended, so I switched to "recording" pe and reporting resErr through pgwire, and I also agree that simply removing ex.recordError(ctx, resErr) would have solved the double reporting problem. Sorry for not seeing it before and making all this fuzz.

Yet, I'm leaning towards actually recording pe. Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?

I would cut this gordian knot by using errors.CombineErrors to report whichever is set if only one is set, or both if both are set.

pkg/sql/pgwire/command_result.go, line 103 at r10 (raw file):

			ctx, r.conn.sv, r.err, &r.conn.msgBuilder, &r.conn.writerState.buf,
		); err != nil {
			log.VErrEvent(ctx, 2 /* level */, err.Error())

I think you can be a bit clearer about what is going on:

log.VErrEventf(ctx, 2 /* level */, `error in writeErr(): %v`, err)

do you want to hide these errors by default? I was thinking about log.Warningf instead of a trace-only thing.
(optionally) you can place this logging in a helper function, which you can call in other places where writeErr is called. I provide a few (non-exhaustive) examples below. If you prefer to leave this out of scope, you can file an issue with my name on it and I'll pick it up later.

pkg/sql/pgwire/conn.go, line 349 at r10 (raw file):

			// This messages are only acceptable during the auth phase, handled above.
			err = pgwirebase.NewProtocolViolationErrorf("unexpected authentication data")
			_ /* err */ = writeErr(

you can emit the error here with the helper too

pkg/sql/pgwire/conn.go, line 460 at r10 (raw file):

		_ /* err */ = writeErr(ctx, &sqlServer.GetExecutorConfig().Settings.SV,
			newAdminShutdownErr(ErrDrainingExistingConn), &c.msgBuilder, &c.writerState.buf)
		_ /* n */, _ /* err */ = c.writerState.buf.WriteTo(c.conn)

ditto

pkg/sql/pgwire/conn.go, line 542 at r10 (raw file):

					// Add a prefix. This also adds a stack trace.
					retErr = errors.Wrap(retErr, "caught fatal error")
					_ = writeErr(

here it's different, I believe the error should not be ignored (neither the one below).
This would be retErr = errors.CombineErrors(retErr, err)

pkg/sql/pgwire/conn.go, line 609 at r10 (raw file):

		ctx, c.sessionArgs, &c.stmtBuf, c, c.metrics.SQLMemMetrics)
	if err != nil {
		_ /* err */ = writeErr(

ditto errors.CombineErrors

pkg/sql/pgwire/conn.go, line 1491 at r10 (raw file):

) error {
	sendError := func(err error) error {
		_ /* err */ = writeErr(ctx, &execCfg.Settings.SV, err, &c.msgBuilder, c.conn)

Probably errors.CombineErrors here too.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Yet, I'm leaning towards actually recording pe

You mean, you're leaning towards not recording pe?

Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?

They're different in some corner cases. For example on query cancelation, resErr is more sanitized than pe. On schema change failure, you have a resErr but not a pe (because schema changes run when the transaction commits, and the events and their payloads are about transaction updates).
So I think we should not record pe.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/conn_executor.go, line 1431 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Yet, I'm leaning towards actually recording pe

You mean, you're leaning towards not recording pe?

Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?

They're different in some corner cases. For example on query cancelation, resErr is more sanitized than pe. On schema change failure, you have a resErr but not a pe (because schema changes run when the transaction commits, and the events and their payloads are about transaction updates).
So I think we should not record pe.

I did mean "recording pe", but your examples of the differences suggest that recording resErr should be sufficient, so I removed the newly added reporting of pe as you're suggesting.

pkg/sql/pgwire/command_result.go, line 103 at r10 (raw file):

Previously, knz (kena) wrote…

I think you can be a bit clearer about what is going on:
log.VErrEventf(ctx, 2 /* level */, `error in writeErr(): %v`, err)
do you want to hide these errors by default? I was thinking about log.Warningf instead of a trace-only thing.

(optionally) you can place this logging in a helper function, which you can call in other places where writeErr is called. I provide a few (non-exhaustive) examples below. If you prefer to leave this out of scope, you can file an issue with my name on it and I'll pick it up later.

Nice suggestions, thanks, done.

pkg/sql/pgwire/conn.go, line 349 at r10 (raw file):

Previously, knz (kena) wrote…

you can emit the error here with the helper too

Done.

pkg/sql/pgwire/conn.go, line 460 at r10 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

pkg/sql/pgwire/conn.go, line 542 at r10 (raw file):

Previously, knz (kena) wrote…

here it's different, I believe the error should not be ignored (neither the one below).
This would be retErr = errors.CombineErrors(retErr, err)

Done.

pkg/sql/pgwire/conn.go, line 609 at r10 (raw file):

Previously, knz (kena) wrote…

ditto errors.CombineErrors

Done.

pkg/sql/pgwire/conn.go, line 1491 at r10 (raw file):

Previously, knz (kena) wrote…

Probably errors.CombineErrors here too.

Done.

pkg/sql/pgwire/server.go, line 499 at r14 (raw file):

	sendErr := func(err error) error {
		msgBuilder := newWriteBuffer(s.metrics.BytesOutCount)
		writeErrWithLogging(ctx, &s.execCfg.Settings.SV, err, msgBuilder, conn)

I don't have a feeling for whether this should be logged or combined with the error to return. Thoughts?

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/pgwire/conn.go, line 1187 at r14 (raw file):

// writeErrWithLogging is the same as writeErr with the addition of logging the
// error returned by writeErr.
func writeErrWithLogging(

I don't know about this logging... We don't log for other communication errors (or do all communication errors end up in a call to this? I don't think so). And for good reason - client connections dropping are not a noteworthy event - at least not noteworthy for our log files.

pkg/sql/pgwire/conn.go, line 1193 at r14 (raw file):

		ctx, sv, err, msgBuilder, w,
	); err != nil {
		log.Warningf(ctx, `error in writeErr(): %v`, err)

nit: s/%v/%s since you know err is not nil. We use %v to tolerate nils.

pkg/sql/pgwire/server.go, line 499 at r14 (raw file):

Previously, yuzefovich wrote…

I don't have a feeling for whether this should be logged or combined with the error to return. Thoughts?

I wouldn't combine it. Per the comment of ServeConn(): An error is returned if the initial handshake of the connection fails.
Failing to send the handshake error to the client is not interesting; combining the error would only blur the reason why the handshake failed.

More importantly, ServeConn doesn't generally return communication errors to the caller. Given that, returning a communication error from the handshake part of the protocol in particular doesn't make much sense.

You can argue that ServeConn() shouldn't return any errors. The reason why it does return handshake errors, I guess, is because we want to log those in particular since they might be important to yell about - they might show some client misconfiguration, bad certs, etc. There's probably a better way to achieve said logging.

knz

I'll review more later, just replying on a nit here.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/pgwire/conn.go, line 1193 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/%v/%s since you know err is not nil. We use %v to tolerate nils.

this nit is more a FYI. The code is fine as-is. We use %v throughout for errors.

yuzefovich · 2020-01-21T16:52:46Z

Guys, what do you think about not merging the last commit about auditing error handling and leaving a TODO to do that? The original intention of this PR was simply removing duplicated reports of errors to Sentry.

knz · 2020-01-21T16:54:56Z

I'm in favor. You can split the commit in a separate PR.
I'll let Andrei weigh in, his opinion on this is more informed than mine.

Previously, context.Background() was used to record an internal error. That context is missing the registered tags (e.g. 'statement' tag) which results in an incomplete sentry report. Now this is fixed. Release note: None

The behavior of CloseWithErr method can be obtained with SetError followed by Close, so this commit does such refactoring which simplifies the interface. Release note: None

yuzefovich · 2020-01-21T17:11:31Z

Removed the last commit and will open a separate PR to continue discussion in there.

andreimatei

I really wanted to and agonized for an hour, but I just can't sign off on the first commit. I know you haven't written this code, so you can take Rafa's LGTM and run with it, but I can't put mine. Having something as important as including the statement in the Sentry reports depend on what context res.Close() is called with does not make sense. It's magic for no reason and fragile for no reason. The caller to res.Close() might as well put structured information in the error for the purposes of the report - for example by using the errors.With<foo>(err) patterns.
Also notice that currently because of the context stuff the reports don't make a distinction between errors encountered by execution of a statement or portal vs errors encountered by a prepare. If we make this caller construct some structured info, it'd be trivial to make that distinction.

On log.Fatal(), on the other hand, I think the magic is justified to get as much info as possible. On panics, I similarly believe the magic is justified, except that it looks to me like on the one panic handler that matters, the magic will actually not work because the context there is not the one with the statement in it. So again I think some structured solution needs to be find. For example recovering panics earlier, somewhere in conn_executor_exec.go, and panicing again with an error that has the statement in it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/conn_executor.go, line 1438 at r16 (raw file):

			res.SetError(pe.errorCause())
		} else {
			ex.recordError(ctx, resErr)

now that this call goes away, there's no reason for recordError to stay a method on the executor. Please get rid of it.

pkg/sql/conn_executor.go, line 1435 at r17 (raw file):

			ex.sessionEventf(ctx, "execution error: %s", pe.errorCause())
			if resErr == nil {
				// We use the payload error to set on the result only if 'res' doesn't

This comment is confusing. The first sentence is unnecessary since it just states the code without adding any useful commentary. The 2nd sentence talks about reporting, but no reader will know what you're talking about - the reporting is hidden 20 levels deep. I don't think this code needs any comment besides that comment a few lines up that claims that res.Err() might not be set in case of error events; we'd be doing this SetError() regardless of any reporting concerns.

Previously, in a certain code path both connExecutor and pgwire would record telemetry for the same error to be sent to sentry. This resulted in duplicated events. Now this is fixed. Release note: None

yuzefovich

I appreciate the thoughtful comments that you left, and I extracted them into a separate issue (#44308). I believe that this PR strictly improves the situation, so it is beneficial to merge it. At the moment, I don't have the bandwidth to look into this further, so I'll go ahead and merge it as is.

Thanks for the comments!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)

pkg/sql/conn_executor.go, line 1438 at r16 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

now that this call goes away, there's no reason for recordError to stay a method on the executor. Please get rid of it.

Done.

pkg/sql/conn_executor.go, line 1435 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This comment is confusing. The first sentence is unnecessary since it just states the code without adding any useful commentary. The 2nd sentence talks about reporting, but no reader will know what you're talking about - the reporting is hidden 20 levels deep. I don't think this code needs any comment besides that comment a few lines up that claims that res.Err() might not be set in case of error events; we'd be doing this SetError() regardless of any reporting concerns.

Ok, removed.

43301: sql: fix a few issues with reporting of errors to sentry r=yuzefovich a=yuzefovich **sql: use the correct context when recording an error for sentry** Previously, context.Background() was used to record an internal error. That context is missing the registered tags (e.g. 'statement' tag) which results in an incomplete sentry report. Now this is fixed. Release note: None **sql: remove CloseWithErr method from CommandResultClose interface** The behavior of CloseWithErr method can be obtained with SetError followed by Close, so this commit does such refactoring which simplifies the interface. Release note: None **sql: fix double reporting of the same error with sentry** Previously, in a certain code path both connExecutor and pgwire would record telemetry for the same error to be sent to sentry. This resulted in duplicated events. Now this is fixed. Release note: None 44246: build: fix teamcity-compose script r=mjibson a=mjibson 44250: backupccl,importccl: fix privilege checks for BACKUP/RESTORE and IMPORT r=pbardea a=pbardea This changes the privilege checks in IMPORT, IMPORT INTO and RESTORE to run during the *planning* of the job, in the SQL plan hook execution, rather than during the execution of the job. This is done because privilege checks are implemented on planner, and close over the planner's txn in some branches/cases, so invoking them later, on a txn-less planner in a resumed jobs execution, can cause problems. Before this, the planStateHook's txn was assumed to be set and caused a panic on checking RBAC privileges. Additionally, permission checks in these operations did not properly give access to all admin users. Fixes #44252. Release note (bug fix): Allow all admin users to use BACKUP/RESTORE and IMPORT. 44268: sql/sem/builtins: mark timeofday as impure r=mjibson a=mjibson Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Paul Bardea <pbardea@gmail.com>

craig · 2020-01-23T18:51:45Z

Build succeeded

GitHub CI (Cockroach)

yuzefovich requested review from andreimatei and knz December 18, 2019 19:07

yuzefovich commented Dec 18, 2019

View reviewed changes

knz approved these changes Dec 18, 2019

View reviewed changes

knz reviewed Dec 18, 2019

View reviewed changes

yuzefovich changed the title ~~sql: use the correct context when recording an error for sentry~~ sql: fix a few issues with reporting of errors to sentry Dec 18, 2019

yuzefovich commented Dec 18, 2019

View reviewed changes

andreimatei reviewed Dec 18, 2019

View reviewed changes

yuzefovich commented Dec 18, 2019

View reviewed changes

andreimatei reviewed Dec 18, 2019

View reviewed changes

yuzefovich commented Dec 18, 2019

View reviewed changes

yuzefovich force-pushed the sentry branch from d83c7c2 to 5c33758 Compare December 18, 2019 19:43

andreimatei suggested changes Dec 18, 2019

View reviewed changes

yuzefovich force-pushed the sentry branch from 5c33758 to 3633461 Compare December 18, 2019 21:45

yuzefovich commented Dec 18, 2019

View reviewed changes

andreimatei reviewed Dec 19, 2019

View reviewed changes

knz reviewed Dec 19, 2019

View reviewed changes

andreimatei reviewed Dec 19, 2019

View reviewed changes

yuzefovich force-pushed the sentry branch from 3633461 to 2b26099 Compare December 19, 2019 22:02

yuzefovich commented Dec 19, 2019

View reviewed changes

andreimatei suggested changes Dec 27, 2019

View reviewed changes

knz reviewed Dec 28, 2019

View reviewed changes

yuzefovich force-pushed the sentry branch 2 times, most recently from 4fd8f7e to eaec2d0 Compare December 31, 2019 04:21

yuzefovich commented Dec 31, 2019

View reviewed changes

knz approved these changes Dec 31, 2019

View reviewed changes

andreimatei reviewed Jan 2, 2020

View reviewed changes

yuzefovich force-pushed the sentry branch 2 times, most recently from cf98ff8 to ab47e59 Compare January 2, 2020 18:17

yuzefovich commented Jan 2, 2020

View reviewed changes

andreimatei suggested changes Jan 2, 2020

View reviewed changes

knz reviewed Jan 3, 2020

View reviewed changes

yuzefovich added 2 commits January 21, 2020 09:06

sql: use the correct context when recording an error for sentry

135711e

Previously, context.Background() was used to record an internal error. That context is missing the registered tags (e.g. 'statement' tag) which results in an incomplete sentry report. Now this is fixed. Release note: None

sql: remove CloseWithErr method from CommandResultClose interface

46f3143

The behavior of CloseWithErr method can be obtained with SetError followed by Close, so this commit does such refactoring which simplifies the interface. Release note: None

yuzefovich force-pushed the sentry branch from ab47e59 to f922b73 Compare January 21, 2020 17:10

yuzefovich mentioned this pull request Jan 21, 2020

pgwire: audit handling of errors returned from writeErr #44165

Closed

andreimatei reviewed Jan 23, 2020

View reviewed changes

sql: fix double reporting of the same error with sentry

f5d7f33

Previously, in a certain code path both connExecutor and pgwire would record telemetry for the same error to be sent to sentry. This resulted in duplicated events. Now this is fixed. Release note: None

yuzefovich force-pushed the sentry branch from f922b73 to f5d7f33 Compare January 23, 2020 17:23

yuzefovich mentioned this pull request Jan 23, 2020

sql: clean up propagation of sql statement into a sentry report #44308

Closed

yuzefovich commented Jan 23, 2020

View reviewed changes

craig bot merged commit f5d7f33 into cockroachdb:master Jan 23, 2020

yuzefovich deleted the sentry branch January 23, 2020 19:25

This was referenced Feb 25, 2020

opt: assertion failed: estimated row count must be non-zero #45180

Closed

release-19.2: sql: fix a few issues with reporting of errors to sentry #45400

Merged

yuzefovich mentioned this pull request Apr 8, 2020

sentry: don't pass statement as a context tag #38516

Closed

Conversation

yuzefovich commented Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 18, 2019

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Dec 18, 2019

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Dec 18, 2019

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich commented Jan 21, 2020

yuzefovich commented Dec 18, 2019 •

edited

Loading