sql: fix a few issues with reporting of errors to sentry#43301
sql: fix a few issues with reporting of errors to sentry#43301craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
res.CloseWithErr(pe.errorCause()) } else { ex.recordError(ctx, resErr)
The duplicate report happens because of these two lines - both ex.recordError and res.Close create a report, I'm not sure which one should not be happening. Should we modify res.Close to take an argument whether the report has already been created?
knz
left a comment
There was a problem hiding this comment.
Fantastic find. This must have been painful to troubleshoot. You're a hero.
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, yuzefovich wrote…
The duplicate report happens because of these two lines - both
ex.recordErrorandres.Closecreate a report, I'm not sure which one should not be happening. Should we modifyres.Closeto take an argument whether the report has already been created?
Unsure what the best approach is.
My recollection is the following:
- we can't make res.Close the only place responsible because if the resultwriter is not pgwire, the error would get lost
- we can't make this code here the only place responsible, because we need to catch the error on other termination paths (e.g. a panic)
Maybe what you suggest is not too bad. It's possible that the number of people who can propose an informed opinion on this has been reduced to just 1 (you) over the last year.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, knz (kena) wrote…
Unsure what the best approach is.
My recollection is the following:
- we can't make res.Close the only place responsible because if the resultwriter is not pgwire, the error would get lost
- we can't make this code here the only place responsible, because we need to catch the error on other termination paths (e.g. a panic)
Maybe what you suggest is not too bad. It's possible that the number of people who can propose an informed opinion on this has been reduced to just 1 (you) over the last year.
I'm under the same impression as your recollection of things, so I pushed a second commit to add that argument. PTAL.
andreimatei
left a comment
There was a problem hiding this comment.
For the first commit, where is the sentry reporting happening?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/conn_executor.go, line 1434 at r2 (raw file):
// We have just reported the error, so we tell res.Close not to report // the error to avoid the duplicate error report. res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()), false /* shouldReportError */)
@andreimatei this is where the reporting happens and where the context was needed.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)
pkg/sql/conn_executor.go, line 1434 at r2 (raw file):
Previously, yuzefovich wrote…
@andreimatei this is where the reporting happens and where the context was needed.
pls point me to the code that calls into sentry
|
In |
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/conn_executor.go, line 1434 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
pls point me to the code that calls into sentry
pkg/sql/pgwire/conn.go:1190
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, yuzefovich wrote…
I'm under the same impression as your recollection of things, so I pushed a second commit to add that argument. PTAL.
I think we need a clear contract about who reports what. Given that pgwire is in the reporting game, I think the connExecutor should get out of it; I don't see what benefit we get from sql ever doing it. Perhaps for internal executors... but we don't do it consistently. I'd rather do it in the bufferedCommandResult if we care about internal executor. Or, rather, forget about them.
I also think that the way in which pgwire gets that statement from the context is fairly ridiculous. It's obviously extremely fragile, and we care about that statement enough. Getting random dynamic tags from the context to put them in the report is fine, but the statement is not dynamic. Moreover, pgwire actually has the statement. It doesn't need to get it in an insane way from sql. The constructor of the CommandResult has it, for example. So let's just hang it from the CommandResult.
So, I'd keep the first commit but change the way in which the statement is retrieved for the report, and I'd do something else for the 2nd commit.
pkg/sql/conn_io.go, line 671 at r3 (raw file):
// an error (if present). This should be set to true only if the caller will // take care of creating the report on their own. Close(ctx context.Context, tsi TransactionStatusIndicator, shouldReportError bool)
No bool arguments to high-profile APIs like this! In fact, no bool args ever (discouraged by the style guide too.
Make it an enum if we don't find a better idea.
Also what's the "should"? Should every function Foo be called shouldFoo? :)
Just reportError.
pkg/sql/conn_io.go, line 671 at r3 (raw file):
// an error (if present). This should be set to true only if the caller will // take care of creating the report on their own. Close(ctx context.Context, tsi TransactionStatusIndicator, shouldReportError bool)
The only caller that sets the new argument doesn't need it - the result in question is created just above and has no error. So the arg is not needed.
pkg/sql/conn_io.go, line 681 at r3 (raw file):
// command in the same batch as the one being closed. The contract is that the // next result created corresponds to the first command in the next batch. CloseWithErr(context.Context, error)
It looks to me like we can get rid of this CloseWithErr(). It's only called in one place, which can call SetError() + Close(). Getting rid of it would simplify the reasoning about this PR. Perhaps in a preamble commit.
pkg/sql/pgwire/conn.go, line 1174 at r3 (raw file):
} func (c *conn) bufferErr(ctx context.Context, err error, shouldReportError bool) {
no bools
And I don't think the arg is needed. If the code is simplified, I think this will always be false.
pkg/sql/pgwire/conn.go, line 1188 at r3 (raw file):
} func writeErr(
split this into writeErr() and writeAndReportErr().
yes!
Maybe that's a good idea.
Surely not :) |
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_io.go, line 681 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
It looks to me like we can get rid of this
CloseWithErr(). It's only called in one place, which can callSetError() + Close(). Getting rid of it would simplify the reasoning about this PR. Perhaps in a preamble commit.
You're right. Added such commit (I wanna see that nothing breaks, so I've just pushed it.)
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think we need a clear contract about who reports what. Given that pgwire is in the reporting game, I think the
connExecutorshould get out of it; I don't see what benefit we get from sql ever doing it. Perhaps for internal executors... but we don't do it consistently. I'd rather do it in thebufferedCommandResultif we care about internal executor. Or, rather, forget about them.
I also think that the way in which pgwire gets that statement from the context is fairly ridiculous. It's obviously extremely fragile, and we care about that statement enough. Getting random dynamic tags from the context to put them in the report is fine, but the statement is not dynamic. Moreover, pgwire actually has the statement. It doesn't need to get it in an insane way from sql. The constructor of theCommandResulthas it, for example. So let's just hang it from theCommandResult.So, I'd keep the first commit but change the way in which the statement is retrieved for the report, and I'd do something else for the 2nd commit.
I'm leaning more towards keeping the reporting of statement execution errors in sql, and not removing it from pgwire. So as soon as a CommandResult is created, the responsibility of reporting its error falls to the ConnExecutor.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm leaning more towards keeping the reporting of statement execution errors in sql, and not removing it from pgwire. So as soon as a CommandResult is created, the responsibility of reporting its error falls to the ConnExecutor.
I meant "and removing it from pgwire"
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I meant "and removing it from pgwire"
I support making SQL generally more responsible for reporting.
Meanwhile we still need a mechanism in pgwire to record errors happening within pgwire.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, knz (kena) wrote…
I support making SQL generally more responsible for reporting.
Meanwhile we still need a mechanism in pgwire to record errors happening within pgwire.
yeah, for everything not using newCommandResult (i.e. for all the non-statement commands), pgwire would remain responsible
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
yeah, for everything not using
newCommandResult(i.e. for all the non-statement commands), pgwire would remain responsible
Pardon my ignorance, but I don't fully understand where the line between non-statement and statement commands lies. This seems like a bigger undertaking than I had in mind.
However, I looked into changing the way that statement is propagated, and although it is true that commandResult can store tree.Statement, SendReport is being called by other things (I got to util/log logDepth() function) and for those it'll be a lot of plumbing to pass in the statement outside of the context. I decided to not proceed with that change.
I appreciate the thoughtful comments that you guys provided though.
pkg/sql/conn_io.go, line 671 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
No bool arguments to high-profile APIs like this! In fact, no bool args ever (discouraged by the style guide too.
Make it an enum if we don't find a better idea.Also what's the "should"? Should every function Foo be called shouldFoo? :)
JustreportError.
Introduced an enum.
pkg/sql/conn_io.go, line 671 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
The only caller that sets the new argument doesn't need it - the result in question is created just above and has no error. So the arg is not needed.
Hm, I don't think it's correct (if there was no error, we would not be seeing duplicated reports).
pkg/sql/pgwire/conn.go, line 1174 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
no bools
And I don't think the arg is needed. If the code is simplified, I think this will always be false.
No bools, but I believe the argument is needed.
pkg/sql/pgwire/conn.go, line 1188 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
split this into
writeErr()andwriteAndReportErr().
Done.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
SendReport is being called by other things
I think it's fine for randos to take whatever they find in the context. In particular, log.Fatal() calls shouldn't happen, so I don't care too much about them.
But errors reported for the execution of SQL statements are not random, and also they're supposed to happen (or maybe not? We report on code == pgcode.Internal || errors.HasAssertionFailure(err) and I forgot when the Internal code happens).
Anyway, the thing I care more about is the contract about who reports what. Passing an extra arg to res.Close() I really don't like, and I don't think it's necessary. The CommandResult interface I care about.
What was the code doing before this PR? Isn't it nonsense?
It has two errors to deal with: resErr and pe. If there's no resErr, then it was passing pe to res.CloseWithError(), and the result was in charge of reporting the one and only error. Cool.
If there was a resErr, then it was recording the resErr (which was already part of the result and hence reported by the result), and it was ignoring pe. Does that make any sense? Perhaps what it wanted to do is record pe, not resErr. As written (before and after your PR), if you simply remove ex.recordError(ctx, resErr), don't all the problems go away while reporting exactly what we already report?
Unless I'm missing something, I'd just get rid of that call and move on.
pkg/sql/conn_executor.go, line 1379 at r6 (raw file):
// If we're about to close the connection, close res in order to flush // now, as we won't have an opportunity to do it later. res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()), sqltelemetry.ErrorNotRecorded)
long line
knz
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1436 at r6 (raw file):
} errStatus := sqltelemetry.ErrorNotRecorded if resErr == nil && ok {
Can you add a few comments to explain why errStatus is needed and what are the various combination of causes/circumstances covered by the various branches of this conditional.
IMHO This code is very complex and it is important to help the future maintainer.
pkg/sql/conn_io.go, line 669 at r6 (raw file):
// state that needs to be closed even when the whole connection is about to be // terminated. Close(context.Context, TransactionStatusIndicator, sqltelemetry.RecordErrorStatus)
ditto my comment above about explaining the status more and the various cases that it covers (with examples).
pkg/sql/pgwire/conn.go, line 1177 at r6 (raw file):
if err := writeAndReportErr(ctx, c.sv, err, &c.msgBuilder, &c.writerState.buf); err != nil { panic(fmt.Sprintf("unexpected err from buffer: %s", err))
So I went and investigated this case further. I believe a panic here is not warranted.
tldr:
- make
bufferErrreturnerrorand propagate the error from here - change the remaining call points in
command_result.go(inClose) to log then discard the error.
The reason is that it's actually legitimate for an error to occur here. The client could have disconnected and dropped its connection just before the error packet is sent. In fact it's even likely, for example if a user interrupts a long-running query after they quit the client that issued it. We don't want to panic in those cases.
pkg/sql/sqltelemetry/report.go, line 25 at r6 (raw file):
// RecordErrorStatus indicates whether the error has already been recorded. type RecordErrorStatus int
type foo bool
const a foo = true
const b foo = false- smaller footprint in memory
- you don't need a
defaultin theswitch.
4fd8f7e to
eaec2d0
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
SendReport is being called by other things
I think it's fine for randos to take whatever they find in the context. In particular,
log.Fatal()calls shouldn't happen, so I don't care too much about them.
But errors reported for the execution of SQL statements are not random, and also they're supposed to happen (or maybe not? We report oncode == pgcode.Internal || errors.HasAssertionFailure(err)and I forgot when theInternalcode happens).Anyway, the thing I care more about is the contract about who reports what. Passing an extra arg to
res.Close()I really don't like, and I don't think it's necessary. TheCommandResultinterface I care about.What was the code doing before this PR? Isn't it nonsense?
It has two errors to deal with:resErrandpe. If there's noresErr, then it was passingpetores.CloseWithError(), and the result was in charge of reporting the one and only error. Cool.
If there was aresErr, then it was recording theresErr(which was already part of the result and hence reported by the result), and it was ignoringpe. Does that make any sense? Perhaps what it wanted to do is recordpe, notresErr. As written (before and after your PR), if you simply removeex.recordError(ctx, resErr), don't all the problems go away while reporting exactly what we already report?Unless I'm missing something, I'd just get rid of that call and move on.
I think you're right that the code was not doing what it intended, so I switched to "recording" pe and reporting resErr through pgwire, and I also agree that simply removing ex.recordError(ctx, resErr) would have solved the double reporting problem. Sorry for not seeing it before and making all this fuzz.
Yet, I'm leaning towards actually recording pe. Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?
pkg/sql/conn_executor.go, line 1379 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
long line
Removed.
pkg/sql/conn_executor.go, line 1436 at r6 (raw file):
Previously, knz (kena) wrote…
Can you add a few comments to explain why
errStatusis needed and what are the various combination of causes/circumstances covered by the various branches of this conditional.IMHO This code is very complex and it is important to help the future maintainer.
Removed.
pkg/sql/conn_io.go, line 669 at r6 (raw file):
Previously, knz (kena) wrote…
ditto my comment above about explaining the status more and the various cases that it covers (with examples).
Removed.
pkg/sql/pgwire/conn.go, line 1177 at r6 (raw file):
Previously, knz (kena) wrote…
So I went and investigated this case further. I believe a
panichere is not warranted.
tldr:
- make
bufferErrreturnerrorand propagate the error from here- change the remaining call points in
command_result.go(inClose) to log then discard the error.The reason is that it's actually legitimate for an error to occur here. The client could have disconnected and dropped its connection just before the error packet is sent. In fact it's even likely, for example if a user interrupts a long-running query after they quit the client that issued it. We don't want to panic in those cases.
Makes sense, added another commit.
pkg/sql/sqltelemetry/report.go, line 25 at r6 (raw file):
Previously, knz (kena) wrote…
type foo bool const a foo = true const b foo = false
- smaller footprint in memory
- you don't need a
defaultin theswitch.
Code was removed, but thanks for the suggestion.
knz
left a comment
There was a problem hiding this comment.
LGTM with nits
Reviewed 1 of 3 files at r8, 1 of 1 files at r9, 2 of 2 files at r10.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, yuzefovich wrote…
I think you're right that the code was not doing what it intended, so I switched to "recording"
peand reportingresErrthroughpgwire, and I also agree that simply removingex.recordError(ctx, resErr)would have solved the double reporting problem. Sorry for not seeing it before and making all this fuzz.Yet, I'm leaning towards actually recording
pe. Do you guys know whetherpe.errorCause()andres.Err()are usually different? Do we want to report both?
I would cut this gordian knot by using errors.CombineErrors to report whichever is set if only one is set, or both if both are set.
pkg/sql/pgwire/command_result.go, line 103 at r10 (raw file):
ctx, r.conn.sv, r.err, &r.conn.msgBuilder, &r.conn.writerState.buf, ); err != nil { log.VErrEvent(ctx, 2 /* level */, err.Error())
- I think you can be a bit clearer about what is going on:
log.VErrEventf(ctx, 2 /* level */, `error in writeErr(): %v`, err)-
do you want to hide these errors by default? I was thinking about
log.Warningfinstead of a trace-only thing. -
(optionally) you can place this logging in a helper function, which you can call in other places where
writeErris called. I provide a few (non-exhaustive) examples below. If you prefer to leave this out of scope, you can file an issue with my name on it and I'll pick it up later.
pkg/sql/pgwire/conn.go, line 349 at r10 (raw file):
// This messages are only acceptable during the auth phase, handled above. err = pgwirebase.NewProtocolViolationErrorf("unexpected authentication data") _ /* err */ = writeErr(
you can emit the error here with the helper too
pkg/sql/pgwire/conn.go, line 460 at r10 (raw file):
_ /* err */ = writeErr(ctx, &sqlServer.GetExecutorConfig().Settings.SV, newAdminShutdownErr(ErrDrainingExistingConn), &c.msgBuilder, &c.writerState.buf) _ /* n */, _ /* err */ = c.writerState.buf.WriteTo(c.conn)
ditto
pkg/sql/pgwire/conn.go, line 542 at r10 (raw file):
// Add a prefix. This also adds a stack trace. retErr = errors.Wrap(retErr, "caught fatal error") _ = writeErr(
here it's different, I believe the error should not be ignored (neither the one below).
This would be retErr = errors.CombineErrors(retErr, err)
pkg/sql/pgwire/conn.go, line 609 at r10 (raw file):
ctx, c.sessionArgs, &c.stmtBuf, c, c.metrics.SQLMemMetrics) if err != nil { _ /* err */ = writeErr(
ditto errors.CombineErrors
pkg/sql/pgwire/conn.go, line 1491 at r10 (raw file):
) error { sendError := func(err error) error { _ /* err */ = writeErr(ctx, &execCfg.Settings.SV, err, &c.msgBuilder, c.conn)
Probably errors.CombineErrors here too.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Yet, I'm leaning towards actually recording pe
You mean, you're leaning towards not recording pe?
Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?
They're different in some corner cases. For example on query cancelation, resErr is more sanitized than pe. On schema change failure, you have a resErr but not a pe (because schema changes run when the transaction commits, and the events and their payloads are about transaction updates).
So I think we should not record pe.
cf98ff8 to
ab47e59
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/conn_executor.go, line 1431 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Yet, I'm leaning towards actually recording pe
You mean, you're leaning towards not recording
pe?Do you guys know whether pe.errorCause() and res.Err() are usually different? Do we want to report both?
They're different in some corner cases. For example on query cancelation,
resErris more sanitized thanpe. On schema change failure, you have aresErrbut not ape(because schema changes run when the transaction commits, and the events and their payloads are about transaction updates).
So I think we should not recordpe.
I did mean "recording pe", but your examples of the differences suggest that recording resErr should be sufficient, so I removed the newly added reporting of pe as you're suggesting.
pkg/sql/pgwire/command_result.go, line 103 at r10 (raw file):
Previously, knz (kena) wrote…
- I think you can be a bit clearer about what is going on:
log.VErrEventf(ctx, 2 /* level */, `error in writeErr(): %v`, err)
do you want to hide these errors by default? I was thinking about
log.Warningfinstead of a trace-only thing.(optionally) you can place this logging in a helper function, which you can call in other places where
writeErris called. I provide a few (non-exhaustive) examples below. If you prefer to leave this out of scope, you can file an issue with my name on it and I'll pick it up later.
Nice suggestions, thanks, done.
pkg/sql/pgwire/conn.go, line 349 at r10 (raw file):
Previously, knz (kena) wrote…
you can emit the error here with the helper too
Done.
pkg/sql/pgwire/conn.go, line 460 at r10 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
pkg/sql/pgwire/conn.go, line 542 at r10 (raw file):
Previously, knz (kena) wrote…
here it's different, I believe the error should not be ignored (neither the one below).
This would beretErr = errors.CombineErrors(retErr, err)
Done.
pkg/sql/pgwire/conn.go, line 609 at r10 (raw file):
Previously, knz (kena) wrote…
ditto
errors.CombineErrors
Done.
pkg/sql/pgwire/conn.go, line 1491 at r10 (raw file):
Previously, knz (kena) wrote…
Probably
errors.CombineErrorshere too.
Done.
pkg/sql/pgwire/server.go, line 499 at r14 (raw file):
sendErr := func(err error) error { msgBuilder := newWriteBuffer(s.metrics.BytesOutCount) writeErrWithLogging(ctx, &s.execCfg.Settings.SV, err, msgBuilder, conn)
I don't have a feeling for whether this should be logged or combined with the error to return. Thoughts?
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/pgwire/conn.go, line 1187 at r14 (raw file):
// writeErrWithLogging is the same as writeErr with the addition of logging the // error returned by writeErr. func writeErrWithLogging(
I don't know about this logging... We don't log for other communication errors (or do all communication errors end up in a call to this? I don't think so). And for good reason - client connections dropping are not a noteworthy event - at least not noteworthy for our log files.
pkg/sql/pgwire/conn.go, line 1193 at r14 (raw file):
ctx, sv, err, msgBuilder, w, ); err != nil { log.Warningf(ctx, `error in writeErr(): %v`, err)
nit: s/%v/%s since you know err is not nil. We use %v to tolerate nils.
pkg/sql/pgwire/server.go, line 499 at r14 (raw file):
Previously, yuzefovich wrote…
I don't have a feeling for whether this should be logged or combined with the error to return. Thoughts?
I wouldn't combine it. Per the comment of ServeConn(): An error is returned if the initial handshake of the connection fails.
Failing to send the handshake error to the client is not interesting; combining the error would only blur the reason why the handshake failed.
More importantly, ServeConn doesn't generally return communication errors to the caller. Given that, returning a communication error from the handshake part of the protocol in particular doesn't make much sense.
You can argue that ServeConn() shouldn't return any errors. The reason why it does return handshake errors, I guess, is because we want to log those in particular since they might be important to yell about - they might show some client misconfiguration, bad certs, etc. There's probably a better way to achieve said logging.
knz
left a comment
There was a problem hiding this comment.
I'll review more later, just replying on a nit here.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/pgwire/conn.go, line 1193 at r14 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/%v/%s since you know err is not nil. We use
%vto tolerate nils.
this nit is more a FYI. The code is fine as-is. We use %v throughout for errors.
|
Guys, what do you think about not merging the last commit about auditing error handling and leaving a TODO to do that? The original intention of this PR was simply removing duplicated reports of errors to Sentry. |
|
I'm in favor. You can split the commit in a separate PR. |
Previously, context.Background() was used to record an internal error. That context is missing the registered tags (e.g. 'statement' tag) which results in an incomplete sentry report. Now this is fixed. Release note: None
The behavior of CloseWithErr method can be obtained with SetError followed by Close, so this commit does such refactoring which simplifies the interface. Release note: None
|
Removed the last commit and will open a separate PR to continue discussion in there. |
andreimatei
left a comment
There was a problem hiding this comment.
I really wanted to and agonized for an hour, but I just can't sign off on the first commit. I know you haven't written this code, so you can take Rafa's LGTM and run with it, but I can't put mine. Having something as important as including the statement in the Sentry reports depend on what context res.Close() is called with does not make sense. It's magic for no reason and fragile for no reason. The caller to res.Close() might as well put structured information in the error for the purposes of the report - for example by using the errors.With<foo>(err) patterns.
Also notice that currently because of the context stuff the reports don't make a distinction between errors encountered by execution of a statement or portal vs errors encountered by a prepare. If we make this caller construct some structured info, it'd be trivial to make that distinction.
On log.Fatal(), on the other hand, I think the magic is justified to get as much info as possible. On panics, I similarly believe the magic is justified, except that it looks to me like on the one panic handler that matters, the magic will actually not work because the context there is not the one with the statement in it. So again I think some structured solution needs to be find. For example recovering panics earlier, somewhere in conn_executor_exec.go, and panicing again with an error that has the statement in it.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/conn_executor.go, line 1438 at r16 (raw file):
res.SetError(pe.errorCause()) } else { ex.recordError(ctx, resErr)
now that this call goes away, there's no reason for recordError to stay a method on the executor. Please get rid of it.
pkg/sql/conn_executor.go, line 1435 at r17 (raw file):
ex.sessionEventf(ctx, "execution error: %s", pe.errorCause()) if resErr == nil { // We use the payload error to set on the result only if 'res' doesn't
This comment is confusing. The first sentence is unnecessary since it just states the code without adding any useful commentary. The 2nd sentence talks about reporting, but no reader will know what you're talking about - the reporting is hidden 20 levels deep. I don't think this code needs any comment besides that comment a few lines up that claims that res.Err() might not be set in case of error events; we'd be doing this SetError() regardless of any reporting concerns.
Previously, in a certain code path both connExecutor and pgwire would record telemetry for the same error to be sent to sentry. This resulted in duplicated events. Now this is fixed. Release note: None
yuzefovich
left a comment
There was a problem hiding this comment.
I appreciate the thoughtful comments that you left, and I extracted them into a separate issue (#44308). I believe that this PR strictly improves the situation, so it is beneficial to merge it. At the moment, I don't have the bandwidth to look into this further, so I'll go ahead and merge it as is.
Thanks for the comments!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, and @yuzefovich)
pkg/sql/conn_executor.go, line 1438 at r16 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
now that this call goes away, there's no reason for
recordErrorto stay a method on the executor. Please get rid of it.
Done.
pkg/sql/conn_executor.go, line 1435 at r17 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This comment is confusing. The first sentence is unnecessary since it just states the code without adding any useful commentary. The 2nd sentence talks about reporting, but no reader will know what you're talking about - the reporting is hidden 20 levels deep. I don't think this code needs any comment besides that comment a few lines up that claims that
res.Err()might not be set in case of error events; we'd be doing thisSetError()regardless of any reporting concerns.
Ok, removed.
43301: sql: fix a few issues with reporting of errors to sentry r=yuzefovich a=yuzefovich **sql: use the correct context when recording an error for sentry** Previously, context.Background() was used to record an internal error. That context is missing the registered tags (e.g. 'statement' tag) which results in an incomplete sentry report. Now this is fixed. Release note: None **sql: remove CloseWithErr method from CommandResultClose interface** The behavior of CloseWithErr method can be obtained with SetError followed by Close, so this commit does such refactoring which simplifies the interface. Release note: None **sql: fix double reporting of the same error with sentry** Previously, in a certain code path both connExecutor and pgwire would record telemetry for the same error to be sent to sentry. This resulted in duplicated events. Now this is fixed. Release note: None 44246: build: fix teamcity-compose script r=mjibson a=mjibson 44250: backupccl,importccl: fix privilege checks for BACKUP/RESTORE and IMPORT r=pbardea a=pbardea This changes the privilege checks in IMPORT, IMPORT INTO and RESTORE to run during the *planning* of the job, in the SQL plan hook execution, rather than during the execution of the job. This is done because privilege checks are implemented on planner, and close over the planner's txn in some branches/cases, so invoking them later, on a txn-less planner in a resumed jobs execution, can cause problems. Before this, the planStateHook's txn was assumed to be set and caused a panic on checking RBAC privileges. Additionally, permission checks in these operations did not properly give access to all admin users. Fixes #44252. Release note (bug fix): Allow all admin users to use BACKUP/RESTORE and IMPORT. 44268: sql/sem/builtins: mark timeofday as impure r=mjibson a=mjibson Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Paul Bardea <pbardea@gmail.com>
Build succeeded |
sql: use the correct context when recording an error for sentry
Previously, context.Background() was used to record an internal error.
That context is missing the registered tags (e.g. 'statement' tag) which
results in an incomplete sentry report. Now this is fixed.
Release note: None
sql: remove CloseWithErr method from CommandResultClose interface
The behavior of CloseWithErr method can be obtained with SetError
followed by Close, so this commit does such refactoring which simplifies
the interface.
Release note: None
sql: fix double reporting of the same error with sentry
Previously, in a certain code path both connExecutor and pgwire would
record telemetry for the same error to be sent to sentry. This resulted
in duplicated events. Now this is fixed.
Release note: None