sql: async writes to system eventlog#86174
Conversation
|
@andreimatei could you review this in replacement to #84983? It's simpler. If folk like it, we can stop the deprecation process for system.eventlog. |
|
FWIW, with the async write enabled, |
84ee0ac to
257086a
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
-- commits line 4 at r2:
is it really?
-- commits line 19 at r2:
is it really?
-- commits line 42 at r6:
"cluster setting"
-- commits line 49 at r6:
do we really need this cluster setting? I'd vote against it.
pkg/sql/event_log.go line 128 at r2 (raw file):
// For DDL statements, this is typically the ID of // the affected descriptor. targetID int32
please put a comment somewhere around the schema definition for system.eventlog saying that the respective column is not populated any more (or is populated with 0)
pkg/sql/event_log.go line 286 at r2 (raw file):
prevDescID := m.DescriptorID *m = commonSQLEventDetails if m.DescriptorID == 0 {
I don't really understand what's going on here. Perhaps put a method comment explaining what should and should not be populated in commonSQLEventDetails.
pkg/sql/event_log.go line 534 at r2 (raw file):
reportingID := ex.s.cfg.NodeInfo.NodeID.SQLInstanceID() const colsPerEvent = 5 // Note: we insert the alue zero as targetID because sadly this
alue
pkg/sql/event_log.go line 534 at r2 (raw file):
reportingID := ex.s.cfg.NodeInfo.NodeID.SQLInstanceID() const colsPerEvent = 5 // Note: we insert the alue zero as targetID because sadly this
let's add a migration removing that constraint?
pkg/sql/event_log.go line 453 at r4 (raw file):
func insertEventRecords( ctx context.Context, execCfg *ExecutorConfig,
I'm not sure changing this param is an improvement. The main thing that's needed out of the ExecutorConfig is an InternalExecutor; for the rest of the needed things, the ExecutorConfig is not a very appropriate struct to pass. Between the two, I prefer it how it was.
pkg/sql/event_log.go line 406 at r6 (raw file):
) var eventLogAsyncWritesMaxAttempts = settings.RegisterIntSetting(
I'd vote we hardcode a number instead of this setting
pkg/sql/event_log.go line 414 at r6 (raw file):
) var eventLogAsyncWritesAttemptTimeout = settings.RegisterDurationSetting(
I'd vote we hardcode a timeout instead of this setting
pkg/sql/event_log.go line 461 at r6 (raw file):
// filtering. Easiest is to pretend the event is sourced here. // // NB: insertEventRecords never returns an error when provided a nil txn.
instead of this NB, assert that the error is nil and put the explanation in the panic
pkg/sql/event_log.go line 551 at r6 (raw file):
} // No: do them async. txn.AddCommitTrigger(func(ctx context.Context) {
I'm confused; doesn't this crash if txn is nil?
pkg/sql/event_log.go line 564 at r6 (raw file):
stopper := execCfg.RPCContext.Stopper return stopper.RunAsyncTask( context.Background(), "record-events", func(bgCtx context.Context) {
why not use ctx?
pkg/sql/event_log.go line 598 at r6 (raw file):
} func synchronousWriteToSystemEventsTable(
nit: remove synchronous from the name
e53db83 to
24b91bf
Compare
knz
left a comment
There was a problem hiding this comment.
This is ready for another look.
NB: the very last commit is currently expected to fail CI due to #86132. I'll rebase when the fix in #86153 merges.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
Previously, andreimatei (Andrei Matei) wrote…
is it really?
I've fixed the release justifications throughout.
Previously, andreimatei (Andrei Matei) wrote…
is it really?
Done.
Previously, andreimatei (Andrei Matei) wrote…
"cluster setting"
Done.
Previously, andreimatei (Andrei Matei) wrote…
do we really need this cluster setting? I'd vote against it.
It's not public. We need it for certain logic tests (not all) which is why a testing knob won't do.
pkg/sql/event_log.go line 128 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please put a comment somewhere around the schema definition for
system.eventlogsaying that the respective column is not populated any more (or is populated with 0)
Done.
pkg/sql/event_log.go line 286 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't really understand what's going on here. Perhaps put a method comment explaining what should and should not be populated in
commonSQLEventDetails.
Added comments and extended the commit message to explain what's going on.
pkg/sql/event_log.go line 534 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
alue
Done.
pkg/sql/event_log.go line 534 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
let's add a migration removing that constraint?
Added a TODO.
pkg/sql/event_log.go line 453 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm not sure changing this param is an improvement. The main thing that's needed out of the
ExecutorConfigis anInternalExecutor; for the rest of the needed things, theExecutorConfigis not a very appropriate struct to pass. Between the two, I prefer it how it was.
It's mostly for the benefit of the next commit. I've explained in the commit message.
pkg/sql/event_log.go line 406 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'd vote we hardcode a number instead of this setting
OK, Done.
pkg/sql/event_log.go line 414 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'd vote we hardcode a timeout instead of this setting
Done.
pkg/sql/event_log.go line 461 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of this NB, assert that the error is nil and put the explanation in the panic
Done.
pkg/sql/event_log.go line 551 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm confused; doesn't this crash if
txnis nil?
It did 😅
Ive fixed it since.
pkg/sql/event_log.go line 564 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why not use
ctx?
Because ctx is linked to the surrounding query's execution, and gets canceled when the execution stops. The async task writing the event to the log can run after the surrounding SQL statement has completed execution.
pkg/sql/event_log.go line 598 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: remove
synchronousfrom the name
Done.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
Previously, knz (kena) wrote…
It's not public. We need it for certain logic tests (not all) which is why a testing knob won't do.
you can use a cluster-opt: ... line in logic tests to set testing knobs (and add support to logic.go for a new knob). If the tests can get away with setting the knob for the whole duration of the test (and I think here they do). I fell like a knob here would be preferable (even to a non-public setting), but up to you.
-- commits line 69 at r13:
I don't think this should be a separate commit.
pkg/sql/event_log.go line 260 at r9 (raw file):
// have access to a (*planner) and the current statement metadata. // // It does so by synthetizing the (shared) CommonSQLEventDetails,
Now what's going on is more clear because of the code comment, but I don't understand what "synthesize" means here. This method takes in commonSQLEventDetails, it doesn't seem to synthesize anything. I'd remove this paragraph.
pkg/sql/event_log.go line 486 at r12 (raw file):
// the run-time type of the event payload. // // If the txn field is non-nil and the cluster setting
This comment is hard to parse, in my opinion. If server.eventlog.sync_writes.enabled didn't exist, there'd be no need for it :P
pkg/sql/event_log.go line 490 at r12 (raw file):
// on the given txn object. Otherwise, an asynchronous task is spawned // to do the write. // If txn is nil, the event is written asynchronously.
even if txn is not nil, the write is still async (as a commit trigger), right? I think commit triggers should be mentioned in this comment.
pkg/sql/event_log.go line 585 at r12 (raw file):
stopper := execCfg.RPCContext.Stopper if err := stopper.RunAsyncTask( context.Background(), "record-events", func(bgCtx context.Context) {
If origCtx stays named origCtx, then you can rename bgCtx to ctx, which would be better in my opinion.
As a nit, I would also rename origCtx to ctx as the parameter name (from the perspective of the function's caller, origCtx doesn't make sense), and I'd do a origCtx := ctx before the RunAsyncTask() call.
pkg/sql/event_log.go line 585 at r12 (raw file):
stopper := execCfg.RPCContext.Stopper if err := stopper.RunAsyncTask( context.Background(), "record-events", func(bgCtx context.Context) {
please put a comment around here saying that we don't want to inherit the caller's cancelation
pkg/sql/event_log.go line 620 at r12 (raw file):
} }); err != nil { log.Ops.Warningf(origCtx, "failed to start task to log %d events: %v", len(entries), err)
I don't think using the Ops channel here (as opposed to simply log.Infof makes sense). In fact, I wouldn't log this error at all, and definitely not at the warning level. If the server is shutting down and we can't write to the system table any more, I think it's just bad luck; I don't think anyone will benefit from these log message.
Same for the Warning above too.
knz
left a comment
There was a problem hiding this comment.
I've merged the last two commits as recommended.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
Previously, andreimatei (Andrei Matei) wrote…
you can use a
cluster-opt: ...line in logic tests to set testing knobs (and add support tologic.gofor a new knob). If the tests can get away with setting the knob for the whole duration of the test (and I think here they do). I fell like a knob here would be preferable (even to a non-public setting), but up to you.
Done.
Previously, andreimatei (Andrei Matei) wrote…
I don't think this should be a separate commit.
Done.
pkg/sql/event_log.go line 260 at r9 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Now what's going on is more clear because of the code comment, but I don't understand what "synthesize" means here. This method takes in
commonSQLEventDetails, it doesn't seem to synthesize anything. I'd remove this paragraph.
Done.
pkg/sql/event_log.go line 486 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This comment is hard to parse, in my opinion. If
server.eventlog.sync_writes.enableddidn't exist, there'd be no need for it :P
Done.
pkg/sql/event_log.go line 490 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
even if
txnis not nil, the write is still async (as a commit trigger), right? I think commit triggers should be mentioned in this comment.
Done.
pkg/sql/event_log.go line 585 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
If
origCtxstays namedorigCtx, then you can renamebgCtxtoctx, which would be better in my opinion.
As a nit, I would also renameorigCtxtoctxas the parameter name (from the perspective of the function's caller,origCtxdoesn't make sense), and I'd do aorigCtx := ctxbefore theRunAsyncTask()call.
Done.
pkg/sql/event_log.go line 585 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please put a comment around here saying that we don't want to inherit the caller's cancelation
Done.
pkg/sql/event_log.go line 620 at r12 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't think using the
Opschannel here (as opposed to simplylog.Infofmakes sense). In fact, I wouldn't log this error at all, and definitely not at the warning level. If the server is shutting down and we can't write to the system table any more, I think it's just bad luck; I don't think anyone will benefit from these log message.
Same for the Warning above too.
The one above is definitely useful because perhaps the eventlog table is unavailable, and we want to see symptoms of that.
For the one here, rephrased as an assertion failure. This ensures we get a sentry report for it if it's ever hit.
|
Build failed: |
bors r- |
|
Race condition. You got it. Sorry :) |
|
@ajwerner what is the PR/branch? |
This commit removes support for the discrete column `targetID`, which was ill-defined (many event types don't have a target). It also simplifies the handling of the descriptor ID, to make it clear where it is assigned. Prior to this commit, the callers of the event log code could pass the CommonSQLEventDetails payload and the descriptor ID separately. Now they can pre-populate the DescriptorID in the CommonSQLEventDetails directly, so just one argument is necessary (the event payload). This in turn simplifies the definition of the eventLogEntry type. Release justification: low risk, high benefit changes to existing functionality Release note: None
Release justification: low risk, high benefit changes to existing functionality Release note: None
We also change the code to pass the ExecutorConfig instead of an InternalExecutor to the various event log functions. This is because in the next commit we will be using the execCfg more directly. It's shorter to write `execCfg.DB.Txn`, `execCfg.NodeInfo.SQLInstanceID`, `execCfg.RPCContext.Stopper`, etc. than indirecting via the InternalExecutor with `ie.s.cfg.DB.Txn`, `ie.s.cfg.NodeInfo.SQLInstanceID`, `ie.s.cfg.RPCContext.Stopper`, etc. Release justification: low risk, high benefit changes to existing functionality Release note: None
Release justification: fixes for high-priority or high-severity bugs in existing functionality Release note (performance improvement): SQL statement that cause events to be logged to `system.eventlog` are now able to complete faster.
|
nvm found it |
|
bors r=andreimatei |
|
Build succeeded: |
In cockroachdb#86174, exporting of events to the Obs Service was doubly-decoupled from the SQL statement causing the event: it ran in an async task inside another async task. The inner task was given a context canceled when the outer task finished - thus introducing a race causing the event to not be passed along. This was caught by TestEventIngestionIntegration flakiness. The second task was unnecessary. Its purpose was to shield the caller from slow communication with the Obs Service, but the client library being used already does that (it's non-blocking). Release note: None
87054: sql: fix exporting of events to Obs Service r=andreimatei a=andreimatei In #86174, exporting of events to the Obs Service was doubly-decoupled from the SQL statement causing the event: it ran in an async task inside another async task. The inner task was given a context canceled when the outer task finished - thus introducing a race causing the event to not be passed along. This was caught by TestEventIngestionIntegration flakiness. The second task was unnecessary. Its purpose was to shield the caller from slow communication with the Obs Service, but the client library being used already does that (it's non-blocking). Release note: None Release justification: bug fix, test flakiness 87130: bazel: set `GORACE` variable appropriately when testing w/ `race` r=rail,srosenberg a=rickystewart By default the `race` data is output to `stderr`, which means it ends up in the `test.log` file but NOT the `test.xml` file. Changing it to `stdout` means that this instead ends up `test.xml` as well as `test.log` and any output JSON files. Closes #74105. Release justification: Non-production code changes Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Informs #73308.
Epic: CRDB-18596
This PR aims to remove the synchronous write to system.eventlog for structured events, which is undesirable:
Instead it makes the writes asynchronous.
This PR subsumes #84983.
See individual commits for details.
Release justification: bug fixes