Skip to content

sql: async writes to system eventlog#86174

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
knz:event-log-v2
Aug 21, 2022
Merged

sql: async writes to system eventlog#86174
craig[bot] merged 5 commits intocockroachdb:masterfrom
knz:event-log-v2

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Aug 15, 2022

Informs #73308.
Epic: CRDB-18596

This PR aims to remove the synchronous write to system.eventlog for structured events, which is undesirable:

  • it adds a 5-way write to many small "maintainance" transactions.
  • it forces a global roundtrip in multi-region schema changes.

Instead it makes the writes asynchronous.

This PR subsumes #84983.

See individual commits for details.

Release justification: bug fixes

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz changed the title sql: event log fixes sql: async writes to system eventlog Aug 15, 2022
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 15, 2022

@andreimatei could you review this in replacement to #84983?

It's simpler. If folk like it, we can stop the deprecation process for system.eventlog.

@knz knz marked this pull request as ready for review August 15, 2022 21:58
@knz knz requested review from a team as code owners August 15, 2022 21:58
@knz knz requested a review from a team August 15, 2022 21:58
@knz knz requested review from a team as code owners August 15, 2022 21:58
@knz knz requested review from a team and andreimatei August 15, 2022 21:58
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 15, 2022

FWIW, with the async write enabled, cockroach demo is visibly faster to start already.

@knz knz force-pushed the event-log-v2 branch 5 times, most recently from 84ee0ac to 257086a Compare August 15, 2022 22:37
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @knz)


-- commits line 4 at r2:
is it really?


-- commits line 19 at r2:
is it really?


-- commits line 42 at r6:
"cluster setting"


-- commits line 49 at r6:
do we really need this cluster setting? I'd vote against it.


pkg/sql/event_log.go line 128 at r2 (raw file):

	// For DDL statements, this is typically the ID of
	// the affected descriptor.
	targetID int32

please put a comment somewhere around the schema definition for system.eventlog saying that the respective column is not populated any more (or is populated with 0)


pkg/sql/event_log.go line 286 at r2 (raw file):

		prevDescID := m.DescriptorID
		*m = commonSQLEventDetails
		if m.DescriptorID == 0 {

I don't really understand what's going on here. Perhaps put a method comment explaining what should and should not be populated in commonSQLEventDetails.


pkg/sql/event_log.go line 534 at r2 (raw file):

	reportingID := ex.s.cfg.NodeInfo.NodeID.SQLInstanceID()
	const colsPerEvent = 5
	// Note: we insert the alue zero as targetID because sadly this

alue


pkg/sql/event_log.go line 534 at r2 (raw file):

	reportingID := ex.s.cfg.NodeInfo.NodeID.SQLInstanceID()
	const colsPerEvent = 5
	// Note: we insert the alue zero as targetID because sadly this

let's add a migration removing that constraint?


pkg/sql/event_log.go line 453 at r4 (raw file):

func insertEventRecords(
	ctx context.Context,
	execCfg *ExecutorConfig,

I'm not sure changing this param is an improvement. The main thing that's needed out of the ExecutorConfig is an InternalExecutor; for the rest of the needed things, the ExecutorConfig is not a very appropriate struct to pass. Between the two, I prefer it how it was.


pkg/sql/event_log.go line 406 at r6 (raw file):

)

var eventLogAsyncWritesMaxAttempts = settings.RegisterIntSetting(

I'd vote we hardcode a number instead of this setting


pkg/sql/event_log.go line 414 at r6 (raw file):

)

var eventLogAsyncWritesAttemptTimeout = settings.RegisterDurationSetting(

I'd vote we hardcode a timeout instead of this setting


pkg/sql/event_log.go line 461 at r6 (raw file):

	// filtering. Easiest is to pretend the event is sourced here.
	//
	// NB: insertEventRecords never returns an error when provided a nil txn.

instead of this NB, assert that the error is nil and put the explanation in the panic


pkg/sql/event_log.go line 551 at r6 (raw file):

	}
	// No: do them async.
	txn.AddCommitTrigger(func(ctx context.Context) {

I'm confused; doesn't this crash if txn is nil?


pkg/sql/event_log.go line 564 at r6 (raw file):

	stopper := execCfg.RPCContext.Stopper
	return stopper.RunAsyncTask(
		context.Background(), "record-events", func(bgCtx context.Context) {

why not use ctx?


pkg/sql/event_log.go line 598 at r6 (raw file):

}

func synchronousWriteToSystemEventsTable(

nit: remove synchronous from the name

@knz knz force-pushed the event-log-v2 branch 3 times, most recently from e53db83 to 24b91bf Compare August 16, 2022 08:57
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for another look.

NB: the very last commit is currently expected to fail CI due to #86132. I'll rebase when the fix in #86153 merges.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


-- commits line 4 at r2:

Previously, andreimatei (Andrei Matei) wrote…

is it really?

I've fixed the release justifications throughout.


-- commits line 19 at r2:

Previously, andreimatei (Andrei Matei) wrote…

is it really?

Done.


-- commits line 42 at r6:

Previously, andreimatei (Andrei Matei) wrote…

"cluster setting"

Done.


-- commits line 49 at r6:

Previously, andreimatei (Andrei Matei) wrote…

do we really need this cluster setting? I'd vote against it.

It's not public. We need it for certain logic tests (not all) which is why a testing knob won't do.


pkg/sql/event_log.go line 128 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put a comment somewhere around the schema definition for system.eventlog saying that the respective column is not populated any more (or is populated with 0)

Done.


pkg/sql/event_log.go line 286 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't really understand what's going on here. Perhaps put a method comment explaining what should and should not be populated in commonSQLEventDetails.

Added comments and extended the commit message to explain what's going on.


pkg/sql/event_log.go line 534 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

alue

Done.


pkg/sql/event_log.go line 534 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's add a migration removing that constraint?

Added a TODO.


pkg/sql/event_log.go line 453 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not sure changing this param is an improvement. The main thing that's needed out of the ExecutorConfig is an InternalExecutor; for the rest of the needed things, the ExecutorConfig is not a very appropriate struct to pass. Between the two, I prefer it how it was.

It's mostly for the benefit of the next commit. I've explained in the commit message.


pkg/sql/event_log.go line 406 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd vote we hardcode a number instead of this setting

OK, Done.


pkg/sql/event_log.go line 414 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd vote we hardcode a timeout instead of this setting

Done.


pkg/sql/event_log.go line 461 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this NB, assert that the error is nil and put the explanation in the panic

Done.


pkg/sql/event_log.go line 551 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm confused; doesn't this crash if txn is nil?

It did 😅
Ive fixed it since.


pkg/sql/event_log.go line 564 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why not use ctx?

Because ctx is linked to the surrounding query's execution, and gets canceled when the execution stops. The async task writing the event to the log can run after the surrounding SQL statement has completed execution.


pkg/sql/event_log.go line 598 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: remove synchronous from the name

Done.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @knz)


-- commits line 49 at r6:

Previously, knz (kena) wrote…

It's not public. We need it for certain logic tests (not all) which is why a testing knob won't do.

you can use a cluster-opt: ... line in logic tests to set testing knobs (and add support to logic.go for a new knob). If the tests can get away with setting the knob for the whole duration of the test (and I think here they do). I fell like a knob here would be preferable (even to a non-public setting), but up to you.


-- commits line 69 at r13:
I don't think this should be a separate commit.


pkg/sql/event_log.go line 260 at r9 (raw file):

// have access to a (*planner) and the current statement metadata.
//
// It does so by synthetizing the (shared) CommonSQLEventDetails,

Now what's going on is more clear because of the code comment, but I don't understand what "synthesize" means here. This method takes in commonSQLEventDetails, it doesn't seem to synthesize anything. I'd remove this paragraph.


pkg/sql/event_log.go line 486 at r12 (raw file):

// the run-time type of the event payload.
//
// If the txn field is non-nil and the cluster setting

This comment is hard to parse, in my opinion. If server.eventlog.sync_writes.enabled didn't exist, there'd be no need for it :P


pkg/sql/event_log.go line 490 at r12 (raw file):

// on the given txn object. Otherwise, an asynchronous task is spawned
// to do the write.
// If txn is nil, the event is written asynchronously.

even if txn is not nil, the write is still async (as a commit trigger), right? I think commit triggers should be mentioned in this comment.


pkg/sql/event_log.go line 585 at r12 (raw file):

	stopper := execCfg.RPCContext.Stopper
	if err := stopper.RunAsyncTask(
		context.Background(), "record-events", func(bgCtx context.Context) {

If origCtx stays named origCtx, then you can rename bgCtx to ctx, which would be better in my opinion.
As a nit, I would also rename origCtx to ctx as the parameter name (from the perspective of the function's caller, origCtx doesn't make sense), and I'd do a origCtx := ctx before the RunAsyncTask() call.


pkg/sql/event_log.go line 585 at r12 (raw file):

	stopper := execCfg.RPCContext.Stopper
	if err := stopper.RunAsyncTask(
		context.Background(), "record-events", func(bgCtx context.Context) {

please put a comment around here saying that we don't want to inherit the caller's cancelation


pkg/sql/event_log.go line 620 at r12 (raw file):

			}
		}); err != nil {
		log.Ops.Warningf(origCtx, "failed to start task to log %d events: %v", len(entries), err)

I don't think using the Ops channel here (as opposed to simply log.Infof makes sense). In fact, I wouldn't log this error at all, and definitely not at the warning level. If the server is shutting down and we can't write to the system table any more, I think it's just bad luck; I don't think anyone will benefit from these log message.
Same for the Warning above too.

Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the last two commits as recommended.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


-- commits line 49 at r6:

Previously, andreimatei (Andrei Matei) wrote…

you can use a cluster-opt: ... line in logic tests to set testing knobs (and add support to logic.go for a new knob). If the tests can get away with setting the knob for the whole duration of the test (and I think here they do). I fell like a knob here would be preferable (even to a non-public setting), but up to you.

Done.


-- commits line 69 at r13:

Previously, andreimatei (Andrei Matei) wrote…

I don't think this should be a separate commit.

Done.


pkg/sql/event_log.go line 260 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Now what's going on is more clear because of the code comment, but I don't understand what "synthesize" means here. This method takes in commonSQLEventDetails, it doesn't seem to synthesize anything. I'd remove this paragraph.

Done.


pkg/sql/event_log.go line 486 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This comment is hard to parse, in my opinion. If server.eventlog.sync_writes.enabled didn't exist, there'd be no need for it :P

Done.


pkg/sql/event_log.go line 490 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

even if txn is not nil, the write is still async (as a commit trigger), right? I think commit triggers should be mentioned in this comment.

Done.


pkg/sql/event_log.go line 585 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

If origCtx stays named origCtx, then you can rename bgCtx to ctx, which would be better in my opinion.
As a nit, I would also rename origCtx to ctx as the parameter name (from the perspective of the function's caller, origCtx doesn't make sense), and I'd do a origCtx := ctx before the RunAsyncTask() call.

Done.


pkg/sql/event_log.go line 585 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put a comment around here saying that we don't want to inherit the caller's cancelation

Done.


pkg/sql/event_log.go line 620 at r12 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think using the Ops channel here (as opposed to simply log.Infof makes sense). In fact, I wouldn't log this error at all, and definitely not at the warning level. If the server is shutting down and we can't write to the system table any more, I think it's just bad luck; I don't think anyone will benefit from these log message.
Same for the Warning above too.

The one above is definitely useful because perhaps the eventlog table is unavailable, and we want to see symptoms of that.

For the one here, rephrased as an assertion failure. This ensures we get a sentry report for it if it's ever hit.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 20, 2022

Build failed:

@ajwerner
Copy link
Copy Markdown
Contributor

pkg/sql/schemachanger/scdeps/sctestdeps/test_deps.go:1024:9: cannot use s (variable of type *TestState) as type scexec.EventLogger in return statement:
22:42:27 *TestState does not implement scexec.EventLogger (wrong type for LogEvent method)
22:42:27 have LogEvent(_ context.Context, descID catid.DescID, details eventpb.CommonSQLEventDetails, event logpb.EventPayload) error
22:42:27 want LogEvent(ctx context.Context, details eventpb.CommonSQLEventDetails, event logpb.EventPayload) error

bors r-

@ajwerner
Copy link
Copy Markdown
Contributor

ajwerner commented Aug 20, 2022

Race condition. You got it. Sorry :)

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 20, 2022

@ajwerner what is the PR/branch?

knz added 4 commits August 21, 2022 01:30
This commit removes support for the discrete column `targetID`, which
was ill-defined (many event types don't have a target).

It also simplifies the handling of the descriptor ID, to make it clear
where it is assigned. Prior to this commit, the callers of the event
log code could pass the CommonSQLEventDetails payload and the
descriptor ID separately. Now they can pre-populate the DescriptorID
in the CommonSQLEventDetails directly, so just one argument is
necessary (the event payload).

This in turn simplifies the definition of the eventLogEntry type.

Release justification: low risk, high benefit changes to existing functionality

Release note: None
Release justification: low risk, high benefit changes to existing functionality

Release note: None
We also change the code to pass the ExecutorConfig instead of an
InternalExecutor to the various event log functions.

This is because in the next commit we will be using the execCfg more
directly.

It's shorter to write `execCfg.DB.Txn`,
`execCfg.NodeInfo.SQLInstanceID`, `execCfg.RPCContext.Stopper`, etc.
than indirecting via the InternalExecutor with `ie.s.cfg.DB.Txn`,
`ie.s.cfg.NodeInfo.SQLInstanceID`, `ie.s.cfg.RPCContext.Stopper`, etc.

Release justification: low risk, high benefit changes to existing functionality

Release note: None
Release justification: fixes for high-priority or high-severity bugs in existing functionality

Release note (performance improvement): SQL statement that cause
events to be logged to `system.eventlog` are now able to complete faster.
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 20, 2022

nvm found it

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 21, 2022

bors r=andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 21, 2022

Build succeeded:

@craig craig bot merged commit e6a7dc2 into cockroachdb:master Aug 21, 2022
@knz knz deleted the event-log-v2 branch August 21, 2022 09:33
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Aug 29, 2022
In cockroachdb#86174, exporting of events to the Obs Service was doubly-decoupled
from the SQL statement causing the event: it ran in an async task inside
another async task. The inner task was given a context canceled when the
outer task finished - thus introducing a race causing the event to not
be passed along. This was caught by TestEventIngestionIntegration
flakiness.
The second task was unnecessary. Its purpose was to shield the caller
from slow communication with the Obs Service, but the client library
being used already does that (it's non-blocking).

Release note: None
craig bot pushed a commit that referenced this pull request Aug 30, 2022
87054: sql: fix exporting of events to Obs Service r=andreimatei a=andreimatei

In #86174, exporting of events to the Obs Service was doubly-decoupled
from the SQL statement causing the event: it ran in an async task inside
another async task. The inner task was given a context canceled when the
outer task finished - thus introducing a race causing the event to not
be passed along. This was caught by TestEventIngestionIntegration
flakiness.
The second task was unnecessary. Its purpose was to shield the caller
from slow communication with the Obs Service, but the client library
being used already does that (it's non-blocking).

Release note: None
Release justification: bug fix, test flakiness

87130: bazel: set `GORACE` variable appropriately when testing w/ `race` r=rail,srosenberg a=rickystewart

By default the `race` data is output to `stderr`, which means it ends up
in the `test.log` file but NOT the `test.xml` file. Changing it to
`stdout` means that this instead ends up `test.xml` as well as
`test.log` and any output JSON files.

Closes #74105.

Release justification: Non-production code changes
Release note: None

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants