server,autoconfig: automatic configuration via config tasks by knz · Pull Request #98459 · cockroachdb/cockroach

knz · 2023-03-12T16:46:13Z

Epic: CRDB-23559
Informs #98431.
All commits but the last are from #98993.

This change introduces "auto config tasks", a mechanism through which
configuration payloads ("tasks") can be injected into a running SQL
service.

This is driven via the "auto config runner" job that was introduced in
the previous commit. The job listens for the arrival of new task
definitions via a Provider interface. When new tasks are known, and
previous tasks have completed, the runner creates a job for the first
next task.

Release note: None

cockroach-teamcity · 2023-03-12T16:46:20Z

This change is

knz · 2023-03-12T19:32:39Z

@ajwerner I plan to add the additional layer for "pingback" of results in a later PR as discussed in #98431.

knz

I feel like the code could use some polish in terms of dependency definition and plumbing, but it's understandable enough. I'd consider making a structure that can be dependency injected to interact with the job data and what not and then take that also as an interface into the job and task.

I will do this in the next iteration.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)

pkg/server/autoconfig/auto_config.go line 164 at r2 (raw file):

Previously, ajwerner wrote…

It sort of feels like it'd be nice to combine the fetching of the currently started task with the fetching of the latest completed task.

I don't see it - one of them it used in two places, the other just one; also they access different segments of the keyspace so will result in two separate storage scans in any case.

pkg/server/autoconfig/doc.go line 70 at r2 (raw file):

Previously, ajwerner wrote…

One thing I'm not seeing in this PR is how we talk about different environments and the completion of their tasks independently. I expected to see some sort of environment ID or long-running job per environment or something like that.

Is it the responsibility of the enviroments to make sure that the IDs are monotonically increasing just within that environment, or across all environments?

Done - clarified that each environment has a separate sequence.

pkg/server/autoconfig/doc.go line 94 at r2 (raw file):

Previously, ajwerner wrote…

It seems to me that job success is at most once. If these tasks fail with a permanent error, they are not retried by anything. Cancellation is a mechanism to cause failure. One thing we might want to expose is a way to mark a task as required. In jobs parlance, this would mean that it's non-cancelable. A non-cancelable job treats errors as retryable by default (as opposed to permanent), and it prevents user cancelation. I suspect for the hard-coded tasks, we'll want to mark them as uncancelable.

I see where this is coming from but I'd like to delay the introduction of this until we get some experience with the mechanism. In a first release, the hard-coded task (config profiles) can accommodate tasks that don't get executed due to errors. We can inspect their status by other means.

pkg/server/autoconfig/task_markers.go line 57 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

I view an execCfg argument to a function as a code smell

Done.

pkg/server/autoconfig/task_markers.go line 63 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

makes sense, I just realized that you might not need to load it. You can probably:
j := &jobs.Job{id: jobs.AutoConfigRunnerJobID}
infoStorage := j.InfoStorage(txn)
...
edit: noticed the id field is not exported so you'd have to add a method to create a Job struct if you want to follow this approach

ok done #98993

pkg/server/autoconfig/task_markers.go line 76 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.

pkg/server/autoconfig/task_markers.go line 82 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Makes sense, thanks

see what i did in #98993.

pkg/server/autoconfig/task_markers.go line 118 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.

pkg/server/autoconfig/task_markers.go line 152 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.

pkg/server/autoconfig/task_markers.go line 162 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok, that i will do.

#98993

pkg/server/autoconfig/acprovider/provider.go line 17 at r2 (raw file):

Previously, ajwerner wrote…

Something about this interface feels a little off. I guess it's the fact that it always hands you all the tasks, and that it seems to be something you tell about task completion, but it doesn't have a clear contract around how that affects behavior. The channel-based API also feels a bit fishy.

Is it important for the client to get more than one task at a time?
What about something like:
// Peek will block waiting for the first task the provider believes still needs to be run.
// It will return an error if the context is canceled while waiting.
Peek(ctx context.Context) (autoconfigpb.Task, error)

// Pop will report the completion of all tasks with ID up to completed.
Pop(ctx context.Context, completed TaskID) error

Done.

pkg/server/autoconfig/acprovider/provider.go line 36 at r2 (raw file):

Previously, ajwerner wrote…

This doesn't implement the contract The channel receives an initial event immediately.

Consider revisiting that.

Done.

98993: jobs: enhance the InfoStorage API r=adityamaru a=knz Needed for #98459. Epic: CRDB-23559 98998: kvserver: unskip TestBehaviorDuringLeaseTransfer r=shralex a=shralex Unskip this test, its not failing locally. Previous teamcity logs are no longer available. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21501 Fixes: #91948 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: shralex <shralex@gmail.com>

adityamaru

I did a first pass, this is really great stuff. I only have questions so everything LGTM once those are cleared up.

adityamaru · 2023-03-22T00:07:12Z

pkg/server/autoconfig/doc.go

+// transactionally with the removal of its start marker and the
+// creation of its completion marker.
+//
+// # Possible exceptions to the "exactly-once" semantics


+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?

adityamaru · 2023-03-22T00:11:13Z

pkg/server/autoconfig/task_markers.go

+	// We cannot use the InfoStorage directly here because this function
+	// is called from two different jobs (the runner and the task) but
+	// must always write with the job ID of the runner to job_info.


stale comment?

adityamaru · 2023-03-22T11:57:50Z

pkg/server/autoconfig/auto_config.go

-	// own wait.
-	shutdownCh := execCtx.(sql.JobExecContext).ExecCfg().RPCContext.Stopper.ShouldQuiesce()
+	// shutdown, we find some tests fail unless this job also looks
+	// at the quiesce status itself.


q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?

adityamaru · 2023-03-22T13:42:08Z

pkg/server/autoconfig/auto_config_test.go

+
+	ctx := context.Background()
+	s, sqlDB, _ := serverutils.StartServer(t, base.TestServerArgs{
+		DisableDefaultTestTenant: true,


nit: maybe a quick comment explaining why it is disabled.

adityamaru · 2023-03-22T13:45:00Z

pkg/server/autoconfig/task_markers.go

This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.

adityamaru · 2023-03-22T14:22:59Z

pkg/server/autoconfig/auto_config_env_runner.go

+	if prevJobID != 0 {
+		// We have a job already. Just wait for it.
+		log.Infof(ctx, "waiting for task %d, job %d to complete", prevTaskID, prevJobID)
+		if err := execCfg.JobRegistry.Run(ctx, []jobspb.JobID{prevJobID}); err != nil {


The comment on Run states:

// The assumption is that these jobs were created by // this registry and thus are pre-claimed by it.

I don't think that there is any guarantee that prevJobID was created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?

adityamaru · 2023-03-22T14:32:20Z

pkg/server/autoconfig/acprovider/provider.go

+
+	// Pop will report the completion of all tasks with ID up to
+	// completed for the given environment.
+	Pop(ctx context.Context, env autoconfigpb.EnvironmentID, completed autoconfigpb.TaskID) error


I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?

knz

This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.

Done.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)

pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?

Done.

pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

The comment on Run states:
// The assumption is that these jobs were created by
// this registry and thus are pre-claimed by it.
I don't think that there is any guarantee that prevJobID was created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?

I had the same concern originally (internal convo)

Bottom line is Run falls back to polling if the job is not locally registered already.

pkg/server/autoconfig/doc.go line 94 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?

Done.

pkg/server/autoconfig/task_markers.go line 198 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

stale comment?

Done.

pkg/server/autoconfig/acprovider/provider.go line 42 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?

It's not strictly necessary (the env runner will pick up the completion marker at the next round), but I guess it doesn't hurt. Done.

pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: maybe a quick comment explaining why it is disabled.

actually there was no good reason. Removed.

knz · 2023-03-30T14:59:42Z

@adityamaru I've rebased this on top of the other changes to change the info key to string. RFAL

adityamaru · 2023-04-03T04:19:20Z

Reviewing this is on my list of things for tomorrow. Sorry for the delay @knz

adityamaru

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)

pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Done.

Given that 99242 has been addressed can we remove this?

Code quote:

https://github.com/cockroachdb/cockroach/issues/99242

pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I had the same concern originally (internal convo)

Bottom line is Run falls back to polling if the job is not locally registered already.

perfect, thanks!

pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

actually there was no good reason. Removed.

nit: leftover commented code

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)

pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Given that 99242 has been addressed can we remove this?

Done. good call!

pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: leftover commented code

Fixed.

This change introduces "auto config tasks", a mechanism through which configuration payloads ("tasks") can be injected into a running SQL service. This is driven via the "auto config runner" job that was introduced in the previous commit. The job listens for the arrival of new environment/task definitions via a `Provider` interface. When new environments are known, it spans "env runner" jobs; each waiting for its own tasks. When new tasks are known, and previous tasks have completed, the "env runner" job creates a new separate job for the first next task. Release note: None

knz · 2023-04-03T19:48:38Z

bors r=adityamaru

craig · 2023-04-03T20:56:22Z

Build succeeded:

Bazel Essential CI (Cockroach)

knz · 2023-04-10T13:57:44Z

blathers backport 23.1

98466: cli,server: static configuration profiles r=stevendanna a=knz Epic: CRDB-23559 Informs #98431. Fixes #94856. (Based off #98459) Supersedes #98380. This change introduces a mechanism through which an operator can select a "configuration profile" via the command-line flag `--config-profile` or env var `COCKROACH_CONFIG_PROFILE`. The SQL initialization defined by the profile is applied during server start-up. The profiles are (currently) hardcoded inside CockroachDB. The following profiles are predefined: - `default`: no configuration. - `multitenant+noapp`: no pre-defined `application` tenant, but with a predefined application tenant template that is used whenever a new tenant is defined. This config profile is meant for use for C2C replication target clusters. - `multitenant+app+sharedservice`: shared-process multitenancy with pre-defined `application` tenant, based off the same configuration as `multitenant+noapp`. Release note: None 101907: multitenant: misc fixes related to tenant capabilities r=arulajmani a=knz See individual commits for details. The last commit in particular probably addresses #99087. Epic: CRDB-23559 101935: sql: add issue number to todo r=rharding6373 a=rharding6373 Epic: none Informs: #101934 Release note: none Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>

knz changed the title ~~jobs: avoid race conditions in tests~~ server: auto config initialization Mar 12, 2023

knz force-pushed the 20230311-auto-config-job branch from 92c3e7d to 919f208 Compare March 12, 2023 16:52

knz force-pushed the 20230312-knz-auto-config-run branch from 6445636 to b3b17ba Compare March 12, 2023 19:18

knz requested a review from ajwerner March 12, 2023 19:18

knz marked this pull request as ready for review March 12, 2023 19:18

knz requested review from a team as code owners March 12, 2023 19:18

knz requested a review from michae2 March 12, 2023 19:18

knz force-pushed the 20230312-knz-auto-config-run branch 2 times, most recently from 807981d to 6fd04dc Compare March 12, 2023 19:25

knz force-pushed the 20230312-knz-auto-config-run branch from 6fd04dc to 9bb9e01 Compare March 12, 2023 22:53

knz requested a review from a team March 12, 2023 22:53

knz mentioned this pull request Mar 12, 2023

cli,server: static configuration profiles #98466

Merged

knz force-pushed the 20230312-knz-auto-config-run branch from 9bb9e01 to f119880 Compare March 12, 2023 23:46

knz force-pushed the 20230311-auto-config-job branch from 919f208 to 4d20185 Compare March 13, 2023 00:06

knz requested a review from a team March 13, 2023 00:06

knz force-pushed the 20230312-knz-auto-config-run branch from f119880 to ce2a284 Compare March 13, 2023 00:11

knz changed the title ~~server: auto config initialization~~ server,autoconfig: automatic configuration via config tasks Mar 13, 2023

knz added the A-multitenancy Related to multi-tenancy label Mar 13, 2023

michae2 removed their request for review March 13, 2023 17:22

knz force-pushed the 20230311-auto-config-job branch 2 times, most recently from ad19f6a to d4e795e Compare March 13, 2023 22:09

knz force-pushed the 20230312-knz-auto-config-run branch from ce2a284 to 59bfefe Compare March 13, 2023 22:10

knz requested a review from a team as a March 13, 2023 22:10

knz commented Mar 19, 2023

View reviewed changes

knz force-pushed the 20230312-knz-auto-config-run branch 3 times, most recently from 5586a3c to d9162bc Compare March 20, 2023 14:02

knz force-pushed the 20230312-knz-auto-config-run branch from d9162bc to b36f0f2 Compare March 21, 2023 15:19

adityamaru reviewed Mar 22, 2023

View reviewed changes

adityamaru self-requested a review March 22, 2023 14:33

knz commented Mar 26, 2023

View reviewed changes

knz force-pushed the 20230312-knz-auto-config-run branch from b36f0f2 to 2e5d038 Compare March 26, 2023 21:46

knz force-pushed the 20230312-knz-auto-config-run branch 2 times, most recently from 46be135 to f5588db Compare April 2, 2023 19:34

adityamaru approved these changes Apr 3, 2023

View reviewed changes

knz commented Apr 3, 2023

View reviewed changes

knz force-pushed the 20230312-knz-auto-config-run branch from f5588db to 20883fd Compare April 3, 2023 19:18

knz force-pushed the 20230312-knz-auto-config-run branch from 20883fd to c79194c Compare April 3, 2023 19:25

craig bot merged commit 04a92b6 into master Apr 3, 2023

craig bot deleted the 20230312-knz-auto-config-run branch April 3, 2023 20:56

blathers-crl bot mentioned this pull request Apr 10, 2023

release-23.1: server,autoconfig: automatic configuration via config tasks #101069

Merged

This was referenced Apr 20, 2023

release-23.1: cli,server: static configuration profiles #101957

Merged

release-23.1.0: cli,server: static configuration profiles #101958

Merged

Conversation

knz commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 12, 2023

Uh oh!

knz commented Mar 12, 2023

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

adityamaru left a comment

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

adityamaru Mar 22, 2023

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Mar 30, 2023

Uh oh!

adityamaru commented Apr 3, 2023

Uh oh!

adityamaru left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Apr 3, 2023

Uh oh!

craig bot commented Apr 3, 2023

Uh oh!

knz commented Apr 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knz commented Mar 12, 2023 •

edited

Loading