Skip to content

server,autoconfig: automatic configuration via config tasks#98459

Merged
craig[bot] merged 1 commit intomasterfrom
20230312-knz-auto-config-run
Apr 3, 2023
Merged

server,autoconfig: automatic configuration via config tasks#98459
craig[bot] merged 1 commit intomasterfrom
20230312-knz-auto-config-run

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Mar 12, 2023

Epic: CRDB-23559
Informs #98431.
All commits but the last are from #98993.

This change introduces "auto config tasks", a mechanism through which
configuration payloads ("tasks") can be injected into a running SQL
service.

This is driven via the "auto config runner" job that was introduced in
the previous commit. The job listens for the arrival of new task
definitions via a Provider interface. When new tasks are known, and
previous tasks have completed, the runner creates a job for the first
next task.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz changed the title jobs: avoid race conditions in tests server: auto config initialization Mar 12, 2023
@knz knz force-pushed the 20230311-auto-config-job branch from 92c3e7d to 919f208 Compare March 12, 2023 16:52
@knz knz force-pushed the 20230312-knz-auto-config-run branch from 6445636 to b3b17ba Compare March 12, 2023 19:18
@knz knz requested a review from ajwerner March 12, 2023 19:18
@knz knz marked this pull request as ready for review March 12, 2023 19:18
@knz knz requested review from a team as code owners March 12, 2023 19:18
@knz knz requested a review from michae2 March 12, 2023 19:18
@knz knz force-pushed the 20230312-knz-auto-config-run branch 2 times, most recently from 807981d to 6fd04dc Compare March 12, 2023 19:25
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 12, 2023

@ajwerner I plan to add the additional layer for "pingback" of results in a later PR as discussed in #98431.

@knz knz force-pushed the 20230312-knz-auto-config-run branch from 6fd04dc to 9bb9e01 Compare March 12, 2023 22:53
@knz knz requested a review from a team March 12, 2023 22:53
@knz knz force-pushed the 20230312-knz-auto-config-run branch from 9bb9e01 to f119880 Compare March 12, 2023 23:46
@knz knz force-pushed the 20230311-auto-config-job branch from 919f208 to 4d20185 Compare March 13, 2023 00:06
@knz knz requested a review from a team March 13, 2023 00:06
@knz knz force-pushed the 20230312-knz-auto-config-run branch from f119880 to ce2a284 Compare March 13, 2023 00:11
@knz knz changed the title server: auto config initialization server,autoconfig: automatic configuration via config tasks Mar 13, 2023
@knz knz added the A-multitenancy Related to multi-tenancy label Mar 13, 2023
@michae2 michae2 removed their request for review March 13, 2023 17:22
@knz knz force-pushed the 20230311-auto-config-job branch 2 times, most recently from ad19f6a to d4e795e Compare March 13, 2023 22:09
@knz knz force-pushed the 20230312-knz-auto-config-run branch from ce2a284 to 59bfefe Compare March 13, 2023 22:10
@knz knz requested a review from a team as a March 13, 2023 22:10
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the code could use some polish in terms of dependency definition and plumbing, but it's understandable enough. I'd consider making a structure that can be dependency injected to interact with the job data and what not and then take that also as an interface into the job and task.

I will do this in the next iteration.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)


pkg/server/autoconfig/auto_config.go line 164 at r2 (raw file):

Previously, ajwerner wrote…

It sort of feels like it'd be nice to combine the fetching of the currently started task with the fetching of the latest completed task.

I don't see it - one of them it used in two places, the other just one; also they access different segments of the keyspace so will result in two separate storage scans in any case.


pkg/server/autoconfig/doc.go line 70 at r2 (raw file):

Previously, ajwerner wrote…

One thing I'm not seeing in this PR is how we talk about different environments and the completion of their tasks independently. I expected to see some sort of environment ID or long-running job per environment or something like that.

Is it the responsibility of the enviroments to make sure that the IDs are monotonically increasing just within that environment, or across all environments?

Done - clarified that each environment has a separate sequence.


pkg/server/autoconfig/doc.go line 94 at r2 (raw file):

Previously, ajwerner wrote…

It seems to me that job success is at most once. If these tasks fail with a permanent error, they are not retried by anything. Cancellation is a mechanism to cause failure. One thing we might want to expose is a way to mark a task as required. In jobs parlance, this would mean that it's non-cancelable. A non-cancelable job treats errors as retryable by default (as opposed to permanent), and it prevents user cancelation. I suspect for the hard-coded tasks, we'll want to mark them as uncancelable.

I see where this is coming from but I'd like to delay the introduction of this until we get some experience with the mechanism. In a first release, the hard-coded task (config profiles) can accommodate tasks that don't get executed due to errors. We can inspect their status by other means.


pkg/server/autoconfig/task_markers.go line 57 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

I view an execCfg argument to a function as a code smell

Done.


pkg/server/autoconfig/task_markers.go line 63 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

makes sense, I just realized that you might not need to load it. You can probably:

j := &jobs.Job{id: jobs.AutoConfigRunnerJobID}
infoStorage := j.InfoStorage(txn)
...

edit: noticed the id field is not exported so you'd have to add a method to create a Job struct if you want to follow this approach

ok done #98993


pkg/server/autoconfig/task_markers.go line 76 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.


pkg/server/autoconfig/task_markers.go line 82 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Makes sense, thanks

see what i did in #98993.


pkg/server/autoconfig/task_markers.go line 118 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.


pkg/server/autoconfig/task_markers.go line 152 at r2 (raw file):

Previously, ajwerner wrote…

why are you passing an execCfg and then not using it?

Done.


pkg/server/autoconfig/task_markers.go line 162 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok, that i will do.

#98993


pkg/server/autoconfig/acprovider/provider.go line 17 at r2 (raw file):

Previously, ajwerner wrote…

Something about this interface feels a little off. I guess it's the fact that it always hands you all the tasks, and that it seems to be something you tell about task completion, but it doesn't have a clear contract around how that affects behavior. The channel-based API also feels a bit fishy.

Is it important for the client to get more than one task at a time?
What about something like:

// Peek will block waiting for the first task the provider believes still needs to be run.
// It will return an error if the context is canceled while waiting.
Peek(ctx context.Context) (autoconfigpb.Task, error)

// Pop will report the completion of all tasks with ID up to completed.
Pop(ctx context.Context, completed TaskID) error

Done.


pkg/server/autoconfig/acprovider/provider.go line 36 at r2 (raw file):

Previously, ajwerner wrote…

This doesn't implement the contract The channel receives an initial event immediately.

Consider revisiting that.

Done.

@knz knz force-pushed the 20230312-knz-auto-config-run branch 3 times, most recently from 5586a3c to d9162bc Compare March 20, 2023 14:02
craig bot pushed a commit that referenced this pull request Mar 21, 2023
98993: jobs: enhance the InfoStorage API r=adityamaru a=knz

Needed for #98459.
Epic: CRDB-23559

98998: kvserver: unskip TestBehaviorDuringLeaseTransfer r=shralex a=shralex

Unskip this test, its not failing locally. Previous teamcity logs are no longer available.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21501
Fixes: #91948
Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: shralex <shralex@gmail.com>
@knz knz force-pushed the 20230312-knz-auto-config-run branch from d9162bc to b36f0f2 Compare March 21, 2023 15:19
Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass, this is really great stuff. I only have questions so everything LGTM once those are cleared up.

// transactionally with the removal of its start marker and the
// creation of its completion marker.
//
// # Possible exceptions to the "exactly-once" semantics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?

Comment on lines +196 to +198
// We cannot use the InfoStorage directly here because this function
// is called from two different jobs (the runner and the task) but
// must always write with the job ID of the runner to job_info.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale comment?

// own wait.
shutdownCh := execCtx.(sql.JobExecContext).ExecCfg().RPCContext.Stopper.ShouldQuiesce()
// shutdown, we find some tests fail unless this job also looks
// at the quiesce status itself.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?


ctx := context.Background()
s, sqlDB, _ := serverutils.StartServer(t, base.TestServerArgs{
DisableDefaultTestTenant: true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a quick comment explaining why it is disabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.

if prevJobID != 0 {
// We have a job already. Just wait for it.
log.Infof(ctx, "waiting for task %d, job %d to complete", prevTaskID, prevJobID)
if err := execCfg.JobRegistry.Run(ctx, []jobspb.JobID{prevJobID}); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on Run states:

// The assumption is that these jobs were created by
// this registry and thus are pre-claimed by it.

I don't think that there is any guarantee that prevJobID was created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?


// Pop will report the completion of all tasks with ID up to
// completed for the given environment.
Pop(ctx context.Context, env autoconfigpb.EnvironmentID, completed autoconfigpb.TaskID) error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?

@adityamaru adityamaru self-requested a review March 22, 2023 14:33
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)


pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?

Done.


pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

The comment on Run states:

// The assumption is that these jobs were created by
// this registry and thus are pre-claimed by it.

I don't think that there is any guarantee that prevJobID was created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?

I had the same concern originally (internal convo)

Bottom line is Run falls back to polling if the job is not locally registered already.


pkg/server/autoconfig/doc.go line 94 at r2 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?

Done.


pkg/server/autoconfig/task_markers.go line 198 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

stale comment?

Done.


pkg/server/autoconfig/acprovider/provider.go line 42 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?

It's not strictly necessary (the env runner will pick up the completion marker at the next round), but I guess it doesn't hurt. Done.


pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: maybe a quick comment explaining why it is disabled.

actually there was no good reason. Removed.

@knz knz force-pushed the 20230312-knz-auto-config-run branch from b36f0f2 to 2e5d038 Compare March 26, 2023 21:46
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 30, 2023

@adityamaru I've rebased this on top of the other changes to change the info key to string. RFAL

@knz knz force-pushed the 20230312-knz-auto-config-run branch 2 times, most recently from 46be135 to f5588db Compare April 2, 2023 19:34
@adityamaru
Copy link
Copy Markdown
Contributor

Reviewing this is on my list of things for tomorrow. Sorry for the delay @knz

Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Done.

Given that 99242 has been addressed can we remove this?

Code quote:

https://github.com/cockroachdb/cockroach/issues/99242

pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I had the same concern originally (internal convo)

Bottom line is Run falls back to polling if the job is not locally registered already.

perfect, thanks!


pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

actually there was no good reason. Removed.

nit: leftover commented code

Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)


pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

Given that 99242 has been addressed can we remove this?

Done. good call!


pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):

Previously, adityamaru (Aditya Maru) wrote…

nit: leftover commented code

Fixed.

@knz knz force-pushed the 20230312-knz-auto-config-run branch from f5588db to 20883fd Compare April 3, 2023 19:18
This change introduces "auto config tasks", a mechanism through which
configuration payloads ("tasks") can be injected into a running SQL
service.

This is driven via the "auto config runner" job that was introduced in
the previous commit. The job listens for the arrival of new
environment/task definitions via a `Provider` interface. When new
environments are known, it spans "env runner" jobs; each waiting for
its own tasks. When new tasks are known, and previous tasks have
completed, the "env runner" job creates a new separate job for the
first next task.

Release note: None
@knz knz force-pushed the 20230312-knz-auto-config-run branch from 20883fd to c79194c Compare April 3, 2023 19:25
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 3, 2023

bors r=adityamaru

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 3, 2023

Build succeeded:

@craig craig bot merged commit 04a92b6 into master Apr 3, 2023
@craig craig bot deleted the 20230312-knz-auto-config-run branch April 3, 2023 20:56
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 10, 2023

blathers backport 23.1

craig bot pushed a commit that referenced this pull request Apr 20, 2023
98466: cli,server: static configuration profiles r=stevendanna a=knz

Epic: CRDB-23559
Informs #98431. 
Fixes #94856.
(Based off #98459)
Supersedes #98380.

This change introduces a mechanism through which an operator can
select a "configuration profile" via the command-line flag
`--config-profile` or env var `COCKROACH_CONFIG_PROFILE`. The SQL
initialization defined by the profile is applied during server
start-up.

The profiles are (currently) hardcoded inside CockroachDB.

The following profiles are predefined:

- `default`: no configuration.

- `multitenant+noapp`: no pre-defined `application` tenant, but with a
  predefined application tenant template that is used whenever a new
  tenant is defined. This config profile is meant for use for C2C
  replication target clusters.

- `multitenant+app+sharedservice`: shared-process multitenancy with
  pre-defined `application` tenant, based off the same configuration as
  `multitenant+noapp`.

Release note: None

101907: multitenant: misc fixes related to tenant capabilities r=arulajmani a=knz

See individual commits for details.

The last commit in particular probably addresses #99087.

Epic: CRDB-23559

101935: sql: add issue number to todo r=rharding6373 a=rharding6373

Epic: none
Informs: #101934

Release note: none

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-multitenancy Related to multi-tenancy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants