server,autoconfig: automatic configuration via config tasks#98459
server,autoconfig: automatic configuration via config tasks#98459craig[bot] merged 1 commit intomasterfrom
Conversation
92c3e7d to
919f208
Compare
6445636 to
b3b17ba
Compare
807981d to
6fd04dc
Compare
6fd04dc to
9bb9e01
Compare
9bb9e01 to
f119880
Compare
919f208 to
4d20185
Compare
f119880 to
ce2a284
Compare
ad19f6a to
d4e795e
Compare
ce2a284 to
59bfefe
Compare
knz
left a comment
There was a problem hiding this comment.
I feel like the code could use some polish in terms of dependency definition and plumbing, but it's understandable enough. I'd consider making a structure that can be dependency injected to interact with the job data and what not and then take that also as an interface into the job and task.
I will do this in the next iteration.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)
pkg/server/autoconfig/auto_config.go line 164 at r2 (raw file):
Previously, ajwerner wrote…
It sort of feels like it'd be nice to combine the fetching of the currently started task with the fetching of the latest completed task.
I don't see it - one of them it used in two places, the other just one; also they access different segments of the keyspace so will result in two separate storage scans in any case.
pkg/server/autoconfig/doc.go line 70 at r2 (raw file):
Previously, ajwerner wrote…
One thing I'm not seeing in this PR is how we talk about different environments and the completion of their tasks independently. I expected to see some sort of environment ID or long-running job per environment or something like that.
Is it the responsibility of the enviroments to make sure that the IDs are monotonically increasing just within that environment, or across all environments?
Done - clarified that each environment has a separate sequence.
pkg/server/autoconfig/doc.go line 94 at r2 (raw file):
Previously, ajwerner wrote…
It seems to me that job success is at most once. If these tasks fail with a permanent error, they are not retried by anything. Cancellation is a mechanism to cause failure. One thing we might want to expose is a way to mark a task as required. In jobs parlance, this would mean that it's non-cancelable. A non-cancelable job treats errors as retryable by default (as opposed to permanent), and it prevents user cancelation. I suspect for the hard-coded tasks, we'll want to mark them as uncancelable.
I see where this is coming from but I'd like to delay the introduction of this until we get some experience with the mechanism. In a first release, the hard-coded task (config profiles) can accommodate tasks that don't get executed due to errors. We can inspect their status by other means.
pkg/server/autoconfig/task_markers.go line 57 at r2 (raw file):
Previously, ajwerner wrote…
why are you passing an
execCfgand then not using it?I view an execCfg argument to a function as a code smell
Done.
pkg/server/autoconfig/task_markers.go line 63 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
makes sense, I just realized that you might not need to load it. You can probably:
j := &jobs.Job{id: jobs.AutoConfigRunnerJobID} infoStorage := j.InfoStorage(txn) ...edit: noticed the id field is not exported so you'd have to add a method to create a Job struct if you want to follow this approach
ok done #98993
pkg/server/autoconfig/task_markers.go line 76 at r2 (raw file):
Previously, ajwerner wrote…
why are you passing an
execCfgand then not using it?
Done.
pkg/server/autoconfig/task_markers.go line 82 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
Makes sense, thanks
see what i did in #98993.
pkg/server/autoconfig/task_markers.go line 118 at r2 (raw file):
Previously, ajwerner wrote…
why are you passing an
execCfgand then not using it?
Done.
pkg/server/autoconfig/task_markers.go line 152 at r2 (raw file):
Previously, ajwerner wrote…
why are you passing an
execCfgand then not using it?
Done.
pkg/server/autoconfig/task_markers.go line 162 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
ok, that i will do.
pkg/server/autoconfig/acprovider/provider.go line 17 at r2 (raw file):
Previously, ajwerner wrote…
Something about this interface feels a little off. I guess it's the fact that it always hands you all the tasks, and that it seems to be something you tell about task completion, but it doesn't have a clear contract around how that affects behavior. The channel-based API also feels a bit fishy.
Is it important for the client to get more than one task at a time?
What about something like:// Peek will block waiting for the first task the provider believes still needs to be run. // It will return an error if the context is canceled while waiting. Peek(ctx context.Context) (autoconfigpb.Task, error) // Pop will report the completion of all tasks with ID up to completed. Pop(ctx context.Context, completed TaskID) error
Done.
pkg/server/autoconfig/acprovider/provider.go line 36 at r2 (raw file):
Previously, ajwerner wrote…
This doesn't implement the contract
The channel receives an initial event immediately.Consider revisiting that.
Done.
5586a3c to
d9162bc
Compare
98993: jobs: enhance the InfoStorage API r=adityamaru a=knz Needed for #98459. Epic: CRDB-23559 98998: kvserver: unskip TestBehaviorDuringLeaseTransfer r=shralex a=shralex Unskip this test, its not failing locally. Previous teamcity logs are no longer available. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21501 Fixes: #91948 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: shralex <shralex@gmail.com>
d9162bc to
b36f0f2
Compare
adityamaru
left a comment
There was a problem hiding this comment.
I did a first pass, this is really great stuff. I only have questions so everything LGTM once those are cleared up.
| // transactionally with the removal of its start marker and the | ||
| // creation of its completion marker. | ||
| // | ||
| // # Possible exceptions to the "exactly-once" semantics |
There was a problem hiding this comment.
+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?
| // We cannot use the InfoStorage directly here because this function | ||
| // is called from two different jobs (the runner and the task) but | ||
| // must always write with the job ID of the runner to job_info. |
pkg/server/autoconfig/auto_config.go
Outdated
| // own wait. | ||
| shutdownCh := execCtx.(sql.JobExecContext).ExecCfg().RPCContext.Stopper.ShouldQuiesce() | ||
| // shutdown, we find some tests fail unless this job also looks | ||
| // at the quiesce status itself. |
There was a problem hiding this comment.
q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?
|
|
||
| ctx := context.Background() | ||
| s, sqlDB, _ := serverutils.StartServer(t, base.TestServerArgs{ | ||
| DisableDefaultTestTenant: true, |
There was a problem hiding this comment.
nit: maybe a quick comment explaining why it is disabled.
There was a problem hiding this comment.
This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.
| if prevJobID != 0 { | ||
| // We have a job already. Just wait for it. | ||
| log.Infof(ctx, "waiting for task %d, job %d to complete", prevTaskID, prevJobID) | ||
| if err := execCfg.JobRegistry.Run(ctx, []jobspb.JobID{prevJobID}); err != nil { |
There was a problem hiding this comment.
The comment on Run states:
// The assumption is that these jobs were created by
// this registry and thus are pre-claimed by it.
I don't think that there is any guarantee that prevJobID was created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?
|
|
||
| // Pop will report the completion of all tasks with ID up to | ||
| // completed for the given environment. | ||
| Pop(ctx context.Context, env autoconfigpb.EnvironmentID, completed autoconfigpb.TaskID) error |
There was a problem hiding this comment.
I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?
knz
left a comment
There was a problem hiding this comment.
This file seems unit testable, we can do it as a follow-up but I think some tests to exercise the encoding/decoding would be nice.
Done.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)
pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
q: is this a temporary work around while we figure out the registry-level fix? If it is, should we add a TODO?
Done.
pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
The comment on
Runstates:// The assumption is that these jobs were created by // this registry and thus are pre-claimed by it.I don't think that there is any guarantee that
prevJobIDwas created by this node's registry. If the assumption doesn't hold, my read of the code is that the waiting set won't get notified on job completion since it isn't being run by the current registry?
I had the same concern originally (internal convo)
Bottom line is Run falls back to polling if the job is not locally registered already.
pkg/server/autoconfig/doc.go line 94 at r2 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
+1, maybe we add a TODO for now @knz to capture this idea of non-cancellable, required tasks?
Done.
pkg/server/autoconfig/task_markers.go line 198 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
stale comment?
Done.
pkg/server/autoconfig/acprovider/provider.go line 42 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
I currently see only one caller of Pop when the auto config runner sees the last completed task > the task we are processing. Do we expect each task runner to Pop the task on completion?
It's not strictly necessary (the env runner will pick up the completion marker at the next round), but I guess it doesn't hurt. Done.
pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
nit: maybe a quick comment explaining why it is disabled.
actually there was no good reason. Removed.
b36f0f2 to
2e5d038
Compare
|
@adityamaru I've rebased this on top of the other changes to change the info key to string. RFAL |
46be135 to
f5588db
Compare
|
Reviewing this is on my list of things for tomorrow. Sorry for the delay @knz |
adityamaru
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Done.
Given that 99242 has been addressed can we remove this?
Code quote:
https://github.com/cockroachdb/cockroach/issues/99242pkg/server/autoconfig/auto_config_env_runner.go line 236 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I had the same concern originally (internal convo)
Bottom line is
Runfalls back to polling if the job is not locally registered already.
perfect, thanks!
pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
actually there was no good reason. Removed.
nit: leftover commented code
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @ajwerner)
pkg/server/autoconfig/auto_config.go line 66 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
Given that 99242 has been addressed can we remove this?
Done. good call!
pkg/server/autoconfig/auto_config_test.go line 141 at r4 (raw file):
Previously, adityamaru (Aditya Maru) wrote…
nit: leftover commented code
Fixed.
f5588db to
20883fd
Compare
This change introduces "auto config tasks", a mechanism through which
configuration payloads ("tasks") can be injected into a running SQL
service.
This is driven via the "auto config runner" job that was introduced in
the previous commit. The job listens for the arrival of new
environment/task definitions via a `Provider` interface. When new
environments are known, it spans "env runner" jobs; each waiting for
its own tasks. When new tasks are known, and previous tasks have
completed, the "env runner" job creates a new separate job for the
first next task.
Release note: None
20883fd to
c79194c
Compare
|
bors r=adityamaru |
|
Build succeeded: |
|
blathers backport 23.1 |
98466: cli,server: static configuration profiles r=stevendanna a=knz Epic: CRDB-23559 Informs #98431. Fixes #94856. (Based off #98459) Supersedes #98380. This change introduces a mechanism through which an operator can select a "configuration profile" via the command-line flag `--config-profile` or env var `COCKROACH_CONFIG_PROFILE`. The SQL initialization defined by the profile is applied during server start-up. The profiles are (currently) hardcoded inside CockroachDB. The following profiles are predefined: - `default`: no configuration. - `multitenant+noapp`: no pre-defined `application` tenant, but with a predefined application tenant template that is used whenever a new tenant is defined. This config profile is meant for use for C2C replication target clusters. - `multitenant+app+sharedservice`: shared-process multitenancy with pre-defined `application` tenant, based off the same configuration as `multitenant+noapp`. Release note: None 101907: multitenant: misc fixes related to tenant capabilities r=arulajmani a=knz See individual commits for details. The last commit in particular probably addresses #99087. Epic: CRDB-23559 101935: sql: add issue number to todo r=rharding6373 a=rharding6373 Epic: none Informs: #101934 Release note: none Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>
Epic: CRDB-23559
Informs #98431.
All commits but the last are from #98993.
This change introduces "auto config tasks", a mechanism through which
configuration payloads ("tasks") can be injected into a running SQL
service.
This is driven via the "auto config runner" job that was introduced in
the previous commit. The job listens for the arrival of new task
definitions via a
Providerinterface. When new tasks are known, andprevious tasks have completed, the runner creates a job for the first
next task.
Release note: None