server: do not initialize cluster version to minimum supported#127170
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
@stevendanna this change has hilariously caused the RPC port that the tenant binds to in |
4d0d894 to
5d287ee
Compare
|
No idea how the cluster version is impacting this port hint code. |
|
Summary of the events: Removing the clusterversion initialization to the minimum supported version in shared-process tenants here led to the RPC port the tenant binds to to change, and as a consequence The reason for this is that we now attempt to read the cluster version from storage earlier (in the
However, the implementation of the Given that servers will scan the port range until they find an open port anyway, there was never a strong guarantee that ports are deterministic (as the test suggests). For this reason, this commit implements an approach where we drop the "index" concept from the orchestrator, and just let servers bind to the first port in the range they find available. If there's a strong reason to keep the "index" idea around, let me know and we can take a different route. |
|
Uff. It's a rather unfortunate side-effect of the current architecture that we can start trying to start up the tenant's sql server before the rest of the system has seen the tenant's state change.
I think my main reason for keeping it as the time of adding the port-finding code was just to reduce the amount of time we spend searching through ports we likely already have used. I know that there is also this oddness over in demo_cluster.go: // The code in NewDemoCluster put the KV ports higher so
// we need to subtract the number of nodes to get back
// to the "good" ports.
//
// We reduce lower bound of the port range by 1 because
// the server controller uses 1-based indexing for
// servers.
args.ApplicationInternalRPCPortMin = rpcPort - (demoCtx.NumNodes + 1)
args.ApplicationInternalRPCPortMax = args.ApplicationInternalRPCPortMin + 1024But I don't think your change here will change that that is trying to do. |
|
I kinda wonder if we should follow up with something that does one or two retries around that initial version look up in the case of particular errors. If we are starting s SQL server for a tenant, then there is a good reason to suspect that this should work if we were to retry quickly. |
|
We can do that, I'll take a stab at it. That said, I'm still not sure whether we should keep the index idea around. I understand it's not ideal to have multiple tenants race to bind to the same ports but also find it quite error-prone/confusing to have a function that assumes one-tenant-per-call ( |
I knew this sounded familiar 😄 cockroach/pkg/server/settingswatcher/settings_watcher.go Lines 263 to 274 in e8357e7 |
5d287ee to
f393011
Compare
|
@stevendanna let me know if something like this is what you had in mind. |
|
@stevendanna friendly ping. |
pkg/server/server_sql.go
Outdated
| // When initializing the cluster version for tenants, we retry on certain | ||
| // errors that could happen if the tenant has not transitioned to | ||
| // its actual service mode yet. |
There was a problem hiding this comment.
[nit] I think we can be a bit more explicit here. We retry if the error indicates that the KV authoriser hasn't seen our service-mode transition. This is possible because both the service controller and authoriser received tenant mode updates asynchronously via rangefeeds.
Previously, shared-process tenants would initialize their view of the cluster version early and set it to the `MinSupportedVersion` as a "safe" option. However, that setting can likely be wrong, especially now that we plan to support skip-version upgrades. In this commit, we remove that code and rely on the initialization that happens from storage in `preStart`. This means that we now expose a non-negligible amount of code to a state where we don't have an initialized cluster version. However, having that error out is better than have code assume a potentially wrong version. A similar discussion and approach happened for separate process tenants a while ago: see cockroachdb#100684 and cockroachdb#104859. Fixes: cockroachdb#126754 Release note: None
f393011 to
e36acf8
Compare
|
TFTR! bors r=stevendanna |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #126754: branch-release-24.2. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously the http routes were set up after sql version initialization. As of cockroachdb#127170, the version initialization occurs in the sql server prestart, which occurs just after router setup. It became possible to hit the http routes before the version was set, leading to panics as the http routes rely on the sql version being set. Epic: none Fixes: cockroachdb#138342
Previously the http routes were set up after sql version initialization. As of cockroachdb#127170, the version initialization occurs in the sql server prestart, which occurs just after router setup. It became possible to hit the http routes before the version was set, leading to panics as the http routes rely on the sql version being set. Epic: none Fixes: cockroachdb#138342
Previously the http routes for secondary tenants were set up after sql version initialization. As of cockroachdb#127170, the version initialization occurs in the sql server prestart, which occurs just after the router setup. It became possible to hit the http routes before the version was set, leading to fatal errors as the http routes rely on the sql version being set. Epic: none Fixes: cockroachdb#138342 Release note (bug fix): Secondary tenants will not fatal when issuing HTTP requests during tenant startup.
138343: sql: setup http router after version initialization r=xinhaoz a=xinhaoz Previously the http routes were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after router setup. It became possible to hit the http routes before the version was set, leading to panics as the http routes rely on the sql version being set. Epic: none Fixes: #138342 Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Previously the http routes for secondary tenants were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after the router setup. It became possible to hit the http routes before the version was set, leading to fatal errors as the http routes rely on the sql version being set. Epic: none Fixes: #138342 Release note (bug fix): Secondary tenants will not fatal when issuing HTTP requests during tenant startup.
Previously the http routes for secondary tenants were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after the router setup. It became possible to hit the http routes before the version was set, leading to fatal errors as the http routes rely on the sql version being set. Epic: none Fixes: #138342 Release note (bug fix): Secondary tenants will not fatal when issuing HTTP requests during tenant startup.
Previously the http routes for secondary tenants were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after the router setup. It became possible to hit the http routes before the version was set, leading to fatal errors as the http routes rely on the sql version being set. Epic: none Fixes: #138342 Release note (bug fix): Secondary tenants will not fatal when issuing HTTP requests during tenant startup.
Previously the http routes for secondary tenants were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after the router setup. It became possible to hit the http routes before the version was set, leading to fatal errors as the http routes rely on the sql version being set. Epic: none Fixes: #138342 Release note (bug fix): Secondary tenants will not fatal when issuing HTTP requests during tenant startup.
Previously, shared-process tenants would initialize their view of the cluster version early and set it to the
MinSupportedVersionas a "safe" option. However, that setting can likely be wrong, especially now that we plan to support skip-version upgrades.In this commit, we remove that code and rely on the initialization that happens from storage in
preStart. This means that we now expose a non-negligible amount of code to a state where we don't have an initialized cluster version. However, having that error out is better than have code assume a potentially wrong version.A similar discussion and approach happened for separate process tenants a while ago: see #100684 and #104859.
Fixes: #126754
Release note: None