Skip to content

sql: setup http router after version initialization#138343

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
xinhaoz:fix-version-check-race
Jan 9, 2025
Merged

sql: setup http router after version initialization#138343
craig[bot] merged 1 commit intocockroachdb:masterfrom
xinhaoz:fix-version-check-race

Conversation

@xinhaoz
Copy link
Copy Markdown
Contributor

@xinhaoz xinhaoz commented Jan 6, 2025

Previously the http routes were set up after sql version initialization. As of #127170, the version initialization occurs in the sql server prestart, which occurs just after router setup. It became possible to hit the http routes before the version was set, leading to panics as the http routes rely on the sql version being set.

Epic: none
Fixes: #138342

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@xinhaoz xinhaoz force-pushed the fix-version-check-race branch from dec70d6 to db46929 Compare January 6, 2025 20:02
@xinhaoz xinhaoz marked this pull request as ready for review January 6, 2025 20:41
@xinhaoz xinhaoz requested review from a team as code owners January 6, 2025 20:41
@xinhaoz xinhaoz requested review from DarrylWong, kyle-a-wong and srosenberg and removed request for a team January 6, 2025 20:41
func registerHTTPRestart(r registry.Registry) {
r.Add(registry.TestSpec{
Name: "http-restart/mixed-version",
Owner: registry.OwnerTestEng,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should test eng own this or obs? I'd argue that obs should own it since its testing that the http server is registering routes at the proper time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above reasoning makes sense to me; +1 for obs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Since we want to keep ownership I kept it in its own test instead of adding to version-upgrade.

}
}

for i := 1; i <= 4; i++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use for _, node := range c.CRDBNodes() {... instead of a hardcoded size

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, hardcoding should be eschewed because it makes the test configuration brittle, i.e., difficult to refactor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, mostly on my own code 😅

The regression test feels pretty lightweight, I wonder if it needs to be it's own test or if we can stick it in version-upgrade. Maybe only in non local runs if we are worried about keeping the acceptance test light.

}
if err := client.PostProtobuf(ctx, url, &request, &response); err != nil {
if logEvery.ShouldLog() {
l.Printf("Error posting protobuf: %s", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe prefix message with node ID

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wonder if this error checking should be smarter i.e. check that it suceeds at least once. Could definitely see a papercut in the url such that the http req always fails and we never catch it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return
default:
}
// Copied from elsewhere, actual HTTP req shouldn't matter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I tried changing this to a simpler request (/_status/vars) and couldn't get it to fail after stressing 10 times. Is a tsdb request much faster so we hit the race condition (much) more often?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSDB request doesn't require auth. from what I recall. (Unless that changed recently.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarrylWong You were just unlucky in selecting /_status/vars as it doesn't use the authenticated handler.
TSDB reqs don't require any sort of sql role but they do go through the auth http handler.

}

for i := 1; i <= 4; i++ {
mvt.BackgroundFunc("HTTP requests", func(ctx context.Context, l *logger.Logger, rng *rand.Rand, h *mixedversion.Helper) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not OnStartup? Its execution always precedes any background func. (See testPlanner.testStartSteps.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test seemed to stall when I put it in OnStartup so I kept it in BackgroundFunc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BackgroundFunc runs in a goroutine managed by framework. The framework knows to not block on these steps and expects them to run for the duration of the test. OnStartup is a normal step so the test will block until the step is finished. If you wanted to achieve something similar in OnStartup you'd have to add your own concurrency control with h.GoWithCancel().

To answer Stan's question, there are no startup steps here so the ordering of the hooks is the exact same. So I just went with BackgroundFunc since the ergonomics of it are slightly cleaner imo, no strong preference though.

Ideally it should run before any nodes are started, that way we don't miss out on testing the first time it starts up. There's currently no way to do that in framework though 😞

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it should run before any nodes are started, that way we don't miss out on testing the first time it starts up. There's currently no way to do that in framework though.

Right, that was my thinking too...BackgroundFunc may not be able to catch a regression on every run. I suppose that's a reasonable compromise for now. I wonder if we should create a follow-up issue to add the missing feature? To my knowledge, this is the first such use-case where we do want to execute a user-defined func. immediately at startup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timing to pull that off makes it complicated. c.Start handles both the selecting the adminUIPorts and starting CRDB, but we'd need to interleave this hook between the two steps. A workaround would be to hardcode the ports used, but that also isn't supported out of the box since mixed version framework controls the creation of clusters/tenants.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We'd need to refactor a bit. E.g., c.Start will take an optional callback func; it's called immediately after we know the port(s), i.e., after generateStartArgs. Then, we expose a new user hook inside the mixedversion framework; it gets invoked before the crdb process is started. It seems rather contrived to satisfy this particular use-case, but some other use-case will probably come up in the future. Let's file it away in a new issue so that we don't have to fish for it; worst-case it will linger like many others from ages ago :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

@xinhaoz xinhaoz force-pushed the fix-version-check-race branch from db46929 to b2564a6 Compare January 8, 2025 14:53
@xinhaoz xinhaoz added backport-24.2.x backport-24.3.x Flags PRs that need to be backported to 24.3 labels Jan 8, 2025
@xinhaoz xinhaoz force-pushed the fix-version-check-race branch 2 times, most recently from e75abbb to 030804e Compare January 8, 2025 16:48
Previously the http routes for secondary tenants were set up
after sql version initialization. As of cockroachdb#127170, the version
initialization occurs in the sql server prestart, which occurs
just after the router setup. It became possible to hit the http
routes before the version was set, leading to fatal errors as
the http routes rely on the sql version being set.

Epic: none
Fixes: cockroachdb#138342

Release note (bug fix): Secondary tenants will not fatal when
issuing HTTP requests during tenant startup.
Copy link
Copy Markdown
Contributor

@kyle-a-wong kyle-a-wong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xinhaoz
Copy link
Copy Markdown
Contributor Author

xinhaoz commented Jan 9, 2025

TFTR!
bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 9, 2025

@craig craig bot merged commit 4fe1d00 into cockroachdb:master Jan 9, 2025
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 9, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #138342: branch-release-24.2, branch-release-24.3.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 9, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-24.2-138343 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/138754/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.2.x failed. See errors above.


error setting reviewers, but backport branch blathers/backport-release-24.3-138343 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/138755/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-24.3.x Flags PRs that need to be backported to 24.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: register http routes after version initialization

5 participants