Skip to content

roachprod: remove default port assumption in start and start-sql#114097

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
DarrylWong:random-port
Dec 13, 2023
Merged

roachprod: remove default port assumption in start and start-sql#114097
craig[bot] merged 1 commit intocockroachdb:masterfrom
DarrylWong:random-port

Conversation

@DarrylWong
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong commented Nov 8, 2023

Previously, the default sql port and admin ui port were set to 26257
and 26258 respectively in start and start-sql. This change now removes
that default assignment and now randomly assigns an available open port.

Many roachtests rely on this default port assumption and fail after this
change. To fix this, we now explicity specify the port when applicable.
Some roachtests use a third party library, where the port cannot be easily
passed in and is hardcoded. In these cases, we retain the previous behaviour
by specifying StartOpts.RoachprodOpts.SQLPort.

Epic: None
Fixes: #111052
Release note: none

@DarrylWong DarrylWong added the do-not-merge bors won't merge a PR with this label. label Nov 8, 2023
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@DarrylWong DarrylWong force-pushed the random-port branch 6 times, most recently from 2e2af7a to 5392f16 Compare November 16, 2023 19:32
@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Nov 16, 2023

drive-by comment: a lot of the ORM tests rely on a known port. i do think there might be a way to refactor a lot of them to be aware of this.

@DarrylWong
Copy link
Copy Markdown
Contributor Author

Thanks for the heads up! And yeah, the ORM tests aren't alone in this. I got over 100 roachtest fails 😢 so refactoring them to be compatible will be the bulk of this PR.

@DarrylWong DarrylWong force-pushed the random-port branch 5 times, most recently from 5b7e957 to df4e698 Compare November 19, 2023 17:01
@renatolabs
Copy link
Copy Markdown

drive-by comment: a lot of the ORM tests rely on a known port

Another drive-by comment: the intention is just to change the default port and remove assumptions on any particular assignment. Individual tests that rely on a specific configuration, such as ORM tests, should be able to choose a ports to bind but that is now made explicit by settingStartOpts.RoachprodOpts.SQLPort to the desired value.

@DarrylWong DarrylWong force-pushed the random-port branch 4 times, most recently from 8923adf to 822a673 Compare November 22, 2023 17:07
@DarrylWong DarrylWong changed the title roachtest: use a randomly available port by default in start and star… roachprod: remove default port assumption in start and start-sql Nov 22, 2023
@DarrylWong DarrylWong removed the do-not-merge bors won't merge a PR with this label. label Nov 22, 2023
@DarrylWong
Copy link
Copy Markdown
Contributor Author

DarrylWong commented Nov 22, 2023

Test runs:

AWS

GCE Lots of new failed tests due to rebasing off master and introducing a new DNS caching bug. Will rerun when fixed.

Debating if I should run the weeklies or not... Maybe I can just run the first hour or so of them since most fail fast?

@DarrylWong DarrylWong force-pushed the random-port branch 3 times, most recently from 2a7184b to a022097 Compare December 1, 2023 15:49
@DarrylWong DarrylWong force-pushed the random-port branch 4 times, most recently from 27617fc to a8f2747 Compare December 5, 2023 16:23
Copy link
Copy Markdown
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)


pkg/cmd/roachtest/roachtestutil/utils.go line 40 at r1 (raw file):

I suppose most tests don't care about the node id, right?

Correct.

we can randomize the index

Yes, I think that would be easy enough. Unfortunately I think the problem is that this randomization needs to happen outside of this helper.

i.e., if we're calling c.Run(ctx, c.Node(1), "./cockroach workload fixtures import tpcc '{url}'"), the url would have to be for node 1.

The solution would be to refactor all the c.Node(1) calls in the code to instead use RandNode(). but given "In any case, this is probably a moot point since we plan to switch to load-balanced MIGs", I'm not sure if this is worth it, and definitely seems out of the scope of this PR to me.


pkg/cmd/roachtest/tests/admission_control_intent_resolution.go line 74 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I wonder if we should encode this into TestSpec, since it's a property/requirement of a specific test? By making this requirement implicit, we can't readily extract all tests which support non-default ports, other than through source code analysis.

Hmm, I agree that it would be nice, but it's not obvious to me the best way to do this. These port values are used in c.Start() which has no knowledge of the test spec.

One solution would be to rework DefaultStartOpts() to take in the TestSpec, and return appropriate start opts there based on the TestSpec. The DefaultStartOptsNoBackups() DefaultStartSingleNodeOpts() and DefaultStartOptsInMemory() functions fall under the same line of thinking so we could make those part of TestSpec while we're at it. But that would touch every roachtest and might be out of the scope of this PR?

I'm open to other solutions or thoughts on the above solution! For what it's worth, I do think this would be a far less troublesome refactor than randomizing the node used.


pkg/cmd/roachtest/tests/alterpk.go line 116 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Remove?

Done.


pkg/cmd/roachtest/tests/tpchvec.go line 580 at r1 (raw file):
I see now that I can combine this cmd and the one below, but I don't think I quite get what you mean by refactoring maybeExpandPgPort. Do you mean making a helper to find the port using c.DiscoverService that can be used both here and in maybeExpandPgPort?

unless this is a one-off.

It's also done in npgsql, so one-off given the above combining commands change.

@DarrylWong
Copy link
Copy Markdown
Contributor Author

Above comments addressed offline with Stan. Reran/fixed the failing roachtests, and confirmed they all pass except multitenant-upgrade which appears to be a unrelated issue.

Will leave this open a bit longer though in case anyone else has comments 😄

Previously, the default sql port and admin ui port were set to 26257
and 26258 respectively in start and start-sql. This change now removes
that default assignment and now randomly assigns an available open port.

Many roachtests rely on this default port assumption and fail after this
change. To fix this, we now explicity specify the port when applicable.
Some roachtests use a third party library, where the port cannot be easily
passed in and is hardcoded. In these cases, we retain the previous behaviour
by specifying StartOpts.RoachprodOpts.SQLPort.
@DarrylWong
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=srosenberg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 13, 2023

Build succeeded:

@craig craig bot merged commit c92d216 into cockroachdb:master Dec 13, 2023
@DarrylWong
Copy link
Copy Markdown
Contributor Author

Reminder to self: cherrypick #116452 when this is backported

DarrylWong added a commit to DarrylWong/fork that referenced this pull request Dec 18, 2023
After cockroachdb#114097, we need to specify ports for cockroach node
connections. This one was missed and timeseries were not being
collected in test failures.

Release note: none
Epic: none
craig bot pushed a commit that referenced this pull request Dec 18, 2023
116218: roachtest: run consistency checks after tests r=erikgrinaker a=erikgrinaker

This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots.

The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping `registry.PostValidationReplicaDivergence`.

Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately.

Touches #115770.
Epic: none
Release note: None

116614: sql: add telemetry for mixed DDL/DML transactions r=rafiss a=rafiss

This patch adds feature counter telemetry for explicit transactions that have schema changes. We track if a transaction has DDL only or a mixture of DDL and DML, and if it succeeded or failed.

fixes #115012
Release note: None

116672: sql: skip TestExperimentalRelocateNonVoters under duress r=rafiss a=rafiss

fixes #115935
Release note: None

116695: roachtest: specify port for post test timeseries collection r=renatolabs a=DarrylWong

After #114097, we need to specify ports for cockroach node connections. This one was missed and timeseries were not being collected in test failures.

Release note: none
Epic: none

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: DarrylWong <darryl@cockroachlabs.com>
DarrylWong added a commit to DarrylWong/fork that referenced this pull request Dec 28, 2023
The change cockroachdb#114097 removed default port assumptions for SQLPort and AdminUIPort,
and instead finds open ports to assign. However, Prometheus scraping relies on
this hardcoded AdminUIPort as it can't discover the port itself.

This change temporarily reverts AdminUIPort back to the hardcoded port.

Release note: none
Epic: none
Informs: cockroachdb#117125
craig bot pushed a commit that referenced this pull request Dec 28, 2023
117141: roachprod: revert AdminUIPort back to DefaultAdminUIPort r=srosenberg a=DarrylWong

The change #114097 removed default port assumptions for SQLPort and AdminUIPort, and instead finds open ports to assign. However, Prometheus scraping relies on this hardcoded AdminUIPort as it can't discover the port itself.

This change temporarily reverts AdminUIPort back to the hardcoded port.

Release note: none
Epic: none
Informs: #117125

Co-authored-by: DarrylWong <darryl@cockroachlabs.com>
DarrylWong added a commit to DarrylWong/fork that referenced this pull request Jan 8, 2024
This was originally removed in cockroachdb#115599 due to cockroachdb#114097 merging,
but adminui was reverted in cockroachdb#117141 and mistakenly did not
revert the special case for virtual clusters.

Release note: None
DarrylWong added a commit to DarrylWong/fork that referenced this pull request Jan 8, 2024
This was originally removed in cockroachdb#115599 due to cockroachdb#114097 merging,
but adminui was reverted in cockroachdb#117141 and mistakenly did not
revert the special case for virtual clusters.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachprod: use a randomly available port by default in start and start-sql

5 participants