Skip to content

roachtest/mixedversion: sync workload binary versions with the cluster version#153269

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version
Oct 1, 2025
Merged

roachtest/mixedversion: sync workload binary versions with the cluster version#153269
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version

Conversation

@williamchoe3
Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 commented Sep 9, 2025

Fixes #147374

Problem

The current workload binary is no longer backwards compatible with certain workloads i.e. bank (see original issue above for more information). Previously, the workload binary would just use whatever default binary was staged on the workload node by default. In a test, the test author could explicitly download an older version of workload to circumvent version compatibility issues or use Workload's override (PR) parameter to download a binary matching the cluster version to the workload node when the workload hook is being executed.

Change

While the later's solution works, it makes more sense to always just use a workload binary that matches the cluster. If a test wants to use a specific workload binary version though, that functionality remains possible as well.
During startStep, while we are staging the cluster at it's initial version, we also stage that version's binary all the versioned binaries in the upgrade plan on the workload node(s). Previously we just left the workload node alone at this step. Then, introduced a new test framework specific step stageWorkloadBinaryStep which gets added in test.Planner.afterUpgradeSteps which get added to the plan after a cluster finalizes it's upgrade. stageWorkloadBinaryStep stages the finalized cluster version's binary on the workload node(s).

Removed the call to UploadCockroach from Workload as when the workload init or run hooks are called, the cluster binary will already exist on the workload node. Kept override parameter, but now if override is set, Workload will simply just respect the binary that get's passed into the command instead of defaulting the current cluster version.

  • If a test wants to use a specific Workload binary, they are able to do that if they want without using the Framework's helper function

Added a new helper method to Helper which can be used in hooks to abstract a lower level roachtest framework call

clusterupgrade.BinaryPathForVersion(t.rt, h.System.FromVersion, "cockroach")
--> 
CockroachBinaryForWorkload(t)

Added new mixedversion test option which will add the framework step to stage all versioned cockroach binaries in the upgrade plan to the WorkloadNode

mixedversion.WithWorkloadNodes(c.WorkloadNode())

Example

Given a plan e.g. during the run startup hooks concurrently set of steps

Plan:
Upgrades:           v25.2.5 → <current>
Deployment mode:    system-only
Mutators:           cluster_setting[kv.transaction.write_buffering.enabled], cluster_setting[kv.rangefeed.buffered_sender.enabled]
Plan:
├── install fixtures for version "v25.2.5" (1)
├── start cluster at version "v25.2.5" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '25.2' on system tenant (3)
├── stage workload binary on workload node(s) :5 for version(s) v25.2.5, <current> (4) <-- NEW
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 30s delay (5)
│   ├── run "load TPCC dataset", after 500ms delay (6)
│   ├── run "load bank dataset", after 30s delay (7)
│   └── set cluster setting "kv.rangefeed.buffered_sender.enabled" to 'true' on system tenant, after 3m0s delay (8)
└── upgrade cluster from "v25.2.5" to "<current>"

Verification

Adhoc Nightly against all mixed version tests https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/20482572?buildTab=overview&expandBuildDeploymentsSection=false&expandBuildChangesSection=true&hideTestsFromDependencies=false&expandBuildTestsSection=true&hideProblemsFromDependencies=false&expandBuildProblemsSection=true

Added a gomock for Cluster in cluster/mock for unit tests

  • Removed duplicate Cluster mock in clusterstats
  • Wanted to colocate the mocks with the actual interface they are mocking. Then each package that wants to use the mock just needs to import that mock instead of creating a new mock in that package and creating a new generated file for that mock

Added datadriven unit test that captures the new post upgrade internal framework step

While looking at Test I saw these test specific fields, (not sure what this specific methodology is called), but just curious because my initial thought was to add a mock Cluster implementation, because I have workload node related logic that needs to call cluster interface methods. I guess i could've maybe went with something like below, but I personally don't like intertwining prod logic and test logic. I guess is there any reason we wouldn't want the mock cluster? Besides the boiler plate it creates in the unit tests?
https://github.com/cockroachdb/cockroach/blob/b4738823bcc55547ff431b99684576370760e56d/pkg/cmd/roachtest/roachtestutil/mixedversion/planner_test.go

	// Test is the main struct callers of this package interact with.
	Test struct {
		// the following are test-only fields, allowing tests to simulate
		// cluster properties without passing a cluster.Cluster
		// implementation.
		_arch      *vm.CPUArch
		_isLocal   *bool
		_getFailer func(name string) (*failures.Failer, error)

mixedversion test refactor

Key
👌: no workload, no change needed
➕: added mixedversion.WithWorkloadNodes(c.WorkloadNode()) test option to stage versioned binaries on workload node
✅: removed hardcoded "./cockroach workload ..." call to something like "%s workload... ", h.CockroachBinaryForWorkload(t)
⚠️: can't add versioned call because workload logic not defined in a user hook.

  • Most of these tests use Workload() which will override the binary with the versioned one by default
  • Some of these tests have logic around the Run cmd so calling Workload doesn't provide an easy fix, but the test should continue to work with the unversioned (default) binary

❌: not refactoring bc it's not using bank and the logic is tighty coupled with shared non mixedversion roachtests so would need to decouple, which I think would make the test logic harder to understand and not worth having a 1 line helper

Test Changes

👌acceptance/validate-system-schema-after-version-upgrade [sql-foundations]   randomized,timeout: 1h0m0s
👌acceptance/version-upgrade [test-eng]   randomized,timeout: 2h0m0s
➕✅admission-control/elastic-workload/mixed-version [kv]   timeout: 3h0m0s
➕⚠️backup-restore/mixed-version [disaster-recovery]   randomized,timeout: 8h0m0s
➕⚠️c2c/mixed-version [disaster-recovery]
  * 2 clusters, 1 workload node, put the test option to upgrade the workload node in both clusters, techniaclly 1 of them might be redundant, but i just wanna be sure by the time either cluster makes a workload call it's good, not sure if it's the 1st cluster, 2nd cluster, or random which one goes first so just putting in both, also binary existence is checked before staging so no redundent downloads will be done
➕✅cdc/mixed-version/checkpointing [cdc]   randomized,timeout: 3h0m0s
➕✅cdc/mixed-versions [cdc]   randomized,timeout: 3h0m0s
👌change-replicas/mixed-version [kv]   randomized,timeout: 1h0m0s
👌db-console/mixed-version-cypress [obs-prs]   timeout: 2h0m0s
* uses workload node for another task that's not running workload so no change is needed
➕✅db-console/mixed-version-endpoints [obs-prs]   randomized,timeout: 1h0m0s
👌declarative_schema_changer/job-compatibility-mixed-version-V242-V243 [sql-foundations]   randomized
👌~~➕✅~~decommission/mixed-versions [kv]   randomized
  * ~~this test didn't have a workload node previously and I didn't see a reason not to use one (unlike some other tests that want the extra pressure on the node)~~
  * data loading is fine to do on a cluster node (see comments below)
👌follower-reads/mixed-version/single-region [kv]   randomized
👌follower-reads/mixed-version/survival=region/locality=global/reads=strong [kv]   randomized
👌http-register-routes/mixed-version [obs-prs]   randomized,timeout: 1h0m0s
~~➕✅~~import/mixed-versions [sql-queries] (skipped: Issue #143870)   randomized
* don't need to use an explicit workload node here (see comments below)
👌jobs/mixed-versions [disaster-recovery]   randomized
➕✅⚠️ldr/mixed-version [disaster-recovery]
  * the init is in a hook, i can user the helper, the run isn't in a hook it uses Workload, so i didn't replace it
➕❌multi-region/mixed-version [test-eng]   randomized,timeout: 36h0m0s
❌multitenant-upgrade [server]   randomized,timeout: 5h0m0s
❌rebalance/by-load/leases/mixed-version [kv]   randomized
❌rebalance/by-load/replicas/mixed-version [kv]   randomized
❌schemachange/mixed-versions [sql-foundations]   randomized
❌schemachange/mixed-versions-compat [sql-foundations]
❌schemachange/secondary-index-multi-version [sql-foundations]   randomized
➕⚠️sql-stats/mixed-version [obs-prs]   randomized,timeout: 1h0m0s
➕✅tpcc/mixed-headroom/chaos/n6cpu16 [test-eng]   randomized,timeout: 7h0m0s
➕✅tpcc/mixed-headroom/n5cpu16 [test-eng]   randomized,timeout: 7h0m0s
👌validate-system-schema-after-version-upgrade/separate-process [sql-foundations]

Notes

pkg/cmd/roachtest/clusterstats/BUILD.bazel I made some edits initially, removed them, and now ./dev gen is generating go_library / go_test deps in a different order 🤷‍♂️

  • Saw in pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go in clusterupgrade pkg, UploadWorkload but chose not to use it because it's a currently unused path that also eventually calls a deprecated Cluster API function and UploadCockroach & UploadWorkload are both just wrappers around uploadBinaryVersion

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from 69f6bfb to d25794d Compare September 15, 2025 19:11
@williamchoe3 williamchoe3 changed the title init commit just print statements roachtest/mixedversion: sync workload binary versions with the cluster version Sep 15, 2025
@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from 7314118 to 8baf4cd Compare September 15, 2025 21:06
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 16, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from b95c845 to 055900b Compare September 17, 2025 23:01
@williamchoe3 williamchoe3 marked this pull request as ready for review September 18, 2025 14:28
@williamchoe3 williamchoe3 requested review from a team as code owners September 18, 2025 14:28
@williamchoe3 williamchoe3 requested review from nameisbhaskar and shailendra-patel and removed request for a team September 18, 2025 14:28
@williamchoe3
Copy link
Copy Markdown
Contributor Author

Not all tests passing yet (5/29 failing), debugging multi cluster scenarios but opening up for review for general feedback

// exists, assume the binary has already been uploaded previously. Returns the
// path of the uploaded binaries on the nodes.
//
// If --versions-binary-override option is set and if version v is contained in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: only if we are uploading the cockroach binary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in other words for mixedversion to use --versions-binary-override, you have to use --cockroach ...? I don't know the CI roachtest call / if --cockroach-stage ... is used, just after reading the logic of the override binary version

defaultBinary, isOverridden = t.VersionsBinaryOverride()[v.String()]

it seems I don't see right away why it wouldn't work if --cockroach-stage is set. No worries on finding the that logic though, I believe you

just wondering if those args aren't meant to be used together should i mark it out them out as incompatible in main?

(the reason I wrote that comment in the first place was the behavior wasn't the clearest to me in a first pass so just trying to make things more digestible)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more so meant that we only support overriding the cockroach binary, i.e. you can't override the DeprecatedWorkload()

// TODO(testeng): Replace with https://github.com/cockroachdb/cockroach/issues/147374
// By default, the binary used to run the command(s) will be the same as the
// the current version of the cluster at the time this hook is executed.
// This is because the binary version is no longer backwards compatible as of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a bit overgeneralizing. Only some of the workloads, e.g. bank, are no longer backwards compatible. We can just say that we assume the workload is not backwards compatible.

// version as the cockroach cluster.
// TODO(testeng): Replace with https://github.com/cockroachdb/cockroach/issues/147374
binary := uploadCockroach(ctx, t, c, randomNode, h.System.FromVersion)
binary := clusterupgrade.BinaryPathForVersion(t, h.System.FromVersion, "cockroach")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hide this behind a h.CockroachBinary method.

That way the test can just call cmd := roachtestutil.NewCommand("%s workload fixtures import bank", h.CockroachBinary) and not think about what version we need.

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this works for workloads defined in user defined hooks like admission-control/elastic-workload/mixed-version that have access to helper

			runForeground := func(ctx context.Context, duration time.Duration, h *mixedversion.Helper) error {
				cmd := roachtestutil.NewCommand("%s workload run kv "+
					"%s --concurrency=500 "+
					"--max-rate=5000 --read-percent=5 "+
					"--min-block-bytes=512 --max-block-bytes=1024 "+
					"--txn-qos='regular' "+
					"--duration=%v {pgurl%s}",
					h.CockroachBinaryForWorkload(t), roachtestutil.GetWorkloadHistogramArgs(t, c, labels), duration,
					c.CRDBNodes())
				return c.RunE(ctx, option.WithNodes(c.WorkloadNode()), cmd.String())
			}

but in workloads that are defined in the outer regiester_test() like in backup-restore/mixed-version, there's no access to Helper (if this is wrong let me know)

func bankWorkloadCmd(
	l *logger.Logger, testRNG *rand.Rand, roachNodes option.NodeListOption, mock bool,
) (init *roachtestutil.Command, run *roachtestutil.Command) {
...
	init = roachtestutil.NewCommand("./cockroach workload init bank").
		Flag("rows", bankRows).
		MaybeFlag(bankPayload != 0, "payload-bytes", bankPayload).
		Flag("ranges", 0).
		Arg("{pgurl%s}", roachNodes)

so because of this i think the override logic is will needed in Workload

for all usages in user defined hooks i'll switch to the helper

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no access to Helper

Right, it's used outside of mixed version tests so we don't want to pass the Helper there. Can we instead just pass a binary string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, because for clusterupgrade.BinaryPathForVersion(t, h.System.FromVersion, "cockroach"), that needs a version that i was relying on Helper to get. Calling Workload get's around this because it defines the command in the hook which has access to Helper
Will see if I can think of another way

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the main concern is to get rid of the override boolean in Workload, could also have a Workload function and a WorkloadWithOverrideBinary function

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see the issue, that makes sense. In that case, I'd still lean towards removing it since we currently have nothing that sets OverrideBinary to true. Also remember that Workload is a convenience wrapper so tests can always just do it manually via BackgroundFunc if you really need to.

p.setFinalizing(service, false)
p.setStage(service, AfterUpgradeFinalizedStage)

if len(p.cluster.All()) != len(p.cluster.CRDBNodes()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just passing in the workload node(s) as an argument in the test? I'm a bit uneasy about relying on this since:

  1. In the future we might make the workload node(s) its own cluster
  2. If the test doesn't care about syncing the binaries, this adds extra work
  3. WorkloadNode() as of right now is purely an optional convenience thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass it in as an arg in mixedversion.NewTest(...) as a type option.NodeListOption? Sure sounds better than relying on math. Alternatively before i was thinking of using the lower level framework cluster.WorkloadNode, but unfortunately that calls Fatal if there's not a workload node. So adding a new cluster method could also work. Any preference on working in mixedversion / cluster?

  • If we add it to mixedversion's test.Test (and/or helper)?, we can rip out cluster from the main code path
  • If we add it to cluster, it might be simpler bc less abstraction but would add a reliance on cluster

leaning towards your suggestion, but curious if you have any thoughts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah adding it to mixedversion test.Test was what i was thinking. I'm not sure if I follow your second option though. If you want to see if a workload node was declared, you can check the test spec. However, mixed version doesn't have access to the roachtest test spec because it has it's own test spec (test.Test).

we can rip out cluster from the main code path

I think we still need cluster for things like cluster.Run but I could be wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something similar to this

func (c *clusterImpl) WorkloadNode() option.NodeListOption {
except it doesn't Fatal if there's no workload nodes
But leaning towards your initial suggestion anyways. Also just meant ripping the cluster stuff that i added which is the reason why i had to add the cluster mock in the unit test so can take that out too

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing in the workload node(s) as an argument in the test

So 2 approaches I thought of here:

  • Changing function signature
    • Wouldn't make sense for non workload tests to have to pass in a nil
    • could use a new constructor that takes in workload nodes, but having 2 separate constructors might be a little confusing
      • AND would need to add logic to add to testOptions so the planner can see the objs which seems like an antipattern in this framework as testOptions seems to be entirely user supplied
func NewTestWithWorkload(  
  ctx context.Context,  
  t test.Test,  
  l *logger.Logger,  
  c cluster.Cluster,  
  crdbNodes option.NodeListOption,  
  workloadNodes option.NodeListOption,  
  options ...CustomOption,  
) *Test { }
  • Create an option
    • alligned with framework design, could be seen as slightly odd as passing in the crdb nodes as a required param and not workload nodes, but not every mvt use's workload nodes so seems ok
			mvt := mixedversion.NewTest(
				ctx, t, t.L(), c, c.CRDBNodes(),
				// We use a longer upgrade timeout in this test to give the
				// migrations enough time to finish considering all the data
				// that might exist in the cluster by the time the upgrade is
				// attempted.
				mixedversion.WithWorkloadNodes(c.WorkloadNode()),

Leaning towards the option

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option makes sense and is what I was thinking. WorkloadNode itself is already an option

// This is because the binary version is no longer backwards compatible as of
// v25.3
//
// If overrideBinary is true, the binary used to run the command(s) will be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bool sounds like it needs to be flipped to make sense. If it's true and we use the binary passed in the command, then it sounds like we aren't overriding anything at all.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also might not need it at all anymore, instead we can just make sure to change the tests to use h.Cockroach() instead of ./cockroach which should give the versioned binary. This gives the same flexibility but without this extra bool.

func (s stageWorkloadBinaryStep) Run(
ctx context.Context, l *logger.Logger, _ *rand.Rand, h *Helper,
) error {
numWorkloadNodes := len(h.runner.cluster.All()) - len(h.runner.cluster.CRDBNodes())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it'd be nice to not rely on node arithmetic here if possible.

@DarrylWong
Copy link
Copy Markdown
Contributor

   └── run following steps concurrently
      ├── stage workload binary on workload node(s) for version "v25.2.5", after 5s delay (34) <-- NEW
      └── restarting node 1 after panic, after 30s delay (35)

I don't think we wanna run the staging step concurrently with anything.

@DarrylWong
Copy link
Copy Markdown
Contributor

I also wonder if we could just add a step at the very start to stage all the versions needed across the entire test at once. That way we don't add a bunch of new steps that are purely framework details.

@DarrylWong
Copy link
Copy Markdown
Contributor

There's also this CDC failure you should fix that we (I) missed a while ago:

#151015

@williamchoe3
Copy link
Copy Markdown
Contributor Author

williamchoe3 commented Sep 18, 2025

There's also this CDC failure you should fix that we (I) missed a while ago:

#151015

Have a fix that's working locally, need to verify on CI @DarrylWong

@williamchoe3
Copy link
Copy Markdown
Contributor Author

I also wonder if we could just add a step at the very start to stage all the versions needed across the entire test at once. That way we don't add a bunch of new steps that are purely framework details.

Makes sense, I'll give that a try. Is that what's being done for all the binaries for cluster nodes too?

@williamchoe3
Copy link
Copy Markdown
Contributor Author

   └── run following steps concurrently
      ├── stage workload binary on workload node(s) for version "v25.2.5", after 5s delay (34) <-- NEW
      └── restarting node 1 after panic, after 30s delay (35)

I don't think we wanna run the staging step concurrently with anything.

I assumed that user hooks running in afterUpgradeSteps wouldn't affect the workload node so this was fine, but if we're going to front load all the workload node binary downloads in the beginning than this wouldn't matter anymore

@DarrylWong
Copy link
Copy Markdown
Contributor

Is that what's being done for all the binaries for cluster nodes too?

No, but for the cluster nodes there's a logical spot to upload them i.e. when we do the rolling restart for each node so we don't need a separate step. It's not a huge deal though - I'm partially thinking ahead to our new modular framework There's no dependency for uploading the workload binaries, so we could upload them all at the start if we wanted.

Flags map[string]*string
UseEquals bool
// InlineEnvVarAssignment e.g. COCKROACH_RANDOM_SEED=%d ./workload ...
InlineEnvVarAssignments map[string]*string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just EnvVars is probably clear enough

// exists, assume the binary has already been uploaded previously. Returns the
// path of the uploaded binaries on the nodes.
//
// If --versions-binary-override option is set and if version v is contained in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I more so meant that we only support overriding the cockroach binary, i.e. you can't override the DeprecatedWorkload()

// See: https://github.com/cockroachdb/cockroach/issues/121411.
mixedversion.AlwaysUseLatestPredecessors,
)
tester := newSchemaChangeMixedVersionTester(t, c, maxOps, concurrency)
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schemachange workload is a bit different since it has backwards compatibility built in. Let me think about it more but I think we actually want to use the tip of master here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before I changed that file to used versioned workload binary, this was the error I was seeing, different from the unknown flag / arg error I saw when using cockroach workload .... but I assumed it was the same root cause bc of using the newer binary. And when I ran locally with my change with the same seed it started to pass

Wraps: (5) full command output in run_193441.767109628_n4_COCKROACHRANDOMSEED2.log
Wraps: (6) Node 4. Command with error:
  | ```
  | COCKROACH_RANDOM_SEED=2207545676704137517 ./workload run schemachange {pgurl:1-4} --concurrency 5 --max-ops 1000 --verbose 1
  | ```
  | <truncated> ... 60 1 workload/cli/run.go:590  [-] 3 +  | 	src/runtime/asm_amd64.s:1700
  | E250917 19:35:30.793260 1 workload/cli/run.go:590  [-] 3 +Wraps: (4) ***UNEXPECTED ERROR; Received an unexpected execution error.
  | E250917 19:35:30.793260 1 workload/cli/run.go:590  [-] 3 +Wraps: (5) ERROR: function udf_w3_23("char",bit,"char",bit,bool,box2d,bpchar,bytes,citext,string,date,decimal,float,geography,geometry,inet,int,int2,int2vector,int,interval,jsonb,string,oidvector,pg_lsn,refcursor,string,time) does not exist (SQLSTATE 42883)
  | E250917 19:35:30.793260 1 workload/cli/run.go:590  [-] 3 +Error types: (1) *markers.withMark (2) *schemachange.ErrorState (3) *withstack.withStack (4) *errutil.withPrefix (5) *pgconn.PgError
  | Error: ***UNEXPECTED ERROR; Received an unexpected execution error.: ERROR: function udf_w3_23("char",bit,"char",bit,bool,box2d,bpchar,bytes,citext,string,date,decimal,float,geography,geometry,inet,int,int2,int2vector,int,interval,jsonb,string,oidvector,pg_lsn,refcursor,string,time) does not exist (SQLSTATE 42883)
Wraps: (7) COMMAND_PROBLEM
Wraps: (8) exit status 1

but yea I also don't know about the nuance that using workload vs cockroach workload has besides it's a different binary so if I should do something different let me know

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cockroach workload contains a subset of the workloads found in workload. They aren't 1:1 to ensure the cockroach binary isn't too large. However, to make sure we don't upload an unnecessary binary unless needed (since we already upload cockroach to every node), we opt for cockroach if possible.

// initializeSchemaAndIDs ensures schema objects are created in the cluster, and determines
// the various IDs that will be used as placeholders for the endpoints.
func initializeSchemaAndIDs(ctx context.Context, c cluster.Cluster, l *logger.Logger) error {
func initializeSchemaAndIDs(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could just pass in the binaryPath instead of the entire tpcc command

}
}

// WithWorkloadNodes allows callers to specify workload nodes which
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an unfinished comment :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops 😅

})
}

// DepricatedWorkload is this needed? TODO remove if not needed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GItHub will remember for us, so no need to keep commented code around. We'll just pull from history if we need to restore it.

Name: "decommission/mixed-versions",
Owner: registry.OwnerKV,
Cluster: r.MakeClusterSpec(numNodes),
Cluster: r.MakeClusterSpec(numNodes, spec.WorkloadNode()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't look like it was using a workload node. It was importing some data into the cluster using the workload binary, but just doing so on a random CRDB node since it's not heavyweight enough to require it's own VM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i don't think my prev comment sent for whatever reason, but this might be a dup resp)
kk i can change it, i was initially thinking from a functional separation of concerns point of view, but if there's no clear benefit to having it / we can save a VM then i'll take it out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.WorkloadNode just indicates that the last node in the cluster is denoted as a workload node, it doesn't actually add a VM to the cluster (i.e. the cluster is still size(numNodes))

Maybe I'd do it differently now since people have been confused about this, but at the time I did it this way because:

  1. I didn't want to change hundreds of tests that were already adding the workload node in the ClusterSpec or calling clusterSpec.NodeCount. Test writers were already used to it and was my thinking.
  2. It was (optimistically) just a temporary change until we got the ability to spin up multiple clusters per roachtests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks for the info! I feel like i remember reading that or someone telling me a while back, but yeah for some reason when i see spec.WorkloadNode in my head I default to thinking it's adding a node.

I feel like it does make sense to initialize the cluster with all the nodes you need and declare node(s) as workload, maybe the function name SetNodesAsWorkload vs WorkloadNode could help? I don't know if you could tell, but i tend to lean towards more verbose names haha


runWorkloadCmd := tester.runWorkloadCmd(mvt.RNG())
_ = mvt.BackgroundCommand("run workload", tester.workloadNodes, runWorkloadCmd)
_ = mvt.Workload("bank", tester.workloadNodes, nil, runWorkloadCmd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why did this need to change from BackgroundCommand to Workload? CDCMixedVersionCheckpointing still has it as BackgroundCommand so we should make it consistent whatever we choose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the catch, I missed CDCMixedVersionCheckpointing, i'll make the same change

And as for why, I changed to using Workload so workload can handle replacing the binary path with the correct versioned one. Without calling Workload, there's no way to get the versioned binary path because the logic currently exists in the test's main body and not in a user hook so it doesn't have access to Helper. Just thought it would be simpler to change this call vs moving the workload command building logic into a custom user hook. Also Workload calls BackgroundFunc which calls BackgroundCommand so it looks like we get to the same thing in the end, but let me know if there's anything to be aware of with that call chain, it looks like its mainly just abstracting which node the command is ran on, but since we're just passing the 1 workload node i think it should be consistent

func Workload(...) {
...
initCmd.Binary = h.CockroachBinaryForWorkload(t.rt)
}

Name: "import/mixed-versions",
Owner: registry.OwnerSQLQueries,
Cluster: r.MakeClusterSpec(4),
Cluster: r.MakeClusterSpec(4, spec.WorkloadNode()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this test wasn't using a Workload node since it just imports cold data but doesn't actually run the workload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, same as above can remove

@williamchoe3
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nits and unit test only comments. I think the core functionality looks good though!

return c.Flag(name, falseVal)
}

func (c *Command) InlineEnvVarAssignment(name string, val interface{}) *Command {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just EnvVar is probably fine here too. IMO inline is implied, since it would be surprising if it e.g. exported the env var.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops should've changed that when i changed the field name

Flags map[string]*string
UseEquals bool
// EnvVars e.g. COCKROACH_RANDOM_SEED=%d ./workload ...
EnvVars map[string]*string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a commandbuilder_test already, wouldn't hurt to add an example using the env var option.

mvt = createDataDrivenMixedVersionTest(t, d.CmdArgs)
case "mixed-version-test-with-workload":
mc := mockcluster.NewMockCluster(ctrl)
mc.EXPECT().WorkloadNode().AnyTimes().Return(option.NewNodeListOptionRange(1, 1))
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for using gomock more in unit tests (I've wanted to lift the mock out of clusterstats for a while), but it seems overkill if this is all we use it for.

Something like this should work too without the need for a mock:

			case "workload-node":
				mvt = newTest(ctrl, WithWorkloadNodes(option.NodeListOption{5}))

Better yet, you can make it an arg in createDataDrivenMixedVersionTest:


		case "workload_node":
			workloadNodes := option.NodeListOption{}
			for _, nodeStr := range arg.Vals {
				n, err := strconv.Atoi(nodeStr)
				require.NoError(t, err)
				workloadNodes = append(workloadNodes, n)
			}

			opts = append(opts, WithWorkloadNodes(workloadNodes))

This would let the test itself specify node(s) as well as let you keep just one mixed-version-test hook:

mixed-version-test workload_node=(1,3)
----
ok

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't consider just skipping the cluster part and passing in a list of nodes directly, thanks! And yea i agree, originally with my first approach where I was relying on cluster methods I had to mock out those calls in all the unit tests, but now it's not needed so ripped everything out

return numReleases > 1
}

// CockroachBinaryForWorkload returns the correct binary path to use when
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what do you think about something like VersionedCockroachPath?

  1. Just a hypothetical, but maybe we want to run some ./cockroach debug commands that also needs to be versioned. This would work for that use case too, so we can make it generic.
  2. Makes it more obvious that this is a versioned binary from just the name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! changed


// schemaChangeMixedVersionTester bundles state and configuration to pass to
// test hooks
type schemaChangeMixedVersionTester struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was just refactoring and not any actual test changes? Just making sure I'm not missing anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was using a versioned dedicated workload binary instead of the default that looks like just get's built here build/teamcity/cockroach/ci/builds/build_impl.sh for CI. Also we had a thread on this before https://github.com/cockroachdb/cockroach/pull/153269/files#r2363296130
But I was seeing an odd workload run failure that seemed to go away after I started using the versioned binary (logs in the linked thread)

Basically I did something similar to your original workaround. In every hook that is doing a ./workload {run || init} I call a roachtest level function to get the versioned binary

	binaryPath, _, err := clusterupgrade.UploadWorkload(
		ctx, t.t, l, t.c, t.c.WorkloadNode(), h.System.FromVersion)
	if err != nil {
		return err
	}

To do this, I needed to create the struct and have the mixedversion hooks be methods on that struct so I can pass along the test.Test (roachtest) inside the mixedversion hook so I can call UploadWorkload because it needs t.t as a parameter

The mixedversion CDC tests do something similar which is where i got the inspiration from https://github.com/cockroachdb/cockroach/pull/153269/files#diff-187d25c5a34e03c63c9633e5b04ca4c2ee071761b055df84c7d05e95bfcfa5c0L147

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note though, I am seeing this test fail in the most recent CI run, taking a look

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I forgot about this case when doing the original change to doing all the binary downloads up front, I can make this another mixedversion test option, but before I do that

The schemachange workload is a bit different since it has backwards compatibility built in. Let me think about it more but I think we actually want to use the tip of master here.

should we even be using the versioned binary here? I don't have the logs from the original run I did with this test so I can do some runs from before this change to first verify the failure was happening. and the reason i didn't properly verify this was failing before making the change originally was because we decided to just go ahead and try to use the versioned binary as a default

but I'm going to look at the other test failures first because this seems like a special case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so after some digging in the original PR #113890

It seems like we do want this to be versioned, and we currently set:

		// Disable version skipping and limit the test to only one upgrade as the workload is only
		// compatible with the branch it was built from and the major version before that.
		mixedversion.NumUpgrades(1),
		mixedversion.DisableSkipVersionUpgrades,

However, it seems like since the schemachange workload is compatible with one version previous, I think that actually means we want to stage the h.System.ToVersion? If. we do that, then we can get rid of the above mixedversion.NumUpgrades limitation.

IMO it also wouldn't hurt to do this as a followup, that way we don't hold up the PR while you are testing/figuring this out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk thanks for the follow up info! i'll go ahead and revert the file on this branch, and open up a new followup issue

Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the commits being cleaned up before merging.

@williamchoe3
Copy link
Copy Markdown
Contributor Author

@DarrylWong
Copy link
Copy Markdown
Contributor

Looks like cdc/mixed-version is failing on 25.4 so lets try to backport it if it goes through cleanly

@DarrylWong DarrylWong added the backport-25.4.x Flags PRs that need to be backported to 25.4 label Sep 26, 2025
@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from 9d81c33 to 0f02751 Compare September 29, 2025 15:02
@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from 0f02751 to e39348f Compare September 29, 2025 21:17
@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from 96d1bf6 to aee214e Compare September 30, 2025 19:14
…r version

The current workload binary is no longer backwards compatible for certain
workloads. Previously, the workload binary would just use whatever default
binary was staged on the workload node. Currently, there's a quick fix that
installs the current system version's binary on the workload node when a
workload cmd is being executed.

This change bakes that behavior into the test framework itself, so we can remove
the binary staging from the Workload hook. Workload nodes are staged with all
versioned binaries in the upgrade plan so all workload commands can use a
versioned binary that matches the current state of the cluster.

Resolves cockroachdb#147374

Epic: None
Release note: None
@williamchoe3 williamchoe3 force-pushed the wchoe/147374-roachtest-mixedversion-sync-workload-binary-version-with-cluster-version branch from aee214e to 9b2f590 Compare September 30, 2025 20:38
@williamchoe3
Copy link
Copy Markdown
Contributor Author

resolved merged conflicts, just had to update the stageAllWorkloadBinariesStep.Description signature & regenerate a few dd tests because of line number changes in debug mode, and regenerate the workload dd test I added due to the change in plan generation

@williamchoe3
Copy link
Copy Markdown
Contributor Author

blathers auth-check

Copy link
Copy Markdown
Contributor Author

You shall pass

@williamchoe3
Copy link
Copy Markdown
Contributor Author

TFTR!!

bors r=DarrylWong

@craig craig bot merged commit 6bfc8db into cockroachdb:master Oct 1, 2025
24 checks passed
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 1, 2025

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Oct 1, 2025

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #147374: branch-release-25.4.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Oct 1, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 9b2f590 to blathers/backport-release-25.4-153269: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 25.4.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.4.x Flags PRs that need to be backported to 25.4 backport-failed v26.1.0-prerelease

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest/mixedversion: sync workload binary versions with the cluster version

4 participants