roachtest: exit with failure on github post errors by williamchoe3 · Pull Request #149479 · cockroachdb/cockroach

williamchoe3 · 2025-07-03T17:49:11Z

Changes

Highlevel Changes

Added a new failure path first by

adding a new counter in testRunner struct which get's incremented when github.MaybePost() (called in testRunner.runWorkers() and testRunner.runTests() )returns an error. When this count > 0, testRunner.Run() will return a new error errGithubPostFailed and when main() sees that error, it will return a new exit code 12 which will fail the pipeline (unlike exit code 10, 11)
^ very similar to how provisioning errors are tracked and returned to main()
does not trigger test short circuiting mechanism because testRunner.runWorkers() doesn't return an error

type testRunner struct {
...
// numGithubPostErrs Counts GitHub post errors across all workers
numGithubPostErrs int32
...
}
...
issue, err := github.MaybePost(t, issueInfo, l, output, params) // TODO add cluster specific args here
if err != nil {
    shout(ctx, l, stdout, "failed to post issue: %s", err)
    atomic.AddInt32(&r.numGithubPostErrs, 1)
}

Design

In order to do verification via unit tests, i'm used to using something like Python's magic mock, but that's not available in GoLang so i opted for a Dependency Injection approach. (This was the best I could come up with, I wanted to avoid "if unit test, do this" logic. If anyone has any other approaches / suggestions let me know!)
I made a new interface GithubPoster in such a way that the original githubIssues implements that new interface. I then pass this interface in function signatures all the way from Run() to runTests(). Then in the unit tests, I could pass a different implementation of GithubPoster that has a MaybePost() that always fails.
github.go

type GithubPoster interface {
	MaybePost(
		t *testImpl, issueInfo *githubIssueInfo, l *logger.Logger, message string,
		params map[string]string) (
		*issues.TestFailureIssue, error)
}

Another issue with this approach is the original githubIssues has information that is cluster specific, but because of dependency injection, it's now a shared struct among all the workers, so it doesn't make sense to store certain fields that are worker dependent.
For the fields that are worker specific, I created a new struct githubIssueInfo that is created in runWorkers(), similar to how githubIssues used to be created there.
Note: I don't love the name githubIssueInfo, but i wanted to stick with a similar naming convention to githubIssues, open to name suggestions

// Original githubIssues
type githubIssues struct {
	disable      bool
	cluster      *clusterImpl
	vmCreateOpts *vm.CreateOpts
	issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest,
		*issues.Options) (*issues.TestFailureIssue, error)
	teamLoader func() (team.Map, error)
}
// New githubIssues
type githubIssues struct {
	disable      bool
	issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest,
		*issues.Options) (*issues.TestFailureIssue, error)
	teamLoader func() (team.Map, error)
}

All this was very verbose and didn't love that i had to change all the function signatures to do this, open to other ways to do verification.

Misc

Also first time writing in Go in like ~3 years very open to general go semantic feedback / best practices / design patterns

Verification

Diff of binary I used to manually confirm if you wanna see where I hardcoded to return errors: 611adcc

Manual Test Logs

➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ tmp/roachtest run acceptance/build-info --cockroach /Users/wchoe/work/cockroachdb/cockroach/bin_linux/cockroach
...
Running tests which match regex "acceptance/build-info" and are compatible with cloud "gce".

fallback runner logs in: artifacts/roachtest.crdb.log
2025/07/09 00:51:48 run.go:386: test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log
test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log
HTTP server listening on port 56238 on localhost: http://localhost:56238/
2025/07/09 00:51:48 run.go:148: global random seed: 1949199437086051249
2025/07/09 00:51:48 test_runner.go:398: test_run_id: will.choe-1752022308
test_run_id: will.choe-1752022308
[w0] 2025/07/09 00:51:48 work_pool.go:198: Acquired quota for 16 CPUs
[w0] 2025/07/09 00:51:48 cluster.go:3204: Using randomly chosen arch="amd64", acceptance/build-info
[w0] 2025/07/09 00:51:48 test_runner.go:798: Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking.
Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking.
2025/07/09 00:51:48 test_impl.go:478: test failure #1: full stack retained in failure_1.log: (test_runner.go:873).func4: mocking [owner=test-eng]
2025/07/09 00:51:48 test_impl.go:200: Runtime assertions disabled
[w0] 2025/07/09 00:51:48 test_runner.go:883: failed to post issue: mocking
failed to post issue: mocking
[w0] 2025/07/09 00:51:48 test_runner.go:1019: test failed: acceptance/build-info (run 1)
[w0] 2025/07/09 00:51:48 test_runner.go:732: Releasing quota for 16 CPUs
[w0] 2025/07/09 00:51:48 test_runner.go:744: No work remaining; runWorker is bailing out...
No work remaining; runWorker is bailing out...
[w0] 2025/07/09 00:51:48 test_runner.go:643: Worker exiting; no cluster to destroy.
2025/07/09 00:51:48 test_runner.go:460: PASS
PASS
2025/07/09 00:51:48 test_runner.go:465: 1 clusters could not be created and 1 errors occurred while posting to github
1 clusters could not be created and 1 errors occurred while posting to github
2025/07/09 00:51:48 run.go:200: runTests destroying all clusters
Error: some clusters could not be created
failed to POST to GitHub
➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ echo $?
12

cockroach-teamcity · 2025-07-03T17:49:21Z

This change is

pkg/cmd/roachtest/test_test.go

DarrylWong · 2025-07-09T15:36:30Z

pkg/cmd/roachtest/test_runner.go

 	shout(ctx, l, lopt.stdout, passFailLine)

-	if r.numClusterErrs > 0 {
+	// Prioritize returning GitHub POST errors because we want those to fail the


nit: could you explain what you mean by "prioritize returning GitHub POST errors"? It seems like this code is taking both into account equally.

i was trying to capture the idea that we could technically have any combination of 3 errors now, errSomeClusterProvisioningFailed, errGithubPostFailed, errTestsFailed
and the comment was supposed to help explain why the if else clauses are set up in the order that they are, BUT I just realized that I should probably just return the joined error up to main and let main handle what to do (and it already does)

will make that change rn

oh i just realized this is what you were getting at in your other comment, but let me know if the way it is now makes more sense @DarrylWong

Yep, the prioritization comment makes a lot more sense now. Thanks!

DarrylWong · 2025-07-09T15:40:03Z

pkg/cmd/roachtest/test_runner.go

-	if r.numClusterErrs > 0 {
+	// Prioritize returning GitHub POST errors because we want those to fail the
+	// run.
+	if r.numClusterErrs > 0 && r.numGithubPostErrs > 0 {


nit: seems fine to just log to teamcity twice. e.g.

var err error if r.numGithubPostErrs > 0 { shout(ctx, l, lopt.stdout, "%d errors occurred while posting to github", r.numGithubPostErrs) err = errGithubPostFailed } if r.numClusterErrs > 0 { shout(ctx, l, lopt.stdout, "%d clusters could not be created", r.numClusterErrs) err = errors.CombineErrors(err, errSomeClusterProvisioningFailed) } if err != nil { return err }

i can take out the combined log msg and just do the same one twice in the first if clause i.e. i think that makes it more simple like

if r.numClusterErrs > 0 && r.numGithubPostErrs > 0 { shout(ctx, l, lopt.stdout, "%d errors occurred while posting to github", r.numGithubPostErrs) shout(ctx, l, lopt.stdout, "%d clusters could not be created", r.numClusterErrs) err = ... } else if ...

but i don't like potentially rewriting err, i think the verbosity is worth for the readability, but i don't feel that strongly about this. also if that's how the switch case logic is done in the rest of the package (like in main) / go in general then down to switch

ended up rewriting this logic, see this comment #149479 (comment)

pkg/cmd/roachtest/main.go

herkolategan · 2025-07-10T09:01:25Z

As tempting as it is to use the issue this addresses in the title, it should be moved to the body of the PR and on the commit message in the following form:

roachtest: exit with failure on github post errors
...
Fixes: #147116

You can update the PR title to be the same as your commit -> <comma separated affected packages>: <short desc>

herkolategan

Nice work! Left a few minor comments, but overall the high level implementation looks good.

pkg/cmd/roachtest/test_test.go

pkg/cmd/roachtest/test_runner.go

DarrylWong · 2025-07-11T16:03:14Z

pkg/cmd/roachtest/test_test.go

+		Suites:           registry.Suites(registry.Nightly),
+		CockroachBinary:  registry.StandardCockroach,
+		Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
+			panic("this test should never run")


nit: The test runner will recover this panic as a test failure. So you'd have to also add a require.NotErrorIs(t, err, errTestsFailed) check to assert that the test itself didn't fail.

oops got it will add ty

…b issue creation fails Previously, pipeline would not fail when encountering an issue posting to github, which would cause thsee failures to be "silent". Now pipelines will fail when there's an error posting to github increasing visability. New mechanism is very similar to how provisioning errors are tracked and returned to main. Does not trigger test short circuiting mechanism because testRunner.runWorkers() doesn't return an error Introduces dependency injection for GitHub in Run so GitHub failures can be mocked in testing Resolves cockroachdb#147116 Epic: None Release note: None

williamchoe3 · 2025-07-11T18:28:39Z

The TeamCity build has unfortunately been very inconsistent for me and just me, i've just been assuming it's intermittent failures but i'm the only branch that's hitting it. I've been able to get it to pass by just rerunning but gonna take a look to see what's going on, noticing heap memory issues when publishing artifacts but i'm not adding any new artifacts so i dont know why i'd be getting OOMs

herkolategan

Thanks for making the changes! I'll go look at the TC test runs as well. It's usually easy to find using the PR number, but sometimes we paste a link to the TC build on the PR just to make it easier for reviewers to link directly to it.

williamchoe3 · 2025-07-14T19:13:23Z

The TeamCity build has unfortunately been very inconsistent for me and just me, i've just been assuming it's intermittent failures but i'm the only branch that's hitting it. I've been able to get it to pass by just rerunning but gonna take a look to see what's going on, noticing heap memory issues when publishing artifacts but i'm not adding any new artifacts so i dont know why i'd be getting OOMs

some extra context (https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1752259095090199)
ran local stress, race testing didn't see anything out of the ordinary, will go ahead with the merge

williamchoe3 · 2025-07-14T19:15:59Z

TFTR!

bors r=herkolategan,DarrylWong

craig · 2025-07-14T20:15:36Z

Build succeeded:

srosenberg · 2025-07-15T03:54:59Z

Nice work! As a precaution we usually fire up a smoke test using SELECT_PROBABILITY [1] prior to borsing. Here is an example [2]. This rule holds only for any non-trivial changes which touch the roachtest framework. Historically, the smoke test has been pretty effective in finding issues that were missed during the PR review. Worst-case scenario the nightly would crash, preempting some of the roachtests.

[1] https://cockroachlabs.atlassian.net/wiki/spaces/TE/pages/146538717/Roachtest#Roachtest-OverridingSELECT_PROBABILITY
[2] #148244 (comment)

During the migration of the github poster initialization in cockroachdb#149479, the default team loader configuration was inadvertently omitted. This change restores the default team loader setting to resolve the issue. Informs: cockroachdb#149479 Epic: None Release note: None

150223: roachtest: fix nil team loader r=williamchoe3,srosenberg a=herkolategan During the migration of the github poster initialization in #149479, the default team loader configuration was inadvertently omitted. This change restores the default team loader setting to resolve the issue. Informs: #149479 Epic: None Release note: None Co-authored-by: Herko Lategan <herko@cockroachlabs.com>

williamchoe3 self-assigned this Jul 9, 2025

williamchoe3 marked this pull request as ready for review July 9, 2025 14:31

williamchoe3 requested a review from a team as a code owner July 9, 2025 14:31

williamchoe3 requested review from DarrylWong and herkolategan and removed request for a team July 9, 2025 14:31

DarrylWong reviewed Jul 9, 2025

View reviewed changes

williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch 4 times, most recently from 65a4385 to e09507a Compare July 9, 2025 21:34

herkolategan reviewed Jul 10, 2025

View reviewed changes

pkg/cmd/roachtest/test_test.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/test_test.go Show resolved Hide resolved

pkg/cmd/roachtest/test_runner.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/test_runner.go Show resolved Hide resolved

williamchoe3 changed the title ~~147116: roachtest fail when github issuePoster returns an error~~ roachtest: exit with failure on github post errors Jul 10, 2025

williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch from e09507a to 3ead89f Compare July 10, 2025 18:32

DarrylWong reviewed Jul 11, 2025

View reviewed changes

williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch from 3ead89f to 3ac7c44 Compare July 11, 2025 16:12

DarrylWong approved these changes Jul 11, 2025

View reviewed changes

herkolategan approved these changes Jul 14, 2025

View reviewed changes

craig bot merged commit cbdd88b into cockroachdb:master Jul 14, 2025
21 of 22 checks passed

celeste-cockroachdb bot added the target-release-25.4.0 label Jul 14, 2025

herkolategan mentioned this pull request Jul 15, 2025

roachtest: fix nil team loader #150223

Merged

tbg mentioned this pull request Jul 31, 2025

roachtest: admission-control/multitenant-fairness/read-heavy/skewed failed #150799

Closed

celeste-cockroachdb bot added v25.4.0-prerelease and removed target-release-25.4.0 labels Oct 1, 2025

Conversation

williamchoe3 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Highlevel Changes

Design

Misc

Verification

Manual Test Logs

Uh oh!

cockroach-teamcity commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamchoe3 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

herkolategan commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

herkolategan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamchoe3 commented Jul 11, 2025

Uh oh!

herkolategan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamchoe3 commented Jul 14, 2025

Uh oh!

williamchoe3 commented Jul 14, 2025

Uh oh!

craig bot commented Jul 14, 2025

Uh oh!

Uh oh!

srosenberg commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

williamchoe3 commented Jul 3, 2025 •

edited

Loading

williamchoe3 Jul 9, 2025 •

edited

Loading

herkolategan commented Jul 10, 2025 •

edited

Loading

herkolategan left a comment •

edited

Loading