Skip to content

roachtest: exit with failure on github post errors#149479

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/147116-github-err-will-fail-pipeline
Jul 14, 2025
Merged

roachtest: exit with failure on github post errors#149479
craig[bot] merged 1 commit intocockroachdb:masterfrom
williamchoe3:wchoe/147116-github-err-will-fail-pipeline

Conversation

@williamchoe3
Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 commented Jul 3, 2025

Fixes #147116

Changes

Highlevel Changes

Added a new failure path first by

  • adding a new counter in testRunner struct which get's incremented when github.MaybePost() (called in testRunner.runWorkers() and testRunner.runTests() )returns an error. When this count > 0, testRunner.Run() will return a new error errGithubPostFailed and when main() sees that error, it will return a new exit code 12 which will fail the pipeline (unlike exit code 10, 11)
  • ^ very similar to how provisioning errors are tracked and returned to main()
  • does not trigger test short circuiting mechanism because testRunner.runWorkers() doesn't return an error
type testRunner struct {
...
// numGithubPostErrs Counts GitHub post errors across all workers
numGithubPostErrs int32
...
}
...
issue, err := github.MaybePost(t, issueInfo, l, output, params) // TODO add cluster specific args here
if err != nil {
    shout(ctx, l, stdout, "failed to post issue: %s", err)
    atomic.AddInt32(&r.numGithubPostErrs, 1)
}

Design

In order to do verification via unit tests, i'm used to using something like Python's magic mock, but that's not available in GoLang so i opted for a Dependency Injection approach. (This was the best I could come up with, I wanted to avoid "if unit test, do this" logic. If anyone has any other approaches / suggestions let me know!)
I made a new interface GithubPoster in such a way that the original githubIssues implements that new interface. I then pass this interface in function signatures all the way from Run() to runTests(). Then in the unit tests, I could pass a different implementation of GithubPoster that has a MaybePost() that always fails.
github.go

type GithubPoster interface {
	MaybePost(
		t *testImpl, issueInfo *githubIssueInfo, l *logger.Logger, message string,
		params map[string]string) (
		*issues.TestFailureIssue, error)
}

Another issue with this approach is the original githubIssues has information that is cluster specific, but because of dependency injection, it's now a shared struct among all the workers, so it doesn't make sense to store certain fields that are worker dependent.
For the fields that are worker specific, I created a new struct githubIssueInfo that is created in runWorkers(), similar to how githubIssues used to be created there.
Note: I don't love the name githubIssueInfo, but i wanted to stick with a similar naming convention to githubIssues, open to name suggestions

// Original githubIssues
type githubIssues struct {
	disable      bool
	cluster      *clusterImpl
	vmCreateOpts *vm.CreateOpts
	issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest,
		*issues.Options) (*issues.TestFailureIssue, error)
	teamLoader func() (team.Map, error)
}
// New githubIssues
type githubIssues struct {
	disable      bool
	issuePoster func(context.Context, issues.Logger, issues.IssueFormatter, issues.PostRequest,
		*issues.Options) (*issues.TestFailureIssue, error)
	teamLoader func() (team.Map, error)
}

All this was very verbose and didn't love that i had to change all the function signatures to do this, open to other ways to do verification.

Misc

Also first time writing in Go in like ~3 years very open to general go semantic feedback / best practices / design patterns

Verification

Diff of binary I used to manually confirm if you wanna see where I hardcoded to return errors: 611adcc

Manual Test Logs

➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ tmp/roachtest run acceptance/build-info --cockroach /Users/wchoe/work/cockroachdb/cockroach/bin_linux/cockroach
...
Running tests which match regex "acceptance/build-info" and are compatible with cloud "gce".

fallback runner logs in: artifacts/roachtest.crdb.log
2025/07/09 00:51:48 run.go:386: test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log
test runner logs in: artifacts/_runner-logs/test_runner-1752022308.log
HTTP server listening on port 56238 on localhost: http://localhost:56238/
2025/07/09 00:51:48 run.go:148: global random seed: 1949199437086051249
2025/07/09 00:51:48 test_runner.go:398: test_run_id: will.choe-1752022308
test_run_id: will.choe-1752022308
[w0] 2025/07/09 00:51:48 work_pool.go:198: Acquired quota for 16 CPUs
[w0] 2025/07/09 00:51:48 cluster.go:3204: Using randomly chosen arch="amd64", acceptance/build-info
[w0] 2025/07/09 00:51:48 test_runner.go:798: Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking.
Unable to create (or reuse) cluster for test acceptance/build-info due to: mocking.
2025/07/09 00:51:48 test_impl.go:478: test failure #1: full stack retained in failure_1.log: (test_runner.go:873).func4: mocking [owner=test-eng]
2025/07/09 00:51:48 test_impl.go:200: Runtime assertions disabled
[w0] 2025/07/09 00:51:48 test_runner.go:883: failed to post issue: mocking
failed to post issue: mocking
[w0] 2025/07/09 00:51:48 test_runner.go:1019: test failed: acceptance/build-info (run 1)
[w0] 2025/07/09 00:51:48 test_runner.go:732: Releasing quota for 16 CPUs
[w0] 2025/07/09 00:51:48 test_runner.go:744: No work remaining; runWorker is bailing out...
No work remaining; runWorker is bailing out...
[w0] 2025/07/09 00:51:48 test_runner.go:643: Worker exiting; no cluster to destroy.
2025/07/09 00:51:48 test_runner.go:460: PASS
PASS
2025/07/09 00:51:48 test_runner.go:465: 1 clusters could not be created and 1 errors occurred while posting to github
1 clusters could not be created and 1 errors occurred while posting to github
2025/07/09 00:51:48 run.go:200: runTests destroying all clusters
Error: some clusters could not be created
failed to POST to GitHub
➜ cockroach git:(wchoe/147116-github-err-will-fail-pipeline) ✗ echo $?
12

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@williamchoe3 williamchoe3 self-assigned this Jul 9, 2025
@williamchoe3 williamchoe3 marked this pull request as ready for review July 9, 2025 14:31
@williamchoe3 williamchoe3 requested a review from a team as a code owner July 9, 2025 14:31
@williamchoe3 williamchoe3 requested review from DarrylWong and herkolategan and removed request for a team July 9, 2025 14:31
shout(ctx, l, lopt.stdout, passFailLine)

if r.numClusterErrs > 0 {
// Prioritize returning GitHub POST errors because we want those to fail the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you explain what you mean by "prioritize returning GitHub POST errors"? It seems like this code is taking both into account equally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was trying to capture the idea that we could technically have any combination of 3 errors now, errSomeClusterProvisioningFailed, errGithubPostFailed, errTestsFailed
and the comment was supposed to help explain why the if else clauses are set up in the order that they are, BUT I just realized that I should probably just return the joined error up to main and let main handle what to do (and it already does)

will make that change rn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i just realized this is what you were getting at in your other comment, but let me know if the way it is now makes more sense @DarrylWong

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the prioritization comment makes a lot more sense now. Thanks!

if r.numClusterErrs > 0 {
// Prioritize returning GitHub POST errors because we want those to fail the
// run.
if r.numClusterErrs > 0 && r.numGithubPostErrs > 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems fine to just log to teamcity twice. e.g.

	var err error
	if r.numGithubPostErrs > 0 {
		shout(ctx, l, lopt.stdout, "%d errors occurred while posting to github", r.numGithubPostErrs)
		err = errGithubPostFailed
	}
	if r.numClusterErrs > 0 {
		shout(ctx, l, lopt.stdout, "%d clusters could not be created", r.numClusterErrs)
		err = errors.CombineErrors(err, errSomeClusterProvisioningFailed)
	}
	if err != nil {
		return err
	}

Copy link
Copy Markdown
Contributor Author

@williamchoe3 williamchoe3 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can take out the combined log msg and just do the same one twice in the first if clause i.e. i think that makes it more simple like

if r.numClusterErrs > 0 && r.numGithubPostErrs > 0 {
  shout(ctx, l, lopt.stdout, "%d errors occurred while posting to github", r.numGithubPostErrs)
  shout(ctx, l, lopt.stdout, "%d clusters could not be created", r.numClusterErrs)
  err = ...
} else if ...

but i don't like potentially rewriting err, i think the verbosity is worth for the readability, but i don't feel that strongly about this. also if that's how the switch case logic is done in the rest of the package (like in main) / go in general then down to switch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up rewriting this logic, see this comment #149479 (comment)

@williamchoe3 williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch 4 times, most recently from 65a4385 to e09507a Compare July 9, 2025 21:34
@herkolategan
Copy link
Copy Markdown
Collaborator

herkolategan commented Jul 10, 2025

As tempting as it is to use the issue this addresses in the title, it should be moved to the body of the PR and on the commit message in the following form:

roachtest: exit with failure on github post errors
...
Fixes: #147116

You can update the PR title to be the same as your commit -> <comma separated affected packages>: <short desc>

Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Left a few minor comments, but overall the high level implementation looks good.

@williamchoe3 williamchoe3 changed the title 147116: roachtest fail when github issuePoster returns an error roachtest: exit with failure on github post errors Jul 10, 2025
@williamchoe3 williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch from e09507a to 3ead89f Compare July 10, 2025 18:32
Suites: registry.Suites(registry.Nightly),
CockroachBinary: registry.StandardCockroach,
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
panic("this test should never run")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The test runner will recover this panic as a test failure. So you'd have to also add a require.NotErrorIs(t, err, errTestsFailed) check to assert that the test itself didn't fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops got it will add ty

…b issue

creation fails

Previously, pipeline would not fail when encountering an issue posting to
github, which would cause thsee failures to be "silent". Now pipelines will fail
when there's an error posting to github increasing visability. New mechanism is
very similar to how provisioning errors are tracked and returned to main. Does
not trigger test short circuiting mechanism because testRunner.runWorkers()
doesn't return an error

Introduces dependency injection for GitHub in Run so GitHub failures can be
mocked in testing

Resolves cockroachdb#147116

Epic: None
Release note: None
@williamchoe3 williamchoe3 force-pushed the wchoe/147116-github-err-will-fail-pipeline branch from 3ead89f to 3ac7c44 Compare July 11, 2025 16:12
@williamchoe3
Copy link
Copy Markdown
Contributor Author

The TeamCity build has unfortunately been very inconsistent for me and just me, i've just been assuming it's intermittent failures but i'm the only branch that's hitting it. I've been able to get it to pass by just rerunning but gonna take a look to see what's going on, noticing heap memory issues when publishing artifacts but i'm not adding any new artifacts so i dont know why i'd be getting OOMs

Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! I'll go look at the TC test runs as well. It's usually easy to find using the PR number, but sometimes we paste a link to the TC build on the PR just to make it easier for reviewers to link directly to it.

@williamchoe3
Copy link
Copy Markdown
Contributor Author

The TeamCity build has unfortunately been very inconsistent for me and just me, i've just been assuming it's intermittent failures but i'm the only branch that's hitting it. I've been able to get it to pass by just rerunning but gonna take a look to see what's going on, noticing heap memory issues when publishing artifacts but i'm not adding any new artifacts so i dont know why i'd be getting OOMs

some extra context (https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1752259095090199)
ran local stress, race testing didn't see anything out of the ordinary, will go ahead with the merge

@williamchoe3
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=herkolategan,DarrylWong

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 14, 2025

@craig craig bot merged commit cbdd88b into cockroachdb:master Jul 14, 2025
21 of 22 checks passed
@srosenberg
Copy link
Copy Markdown
Member

Nice work! As a precaution we usually fire up a smoke test using SELECT_PROBABILITY [1] prior to borsing. Here is an example [2]. This rule holds only for any non-trivial changes which touch the roachtest framework. Historically, the smoke test has been pretty effective in finding issues that were missed during the PR review. Worst-case scenario the nightly would crash, preempting some of the roachtests.

[1] https://cockroachlabs.atlassian.net/wiki/spaces/TE/pages/146538717/Roachtest#Roachtest-OverridingSELECT_PROBABILITY
[2] #148244 (comment)

herkolategan added a commit to herkolategan/cockroach that referenced this pull request Jul 15, 2025
During the migration of the github poster initialization in cockroachdb#149479, the default
team loader configuration was inadvertently omitted. This change restores the
default team loader setting to resolve the issue.

Informs: cockroachdb#149479

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Jul 15, 2025
150223: roachtest: fix nil team loader r=williamchoe3,srosenberg a=herkolategan

During the migration of the github poster initialization in #149479, the default team loader configuration was inadvertently omitted. This change restores the default team loader setting to resolve the issue.

Informs: #149479

Epic: None
Release note: None

Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: fail when github issuePoster returns an error

5 participants