Skip to content

github-post: add timeout handling to the stress issue poster#29987

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:slow-tests
Sep 10, 2018
Merged

github-post: add timeout handling to the stress issue poster#29987
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:slow-tests

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

Add explicit support to the issue poster for timeouts:

  • on all runs, publish an artifacts file with a list of slow tests
  • when timeouts happen, distinguish between the case where the test
    currently running at the timeout point is the culprit (i.e. if it is the
    longest running test) versus situations where that test is just an
    innocent bystender

This patch also spruces up the github-post script in various ways. Among
them there's now better support for running it on an input that comes
directly from a test run, and not from a stress wrapper.

Release note: None

@andreimatei andreimatei requested a review from a team September 9, 2018 16:58
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! I gave the changes in listFailures a quick review. Tons of logic in that method now. I wonder if it should be broken into smaller pieces. Regardless, the test case additions give me some confidence in this PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


build/teamcity-stress.sh, line 38 at r1 (raw file):

make stress PKG="$PKG" TESTTIMEOUT=40m GOFLAGS="$GOFLAGS" TAGS="$TAGS" STRESSFLAGS="-maxruns 100 -maxfails 1 -stderr $STRESSFLAGS" \
  | tee artifacts/stress.log \
  | go tool test2json -t | github-post

@benesch or someone else who is knowledgable about shell pipelines will have to review this.


pkg/cmd/github-post/main.go, line 97 at r1 (raw file):

	// tests. Those tests will remain "outstanding" and will be ignored for the
	// purpose of issue reporting.
	outstendingOutput := make(map[string][]testEvent)

s/outstending/outstanding/g


pkg/cmd/github-post/main.go, line 100 at r1 (raw file):

	failures := make(map[string][]testEvent)
	slowPassingTests := make([]testEvent, 0, 1000)
	slowFailingTests := make([]testEvent, 0, 1000)

I doubt there is a measurable effect for pre-allocating these slices. In non-performance sensitive code, I'd go with something more like:

var slowPassingTests []testEvent
var slowFailingTests []testEvent

pkg/cmd/github-post/main.go, line 304 at r1 (raw file):

	}

	// Sort slow test descendingly by duration.

s/test/tests/g


pkg/cmd/github-post/main.go, line 393 at r1 (raw file):

	if _, err := f.WriteString(report); err != nil {
		return err
	}

The above can be written as:

if err := ioutil.WriteFile("artifacts/slow-tests-report.txt", []byte(report), 0644); err != nil {
  return err
}

Copy link
Copy Markdown
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


build/teamcity-stress.sh, line 38 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@benesch or someone else who is knowledgable about shell pipelines will have to review this.

LGTM but can you stick a 2>&1 onto the make stress invocation? We should have had that there all along, I think.

Add explicit support to the issue poster for timeouts:
- on all runs, publish an artifacts file with a list of slow tests
- when timeouts happen, distinguish between the case where the test
currently running at the timeout point is the culprit (i.e. if it is the
longest running test) versus situations where that test is just an
innocent bystender

This patch also spruces up the github-post script in various ways. Among
them there's now better support for running it on an input that comes
directly from a test run, and not from a stress wrapper.

Release note: None
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I have tried to leave the place cleaner than I found it... I'm not keen to attempt more refactoring now cause I'm not entirely sure what I would do :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


build/teamcity-stress.sh, line 38 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

LGTM but can you stick a 2>&1 onto the make stress invocation? We should have had that there all along, I think.

done


pkg/cmd/github-post/main.go, line 97 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/outstending/outstanding/g

Done.


pkg/cmd/github-post/main.go, line 100 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I doubt there is a measurable effect for pre-allocating these slices. In non-performance sensitive code, I'd go with something more like:

var slowPassingTests []testEvent
var slowFailingTests []testEvent

Done.


pkg/cmd/github-post/main.go, line 304 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/test/tests/g

Done.


pkg/cmd/github-post/main.go, line 393 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The above can be written as:

if err := ioutil.WriteFile("artifacts/slow-tests-report.txt", []byte(report), 0644); err != nil {
  return err
}

Done.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I have tried to leave the place cleaner than I found it... I'm not keen to attempt more refactoring now cause I'm not entirely sure what I would do :)

Ack. You should probably prepare yourself for a bit of inevitable fallout. I find that occurs whenever these helper programs are changed.

:lgtm:

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Copy Markdown
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the shell script :lgtm:. Deferring to Peter's LGTM on the rest.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs
Btw, the real thing to do, in my opinion, is not so much refactoring the listFailures method more, but considering options for not relying on a pipeline of less-than-ideal tools: stress -> go test runner -> test2son -> github-post. The combination of the first 3 produces input that's really hard to operate on, so whatever you do on top is probably going to ugly.

bors r+

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained

craig bot pushed a commit that referenced this pull request Sep 10, 2018
16354: storage: Delay initial metrics until system config is ready r=bdarnell a=bdarnell

This was previously done too soon at startup and would clutter the
logs (mainly in tests) with "failed initial metrics computation".

Fixes #13560

29817: client, roachpb: eliminate log spam when loadgens are killed r=andreimatei a=andreimatei

When the TPCC loadgen is CTRL-C'ed, the logs are spammed with:
I180906 22:02:41.239771 27177127 internal/client/txn.go:625  [n1] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=57c518aa key=/Table/61/1/1224/0 rw=true pri=0.03781170 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1536271361.068457287,0 orig=1536271361.068457287,0 max=1536271361.073978228,0 wto=false rop=false seq=11 int=5

The "async rollback" part refers to the rollback being done with a
canceled ctx (presumably a dropped connection's ctx). I believe the
error happens because there's a commit in flight when the ctx is
canceled. This patch lowers the message's level for this case.

Release note: None

29987: github-post: add timeout handling to the stress issue poster r=andreimatei a=andreimatei

Add explicit support to the issue poster for timeouts:
- on all runs, publish an artifacts file with a list of slow tests
- when timeouts happen, distinguish between the case where the test
currently running at the timeout point is the culprit (i.e. if it is the
longest running test) versus situations where that test is just an
innocent bystender

This patch also spruces up the github-post script in various ways. Among
them there's now better support for running it on an input that comes
directly from a test run, and not from a stress wrapper.

Release note: None

30006: opt: fix panic during SELECT MIN(NULL) r=rytaft a=arjunravinarayan

Fixes #29695.

Release note (bug fix): fix a crash when SELECT MIN(NULL) was run with
the optimizer enabled.

30008: gossip: allow receipt of "loopback infos" r=bdarnell a=petermattis

Receipt of loopback infos was disabled in #29398, but doing so had the
unfortunate effect of allowing gossip state to temporarily diverge
between nodes. Rather than disallowing loopback infos, we now ratchet
the gossip monotonic clock in order to avoid the assertion around the
gossip highwater timestamps.

Fixes #29992
Fixes #20986

Release note: None

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Arjun Narayan <arjun@cockroachlabs.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 10, 2018

Build succeeded

@craig craig bot merged commit da4fd85 into cockroachdb:master Sep 10, 2018
@andreimatei andreimatei deleted the slow-tests branch September 10, 2018 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants