telemetry: make sure test starts with clean slate by dt · Pull Request #32899 · cockroachdb/cockroach

dt · 2018-12-06T15:19:55Z

The test expects certain activity to be what is measured and reported. Sometimes the test fails when the
reported activity includes extra errors that were not part of the intentionally generated activity.

During the activity generation, we make sure we're getting the errors, and only the errors, we expect,
however if errors were encountered earlier, during startup, records of them may linger and then be mixed
with the record of the generated activity, throwing off our measurements.

Avoid this by resetting the recorder's counts before starting to generate the activities we actually want
to measure.

Fixes #31778.

Release note: none.

cockroach-teamcity · 2018-12-06T15:20:18Z

This change is

knz · 2018-12-06T15:45:52Z

this will want a backport

andreimatei

what errors are we wiping? Consider adding a comment about them.
LGTM if those errors are kosher... Although I'm suspicious...

Reviewable status: complete! 0 of 0 LGTMs obtained

dt

All the statements the test runs directly are error-checked, so somehow a bonus error is sneaking in to the error telemetry totals, which I have to assume is leftover from something during startup. If the test made it this far though, whatever it is must have been handled or retired, since the test execution hasn't complained.

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

Aren't you curious what error that is?

Reviewable status: complete! 0 of 0 LGTMs obtained

The test expects certain activity to be what is measured and reported. Sometimes the test fails when the reported activity includes extra errors that were not part of the intentionally generated activity. During the activity generation, we make sure we're getting the errors, and only the errors, we expect, however if errors were encountered earlier, during startup, records of them may linger and then be mixed with the record of the generated activity, throwing off our measurements. Avoid this by resetting the recorder's counts before starting to generate the activities we actually want to measure. Fixes cockroachdb#31778. Release note: none.

dt · 2018-12-18T18:00:36Z

Nope, not even a little: this test is testing if we measure and report the activity that we generate in the test. Server startup is complicated -- it includes starting lots of things, running migrations and other stuff all of which has its own error handling/retry/etc. None of that is within the scope of what we want to test here as long as testserver.Start doesn't return an error.

bors r+

32899: telemetry: make sure test starts with clean slate r=dt a=dt The test expects certain activity to be what is measured and reported. Sometimes the test fails when the reported activity includes extra errors that were not part of the intentionally generated activity. During the activity generation, we make sure we're getting the errors, and only the errors, we expect, however if errors were encountered earlier, during startup, records of them may linger and then be mixed with the record of the generated activity, throwing off our measurements. Avoid this by resetting the recorder's counts before starting to generate the activities we actually want to measure. Fixes #31778. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

craig · 2018-12-18T18:15:01Z

Build succeeded

GitHub CI (Cockroach)

andreimatei

I mostly agree about the test in isolation - although if we could avoid having to burden the test with having to do the reset that this patch is adding, it'd be nice.
But I don't really get it - it sounds like we accumulate some errors during startup that the telemetry will report. Isn't that straight up bad / bug? It'll pollute our stats, won't it?
More importantly, here's an example where a bit of curiosity about why a test needs to work around startup phenomena lead to very fruitful work: #32495. And I can find a ton others. So when you notice a test showing that something is funky, I think it's kinda on you to do something about it - at the very least open an issue to investigate more. And this here instance seems to me to indicate funkiness, doesn't it?
No?

Reviewable status: complete! 0 of 0 LGTMs obtained

dt · 2018-12-18T19:26:00Z

Our telemetry is interested in how many instances of each error code we see, and if running a migration sometimes is retried since it sees an uncertainty error or something, then it is fine, and intended, that that's counted in telemetry. But within the test, we want to measure and verify specific activity, to ensure telemetry collection is doing what is expected and nothing else.

I'm not convinced there's enough to smell funk here -- so, so many things happen during startup, all of them wrapped in retires, that I'm hesitant to start a fire drill when nothing appears wrong other than the number of errors the executor has returned.

andreimatei · 2018-12-18T19:31:26Z

If a retriable error is enough here, then indeed I'm no longer particularly concerned.

dt requested review from a team, andreimatei and knz December 6, 2018 15:19

knz approved these changes Dec 6, 2018

View reviewed changes

andreimatei reviewed Dec 6, 2018

View reviewed changes

dt commented Dec 6, 2018

View reviewed changes

andreimatei reviewed Dec 6, 2018

View reviewed changes

dt force-pushed the deflake-errorcount branch from c225f25 to 075df6d Compare December 17, 2018 22:44

craig bot merged commit 075df6d into cockroachdb:master Dec 18, 2018

dt deleted the deflake-errorcount branch December 18, 2018 18:29

andreimatei reviewed Dec 18, 2018

View reviewed changes

irfansharif mentioned this pull request Oct 30, 2019

release-2.1: storage: take an engine checkpoint during failing consistency checks #42042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: make sure test starts with clean slate#32899

telemetry: make sure test starts with clean slate#32899
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:deflake-errorcount

dt commented Dec 6, 2018

Uh oh!

cockroach-teamcity commented Dec 6, 2018

Uh oh!

knz commented Dec 6, 2018

Uh oh!

andreimatei left a comment

Uh oh!

dt left a comment

Uh oh!

andreimatei left a comment

Uh oh!

dt commented Dec 18, 2018

Uh oh!

craig bot commented Dec 18, 2018

Uh oh!

andreimatei left a comment

Uh oh!

dt commented Dec 18, 2018

Uh oh!

andreimatei commented Dec 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dt commented Dec 6, 2018

Uh oh!

cockroach-teamcity commented Dec 6, 2018

Uh oh!

knz commented Dec 6, 2018

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Dec 18, 2018

Uh oh!

craig bot commented Dec 18, 2018

Build succeeded

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Dec 18, 2018

Uh oh!

andreimatei commented Dec 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants