Skip to content

telemetry: make sure test starts with clean slate#32899

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:deflake-errorcount
Dec 18, 2018
Merged

telemetry: make sure test starts with clean slate#32899
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:deflake-errorcount

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Dec 6, 2018

The test expects certain activity to be what is measured and reported. Sometimes the test fails when the
reported activity includes extra errors that were not part of the intentionally generated activity.

During the activity generation, we make sure we're getting the errors, and only the errors, we expect,
however if errors were encountered earlier, during startup, records of them may linger and then be mixed
with the record of the generated activity, throwing off our measurements.

Avoid this by resetting the recorder's counts before starting to generate the activities we actually want
to measure.

Fixes #31778.

Release note: none.

@dt dt requested review from a team, andreimatei and knz December 6, 2018 15:19
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 6, 2018

this will want a backport

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what errors are we wiping? Consider adding a comment about them.
LGTM if those errors are kosher... Although I'm suspicious...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the statements the test runs directly are error-checked, so somehow a bonus error is sneaking in to the error telemetry totals, which I have to assume is leftover from something during startup. If the test made it this far though, whatever it is must have been handled or retired, since the test execution hasn't complained.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you curious what error that is?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

The test expects certain activity to be what is measured and reported. Sometimes the test fails when the
reported activity includes extra errors that were not part of the intentionally generated activity.

During the activity generation, we make sure we're getting the errors, and only the errors, we expect,
however if errors were encountered earlier, during startup, records of them may linger and then be mixed
with the record of the generated activity, throwing off our measurements.

Avoid this by resetting the recorder's counts before starting to generate the activities we actually want
to measure.

Fixes cockroachdb#31778.

Release note: none.
@dt dt force-pushed the deflake-errorcount branch from c225f25 to 075df6d Compare December 17, 2018 22:44
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Dec 18, 2018

Nope, not even a little: this test is testing if we measure and report the activity that we generate in the test. Server startup is complicated -- it includes starting lots of things, running migrations and other stuff all of which has its own error handling/retry/etc. None of that is within the scope of what we want to test here as long as testserver.Start doesn't return an error.

bors r+

craig bot pushed a commit that referenced this pull request Dec 18, 2018
32899: telemetry: make sure test starts with clean slate r=dt a=dt

The test expects certain activity to be what is measured and reported. Sometimes the test fails when the
reported activity includes extra errors that were not part of the intentionally generated activity.

During the activity generation, we make sure we're getting the errors, and only the errors, we expect,
however if errors were encountered earlier, during startup, records of them may linger and then be mixed
with the record of the generated activity, throwing off our measurements.

Avoid this by resetting the recorder's counts before starting to generate the activities we actually want
to measure.

Fixes #31778.

Release note: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 18, 2018

Build succeeded

@craig craig bot merged commit 075df6d into cockroachdb:master Dec 18, 2018
@dt dt deleted the deflake-errorcount branch December 18, 2018 18:29
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree about the test in isolation - although if we could avoid having to burden the test with having to do the reset that this patch is adding, it'd be nice.
But I don't really get it - it sounds like we accumulate some errors during startup that the telemetry will report. Isn't that straight up bad / bug? It'll pollute our stats, won't it?
More importantly, here's an example where a bit of curiosity about why a test needs to work around startup phenomena lead to very fruitful work: #32495. And I can find a ton others. So when you notice a test showing that something is funky, I think it's kinda on you to do something about it - at the very least open an issue to investigate more. And this here instance seems to me to indicate funkiness, doesn't it?
No?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Dec 18, 2018

Our telemetry is interested in how many instances of each error code we see, and if running a migration sometimes is retried since it sees an uncertainty error or something, then it is fine, and intended, that that's counted in telemetry. But within the test, we want to measure and verify specific activity, to ensure telemetry collection is doing what is expected and nothing else.

I'm not convinced there's enough to smell funk here -- so, so many things happen during startup, all of them wrapped in retires, that I'm hesitant to start a fire drill when nothing appears wrong other than the number of errors the executor has returned.

@andreimatei
Copy link
Copy Markdown
Contributor

If a retriable error is enough here, then indeed I'm no longer particularly concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants