telemetry: make sure test starts with clean slate#32899
telemetry: make sure test starts with clean slate#32899craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
this will want a backport |
andreimatei
left a comment
There was a problem hiding this comment.
what errors are we wiping? Consider adding a comment about them.
LGTM if those errors are kosher... Although I'm suspicious...
Reviewable status:
complete! 0 of 0 LGTMs obtained
dt
left a comment
There was a problem hiding this comment.
All the statements the test runs directly are error-checked, so somehow a bonus error is sneaking in to the error telemetry totals, which I have to assume is leftover from something during startup. If the test made it this far though, whatever it is must have been handled or retired, since the test execution hasn't complained.
Reviewable status:
complete! 0 of 0 LGTMs obtained
andreimatei
left a comment
There was a problem hiding this comment.
Aren't you curious what error that is?
Reviewable status:
complete! 0 of 0 LGTMs obtained
The test expects certain activity to be what is measured and reported. Sometimes the test fails when the reported activity includes extra errors that were not part of the intentionally generated activity. During the activity generation, we make sure we're getting the errors, and only the errors, we expect, however if errors were encountered earlier, during startup, records of them may linger and then be mixed with the record of the generated activity, throwing off our measurements. Avoid this by resetting the recorder's counts before starting to generate the activities we actually want to measure. Fixes cockroachdb#31778. Release note: none.
c225f25 to
075df6d
Compare
|
Nope, not even a little: this test is testing if we measure and report the activity that we generate in the test. Server startup is complicated -- it includes starting lots of things, running migrations and other stuff all of which has its own error handling/retry/etc. None of that is within the scope of what we want to test here as long as testserver.Start doesn't return an error. bors r+ |
32899: telemetry: make sure test starts with clean slate r=dt a=dt The test expects certain activity to be what is measured and reported. Sometimes the test fails when the reported activity includes extra errors that were not part of the intentionally generated activity. During the activity generation, we make sure we're getting the errors, and only the errors, we expect, however if errors were encountered earlier, during startup, records of them may linger and then be mixed with the record of the generated activity, throwing off our measurements. Avoid this by resetting the recorder's counts before starting to generate the activities we actually want to measure. Fixes #31778. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Build succeeded |
andreimatei
left a comment
There was a problem hiding this comment.
I mostly agree about the test in isolation - although if we could avoid having to burden the test with having to do the reset that this patch is adding, it'd be nice.
But I don't really get it - it sounds like we accumulate some errors during startup that the telemetry will report. Isn't that straight up bad / bug? It'll pollute our stats, won't it?
More importantly, here's an example where a bit of curiosity about why a test needs to work around startup phenomena lead to very fruitful work: #32495. And I can find a ton others. So when you notice a test showing that something is funky, I think it's kinda on you to do something about it - at the very least open an issue to investigate more. And this here instance seems to me to indicate funkiness, doesn't it?
No?
Reviewable status:
complete! 0 of 0 LGTMs obtained
|
Our telemetry is interested in how many instances of each error code we see, and if running a migration sometimes is retried since it sees an uncertainty error or something, then it is fine, and intended, that that's counted in telemetry. But within the test, we want to measure and verify specific activity, to ensure telemetry collection is doing what is expected and nothing else. I'm not convinced there's enough to smell funk here -- so, so many things happen during startup, all of them wrapped in retires, that I'm hesitant to start a fire drill when nothing appears wrong other than the number of errors the executor has returned. |
|
If a retriable error is enough here, then indeed I'm no longer particularly concerned. |
The test expects certain activity to be what is measured and reported. Sometimes the test fails when the
reported activity includes extra errors that were not part of the intentionally generated activity.
During the activity generation, we make sure we're getting the errors, and only the errors, we expect,
however if errors were encountered earlier, during startup, records of them may linger and then be mixed
with the record of the generated activity, throwing off our measurements.
Avoid this by resetting the recorder's counts before starting to generate the activities we actually want
to measure.
Fixes #31778.
Release note: none.