Compact: fix objstore bucket operations check to improve flakiness#6064
Compact: fix objstore bucket operations check to improve flakiness#6064matej-g merged 5 commits intothanos-io:mainfrom
Conversation
Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent. Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
test/e2e/compact_test.go
Outdated
| testutil.Ok(t, err) | ||
| testutil.Ok(t, c.WaitSumMetricsWithOptions( | ||
| e2emon.Equals(573), | ||
| e2emon.GreaterOrEqual(573), |
There was a problem hiding this comment.
I think the idea of these assertions is to make sure the compactor is efficient and does not make more loops/API requests than necessary.
There was a problem hiding this comment.
I get the idea. But this is flaky because the number of operations depend heavily on time (could be a bug, I don't know), which becomes a hassle when processes are fighting for CPU time in CI or even if you run all e2e tests locally (because they will run in parallel).
If I put a sleep on line 725, depending on how much time the sleep takes the assertion will fail because thanos_objstore_bucket_operations_total will have a completely different value. This test has to be fixed and made independent of time, otherwise it will stay flaky. But I do not have a clear picture of how to achieve it.
There was a problem hiding this comment.
Got it. In that case I am not sure what's the best option here, maybe we can use LessThan(1000) or some larger value to make sure the compactor does not go into infinite loops. If we have to use GreaterThan, we might as well remove the assertion.
There was a problem hiding this comment.
@fpetkovski I implemented a custom matcher to use in the assertion, to be able to assert the value is between 0 and 1000, instead of just checking one of the two bounds. This is not present yet in efficientgo/e2emon, so it lives in the e2ethanos package for now.
I've ran the test a few times locally with this configuration and it was so good that I am removing the skip on the penalty compactor test.
PTAL 🙇
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
…ctor-test Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
matej-g
left a comment
There was a problem hiding this comment.
Thanks for this, if this finally fixes #4866 I'm buying you a beer 🍻. From my understanding, we are not able to assert on exact number due to what you described in the description, so having this as the second best seems reasonable.
Apart from failing lint, this looks good!
| @@ -0,0 +1,16 @@ | |||
| package e2ethanos | |||
There was a problem hiding this comment.
Why not add this directly to the upstream? It probably would be useful to other as well 👍
There was a problem hiding this comment.
I can add there in parallel, but I personally do not want to be blocked here until it's merged there. Honestly the flaky tests are being a massive waste of time and resources (paid or not) on my PRs (and everybody else's I bet), plus we have the penalty dedup test completely skipped since several months.
There was a problem hiding this comment.
Sure, we can replace it eventually next time we bump E2E version 👍
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
saswatamcode
left a comment
There was a problem hiding this comment.
Thanks for fixing! This has been a pain point for a long time!
Just one tiny question for the new assert method.
|
|
||
| // Between is a MetricValueExpectation function for WaitSumMetrics that returns true if given single sum is between | ||
| // the lower and upper bounds (non-inclusive, as in `lower < x < upper`). | ||
| func Between(lower, upper float64) e2emon.MetricValueExpectation { |
There was a problem hiding this comment.
Hmm, so the expectation is that whatever metric is being asserted would freeze at a particular value, between lower and upper bounds?
There was a problem hiding this comment.
The expectation is precisely the meaning of the mathematical expression lower < x < upper: x can be anything between lower and upper. It's effectively the same as running Greater(lower) and Less(upper).
Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent if the goal is to be precise. As removing this assertion might not be a good idea, doing a lower boundary check suffices.
This is aimed to fix CI errors like the one below, copied from https://github.com/thanos-io/thanos/actions/runs/3957791998/jobs/6778588621:
If you add a
time.Sleepbefore this assertion and increase the time it will increasethanos_objstore_bucket_operations_totalmore and more. The Compactor isn't halted and for some reasons the bucket operations keep going up. I don't know if they stop at some point.Signed-off-by: Douglas Camata 159076+douglascamata@users.noreply.github.com
Changes
thanos_objstore_bucket_operations_totalusingGreateOrEqual, because if some time passes the actual value will change to be bigger than what is expected.Verification
Bunch of local tests run with some
time.Sleepthroughout the code, which failed before, now passed.