Compact: fix objstore bucket operations check to improve flakiness by douglascamata · Pull Request #6064 · thanos-io/thanos

douglascamata · 2023-01-20T13:08:50Z

Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent if the goal is to be precise. As removing this assertion might not be a good idea, doing a lower boundary check suffices.

This is aimed to fix CI errors like the one below, copied from https://github.com/thanos-io/thanos/actions/runs/3957791998/jobs/6778588621:

=== CONT  TestCompactWithStoreGateway
    compact_test.go:726: compact_test.go:726:
        
         unexpected error: unable to find metrics [thanos_objstore_bucket_operations_total] with expected values after 50 retries. Last error: <nil>. Last values: [631]

If you add a time.Sleep before this assertion and increase the time it will increase thanos_objstore_bucket_operations_total more and more. The Compactor isn't halted and for some reasons the bucket operations keep going up. I don't know if they stop at some point.

Signed-off-by: Douglas Camata 159076+douglascamata@users.noreply.github.com

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Assert on thanos_objstore_bucket_operations_total using GreateOrEqual, because if some time passes the actual value will change to be bigger than what is expected.

Verification

Bunch of local tests run with some time.Sleep throughout the code, which failed before, now passed.

Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent. Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

fpetkovski · 2023-01-21T08:59:05Z

test/e2e/compact_test.go

 		testutil.Ok(t, err)
 		testutil.Ok(t, c.WaitSumMetricsWithOptions(
-			e2emon.Equals(573),
+			e2emon.GreaterOrEqual(573),


I think the idea of these assertions is to make sure the compactor is efficient and does not make more loops/API requests than necessary.

I get the idea. But this is flaky because the number of operations depend heavily on time (could be a bug, I don't know), which becomes a hassle when processes are fighting for CPU time in CI or even if you run all e2e tests locally (because they will run in parallel).

If I put a sleep on line 725, depending on how much time the sleep takes the assertion will fail because thanos_objstore_bucket_operations_total will have a completely different value. This test has to be fixed and made independent of time, otherwise it will stay flaky. But I do not have a clear picture of how to achieve it.

Got it. In that case I am not sure what's the best option here, maybe we can use LessThan(1000) or some larger value to make sure the compactor does not go into infinite loops. If we have to use GreaterThan, we might as well remove the assertion.

@fpetkovski I implemented a custom matcher to use in the assertion, to be able to assert the value is between 0 and 1000, instead of just checking one of the two bounds. This is not present yet in efficientgo/e2emon, so it lives in the e2ethanos package for now.

I've ran the test a few times locally with this configuration and it was so good that I am removing the skip on the penalty compactor test.

PTAL 🙇

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

…ctor-test Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

fpetkovski

Nice 👍

matej-g

Thanks for this, if this finally fixes #4866 I'm buying you a beer 🍻. From my understanding, we are not able to assert on exact number due to what you described in the description, so having this as the second best seems reasonable.

Apart from failing lint, this looks good!

matej-g · 2023-03-29T08:44:00Z

test/e2e/e2ethanos/custom_test_matchers.go

@@ -0,0 +1,16 @@
+package e2ethanos


Why not add this directly to the upstream? It probably would be useful to other as well 👍

I can add there in parallel, but I personally do not want to be blocked here until it's merged there. Honestly the flaky tests are being a massive waste of time and resources (paid or not) on my PRs (and everybody else's I bet), plus we have the penalty dedup test completely skipped since several months.

FYI, PR opened upstream: efficientgo/e2e#65

Sure, we can replace it eventually next time we bump E2E version 👍

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

saswatamcode

Thanks for fixing! This has been a pain point for a long time!

Just one tiny question for the new assert method.

saswatamcode · 2023-03-29T10:12:18Z

test/e2e/e2ethanos/custom_test_matchers.go

+
+// Between is a MetricValueExpectation function for WaitSumMetrics that returns true if given single sum is between
+// the lower and upper bounds (non-inclusive, as in `lower < x < upper`).
+func Between(lower, upper float64) e2emon.MetricValueExpectation {


Hmm, so the expectation is that whatever metric is being asserted would freeze at a particular value, between lower and upper bounds?

The expectation is precisely the meaning of the mathematical expression lower < x < upper: x can be anything between lower and upper. It's effectively the same as running Greater(lower) and Less(upper).

Fix objstore bucket operations check

950a920

Adding some sleeps to the test caused this assertion to consistently fail. If it relies on timing of things to be correct, it's intermittent. Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

pull-request-size bot added the size/XS label Jan 20, 2023

fpetkovski reviewed Jan 21, 2023

View reviewed changes

douglascamata added 2 commits March 28, 2023 20:45

Improve assertion on Compact e2e test

759c9d3

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Reanble skipped Compactor e2e test (penalty dedup)

b41d220

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

pull-request-size bot added size/S and removed size/XS labels Mar 28, 2023

Merge branch 'main' of github.com:thanos-io/thanos into improve-compa…

ca4934a

…ctor-test Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

fpetkovski previously approved these changes Mar 29, 2023

View reviewed changes

matej-g previously approved these changes Mar 29, 2023

View reviewed changes

Add missing copyright notice

ad56598

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

douglascamata dismissed stale reviews from matej-g and fpetkovski via ad56598 March 29, 2023 10:12

saswatamcode reviewed Mar 29, 2023

View reviewed changes

matej-g approved these changes Mar 30, 2023

View reviewed changes

matej-g merged commit c1d2c5f into thanos-io:main Mar 30, 2023

douglascamata mentioned this pull request Apr 21, 2023

tests: Remove custom Between test matcher #6310

Merged

2 tasks

douglascamata mentioned this pull request May 4, 2023

Flaky compact penalty deduplication E2E test #4866

Open

Conversation

douglascamata commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpetkovski left a comment

Choose a reason for hiding this comment

Uh oh!

matej-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

douglascamata Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saswatamcode left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

douglascamata Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

douglascamata commented Jan 20, 2023 •

edited

Loading

douglascamata Mar 29, 2023 •

edited

Loading

douglascamata Mar 29, 2023 •

edited

Loading