Skip to content

Refactor metrics controller tests to fix flaky behaviour#4406

Merged
markmandel merged 1 commit intoagones-dev:mainfrom
markmandel:flaky/TestControllerGameServersNodeState
Jan 7, 2026
Merged

Refactor metrics controller tests to fix flaky behaviour#4406
markmandel merged 1 commit intoagones-dev:mainfrom
markmandel:flaky/TestControllerGameServersNodeState

Conversation

@markmandel
Copy link
Copy Markdown
Collaborator

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug

/kind cleanup

/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

Moved testing that is polls through Eventually() tests into the metric assert function - since all metric data is eventually consistent.

Also centralizes edge cases around Distribution metrics.

Which issue(s) this PR fixes:

Closes #4391

Special notes for your reviewer:

N/A

Moved testing that is polls through Eventually() tests into the metric
assert function - since all metric data is eventually consistent.

Also centralizes edge cases around Distribution metrics.

Closes agones-dev#4391
@github-actions github-actions bot added kind/cleanup Refactoring code, fixing up documentation, etc size/M labels Jan 7, 2026
@markmandel markmandel added the area/tests Unit tests, e2e tests, anything to make sure things don't break label Jan 7, 2026
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: b4e464c4-e76b-4aeb-b988-cdbf418cb7be

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4406/head:pr_4406 && git checkout pr_4406
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.55.0-dev-1f199b0

Comment on lines +68 to +91
hasDistribution := false
for _, e := range expected {
expectedValuesAsMap[serialize(e.labels)] = e
if _, ok := e.val.(*metricdata.Distribution); ok {
hasDistribution = true
break
}
}

var wantedMetric *metricdata.Metric
for _, m := range exporter.metrics {
if m.Descriptor.Name == metricName {
wantedMetric = m
}
// For Distribution metrics, handle the synchronization and reset pattern
if hasDistribution {
// Wait until the metric is available
assert.Eventually(t, func() bool {
f()
exporter := &metricExporter{}
c.lock.Lock()
reader.ReadAndExport(exporter)
c.lock.Unlock()
for _, m := range exporter.metrics {
if m.Descriptor.Name == metricName {
return true
}
}
return false
}, 10*time.Second, 100*time.Millisecond)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this fixes things, but basically just copying the structure of other working tests. OpenCensus can be weird.

Copy link
Copy Markdown
Collaborator

@lacroixthomas lacroixthomas Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it was not because of a race condition, and that the lock / unlock around reader.ReadAndExport you added would be the patch ?🤔 There must be other things as well

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with just an eventually added to the final assert... and things did not go as planned. In theory it should have worked - but this seemed good enough 🤷🏻 it tested what we wanted to test.

The locking is to avoid data races collecting and then exporting metrics, since opencensus is not thread safe! It happened very rarely when I was running this lots.

Side note - didn't know assert.EventuallyWithT was a thing - I could refactor some tests to use it. Very handy.

@markmandel
Copy link
Copy Markdown
Collaborator Author

Oh yeah - I also ran all the tests in pkg/metrics/controller_test.go 200 times, and it all passed.

Copy link
Copy Markdown
Collaborator

@lacroixthomas lacroixthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Comment on lines +68 to +91
hasDistribution := false
for _, e := range expected {
expectedValuesAsMap[serialize(e.labels)] = e
if _, ok := e.val.(*metricdata.Distribution); ok {
hasDistribution = true
break
}
}

var wantedMetric *metricdata.Metric
for _, m := range exporter.metrics {
if m.Descriptor.Name == metricName {
wantedMetric = m
}
// For Distribution metrics, handle the synchronization and reset pattern
if hasDistribution {
// Wait until the metric is available
assert.Eventually(t, func() bool {
f()
exporter := &metricExporter{}
c.lock.Lock()
reader.ReadAndExport(exporter)
c.lock.Unlock()
for _, m := range exporter.metrics {
if m.Descriptor.Name == metricName {
return true
}
}
return false
}, 10*time.Second, 100*time.Millisecond)
Copy link
Copy Markdown
Collaborator

@lacroixthomas lacroixthomas Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it was not because of a race condition, and that the lock / unlock around reader.ReadAndExport you added would be the patch ?🤔 There must be other things as well

@markmandel markmandel merged commit 0d629cb into agones-dev:main Jan 7, 2026
5 checks passed
@markmandel markmandel deleted the flaky/TestControllerGameServersNodeState branch January 7, 2026 23:52
mnthe pushed a commit to mnthe/agones that referenced this pull request Mar 23, 2026
…4406)

Moved testing that is polls through Eventually() tests into the metric
assert function - since all metric data is eventually consistent.

Also centralizes edge cases around Distribution metrics.

Closes agones-dev#4391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests Unit tests, e2e tests, anything to make sure things don't break kind/cleanup Refactoring code, fixing up documentation, etc size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Unit Test: TestControllerGameServersNodeState

3 participants