Summary
After upgrading Magick.NET-Q8-AnyCPU from v13.5.0 to v14.10.4 (PR #34455), 1523 UI screenshot tests fail. The root cause is that ErrorMetric.Fuzz changed its calculation in v14 and now reports much higher distortion values for the same image pairs. This reveals that the CI screenshots have been genuinely different from the baselines all along — v13 just couldn't see it.
Root Cause Analysis
ErrorMetric.Fuzz behavioral change
For the same pair of images (e.g., VerifyEntryClearButtonVisibilitySetToWhileEditing.png baseline vs CI screenshot):
| Metric |
v13 |
v14 |
ErrorMetric.Fuzz |
0.00248 (0.25%) |
0.08129 (8.13%) |
ErrorMetric.RootMeanSquared |
N/A |
0.00248 (0.25%) |
v14's RootMeanSquared produces the exact same values as v13's Fuzz. This means v14's Fuzz is now a fundamentally different (more sensitive) metric.
The images ARE different
The baseline and CI screenshots are not identical — they contain real visual differences. For example, in VerifyEntryClearButtonVisibilitySetToWhileEditing:
- Baseline (in repo): Clear button X icon is grey
- CI screenshot: Clear button X icon is black
Similar color shifts (grey→black, subtle color changes) exist across many test screenshots. v13's Fuzz metric was insensitive enough that these differences fell below the 0.5% threshold. v14's Fuzz now correctly detects them.
Current Workaround (PR #34455)
In PR #34455, we switched the default ErrorMetric from Fuzz to RootMeanSquared to preserve v13 behavior:
// MagickNetVisualComparer.cs - was ErrorMetric.Fuzz
public MagickNetVisualComparer(ErrorMetric errorMetric = ErrorMetric.RootMeanSquared, ...)
// MagickNetVisualDiffGenerator.cs - was ErrorMetric.Fuzz
public MagickNetVisualDiffGenerator(ErrorMetric error = ErrorMetric.RootMeanSquared)
This makes all tests pass again but hides real differences that v14's Fuzz can now detect.
Recommendation: Switch back to ErrorMetric.Fuzz
We should take advantage of v14's improved sensitivity. This would require:
1. Revert the ErrorMetric workaround
// MagickNetVisualComparer.cs - switch back to Fuzz
public MagickNetVisualComparer(ErrorMetric errorMetric = ErrorMetric.Fuzz, double differenceThreshold = 0.005)
// MagickNetVisualDiffGenerator.cs - switch back to Fuzz
public MagickNetVisualDiffGenerator(ErrorMetric error = ErrorMetric.Fuzz)
2. Regenerate all baseline screenshots
Since the images ARE different, the baselines need updating to match what CI actually produces. This is a large batch operation:
# Find all snapshot directories
find src/Controls/tests -type d -name "snapshots"
# src/Controls/tests/TestCases.Android.Tests/snapshots/android/
# src/Controls/tests/TestCases.Android.Tests/snapshots/android-notch-36/
# src/Controls/tests/TestCases.iOS.Tests/snapshots/ios/
# src/Controls/tests/TestCases.iOS.Tests/snapshots/ios-26/
# src/Controls/tests/TestCases.Mac.Tests/snapshots/mac/
# src/Controls/tests/TestCases.WinUI.Tests/snapshots/windows/
The baseline regeneration needs to happen on CI infrastructure (not locally) since the baselines must match the CI environment exactly.
3. Investigate why baselines differ from CI
The fact that baselines differ from CI screenshots means either:
- Baselines were generated on different OS/device versions than CI currently uses
- Rendering has subtly changed over time (Android API updates, iOS version changes)
- Some baselines were generated locally with different DPI/scaling
This is worth investigating to prevent drift in the future.
4. Consider adjusting the threshold
If regenerating all baselines isn't practical immediately, the threshold could be increased from 0.5% to accommodate the new Fuzz metric. However, this reduces test sensitivity and is not recommended long-term.
To find the right threshold, analyze the v14 Fuzz values across all failing tests and pick a value that passes the "same but slightly different" cases while still catching real regressions.
Verification Script
To compare how v13 and v14 treat any pair of images:
using ImageMagick;
var baseline = new MagickImage("baseline.png");
var actual = new MagickImage("actual.png");
// v14 Fuzz (new, more sensitive)
double fuzz = baseline.Compare(actual, ErrorMetric.Fuzz, Channels.Red);
// v14 RootMeanSquared (equivalent to v13 Fuzz)
double rms = baseline.Compare(actual, ErrorMetric.RootMeanSquared, Channels.Red);
Console.WriteLine($"Fuzz (v14 behavior): {fuzz:P4}");
Console.WriteLine($"RootMeanSquared (v13 compat): {rms:P4}");
Files involved
src/TestUtils/src/VisualTestUtils.MagickNet/MagickNetVisualComparer.cs — comparison metric
src/TestUtils/src/VisualTestUtils.MagickNet/MagickNetVisualDiffGenerator.cs — diff generation metric
src/Controls/tests/TestCases.Shared.Tests/UITest.cs — test framework that uses these
src/Controls/tests/TestCases.*/snapshots/ — all baseline screenshot directories
Related
Summary
After upgrading Magick.NET-Q8-AnyCPU from v13.5.0 to v14.10.4 (PR #34455), 1523 UI screenshot tests fail. The root cause is that
ErrorMetric.Fuzzchanged its calculation in v14 and now reports much higher distortion values for the same image pairs. This reveals that the CI screenshots have been genuinely different from the baselines all along — v13 just couldn't see it.Root Cause Analysis
ErrorMetric.Fuzz behavioral change
For the same pair of images (e.g.,
VerifyEntryClearButtonVisibilitySetToWhileEditing.pngbaseline vs CI screenshot):ErrorMetric.FuzzErrorMetric.RootMeanSquaredv14's
RootMeanSquaredproduces the exact same values as v13'sFuzz. This means v14's Fuzz is now a fundamentally different (more sensitive) metric.The images ARE different
The baseline and CI screenshots are not identical — they contain real visual differences. For example, in
VerifyEntryClearButtonVisibilitySetToWhileEditing:Similar color shifts (grey→black, subtle color changes) exist across many test screenshots. v13's
Fuzzmetric was insensitive enough that these differences fell below the 0.5% threshold. v14'sFuzznow correctly detects them.Current Workaround (PR #34455)
In PR #34455, we switched the default
ErrorMetricfromFuzztoRootMeanSquaredto preserve v13 behavior:This makes all tests pass again but hides real differences that v14's Fuzz can now detect.
Recommendation: Switch back to ErrorMetric.Fuzz
We should take advantage of v14's improved sensitivity. This would require:
1. Revert the ErrorMetric workaround
2. Regenerate all baseline screenshots
Since the images ARE different, the baselines need updating to match what CI actually produces. This is a large batch operation:
The baseline regeneration needs to happen on CI infrastructure (not locally) since the baselines must match the CI environment exactly.
3. Investigate why baselines differ from CI
The fact that baselines differ from CI screenshots means either:
This is worth investigating to prevent drift in the future.
4. Consider adjusting the threshold
If regenerating all baselines isn't practical immediately, the threshold could be increased from 0.5% to accommodate the new Fuzz metric. However, this reduces test sensitivity and is not recommended long-term.
To find the right threshold, analyze the v14 Fuzz values across all failing tests and pick a value that passes the "same but slightly different" cases while still catching real regressions.
Verification Script
To compare how v13 and v14 treat any pair of images:
Files involved
src/TestUtils/src/VisualTestUtils.MagickNet/MagickNetVisualComparer.cs— comparison metricsrc/TestUtils/src/VisualTestUtils.MagickNet/MagickNetVisualDiffGenerator.cs— diff generation metricsrc/Controls/tests/TestCases.Shared.Tests/UITest.cs— test framework that uses thesesrc/Controls/tests/TestCases.*/snapshots/— all baseline screenshot directoriesRelated