Add evaluate-pr-tests skill for test quality evaluation by jfversluis · Pull Request #34329 · dotnet/maui

jfversluis · 2026-03-04T13:43:48Z

Description

Adds a new Copilot skill (evaluate-pr-tests) that evaluates the quality, coverage, and appropriateness of tests added in PRs.

Why This Is Useful

Problem: When reviewing PRs, evaluating test quality is one of the most time-consuming and inconsistent parts of the process. Common issues slip through:

Tests that only verify a page loads but never exercise the actual fix
UI tests for behavior that could be caught by a unit test (which runs in seconds vs minutes)
Missing edge cases (null inputs, boundary values, repeated actions) that lead to regressions later
Convention violations (wrong base class, missing attributes, obsolete APIs) that cause CI failures
Flaky patterns (arbitrary delays, missing WaitForElement) that waste CI time

These issues are tedious to catch manually and easy to miss — especially across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds of existing test patterns.

Solution: This skill automates the mechanical checks and guides the agent through deeper analysis:

Automated (script): File categorization, convention compliance, anti-pattern detection, AutomationId consistency, existing similar test search, platform scope analysis
Agent-guided (SKILL.md): Fix coverage analysis, edge case gap identification, test type recommendation (prefer unit > XAML > device > UI), assertion quality review

Real-world example: Running against PR #34324 (Android Shell TabBar fix), the skill:

✅ Correctly identified 1 fix file + 1 UI test + 1 HostApp page + 1 snapshot
⚠️ Flagged an inline #if platform directive (real convention violation)
ℹ️ Noted 3 HostApp AutomationIds not exercised by the test (informational)
✅ Found existing Shell tests for cross-reference
✅ Identified this as Android-only based on fix file paths

Quality bar: This skill went through 3 rounds of multi-model review (10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch regex bugs, false positive/negative edge cases, and documentation mismatches.

Evaluation Criteria (9 checks)

#	Criterion	What It Checks
1	Fix Coverage	Does the test exercise the actual code paths changed by the fix?
2	Edge Cases & Gaps	Boundary conditions, null handling, async/timing, repeated actions
3	Test Type Appropriateness	Could a lighter test type work? (unit > XAML > device > UI)
4	Convention Compliance	Naming, attributes, base classes, obsolete APIs
5	Flakiness Risk	Delays, missing WaitForElement, screenshot hygiene
6	Duplicate Coverage	Does a similar test already exist?
7	Platform Scope	Does test coverage match platforms affected by the fix?
8	Assertion Quality	Are assertions specific enough to catch regressions?
9	Fix-Test Alignment	Do test and fix target the same code paths?

Components

SKILL.md — Evaluation criteria with decision trees, examples, and structured report template
Gather-TestContext.ps1 — Automated convention checks, anti-pattern detection, AutomationId consistency, platform scope analysis
copilot-instructions.md — Skill registration with trigger phrases

Usage

# Auto-detect PR context
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1

# Or ask the agent
# "evaluate the tests in this PR"
# "are these tests good enough?"
# "review test quality for PR #XXXXX"

Fixes [AGR] Add better test analysis to detect when Tests should be unittests for UITests #34139

github-actions · 2026-03-04T13:43:59Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34329

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34329"

Copilot

Pull request overview

Adds a new Copilot skill (evaluate-pr-tests) to help reviewers assess whether PR-added tests appropriately cover fixes and follow testing conventions.

Changes:

Added evaluate-pr-tests skill documentation and a structured evaluation rubric/output template.
Introduced Gather-TestContext.ps1 to auto-collect PR test context (file categorization, convention checks, anti-pattern detection, duplicates, platform scope).
Registered the new skill in .github/copilot-instructions.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1	New script to gather automated PR test context and generate a markdown report.
.github/skills/evaluate-pr-tests/SKILL.md	New skill spec explaining evaluation criteria and expected output format.
.github/copilot-instructions.md	Adds the new skill to the documented list of available skills.

.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1

.github/skills/evaluate-pr-tests/SKILL.md

.github/copilot-instructions.md

PureWeen · 2026-03-13T01:19:30Z

/rebase

PureWeen · 2026-03-13T18:10:24Z

Skill Evaluation Report: `evaluate-pr-tests`

Methodology

This skill was evaluated using a 3-validator approach:

skill-validator A/B testing (dotnet/skills): Ran 6 eval scenarios comparing baseline agent vs. skill-equipped agent, with pairwise LLM judging
Claude Code CLI live trigger testing: Tested 3 prompts (positive, negative, near-miss) via claude -p to verify trigger precision/recall
Prompt engineering analysis: Manual review of SKILL.md structure, trigger description, instruction clarity, and scope

An eval.yaml with 8 test scenarios was generated and should be committed with this PR.

Empirical Results

skill-validator A/B (6 scenarios, 1 run)

Scenario	Baseline	With Skill	Δ
Happy path (structured report)	2/5	5/5	+3.0 ✅
Anti-pattern detection	3/5	5/5	+2.0 ✅
Test type downgrade	5/5	5/5	0.0 (strong baseline)
Weak assertion detection	5/5	5/5	0.0
Edge case gaps	5/5	5/5	0.0
Negative trigger	3/5	3/5	0.0

Key finding: Skill's primary value is driving the structured 4-step workflow (Gather → Understand → Evaluate → Report) and anti-pattern detection.

Claude Code CLI Trigger Tests (Verified)

Test	Prompt	Result
✅ Positive	"Evaluate the tests added in PR #34329 for quality and coverage"	Skill triggered -- referenced `Gather-TestContext.ps1`
✅ Negative	"Do a general code review of the changes in the latest commit"	Skill correctly did NOT trigger
⚠️ Near-miss	"Look at the test files in PR #34329 and tell me if they seem reasonable"	Skill did NOT trigger -- recall gap

Note: Claude CLI tests ran via Python subprocess; tests were short-circuited by tool-permission restrictions in non-interactive mode but skill recognition was verifiable.

Verdict

Metric	Score
Trigger Precision	9/10
Trigger Recall	7/10
Dotnet Validator	IMPROVE (6/10)
Anthropic Validator	KEEP (8/10)
Consensus	KEEP with improvements (7/10)

Recommended Improvements

P0: Expand trigger description with synonym phrases

The near-miss recall gap is confirmed by both empirical A/B testing and live CLI tests. Add trigger synonyms:

description: "Evaluates tests added in a PR for coverage, quality, edge cases, and test type appropriateness. Checks if tests cover the fix, finds gaps, and recommends lighter test types when possible. Prefer unit tests over device tests over UI tests. Triggers on: 'evaluate tests in PR', 'review test quality', 'are these tests good enough', 'check test coverage', 'is this test adequate', 'assess test coverage for PR'."

P1: Add "no tests in PR" handling

SKILL.md currently says "✅ PR has tests and you want to evaluate their quality" but gives no instruction for PRs with zero test files. Add:

- ⚠️ PR has no test files -- output a ❌ Fix Coverage verdict noting no tests were added; skip remaining criteria

P1: Strengthen Criterion 9 (Fix-Test Alignment)

Currently under-specified compared to criteria 1-8. Add red flags and examples:

**Red flags:**
- Test class is named `Issue12345` for a fix in `CollectionView` but only exercises `Label` rendering
- Fix changes `Shell.cs` but test only navigates a `ContentPage`

P2: Fix Gather-TestContext.ps1 bugs (5 issues)

[Category] over-counting: Counts all attribute instances instead of unique values
Fluent WaitForElement().Tap() not detected: Wait-check misses chained calls
Screenshot-delay regex false positives: 200-char window can span method boundaries
.ios.cs files not counted for MacCatalyst: Should count for both iOS and MacCatalyst in platform scope
Missing UITestEntry/UITestEditor check: Criterion 5 mentions cursor blink risk but script does not detect Entry/Editor without UITestEntry in screenshot tests

P2: Replace self-referential happy-path eval scenario

The current eval.yaml happy path evaluates PR #34329 itself (this PR), producing atypical "N/A -- tooling PR" results. Replace with a prompt targeting a real PR with tests.

eval.yaml

An eval.yaml with 8 scenarios was generated during this evaluation and is ready to commit:

6 original: happy path, negative trigger, anti-pattern, test-type downgrade, weak assertions, edge case gaps
2 added: near-miss recall (empirically confirmed gap), no-tests-in-PR

Evaluated using dotnet/skills skill-validator A/B testing + Claude Code CLI live trigger testing

New skill that evaluates tests added in PRs across 9 criteria: - Fix coverage: does the test exercise the changed code paths? - Edge cases & gaps: boundary conditions, null handling, async - Test type appropriateness: unit > XAML > device > UI preference - Convention compliance: naming, attributes, base classes - Flakiness risk: delays, missing waits, screenshot hygiene - Duplicate coverage: existing similar tests - Platform scope: test vs fix platform alignment - Assertion quality: specific vs vague assertions - Fix-test alignment: test targets same code as fix Includes a PowerShell script (Gather-TestContext.ps1) that automates mechanical checks (convention compliance, anti-pattern detection, AutomationId consistency, file categorization) while the SKILL.md guides the agent through deeper analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update the convention check to flag both missing and multiple [Category] attributes. The rule is: exactly ONE, either on the class or on a method, but not both. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix .xaml.cs matching: strip .xaml extension for HostApp file matching - Relax file naming: only flag files in Issues/ that start with Issue - AutomationId: bidirectional check, reverse direction is informational - Wait check: per-interaction instead of file-level - Unit test helpers: only flag missing test methods if filename contains Test - Fix duplicate numbering in copilot-instructions.md (8,8,9,9 → 8,9,10,10) - Skip [Issue()] check on .xaml files (attribute lives in code-behind) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- AutomationId forward check: skip text selectors (only check PascalCase IDs) - Unit test helper: tighten to require Test/Tests at end of filename (avoids false positives on TestExtensions, TestHelpers, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- WaitForElement regex: remove trailing \) to match timeout overloads - UnitTests path: broaden to match test/UnitTests/ (no dot prefix) - Numbering: try-fix → 11 (was duplicate 10) - SKILL.md: remove 'Descriptive method names' from automated checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Split git diff output on \r?\n and trim entries to handle Windows CRLF line endings that could break downstream file path matching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Generated during skill evaluation session with skill-validator A/B testing and Claude Code CLI trigger testing. Scenarios cover: - Happy path (structured report generation) - Negative trigger (general code review should not activate) - Anti-pattern detection (Thread.Sleep, obsolete APIs) - Test type downgrade recommendations - Weak assertion detection - Edge case gap analysis - Near-miss recall (informal phrasing) - No-tests-in-PR handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…g, script fixes - P0: Expand trigger description with synonym phrases for better recall - P1: Add no-tests-in-PR handling guidance (output ❌ Fix Coverage, skip rest) - P1: Strengthen Criterion 9 with red flags and concrete examples - P2: Fix [Category] over-counting by deduplicating attribute matches - P2: Add fluent WaitForElement().Tap() chain detection - P2: Tighten screenshot-delay regex to method scope (avoid cross-method FPs) - P2: Count .ios.cs files for MacCatalyst in platform scope analysis - P2: Extend UITestEntry/UITestEditor check to cover XAML <Entry>/<Editor> - P2: Replace self-referential eval.yaml happy-path scenario (was PR #34329) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…numbering - Fix [Category] dedup regex to capture full attribute text (not just literal prefix) so distinct values like UITestCategories.Button vs UITestCategories.Shell remain unique after Select-Object -Unique, enabling the Count -gt 1 check to fire - Remove dead $fluentInteractionCount variable and unused regex (behavior correct by accident; dead code misleads maintainers) - Remove over-greedy method-boundary regex that caused false negatives in screenshot delay detection; 20-line window is already sufficient scope - Fix skill numbering: renumber try-fix from 11 to 10 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…10 total) New scenarios cover skill fixes introduced in latest PR commits: - Fix-test alignment: detects when test exercises wrong control vs fix files - Fluent chain: verifies App.WaitForElement().Tap() is not a false-positive violation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## Description Adds a new Copilot skill (`evaluate-pr-tests`) that evaluates the quality, coverage, and appropriateness of tests added in PRs. ## Why This Is Useful **Problem:** When reviewing PRs, evaluating test quality is one of the most time-consuming and inconsistent parts of the process. Common issues slip through: - Tests that only verify a page loads but never exercise the actual fix - UI tests for behavior that could be caught by a unit test (which runs in seconds vs minutes) - Missing edge cases (null inputs, boundary values, repeated actions) that lead to regressions later - Convention violations (wrong base class, missing attributes, obsolete APIs) that cause CI failures - Flaky patterns (arbitrary delays, missing `WaitForElement`) that waste CI time These issues are tedious to catch manually and easy to miss — especially across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds of existing test patterns. **Solution:** This skill automates the mechanical checks and guides the agent through deeper analysis: 1. **Automated (script):** File categorization, convention compliance, anti-pattern detection, AutomationId consistency, existing similar test search, platform scope analysis 2. **Agent-guided (SKILL.md):** Fix coverage analysis, edge case gap identification, test type recommendation (prefer unit > XAML > device > UI), assertion quality review **Real-world example:** Running against PR dotnet#34324 (Android Shell TabBar fix), the skill: - ✅ Correctly identified 1 fix file + 1 UI test + 1 HostApp page + 1 snapshot - ⚠️ Flagged an inline `#if` platform directive (real convention violation) - ℹ️ Noted 3 HostApp AutomationIds not exercised by the test (informational) - ✅ Found existing Shell tests for cross-reference - ✅ Identified this as Android-only based on fix file paths **Quality bar:** This skill went through 3 rounds of multi-model review (10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch regex bugs, false positive/negative edge cases, and documentation mismatches. ## Evaluation Criteria (9 checks) | # | Criterion | What It Checks | |---|-----------|---------------| | 1 | **Fix Coverage** | Does the test exercise the actual code paths changed by the fix? | | 2 | **Edge Cases & Gaps** | Boundary conditions, null handling, async/timing, repeated actions | | 3 | **Test Type Appropriateness** | Could a lighter test type work? (unit > XAML > device > UI) | | 4 | **Convention Compliance** | Naming, attributes, base classes, obsolete APIs | | 5 | **Flakiness Risk** | Delays, missing WaitForElement, screenshot hygiene | | 6 | **Duplicate Coverage** | Does a similar test already exist? | | 7 | **Platform Scope** | Does test coverage match platforms affected by the fix? | | 8 | **Assertion Quality** | Are assertions specific enough to catch regressions? | | 9 | **Fix-Test Alignment** | Do test and fix target the same code paths? | ## Components - **`SKILL.md`** — Evaluation criteria with decision trees, examples, and structured report template - **`Gather-TestContext.ps1`** — Automated convention checks, anti-pattern detection, AutomationId consistency, platform scope analysis - **`copilot-instructions.md`** — Skill registration with trigger phrases ## Usage ```bash # Auto-detect PR context pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 # Or ask the agent # "evaluate the tests in this PR" # "are these tests good enough?" # "review test quality for PR #XXXXX" ``` - Fixes dotnet#34139 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Florkowski <42434498+kubaflo@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 4, 2026 13:43

Copilot started reviewing on behalf of jfversluis March 4, 2026 13:44 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

jfversluis added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label Mar 5, 2026

github-actions bot force-pushed the feature/evaluate-pr-tests-skill branch from 48f4af1 to dfcd7b6 Compare March 13, 2026 01:20

jfversluis and others added 8 commits March 17, 2026 12:12

Enforce exactly 1 Category attribute per UI test class

f721f36

Update the convention check to flag both missing and multiple [Category] attributes. The rule is: exactly ONE, either on the class or on a method, but not both. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix CRLF line splitting for Windows compatibility

d0b7fb5

Split git diff output on \r?\n and trim entries to handle Windows CRLF line endings that could break downstream file path matching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen force-pushed the feature/evaluate-pr-tests-skill branch from c2c5c22 to 3e57047 Compare March 17, 2026 17:17

github-actions bot and others added 3 commits March 17, 2026 13:05

Merge branch 'main' into feature/evaluate-pr-tests-skill

13fd4c6

kubaflo approved these changes Mar 18, 2026

View reviewed changes

PureWeen merged commit c1b4a01 into main Mar 18, 2026
4 of 5 checks passed

PureWeen deleted the feature/evaluate-pr-tests-skill branch March 18, 2026 15:48

This was referenced Mar 20, 2026

[repo-status] 🌟 Daily Repo Status - March 20, 2026 #34578

Closed

[repo-status] 🌟 Daily Repo Status - March 21, 2026 #34588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluate-pr-tests skill for test quality evaluation#34329

Add evaluate-pr-tests skill for test quality evaluation#34329
PureWeen merged 11 commits intomainfrom
feature/evaluate-pr-tests-skill

jfversluis commented Mar 4, 2026 •

edited by PureWeen

Loading

Uh oh!

github-actions bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PureWeen commented Mar 13, 2026

Uh oh!

PureWeen commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jfversluis commented Mar 4, 2026 • edited by PureWeen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why This Is Useful

Evaluation Criteria (9 checks)

Components

Usage

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PureWeen commented Mar 13, 2026

Uh oh!

PureWeen commented Mar 13, 2026

Skill Evaluation Report: evaluate-pr-tests

Methodology

Empirical Results

skill-validator A/B (6 scenarios, 1 run)

Claude Code CLI Trigger Tests (Verified)

Verdict

Recommended Improvements

P0: Expand trigger description with synonym phrases

P1: Add "no tests in PR" handling

P1: Strengthen Criterion 9 (Fix-Test Alignment)

P2: Replace self-referential happy-path eval scenario

eval.yaml

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jfversluis commented Mar 4, 2026 •

edited by PureWeen

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading

Skill Evaluation Report: `evaluate-pr-tests`