Skip to content

Add evaluate-pr-tests skill for test quality evaluation#34329

Merged
PureWeen merged 11 commits intomainfrom
feature/evaluate-pr-tests-skill
Mar 18, 2026
Merged

Add evaluate-pr-tests skill for test quality evaluation#34329
PureWeen merged 11 commits intomainfrom
feature/evaluate-pr-tests-skill

Conversation

@jfversluis
Copy link
Copy Markdown
Member

@jfversluis jfversluis commented Mar 4, 2026

Description

Adds a new Copilot skill (evaluate-pr-tests) that evaluates the quality, coverage, and appropriateness of tests added in PRs.

Why This Is Useful

Problem: When reviewing PRs, evaluating test quality is one of the most time-consuming and inconsistent parts of the process. Common issues slip through:

  • Tests that only verify a page loads but never exercise the actual fix
  • UI tests for behavior that could be caught by a unit test (which runs in seconds vs minutes)
  • Missing edge cases (null inputs, boundary values, repeated actions) that lead to regressions later
  • Convention violations (wrong base class, missing attributes, obsolete APIs) that cause CI failures
  • Flaky patterns (arbitrary delays, missing WaitForElement) that waste CI time

These issues are tedious to catch manually and easy to miss — especially across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds of existing test patterns.

Solution: This skill automates the mechanical checks and guides the agent through deeper analysis:

  1. Automated (script): File categorization, convention compliance, anti-pattern detection, AutomationId consistency, existing similar test search, platform scope analysis
  2. Agent-guided (SKILL.md): Fix coverage analysis, edge case gap identification, test type recommendation (prefer unit > XAML > device > UI), assertion quality review

Real-world example: Running against PR #34324 (Android Shell TabBar fix), the skill:

  • ✅ Correctly identified 1 fix file + 1 UI test + 1 HostApp page + 1 snapshot
  • ⚠️ Flagged an inline #if platform directive (real convention violation)
  • ℹ️ Noted 3 HostApp AutomationIds not exercised by the test (informational)
  • ✅ Found existing Shell tests for cross-reference
  • ✅ Identified this as Android-only based on fix file paths

Quality bar: This skill went through 3 rounds of multi-model review (10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch regex bugs, false positive/negative edge cases, and documentation mismatches.

Evaluation Criteria (9 checks)

# Criterion What It Checks
1 Fix Coverage Does the test exercise the actual code paths changed by the fix?
2 Edge Cases & Gaps Boundary conditions, null handling, async/timing, repeated actions
3 Test Type Appropriateness Could a lighter test type work? (unit > XAML > device > UI)
4 Convention Compliance Naming, attributes, base classes, obsolete APIs
5 Flakiness Risk Delays, missing WaitForElement, screenshot hygiene
6 Duplicate Coverage Does a similar test already exist?
7 Platform Scope Does test coverage match platforms affected by the fix?
8 Assertion Quality Are assertions specific enough to catch regressions?
9 Fix-Test Alignment Do test and fix target the same code paths?

Components

  • SKILL.md — Evaluation criteria with decision trees, examples, and structured report template
  • Gather-TestContext.ps1 — Automated convention checks, anti-pattern detection, AutomationId consistency, platform scope analysis
  • copilot-instructions.md — Skill registration with trigger phrases

Usage

# Auto-detect PR context
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1

# Or ask the agent
# "evaluate the tests in this PR"
# "are these tests good enough?"
# "review test quality for PR #XXXXX"

Copilot AI review requested due to automatic review settings March 4, 2026 13:43
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 4, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34329

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34329"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Copilot skill (evaluate-pr-tests) to help reviewers assess whether PR-added tests appropriately cover fixes and follow testing conventions.

Changes:

  • Added evaluate-pr-tests skill documentation and a structured evaluation rubric/output template.
  • Introduced Gather-TestContext.ps1 to auto-collect PR test context (file categorization, convention checks, anti-pattern detection, duplicates, platform scope).
  • Registered the new skill in .github/copilot-instructions.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 New script to gather automated PR test context and generate a markdown report.
.github/skills/evaluate-pr-tests/SKILL.md New skill spec explaining evaluation criteria and expected output format.
.github/copilot-instructions.md Adds the new skill to the documented list of available skills.

@jfversluis jfversluis added the area-ai-agents Copilot CLI agents, agent skills, AI-assisted development label Mar 5, 2026
@PureWeen
Copy link
Copy Markdown
Member

/rebase

@github-actions github-actions bot force-pushed the feature/evaluate-pr-tests-skill branch from 48f4af1 to dfcd7b6 Compare March 13, 2026 01:20
@PureWeen
Copy link
Copy Markdown
Member

Skill Evaluation Report: evaluate-pr-tests

Methodology

This skill was evaluated using a 3-validator approach:

  1. skill-validator A/B testing (dotnet/skills): Ran 6 eval scenarios comparing baseline agent vs. skill-equipped agent, with pairwise LLM judging
  2. Claude Code CLI live trigger testing: Tested 3 prompts (positive, negative, near-miss) via claude -p to verify trigger precision/recall
  3. Prompt engineering analysis: Manual review of SKILL.md structure, trigger description, instruction clarity, and scope

An eval.yaml with 8 test scenarios was generated and should be committed with this PR.


Empirical Results

skill-validator A/B (6 scenarios, 1 run)

Scenario Baseline With Skill Δ
Happy path (structured report) 2/5 5/5 +3.0
Anti-pattern detection 3/5 5/5 +2.0
Test type downgrade 5/5 5/5 0.0 (strong baseline)
Weak assertion detection 5/5 5/5 0.0
Edge case gaps 5/5 5/5 0.0
Negative trigger 3/5 3/5 0.0

Key finding: Skill's primary value is driving the structured 4-step workflow (Gather → Understand → Evaluate → Report) and anti-pattern detection.

Claude Code CLI Trigger Tests (Verified)

Test Prompt Result
✅ Positive "Evaluate the tests added in PR #34329 for quality and coverage" Skill triggered -- referenced Gather-TestContext.ps1
✅ Negative "Do a general code review of the changes in the latest commit" Skill correctly did NOT trigger
⚠️ Near-miss "Look at the test files in PR #34329 and tell me if they seem reasonable" Skill did NOT trigger -- recall gap

Note: Claude CLI tests ran via Python subprocess; tests were short-circuited by tool-permission restrictions in non-interactive mode but skill recognition was verifiable.


Verdict

Metric Score
Trigger Precision 9/10
Trigger Recall 7/10
Dotnet Validator IMPROVE (6/10)
Anthropic Validator KEEP (8/10)
Consensus KEEP with improvements (7/10)

Recommended Improvements

P0: Expand trigger description with synonym phrases

The near-miss recall gap is confirmed by both empirical A/B testing and live CLI tests. Add trigger synonyms:

description: "Evaluates tests added in a PR for coverage, quality, edge cases, and test type appropriateness. Checks if tests cover the fix, finds gaps, and recommends lighter test types when possible. Prefer unit tests over device tests over UI tests. Triggers on: 'evaluate tests in PR', 'review test quality', 'are these tests good enough', 'check test coverage', 'is this test adequate', 'assess test coverage for PR'."

P1: Add "no tests in PR" handling

SKILL.md currently says "✅ PR has tests and you want to evaluate their quality" but gives no instruction for PRs with zero test files. Add:

- ⚠️ PR has no test files -- output a ❌ Fix Coverage verdict noting no tests were added; skip remaining criteria

P1: Strengthen Criterion 9 (Fix-Test Alignment)

Currently under-specified compared to criteria 1-8. Add red flags and examples:

**Red flags:**
- Test class is named `Issue12345` for a fix in `CollectionView` but only exercises `Label` rendering
- Fix changes `Shell.cs` but test only navigates a `ContentPage`
P2: Fix Gather-TestContext.ps1 bugs (5 issues)
  1. [Category] over-counting: Counts all attribute instances instead of unique values
  2. Fluent WaitForElement().Tap() not detected: Wait-check misses chained calls
  3. Screenshot-delay regex false positives: 200-char window can span method boundaries
  4. .ios.cs files not counted for MacCatalyst: Should count for both iOS and MacCatalyst in platform scope
  5. Missing UITestEntry/UITestEditor check: Criterion 5 mentions cursor blink risk but script does not detect Entry/Editor without UITestEntry in screenshot tests

P2: Replace self-referential happy-path eval scenario

The current eval.yaml happy path evaluates PR #34329 itself (this PR), producing atypical "N/A -- tooling PR" results. Replace with a prompt targeting a real PR with tests.


eval.yaml

An eval.yaml with 8 scenarios was generated during this evaluation and is ready to commit:

  • 6 original: happy path, negative trigger, anti-pattern, test-type downgrade, weak assertions, edge case gaps
  • 2 added: near-miss recall (empirically confirmed gap), no-tests-in-PR

Evaluated using dotnet/skills skill-validator A/B testing + Claude Code CLI live trigger testing

jfversluis and others added 8 commits March 17, 2026 12:12
New skill that evaluates tests added in PRs across 9 criteria:
- Fix coverage: does the test exercise the changed code paths?
- Edge cases & gaps: boundary conditions, null handling, async
- Test type appropriateness: unit > XAML > device > UI preference
- Convention compliance: naming, attributes, base classes
- Flakiness risk: delays, missing waits, screenshot hygiene
- Duplicate coverage: existing similar tests
- Platform scope: test vs fix platform alignment
- Assertion quality: specific vs vague assertions
- Fix-test alignment: test targets same code as fix

Includes a PowerShell script (Gather-TestContext.ps1) that automates
mechanical checks (convention compliance, anti-pattern detection,
AutomationId consistency, file categorization) while the SKILL.md
guides the agent through deeper analysis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the convention check to flag both missing and multiple
[Category] attributes. The rule is: exactly ONE, either on the
class or on a method, but not both.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix .xaml.cs matching: strip .xaml extension for HostApp file matching
- Relax file naming: only flag files in Issues/ that start with Issue
- AutomationId: bidirectional check, reverse direction is informational
- Wait check: per-interaction instead of file-level
- Unit test helpers: only flag missing test methods if filename contains Test
- Fix duplicate numbering in copilot-instructions.md (8,8,9,9 → 8,9,10,10)
- Skip [Issue()] check on .xaml files (attribute lives in code-behind)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- AutomationId forward check: skip text selectors (only check PascalCase IDs)
- Unit test helper: tighten to require Test/Tests at end of filename
  (avoids false positives on TestExtensions, TestHelpers, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- WaitForElement regex: remove trailing \) to match timeout overloads
- UnitTests path: broaden to match test/UnitTests/ (no dot prefix)
- Numbering: try-fix → 11 (was duplicate 10)
- SKILL.md: remove 'Descriptive method names' from automated checks

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split git diff output on \r?\n and trim entries to handle Windows CRLF
line endings that could break downstream file path matching.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Generated during skill evaluation session with skill-validator A/B testing
and Claude Code CLI trigger testing. Scenarios cover:
- Happy path (structured report generation)
- Negative trigger (general code review should not activate)
- Anti-pattern detection (Thread.Sleep, obsolete APIs)
- Test type downgrade recommendations
- Weak assertion detection
- Edge case gap analysis
- Near-miss recall (informal phrasing)
- No-tests-in-PR handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…g, script fixes

- P0: Expand trigger description with synonym phrases for better recall
- P1: Add no-tests-in-PR handling guidance (output ❌ Fix Coverage, skip rest)
- P1: Strengthen Criterion 9 with red flags and concrete examples
- P2: Fix [Category] over-counting by deduplicating attribute matches
- P2: Add fluent WaitForElement().Tap() chain detection
- P2: Tighten screenshot-delay regex to method scope (avoid cross-method FPs)
- P2: Count .ios.cs files for MacCatalyst in platform scope analysis
- P2: Extend UITestEntry/UITestEditor check to cover XAML <Entry>/<Editor>
- P2: Replace self-referential eval.yaml happy-path scenario (was PR #34329)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feature/evaluate-pr-tests-skill branch from c2c5c22 to 3e57047 Compare March 17, 2026 17:17
github-actions bot and others added 3 commits March 17, 2026 13:05
…numbering

- Fix [Category] dedup regex to capture full attribute text (not just literal prefix)
  so distinct values like UITestCategories.Button vs UITestCategories.Shell remain
  unique after Select-Object -Unique, enabling the Count -gt 1 check to fire
- Remove dead $fluentInteractionCount variable and unused regex (behavior correct by
  accident; dead code misleads maintainers)
- Remove over-greedy method-boundary regex that caused false negatives in screenshot
  delay detection; 20-line window is already sufficient scope
- Fix skill numbering: renumber try-fix from 11 to 10

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…10 total)

New scenarios cover skill fixes introduced in latest PR commits:
- Fix-test alignment: detects when test exercises wrong control vs fix files
- Fluent chain: verifies App.WaitForElement().Tap() is not a false-positive violation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit c1b4a01 into main Mar 18, 2026
4 of 5 checks passed
@PureWeen PureWeen deleted the feature/evaluate-pr-tests-skill branch March 18, 2026 15:48
KarthikRajaKalaimani pushed a commit to KarthikRajaKalaimani/maui that referenced this pull request Mar 30, 2026
## Description

Adds a new Copilot skill (`evaluate-pr-tests`) that evaluates the
quality, coverage, and appropriateness of tests added in PRs.

## Why This Is Useful

**Problem:** When reviewing PRs, evaluating test quality is one of the
most time-consuming and inconsistent parts of the process. Common issues
slip through:

- Tests that only verify a page loads but never exercise the actual fix
- UI tests for behavior that could be caught by a unit test (which runs
in seconds vs minutes)
- Missing edge cases (null inputs, boundary values, repeated actions)
that lead to regressions later
- Convention violations (wrong base class, missing attributes, obsolete
APIs) that cause CI failures
- Flaky patterns (arbitrary delays, missing `WaitForElement`) that waste
CI time

These issues are tedious to catch manually and easy to miss — especially
across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds
of existing test patterns.

**Solution:** This skill automates the mechanical checks and guides the
agent through deeper analysis:

1. **Automated (script):** File categorization, convention compliance,
anti-pattern detection, AutomationId consistency, existing similar test
search, platform scope analysis
2. **Agent-guided (SKILL.md):** Fix coverage analysis, edge case gap
identification, test type recommendation (prefer unit > XAML > device >
UI), assertion quality review

**Real-world example:** Running against PR dotnet#34324 (Android Shell TabBar
fix), the skill:
- ✅ Correctly identified 1 fix file + 1 UI test + 1 HostApp page + 1
snapshot
- ⚠️ Flagged an inline `#if` platform directive (real convention
violation)
- ℹ️ Noted 3 HostApp AutomationIds not exercised by the test
(informational)
- ✅ Found existing Shell tests for cross-reference
- ✅ Identified this as Android-only based on fix file paths

**Quality bar:** This skill went through 3 rounds of multi-model review
(10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch
regex bugs, false positive/negative edge cases, and documentation
mismatches.

## Evaluation Criteria (9 checks)

| # | Criterion | What It Checks |
|---|-----------|---------------|
| 1 | **Fix Coverage** | Does the test exercise the actual code paths
changed by the fix? |
| 2 | **Edge Cases & Gaps** | Boundary conditions, null handling,
async/timing, repeated actions |
| 3 | **Test Type Appropriateness** | Could a lighter test type work?
(unit > XAML > device > UI) |
| 4 | **Convention Compliance** | Naming, attributes, base classes,
obsolete APIs |
| 5 | **Flakiness Risk** | Delays, missing WaitForElement, screenshot
hygiene |
| 6 | **Duplicate Coverage** | Does a similar test already exist? |
| 7 | **Platform Scope** | Does test coverage match platforms affected
by the fix? |
| 8 | **Assertion Quality** | Are assertions specific enough to catch
regressions? |
| 9 | **Fix-Test Alignment** | Do test and fix target the same code
paths? |

## Components

- **`SKILL.md`** — Evaluation criteria with decision trees, examples,
and structured report template
- **`Gather-TestContext.ps1`** — Automated convention checks,
anti-pattern detection, AutomationId consistency, platform scope
analysis
- **`copilot-instructions.md`** — Skill registration with trigger
phrases

## Usage

```bash
# Auto-detect PR context
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1

# Or ask the agent
# "evaluate the tests in this PR"
# "are these tests good enough?"
# "review test quality for PR #XXXXX"
```

- Fixes dotnet#34139

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jakub Florkowski <42434498+kubaflo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai-agents Copilot CLI agents, agent skills, AI-assisted development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AGR] Add better test analysis to detect when Tests should be unittests for UITests

4 participants