Add evaluate-pr-tests skill for test quality evaluation#34329
Add evaluate-pr-tests skill for test quality evaluation#34329
Conversation
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34329Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34329" |
There was a problem hiding this comment.
Pull request overview
Adds a new Copilot skill (evaluate-pr-tests) to help reviewers assess whether PR-added tests appropriately cover fixes and follow testing conventions.
Changes:
- Added
evaluate-pr-testsskill documentation and a structured evaluation rubric/output template. - Introduced
Gather-TestContext.ps1to auto-collect PR test context (file categorization, convention checks, anti-pattern detection, duplicates, platform scope). - Registered the new skill in
.github/copilot-instructions.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 | New script to gather automated PR test context and generate a markdown report. |
| .github/skills/evaluate-pr-tests/SKILL.md | New skill spec explaining evaluation criteria and expected output format. |
| .github/copilot-instructions.md | Adds the new skill to the documented list of available skills. |
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
Outdated
Show resolved
Hide resolved
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
Outdated
Show resolved
Hide resolved
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
Outdated
Show resolved
Hide resolved
.github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
Outdated
Show resolved
Hide resolved
|
/rebase |
48f4af1 to
dfcd7b6
Compare
Skill Evaluation Report:
|
| Scenario | Baseline | With Skill | Δ |
|---|---|---|---|
| Happy path (structured report) | 2/5 | 5/5 | +3.0 ✅ |
| Anti-pattern detection | 3/5 | 5/5 | +2.0 ✅ |
| Test type downgrade | 5/5 | 5/5 | 0.0 (strong baseline) |
| Weak assertion detection | 5/5 | 5/5 | 0.0 |
| Edge case gaps | 5/5 | 5/5 | 0.0 |
| Negative trigger | 3/5 | 3/5 | 0.0 |
Key finding: Skill's primary value is driving the structured 4-step workflow (Gather → Understand → Evaluate → Report) and anti-pattern detection.
Claude Code CLI Trigger Tests (Verified)
| Test | Prompt | Result |
|---|---|---|
| ✅ Positive | "Evaluate the tests added in PR #34329 for quality and coverage" | Skill triggered -- referenced Gather-TestContext.ps1 |
| ✅ Negative | "Do a general code review of the changes in the latest commit" | Skill correctly did NOT trigger |
| "Look at the test files in PR #34329 and tell me if they seem reasonable" | Skill did NOT trigger -- recall gap |
Note: Claude CLI tests ran via Python subprocess; tests were short-circuited by tool-permission restrictions in non-interactive mode but skill recognition was verifiable.
Verdict
| Metric | Score |
|---|---|
| Trigger Precision | 9/10 |
| Trigger Recall | 7/10 |
| Dotnet Validator | IMPROVE (6/10) |
| Anthropic Validator | KEEP (8/10) |
| Consensus | KEEP with improvements (7/10) |
Recommended Improvements
P0: Expand trigger description with synonym phrases
The near-miss recall gap is confirmed by both empirical A/B testing and live CLI tests. Add trigger synonyms:
description: "Evaluates tests added in a PR for coverage, quality, edge cases, and test type appropriateness. Checks if tests cover the fix, finds gaps, and recommends lighter test types when possible. Prefer unit tests over device tests over UI tests. Triggers on: 'evaluate tests in PR', 'review test quality', 'are these tests good enough', 'check test coverage', 'is this test adequate', 'assess test coverage for PR'."P1: Add "no tests in PR" handling
SKILL.md currently says "✅ PR has tests and you want to evaluate their quality" but gives no instruction for PRs with zero test files. Add:
- ⚠️ PR has no test files -- output a ❌ Fix Coverage verdict noting no tests were added; skip remaining criteriaP1: Strengthen Criterion 9 (Fix-Test Alignment)
Currently under-specified compared to criteria 1-8. Add red flags and examples:
**Red flags:**
- Test class is named `Issue12345` for a fix in `CollectionView` but only exercises `Label` rendering
- Fix changes `Shell.cs` but test only navigates a `ContentPage`P2: Fix Gather-TestContext.ps1 bugs (5 issues)
[Category]over-counting: Counts all attribute instances instead of unique values- Fluent
WaitForElement().Tap()not detected: Wait-check misses chained calls - Screenshot-delay regex false positives: 200-char window can span method boundaries
.ios.csfiles not counted for MacCatalyst: Should count for both iOS and MacCatalyst in platform scope- Missing
UITestEntry/UITestEditorcheck: Criterion 5 mentions cursor blink risk but script does not detectEntry/EditorwithoutUITestEntryin screenshot tests
P2: Replace self-referential happy-path eval scenario
The current eval.yaml happy path evaluates PR #34329 itself (this PR), producing atypical "N/A -- tooling PR" results. Replace with a prompt targeting a real PR with tests.
eval.yaml
An eval.yaml with 8 scenarios was generated during this evaluation and is ready to commit:
- 6 original: happy path, negative trigger, anti-pattern, test-type downgrade, weak assertions, edge case gaps
- 2 added: near-miss recall (empirically confirmed gap), no-tests-in-PR
Evaluated using dotnet/skills skill-validator A/B testing + Claude Code CLI live trigger testing
New skill that evaluates tests added in PRs across 9 criteria: - Fix coverage: does the test exercise the changed code paths? - Edge cases & gaps: boundary conditions, null handling, async - Test type appropriateness: unit > XAML > device > UI preference - Convention compliance: naming, attributes, base classes - Flakiness risk: delays, missing waits, screenshot hygiene - Duplicate coverage: existing similar tests - Platform scope: test vs fix platform alignment - Assertion quality: specific vs vague assertions - Fix-test alignment: test targets same code as fix Includes a PowerShell script (Gather-TestContext.ps1) that automates mechanical checks (convention compliance, anti-pattern detection, AutomationId consistency, file categorization) while the SKILL.md guides the agent through deeper analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the convention check to flag both missing and multiple [Category] attributes. The rule is: exactly ONE, either on the class or on a method, but not both. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix .xaml.cs matching: strip .xaml extension for HostApp file matching - Relax file naming: only flag files in Issues/ that start with Issue - AutomationId: bidirectional check, reverse direction is informational - Wait check: per-interaction instead of file-level - Unit test helpers: only flag missing test methods if filename contains Test - Fix duplicate numbering in copilot-instructions.md (8,8,9,9 → 8,9,10,10) - Skip [Issue()] check on .xaml files (attribute lives in code-behind) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- AutomationId forward check: skip text selectors (only check PascalCase IDs) - Unit test helper: tighten to require Test/Tests at end of filename (avoids false positives on TestExtensions, TestHelpers, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- WaitForElement regex: remove trailing \) to match timeout overloads - UnitTests path: broaden to match test/UnitTests/ (no dot prefix) - Numbering: try-fix → 11 (was duplicate 10) - SKILL.md: remove 'Descriptive method names' from automated checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split git diff output on \r?\n and trim entries to handle Windows CRLF line endings that could break downstream file path matching. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Generated during skill evaluation session with skill-validator A/B testing and Claude Code CLI trigger testing. Scenarios cover: - Happy path (structured report generation) - Negative trigger (general code review should not activate) - Anti-pattern detection (Thread.Sleep, obsolete APIs) - Test type downgrade recommendations - Weak assertion detection - Edge case gap analysis - Near-miss recall (informal phrasing) - No-tests-in-PR handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…g, script fixes - P0: Expand trigger description with synonym phrases for better recall - P1: Add no-tests-in-PR handling guidance (output ❌ Fix Coverage, skip rest) - P1: Strengthen Criterion 9 with red flags and concrete examples - P2: Fix [Category] over-counting by deduplicating attribute matches - P2: Add fluent WaitForElement().Tap() chain detection - P2: Tighten screenshot-delay regex to method scope (avoid cross-method FPs) - P2: Count .ios.cs files for MacCatalyst in platform scope analysis - P2: Extend UITestEntry/UITestEditor check to cover XAML <Entry>/<Editor> - P2: Replace self-referential eval.yaml happy-path scenario (was PR #34329) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c2c5c22 to
3e57047
Compare
…numbering - Fix [Category] dedup regex to capture full attribute text (not just literal prefix) so distinct values like UITestCategories.Button vs UITestCategories.Shell remain unique after Select-Object -Unique, enabling the Count -gt 1 check to fire - Remove dead $fluentInteractionCount variable and unused regex (behavior correct by accident; dead code misleads maintainers) - Remove over-greedy method-boundary regex that caused false negatives in screenshot delay detection; 20-line window is already sufficient scope - Fix skill numbering: renumber try-fix from 11 to 10 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…10 total) New scenarios cover skill fixes introduced in latest PR commits: - Fix-test alignment: detects when test exercises wrong control vs fix files - Fluent chain: verifies App.WaitForElement().Tap() is not a false-positive violation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Description Adds a new Copilot skill (`evaluate-pr-tests`) that evaluates the quality, coverage, and appropriateness of tests added in PRs. ## Why This Is Useful **Problem:** When reviewing PRs, evaluating test quality is one of the most time-consuming and inconsistent parts of the process. Common issues slip through: - Tests that only verify a page loads but never exercise the actual fix - UI tests for behavior that could be caught by a unit test (which runs in seconds vs minutes) - Missing edge cases (null inputs, boundary values, repeated actions) that lead to regressions later - Convention violations (wrong base class, missing attributes, obsolete APIs) that cause CI failures - Flaky patterns (arbitrary delays, missing `WaitForElement`) that waste CI time These issues are tedious to catch manually and easy to miss — especially across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds of existing test patterns. **Solution:** This skill automates the mechanical checks and guides the agent through deeper analysis: 1. **Automated (script):** File categorization, convention compliance, anti-pattern detection, AutomationId consistency, existing similar test search, platform scope analysis 2. **Agent-guided (SKILL.md):** Fix coverage analysis, edge case gap identification, test type recommendation (prefer unit > XAML > device > UI), assertion quality review **Real-world example:** Running against PR dotnet#34324 (Android Shell TabBar fix), the skill: - ✅ Correctly identified 1 fix file + 1 UI test + 1 HostApp page + 1 snapshot -⚠️ Flagged an inline `#if` platform directive (real convention violation) - ℹ️ Noted 3 HostApp AutomationIds not exercised by the test (informational) - ✅ Found existing Shell tests for cross-reference - ✅ Identified this as Android-only based on fix file paths **Quality bar:** This skill went through 3 rounds of multi-model review (10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch regex bugs, false positive/negative edge cases, and documentation mismatches. ## Evaluation Criteria (9 checks) | # | Criterion | What It Checks | |---|-----------|---------------| | 1 | **Fix Coverage** | Does the test exercise the actual code paths changed by the fix? | | 2 | **Edge Cases & Gaps** | Boundary conditions, null handling, async/timing, repeated actions | | 3 | **Test Type Appropriateness** | Could a lighter test type work? (unit > XAML > device > UI) | | 4 | **Convention Compliance** | Naming, attributes, base classes, obsolete APIs | | 5 | **Flakiness Risk** | Delays, missing WaitForElement, screenshot hygiene | | 6 | **Duplicate Coverage** | Does a similar test already exist? | | 7 | **Platform Scope** | Does test coverage match platforms affected by the fix? | | 8 | **Assertion Quality** | Are assertions specific enough to catch regressions? | | 9 | **Fix-Test Alignment** | Do test and fix target the same code paths? | ## Components - **`SKILL.md`** — Evaluation criteria with decision trees, examples, and structured report template - **`Gather-TestContext.ps1`** — Automated convention checks, anti-pattern detection, AutomationId consistency, platform scope analysis - **`copilot-instructions.md`** — Skill registration with trigger phrases ## Usage ```bash # Auto-detect PR context pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 # Or ask the agent # "evaluate the tests in this PR" # "are these tests good enough?" # "review test quality for PR #XXXXX" ``` - Fixes dotnet#34139 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Jakub Florkowski <42434498+kubaflo@users.noreply.github.com>
Description
Adds a new Copilot skill (
evaluate-pr-tests) that evaluates the quality, coverage, and appropriateness of tests added in PRs.Why This Is Useful
Problem: When reviewing PRs, evaluating test quality is one of the most time-consuming and inconsistent parts of the process. Common issues slip through:
WaitForElement) that waste CI timeThese issues are tedious to catch manually and easy to miss — especially across 4 platforms, 4 test types (unit, XAML, device, UI), and hundreds of existing test patterns.
Solution: This skill automates the mechanical checks and guides the agent through deeper analysis:
Real-world example: Running against PR #34324 (Android Shell TabBar fix), the skill:
#ifplatform directive (real convention violation)Quality bar: This skill went through 3 rounds of multi-model review (10 total reviews across Opus, GPT-5.1, Gemini, Sonnet 4.5) to catch regex bugs, false positive/negative edge cases, and documentation mismatches.
Evaluation Criteria (9 checks)
Components
SKILL.md— Evaluation criteria with decision trees, examples, and structured report templateGather-TestContext.ps1— Automated convention checks, anti-pattern detection, AutomationId consistency, platform scope analysiscopilot-instructions.md— Skill registration with trigger phrasesUsage