feat: add azd CLI evaluation and testing framework#7202
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new cli/azd/test/eval/ evaluation framework intended to measure how well GitHub Copilot CLI (and humans) can discover and use azd commands, plus scheduled GitHub Actions workflows to run these evals and publish artifacts/reports.
Changes:
- Introduces a Node/TypeScript Jest test harness for unit-style CLI surface validation (help text, flags, sequencing).
- Adds Waza task YAMLs (deploy/troubleshoot/environment/lifecycle/negative scenarios) and Python grader scripts for infra/app validation.
- Adds GitHub Actions workflows to run unit tests on PRs and scheduled Waza/E2E/report jobs.
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/test/eval/tsconfig.json | TypeScript build configuration for eval tests/tools |
| cli/azd/test/eval/package.json | Node package + scripts for running Jest/Waza/reporting |
| cli/azd/test/eval/package-lock.json | Locked dependency tree for reproducible installs |
| cli/azd/test/eval/jest.config.ts | Jest configuration (ts-jest + junit in CI) |
| cli/azd/test/eval/.gitignore | Ignores build outputs and generated reports |
| cli/azd/test/eval/reports/.gitkeep | Keeps reports/ directory in git |
| cli/azd/test/eval/eval.yaml | Waza eval configuration (executor/model/metrics/task globs) |
| cli/azd/test/eval/README.md | Documentation for running/extending eval framework |
| cli/azd/test/eval/tests/unit/command-registry.test.ts | Verifies core commands exist and respond to --help |
| cli/azd/test/eval/tests/unit/help-text-quality.test.ts | Checks help output contains expected sections/descriptions |
| cli/azd/test/eval/tests/unit/flag-validation.test.ts | Validates key flags appear/behave as expected |
| cli/azd/test/eval/tests/unit/command-sequencing.test.ts | Ensures commands fail with guidance in empty dirs |
| cli/azd/test/eval/tests/human/cli-workflow.test.ts | “Human baseline” tests for responsiveness/basic UX expectations |
| cli/azd/test/eval/tests/human/command-discovery.test.ts | “Human baseline” tests focused on discovering commands/flags |
| cli/azd/test/eval/tests/human/error-recovery.test.ts | “Human baseline” tests for actionable errors and recovery hints |
| cli/azd/test/eval/tasks/deploy/deploy-python-webapp.yaml | Waza task: deploy Python app guidance |
| cli/azd/test/eval/tasks/deploy/deploy-node-api.yaml | Waza task: deploy Node API guidance |
| cli/azd/test/eval/tasks/deploy/deploy-existing-project.yaml | Waza task: deploy existing azd project (avoid init) |
| cli/azd/test/eval/tasks/environment/create-staging.yaml | Waza task: create staging environment workflow |
| cli/azd/test/eval/tasks/environment/switch-env.yaml | Waza task: switch environments |
| cli/azd/test/eval/tasks/environment/delete-env.yaml | Waza task: teardown + delete environment workflow |
| cli/azd/test/eval/tasks/lifecycle/full-lifecycle.yaml | Waza task: init→provision→deploy→down sequence |
| cli/azd/test/eval/tasks/lifecycle/teardown-only.yaml | Waza task: down/cleanup guidance |
| cli/azd/test/eval/tasks/troubleshoot/auth-error.yaml | Waza task: troubleshoot auth error guidance |
| cli/azd/test/eval/tasks/troubleshoot/config-error.yaml | Waza task: troubleshoot malformed azure.yaml |
| cli/azd/test/eval/tasks/troubleshoot/quota-error.yaml | Waza task: troubleshoot quota error |
| cli/azd/test/eval/tasks/troubleshoot/provision-role-conflict.yaml | Waza task: troubleshoot RBAC role assignment conflict |
| cli/azd/test/eval/tasks/negative/raw-azure-cli.yaml | Waza negative task: use az not azd |
| cli/azd/test/eval/tasks/negative/not-azure.yaml | Waza negative task: non-Azure question should avoid azd |
| cli/azd/test/eval/tasks/negative/general-coding.yaml | Waza negative task: general coding response without azd |
| cli/azd/test/eval/graders/infra_validator.py | Python grader stub for ARM resource existence validation |
| cli/azd/test/eval/graders/cleanup_validator.py | Python grader stub for post-azd down cleanup validation |
| cli/azd/test/eval/graders/app_health.py | Python grader stub for HTTP endpoint health validation |
| cli/azd/.vscode/cspell.yaml | Adds spelling dictionary overrides for eval docs |
| .github/workflows/eval-unit.yml | PR workflow to build azd + run Jest unit suite |
| .github/workflows/eval-waza.yml | Scheduled workflow to run Waza evaluations |
| .github/workflows/eval-e2e.yml | Scheduled workflow intended for E2E lifecycle evals with Azure login |
| .github/workflows/eval-report.yml | Scheduled workflow intended to generate weekly comparison/regression issues |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Good initiative - adding eval coverage for Copilot CLI interactions with azd fills a real gap. The Waza task definitions are well-structured, grader weights are mathematically correct across all 14 tasks, and the CI workflow design (unit on PR, Waza 3x/day, E2E weekly) is sensible.
However, there are structural and reliability issues that should be addressed before merge:
- The
azd()test helper is copy-pasted across 7 files with subtle inconsistencies (NO_COLORvsAZD_DEBUG_FORCE_NO_TTY,e: anyvse: unknown) - this is already causing bugs and will make maintenance painful - Human test files don't set
NO_COLOR: "1", so regex assertions against help text will be flaky when ANSI escape codes are present - The
eval.yamlsystem prompt omitsazd env delete, butdelete-env.yamlexpects the LLM to suggest it - this task will score poorly by design app_health.pyhas inconsistent retry logic: status mismatches retry, but body-content mismatches return failure immediately- Two npm devDependencies (
@azure/arm-resources,@azure/identity) are never imported anywhere
I've excluded items already covered by the existing review.
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
59c2cb2 to
ef33cf4
Compare
|
@jongio - thanks for the feedback. Everything has been addressed here. Ready for another review. @rajeshkamal5050 / @wbreza - going to need someone here with admin to setup the CI portions and TOKENS and subscription. See PR description on the how. |
|
@jongio All review feedback addressed and CI is fully green. Changes: shared test-utils.ts helper, 15 pytest grader tests, app_health.py retry fix, azd env delete in eval.yaml, removed unused deps, jest timeout reduction, workflow fixes (permissions, PATH, artifacts, cleanup), cspell overrides. PR body updated with full setup instructions. Ready for re-review! |
jongio
left a comment
There was a problem hiding this comment.
Solid foundation for measuring Copilot CLI + azd interactions. The task YAML structure is well-designed, grader weight math is correct, and the CI pipeline layout (unit on PR, Waza scheduled, E2E weekly) makes sense. I've skipped items already raised in the existing reviews and focused on issues I haven't seen mentioned.
The graders have a logic gap in how urlopen handles non-2xx responses - it throws before your status comparison runs, so expected_status only works for 2xx codes. The get_access_token() function is copy-pasted across two grader files. The report workflow is entirely non-functional (placeholder echo + missing dependency file). A couple of the task YAML graders are either redundant or too strict in what they require from the LLM response.
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ef33cf4 to
24d2af3
Compare
|
@jongio Round 2 feedback addressed, rebased on main, and all CI passing locally. Changes: HTTPError non-2xx handling in app_health.py, shared azure_auth.py module, cleaned up eval-report.yml, relaxed teardown-only.yaml grader, fixed duplicate grader in deploy-existing-project.yaml, Windows .exe support in test-utils.ts. Replied to each comment individually. Ready for re-review! |
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
24d2af3 to
ebef1ca
Compare
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- app_health.py: handle non-2xx expected_status via HTTPError.code comparison - Extract shared get_access_token() to graders/azure_auth.py (was duplicated in cleanup_validator.py and infra_validator.py) - eval-report.yml: remove non-functional regression issue step, drop issues:write permission, add TODO for future report generation script - teardown-only.yaml: relax --purge from must_match to must_match_any (--force without --purge is a valid response) - deploy-existing-project.yaml: replace duplicate grader with check for --no-prompt, azure.yaml, service, or --all - test-utils.ts: add .exe extension on Windows for cross-platform support - Add 2 new pytest tests for HTTPError expected_status matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ling - Fix mock patch targets: patch cleanup_validator/infra_validator instead of azure_auth to correctly intercept imported names - Replace azd env delete with azd env remove (correct CLI command) - Add pytest step to eval-unit.yml for grader tests - Remove continue-on-error from waza validate step - Catch HTTPError in grade() functions to return score 0 gracefully - Fix README grader signature: grade(context) not grade(inputs, params) - Filter eval-report.yml artifact downloads to main branch only Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The waza CLI is not available in the CI runner environment. Make the validation step conditional to unblock the workflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com>
Co-authored-by: Jon Gallant <2163001+jongio@users.noreply.github.com>
- README: update code examples to use shared test-utils.ts import (removes catch(e:any), AZD_DEBUG_FORCE_NO_TTY, inline helpers) - README: remove nonexistent scripts/ from directory structure - README: fix CI table eval-report description (removed auto-issue ref) - README: align Waza install method with CI (npm install -g waza) - README: clarify code graders are for E2E lifecycle only - eval-e2e.yml: fix cleanup step to use correct working directory - deploy-existing-project.yaml: make grader 4 check for deployment explanation instead of duplicating command/flag checks from grader 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix shell injection in test-utils.ts (use execFileSync) - Fix app_health.py HTTPError body check - Tighten command-sequencing test assertions - Remove continue-on-error masking in eval-waza.yml (staged separately) - Fix action_sequence grader to accept azd up shorthand Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eval failures should fail the workflow. Results are still uploaded via if: always() on the artifact upload step. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Narrow azure_auth.py exception handling to expected failures - Fix cleanup_validator.py to only swallow 404, not all HTTPErrors - Fix eval-e2e.yml cleanup step to not depend on azure.yaml Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix list_resources() to re-raise non-404 HTTPError instead of silently returning empty list. Matches cleanup_validator.py pattern. - Fix README build path: cd ../../ (not ../../../) from test/eval/. - Remove unused tsx devDependency from package.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap list_resources/list_remaining_resources callers in try/except to handle non-404 HTTPError after upstream fix. - Add timeout-minutes: 30 to eval-waza.yml workflow step. - Narrow 'contact support' negative match to 'only contact support' so mentioning support alongside actionable steps isn't penalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per danieljurek's feedback, tag resource groups with DeleteAfter before deleting so the cleanup script can detect and remove resources that resist deletion. Switched to PowerShell for consistency with the cleanup script. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add timeout-minutes: 30 to eval-unit.yml job. - Fix bare module imports in test_graders.py with sys.path. - Add 'azd pipeline' to eval system prompt command list. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the conditional waza validation step that would always skip since waza is not installed in CI. Document it as a local-only check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Separate commit for go fix modernization unrelated to eval framework. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move eval-waza, eval-e2e, and eval-report workflows to internal ADO
pipelines (eng/pipelines/) following team convention for authenticated
workloads. eval-unit stays in GH Actions (no secrets needed) with an
ADO mirror for consistency.
- Delete .github/workflows/eval-{e2e,waza,report}.yml
- Add eng/pipelines/eval-{unit,waza,e2e,report}.yml
- Update README CI/CD and auth setup sections for ADO
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The main eval step should not mask failures with continueOnError. Only the cleanup step (best-effort) retains continueOnError: true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use ADO ##vso[task.prependpath] instead of $GITHUB_PATH in eval pipelines - Use shared azure_auth module in README example instead of inline token helper - Add retry delay before continue in app_health.py mismatch paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
91fc775 to
c817f90
Compare
jongio
left a comment
There was a problem hiding this comment.
Verified all the fixes from 4/9 and 4/10 feedback - $GITHUB_PATH, README example, retry timing, continueOnError all addressed. Four new items below.
jongio
left a comment
There was a problem hiding this comment.
Previous feedback addressed. The eval framework is solid - 4 new suggestions posted as non-blocking comments.
Scopes the cleanup step to only delete resource groups matching the current build's ID (rg-eval-e2e-$(Build.BuildId)) instead of all rg-eval-* groups, preventing deletion of resources from parallel runs or local dev testing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
jongio
left a comment
There was a problem hiding this comment.
Latest commit scopes resource group cleanup to the current build ID - correct fix that prevents parallel run interference.
Two low-priority suggestions for future improvement (non-blocking):
-
graders/azure_auth.py-subprocess.run(["az", ...])has notimeoutparameter. Ifazhangs locally (e.g., expired creds trigger interactive login), it blocks indefinitely. Consider addingtimeout=30and catchingsubprocess.TimeoutExpired. -
graders/azure_auth.py- This shared auth module has 3 code paths (az CLI success, env var fallback, both fail) but no direct tests. The grader tests mock it out, so the internal logic is untested. Consider adding aTestAzureAuthclass totest_graders.py.
Three suggestions from my 04/13 review are still open as non-blocking comments - no need to re-post those.
Closes #7608
Problem
We have no visibility into how GitHub Copilot CLI interacts with
azd. Unlike the microsoft/github-copilot-for-azure skills repo — which has a comprehensive test, eval, and CI setup — the azd CLI has zero coverage for measuring LLM interactions, command discoverability, or human usability patterns. We have no idea whether Copilot CLI can successfully discover commands, interpret help text, handle errors gracefully, or guide users through common workflows.Solution
This PR adds a comprehensive evaluation and testing framework at
cli/azd/test/eval/inspired by the GHCP4A setup. It covers both LLM eval (how well an AI agent uses azd) and non-LLM unit tests (how well azd surfaces information for human and AI consumption).What's included
125 passing tests across 7 suites:
--help--output json,--no-prompt,-e/--environmentflags15 Python grader unit tests (pytest):
app_health.py— HTTP health checks with retry logiccleanup_validator.py— ARM API validation for post-azd downcleanupinfra_validator.py— ARM API validation for post-azd provisionresources14 Waza LLM eval task definitions (YAML):
4 CI workflows:
eval-unit.yml— runs unit tests + waza validate on PReval-waza.yml— Waza LLM evals 3x/day (Tue-Sat)eval-e2e.yml— weekly E2E with Azure resource validationeval-report.yml— weekly report generation + auto-issue creationSetup Required Before Going Live
Secrets to configure (Settings → Secrets and variables → Actions)
AZURE_CLIENT_IDeval-e2e.ymlAZURE_TENANT_IDeval-e2e.ymlAZURE_SUBSCRIPTION_IDeval-e2e.yml+ gradersCOPILOT_CLI_TOKENeval-waza.yml,eval-e2e.ymlGITHUB_TOKENeval-report.ymlService principal setup
What works without any setup
npm run test:unitnpm run test:humannpm run waza:run:mockeval-unit.ymlCIcli/azd/test/eval/**What needs secrets
eval-waza.ymlCOPILOT_CLI_TOKENeval-e2e.ymlAZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID,COPILOT_CLI_TOKENeval-report.ymlGITHUB_TOKEN(auto)Testing
All tests pass locally: