Add live golden-path (Tier 2) pipeline for azd ai agent extension#8758
Conversation
Adds eng/pipelines/ext-azure-ai-agents-live.yml, an on-demand/weekly Azure DevOps pipeline that drives the real 'azd ai agent' CLI through tmux against live Azure (TME), exercising init -> provision -> deploy -> invoke -> down for both code and container deploy modes. This is the live counterpart to the PR-gate checks (Tier 0 offline + Tier 1 recording/playback in #8754). Per Azure SDK EngSys / SFI guidance, live access stays out of the automatic PR pipeline (trigger: none) and runs only via '/azp run ext-azure-ai-agents-live' or the weekly schedule. The Tier 2 tmux driver (test_full_e2e.py, test_tier2.py) is migrated from the #8692 prototype; CI auth detection is extended to recognize Azure DevOps (TF_BUILD) and an explicit E2E_USE_AZ_CLI_AUTH override.
…zSessionActive The azure-sdk-tests service connection uses Workload Identity Federation, whose az session is isolated to the task's private AZURE_CONFIG_DIR and expires after ~10 min. Running the ~50 min golden-path test (and the cleanup) as plain bash steps after a separate login step would fail auth on both counts. Run them inside AzureCLI@2 with keepAzSessionActive:true (matching build-cli.yml) so the session stays refreshed and reaches azd (auth.useAzCliAuth) through tmux, which inherits AZURE_CONFIG_DIR. Subscription/tenant are now read in-script via az account show instead of cross-step pipeline variables.
test_tier2.py always ran sequentially, but kept a tautological if-condition (len==1 or len>1), an unused concurrent.futures import, a no-op --serial flag, and a docstring/print claiming parallel execution. Simplify to an explicit sequential loop and update the docstring to match. Also fix test_full_e2e.py's module docstring to point at README.md (LOCAL-TEST-GUIDE.md does not exist).
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated Azure DevOps pipeline and supporting Python drivers/docs to run the Tier 2 live golden-path E2E for the azure.ai.agents azd extension (init → provision → deploy → invoke → down) against real Azure resources, outside of the GitHub PR gate.
Changes:
- Added an on-demand + weekly ADO pipeline (
ext-azure-ai-agents-live) to run Tier 2 live E2E for bothcodeandcontainerdeploy modes. - Added a tmux-driven Python Tier 2 runner (
test_tier2.py) and a full golden-path driver (test_full_e2e.py) adapted for ADO CI auth detection. - Added documentation for running the live Tier 2 tests locally and in CI.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| eng/pipelines/ext-azure-ai-agents-live.yml | New ADO pipeline wiring to build azd + extension, run Tier 2 live E2E under AzureCLI@2, and publish logs/cleanup. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_tier2.py | Tier 2 orchestrator: runs code + container golden paths sequentially with isolation and timeout handling. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/test_full_e2e.py | tmux-driven end-to-end golden path driver (init/provision/deploy/invoke/down) with CI/local auth switching. |
| cli/azd/extensions/azure.ai.agents/tests/e2e-live/README.md | Docs for CI setup (ADO registration/service connection/secrets) and local WSL execution. |
- Use the ambient azure-sdk org secret `azuresdk-github-pat` for GH_TOKEN instead of an empty `GitHubPat` placeholder variable (mirrors eval-waza.yml); removes a misleading masked variable and the need for admin PAT setup. - Harden the AzureCLI@2 inline script: `set -euo pipefail` and assign-then-verify subscription/tenant so an `az account show` failure fails fast (a plain `export X=$(...)` would have masked the error from set -e). - Reword the extension-install comment to be self-contained (it no longer inaccurately claims to mirror lint-ext-azure-ai-agents.yml). - Clarify the test_full_e2e.py auth prerequisite: only local WSL runs leave auth.useAzCliAuth unset; CI auto-enables az CLI auth. - Clear tmux scrollback after env setup so the exported GH token cannot leak into capture() output on failures/timeouts. - _cleanup_leaked_resources now checks azd down's return code and reports failures instead of always printing "Cleanup complete".
- Stream child E2E output live with a watchdog-enforced hard timeout instead of buffering everything via capture_output - Shell-escape the GitHub token (shlex.quote) before exporting in tmux - Clean up the per-mode AZD_CONFIG_DIR temp copy unless E2E_KEEP_ARTIFACTS - Use sha256 instead of md5 for the agent-name uniqueness suffix - Derive the agent binary arch from uname -m instead of hard-coding amd64
- Shell-escape HOME/PATH/TENANT, the cd target, and the agent name with shlex.quote() (consistent with the earlier token fix) - On Tier 2 timeout, kill the child's detached tmux server so reused CI agents do not accumulate orphaned tmux sockets
# Conflicts: # cli/azd/extensions/azure.ai.agents/cspell.yaml
jongio
left a comment
There was a problem hiding this comment.
One minor nit on the location dispatch case for consistency with the other tightened prompt matches.
Merging main shifted prompt_service.go, leaving the cited line numbers (143/190/226) pointing at the wrong lines. Drop the brittle file:line reference and rely on the function name plus the quoted picker messages, which the case clauses already match exactly.
jongio
left a comment
There was a problem hiding this comment.
Previous feedback (errors.AsType, t.Context(), yaml.Unmarshal for service name parsing, tightening prompt dispatch matches, logging default-case fall-throughs) all addressed in recent commits. Code quality is solid.
One observation: the PR description mentions a condition: always() AzureCLI@2 step as a "second safety net" for crash-path cleanup, but the pipeline YAML only has PublishPipelineArtifact@1 with condition: always(). The Go .Cleanup(r.teardown) approach is sufficient given the timeout hierarchy (job 150m > step 130m > go test 125m > per-mode 60m), so this isn't a gap in practice, but the description is slightly overstated. Consider updating the PR body to reflect the actual cleanup mechanism (t.Cleanup + LIFO ordering) or adding the explicit cleanup step if you want the belt-and-suspenders guarantee for cases where the Go process is killed before t.Cleanup runs.
No blocking issues. The event-driven prompt loop design, the timeout hierarchy, and the �ssertSafeTestDir guard are all well thought out.
ab890b1 tightened the subscription case to has("select an azure subscription"), but that string is the extension's fmt.Println preamble, not the survey "?" line activePrompt reads. ensureSubscription passes an empty PromptSubscriptionRequest, so the picker renders azd-core's default message "Select subscription" (prompt_service.go:507). The tightened match missed it, sending the prompt to the default Enter and selecting the wrong subscription in a live run. Match "select subscription" instead.
jongio
left a comment
There was a problem hiding this comment.
Incremental commit fixes the subscription dispatch to match azd-core's actual picker message (Select subscription from prompt_service.go:507) rather than the extension's preamble text. Verified: activePrompt lowercases before dispatch, so the case-sensitive has() correctly matches.
Minor note for future awareness: select subscription (singular) is also a prefix/substring of select subscriptions to include in the filter (config_sub_filter.go:188). This won't cause false matches since that multi-subscription filter prompt doesn't appear in the agent init flow, but worth keeping in mind if future init steps add subscription filtering.
RickWinter
left a comment
There was a problem hiding this comment.
This adds a Tier 2 live golden-path test (init/provision/deploy/invoke/down) for the azure.ai.agents extension, driven through a real PTY with go-expect + vt10x, plus an on-demand/weekly Azure DevOps pipeline that runs it against the TME subscription. The shape is right: it is gated behind AZURE_AI_AGENTS_E2E_LIVE, kept out of the PR pipeline, runs the two deploy modes sequentially, and registers an azd down teardown with a condition: always() safety net so a crash does not leak billable resources. go.sum already carries the three new direct dependencies, so the extension module still builds, and the answer matcher is well-factored and covered on its own by assert_test.go.
One thing is worth tightening before a green run is trusted as signal: the invoke assertion matches a standalone "4" against the entire combined CLI output rather than the agent's actual response, so incidental azd chatter (an elapsed-seconds token, a count, a step number) could turn the check green while the agent is broken. That is the difference between a live test that verifies the agent and one that mostly verifies the pipeline ran, so I would scope it to the response payload.
The rest are non-blocking. The numerous file:line source citations will rot silently the first time those files are edited, and the pipeline hand-writes the extension manifest into config.json where it can drift from the real one. Nothing here runs in PR CI, so the blast radius is bounded to the manual or scheduled live run. None of it blocks merge; I would fix the invoke matcher scope before relying on a green run and treat the other two as follow-ups.
jongio
left a comment
There was a problem hiding this comment.
Confirmed RickWinter's invoke assertion concern (replied on his thread with details). responseHasExpectedAnswer runs on the full combined stdout+stderr stream, and patterns like "4 seconds" or "step 4:" pass the standalone-four check since space/colon aren't word runes. Consider adding -o raw to the invoke args so the assertion is scoped to the server response payload only.
No other new findings. The event-driven dispatch loop, layered timeout hierarchy (per-mode 60m < go test 125m < ADO step 130m < job 150m), and LIFO cleanup ordering are correctly structured. Prior review feedback has been addressed in the current HEAD.
Match only the model's reply ([agent] ... up to 'Server responded in') so stray digits elsewhere in CLI output can't false-pass; replace stale source line numbers in dispatch comments with function names.
Derive the extension manifest fields from extension.yaml so they cannot drift from a hand-maintained copy; only path/source/test version injected.
jongio
left a comment
There was a problem hiding this comment.
Incremental review of commit 3c0f717 (scopes the 2+2 invoke assertion to the agent response region).
The agentResponseRegion approach is correct: slicing between [<agent>] and "Server responded in" cleanly isolates the model's answer from surrounding CLI noise (model names, status codes, timing info). Fallback to full text when markers are absent is a reasonable defensive choice since hasStandaloneFour already rejects most false-positive patterns.
The yq-from-extension.yaml change in the pipeline eliminates the manifest drift problem. Comment references now use function names instead of fragile line numbers. Both solid improvements.
One suggestion on test coverage below.
jongio
left a comment
There was a problem hiding this comment.
Incremental review of commit 3c0f717 (scopes the 2+2 invoke assertion to the agent response region).
The agentResponseRegion approach is correct: slicing between [<agent>] and "Server responded in" cleanly isolates the model's answer from surrounding CLI noise (model names, status codes, timing info). Fallback to full text when markers are absent is a reasonable defensive choice since hasStandaloneFour already rejects most false-positive patterns.
The yq-from-extension.yaml change in the pipeline eliminates the manifest drift problem. Comment references now use function names instead of fragile line numbers. Both solid improvements.
One suggestion on test coverage below.
Per review: verify a real standalone digit ahead of the [agent] line is excluded by the response-region slicing.
jongio
left a comment
There was a problem hiding this comment.
I've been reviewing this PR across multiple rounds. Here's my final assessment of the current state:
The Go PTY-based test driver is well-structured. The event-driven prompt dispatch (waitForQuiet + screen-read + dispatch) is a significant improvement over fixed-interval polling, and the loop detection (promptKey + repeat counter) prevents silent hangs when prompts change upstream.
The answer assertion logic (assert.go) correctly handles the tricky problem of detecting a standalone "4" without false-positiving on model names, versions, or status codes. The agentResponseRegion scoping adds a solid second layer of defense.
Timeout hierarchy is correctly ordered (per-mode runTimeout < go test -timeout < ADO step < ADO job) so the in-test teardown always gets a chance to run before anything force-kills the process.
My prior comments were all non-blocking suggestions (broader keyword-match tightening in dispatchPrompt, an additional test case for region-scoping in assert_test.go). None are blockers.
One note for the broader reviewer group: the go.mod adds go-expect, creack/pty, and vt10x at 2022-era commits. These are stable, widely-used packages for PTY testing in Go (used by AlecAivazis/survey itself). They don't have newer tagged releases that would be materially better.
|
Thanks @RickWinter @jongio — all threads addressed and resolved. This round:
CI's green. |
jongio
left a comment
There was a problem hiding this comment.
One gap to close before merge: the EngSys DeleteAfter tag on resource groups (as @danieljurek noted). The in-test azd down and the condition: always() safety-net cover the happy and crash paths, but if both fail (VM hard-kill, network timeout during teardown), resources leak with no EngSys visibility.
Suggested fix: after phaseProvision returns, shell out to tag the resource group:
// inside runner, after phaseProvision succeeds:
rgName := r.resolveResourceGroup() // azd env get-values | grep AZURE_RESOURCE_GROUP
if rgName != "" {
deleteAfter := time.Now().Add(48 * time.Hour).UTC().Format(time.RFC3339)
r.runCmd(ctx, "az", "tag", "update",
"--resource-id", "/subscriptions/"+sub+"/resourceGroups/"+rgName,
"--operation", "merge", "--tags", "DeleteAfter="+deleteAfter)
}This way the EngSys cleanup pipeline can find and purge any survivors. The 48h window gives plenty of room for the normal teardown to succeed first.
The rest of the implementation is solid: event-driven prompt loop, proper timeout hierarchy, LIFO cleanup ordering, and the agentResponseRegion scoping all look correct.
…eanup Resolve cspell.yaml (union word lists) and go.mod (keep main's otel/x-deps bumps and the temporary azd replace; re-add the PTY test deps) conflicts. Stamp a DeleteAfter cleanup tag (now+48h, RFC3339 UTC) on the provisioned resource group so the EngSys garbage collector reclaims it even if the post-run `azd down` teardown never runs (agent crash, post-timeout budget exhaustion, lost network). Tagged in two places: - the Go driver, immediately after a successful `azd provision` - the pipeline cleanup loop, before `azd down` Uses `az group update --set tags.DeleteAfter=` so the single tag is merged without clobbering azd's own tags (e.g. azd-env-name). Addresses review feedback from @danieljurek and @jongio.
jongio
left a comment
There was a problem hiding this comment.
Incremental review against 6c66c25 (new commits since my last review at 2d9dfb7).
The DeleteAfter tagging implementation addresses my prior concern and @danieljurek's feedback. Both the in-test tag (after successful provision) and the pipeline cleanup step correctly use RFC 3339 UTC format per the EngSys spec. One non-blocking edge case noted inline.
A provision that times out before persisting AZURE_RESOURCE_GROUP would leave its resource group untagged, so the EngSys cleanup pipeline could not reclaim it if azd down also failed. Enumerate azd envs on disk (.azure/<env>) and tag the deterministic rg-<env> group, which exists from the start of provisioning, closing the only permanent-leak gap.
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 2 new commits since my last pass (6c66c25).
30dfb29 (SplitSeq): Correct go fix modernization. strings.SplitSeq returns a lazy iterator, semantically identical to the previous strings.Split usage.
e1c7ab8 (cleanup tagging): The pipeline safety-net step now enumerates .azure/*/ directories to construct
g- directly, instead of reading AZURE_RESOURCE_GROUP from �zd env get-values. This closes the edge case I flagged in my last review: if provision times out before persisting env values, the old approach would skip tagging entirely. The hard-coded
g- convention is safe here since the test always uses default azd naming and the || true guard makes it best-effort.
All prior findings (errors.AsType, t.Context(), yaml.Unmarshal for service parsing, stderr visibility, prompt dispatch tightening, DeleteAfter tagging) are addressed.
No new issues.
|
@vhvb1989 @danieljurek @trangevi could you help re-approve the PR and merge? latest commits addressed some nit issues, and they staled the earlier approvals. thanks. |
jongio
left a comment
There was a problem hiding this comment.
All prior review concerns (response region scoping, function-name anchors instead of file:line, yq-generated config.json, DeleteAfter tagging including the provision-timeout case) are addressed in the latest commits.
One low-severity note on the test dependencies (non-blocking).
jongio
left a comment
There was a problem hiding this comment.
Incremental review of 990ffa5 (bumps creack/pty from v1.1.17 to v1.1.24).
This addresses my prior comment about starting on the latest stable version for new test infrastructure. v1.1.24 is a legitimate release (Oct 2024), the pty.Open/Setsize API is unchanged, and the author confirmed clean Linux build/vet/lint.
No new issues found across the full changeset. Prior concerns (response region scoping, function-name anchors instead of file:line, yq-generated config.json, DeleteAfter tagging for the provision-timeout case) are all resolved in earlier commits.
Fixes #8759
Summary
eng/pipelines/ext-azure-ai-agents-live.yml: an on-demand / weekly Azure DevOps pipeline that runs the Tier 2 live golden path (init → provision → deploy → invoke → down) for theazure.ai.agentsextension against a real Azure (TME) subscription, for bothcodeandcontainerdeploy modes.tier2_live_test.go,console_test.go,assert.go) undercli/azd/extensions/azure.ai.agents/tests/e2e-live/, drivingazd ai agentthrough a pseudo-terminal withgo-expect+vt10x+creack/pty(replacing the original tmux/Python prototype from test: add static E2E tests for azure.ai.agents extension #8692), with CI auth detection for Azure DevOps (TF_BUILD/E2E_USE_AZ_CLI_AUTH).README.mddocumenting how to run (CI + local) and the one-time ADO setup.Together with the PR-gate tests in #8754 (Tier 0 offline + Tier 1 recording/playback), this covers Tier 0/1/2 from the original prototype (#8692).
At a glance
What it tests — the full live golden path
init → provision → deploy → invoke → downofazd ai agent, driving a real CLI over a pseudo-terminal (Gogo-expect+vt10x+creack/pty), for bothcode(zip) andcontainer(ACR build) deploy modes, run sequentially. The interactiveinitphase is an event-driven prompt loop (wait for the survey UI to settle, read the rendered screen, dispatch on the verbatim prompt strings) rather than a fixed script, so it tolerates the live model/region/project branches. Theinvokestep asserts a live model answers "what is 2+2?" with a standalone4/four.How to trigger —
trigger: none/pr: none, so it never runs on PR push. Run it via: a PR comment/azp run ext-azure-ai-agents-live(requires repo write), the weekly Mon 07:00 UTC schedule, or a manual ADO queue (pickdeployModes=both/code/container).Identity / subscription / what it does — authenticates as the
azure-sdk-testsADO service connection (Workload Identity Federation / OIDC) into the shared TME test subscription, regioneastus2. It creates a fresh Foundry project +gpt-4o-minideployment (+ an ACR in container mode) and the agent service, exercises them, then tears them down.azlogs in via OIDC and azd reuses that session (auth.useAzCliAuth+keepAzSessionActive: true, so the WIF token survives the multi-minute run).Leftover resources? — No, by design. Each mode registers a
t.Cleanupthat runsazd down --force --purge, and a teardown failure fails the run (never green while leaking). Acondition: always()AzureCLI@2step is a second safety net: it force-purges any project dirs left under/tmp/e2e-tests/tier2-*/after a crash or timeout. (Crash-path cleanup is best-effort, so the TME subscription is worth an occasional spot-check.)Runs on Azure DevOps, not GitHub Actions — the YAML lives in this GitHub repo (
eng/pipelines/…) but is an ADO pipeline (extends: 1es-redirect,task: AzureCLI@2,$(Build.*),##vso[...]), executed by theazure-sdkADO project. GitHub CI for this PR stays green/inert; the live run only happens in ADO.Post-merge admin action — needs an Azure DevOps admin on the
azure-sdkproject (Azure SDK EngSys) — not a GitHub admin — to: (a) register the pipeline asext-azure-ai-agents-live(the exact name/azp runuses) against this repo + YAML path; (b) confirm theazure-sdk-testsservice connection maps to the TME subscription with RBAC to create Foundry projects and deploy models (Contributor + Azure AI Developer + Cognitive Services Contributor, or equivalent); (c) kick the first/azp runto validate thekeepAzSessionActiveauth path. Merging the PR itself only needs a GitHub maintainer approval. TheinitGitHub token is already wired via the ambientazuresdk-github-patorg secret, so no extra secret setup is needed.Testing
/azp run(the test is gated behindAZURE_AI_AGENTS_E2E_LIVE=1and never runs in PR CI).gofmt,go vet,golangci-lint(ubuntu + windows), and ago test -ccompile all clean on thelinuxbuild tag; the no-tagassert.goanswer-matching logic is covered byassert_test.go(runs on the host);README.mdpassescspell.t.Context()as the run-context parent sogo test -timeoutcancellation propagates; LIFOt.Cleanupordering so teardown runs before the copiedAZD_CONFIG_DIRis removed; a destructive-RemoveAllguardrail on the test dir; and a robust standalone-token check for the invoke result (accepts4orfour, ignores incidental4s such asgpt-4o-mini/4.1/404).runTimeout(60 min, acontextdeadline) > the sum of the per-phase budgets (init/provision/deploy/invoke ≈ 32 min) so a slow-but-healthy run is never preempted mid-phase; both modes fit undergo test -timeout 125m, and the ADOAzureCLI@2step (130 min) and job (150 min) caps sit strictly above that, so a run can never be hard-killed mid-azd down(which would leak live resources).