Azure AI Agents: add next-step guidance and doctor diagnostics by antriksh30 · Pull Request #8198 · Azure/azure-dev

antriksh30 · 2026-05-15T04:00:03Z

Summary

add context-aware Next: guidance for the Azure AI Agents extension across init, run, invoke, show, deploy-hook, and doctor flows
add azd ai agent doctor with local and remote checks for project setup, environment variables, authentication, Foundry reachability/RBAC, hosted agent status, identity roles, model deployments, connections, and toolboxes
improve post-deploy guidance with README-first sample payload hints, omit active-agent show next steps, and stream text-mode doctor checks as they complete

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Pull request overview

Copilot reviewed 73 out of 73 changed files in this pull request and generated 10 comments.

jongio

Solid foundation for the doctor + nextstep subsystems - clean module boundaries, interface-driven testability, and well-structured check pipeline. A few things to sort out before this leaves draft:

invoke.go:541-543 is the one I'd prioritize - it silently changes CLI semantics for --name. The rest are lower-stakes but worth a look.

@jongio

The remote.auth doctor check surfaced the raw user principal name in both the Message string and the structured Details map regardless of the --unredacted flag, in contrast with the rest of the doctor checks (checks_rbac.go, checks_agent_identity_roles.go) which already gate identity values on opts.Unredacted. This change threads Options into the auth check function and adds two small helpers that reuse the existing redactedPlaceholder constant: - redactUPN(upn, unredacted) returns the value to surface in Message text: raw when --unredacted, the shared <redacted> placeholder when a UPN was discovered but should be scrubbed, and empty when none was found so composeAuthMessage cleanly drops the prefix. - authDetails(upn, minutes, unredacted) builds the Details map and omits the "upn" key entirely unless --unredacted is set, so machine consumers do not see the raw value by default. PASS, WARN, and expired-FAIL branches now compose their messages from the redacted display value. Existing tests that asserted the raw UPN were updated to pass Options{Unredacted: true}; new table tests cover the default-redacted and --unredacted contracts on every branch, the empty-UPN drop, and both helpers in isolation. Resolves PR Azure#8198 review comment from @jongio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TestErrorCodeWireValues pins the lowercase JSON wire values of every exported enum the nextstep package consumes from the Agents API, but the AgentVersionStatus map was missing the "idle" entry. The idle status is actively read at show.go:207 and nextstep/resolver.go:248, so silent drift on that one literal would regress the show command's idle branch and the resolver's deployment-pending hint without any test failure. Add the missing "idle": string(AgentVersionIdle) case. Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@jongio

The remote.auth doctor check surfaced the raw user principal name in both the Message string and the structured Details map regardless of the --unredacted flag, in contrast with the rest of the doctor checks (checks_rbac.go, checks_agent_identity_roles.go) which already gate identity values on opts.Unredacted. This change threads Options into the auth check function and adds two small helpers that reuse the existing redactedPlaceholder constant: - redactUPN(upn, unredacted) returns the value to surface in Message text: raw when --unredacted, the shared <redacted> placeholder when a UPN was discovered but should be scrubbed, and empty when none was found so composeAuthMessage cleanly drops the prefix. - authDetails(upn, minutes, unredacted) builds the Details map and omits the "upn" key entirely unless --unredacted is set, so machine consumers do not see the raw value by default. PASS, WARN, and expired-FAIL branches now compose their messages from the redacted display value. Existing tests that asserted the raw UPN were updated to pass Options{Unredacted: true}; new table tests cover the default-redacted and --unredacted contracts on every branch, the empty-UPN drop, and both helpers in isolation. Resolves PR Azure#8198 review comment from @jongio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fixes two pre-existing UX bugs in `azd ai agent run`: B5 (Next: race): pre-C8 the Next: block was rendered BEFORE `proc.Start()`, so its "in another terminal, try: <example>" line could land in the user's terminal seconds before the agent's port was bound. Following the suggestion verbatim while the agent was still starting yielded a connection-refused. B6 (stale cached OpenAPI): pre-C8 the OpenAPI probe was strictly cache-only. The first `run` ever issued — and every `run` against an agent whose schema had drifted — surfaced a protocol-generic `<payload>` literal instead of the agent's actual example. Scope — only `run.go` uses the live probe. `show.go` (`WithOpenAPIProbe (name, "remote")`), `deploy`/artifact-note path, `init.go`, and `init_from_code.go` remain cache-only by design. Changes ------- nextstep/state.go - New functional option `WithLiveOpenAPIProbe(fetch func(context. Context) ([]byte, error))`. Stores the fetcher in `cfg.openAPILiveFetch`. - `populateOpenAPIPayload` now takes `(ctx, cfg, projectPath, envName, state)`. Order of resolution: (1) live, on success → use it; (2) cache (existing `WithOpenAPIProbe` path), on success → use it; (3) leave HasOpenAPI=false. Every failure is silent. - `assembleState` threads `ctx` through to the new signature. - Doc comment on `WithOpenAPIProbe` updated to note that `WithLiveOpenAPIProbe` overrides it when both are supplied. run.go - Removed the pre-Start `nextstep.AssembleState` + `PrintNext` block that was emitting Next: before the agent bound its port. - After `proc.Start()`, spawn `emitNextAfterBind` in a goroutine with a `nextDone` channel. After `proc.Wait()` returns, the main flow calls `cancel()` and waits on `<-nextDone` so the goroutine is fully joined before stdout writes from `runRun`'s caller resume — closes a stdout race on shutdown. - `emitNextAfterBind` early-returns when stdout is not a terminal, honoring the documented nextstep call-site TTY-gating contract (matches `invoke.go:217`, `show.go:381`). Redirected stdout (`run > log`) no longer receives the banner or Next: block. - After waitForPortReady succeeds, builds a closure that wraps `fetchLiveOpenAPI(ctx, port)` and passes BOTH `WithOpenAPIProbe(serviceName, "local")` (cache fallback) and `WithLiveOpenAPIProbe(...)` to `AssembleState`. The state assembler picks live first, falls back to cache silently if the live fetch fails. - Re-checks `ctx.Err()` after `AssembleState` returns so a Ctrl+C arriving mid-call doesn't surface "Agent ready" after "Agent stopped." was already printed. - Four new constants: `portReadyBudget` (5 s), `portReadyPollInterval` (100 ms), `portReadyDialTimeout` (50 ms), `liveOpenAPITimeout` (3 s). - `waitForPortReady(ctx, port, budget) bool`: bounded TCP dial-loop that honors ctx. - `fetchLiveOpenAPI(ctx, port) ([]byte, error)`: uses `http.NewRequestWithContext` to GET `http://localhost:<port>/invocations/docs/openapi.json`. The route matches the cache-side fetcher in `helpers.go:368` and the user-facing curl tip in `nextstep/resolver.go:226`. Non-200 responses are returned as errors so the assembler falls back to cache rather than ingesting a 404 body via `ExtractInvokeExample`. Tests ----- state_test.go (+5 new TestAssembleState_WithLiveOpenAPIProbe_* cases + expanded `TestOptionsApplyCleanly`): - PrefersLiveOverCache, FallsBackToCacheOnError, FallsBackToCacheOnEmptyBody, LiveWorksEvenWithoutCacheProbe, LiveFailureWithoutCacheLeavesUnset. run_test.go (+8 new tests + `listenLoopback` helper): - 3× waitForPortReady (bound port, budget elapse, ctx cancel). - 3× fetchLiveOpenAPI (200 body asserts `/invocations/docs/openapi.json` path, non-200 error, ctx deadline). - 2× emitNextAfterBind (never-binds, ctx cancelled — both pass nil azdClient through the safe early-return paths to verify the helper exits silently without panic or goroutine leak). Preflight clean: gofmt -s -w, go vet, go build, go test ./internal/cmd/... ./internal/cmd/nextstep/... -count=1 (cmd 10.5 s, doctor 1.6 s, nextstep 1.9 s), golangci-lint run ./internal/cmd/... (0 issues), cspell on the four modified files (0). Review fixes (3-reviewer pass) ------------------------------ Three independent reviewers (Opus xhigh, Sonnet 4.6, GPT-5.5) reached consensus on three correctness findings before merge: - Live probe URL corrected to /invocations/docs/openapi.json (matches existing cache fetcher and user-facing curl tip). - Banner + PrintNext now gated on isTerminal(os.Stdout.Fd()) to honor the nextstep call-site contract. - emitNextAfterBind goroutine is now joined via nextDone channel after proc.Wait, and re-checks ctx.Err() before printing so the banner cannot land after "Agent stopped." - Replaced misleading "ReturnsSilentlyWhenPortNeverBinds" test that only exercised waitForPortReady with two tests that actually call emitNextAfterBind with nil azdClient on the safe early-return paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…P5.1 C12) Adds the final doctor check (P5.1 C12) that surfaces deployed-agent managed-identity role assignments at three ARM scopes — project, account, and resource group. For each agent classified active by the upstream `remote.agent-status` check, the new `remote.agent-identity-roles` check fetches the agent's `instance_identity.principal_id` from Foundry and lists ARM role assignments at all three scopes via the existing `armauthorization` SDK already pulled in for the developer-RBAC check. ## What lands `internal/project/agent_identity_query.go` (new, ~340 LoC): - Public API `QueryAgentIdentityRoles(ctx, azdClient, projectResourceID, principals) (*AgentIdentityRolesResult, error)` reuses `parseAgentIdentityInfo` from `agent_identity_rbac.go` to derive the three scope ARNs, looks up the user-access tenant via `LookupTenant`, builds an `AzureDeveloperCLICredential` pinned to that tenant, and fans out per-principal role-assignment listings with `wg.Go`. - Public types `AgentPrincipal`, `AgentScopeRoles`, `AgentIdentityRolesEntry`, `AgentIdentityRolesScopes`, `AgentIdentityRolesResult` form the structured listing the doctor renders. - `queryAgentIdentityRolesWithLister` separates the per-scope listing strategy from credential acquisition so unit tests drive the inner classifier without standing up ARM fakes. - `listRoleNamesAtScope` lists role assignments with ARM's server-side `assignedTo('<principal>')` filter, then resolves role-definition IDs to human-readable names via `RoleDefinitions.Get`. Failures on individual role-name resolution downgrade gracefully (omitted from the listing). `internal/cmd/doctor/checks_agent_identity_roles.go` (new, ~640 LoC including doc comments): - `newCheckAgentIdentityRoles(deps)` builds the check Closure. Skip cascade against `local.environment-selected`, `local.agent-service-detected`, `remote.auth`, `remote.foundry-endpoint`, and `remote.agent-status`'s Pass (per the design's "for each active agent found in check 11"). - `readActiveAgents(prior)` enumerates agents reachable to this check by reading the upstream `remote.agent-status` Details' `services` slice and filtering to Classification == "active". - `classifyOneAgent` buckets a single agent into fine / underscoped / empty / unknown per the design's pass condition ("project + (account|RG) covered"). `describeOneAgent` renders the one-line per-agent breakdown (`<agent>: project=N, account=M, resource-group=K`, with `?` on probe-error scopes). - `classifyAgentIdentityRolesAggregate` folds per-agent entries into a single doctor Result: all "fine" → Info; any "empty" → Fail (smoking-gun for "every tool call 403s"); worst "underscoped" → Warn; worst "unknown" → Warn. - `makeRealProbeAgentPrincipal` builds the production probe closure (mirrors `makeRealProbeAgentStatus` byte-for-byte apart from the field consumed — `InstanceIdentity.PrincipalID` vs `Status`). ## Renderer additions `StatusInfo` joins the existing Pass/Warn/Fail/Skip status set so the "all agents are fine" case can surface as an informational role listing without flagging the run yellow. - `types.go`: `StatusInfo Status = "info"` + `Summary.Info int` (JSON tag added). - `runner.go`: canonical validation switch + summarize switch extended for Info; `ExitCode` treats Info as a "useful diagnostic completed" status (matches Pass for exit-code purposes). - `doctor_format.go`: glyph "ⓘ" and label "INFO" added to `statusGlyphAndLabel`; `writeSummaryLine` appends ", N info" when Info > 0 (preserves existing test assertions otherwise). ## Dependencies wiring `internal/cmd/doctor/checks_local.go` adds two test seams on `Dependencies`: - `probeAgentPrincipal` — replaces the production `GetAgentVersion` call with a unit-test fake. Same signature shape as `probeAgentStatus`. - `queryAgentIdentityRoles` — replaces the production `project.QueryAgentIdentityRoles` call. Signature mirrors the public API so wiring is a single substitution. `internal/cmd/doctor/checks_remote.go` appends `newCheckAgentIdentityRoles(deps)` after the existing `remote.agent-status` entry in `NewRemoteChecks`. The append-after ordering is load-bearing — every skip-cascade guard in C12 reads `remote.agent-status`'s Result from `prior []Result`, and the local-then-remote ordering invariant remains intact (verified by the existing `TestNewLocalAndRemoteChecks_ProductionCompositionLocalsFirst`). ## Tests `internal/cmd/doctor/checks_agent_identity_roles_test.go` (new, 16 KB): - Skip-cascade gates: nil AzdClient, `remote.agent-status` not Passed, project endpoint missing, no active agents, project-resource-ID unset, project-resource-ID malformed. - Aggregate classification: Info when all fine; Fail when any agent has zero roles; Warn when worst is underscoped; Warn on transient query error. - Per-agent classifier table (six cases: project+account, project+RG, project-only, account-only, all-empty, all-errored). - Detail formatting: scope counts and `?` for probe-error scopes. - Missing-principal degradation: agent with no `instance_identity` surfaces as a warning rather than a fail. - `readActiveAgents` filtering invariants (active-only, missing-name dropped, nil-return on missing upstream). `internal/cmd/doctor/checks_remote_test.go` updated: the `NewRemoteChecks` contract test now pins five entries (auth → foundry-endpoint → rbac → agent-status → agent-identity-roles) with their ID / Name / Remote / Fn invariants. ## Preflight - gofmt -s -w . clean - go vet ./... clean - go build ./... clean - go test ./... -count=1 all green - internal/cmd 14.6s - internal/cmd/doctor 1.6s - internal/cmd/nextstep 3.8s - internal/pkg/agents/agent_api 10.9s - internal/pkg/agents/agent_yaml 1.0s - internal/pkg/azure 12.7s - internal/project 5.5s - golangci-lint run ./internal/cmd/... ./internal/project/... 0 issues - cspell on new files (after "underscoped" added to cspell.yaml) 0 issues - copyright-check.sh on extension clean ## Design notes - The spec at `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md` lines 193–223 specifies a per-agent fan-out at three scopes with the "fine" pass condition (project + (account|RG)). Renders as INFO rather than PASS because the design's intent is a diagnostic listing — operators inspect it on `--output json` and confirm no MI is starved; the check should not flip the doctor green on its own. - C12 uses the `wg.Go` Go 1.26 idiom for per-principal fan-out; per-scope probes within one principal run sequentially (3 scopes × 1 ARM listing each is well under budget and avoids the goroutine-per-scope-explosion). - `probeAgentPrincipal` deliberately does NOT extend C17's `probeAgentStatus` surface — extending it would couple two independent checks. The mirror cost is one ~40-line factory function shared by both. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…P5.1 C2) Adds a best-effort manifest walker that surfaces model / toolbox / connection resources from each service's `agent.manifest.yaml` into `nextstep.State`. Unblocks the doctor checks C13 (model deployments), C14 (toolboxes), and C15 (connections), all of which need to know whether the relevant resource kinds are declared before they can decide to run or skip. ## State additions - `State.HasModels`, `State.HasToolboxes`, `State.HasConnections` — aggregate boolean flags. True when at least one matching resource is found across all `azure.ai.agent` services. - `State.ModelRefs`, `State.Toolboxes`, `State.Connections` — sorted `[]ResourceRef` per kind. - `ResourceRef{Name, ServiceName, Detail}` — slim doctor-facing shape. Detail carries the kind-specific identifier (model id, connection `<Category> | <Target>`, empty for toolboxes). ## Walker semantics - File names probed (in order): `agent.manifest.yaml`, `agent.manifest.yml`. `agent.yaml` is deliberately excluded — it describes the container, not declared resources. - Uses `agent_yaml.ExtractResourceDefinitions` directly (NOT `LoadAndValidateAgentManifest`) so a manifest with an absent / partial `template` block — common during init — still surfaces its `resources:` declarations. - Best-effort: missing file, unreadable bytes, malformed YAML, zero resources, and unknown resource kinds all silently degrade (Has* flags stay false; lists stay nil). Walker never adds to the `errs` slice so a manifest-in-flight (which init re-writes mid-flow) never blocks the rest of state assembly. - Dedup key is `(ServiceName, Name)`. Same name twice in one service collapses to one entry (first-occurrence wins, matching agent_yaml's manifest semantics). Same name under two services surfaces twice so per-service doctor failures remain individually addressable. - Result slices are sorted by `(Name, ServiceName)` so doctor output snapshots and downstream renderers are deterministic. ## Why this is its own commit The walker is a pure data-collection step with no resolver-side consumers in this commit. Doctor checks C13/C14/C15 (following commits) gate-skip themselves on `state.Has{Models,Toolboxes, Connections}` and iterate the matching ref slice. Landing the walker first keeps each downstream commit focused on its single check. ## Tests 8 new tests in `manifest_test.go`: - All three kinds present → flags + lists populated, sort order + detail formatting locked. - Missing manifest → silent, no errors logged through the walker. - Malformed YAML → silent, no errors. - Manifest with no `resources:` key → silent, flags false. - Multi-service aggregation → entries sorted by Name, ties broken by ServiceName. - Duplicate `(service, name)` within one manifest → first occurrence wins. - `.yaml` wins over `.yml` when both exist. - `agent.yaml` (not a manifest) is ignored even if its content happens to parse as one. - `connectionDetail` table-driven test covers all four category/target combinations. ## Preflight - `gofmt -s -w .` — clean - `go vet ./...` — clean - `go test ./... -count=1` — full extension suite green - `golangci-lint run ./internal/cmd/...` — 0 issues - `cspell` over the touched files — 0 issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds the seventh doctor remote check (`remote.model-deployments`) which verifies that every model resource declared in any service's `agent.manifest.yaml` (collected by the C2 manifest walker into `State.ModelRefs`) has a corresponding Cognitive Services deployment on the Foundry project's underlying account. # What it does For each project run: 1. Skip-cascade gates (in order): `AzdClient` nil → `local.environment-selected` → `local.azure-yaml` / `local.agent-service-detected` → `remote.auth` → `remote.foundry-endpoint` → `!state.HasModels` → `AZURE_AI_PROJECT_ID` unreadable / cannot be parsed. Every gate produces a single, actionable Skip message that points the user at the upstream check. 2. Parse `AZURE_AI_PROJECT_ID` (Foundry project ARM resource ID) for `(subscription, resourceGroup, accountName)` via the new `parseAccountFromProjectID` helper. Deployments live at the Cognitive Services *account* level, NOT the project level, so the account name is the load-bearing parameter here. 3. Issue exactly one `armcognitiveservices/v2.DeploymentsClient .NewListPager` round trip via the new `realProbeModelDeployments` helper, capped at 10s (matches the design's per-probe budget in `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`). Returns `[]string` of deployment names. Transport errors short-circuit the check to Skip with the error verbatim plus an actionable retry suggestion — we can not distinguish "deployment missing" from "ARM unreachable" without a successful round trip. 4. `classifyModelDeployments` joins `state.ModelRefs` to the deployment set on name. All match → Pass with the matched count. One or more missing → Fail with the missing names listed in the Message and structured under `Details["missingModels"]` (each entry carries both `name` and `service` so the user can locate the offending manifest entry). Suggestion: `azd provision` to create the missing deployments, or update `agent.manifest.yaml` `resources[].name` to match deployments that already exist. # Aggregation The walker may surface `ModelRefs` from multiple services. Every service in an azd project belongs to the same Foundry project (and therefore the same Cognitive Services account), so the check issues exactly one deployments list per run regardless of how many services / model refs exist. The same model referenced by two services collapses to a single match check; a missing model referenced by two services surfaces as two `missingModels` entries (one per service) so the user can pinpoint each affected manifest. # Test seam `Dependencies.probeModelDeployments` (lowercase, package-internal) matches the established pattern from `probeAuth`, `probeFoundryEndpoint`, `probeDeveloperRBAC`, `probeAgentStatus`, `probeAgentPrincipal`. Production wiring leaves it nil; tests inject a closure that returns canned `(names, err)` tuples and (optionally) captures the `(subscription, resourceGroup, accountName)` it was called with. `Dependencies.assembleState` and `Dependencies.readProjectResourceIDFn` are reused from earlier checks; no new top-level seam is added besides the probe. # Files - `internal/cmd/doctor/checks_model_deployments.go` — new check factory `newCheckModelDeployments`, `parseAccountFromProjectID`, `classifyModelDeployments`, `realProbeModelDeployments`, `listDeploymentNames`. 363 lines. - `internal/cmd/doctor/checks_model_deployments_test.go` — 11 tests: skip-cascade (1 + table of 5 upstream-blocked permutations), no manifest models, unset project ID, unparsable project ID, probe transport error, all-match Pass, partial-match Fail, all-missing Fail, parser table (canonical / mixed-case / missing segments / garbage), factory shape pin. - `internal/cmd/doctor/checks_local.go` — adds the `probeModelDeployments` field to `Dependencies` next to its same-shape siblings. - `internal/cmd/doctor/checks_remote.go` — appends `newCheckModelDeployments` after `newCheckAgentIdentityRoles` in `NewRemoteChecks`. - `internal/cmd/doctor/checks_remote_test.go` — extends the composition pin test to assert 6 checks (was 5) including the new `remote.model-deployments` slot. # Preflight - `gofmt -s -w .` clean. - `go vet ./...` clean. - `go build ./...` clean. - Full extension test suite: green (`cmd`, `cmd/doctor`, `cmd/nextstep`, `exterrors`, `agents/agent_api`, `agents/agent_yaml`, `pkg/azure`, `project` — all pass). - `golangci-lint run ./internal/cmd/doctor/...` 0 issues. - `cspell` 0 issues on production file. - No `go.mod` or `go.sum` changes (uses already-imported `armcognitiveservices/v2`). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds the eighth local doctor check, `local.toolboxes`, which verifies that every ToolboxResource declared in any service's agent.manifest.yaml has its canonical MCP endpoint env var (TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT) set in the active azd environment. Why local (Remote: false). The check only reads the active azd environment via the existing gRPC env service — no ARM / Foundry round trips. Tagging it local means `--local-only` still runs it (which is exactly what we want: a missing TOOLBOX_*_MCP_ENDPOINT is diagnosable without network access). Skip cascade. Skips on AzdClient==nil, local.environment-selected, local.azure-yaml, local.agent-service-detected, and when state.HasToolboxes is false. Deliberately NOT gated on remote.auth / remote.foundry-endpoint — a down Foundry must not poison a local env diagnostic. Classification. - All endpoints set → Pass with matchedCount. - One or more missing → Fail with the missing toolbox names + env var keys in the Message, Suggestion points at `azd provision` (the canonical fix) or `azd env set` as the manual override, Details["missingToolboxes"] carries a structured list (name, service, envVar) for JSON consumers. - Env lookup transport error → Fail (NOT Skip). Divergent from C13's model-deployments which Skips on probe error, because env lookup is local; a transport failure here means the user's azd config / extension is broken and a Skip would silently swallow that signal. Suggestion points at `azd env list` / `azd env get-values`. - Empty / whitespace-only value → treated as missing (matches detectMissingVars semantics in nextstep/state.go). Convention. TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT, with name upper-cased and `-` / `.` / ` ` mapped to `_`. Matches the hosted-toolbox Bicep sample output names. The prefix and suffix are pinned in code (not derived from the env) so the Fail message can name the exact env var the user must grep their Bicep template for. Dedup. classifyToolboxEndpoints dedupes on the canonical env key because the C2 manifest walker dedupes on (ServiceName, Name) — the same toolbox referenced by two services would otherwise produce two env lookups and two missing-list entries. Exposed `dedupToolboxKeys` for callers (renderer / future telemetry) that want the expected-key list up front; the classifier does its own inline dedup so it does not depend on this helper. Test seam. `Dependencies.lookupToolboxEnv toolboxEnvLookupFn` matches the established seam pattern (probeAuth, probeFoundryEndpoint, probeModelDeployments). Production wiring leaves it nil; the check binds `makeRealToolboxEnvLookup(deps .AzdClient)` on first call, which calls `client.Environment().GetValue` — the canonical one-key env reader used by service_target_agent.go and checks_rbac.go. Tests (15). Skip cascade (azdClient nil + 3 priors), not-gated-on- remote-priors invariant, state emptiness (no toolboxes / nil state), 3 classifier paths (all-set / partial / all-missing), whitespace-as-missing, transport-error-is-Fail, cross-service dedup, normalizeToolboxName table (8 cases), toolboxEndpointKey roundtrip, dedupToolboxKeys table, factory-shape pin. Wired into NewLocalChecks (now 8 entries); local-checks pin test updated. NewRemoteChecks unchanged (still 6 entries). Preflight clean: gofmt, vet, build, full extension test suite green (cmd 16.7s, doctor 2.9s, nextstep 6.7s, etc.), golangci-lint 0 issues, cspell 0 issues on production files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Issue: Doctor previously had no way to verify that Foundry connections referenced by agent manifests (e.g. `bing-grounding`, key-vault-backed auth connections) actually exist on the project. Failure surfaced later at invoke time as 401/403 from the upstream tool, with no clear path back to the missing connection. Approach: New `remote.connections` doctor check that enumerates manifest ConnectionResource entries (already discovered by the C2 manifest walker into `state.Connections`), calls `FoundryProjectsClient.GetAllConnections(ctx)` on the active project, and reports any manifest-declared connection that isn't present on the project as a Fail. Missing-entry rendering format: `<name> [<detail>] (service <svc>)` when Detail is non-empty, falling back to `<name> (service <svc>)` to avoid a bare `[]`. Detail typically renders as `<Category> | <Target>` for connections (types.go:183-189; manifest.go:152-162). Classification: REMOTE check (Remote: true). Calls the Foundry API. Skip cascade mirrors C13 (`remote.model-deployments`): AzdClient → environment-selected → azure.yaml / agent-service-detected → remote.auth → remote.foundry-endpoint → !state.HasConnections → unparsable project ID Probe error → Skip (matching C13's pattern — distinguishing transport failure from "missing connection" requires a successful round-trip). 10s probe timeout. Wiring: `newCheckConnections(deps)` added as the 7th and final entry in `NewRemoteChecks` (after C12 agent-identity-roles, C13 model- deployments). Pin test `TestNewRemoteChecks_HasAuthFoundryEndpointRBAC AgentStatusIdentityRolesModelDeployments` renamed to `...ModelDeploymentsConnections`, Len bumped 6 → 7, 4 new index-6 assertions for ID / Title / Description / Remote. Test seam: New `probeFoundryConnections` field appended to `Dependencies` matching the existing seam pattern from `probeModelDeployments` (C13). Production wiring uses `realProbeFoundryConnections` which constructs a credential via `azidentity.NewAzureDeveloperCLICredential` (matching the rest of the extension per `agent_context.go:101-109`). New helper: `parseAccountProjectFromProjectID(projectID) (account, project, err)` — sibling of C13's `parseAccountFromProjectID`. Returns two segments instead of the four C13 needs; kept separate to avoid churning C13's signature for a single new caller. Both case-insensitive on segment markers. Follow-up: consolidate into a single parser when a third caller appears. Tests: `checks_connections_test.go` — 13 tests mirroring C13 patterns: - Skip cascade table (5 rows: AzdClient, environment, azure.yaml / agent-service-detected, auth, foundry-endpoint). - State emptiness (HasConnections false → Skip). - Project ID unset → Skip. - Project ID unparsable → Skip. - Probe error → Skip. - All-match → Pass. - Partial mismatch → Fail with missing names + service tags. - All-missing → Fail. - Empty-Detail rendering omits `[]`. - Parser table (5 cases: canonical, mixed case, missing project, missing account, garbage). - Factory shape pin (Remote: true, ID, Title, Description). Preflight: - gofmt -s -w . (clean) - go vet ./... (clean) - go build ./... (clean) - go test ./... -count=1 (all packages pass; doctor 5.474s) - golangci-lint run ./internal/cmd/doctor/... (0 issues) - cspell lint "internal/cmd/doctor/*.go" (14 files, 0 issues) - copyright header verified on both new files Files: - internal/cmd/doctor/checks_connections.go (NEW, +332) - internal/cmd/doctor/checks_connections_test.go (NEW, +326) - internal/cmd/doctor/checks_local.go (probeFoundryConnections seam +5) - internal/cmd/doctor/checks_remote.go (wire +newCheckConnections +1) - internal/cmd/doctor/checks_remote_test.go (pin test 6 → 7, +14) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two reviewer-consensus findings from the batched code review of commits 385b147 (C13 remote.model-deployments), 87e1dcc (C14 local.toolboxes), and 1b3143f (C15 remote.connections): Fix 1 (MEDIUM, Opus + GPT-5.5): toolbox env-key normalizer divergence. C14's `normalizeToolboxName` only mapped `-`, `.`, and ` ` to `_` rune-by-rune, while the production helpers `init.go:toolboxMCPEndpointEnvKey` (manifest injection) and `listen.go:toolboxMCPEndpointEnvKey` (runtime env write) both use the regex `[^A-Z0-9]+` -> `_` (run-collapsing, all non-alphanumerics). The two algorithms agreed only on the subset of inputs the test table exercised (`web-search-tools`, `my.toolbox.v2`, `my toolbox`, ...) and diverged on inputs like `my--tool`, `my+tool`, `my:tool`, `my(tool)`, `my\ttool`. A user with such a toolbox name would see the doctor flag a missing endpoint under a key (`TOOLBOX_MY__TOOL_MCP_ENDPOINT`, `TOOLBOX_MY+TOOL_MCP_ENDPOINT`, ...) that nothing in the system ever writes. Resolution: hoist the canonical helper into a shared `internal/pkg/envkey` package so the producing and diagnostic sides cannot drift again. - new internal/pkg/envkey/envkey.go -- `ToolboxMCPEndpoint` - new internal/pkg/envkey/envkey_test.go -- 13 cases incl. double-hyphen run, `+`, `:`, `/`, tab, parens, empty - internal/cmd/listen.go -- drop local helper, drop `regexp` import, route through envkey - internal/cmd/init.go -- route through envkey - internal/cmd/init_test.go -- delete duplicated table (covered by envkey package test) - internal/cmd/doctor/checks_toolboxes.go -- drop local normalizeToolboxName / toolboxEndpointKey /toolbox{Prefix, Suffix}, route 2 callsites through envkey - internal/cmd/doctor/checks_toolboxes_test.go -- replace the normalize-table test with a thin pin test verifying the doctor's renderer helper routes through envkey - cspell.yaml -- allowlist `envkey` Fix 2 (MEDIUM, Sonnet): assembler errors silently swallowed. C13/C14/C15 all used `state, _ := assembler(...)` and reported a `Skip` with "no X declared in any service's agent.manifest.yaml" whenever `state == nil || !state.HasX`. The existing pattern at `checks_manual_env.go:95-109` instead captures `errs` and Fails with the actual cause when `state == nil` (defensive against a future contract change where AssembleState may return a nil state with a populated errs slice). Resolution: mirror the established pattern in all three new checks. The Skip for `!state.HasX` is preserved; only the `state == nil` branch becomes a Fail surfacing `errs[0].Error()`. - checks_model_deployments.go -- Fail-on-nil with cause - checks_toolboxes.go -- Fail-on-nil with cause - checks_connections.go -- Fail-on-nil with cause - checks_model_deployments_test.go -- new test: nil state surfaces errs[0] - checks_toolboxes_test.go -- update existing `SkipsWhenAssemblerReturnsNil` to `FailsWhenAssembler ReturnsNilState` plus new test asserting errs[0] surfaces in the Fail message - checks_connections_test.go -- new test: nil state surfaces errs[0] Not addressed (deferred): LOW (GPT-5.5): `parseAccountProjectFromProjectID` (C15) accepts partial paths; `parseAccountFromProjectID` (C13) does not. Opus reviewed and called the dual-parser duplication "defensible for two callers"; commit 1b3143f's message already notes the follow-up to consolidate when a third caller appears. Preflight: - gofmt -s -w . (clean) - go build ./... (clean) - go vet ./... (clean) - go test ./... -count=1 (all packages pass; envkey 1.837s, doctor 6.276s, cmd 14.401s) - golangci-lint run ./... (0 issues) - cspell lint <new+touched> (17 files, 0 issues) Files (10): - internal/pkg/envkey/envkey.go (NEW) - internal/pkg/envkey/envkey_test.go (NEW) - internal/cmd/listen.go (MOD) - internal/cmd/init.go (MOD) - internal/cmd/init_test.go (MOD) - internal/cmd/doctor/checks_toolboxes.go (MOD) - internal/cmd/doctor/checks_toolboxes_test.go (MOD) - internal/cmd/doctor/checks_model_deployments.go (MOD) - internal/cmd/doctor/checks_model_deployments_test.go (MOD) - internal/cmd/doctor/checks_connections.go (MOD) - internal/cmd/doctor/checks_connections_test.go (MOD) - cspell.yaml (MOD) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When Doctor/post-deploy guidance has no cached OpenAPI-derived sample payload but a service README is present, don't suggest a concrete protocol-generic payload that may fail for that sample's schema. Emit the README pointer first, then an invoke command with an explicit '<payload>' placeholder. Cached OpenAPI payloads still produce runnable invoke commands, and services without a README still get the protocol-generic fallback payload with a generic label. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Active/idle show output should stay an inspection view of the hosted agent resource. Remove the active-state invoke suggestion from ResolveAfterShow and avoid attaching next_step in show JSON when the agent is already healthy. Non-active states keep actionable guidance: - creating -> monitor --type system --follow - failed/empty -> monitor --follow - deleting/deleted -> azd deploy - unknown -> azd ai agent show <service> This also avoids state/OpenAPI assembly work for active show output because no active-state guidance is rendered. Validation: - go test ./internal/cmd ./internal/cmd/nextstep -run 'TestResolveAfterShow|TestResolveNextStepFromStatus|TestShowResultJSON|TestPrintAgentVersionJSON|TestPrintAgentVersionTable|TestResolveAfterInvoke_Success|TestResolveAfterInit_UnresolvedPlaceholders|TestResolveAfterRun' -count=1 - go test ./internal/cmd/... -count=1 - go vet ./internal/cmd/... - golangci-lint run ./internal/cmd/... - cspell lint touched show/nextstep files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stream text-mode doctor output by observing each finalized check result, while keeping JSON output buffered and unchanged. Split the text formatter into header/check/footer pieces so the streaming path preserves the existing report shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add project-root validation for service-relative file probes and reads, including symlink-aware containment checks and root-service handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove the unused nextstep auth probe option and state fields that were added by the PR but never consumed by the resolver or doctor wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Route PR-added nextstep stdout emission through one cmd helper so TTY gating stays consistent across init, invoke, run, show, and doctor text output. This intentionally applies the nextstep call-site TTY contract to the PR-added init next-step blocks as well, keeping redirected output free of human-only guidance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@jongio

The remote.auth doctor check surfaced the raw user principal name in both the Message string and the structured Details map regardless of the --unredacted flag, in contrast with the rest of the doctor checks (checks_rbac.go, checks_agent_identity_roles.go) which already gate identity values on opts.Unredacted. This change threads Options into the auth check function and adds two small helpers that reuse the existing redactedPlaceholder constant: - redactUPN(upn, unredacted) returns the value to surface in Message text: raw when --unredacted, the shared <redacted> placeholder when a UPN was discovered but should be scrubbed, and empty when none was found so composeAuthMessage cleanly drops the prefix. - authDetails(upn, minutes, unredacted) builds the Details map and omits the "upn" key entirely unless --unredacted is set, so machine consumers do not see the raw value by default. PASS, WARN, and expired-FAIL branches now compose their messages from the redacted display value. Existing tests that asserted the raw UPN were updated to pass Options{Unredacted: true}; new table tests cover the default-redacted and --unredacted contracts on every branch, the empty-UPN drop, and both helpers in isolation. Resolves PR Azure#8198 review comment from @jongio. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TestErrorCodeWireValues pins the lowercase JSON wire values of every exported enum the nextstep package consumes from the Agents API, but the AgentVersionStatus map was missing the "idle" entry. The idle status is actively read at show.go:207 and nextstep/resolver.go:248, so silent drift on that one literal would regress the show command's idle branch and the resolver's deployment-pending hint without any test failure. Add the missing "idle": string(AgentVersionIdle) case. Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e output Responds to therealjohn's UX review on PR Azure#8198 by rewriting the doctor text renderer around five contracts: 1. Default = concise. PASS shows just the check name; FAIL shows a one-line Message + one-line `fix:` Suggestion; SKIP inlines the skip reason after `-- skipped`. Use `--debug` to surface the verbose path (full Message + Suggestion + Links). 2. Section grouping. Checks render under Local, Authentication, and Remote headers, derived from the check ID prefix. `remote.auth` gets its own Authentication section per the literal mock. 3. New glyph format: `(✓)`, `(x)`, `(-)`, `(!)`, `(ⓘ)`. ASCII `x` for FAIL matches the mock literally. 4. "To fix" footer on failure. When at least one failure maps to a canonical remediation (`remediationForCheckID`), the footer is a numbered, deduplicated command list in execution order (auth → init → provision → deploy). When all failures are unmapped (or any are unmapped alongside mapped ones), the footer defers to the per-check `fix:` notes rendered in the body. The re-run instruction always closes the block so the user is never left without an actionable next step. 5. First-letter capitalization at render time, with a brand-name blocklist so `azd`, `azure.yaml`, `agent.yaml`, `agent.manifest. yaml`, and `skipped:` leads stay lowercase. Source check strings are untouched. Summary line simplified from `Summary: 1 passed, 1 failed, 1 skipped, 0 warned` to `1 passed, 1 failed, 1 skipped`. Warn/info segments are appended only when non-zero. The streaming render path (`runAndRenderDoctorText` → `renderer.write Check` per result) and the buffered path (`printDoctorReportText`, used by tests) share a single `doctorRenderState` so they produce byte-identical output. The parity test exercises both concise and verbose modes with a fixture covering Message detail, multi-line Suggestion (so `writeIndentedBlock` runs), and Links. The trailing `Next:` block (via `nextstep.PrintAllNext`) is suppressed when the To-fix footer fires, because the failure block is the actionable next step. The `--debug` flag is the existing persistent root flag provided by the azdext SDK; we read it via `isDebug(cmd.Flags())` and thread it through `runAndRenderDoctorText` and `newDoctorRenderer`. No new flag is registered. The JSON output path (`--output json`) does not traverse this renderer and is unchanged. Test additions pin every new contract: concise defaults (including zero-suppression of warn/info), verbose `--debug`, section transitions, streaming/buffered parity in both modes, trailing `Next:`, To-fix footer with mapped failures, To-fix footer with all-unmapped failures (deferred to per-check `fix:` notes), To-fix footer with mixed mapped and unmapped failures (numbered list plus per-check pointer), summary line with non-zero warn/info, empty report, status glyphs, category routing, capitalize edges, and the `firstLine` helper. Three-model code review consensus reached (Opus 4.7 xhigh, Sonnet 4.6, GPT-5.5). All flagged issues addressed. Out of scope (deferred to a follow-up after user input): the trangevi + therealjohn proposal to move doctor into a separate `azure.ai.doctor` extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CI surfaced two follow-ups after the rebase that were originally landed on the branch in the comment-trim cluster (skipped during rebase as cosmetic): 1. go fix ./... modernizations - replaced loopvar t := tt captures (Go 1.22+ scoping makes them unnecessary) - strings.SplitSeq over comma-separated tag lists in state.go and pending_provision.go - max(limit-1, 0) builtin in nextstep/format.go - strings.Cut in doctor/checks_auth.go::firstLine 2. cspell entries for words introduced by the doctor UX redesign and next-step package: inlines, Remediations, remediations, uppercases, parseable, azd's No behavior change. Build, tests, and lint pass.

Signed-off-by: trangevi <trangevi@microsoft.com>

trangevi · 2026-05-22T18:04:49Z

/check-enforcer override

Copilot AI review requested due to automatic review settings May 15, 2026 04:00

antriksh30 requested review from JeffreyCA, hemarina, jongio, rajeshkamal5050, tg-msft, therealjohn, trangevi, trrwilson, vhvb1989, wbreza and weikanglim as code owners May 15, 2026 04:00

Copilot AI reviewed May 15, 2026

antriksh30 changed the title ~~Azure AI Agents: add next-step guidance and doctor diagnostics~~ [WIP] Azure AI Agents: add next-step guidance and doctor diagnostics May 15, 2026

microsoft-github-policy-service Bot assigned antriksh30 May 15, 2026

antriksh30 marked this pull request as draft May 15, 2026 04:08

antriksh30 force-pushed the antrikshjain/next-step-implementation branch from 401ca65 to 829063a Compare May 15, 2026 04:17

antriksh30 requested a review from Copilot May 15, 2026 04:55

Copilot AI reviewed May 15, 2026

antriksh30 requested a review from Copilot May 15, 2026 05:19

Copilot started reviewing on behalf of antriksh30 May 15, 2026 05:20 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

jongio reviewed May 17, 2026

View reviewed changes

antriksh30 marked this pull request as ready for review May 18, 2026 04:55

trangevi linked an issue May 19, 2026 that may be closed by this pull request

Guide developers to the right next step after every azd ai agent command and add azd ai agent doctor #7975

Closed

antriksh30 changed the title ~~[WIP] Azure AI Agents: add next-step guidance and doctor diagnostics~~ Azure AI Agents: add next-step guidance and doctor diagnostics May 20, 2026

antriksh30 force-pushed the antrikshjain/next-step-implementation branch from 6312729 to da2efb5 Compare May 20, 2026 06:22

Antriksh Jain and others added 16 commits May 22, 2026 14:38

azd ai agents: validate service-relative paths

de9dc17

Add project-root validation for service-relative file probes and reads, including symlink-aware containment checks and root-service handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azd ai agents: remove unused nextstep auth state

571740e

Remove the unused nextstep auth probe option and state fields that were added by the PR but never consumed by the resolver or doctor wiring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

antriksh30 force-pushed the antrikshjain/next-step-implementation branch from 804913a to f29b4e3 Compare May 22, 2026 09:20

antriksh30 requested a review from RickWinter as a code owner May 22, 2026 09:20

therealjohn approved these changes May 22, 2026

View reviewed changes

trangevi approved these changes May 22, 2026

View reviewed changes

Move cspell changes to agent specific file

ce3ca24

Signed-off-by: trangevi <trangevi@microsoft.com>

trangevi approved these changes May 22, 2026

View reviewed changes

therealjohn approved these changes May 22, 2026

View reviewed changes

therealjohn mentioned this pull request May 22, 2026

fix: correct end-of-init message logic for existing project + existing deployment #8318

Closed

trangevi merged commit d478803 into Azure:main May 22, 2026
27 checks passed

Copilot AI mentioned this pull request May 22, 2026

Prepare azure.ai.agents 0.1.34-preview patch release metadata #8344

Merged

This was referenced Jun 19, 2026

Provide better guidance on next steps after init #7896

Closed

gracefully recommend provision before first use of azd ai agent run #7449

Closed

Uh oh!

Conversation

antriksh30 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jongio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trangevi commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

antriksh30 commented May 15, 2026 •

edited

Loading