Skip to content

feat(agents): resolve $ref file includes with overlay overrides in Foundry config#8627

Merged
therealjohn merged 2 commits into
huimiu/foundry-azure-yamlfrom
huimiu/foundry-config-task
Jun 12, 2026
Merged

feat(agents): resolve $ref file includes with overlay overrides in Foundry config#8627
therealjohn merged 2 commits into
huimiu/foundry-azure-yamlfrom
huimiu/foundry-config-task

Conversation

@huimiu

@huimiu huimiu commented Jun 12, 2026

Copy link
Copy Markdown
Member

What

Adds an extension-owned resolver for the azure.ai.agents extension that resolves $ref file includes within a parsed Foundry service configuration, implementing spec §2.4 of #8590 (docs/specs/unify-azure-yaml/spec.md).

ResolveFileRefs(cfg map[string]any, projectRoot string):

  • Includes — any object containing a $ref string is replaced by the referenced YAML or JSON file (JSON parses as a subset of YAML, so one decoder handles both).
  • Overlay overrides — keys alongside $ref are overlaid on the loaded file as a shallow, top-level merge: sibling scalars, arrays, and objects each replace the loaded value wholesale.
  • Nested includes$ref inside a loaded file resolves relative to that file's own directory.
  • Path rebasing — each loaded file's relative project, instructions, and $ref paths are rebased so the returned config has every path anchored to projectRoot in clean, forward-slash form. Inline values authored directly in azure.yaml are left exactly as written. instructions is rebased only when it looks like a path (single line ending in .md/.txt), so inline prose is preserved.
  • Whitespace trimming$ref values are trimmed of surrounding whitespace before path resolution, so " ./agents/a.yaml " resolves correctly.
  • Safety — cyclic and excessively deep include chains return actionable errors (cycle detection uses slices.Contains/slices.Clone for correctness); URL $ref targets are rejected; unreadable or malformed files produce structured exterrors validation errors.

This is a self-contained helper plus thorough tests, mirroring the merged ${...} templating helper (#8602). There is no product call-site yet — the consumer is the not-yet-built microsoft.foundry service target (§2.2/§2.6).

Testing

  • go build ./..., go vet ./..., gofmt, golangci-lint run (0 issues), go test ./... (full extension suite), and cspell — all clean.
  • 16 test cases cover: nil config, no-op, relative/absolute/JSON includes, $ref whitespace trimming, sibling overlay, nested includes, top-level sequence files, project/instructions rebasing, inline-vs-path instructions, and the missing-file / malformed / cyclic / URL / non-string / empty-$ref error paths.

Fixes #8626

Copilot AI review requested due to automatic review settings June 12, 2026 05:38
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

📋 Prioritization Note

Thanks for the contribution! The linked issue isn't in the current milestone yet.
Review may take a bit longer — reach out to @rajeshkamal5050 or @kristenwomack if you'd like to discuss prioritization.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new helper in the azure.ai.agents extension to resolve Foundry $ref file includes inside already-parsed service configuration data, including shallow sibling overlays, nested include resolution, path rebasing, and structured validation errors.

Changes:

  • Added ResolveFileRefs implementation to recursively resolve $ref YAML/JSON includes, apply shallow overlays, and rebase certain paths to the project root.
  • Added table-driven unit tests covering includes, overlays, rebasing behavior, and common error cases (missing/malformed/cyclic/URL/non-string/empty refs).
  • Introduced a new extension error code for invalid $ref usage (invalid_file_ref).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
cli/azd/extensions/azure.ai.agents/internal/project/includes.go New $ref resolver implementation (load, overlay, recurse, rebase, cycle/depth safety).
cli/azd/extensions/azure.ai.agents/internal/project/includes_test.go Table-driven tests validating include resolution, rebasing rules, and error handling.
cli/azd/extensions/azure.ai.agents/internal/exterrors/codes.go Adds CodeInvalidFileRef error code for structured validation errors.

Comment thread cli/azd/extensions/azure.ai.agents/internal/project/includes.go Outdated
Comment thread cli/azd/extensions/azure.ai.agents/internal/project/includes.go Outdated
@github-actions github-actions Bot added the ext-agents azure.ai.agents extension label Jun 12, 2026
@therealjohn therealjohn merged commit 6cea824 into huimiu/foundry-azure-yaml Jun 12, 2026
25 checks passed
trangevi added a commit that referenced this pull request Jun 29, 2026
…and IaC-less init (#8818)

* feat(agents): add microsoft.foundry azure.yaml schema (#8603)

* feat(agents): add microsoft.foundry azure.yaml schema

Schema-only scaffolding for unifying Microsoft Foundry agent config in
azure.yaml (design spec PR #8590, section 2.3).

- Add the host: microsoft.foundry conditional to schemas/v1.0 and
  schemas/alpha azure.yaml.json, composing the extension schema at the
  service level via allOf and turning off project/runtime/docker/image/config.
- Add microsoft.foundry to the host examples list.
- Publish the Foundry extension schemas under
  cli/azd/extensions/azure.ai.agents/schemas/: microsoft.foundry.json plus
  per-resource files (Agent, Skill, Routine, Connection, Toolbox, Deployment,
  FileRef), ported from the PM preview repo with $id rewritten to the
  azure-dev path and relative $refs preserved.
- microsoft.foundry.json uses additionalProperties: true at the project level
  (deliberate deviation from the preview's false) so future Foundry resource
  types do not break the schema, per the brief and design spec section 2.3.

Authoring-only: no service-target wiring, provider registration, or
alpha-feature gating (those are later PRs).

* fix(agents): relax PromptAgent to accept skill without instructions

PromptAgent now requires name + kind plus at least one of instructions
or skill (anyOf), instead of always requiring instructions. A prompt
agent backed by a skill (which supplies the instructions) no longer
fails schema validation.

This fixes the complex sample's summarizer-agent validation failure.

* fix(agents): enforce Foundry schema dependencies

Require project when hosted agents define docker or runtime settings, and enforce routine trigger-specific required fields for schedule, webhook, and event triggers.

* fix(agents): align Foundry schema constraints

Use conditional schema constraints for hosted-agent project requirements and routine trigger-specific fields.

* feat: target foundry azure.yaml schema at integration branch with validatable samples

* fix: normalize foundry azure.yaml schema URLs to short raw.githubusercontent form

* feat(agents): resolve $ref file includes with overlay overrides in Foundry config (#8627)

* feat: resolve $ref file includes with overlay overrides in Foundry agents config

* fix: modernize $ref cycle check with slices and trim whitespace in include paths

* feat(agents): add --image flag to agent init command (#8689)

* feat(agents): add --image flag to agent init command

Add --image flag to `azd ai agent init` to allow users to specify
a pre-built container image directly during init. When provided:

- Writes the image URL to the `image` field in agent.yaml
- Skips ACR connection prompts (user manages their own registry)
- Validates the image URL contains registry/image format
- Returns error if combined with --deploy-mode code

This enables non-interactive init workflows for VNet scenarios where
users bring their own private ACR with pre-built images.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(agents): short-circuit template/language selection when --image is set

When --image is provided without --manifest, synthesize a minimal hosted
container manifest and route through the manifest flow, skipping the
init-mode/template/language prompts and code scaffolding (a pre-built image
has no source to scaffold). Requires --agent-name; validates the image early
before any project/template init. The image is written to agent.yaml and ACR
is skipped via the existing --image handling.

* fix(agents): skip ACR connection prompt for --image in interactive mode

The interactive ACR connection discovery/prompt (configureAcrConnection via
configureFoundryProjectEnv) was gated on a.isCodeDeploy, which is false for a
pre-built --image (container deploy). Pass a.skipACR() to the four
selectFoundryProject call sites so an image-based hosted agent skips ACR setup
in interactive mode too, matching the AZD_AGENT_SKIP_ACR env behavior. The
hosted-agent region filter now also applies to --image, which is correct since
it runs as a hosted agent.

* fix(agents): skip startup-command prompt for pre-built image agents

A pre-built container image runs its own entrypoint, so no startup command
applies. Skip resolveStartupCommandForInit when the agent template references
a pre-built image (agentUsesPreBuiltImage), covering both --image (synthesized
manifest) and a -m manifest that already specifies an image. Without this, the
container-deploy path prompted "Enter the command to start your agent" because
the image src dir has no main.py to auto-detect.

* fix(agents): persist AZD_AGENT_SKIP_ACR in the no-model-resources path

configureModelChoice only wrote AZD_AGENT_SKIP_ACR in the deferred-headless
and main model-config paths. A completing no-model-resources flow (e.g. a
pre-built --image agent selected interactively) returned without setting it,
so provision would still create an ACR despite --image. Set it via skipACR()
before returning, matching the other two call sites.

* fix(agents): make --image deployments use the pre-built image end-to-end

Real init/provision/deploy validation showed that --image init correctly wrote
image: and AZD_AGENT_SKIP_ACR=true, but deploy still defaulted to the Dockerfile
build path in --no-prompt mode. Treat AZD_AGENT_SKIP_ACR=true as an explicit
BYO-image signal for hosted container agents so Package returns a remote
pre-built image artifact and Publish is skipped.

The same E2E also found the synthesized manifest used responses protocol
version v1, while invoke requires 1.0.0. Generate 1.0.0 for --image manifests.

* fix(agents): tighten --image reference validation

Replace the slash-only validation with a stricter fully-qualified image reference regex that requires an explicit registry host, repository path, and optional tag or sha256 digest. Add coverage for digest, localhost/port registry, and malformed refs such as URL schemes, namespace-only refs, missing repository, short digest, and uppercase repository names.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(agents): bicep-less init for azd ai agent init (#8643)

* feat(agents): bicep-less init for azd ai agent init

`azd ai agent init` now produces a project without an on-disk `infra/`
directory and stamps `infra.provider: microsoft.foundry` on azure.yaml
so provisioning is routed to this extension's provider.

Init UX changes:
- After the starter scaffold, the `infra/` tree is removed.
- `azure.yaml` gets `infra.provider: microsoft.foundry` written and
  `infra.path: ./infra` removed.
- ACR and Application Insights connection prompts are skipped during
  init; the extension's provisioning provider owns those resources.
- The "no infra/ directory" warning for existing projects is
  suppressed when the provider is already declared.

Core changes:
- Relax the `infra.provider` JSON Schema (v1.0 and alpha) from a fixed
  enum to `pattern: "^[a-z0-9.]+$"` + `examples`, keeping typo
  catching while allowing extension-registered providers.

New code:
- internal/project/provisioning_provider.go: `FoundryProviderName`
  constant, single source of truth.
- internal/cmd/init.go: `writeFoundryProvider`,
  `hasFoundryProviderDeclared` helpers.
- internal/cmd/init_foundry_resources_helpers.go: `bicepless` parameter
  on `configureFoundryProjectEnv` and `selectFoundryProject` that
  short-circuits connection discovery.

Tests:
- TestHasFoundryProviderDeclared covers the warning-suppression
  predicate.
- TestConfigureFoundryProjectEnv_BicepLessShortCircuits verifies the
  short-circuit fires before any Foundry data-plane call (uses a nil
  credential to make a regression crash loudly).

* feat(agents): microsoft.foundry provisioning provider

Adds an in-memory Bicep synthesizer and a provisioning provider that
implements the full azdext provisioning gRPC surface for projects
declaring `infra.provider: microsoft.foundry`. The provider generates,
compiles, and deploys an ARM template on demand without writing Bicep
to disk.

Synthesizer (internal/synthesis):
- Embeds the templates/main.bicep + modules/acr.bicep and the
  precompiled main.arm.json via `go embed`.
- `Synthesize(Input)` returns the parameterized ARM JSON plus a manifest
  describing the synthesized resources.
- `AcceptedHosts` input controls which agent host kinds the synthesizer
  will materialize (currently `azure.ai.agent`).

Provider (internal/project):
- `FoundryProvisioningProvider` implements all 9 azdext provisioning
  methods (Initialize, State, Deploy, Destroy, Preview, Parameters,
  Outputs, Resources, GetDeployment).
- `ensureResourceGroup` creates the RG on demand.
- Synthesizes from agent.yaml + host shapes, threads outputs back
  through the env, surfaces typed errors via exterrors.

Service-target integration:
- `AgentServiceTargetProvider.Initialize` is split into a cheap stub
  and lazy `ensureDeployContext` / `ensureEnv` calls so registration
  no longer requires a deploy context.
- All 5 deploy-time entry points (Package, Publish, Deploy, Endpoints,
  GetTargetResource) gate on the lazy helpers.

Registration:
- `extension.yaml` declares the `provisioning-provider` capability
  with `microsoft.foundry`.
- `internal/cmd/listen.go` registers
  `WithProvisioningProvider(FoundryProviderName, ...)`.

Errors:
- `internal/exterrors/codes.go` adds provisioning op names and
  error codes used by the provider.

* feat(agents): add `--infra` eject flag to `azd ai agent init`

Implements the bicepless-foundry spec's eject command: `azd ai agent
init --infra` writes the synthesized Bicep templates from azure.yaml
to ./infra/ on disk. Behavior follows RFC #8065 and the spec
(spec/bicepless-foundry/spec.md §Eject Command):

  | Trigger                                       | Behavior                                                                                    |
  | --------------------------------------------- | ------------------------------------------------------------------------------------------- |
  | empty dir + --infra                           | Run normal init flow, then write ./infra/                                                   |
  | existing foundry project, no ./infra/, --infra | Standalone eject: write ./infra/; no init prompts, no agent-code change, azure.yaml unmutated |
  | ./infra/ already exists + --infra             | Refuse with CodeInfraEjectExists, suggest delete + retry                                    |
  | not an azd agent project + --infra            | Refuse with CodeInfraEjectNoFoundryService                                                  |

New code:

- internal/cmd/init_infra.go: `ejectInfra` walks
  `synthesis.TemplatesFS()` to ./infra/, writes
  `main.parameters.json` from the synthesizer's parameter map, and
  prints the spec's success block. Atomic cleanup of partial writes
  on error.
- internal/cmd/init.go: `--infra` flag plus top-of-RunE short-circuit
  for the standalone-eject case and a post-init hook for the
  empty-dir case. The standalone branch validates that no init-driving
  inputs (-m, --src, positional arg) were also passed; honoring them
  would silently drop the user's argument. The post-init chain skips
  silently when init returned without producing a foundry service so
  the user never sees "nothing to eject" after init succeeded.

Provider lifecycle hardening (reviewer findings, all critical/major):

- `FoundryProvisioningProvider.Initialize` is now cheap: tenant
  lookup and credential construction moved into a lazy
  `ensureCredential` helper called on-demand by
  `deploymentsClient` / `ensureResourceGroup`. azd-core may call
  Initialize on providers it never deploys with (env refresh,
  multi-provider projects); making it cheap avoids needless RPCs and
  lets metadata-only calls (Parameters, PlannedOutputs) succeed
  without auth.
- `Destroy` no longer silently leaks resources. With `Force=true`
  it deletes the entire resource group (the previous behavior only
  removed the deployment record). Without `--force` it returns a
  structured `CodeDestroyRequiresForce` naming the RG and pointing
  at `azd down --force`.
- `armOutputsToProto` and `armInputsToProto` JSON-encode
  non-string values via a shared `encodeParamValue` helper. Prior
  behavior collapsed arrays/objects via Go's default %v formatter
  (`["a","b"]` -> `[a b]`) which is unparseable downstream.
- `Parameters` and `armParameters` are nil-safe on
  `synthResult` so a programming error elsewhere returns a
  structured Internal error instead of panicking.
- `Preview` returns `Compatibility` (feature not implemented in
  this version) instead of `Validation`; the user supplied no
  invalid input.
- `findFoundryService` and its eject equivalent return
  `Dependency` instead of `Validation` for missing-service errors;
  a missing service is a missing dependency, not malformed input.
  Multiple-services error split into its own
  `CodeInfraEjectMultipleFoundryServices` so telemetry can
  differentiate "none" from "too many".

Tests (all passing under `go test ./... -short`):

- `init_infra_test.go`: 10 test cases covering refusals
  (azure.yaml missing, ./infra/ exists, ./infra/ is a file,
  no foundry service, multiple foundry services, conflicting args
  with -m/--src/positional) and happy paths (file shape, parameters
  shape, no-docker omits ACR param).
- `foundry_provisioning_provider_test.go`: `TestEncodeParamValue`
  (10 sub-cases for the JSON-encode helper),
  `TestArmOutputsToProto_JSONEncodesNonStrings`,
  `TestArmInputsToProto_JSONEncodesNonStrings`,
  `TestParameters_NilSafeOnMissingSynthResult`,
  `TestArmParameters_NilSafeOnMissingSynthResult`,
  `TestDestroy_RefusesWithoutForce`,
  `TestFindFoundryService_DependencyCategory`,
  `TestPreview_NotImplemented` updated to assert
  `Compatibility` category.

End-to-end verified via `azd x build` + manual smoke tests against
the installed extension: standalone eject succeeds, three conflicting-
args combinations refuse cleanly with the structured error, refuse-
when-./infra/-exists and refuse-when-multiple-services produce the
expected output.

Deliberate spec deviations (documented elsewhere, deferred):

- Spec example lists `infra/modules/foundry-project.bicep`; our
  `main.bicep` is monolithic so the eject writes only the modules
  that actually exist (`main.bicep`, `modules/acr.bicep`,
  `abbreviations.json`, `main.parameters.json`).
- Spec uses `--force` to overwrite an existing ./infra/; we follow
  the spec's later guidance to ask the user to delete and re-run.
- Provider's on-disk reuse path (`Deploy` reads ./infra/main.bicep
  when present) is not implemented; the provider always uses the
  embedded ARM JSON. `--infra` is therefore inspection-only today;
  edits to the ejected `main.bicep` are not honored on the next
  `azd provision`. Tracked as separate work.
- Telemetry fields `init.infra_flag` and
  `provision.synthesis_source` not yet emitted.

* feat(agents): on-disk Bicep path + Preview via ARM what-if

Adds support for projects that have an on-disk `./infra/` directory
(typically the output of `azd ai agent init --infra`), and implements
the provisioning Preview operation via ARM what-if for both inline
and on-disk paths.

On-disk path:
- internal/project/ondisk_template.go detects `./infra/main.bicep`,
  compiles it via the bicep CLI, and feeds the resulting ARM JSON
  through the same deploy + outputs pipeline as the inline path.
- `Parameters()` returns host-derived values regardless of source.

Preview:
- internal/project/preview_helpers.go renders an ARM what-if change
  summary (Create / Modify / Delete / NoChange / Ignore).
- `Preview()` calls `ensureResourceGroup` first so what-if has a scope
  to query against.

Robustness:
- preprovision tolerates a missing agent.yaml on the inline-agents
  path (the synthesizer can still run from host shapes alone).
- Deployment output names are canonicalized case-insensitively so
  ARM output lookups survive casing drift in the template.

* chore(agents): polish: lint, drop starter clone, RG fix, comment trim

Cross-cutting cleanup pass after the provider feature stabilized.

- Satisfy golangci-lint and cspell on the bicepless feature.
- Drop the `Azure-Samples/azd-ai-starter-basic` clone from init.
  `azd ai agent init` now invokes `azd init -t <emptyStagedDir>
  <targetDir>` to scaffold the project shape directly. Net -474 LoC.
- Ensure the resource group exists before ARM what-if runs in Preview.
- Trim verbose design-commentary from the bicepless code; comments
  now describe what code does and why, not the historical rationale.

* fix(agents): purge soft-deleted Cognitive Services accounts on `azd down --force --purge`

Foundry's CognitiveServices (AIServices) accounts soft-delete on RG
removal and reserve the name for ~48h. The provider previously
ignored options.GetPurge(), so the next `azd provision` failed with
InvalidTemplateDeployment / FlagMustBeSetForRestore.

Destroy now mirrors azd-core's built-in bicep provider purge flow:
enumerate live accounts in the RG before BeginDelete (capturing
name + location), delete the RG (soft-deleting the accounts), then
purge each via DeletedAccountsClient.BeginPurge.

- Hard-fails on purge / enumeration errors (silent skip would just
  reproduce the same bug on the next provision).
- Skips the purge path when the RG is already gone at Destroy time
  (idempotent re-run). Documented in the Destroy docstring; users
  with a leftover from a prior incomplete cleanup can purge it
  manually via `az cognitiveservices account purge`.
- Without --purge, the account remains soft-deleted (auto-expires
  per Azure retention) - unchanged behavior from prior commits.

New OpCognitiveAccountList / OpCognitiveAccountPurge op-names give
telemetry a way to distinguish the enumeration vs the purge step.

Extracts collectPurgeableAccounts as a pure helper for unit testing;
the SDK pager + poller calls stay inline in Destroy (consistent with
the rest of the provider's untestable Azure SDK call sites).

* fix(agents): address PR review feedback on bicepless feature

- foundry_provisioning_provider.go: thread the caller's context into
  envValues so the Environment().GetValues gRPC call honors
  cancellation/timeouts instead of using context.Background().
- init_infra.go: correct the ejectInfra doc comment; it claimed
  infra.provider "stays azure.ai.agents" but the provider is
  microsoft.foundry. Now states the declared provider is left unchanged.
- init.go: the missing-infra warning pointed users at running
  `azd ai agent init` in an empty directory, which no longer produces an
  infra/ directory now that init is bicep-less. Point to
  `azd ai agent init --infra` instead.

* feat(provisioning): use subscription-scoped template for preview parity

Switch to subscription-scoped ARM deployment so `azd provision --preview` runs what-if without pre-creating the resource group. The RG now appears in the what-if output as a Create operation, matching built-in bicep provider behavior.

* feat(provisioning): render extension preview changes and stream progress

Add a structured changes list to the provisioning preview proto and map it into the core preview renderer so `azd provision --preview` shows a colored resource diff for extension providers (previously only a string summary was sent and dropped). Route Deploy/Preview/Destroy progress to the console spinner via an injected console instead of debug-only logs, finishing the existing TODOs.

* fix(agents): emit preview changes, clean up model deployments on down, fix ACR connection

Preview now returns structured what-if changes so the core renderer shows a colored resource diff. Destroy deletes model deployments before the resource group so the Cognitive Services account delete no longer rolls back with CannotDeleteAccountWithDeployments. The ACR project connection uses the project identity client id plus the 2025-04-01-preview api-version, matching a working Foundry hosted-agent connection (fixes empty-credential and MissingApiVersionParameter failures). Includes a temporary go.mod replace pointing at the in-tree azd core; the core proto change must land and be published before this merges.

* feat(agents): support Terraform as an IaC option for azd ai agent (#8756)

* feat(agents): support Terraform as an IaC option for azd ai agent init

Add `--infra=terraform` to `azd ai agent init` so it ejects a Terraform
module (azurerm + azapi) to ./infra/ instead of Bicep, provisioning the same
resources and emitting the same output contract, then stamps
`infra.provider: terraform` so azd-core's built-in Terraform provider handles
provisioning. Bicep paths are unchanged.

- `--infra` flag changed from bool to string with NoOptDefVal=bicep: bare
  `--infra` ejects Bicep, `--infra=terraform` ejects Terraform,
  `--infra=bicep` is explicit, absent stays bicepless default.
- Embed templates/terraform/*.tf; generate main.tfvars.json at eject time.
- Terraform path stamps infra.provider: terraform in azure.yaml (the one place
  eject mutates it); Bicep path leaves azure.yaml untouched.
- foundry_project_name defaults to env-name derivation when AZURE_AI_PROJECT_NAME
  is unset, mirroring the Bicep provider's sanitizeFoundryName.

Refs #8705

* feat(agents): omit acr.tf and ACR outputs when no agent uses docker

On the Terraform eject path, only emit ACR when at least one agent declares a
docker: block:

- acr.tf is written only in the docker case (it owns container_registry_name);
  main.tf has no ACR references otherwise.
- outputs.tf is rendered from outputs.tf.tmpl (text/template) and includes the
  three AZURE_CONTAINER_REGISTRY_* / ACR connection outputs only when acr.tf is
  present, omitting them entirely otherwise (Terraform resolves resource
  references statically, so they cannot reference absent resources).
- include_acr is no longer a module variable or a main.tfvars.json key -- the
  presence of acr.tf is the include-ACR decision.
- main.tf also derives resource_group_name as rg-{env} and the project name from
  environment_name when those env vars are unset, so the bicepless-default ->
  Terraform flow provisions without extra setup.
- Trimmed the ejected .tf comments to drop azd-internal references.

Accepts asymmetry with the Bicep path (which always emits acr.bicep).

Refs #8705

* docs(agents): fix misleading tfvars comment in writeTfvarsFile

The comment claimed an 'ordered slice of lines' but the code uses a
map[string]any. json.MarshalIndent sorts map keys, so the generated
main.tfvars.json is deterministic; the comment now says so.

Addresses a Copilot review note on PR #8756. The two other Copilot notes
(replace() regex on main.tf:33, coalesce() empty-string fallback on
main.tf:12) are false positives: Terraform's replace() treats a /.../-wrapped
substring as a regex, and coalesce() skips empty strings, both verified against
the HashiCorp docs and a live azd provision --preview.

* feat(agents): secure-by-default private networking via azure.yaml (#8708)

* feat(agents): private networking for host: microsoft.foundry

Add a declarative network: block to the Foundry service in azure.yaml and
teach the bicep-less synthesizer to provision a VNet-bound (network-secured)
account from it. Additive: an absent block yields today's public account.

- schema: network: surface (mode byo|managed, byo vnet/subnets tri-state,
  managed isolationMode, dns create-or-reference) on microsoft.foundry.json
- synthesizer: decode network:, resolve ${VAR}, validate (mode coherence,
  vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params +
  NetworkMode for telemetry
- templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep;
  resources.bicep/main.bicep guard the network path on enableNetworkIsolation
  (publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections,
  account private endpoint + 3 AI DNS zones); main.arm.json regenerated
- provider: pass azd env for ${VAR} resolution, emit provision.network_mode,
  warn that network: is ignored when endpoint: (brownfield) is set
- docs/tests: synthesizer network tests, eject module assertions, extension
  README network section, telemetry-data.md provision.network_mode

* fix(agents): preserve network ${VAR} refs through eject

The existing on-disk provision flow resolves ${VAR} in main.parameters.json
from the azd environment at provision time. Eject must therefore keep ${VAR}
references verbatim instead of resolving them eagerly from the process env and
freezing a literal into the ejected file.

- synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and
  dns.subscription pass through verbatim and the format checks that cannot run
  on an unexpanded placeholder are skipped (concrete-but-malformed still fails)
- eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json
  stays environment-portable; the provision path still resolves and validates
- tests: synthesizer preserve-mode (pass-through + concrete validation) and an
  eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive

* test(agents): add Foundry private-networking E2E harness

Bash E2E harness validating host: microsoft.foundry private networking,
designed to minimize Azure resource-operation time:

- ONE real network account is provisioned (create+own matrix cell) with a BYO
  --image agent, then deploy + invoke prove the agent works under the VNet.
- Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference
  x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if),
  which creates nothing.
- Scenario 2 (eject) is verified against the live account: eject -> what-if
  reports no changes (idempotent), proving the on-disk template + provision-time
  ${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then
  surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end.
- A shared BYO VNet (+ reference subnets / external DNS zones) is created once
  and reused across cells.

Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live
az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS
zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation),
README.md (cost rationale, prerequisites, cleanup). Westus account region per
requirement; AcrPull granted to the project MI on the ABAC registry.

* test(agents): run network E2E phases 0-4 without --image

Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it
runs against the current branch today:

- Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture
  (foundry service + network: block + agent image:), created via 'azd env new'.
  image: yields includeAcr=false, matching BYO image, so no ACR at provision.
  Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true,
  includeAcr=false, ${VAR} resolved.
- Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image
  deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise
  deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet,
  what-if matrix, one real provision, eject idempotency) validate all the
  networking code without it.
- Fix the ABAC registry role: grant 'Container Registry Repository Reader'
  (ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase.
- Drop the --image preflight; README updated (scenario table, prerequisites,
  RUN_DEPLOY usage, role).

* fix(agents): correct network bicep preflight + network-mode output casing

Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch
either; both require a real deployment):

1. networkInjections preflight failure. The account and the network module
   deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId
   compiled to an unresolved reference() at the CognitiveServices RP preflight,
   which then failed to convert networkInjections to its typed contract
   (InvalidResourceProperties). Build the subnet ARM id as a deterministic
   string from the concrete vnetId param instead, and add an explicit dependsOn
   so the subnet still exists before injection. Recompiled main.arm.json.

2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles
   output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical
   remapping the env key was stored mis-cased and azd env get-value
   AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a
   regression case to TestArmOutputsToProto_RepairsMangledKeyCase.

Validated end-to-end: real westus network-isolated Foundry account provisions
green with all topology assertions passing (publicNetworkAccess Disabled,
networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones,
network mode byo), across the full subnet create/reference x DNS own/reference
matrix, plus eject idempotency (what-if reports no changes).

* test(agents): make network E2E harness run green end-to-end

Fixes found while running the harness against live Azure (phases 0-4):

- Hand-authored project must include an agent.yaml (kind: hosted + image:)
  alongside azure.yaml; the foundry provider requires an agent definition file.
- setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template
  creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true
  (BYO-image deploy signal).
- Phase 0 refreshes the dev extension from current source
  (build -> pack -> publish -> install) so the run tests local code, registering
  the provisioning-provider capability + microsoft.foundry provider. Gated by
  SKIP_EXT_REFRESH.
- What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping
  a summary-only preview; this still validates reference-mode subnet/zone
  existence and delegation against the real VNet.
- Idempotent private-dns zone creation (reruns no longer fail on existing zones).
- Add MAX_PHASE to stop early while iterating.
- ACR grant uses the ABAC-aware Container Registry Repository Reader role.
- Fix set -u unbound-variable crash in the phase-4 assert message.
- .gitignore the transient per-run log directories.

Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision +
topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke)
stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image.

* test(agents): build phase-5 image in ABAC-enabled ACR

Update the Foundry private-network E2E harness so phase 5 can build the
~/agents/echo-dual image itself instead of requiring a prebuilt external image.

- Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG.
- Create the target ACR with --role-assignment-mode rbac-abac and reject reuse
  of non-ABAC registries.
- Grant the caller Container Registry Repository Writer before the ACR Task push.
  Resolve the caller object id from the ARM token oid claim to avoid Microsoft
  Graph / Conditional Access failures.
- Build with the required `az acr build --source-acr-auth-id [caller]` form.
- Keep the project MI grant on the ABAC-aware Container Registry Repository
  Reader role for image pull.
- Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the
  real Foundry env in a single RG.

Live validation: the harness created an ABAC ACR, granted caller writer, built
and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a
private-networked Foundry account, and granted the project MI Repository Reader.
The subsequent deploy failed from this public runner with the expected private
endpoint 403, which is documented.

* test(agents): grant private ACR pull to Foundry project identity

Live phase-5 validation showed hosted-agent image pull uses the Foundry project
managed identity, not the parent account identity. Update the network E2E
harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the
ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling
back to the account MI only for older API shapes.

Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on
VM/managed-identity runners after deploy succeeds.

* docs(agents): add BYO image VNet cheatsheet

Add a concise README cheatsheet for initializing, provisioning, deploying, and
invoking a hosted Foundry agent with a BYO container image under VNet private
networking. Include ACR requirements for ABAC and private-only registries.

* docs(agents): move private networking guide to docs

Keep the extension README concise by moving the detailed Foundry private
networking schema, requirements, and BYO-image VNet cheatsheet into
`docs/private-networking.md`, with a short README pointer.

* fix(agents): surface managed-network isolation output

Live managed-network provisioning showed that the resources module emitted
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template
never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE.

Forward the output from main.bicep, add it to the provider canonical output-name
restore list, and cover ARM casing repair with a regression test. Also document
the managed VNet provisioning scenario in the private-networking guide.

Live validation: provisioned network.mode=managed in westus and verified the
account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections
with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and
AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound.

* fix(agents): keep managed-network data plane reachable

Live managed-network deploy validation showed that managed mode configures the
hosted-agent runtime to use a Microsoft-managed network but does not create a
customer private endpoint for the Foundry data plane. Disabling public access in
that mode makes azd deploy/invoke fail with `403 Public access is disabled`.

Keep public data-plane access enabled for managed mode while preserving BYO
mode behavior (public access disabled + private endpoint). Update the private
networking guide with managed deploy/invoke guidance.

Live validation: provisioned managed mode, converted the test ACR to ABAC,
built the echo-dual image with `az acr build --source-acr-auth-id [caller]`,
granted the Foundry project MI `Container Registry Repository Reader`, deployed
successfully, and invoked the hosted agent successfully.

* feat(agents): secure-by-default private networking for Foundry services

Realign the azure.yaml `network:` surface to the natural Azure resource
shape and make a network-bound Foundry account private in every mode.

Reverses the prior managed-mode regression that flipped the account's
publicNetworkAccess back to Enabled. Service sample 18 confirms managed
mode supports a private data plane (customer private endpoint + the
Microsoft-managed egress network), so declaring `network:` now always
disables public data-plane access.

Config: flat `network:` block with two orthogonal axes, no `mode` enum.
- peSubnet (required) -> account private endpoint; omitting it while
  `network:` is declared is an error, never a silent public fallback.
- agentSubnet (optional) -> present injects the agent into a customer
  subnet (BYO egress); absent uses the Microsoft-managed network
  (managed egress), where isolationMode becomes valid.

Synthesizer/templates:
- derive egress from agentSubnet presence (useManagedEgress); replace the
  networkMode param with a useManagedEgress bool.
- disablePublicDataPlaneAccess = enableNetworkIsolation (always private).
- add a managedNetworks/default child resource carrying isolationMode for
  managed egress.
- validate peSubnet-required, isolationMode-managed-only, and single-VNet.

Docs/tests/e2e:
- rewrite docs/private-networking.md (host: azure.ai.agent, the value the
  provision provider actually accepts on this branch).
- add synthesizer unit tests + a compiled-ARM regression guard.
- add a live E2E harness (8-cell what-if matrix, BYO + managed-iso real
  provisions, eject idempotency) with an automatic jumpbox SOCKS tunnel so
  deploy/invoke can reach the private data plane; assert real account
  network-injection state rather than azd's echoed output.

* docs(agents): use 'azd ai agent init --image' and assert real state in private-networking cheatsheet

Managed-egress cheatsheet now scaffolds the agent via 'azd ai agent init
--image' (writes agent.yaml) instead of assuming a hand-authored manifest,
matching the BYO cheatsheet.

Replace the env-output 'Expected outputs' block (azd echoing its own
AZURE_FOUNDRY_* classification) with real resource-state validation:
account publicNetworkAccess=Disabled and the managedNetworks/default
isolationMode, with the invoke echo response as the end-to-end proof.

* docs(agents): order init before azure.yaml in managed-egress cheatsheet

'azd ai agent init --image' scaffolds azure.yaml/agent.yaml, so it must
come before the network: block the reader adds to the generated service.
Matches the actual timing and the BYO cheatsheet ordering.

* docs(agents): simplify private-networking cheatsheets

- Drop the redundant 'azd env set AZD_AGENT_SKIP_ACR true': passing --image
  to 'azd ai agent init' already derives skipACR() and writes the env var
  into the environment init creates/selects. Reuse that env (no separate
  'azd env new') so init and provision share one environment.
- BYO cheatsheet: remove the export-variable indirection; inline placeholders
  directly into 'azd env set', matching the managed-egress cheatsheet.
- Managed-egress cheatsheet: remove the weak 'validate via az show' /
  env-output block; a successful invoke over the private endpoint is the
  end-to-end proof.

* test(agents): fix jumpbox fallback for DNS-reference deploy

In peered fallback mode, populate JB_HOST and wait for SSH before writing
/etc/hosts. DNS-reference accounts can also assign different PE IPs per
privatelink zone, so pin each account FQDN from the private DNS A record
instead of mapping every FQDN to the first PE NIC IP.

* test(agents): exclude private-networking e2e harness from PR

Remove the ad-hoc live validation harness from the committed diff while keeping
it available locally for investigation runs.

* test(agents): ignore bicep generator metadata in ARM drift check

Compare normalized ARM JSON instead of raw bytes so CI/dev Bicep version
metadata does not cause false stale-template failures while still catching
semantic template drift.

* feat(agents): validate distinct subnet names; drop debug log

* feat(agents): refuse Terraform eject when network: is declared

The Terraform module has no VNet/private-endpoint/DNS/networkInjections
resources, so ejecting it for a network: service would silently drop the
config and provision a public account. Fail fast with a clear error
instead, preserving the network: secure-by-default contract. Bicep eject
remains the supported path for private networking.

Also fixes a merge artifact (single-arg ejectInfra call in a test).

* docs+test(agents): document Bicep eject customization; verify full network param set

- docs: add 'Advanced: eject the Bicep and customize it' section to the
  private-networking cheatsheet (eject -> manual edit -> provision -> deploy ->
  invoke), with two worked Bicep edits and a Terraform-unsupported note.
- test: assert the complete network parameter set lands in the ejected
  main.parameters.json for BYO egress, and extend the managed-egress eject test
  with the full param contract.

* docs(agents): merge networking Requirements/Known limitations into one Limitations section

* docs(agents): rename networking quick guides to parallel Scenario 1-3 headings

* feat: generalize $ref resolver + shared $ref-aware azure.yaml edit helper (#8777)

* feat(agents): split Foundry init into per-resource azure.yaml services (#8675)

* feat: carry Foundry agent and resource definitions in azure.yaml (#8779)

* feat(agents): move Foundry networking to project service (#8809)

* feat(agents): move Foundry networking to project service

* fix(agents): reject conflicting Foundry network schema

* fix(agents): align split Foundry schemas and ACR scan

* fix(agents): preserve pre-split Foundry provisioning fallback

* docs(agents): split complex Foundry schema example

* ci: add cspell words for ai agents extension

* ci: exclude gosec G204/G304 for agents test lint

* fix: apply go fix and gofmt in agents and skills exts

* test: add kind and name to resolveAgentProtocol fixtures

* fix: log foundry env fallback and document deploy-context guard

* fix: remove foundry agent legacy-config telemetry from azd core

* fix: default code-deploy dependency_resolution to remote_build

* fix: default rg name and emit AZURE_TENANT_ID in foundry provision

* fix: write endpoint on ai-project when reusing existing project

* fix: don't declare existing model deployments in azure.yaml

* feat: derive azure.yaml project service key from project name

* fix: resolve $ref includes when synthesizing foundry infra

* fix: harden foundry provisioning provider edge cases

* fix: make agent infra eject retryable and reject --image

* fix: seed used service names and slices.Sort keys

* fix: zip agent override source dir on code deploy

* fix: skip deploy upsert for non-key connection auth

* feat: keep deprecated agent config shape in schema

* fix: debug logs, preserve uses: on optimize apply, acr network isolation

* feat: provision model deployments on existing Foundry project

* fix(test): update recording tests for inline agent definition

The functional recording tests still asserted that agent.yaml exists on
disk, but the PR moved agent definitions inline into azure.yaml service
entries. Update both Test_AIAgent_Init_NoPrompt_Defer and
Test_AIAgent_Init_NoPrompt_WithProject to:

- Assert .agentignore exists instead of agent.yaml
- Cross-check resolved model name in azure.yaml instead of agent.yaml
- Update layout comments to reflect the new structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Wei Meng <wemeng@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Zhijie Huang <zhihuan@microsoft.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: trangevi <trangevi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ext-agents azure.ai.agents extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve $ref file includes with overlay overrides in Foundry config

3 participants