Skip to content

Failed MCP servers from coding agent settings should not break unrelated agentic workflows #21813

@benissimo

Description

@benissimo

Problem

An agentic workflow (daily-team-status) hard-failed because an MCP server defined in the repo's .mcp.json failed to launch inside the gh-aw sandbox — even though the workflow never requested that server.

Failed run: https://github.com/RealPage/task-management/actions/runs/23258101607

Root Cause

The repo had a .mcp.json at the root configuring a SonarQube MCP server that launches via Docker. The coding agent auto-discovers .mcp.json and attempts to start all servers defined in it. Inside the gh-aw chroot sandbox:

  1. Docker is unavailable — the sandbox is a chroot jail, not a full Docker environment. The Docker socket is not mounted, so the docker run command fails immediately.
  2. The image is not pre-pulled — the Download container images step only pulls gh-aw infrastructure images, not project-specific MCP images.
  3. Docker Hub is not on the firewall allowlist — even if Docker were available, registry-1.docker.io / production.cloudflare.docker.com are not in --allow-domains, so image pulls would be blocked.

The failure happens at the Docker binary/socket level before any network call — confirmed by firewall logs showing 0 blocked requests and only 2 unique domains (api.anthropic.com, raw.githubusercontent.com).

Error

ERR_API: MCP server(s) failed to launch: sonarqube

This error annotation marks the entire job as failure, despite the agent completing successfully and producing all expected outputs.

Workaround

We removed .mcp.json from version control and added it to .gitignore (https://github.com/RealPage/task-management/pull/1129). Developers recreate it locally. This works but sacrifices the "just works on clone" experience for project-specific MCP servers.

Request

MCP server launch failures should not be fatal for servers that are not explicitly declared in the workflow's tools: frontmatter. Two possible approaches:

  1. Preferred: Make auto-discovered MCP servers (from .mcp.json) non-fatal for agentic workflows that don't declare them in tools: — treat them as best-effort
  2. Alternative: Support an optional: true flag per MCP server in .mcp.json, so repos can mark servers that should not block workflows on failure

Impact

Any repo with a .mcp.json that includes servers requiring Docker, local secrets, or external connectivity will break all scheduled agentic workflows — even workflows that have no dependency on those servers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions