Agentic Development Workflows

An AI Product Engineer’s Field Guide, Part 1

Mar 09, 2026

This article is the first in a three part series, all of which will be posted this week.

Agentic Development Workflows - What is happening in enterprise prod right now.
Coding with Agents - How to code with agents personally and professionally and not suck at it.
The Recursive Developer - How the agentic masters are shipping so much code they can justify $2,000+ a month in coding assistants.

Follow along or subscribe to read all three.

Typewise reported in February 2026 that only 1 in 10 agentic AI pilots make it to production. Deloitte found that while 38% of organizations are piloting agentic solutions, just 11% are actually running them live. Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027 -- not because the technology failed, but because the organizations deploying it did.

I’ve been building software for over twenty years. Payment systems, mobile forensics platforms, agentic research tools, and enough CRUD apps to fill a landfill. Agentic AI is not a fad - I think most people are accepting that by now.. But about 90% of what people call agentic AI is vaporware cosplaying as production software; thin, barely functional veneers over empty stubs and functionality lacking any robustness.

So when I say “top three workflows,” I’m not ranking by Twitter hype or venture capital inflow. I’m ranking by what’s actually running at companies processing real transactions, serving real customers, and writing real checks to keep the lights on.

The money is real -- worldwide AI spending is projected to hit $2.52 trillion in 2026, up 44% year-over-year, with standalone agentic AI landing between $7 billion and $8.5 billion.

Enterprise adoption grew 340% year-over-year. 73% of Fortune 500 companies are now deploying multi-agent workflows. The question is where the money lands, and whether what it buys works.

1. Multi-agent orchestration

The router-specialist-aggregator pattern

If you’ve been in this industry long enough, you recognize multi-agent orchestration for what it is: microservices for intelligence. The same instincts that drove us from monoliths to distributed services a decade ago -- separation of concerns, bounded contexts, independent scaling -- are now driving how we decompose cognitive work across specialized agents. The difference is that the “service contract” is no longer a REST API schema. It’s a system prompt.

The production pattern that keeps showing up is what Nick Gupta calls the Router-Specialist-Aggregator model. A routing agent decomposes intent and uncertainty, selects specialist agents -- each with tight tool permissions -- and an aggregator merges outputs into a single coherent schema. The engineering insight that separates the production deployments from the demos: use cheap models for routing and summarization, expensive models for high-uncertainty steps. LangGraph’s stateful workflow approach saves 40-50% of LLM calls on repeat requests through context-preserving transitions.

Who’s actually running this

The Salesforce number is the one that should make every executive sit up. They didn’t add an AI feature. They restructured an entire operating model around agentic infrastructure. Nine thousand people down to three thousand. That’s not optimization -- that’s a different company.

The framework landscape

LangGraph leads production deployments with ten confirmed enterprise implementations including Klarna, Cisco, and Vizient. CrewAI offers the fastest path to a working multi-agent prototype -- two to four hours from setup to demo -- but production deployments have exposed task scheduling delays around twenty minutes in their enterprise platform, which is a problem if you’re trying to run anything time-sensitive. AutoGen, now absorbed into Microsoft’s Agent Framework, is good at conversation-driven orchestration but fights you on deterministic workflows.

For most production use cases today, Levels 2-3 of agent sophistication -- single agents with tool use up to small specialist teams -- are the sweet spot. Level 4 multi-agent systems are, as one practitioner put it, “fascinating for demos, painful for production.”

My take after two decades of watching architectural patterns come and go: multi-agent orchestration has the highest ceiling but also the widest variance in outcomes. The companies that succeed treat it like distributed systems engineering -- with the same rigor around observability, fault tolerance, and state management. The companies that fail treat it like a science experiment with a chatbot.

2. The tool-use agentic loop

When the model becomes the glue layer

I’ve been saying this since I first started building with function-calling models in 2024: the tool-use loop is the most underestimated pattern in agentic AI. Everyone wants to talk about multi-agent swarms and autonomous reasoning. The unglamorous reality is that the single most impactful production workflow right now is a single model sitting in a tight loop -- selecting tools from a registry, executing them, evaluating results, and deciding whether to loop again or return a final answer.

Tungsten Automation calls this the foundational layer that enables autonomy.

Before tool-use, every application needed custom logic stitched together manually. Now the reasoning model becomes the glue layer linking tools, data, and decisions without rigid workflow definitions.

The AI looks at available options, compares them to the request, selects the right tool, executes it, feeds the result back into its own context, and keeps reasoning. This loop can run once or dozens of times depending on what the task demands.

What makes this pattern powerful is the dynamic nature of tool selection. The agent isn’t following a predetermined script. If the first search returns garbage, it reformulates and tries again. If an API call fails, it reaches for an alternative tool entirely. As ByteByteGo put it, “this adaptability makes tool-enabled agents far more capable than rigid automated workflows.”

What it looks like in production

This is the workflow powering Claude Code, Cursor’s agentic mode, and the entire new generation of AI coding assistants. Anthropic’s Claude Opus 4.5 runs consistent 30-minute autonomous coding sessions, self-improving through four iterations where other models plateau after ten attempts without matching quality. Cisco’s network operations agents use tool-use loops to monitor health, detect degradation, correlate changes, identify root causes, and remediate -- all before a human finishes reading the alert.

What’s happening at the enterprise level is that companies are building internal “tool registries” -- centralized catalogs where skills, APIs, and automations live, and models select from them dynamically.

HPE built an agent called Alfred with four sub-agents that decompose queries, run SQL analysis, generate charts, and produce structured reports, all coordinated through tool-use patterns pulling from their data warehouse. Toyota’s agent provides real-time vehicle ETA visibility across pre-manufacturing to dealership, replacing what used to require navigating 50 to 100 mainframe screens.

Cost per task by approach

Look at that spread. 10-50x cost difference between a simple tool-use loop and a full multi-agent system.

That’s not a rounding error. That’s the difference between a workflow that scales to millions of daily executions and one that burns through your GPU budget by Thursday. Prompt caching alone delivers a 90% reduction on repeated context, and multi-model routing saves another 30-50%.

If you take nothing else from this piece: before you build a multi-agent system, ask yourself whether a single well-instrumented tool-use loop with a good model gets you 80% of the way there at 10% of the cost. In my experience, the answer is yes far more often than anyone selling a framework wants to admit.

3. Human-in-the-loop orchestration

The pattern that separates shipping from stalling

Here’s the uncomfortable part. Every single production agentic deployment I’ve found in my research -- Goldman Sachs, Salesforce, Cisco, Fujitsu, Dell, HPE, Toyota, Mapfre -- includes human-in-the-loop checkpoints. Not as a temporary crutch. As a permanent architectural feature.

The companies that are actually shipping agentic AI understood something the demo builders haven’t: the goal was never to remove humans from the loop. It was to redesign where humans sit in it.

Martin Fowler’s team at Thoughtworks recently published a framework that captures this well. They describe three postures: humans outside the loop (agents handle everything), humans in the loop (humans micromanage every step), and humans on the loop -- what they call “harness engineering.” In that middle-ground posture, humans build and maintain the harness -- the specifications, quality checks, and workflow constraints -- that controls how agents execute. When something breaks, you fix the harness, not the artifact. The agent runs autonomously within those constraints.

Five patterns in practice

Goldman Sachs’ transaction reconciliation agents auto-resolve standard discrepancies but route ambiguous cases to human operators with full context.

Mapfre’s claims agents handle routine damage assessments autonomously while humans oversee sensitive customer communications.

Microsoft’s Agent Framework explicitly builds long-running workflows and human-in-the-loop state management as first-class primitives -- not an afterthought, not a nice-to-have, but a core design goal.

The agentic flywheel

Fowler’s team describes an evolution they call the agentic flywheel. Humans direct agents to build and improve the harness itself. The agent evaluates its own loop performance using tests, production metrics, user journey logs, and commercial results. It scores potential harness improvements by risk, cost, and benefit. Low-risk improvements get auto-applied; high-risk ones go to a backlog for human prioritization. The system doesn’t just execute -- it systematically improves its own execution framework.

This is the pattern I’m most excited about as a builder. Not because it’s flashy -- it decidedly is not. Because it’s anti-fragile.

Every failure makes the harness stronger. Every edge case the human corrects becomes training data. The system compounds over time in a way that fully autonomous agents can’t, because autonomous agents have no mechanism for incorporating the kind of contextual judgment that only comes from a human who actually understands the business.

So what does this mean for you

If you’re leading an applied AI company or an AI-forward product team, here’s where I’d point your attention:

Start with the tool-use loop, not the multi-agent dream. Every company that made it to production started with high-volume, rule-based workflows where a single agent with good tool access could deliver immediate value. Goldman started with transaction reconciliation. Salesforce started with first-line support triage. Cisco started with network monitoring. None of them started with open-ended multi-agent exploration. That came later, after the plumbing was proven.

Design your human-in-the-loop gates before you write a single agent prompt. The 90% failure rate from pilot to production isn’t a technology problem. It’s a governance problem. Figure out which actions are reversible versus irreversible. Map your confidence thresholds. Build your escalation paths. If you treat human oversight as something you’ll bolt on later, you’ll join the 40% of projects Gartner predicts will get canceled.

Build platform infrastructure, not point solutions. Goldman built an agentic platform, not a single automation. Salesforce restructured their entire support model. Dell digitized 20 enterprise processes across divisions. The upfront investment is higher, but the platform approach scales across dozens of use cases. A tool registry that one agent can use is useful. A tool registry that any agent across the organization can use changes everything.

Watch your cost curves obsessively. LLM API costs represent 40-60% of operational expenditure for agentic systems. The difference between a $0.10 tool-use task and a $5.00 multi-agent task is the difference between a product that scales and one that bankrupts you. Prompt caching, multi-model routing, and aggressive state management aren’t optimizations. They’re survival requirements. Annual maintenance runs 15-30% of initial development costs, and initial dev is only 25-35% of your three-year spend.

The conversion rate problem

The number that should be pinned to every AI product leader’s wall: MIT research analyzing 300+ AI implementations found that only 5% of enterprise AI solutions make it from pilot to production. Typewise puts the agentic-specific rate at 10%. Deloitte found just 11% of organizations have agents running in production. These aren’t technology failures. They’re organizational failures. The tech works. The org chart doesn’t.

The proof points are no longer theoretical. Goldman, Salesforce, Cisco, Fujitsu, Dell, HPE, Toyota, and the OpenAI Frontier Alliance have all demonstrated that agentic AI works in production across financial services, technology, networking, manufacturing, automotive, and consulting. The question has shifted from “can AI agents handle enterprise workflows” to “when will yours.”

I’ve been building software long enough to know that the pattern that wins is never the one that looks most impressive in the demo. It’s the one that survives contact with production traffic, real users, and a CFO who wants to know what the return looks like in twelve months. Right now, that pattern is a well-instrumented tool-use loop, coordinated by multi-agent orchestration where the complexity genuinely demands it, and governed by human-in-the-loop checkpoints that make the whole thing auditable, recoverable, and trustworthy.

The agentic future isn’t about removing human judgment. It’s about figuring out where that judgment is most valuable -- and getting everything else out of the way.

Anthony Maio

Discussion about this post

Ready for more?