BELITSOFT > AI Agent Development Company

AI Agent Development Services

Looking for a leading AI agent development company? Hire AI agent developers from Belitsoft to increase revenue and trim costs, resulting in higher profits for your company.
Profile Belitsoft, an experienced developer of autonomous AI agents, among best AI agent development services companies to stakeholders of your organization. Start a pilot to get results today! We work with clients in the USA, United Kingdon, Germany and other countries.

Benefits of AI Agents Implementation

Get Positive ROI

AI agents automate routine tasks to cut labor, training, and operational costs by 15-35%.
Professional AI agent implementation may deliver payback in 6 to 18 months. Your AI agent projects have the best chance to succeed and show ROI with us.

Do More With Less

Benchmarks from different industries show that companies using AI agents see 20-40% productivity improvements and 30-60% fewer errors in repetitive processes. They allow your team to focus on nuanced situations, lowering staffing needs.

Decide Better

Predictive AI agents can process large volumes of structured and unstructured data to generate forecasts and recommendations. They analyze large datasets to find patterns that help businesses decide where to spend their money.

bjarne-elearningforce
Founder, ZensAI (Microsoft) / formerly Elearningforce

Bjarne Mortensen

End‑to‑End AI Agent Consulting, Development and Integration Services

Use top AI agent development services when you need to prepare your data, design a reliable architecture for your agents, implement and train AI agents, validate and test them, integrate agents with your existing software systems, deploy them into production, and support and optimize agent performance. We recommend our end-to-end AI agent development services to create both simple agents for quick wins and agentic platforms when you need full automation.

AI Agent Consulting

Before building an AI agent, our enterprise AI agent deployment consultants help determine whether your AI project pays off and evaluate automations based on value delivered versus time to implement.

We talk to staff and review processes you want to automate to see where time and money can be saved.

We define AI success before investment: time saved, process cost reduction, results improved.

We recommend which AI system and models to use based on business goals, not what is popular.

Data Preparation for AI Agents

Belitsoft helps companies prepare their data and implement retrieval augmented generation that connects AI agents with relevant knowledge.

However, up to 90% of enterprise company data is unstructured, and it is difficult for AI agents to find the right information, so they give unreliable answers.

McKinsey reports that 70% companies experience data challenges in GenAI implementation.

Most business decision makers, 91%, see good company data as the foundation for successful AI implementation.

AI Agent Architecture Building

Hire autonomous agent workflow designers from Belitsoft. Before our AI agent engineers write any code, they decide what underlying software your agent will use: the bigger and smaller language models (LLMs and SLMs), tools to organize how everything works together (like LangChain or CrewAI), where to store the data (vector stores), and how the AI agent stores and retrieves information during conversations. They also figure out how much it will finally cost to run the AI agent (model usage fees, hosting, and storage), how to make sure the AI agent doesn't do anything unsafe, and what tests need to be written to see if the AI agent works.

AI Agent Engineering

Custom AI agent development means that we turn an LLM into the AI agent that does exactly what your company needs. Hire AI agent programmers from Belitsoft to write custom system prompts that tell the LLM how to respond and connect it to your company's business systems and data so agents can answer questions accurately. Our engineers implement controls on what your AI agent can see and edit. They also add memory so it remembers what you talked about the last time you used it. If your business operates in a specialized field like healthcare or finance, hire AI agent developers from Belitsoft to train your AI agent on those types of documents so it performs at the level you need.

AI Agent Integration

Most companies have disconnected systems: knowledge bases, ticketing systems, CRMs, ERPs. AI agent by default can look up information but cannot do anything with it. The AI agent integration solves this.

We split your business processes into small parts that your AI agent can connect to. So when a customer asks about an order on chat, your AI agent can check the order in your CRM, create a support ticket, and add it to your ERP system automatically.

We choose the right AI agent platform for you (Amazon Q, Copilot Studio, CrewAI, etc.), get it working with your existing systems and write custom code if needed.

You get AI agent software that works and not just a demo.

AI Agent Optimization

Belitsoft's continuous improvement processes keep your AI agents accurate, secure, and cost-effective. AI agents need constant monitoring to understand where they make mistakes, where their decision making needs adjustment, and where customers are not satisfied based on user feedback.

Our LLM agent developmet services also include the improving of AI agents by rewriting their instructions, updating its factual knowledge, and adjusting model settings.

Reinforcement learning can be used so the AI agent learns from its mistakes.

We can cache similar prompts so the AI responds faster by skipping repeated computation, which saves time and money.

AI Agent Testing Services

AI agents fail more often than regular software. LLMs behind them don’t always give the same answer to the same question, and the way a question is asked can change the response. You need extra testing to catch these problems.
AI Agent Evaluation Testing
We test every tool your agent calls, every API it connects to, every database it queries, and every decision it makes while working for your business. We make sure it completes business tasks correctly from start to finish. If your agent does multiple steps, we check each of them.
AI Agent Performance Testing
We put your AI agents under heavy load to simulate thousands of users using your system at the same time to see how it performs. We verify that response times stay short and that your AI agent does not crash.
AI Agent Penetration Testing
We simulate attacks (prompt injection, etc.), on your AI agents the way hackers would to test how they respond under those conditions. If reports and documentation show vulnerabilities or potential exploitation, we develop and implement fixes.

AI Agent Deployment

Our enterprise AI agent deployment consultants help you choose secure hosting for your AI agent to run, either on your servers or in the cloud and connects the agent to any databases and sources you want it to use. We also set up oversight so you can monitor what you agent is doing.

MCP AI Agents

We build, deploy and integrate MCP servers that allows your AI to start workflows, get data from within your company, change records in your ERP, CRM, and other systems, and connect several steps to complete complex tasks.

Multi-LLM AI Agents

You decide where to run your AI agent. If you need recommendations, we compare hosting services like OpenAI, Anthropic, AWS Bedrock, Azure, your own servers, or a mix of these options. We review costs, response times, data storage locations, and the effort required for setup and suggest what best fits you.

AI Agent Platforms with Guardrail Features

Before each request reaches your AI agent, it's checked for potential issues. If it violates your policies or looks suspicious, it's blocked. We also validate proposed agent actions by checking them against your business rules in real time. This prevents the risk of unauthorized changes to your files and data leaks from your system.

Security Gateway for AI Agent Calls

We use a gateway architecture where every interaction your agent has with any system passes through a controlled point. This gateway routes requests to the right model, reduces costs based on prompt caching, and logs every request and response.

AI Agent Development Cost

Building an AI agent is similar to building a new software product. The price depends on how ambitious you want to be. A small company can launch a simple AI agent for a few thousand dollars. A large enterprise building autonomous multi-agent systems, where several AI components coordinate tasks together, usually plans for budgets in the tens of thousands of dollars.

From $10,000
Basic & Intermediate Agents
Basic and intermediate custom AI agents for business
Building chat assistants or workflow helpers with moderate complexity that use existing LLMs and a few integrations typically starts around US$10,000, depending on scope. The work may require dedicated backend development, user interfaces, and ongoing maintenance. If you add more connections or a custom interface, the price increases.
From $20,000
Industry & Enterprise Agents
AI agents for specific industries
Agents that rely on trained models to perform defined tasks for your industry may start around US$20,000. They require training, data preparation, and more complex integration work.
Enterprise AI agent systems
Enterprise systems with complex decision flows and multiple integrations are priced higher. Fully autonomous agents that plan, reason, and take actions may start around US$30,000.
From $50,000
Autonomous Multi-agentic AI systems
Autonomous Multi-Agent AI Systems
Solutions where several AI agents work together may start around US$50,000. The added coordination logic and supporting infrastructure increase the budget.

What Affects the AI Agent Development Pricing

Total cost depends on how complex the agentic AI system is, how many other software programs it connects to, how much data preparation is required, which compliance standards apply, how experienced the specialists must be, and whether ongoing support is included.

The more independent components you add, the more work it takes to make them work together correctly. You need specialists who understand mathematics and coding, know how to work with your company data, and can make AI agents function as expected. You also need professional testing using different scenarios.
Data cleaning and formatting often consume additional budget. Building retrieval pipelines with RAG also adds infrastructure costs.
Connecting an agent to CRMs, ERPs, payment systems, or multiple APIs requires custom connectors and can add cost per integration.
Hiring AI engineers, data scientists, prompt engineers, DevOps specialists, and QA specialists adds cost. More specialists and longer timelines increase budgets. Hourly rates vary widely: North American engineers often charge $120–$200 per hour, while rates in Eastern Europe start around $40 per hour.
If you operate in regulated industries, compliance specialists are needed to avoid late-stage redesigns that add time and expense. After launch, expect to spend on LLM API usage and cloud hosting each month.
How To Build AI Agent. AI Agent Software Development Process

Belitsoft usually advises clients to start with a narrowly scoped prototype to prove the idea without a large budget. Instead of building something from scratch, use existing AI models and connect only the key programs required for one use case. Integrate only the systems necessary for the initial return on investment. Open source libraries can reduce upfront costs. If the prototype delivers valuable results, you will have evidence to justify further development.

Assessment Phase What you think you want from an AI agent and what will actually solve your problem may be different. For example, an AI agent development solutions that work somewhere may not fit your context (constraints of your existing systems and processes), or you may overestimate or underestimate what is possible. In the assessment phase, we validate or challenge your hypothesis, uncover hidden blocks you did not consider, find quicker wins you may miss, and prevent building the right solution to the wrong problem.
Scope Phase We spend time with you understanding how the business process you want to automate with an AI agent works now. We need to know every step, including where the data comes from and where it goes. Think of it like drawing a flowchart of the daily tasks of a staff member whose work you want to automate. We write down what your staff will stop doing. Which daily tasks will be handled automatically? Which ones does a person need to review before they are finished? When does a human need to step in to help? We also agree with you on the constraints, the rules your AI agent cannot break. For example, it cannot send emails without approval or it has to work with your current CRM system. We also define the KPIs you expect after AI agent automation, for example cutting response times in half or something similar.
Architecture Phase At this stage, our engineers design how the AI agent will work technically. Will it react immediately to each event, or will it create a detailed plan first and then execute it? Don't worry about the technical details - we'll recommend what fits your situation best. They map out which systems connect to each other, where information is stored, whether on your servers or in the cloud, how it moves between systems, and who controls access to it. They decide which AI tools to use, such as ChatGPT, Claude, or others, and how to connect them to your existing software. They also plan risk mitigation by asking what could go wrong and preparing backup plans (if the AI makes a mistake or goes down, how do we catch errors?) AI agents are not free to run - they use computing power every time they work. At this stage, you calculate what your monthly bill for using an LLM behind AI Agent will look like.
Prototype Phase We create a working demo version of the AI agent. It's functional enough to test because it can work with your company's real documents or orders from your CRM. At this stage, you can see how many out of 100 tasks are completed correctly and without hallucinations like citing a company policy you don't have. You also get information about the operating costs, how much it will cost to run at full scale. This helps you decide if the AI agent is good enough or if you need to improve it before launch.
Development Phase Once the prototype works, developers create the working version of the AI agent - a digital worker you can count on. Engineers write instructions for the AI and see what works. They test the prompts and improve them. It is like training a new employee. You watch them work, tell them what you want, and adjust your training. They figure out how much information the AI needs. Too little information and the AI gives unhelpful answers. Too much and it gets slow, expensive, and confused by irrelevant details. They build all the necessary connections to your business systems, platforms, databases, and so on and make them secure. The testing process starts by giving the AI only 5% of the business tasks to see if it works well. If it does, we increase the percentage: first to 10%, then 25%, 50%, and finally 100%.
Launch Phase We set up the AI to run on multiple servers in different locations. If one server crashes, the others instantly pick up, so your customers never know anything happened. On their monitoring screens, our engineers watch requests per minute, response times, error rates, and costs. They set up automatic alarms, so if the AI starts making mistakes at an unusual rate, response times suddenly triple, or costs spike unexpectedly, the system sends a warning. When the AI agent proves reliable, they give it more tasks. They scale it up until it processes everything it was designed for.

TYPES OF AI AGENTS WE DEVELOP

Where to Start AI Agent Development

We set up a discovery call within a day after you reach out through a contact form, email, or phone. You tell us your goals, what you need to do, how much you're willing to spend, and your timeframe.

We sign an NDA to protect your secret projects. After a few days, you get a detailed proposal with everything: what you'll get, when, and what it costs. If you agree to the plan, we put together a team and begin working on the project. To help you implement AI faster, we provide templates for figuring out where to apply AI, security and compliance checklists, tools to evaluate how ready your data and team are for AI, data assessments, and a step-by-step plan for implementation. These help you plan investments wisely.

How to Choose Cooperation Model with AI Agent Development Company

As an AI agent developer, Belitsoft has flexible cooperation models to fit your needs and budget. You can hire a dedicated team of AI agent developers who work just for you. You can extend your existing team by adding our AI Agent developers temporarily. Or you can hire us for a project with a fixed scope and timeline. We help you scale up or down quickly and hire only the expertise you need. We can also provide ongoing support to improve and update your AI agents after they go live.

Frequently Asked Questions

Agents do things rather than just say things.

Traditional AI tools, such as chatbots, use an LLM only to execute cognitive tasks such as reasoning and generating text. They suggest what you should do yourself.

AI agents execute operational tasks, such as modifying systems, triggering workflows, and taking real-world actions. They can perform tasks for you because they are capable of moving beyond just suggesting what to do.

An AI agent takes a task as a goal, breaks that goal into sequential subgoals, calls the right tools or APIs, executes actions, and adjusts its actions based on the intermediate results it receives during the process of performing the task.

AI agents also use an LLM, which tells them what to do. They execute the task, and then the LLM checks whether it has been completed before presenting you with the final result you expect.

AI agents are software programs made for one specific goal, or just one job. It may be a business process such as fraud detection or invoice processing, whatever.

An AI agent gets the job done in this domain without you telling it every step of how to do it.

For example, a refund AI agent checks orders, sees when the customer bought something, checks whether the item can be returned according to your store policy, calculates how much money to refund, instructs the payment system to issue the refund, and emails the customer with the details.

The architectural design of AI agents is so focused that they are fast and reliable for this particular goal.

The difference between an AI agent and Agentic AI is the difference between a single-agent system, focused on one workflow, and a multi-agent or orchestrated agentic system.

Agentic AI combines specialized agents, planning and memory modules, tool integration, and retrieval-augmented generation.

Agentic AI takes a big objective, such as managing all customer support, and figures out how to get it done. It breaks big jobs into smaller ones and decides which AI agents and other tools process each part. Then it makes sure that each job is done in the right order.

If a customer emails asking for a refund, wants to exchange the item for a different product, and has a question about loyalty points, an agentic AI system interacts with several agents and other programs.

  1. The refund agent processes the refund.
  2. The inventory agent reserves a replacement item.
  3. The shipping agent receives a command to send the new item.
  4. The loyalty agent updates the customer’s points balance.
  5. Finally, the communication agent sends one email to the customer explaining that the item was returned, a new one is on the way, the points have been adjusted, and the new item should arrive soon.

If something breaks, such as a new item being out of stock, an agentic AI system can make corrections to the initial plan and automatically search for similar alternatives.

The agentic AI system keeps track of everything during a transaction. It makes sure each agent is doing its part and that no information gets lost along the way.

The architecture of agentic AI is designed to manage complex situations and adjust when they change. It can work through interconnected problems intelligently, maintaining focus on the overall goal until it is achieved.

An AI agent uses its planning module and LLM brain to break goals into concrete steps and determine the best order to execute them. Using API connectors, it can run code, update databases, and perform other actions. These are capabilities that generative models alone do not have.

By storing findings in short-term and long-term memory, properly engineered AI agents can recall past successes or failures to improve over time. There may be a reflection component in which the agent evaluates whether its actions are moving it closer to the goal, which helps it correct its course when things go off track.

An AI agent repeats a decision-action cycle, planning, executing, reviewing results, and adjusting, until the objective is achieved.

A RAG AI agent makes sure that the answers of your AI agent are accurate, compliant, and up to date.

LLM models can hallucinate or generate answers that look convincing but are false. By default, they use static training data, so they do not know what has happened since they were last trained.

RAG fixes this by allowing the LLM to retrieve information from your company's actual files and databases.

When someone asks a question, the RAG AI agent retrieves the relevant documents and data, then feeds that context to the LLM along with the question. Instead of guessing based on what it learned during training, it answers using your current, specific information.

Proof of concept, or prototype, is typically delivered in 2 to 4 weeks.

Pilot implementations last 2-3 months. If the prototype works as expected, the pilot program will include additional use cases, some API integrations, basic retraining of the model, and more testing before full deployment.

Full deployment of a multi-agent AI system takes between 6 months and a year. You need to set up monitoring that meets enterprise standards and roll out the AI agent software across all your business processes. If multiple AI agents must work together or if the solution must comply with strict regulations, it could take longer.

Portfolio

AI Agent Development: Chrome Extension as In-App Guidance Tool For ERP and CRM
AI Agent Chrome Extension Development (Copilot Alternative) for Microsoft Dynamics 365 Business Central ERP
For an ecommerce client, we developed an AI Chrome extension that provides in-app guidance - a digital adoption tool to help the founder train staff on best practices and reduce high turnover. While one of their systems, Microsoft Dynamics 365 Business Central, has built-in Copilot agents that automate tasks, these agents don't teach employees how to navigate the interface.

Recommended posts

Belitsoft Blog for Entrepreneurs
Claude vs ChatGPT
Claude vs ChatGPT
The Pentagon Situation Anthropic doesn't allow use of its models for fully autonomous weapons or mass domestic surveillance. When the US military used Claude during a January raid to capture Venezuelan President Nicolás Maduro, Anthropic objected. The Trump administration then labeled Anthropic a supply chain risk and told federal agencies to stop using it. However, reports showed the US military continued using Claude for intelligence work, target selection, and battlefield simulations during the US-Israel strikes on Iran, despite the ban. Defense Secretary Pete Hegseth said the military would keep access to Anthropic services for up to six months while switching providers. OpenAI reached a new agreement to run its models on classified Defense Department networks. CEO Sam Altman said the deal still follows OpenAI's own rules, which ban mass domestic surveillance and require human oversight over use of force. Shift in the Market Many users disagreed with the OpenAI agreement. Around 700,000 users reportedly cancelled ChatGPT. Claude's iOS app hit the number one free app spot on the App Store. ChatGPT dropped to second, Gemini to fourth.  Anthropic says free users are up over 60% since January, daily signups tripled since November, and paid subscriptions more than doubled this year. Claude Outage Following the surge in users, today, on March 2, Anthropic's Claude went down. Users faced complete outages. Reports peaked at nearly 2,000 on Downdetector within a short period. In total, around 10,000 reports were submitted. Anthropic said the Claude API was working fine, and that the failures were limited to the claude.ai website. Switching from ChatGPT to Claude Most users avoid switching AI tools because they don't want to lose their conversation history. Anthropic built a few ways to fix this. The fastest option is the Import Memory feature. You paste a prompt from Anthropic into ChatGPT, copy the output, and paste it into Claude's memory settings. Done in under a minute. For a full data transfer, you can export your entire ChatGPT history through settings. It takes 24-48 hours to receive the zip file. You then upload it into a Claude Project, giving Claude access to your full history. You can also download individual ChatGPT conversations as markdown files and upload them directly into Claude. Claude vs ChatGPT Comparision Both ChatGPT Plus and Claude Pro cost $20 per month via web version. In terms of input tokens, GPT-5 is cheaper than Claude Sonnet 4.6. Claude Sonnet 4.6 is priced at $3 per million input tokens and $15 per million output tokens. GPT-5 starts at $1.25 per million input tokens and $10.00 per million output tokens, while GPT-5.2 is priced at $1.75 per million input tokens and $14 per million output tokens. However, Claude is better at coding, generates more cautious and structured answers, and works well with large documents and code. Claude Pro is a better choice for coding quality, and its output is often more natural and context-aware. Claude Pro is also well suited for large documents, long-form summarization, and prototypes, while ChatGPT Plus is better for image generation, video with Sora, and web research. In one comparison, Claude 4.6 Sonnet outperformed ChatGPT 5.2 on most tests. Claude did better at step-by-step reasoning, financial analysis, and tone control. ChatGPT does better at explaining complex topics in simple language and storytelling. For day-to-day professional work, reviewers currently rate Claude as the better tool.
Dmitry Baraishuk • 2 min read
Microsoft Copilot Tasks
Microsoft Copilot Tasks
Unlike traditional LLM-based chatbots, where you ask a question and receive an answer, Copilot Tasks agent completes tasks autonomously: browse the web, coordinate across applications, create documents, and managу schedules.  You describe what you want in plain language. It can be a one-time job, scheduled, or recurring, and the Copilot Tasks agent delivers a completion report when finished. As CEO Mustafa Suleyman puts it: "AI that talks less and does more, no complicated setup or coding skills required". It's designed for everyday users, not just developers or enterprises. There is no need to manually configure AI tools or MCPs (Model Context Protocols) that currently makes agentic AI accessible primarily to technical users. How Copilot Tasks Agent Works Copilot Tasks does not need full access to your device like OpenClaw, for example, because it runs on Microsoft's cloud servers using a remote computer and browser. Since the processing happens on Microsoft's servers, there is no significant battery drain while tasks are running in the background on your laptop, and your hardware is less affected. If an attacker gains access to Copilot Tasks, it can only affect Microsoft 365 and any connected accounts, not everything on your device. Copilot Tasks can work only with what is available in your Microsoft account or other services you have connected to it. What Copilot Tasks Agent Can Do According to Microsoft, Copilot Tasks can: prioritize urgent emails and draft replies each evening compile a weekly summary of how your time was spent recap meetings schedule recurring syncs turn a syllabus into a study plan with practice exams build slide decks with talking points from emails and attachments tailor resumes and cover letters to specific job postings track apartment and car listings, watch hotel prices and rebook when they drop, and gather quotes from local contractors or plumbers book rides timed to your flights and adjust for delays help reserve party venues find local tradespeople unsubscribe from promotional emails you never open and cancel subscriptions you're no longer using How To Control Copilot Tasks Agent We need to know the AI agent will not email the wrong person, charge our card without permission, or schedule meetings in the middle of the night. Microsoft built in a control step before the Copilot Tasks agent takes action.  Before it spends money, books a meeting, or sends an email, it stops and asks for your approval.  You can review what the Copilot Tasks agent plans to do and approve, pause, or cancel it.  That is how Microsoft aims to build a credible solution. Copilot Tasks Agent Alternatives Microsoft Copilot Tasks puts Microsoft into the autonomous agent market alongside OpenAI's ChatGPT Agent, Anthropic's Claude Cowork, Perplexity Computer, launched in February 2026 as a digital worker, and Google's Gemini, which is adding the ability to carry out supported actions in Android applications. AI agents do more than answer questions. They complete tasks, and some use more than one model. Competition in the AI agent market is shifting to the orchestration of large language models. For example, Perplexity Computer coordinates Claude, Gemini, Grok, and ChatGPT, selecting the model based on the task. Microsoft launched Copilot Tasks on February 26 as a research preview to a small group of users, with a waitlist open for others to join ahead of a broader launch. Microsoft has not provided a timeline for general release.
Dmitry Baraishuk • 2 min read
Agentic AI Coding: What Still Remains Expensive Amid a 90% Drop in Costs
Agentic AI Coding: What Still Remains Expensive Amid a 90% Drop in Costs
Benefits of Agentic Coding Engineering Cost Reduction Agentic coding has changed the software development market by cutting the labor cost of implementation for simple and internal tools. The cost of writing has dropped by up to 90% compared with similar work a decade ago.  Agentic coding excels at CRUD applications, simple web forms, standard workflows, small internal tools, simple test suites, and basic API glue code. For these types of projects, development costs have dropped approximately 90%. AI tools let companies delay hiring managers and larger engineering teams, because a small senior team can do more. Tools like Cursor plus Claude allow a single experienced engineer to generate output that used to require a small team. With these tools plus smaller teams, a handful of people can now achieve roughly an order of magnitude more than before. Engineering Time Savings For a typical internal tool where data modeling is already complete, work that previously required a small team can now be finished in a few hours with an agentic coding command line interface. AI coding agents like Claude Code can generate a full unit and integration test suite for a fairly complex internal tool in a few hours. The AI-generated test suite, which contained more than 300 tests, would have taken several software engineers several days to write by hand. A project that previously took roughly a month from start to release can now be completed within about a week when using agentic coding tools. New Market Opportunities For niche platforms, AI makes it economically viable to build a product where the total market size is no more than 10 million USD  in annual revenue and hiring a full team would not have been worthwhile. Faster Product Idea Exploration Agentic coding tools are extremely good at turning business logic specifications into well-written application programming interfaces and services. AI agents allow faster exploration of product ideas, which shortens the loop between product and engineering.  Pairing a business domain expert with a motivated software engineer and these tools produces an extremely powerful combination. Instead of a larger squad that pairs a business specialist with a group of software engineers, we will see much tighter two-person pairings. Such pairings allow extremely rapid iteration on software products. If the chosen direction is poor, the team can discard the current software and quickly start again, using what they have learned. In this new mode, the hard work is the conceptual thinking rather than the typing. Faster B2B SaaS Customization There is a lot of customization per client in B2B SaaS, so the fast conversion of unclear requirements into working prototypes that reveal misunderstandings is the biggest gain.  AI agents help most when software requirements are fuzzy, polish and long-term support are less critical right now, and you only need to iterate quickly by creating simple prototypes. Excel Spreadsheet to Web App Conversion Every organization has hundreds, and possibly thousands, of Excel sheets that track important business processes. Those Excel-based processes would be much better expressed as applications. In some cases, a professional development agency can turn these spreadsheets into an application for around 5000 dollars by combining a competent software engineer with AI tools. Cutting Recurring SaaS Costs with Self-Hosted Internal Solutions, Built Cost-Effectively with AI Agents As AI lowers the cost of custom development, SaaS tools that mostly wrap simple workflows are no longer viable as monthly subscription products. High SaaS subscription prices make it easier for companies to justify replacing them with AI-coded internal solutions. Some multi-billion dollar corporations are already replacing SaaS tools with custom internal solutions built with AI assistance. They:  replicate Salesforce-like features and embed them directly into internal systems to reduce costs. replace tools like Fivetran with internal ETL solutions based on open-source platforms and custom code, saving 40,000 dollars per month, reducing maintenance costs, and making customization inexpensive. rebuild key features of expensive back office SaaS in several weeks. For many internal workloads or custom mobile apps, it now makes sense to build rather than buy. Low-code platforms combined with AI agents enable companies to build business applications quickly, replacing subscription-based alternatives. Optimal Use Cases for Agentic Coding AI is especially powerful in small companies or teams that already have an engineering culture and testing practices.  Production of Small Applications Many developers now produce far more small internal or personal applications, even if those applications never become public products. Even if AI may produce boring, ugly code, it is still good enough for personal tools such as IDE plugins that saves developer time. Component Development Large language models are very good at writing small components from the bottom up and stitching them together.  A software engineer can feed a REST API specification to an AI coding tool and receive a module that largely works. They can also write documentation and explanations for protocols or complex code paths. Composing Small Parts With guidance, AI is very good at composing small parts, such as several API calls plus error processing, into a coherent routine.  Senior developers are comfortable with letting AI generate entire small components such as simple dialogs or small modules. Predictable Tasks, Known Patterns and Libraries AI tools work best when they can compose known patterns and libraries rather than invent new designs.  AI coding tools work excellent when the underlying task is predictable, such as generating wrappers or predictable user interface patterns. They are strong at identifying libraries that solve a problem when given freedom to choose the approach.  The Main Use Case: Understanding Legacy Code The main value of AI agents is not code volume, but having a second brain that thinks faster than a human developer. Senior developers use AI agents to understand large legacy codebases, not necessarily to write big changes autonomously.  Legacy projects are dangerous for agentic coding, because AI agents may generate large diffs that are hard to review, and lack of tests makes correctness very hard to assert. However, even if they are not good at generating new code in that environment, AI tools are excellent at parsing existing legacy code and using it to explore scenarios and hypotheticals.  AI can extract complex business logic even from obscure implementations such as templates and plugins. AI is also good at explaining low-level code, including assembly for retro platforms. What Makes Code Legacy When models are fed well-structured code even from legacy systems, they can help deliver changes faster without a large team. AI agents can easily read a specific function, understand what it does, and propose changes with high accuracy because there are boundaries in the well-structured code. However, most legacy projects are those that are not maintained and have little or no test coverage, and where engineers fear changing anything. Legacy is a business decision: a system becomes legacy when the business declares it obsolete and stops investing in it.  Poor engineering and mismanagement can create legacy code even on a brand new project, so age is not the key factor.  Most vibe-coded apps become legacy almost immediately, because no one wants to invest in cleaning them up. 10x Productivity Gains A 10x personal productivity increase may occur when working on a large legacy codebase if your code agent can: scan and understand big old codebases, answer questions about them, propose targeted changes, and assist with testing and debugging. AI is excellent at understanding existing code, summarising it, and answering questions, and this is the main productivity gain for work on legacy systems. AI is a crew of excavators compared to a single shovel when exploring large garbage heap codebases. Developers have spent a lot of time trying to understand codebases that are several years old. Agents make understanding these older codebases easier: explain what the code is doing, and locate bugs in that code.  New Automation Opportunities Many use cases senior developers previously would not have bothered to script or automate are now easy, and they have cranked out small scripts and small web services in several hours using AI. AI-Assisted Code vs. Contractor Code Developers would rather inherit a repository written with the help of an agent and a good engineer in the loop than one written by a cheap contractor of questionable quality from India who left several years earlier. The kind of repository left by such a questionable contractor typically has no tests and code is often a mess of classes and methods. What Remains Expensive in Custom Software Development Even with AI Agent Coding AI hype around code is comparable to the hype around self-driving cars: solvable in theory, but much harder than many assumed. In fact, coding time is often only a small fraction of the total time spent on a complex enterprise project. Hidden Costs The main costs are not in the initial building but in future maintenance, feature additions, operations, and organizational coordination, which AI reduces far less. Debugging and supporting software in production is still difficult and has not been automated by AI. Production software also has many hidden costs that include: security, upgrades and patches, hosting and uptime at scale, customer interactions, regulatory and compliance aspects, product management and design, and data migration and integration. The cost of having software today also includes the cost of dealing with cloud platform complexity, such as Kubernetes, distributed databases, queues, and multiple user interfaces. Distribution Still Costs More Than Building  AI agents can reduce building cost, but large companies still have advantages in brand, distribution, and customer trust.  Marketplaces and discovery are controlled by algorithms that favour big players and established brands. Many excellent small products will remain invisible until their features are copied by large companies.  You May Still Require a Team Decision makers in larger organisations are unlikely to trust a one-person shop for core systems, regardless of how cheaply that person can code. Coding time is often only a small fraction of the total time spent on a complex enterprise project where you may still require a team of human engineers. Yes, smaller than even a year ago, but still a team. Before building the app from the ground up, a team would set up continuous integration and continuous delivery, define and implement data access patterns and the core services. After that, the team would usually build a backend, dashboards and graphs for users. Near the end, the team would ideally add automated unit tests, automated integration tests and automated end to end tests to make the product fairly solid. The release of such a product, depending on the complexity, may happen even about a month after the work started. This description above only covers direct labour. Every additional person on the project adds coordination overhead that includes daily or regular standups, ticket management, code reviews, handoffs between frontend and backend contributors and waiting on other team members to unblock your work. Senior Engineering Effort is Still Required to Solve Complex Production Issues There is currently enormous value in having a human supervise the agent and check its work. AI makes it very easy to create huge volumes of code, but this can be dangerous because code for critical systems is considered a liability, and less code is usually better. When AI writes code, it is tempting to accept it quickly, but this can damage long-term quality and maintainability. With a senior engineer in the loop, AI agents can create very high-quality software very quickly. AI is best used as a partner, not as an autonomous agent. Senior engineers still need to design, review, and direct the work. Using AI for code generation requires the same effort in design, coding, and review, except the code under review is not their own. Many engineers use AI primarily for exploring codebases and libraries, generating first drafts of code or tests, refactoring proposals, explaining errors, and drafting documentation. They may ask AI to implement something and use the AI output only as motivation. Experienced software engineers never just copy AI-generated code as is, and always review and adjust it. They may ask the model to write some code, then ask it to list the top ten problems, and then ask it to fix the most important ones. AI agents are very good at checking code and then critiquing or fixing it, so engineers actively use this workflow. To get good results from AI, you need to learn how to ask good prompts, plan the work, and supervise what the model does. Experienced software engineers often start with a planning step before any code is written. They ask the agent to propose a high level plan so that the person and the model are aligned. A simple CLAUDE.md or instruction file can teach Claude Code how a specific project works and help it stop repeating the same mistakes over time.  Agents are claimed to be strong at writing large volumes of unit or integration tests quickly. However, there may be challenges with the quality of these tests. They can create a false sense of security. While a human writes a test based on requirements, the AI may write a test based only on the code it sees. If the original code is wrong, the test will also use the wrong logic to verify it - the test passes, but the bug remains. That's why AI agents work best when you already have good automated tests, because those tests keep the model in check and make its output more trustworthy. AI can replicate the behavior of people who put together StackOverflow snippets without fully understanding them. With AI, average software engineers lose the educational value of spending hours reading documentation.   The educational loss is a reason why AI is more useful for senior software engineers than for junior engineers.  Will AI Replace Software Engineers? Current models may be pretty bad at programming anything non-trivial and  "fix" issues by removing functionality. AI coding assistants may hallucinate functions or APIs that do not exist, use deprecated interfaces, reintroduce bugs during refactors, and remove tests silently. They may "fix" failing tests by changing the test instead of the code. However, many human programmers also do not manage complexity well, and for senior engineers there is no fundamental difference between improving large language models and training junior developers. Moreover, AI agents and models are still improving at a rapid pace. Existing benchmarks do not really capture how fast these agents and models are improving. Newer models will soon make the current ones look completely obsolete. Model providers are now actively filtering code training data for quality, so the new generation will use higher quality code than older models.  Even without AI, developers were always at risk of field disruption from new technologies. There will still be strong demand for engineers who can supervise AI, manage complexity, and align software systems with business goals. Knowing the business and industry specifics gives senior programmers an advantage over others in the era of AI-assisted coding. When you add business domain understanding on top of technical and agent skills (which architectural decisions fit a project, which framework to use, and which libraries perform best), it feels as if the 10X engineer has arrived. Such engineers move toward full stack and product engineering or roles with more business responsibility, propose solutions rather than just implement specifications, and engage with customers, sales, and marketing. Modern developers use AI to learn faster about business processes, regulations, and industry history (verifying information). AI is a partner that amplifies us. We plus AI are much more valuable than we or AI alone.
Alexander Kom • 10 min read
Composer LLM from Vibe Coding Platform Cursor 2.0: Cool or Overhyped?
Composer LLM from Vibe Coding Platform Cursor 2.0
What Is Composer LLM There are two common bottlenecks in agent coding: reviewing code and testing changes. Cursor 2.0 targets both. Its new agent-centered interface includes an option to switch back to the classic integrated development environment view. Cursor 2.0 adds a native browser tool. Agents can open a browser, click, analyze the user interface, take snapshots, detect issues, and iterate until the output is correct. The browser is used to fix and validate user interface changes that would otherwise require manual feedback. The developers promise that with Composer, programming tasks in Cursor now finish faster (often in under 30 seconds) and are completed more accurately in production.  The model can run unit tests, fix linter errors, and refactor code.  Cursor now allows up to eight coding agents to run at the same time, working either together or independently to plan, write, test, and review code. Each agent operates in an isolated workspace using git worktrees or remote machines. Developers can compare multiple results from concurrent agent runs and select the best output. The base model used for Composer remains undisclosed. It may be built on Qwen3-Coder or GLM 4.6 according to some observers.  It is known that Cursor did not train Composer from scratch but focused on reinforcement learning (RL) post-training on coding examples  rather than pre-training. How to Use Composer LLM Effectively Here are the timestamps for the practical workflows demonstrated in the video filmed by one of Cursor’s investors. Upgrading a project from SDK v4 to v5 (4:45 - 6:58) The presenter had an old software project. He told Cursor to update the entire project to use the new SDK and gave Composer a link to the migration guide. The AI agent read the instructions, created a to-do list for all the steps, then edited files across the repository, making changes in different places in the project. The Composer LLM in Cursor 2.0 understood how all the files in the repository (the project folder) were connected and made changes wherever needed. Running three models in parallel (progress UI) (6:06 - 7:41) When you work with Cursor, it just prints lines of text to a black screen (the "CLI" or Command Line Interface). This makes it hard to see what is actually happening. The presenter wanted to redesign this text output. Instead of a long scroll of text, he asked the AI to build a table that updates live. The table would show which tests are running, how many are completed, how many failed, and the average time taken. He asked three different AIs (Composer, Haiku, and GPT-5) to do the exact same task at the exact same time. This is a core new feature in Cursor 2.0. Some were faster (like Haiku), some were better at creative UI (like GPT-5). This parallel test lets the developer see which AI gives the best answer. The presenter pointed out that, for this task, Composer LLM was making mistakes. This is why you would want to run three models at once, in case one fails. Building a complete, working web application from an empty folder (14:19 - 15:23) He gave Cursor a one-sentence instruction: "Build me an app that is an image generator". It wrote all the code using a modern website framework Next.js  to build the app. This included all the backend setup files needed to manage the website's content, files, layout code, and settings. It then ran the installation commands, linted the code (ran a spell-checker for code to find any errors), and started the development server (ran the final command to turn on the website). At the end, a fully functional image generator website was shown in the browser. The presenter emphasizes how fast all this happens. A human developer would take significantly longer to set all of this up manually. Iterating UI styling (Composer, Haiku, GPT-5) (16:06 - 19:18) The image generator app that Cursor built in the previous step worked, but the presenter thought it was "so ugly". He gave the AI a command: "Rethink it. It should be minimal, nice-looking, and in dark mode". He gave that exact same prompt at the same time to Composer (Cursor's new, fast model), Haiku (another fast model), and GPT-5 (a model known for being very good at UI design). He got three completely different website designs. He was able to click through each one and compare them. He decided the GPT-5 version looked the best and, with one click, he applied those changes. Using the browser tool (19:20 - 20:48) The final result still was not perfect. The image was pushed to the side, and there was a header bar the presenter did not like. He told the AI to fix the layout: "Make sure the image output is centered on the page. You can remove the top bar if necessary". The presenter used the "@browser" command, which is like saying, "Do not just guess, open a web browser and look at the app yourself". Cursor debugged its own visual work. It opened the app in a new Chrome window and took a screenshot to analyze. After it understood exactly what was wrong, it wrote new CSS and HTML code to fix the layout.  Then Cursor checked the browser again to see if the fix worked. Cursor and Composer LLM Reviews Some users believe Composer 1 is like GPT-5, but completes tasks much quicker. They see the parallel models and multitasking feature as a major improvement. Long-time users say that Cursor keeps them from leaving by releasing an update right after they start thinking about canceling. Some have already canceled their subscriptions but now want to come back because of Cursor 2.0. However, according to some unhappy developers, Composer fails to update configuration files correctly, generates duplicate or broken code objects, includes Chinese characters in outputs, and produces poor pull request code rated 2/10 or 3/10 in security checks. The update broke some extensions, especially for C and C++ coding. They believe Composer is an early alpha: unstable and not worth upgrading yet. They think Cursor is still miles away from how their marketing describes it, because it is "just a VSCode fork with AI" that "does three things at once but none correctly". The integrated browser is appreciated for debugging. However, it does not behave exactly like a full, standalone Google Chrome browser. Users are frustrated about being forced to share data and telemetry by default. For a tool that reads a company's entire codebase, this seems like a privacy violation. Users are also worried that the tool could be using their private, proprietary code to train its models. Tab completion is an example where users agree it is good, but it is also too aggressive and annoying and interferes with typing. Users are looking at other tools (like Claude Code) and wishing Cursor had their features, such as slash commands and hooks. Naming the core feature "Composer", when a major, widely used PHP tool already has that name, is seen as confusing branding. Who Owns Cursor AI The San Francisco-based company Anysphere Inc., maker of the AI coding assistant Cursor, has surpassed $500 million in annualized revenue, mostly from monthly subscription plans. Most revenue comes from subscriptions. Pricing for individual users starts with the free plan (Hobby tier) and maxes out at the $200 per month Ultra plan. Business plans start at $40 per user per month for Teams. Enterprise clients get custom contracts with compliance, so prices vary. The cost of using the API is the same as for GPT-5 and Gemini 2.5 Pro: $1.25 per million tokens when you send the prompt to ChatGPT, and $10 per million tokens when it generates a response. The company’s valuation is $9.9 billion. It is considered the fastest-growing software startup ever, reaching $100 million in annual recurring revenue in 14 months. The previous record was 18 months, held by Wiz. More than half of Fortune 500 companies now use Cursor in some form, and over a million people use it daily, including coders at companies such as OpenAI, Spotify Technology SA, and Nvidia Corp. Cursor AI Alternatives The market is moving toward agentic coding. The AI plans and writes code across multiple files instead of just suggesting one line at a time. Cursor competes with GitHub Copilot, Claude Code, Windsurf, and Cline. Composer is Cursor’s feature for these agentic tasks.
Dmitry Baraishuk • 5 min read
Amazon Cut Tens of Thousands corporate Jobs to Invest more in AI Automation
Amazon Cut Tens of Thousands Jobs to Invest in AI Automation
Amazon staff reduction due to AI automation Andy Jassy, Chief Executive Officer, says the reason is to reduce the "excess of bureaucracy". The vision is to operate like the world’s biggest startup and to make the company leaner, with fewer layers and more ownership, so Amazon can move more quickly. However, the key reason may be the increased use of AI that cut jobs by automating repetitive and routine tasks. It seems that AI-driven productivity gains within corporate teams were sufficient for a substantial reduction in force. Amazon has long-term investments in building out its AI infrastructure and in the short term must offset costs. The company is expected to spend $118 billion in capital expenditures for the year, with much of it going towards building AI and cloud infrastructure. Beth Galetti, Senior Vice President of People Experience and Technology, says the reduction is necessary because this generation of artificial intelligence is the most transformative technology since the Internet, and it accelerates the pace of innovation across existing and new market segments.   Amazon had more than 1,000 generative artificial intelligence services and applications in progress or built, but that figure was a "small fraction" of what it plans to build. Amazon shares rose 1.2 percent to $226.97 on Monday following the report. The company appears to be expecting another big holiday selling season and plans to offer 250,000 seasonal jobs to help staff warehouses, among other needs, which is the same seasonal hiring level as in the prior two years. This contrasts large seasonal warehouse hiring with corporate reductions. Amazon Web Services Amazon Web Services, the company’s cloud computing unit, is affected among others. AWS reported second-quarter sales of $30.9 billion (17.5 percent increase year over year), and that growth was well below gains recorded for Microsoft Azure (39 percent and for Alphabet’s Google Cloud at 32 percent in the same period). This competitive pressure may be driving Amazon to restructure AWS. The division has been making headlines recently for a fifteen-hour internet outage last week that disrupted many widely used online services. Amazon Robots Amazon executives believe the company is on the cusp of a major workplace shift that will replace more than 500,000 jobs with robots. Robotic automation could allow the company to avoid hiring more than 600,000 people by 2033, even while selling twice as many products. Amazon’s automation team expects the company can avoid hiring more than 160,000 people in the United States by 2027. To mitigate fallout in communities that may lose jobs, Amazon policy avoids using words such as "automation" and "artificial intelligence" when discussing robotics and instead substitutes phrases such as "advanced technology" or the word "cobot" to imply collaboration with humans. Daron Acemoglu, a professor at the Massachusetts Institute of Technology who studies automation and who won the Nobel Prize in Economic Sciences last year, said that once companies work out how to automate profitably, the practice will spread to other firms.
Dmitry Baraishuk • 2 min read
ChatGPT Agent Builder: Main Module of the AgentKit Platform
ChatGPT Agent Builder: Main Module of the AgentKit Platform
Why LLM-Based Agents Matter Many try to do different things in ChatGPT, Claude, and Gemini but do not get the results they want. Not because these systems cannot do it, but because they lack correct instructions, do not call the required tools, or do not have access to the right internal files. LLM-based agents are here to solve this. What Can You Use OpenAI AgentKit Agents For? The main use case now is conversational AI chat agents: customer support in e-commerce, internal agents for knowledge base search, or data analysis. E-commerce Applications Why use an agent in e-commerce? How is it different from a regular text chatbot? An LLM agent converts a vague customer query into a precise, machine-readable command (JSON). The LLM agent understands that "I want to return this product" means the "returns" category. This allows you to launch traditional scripts with precision. The LLM agent can extract data from your personal database (if you grant access) and create a personalized response directly in the chat interface, rather than just providing traditional search results. The LLM agent outputs interactive elements (order forms, "buy" buttons) directly into the chat. The user can make a purchase without leaving the chat window. Other Scenarios Turn your ChatGPT projects into AI agents and workflows that your team, friends, family, or customers can use or buy If you have built ChatGPT projects that allow you to do things in seconds that previously took hours, thanks to how you trained them, you can now sell or share these as autonomous agents hosted on your website or behind a paywall. Create your own AI Brain Go to templates, find "Internal Knowledge Assistant". This can classify and answer employee questions. Or maybe you have a general audience asking questions or just want to remember everything from everywhere. Click on the Internal Knowledge Assistant template. It rewrites the user's question to make it better, so users do not have to be experts. It classifies the question: "Q&A," "Fact-finding," or "Other." Then it processes with three types of agents: Internal Q&A, External Fact-finding, or a standard agent. You can train each one. You can upload a knowledge base, or go to "Tools" and connect to MCP Server and add various tools - Gmail, Drive, Dropbox, third-party servers - so your team, clients, or audience can ask questions here instead of asking you directly. Example of Building a Classification Agent with ChatGPT Agent Builder The key advantage of agents is the level of control over responses they give you. Unlike classic GPTs, which rely on a single prompt and file uploads - leaving the model to interpret everything on its own - agents minimize errors. The Agent Builder can help you create a classification agent. When a user sends a request, the agent determines which of your company's information categories it belongs to. Use simple prompts like: "Classify a user's request into: manufacturing, inventory, or logistics". The model will assign it to one of those three categories. Choose your model and set how much reasoning it should do. Usually, leave "Low" reasoning turned on. Classification is not hard for GPT-5. More reasoning uses up more tokens and costs more money. Switch the output from "Text" to JSON. This makes sure the response fits your prompt exactly. The model will only return one of the three categories, nothing more. Next, add If/Else logic. That turns the JSON output from the classifier into a set of instructions. If the response says "manufacturing", your next agent knows to do one thing. If it says "inventory", it will do something else. Each output can trigger other agents or send the final answer to the user. Why Is an Agent Built with OpenAI AgentKit Genuinely New in Terms of Business Value? An agent created with Agent Builder offers something new for business: it combines the power of LLMs with control over logic and security, making it possible to deploy LLMs in corporate processes with a high degree of reliability. Security Guardrails Before Agent Builder, using LLMs in corporate systems always carried risks of unpredictable behavior, confidential data leaks (PII), or toxic content. Built-in Guardrail nodes (PII, Moderation, Jailbreak) are ready-made architectural modules that automatically filter both incoming and outgoing data. This makes LLMs safe to use in sensitive sectors (finance, healthcare, legal). Programmable Logic Agent Builder lets you program multi-step decision logic (via If/Else) based on the LLM's semantic interpretation. Businesses can create complex predictable workflows. The agent does not just "reply" - it acts on a business scenario like "classify the client" (LLM/JSON) -> "check status in CRM" (MCP integration) -> "initiate a return" (MCP integration) -> "send a form to complete" (Widget). The agent can execute transactional actions (book, buy, change status in CRM), becoming an active participant, not just an advisor, and changing states in external databases. Data Control Agent Builder solves the key corporate problem: it lets you use the full power of AI models while minimizing unpredictability. You can deploy an agent via SDK/API, integrating the Agent Builder core into your mobile apps, chat systems, or backend processes. Execution logic and confidential data remain under your control. Deployment via SDK/API means the LLM core processes and classifies data, but confidential data remains inside your secure server or database/CRM. Critical data (names, addresses, phone numbers, credit card info, client activity, order details, CRM interactions) never go to the LLM. They are filtered at your system boundary. The LLM receives only anonymized context to generate a response. How Do You Start Using Agents with OpenAI AgentKit? After creating an agent with Agent Builder, you either get code to embed the chat widget on your site, or use the Agent SDK (or API) to integrate the agent's features into your mobile app or backend system, including vibe-coded ones. What Exactly Does Agent Builder Allow You to Do to Create a Custom Agent on OpenAI LLM? Creating a custom agent in Agent Builder means building a workflow using a drag-and-drop interface. As before with ChatGPT customization, you create a System Prompt (role, persona, response style), and select the LLM. Once you complete basic customization (prompt, LLM, RAG), you begin to configure parameters that control the LLM externally. You connect the database to the agent, but now you do not pass personal data from this database to the LLM - they can be filtered by middleware on your server. This shift from prompt-only management to management via nodes is the key business value of Agent Builder. Workflow Configuration You force the LLM (Agent Node) to interpret a vague customer request ("I want to buy," "how do I return") and output a structured variable (for example, "pathway": "purchasing"). With If/Else nodes, you use this JSON variable to guide the conversation down a custom path: if pathway = "purchasing", activate the "Purchase Agent"; if "returning", activate the "Return Agent". You create or upload unique interactive elements (widget files), such as a checkout widget or purchase-complete widget. This allows the agent to do more than reply with text. The LLM-based agent (Agent Node) is trained to fill these widgets with data from the conversation or your knowledge base. For example, it inserts the product name and price into the order widget, making the widget unique to the current dialogue. Modular Agent Design Instead of a single "all-knowing" GPT, Agent Builder lets you create multiple Agent nodes, each a narrow specialist: Greeting Agent (focuses on tone and classification) Purchase Agent (focuses on product knowledge and sales) Return Agent (focuses on policies and procedures) This modularity allows precise control over instructions and behavior at every stage of the customer journey. Agent Optimization Platform Agent performance optimization is part of a broader set of tools called Agent Kit. Optimization is done on the agent optimization platform, which includes evaluation, trace grading, and datasets. The main issue with complex agents is that they work well 80% of the time but "break" or hallucinate the other 20%. Evaluation tool automatically tests the agent on large datasets to check if it always returns the correct answer and uses correct logic. Trace Grading helps you see step-by-step how the agent arrived at an answer. If it made a mistake, tracing shows at which node (If/Else, Agent, Tool Call) the error happened. Reliability also means cost savings. If the agent gets it right the first time, no extra logic loops or retries - lower token use. By analyzing traces, you may find that the agent takes a long, expensive path for a simple task. You can fix the logic to make the path shorter and cheaper. Agent Builder VS n8n Agent Builder is designed for chat-oriented workflows. n8n is designed for automation. Agent Kit is for consumers and teams in the OpenAI ecosystem needing quick, simple chat workflows. n8n is for developers needing complex, custom, autonomous systems. Background Automation The most powerful automations work in the background, invisibly. In OpenAI's Agent Builder, you are limited in triggers and cannot schedule or run things in the background - n8n clearly wins here. You only have one trigger for Agent Builder - Start. It is for chat agents. The only way to launch the agent is to send a message (via chat or API). No scheduled triggers, no app events, no web hooks for CRM events. If you need an agent to process invoices or track leads automatically, you need real triggers. Agent Builder does not have them (at least, not now). n8n wins for background autonomy. Chat Interface Excellence Chatkit is where Agent Builder truly shines: easy creation of stylish, branded chat interfaces, Widget Studio for interactive visual elements (forms, calendars, product recommendations) that the agent fills with data. Professional, deployable chat interface in minutes without coding. Ease of Use vs Control For a complete beginner making their first agent, Agent Builder is unbeatable in simplicity. It is faster, less intimidating. But this simplicity comes with trade-offs. In Agent Kit, it is difficult to track how variables move between nodes in debug/evaluation mode - easy to see in n8n. n8n gives you full control of infrastructure (self-hosting), model choice, and cost control. Agent Kit is locked to OpenAI cloud only, but allows connection to OpenRouter for hundreds of models via one API. Local models supported in self-hosted n8n. Technical Knowledge Required To work effectively with Agent Builder, you need to know or understand core programming and development concepts Code Logic Basics (System Concepts) How to build conditional logic for workflow routing What "true" and "false" are, how agents use them for If/Else decisions What repetition is and the risk of infinite loops or credit consumption Data Structures and Variables How to save and reference data at workflow steps What JSON is - structured data for formatting agent output for next node use Why data must be manipulated for readability and usability at each step Architecture and Deployment Understanding that LLM is only one part, integrated with external tools and controlled by logic and security nodes Deployment (Chat Kit, SDK) requires coding knowledge (e.g., avoiding UI widget breakage with bad output) To work effectively, you must think in terms of software, data flows, and logic, not just writing a prompt.
Dmitry Baraishuk • 7 min read
OpenAI’s ChatGPT Agent Outperforms the o3 Model Alone: What This Means for Developers
OpenAI’s ChatGPT Agent Outperforms the o3 Model Alone
ChatGPT Agent Meaning ChatGPT Agent operates on its own virtual computer accessible from within the ChatGPT.com interface and alternates between reasoning and direct action to complete complex, multi-step tasks.  Unlike a standard chatbot, it can click and scroll through pages, fill out forms, download and manipulate files – all in response to natural language instructions.  At the core of this capability is a model that combines the web-interaction skills of OpenAI’s earlier "Operator" prototype with the deep analytical and synthesis skills of the "Deep Research" mode.  ChatGPT Agent News and Availability Click the "+" icon to the left of the prompt field, open the Tools menu, and then choose "Agent mode" (it’s the last item in the drop-down list). If you don’t see "Agent mode" there, the feature hasn’t been rolled out to your account yet, and there’s currently no other way to enable it. Not everyone can try it yet. On launch day OpenAI began rolling the feature out to paying Pro, Plus, and Team subscribers. Enterprise and Education customers are queued for "the coming weeks", while free-tier users must wait. European Economic Area and Swiss residents are still in the waiting room while regulatory work proceeds, so if you don’t see "Agent mode" in your Tools menu, the feature simply hasn’t reached your region or plan. OpenAI ChatGPT Agent EU availability Users based in the EU, who have a Pro subscription, on the evening of 18 June still can’t see the new feature. Pro users can try a stop-gap at https://operator.chatgpt.com, but this isn’t the announced ChatGPT Agent — just one component: a simple browser agent that loads pages and executes tasks. It’s quite buggy, for example, CAPTCHAs on websites can’t be solved within the agent’s interface, either automatically or manually. We’ll see how the full ChatGPT Agent handles jobs like these once it finally rolls out. Deep Research mode is available from the Tools menu, but it runs more slowly and cannot click through sites as mode "Operator" does. OpenAI has not committed to a timeline. Tech press summaries simply repeat that official EU access is "coming" but "indefinite for now". However, if you need unofficial access and has PRO account (not PLUS), just use an American VPN. Even if you have a European card, it still works. However, this method is unofficial and access could be blocked at any time. OpenAI ChatGPT Agent availability UK Many UK Pro users received access the day the feature launched. UK users get the same virtual computer with visual/text browsers, terminal, and connectors described in the launch post. OpenAI. If "Agent mode" still doesn’t appear, give it a little time - OpenAI is enabling accounts in batches. Capabilites the new ChatGPT agent brings to web users of ChatGPT.com Real-World Utility Examples ChatGPT Agent’s unified skills significantly broaden its usefulness in both professional and personal contexts.  Professional tasks: It can automate burdensome office chores, such as updating spreadsheets with new financial data while preserving all formulas and formatting. Instead of you manually doing these repetitive tasks, you can delegate them in natural language and the agent will handle the execution end-to-end. Personal tasks: In everyday life, it can serve as a smart concierge. For example, it can find medical specialists and book appointments for you based on your schedule. It can orchestrate complex personal projects or errands that involve multiple steps and information sources, saving you a lot of time and effort. User Control, Oversight & Privacy As ChatGPT Agent works, it provides a live narration of its actions and explicitly asks your permission before taking any step with real-world consequences (such as making a purchase or sending an email).  You can intervene at any time – pausing or stopping the agent, taking over the browser manually, or steering it in a different direction.  Certain critical actions run in a supervised "Watch Mode" that requires you to approve each step. Likewise, the agent is trained to proactively refuse obviously high-risk requests – for example, it will decline to carry out a bank transfer or other sensitive financial operation on its own. If a task is long-running, you’ll get a mobile push notification when it finishes so you can review the results. Strong privacy controls are also built in. With a single click, you can wipe all browsing data the agent has accumulated and log it out of any websites it accessed on your behalf. (Otherwise, site cookies persist according to each site’s policy, which helps speed up repeated visits by avoiding extra logins.)  In secure browser takeover mode – when you temporarily take control to enter credentials or navigate – anything you type (like passwords) is kept private and not stored or sent to the model. This ensures the AI never actually sees your sensitive inputs, since it doesn’t need them, and thereby prevents those details from being retained in the AI’s context.  You can also disable any connected apps or accounts (connectors) at any time to limit what data the agent can access.  Finally, ChatGPT Agent supports scheduling recurring tasks: once the agent completes a task, you can set it to run automatically on a schedule (for example, "generate my weekly metrics report every Monday morning") to further automate your workflow. Built-in Tools, Connectors & Environment To accomplish its tasks, ChatGPT Agent has access to a suite of built-in tools and can choose the best tool for the job.  These include a GUI visual web browser (to interact with websites like a human would), a text-based browser (for quick retrieval and reasoning over large text documents), a full terminal for code execution, and the ability to make direct API calls.  The agent can also use ChatGPT connectors – integrations to third-party services such as Gmail or GitHub – which allow it to securely fetch information from your email, calendars, or other apps and incorporate that data into its responses. If needed, you can hand the agent an authenticated session on a website via a secure takeover mode (logging it into a private portal yourself) -  this lets it access gated or user-specific content, and you can resume manual control at any moment. All these tools run within one coherent virtual computer environment that preserves state across tool switches. This means the agent can move from one modality to another without losing context.  For example, it might retrieve information through an API (say, querying your calendar events), then use the text browser to reason over a lengthy document, and finally switch to the visual browser to interact with a site’s interface – all within one continuous session.  The virtual machine allows it to download files from the web, manipulate them via terminal commands or code, then open the results back in the browser for you to view.  This adaptive approach enables the agent to choose the most efficient path to complete tasks: it might process structured data via API, handle large text analysis in the text browser for speed, and use the visual browser only when necessary for human-oriented pages. By preserving memory and state, the agent can carry out multi-step workflows with speed and accuracy, without redundant back-and-forth or losing track of intermediate results.  You can interact with it in a highly iterative and collaborative way – you can interrupt at any time to give new instructions or clarifications, and the agent will incorporate your feedback and continue the task without starting from scratch. Similarly, the agent may ask you for additional details if needed to ensure it’s on the right track. Risk Mitigation & Safety Framework Because this is the first time ChatGPT can directly take actions on the live web, OpenAI has implemented extensive safety measures.  The controls from the Operator preview have been expanded to address the new risks of a general-purpose agent with web access. This includes safeguards around handling sensitive user data, limiting what it can do via the terminal (network access is restricted), and especially defending against prompt-injection attacks that could be hidden in webpages or metadata.  Prompt injection is when malicious instructions are embedded in content the agent might read (for example, invisible text on a webpage) to trick the AI into doing something harmful or revealing private information.  To counter this, ChatGPT Agent is trained to recognize and resist malicious or hidden prompts, and OpenAI runs a real-time monitor on all tool outputs to catch suspicious patterns.  If the agent encounters anything that looks like a hidden instruction or an attempt to hijack its behavior, it will refuse or seek confirmation.  Moreover, the system requires explicit user reconfirmation before any high-impact action (like those involving sensitive data or transactions), which adds a human check that further reduces the chance of a hidden prompt causing damage. Users are also advised to be mindful of what data they expose the agent to and to disable connectors when not needed, as additional precautions. Given the agent’s expanded capabilities, OpenAI has classified ChatGPT Agent as having "High Biological and Chemical Capability" potential, so they have nonetheless activated the highest tier of safety measures for this category. That safety stack includes threat modeling for misuse scenarios, special training to refuse or flag any dual-use dangerous content, always-on classifier systems and reasoning monitors watching the agent’s behavior, and strict policy enforcement pipelines to intercept potentially harmful actions.  OpenAI has also been coordinating with external experts in biosecurity, government and academia – even hosting a biodefense workshop with national labs and NGOs – to get outside input on shoring up these defenses.  As part of the safety push, OpenAI has published a detailed system card describing the agent’s risk analysis and mitigations, and launched a public bug bounty program to encourage outside researchers to find any vulnerabilities or unintended behaviors so they can be fixed quickly. Availability, Roll-out & Quotas ChatGPT Agent became available starting July 18, 2025 for subscribers on the Pro, Plus, and Team plans. Pro users (who pay for the highest tier) got access immediately on launch day, while Plus and Team users were slated to get it over the following few days as the rollout progressed.  Enterprise and Education tier customers are expected to gain access in the coming weeks after launch.  Usage of the agent is metered: Pro subscribers receive a generous quota of 400 agent messages per month, whereas Plus and Team users get 40 agent messages per month included. If users need more, OpenAI allows purchasing additional agent message capacity via a flexible credit system. (These quotas are separate from regular ChatGPT usage and exist because agent tasks consume more resources.) Currently, the feature is not enabled in certain regions – pending regulatory clearance or additional compliance measures. OpenAI has acknowledged this regional hold and will presumably turn it on in those locations once they address local data and privacy requirements. The introduction of ChatGPT Agent also means some earlier beta features are being retired or folded in. The standalone Operator preview site (operator.chatgpt.com), which previously demonstrated the web-browsing agent capabilities, will remain live only for a few more weeks and then be shut down. Its functionality is effectively merged into the new agent mode. Additionally, the original "deep research" mode (which provided very lengthy, in-depth research answers with extensive citations) has been integrated as part of the agent’s capabilities. For users who prefer the old deep-research style – which can take longer but gives more exhaustive reports by default – that option is still available in the ChatGPT UI by selecting "Deep Research" from the mode dropdown in the message composer.  Benchmark Performance Highlights OpenAI tested ChatGPT Agent on a variety of challenging benchmarks and the results show substantial performance gains over previous models: Humanity’s Last Exam (HLE) On this extremely difficult expert-level QA test across many subjects, the agent achieved 41.6% accuracy (pass@1), which is a new state-of-the-art and a huge jump from prior models. This beats the previous best from the dedicated Deep Research mode (26.6%) on the same benchmark. In other words, ChatGPT Agent answers correctly on roughly 4 out of 10 graduate-level questions, compared to only about 1 out of 5 for its predecessor – a significant improvement on a test explicitly designed to be harder than what any AI had seen. FrontierMath (Tier 1–3) On the hardest open-ended math problems (novel, unpublished challenges that even human math experts struggle with), the agent scored 27.4% accuracy. This vastly outperforms earlier models – for context, the previous o3 model was around 10.3% on these problems. Having an agent that can solve over a quarter of these elite math problems is a major leap forward (prior systems were in the single-digits or low teens percentage range). Complex Knowledge-Work Tasks In an internal benchmark mimicking real-world knowledge work (conducting research and analysis tasks that might take a human hours), ChatGPT Agent’s outputs were rated comparable to or better than expert human results in roughly half the cases. It also significantly outperformed the older models (o3 a) on these tasks across all tested time budgets. In other words, given difficult professional assignments (like writing an analytical report or creating a detailed project plan), the agent’s work was on par with human experts about 50% of the time. This demonstrates that the agent isn’t just good at quizzes – it can handle practical, economically valuable tasks and often do as well as top human professionals in those domains. DSBench (Data Science Benchmark) This benchmark tests realistic data analysis and data modeling workflows. ChatGPT Agent not only surpassed other AI models here, it even exceeded human performance on some metrics. For the data analysis portion, the agent scored 89.9%, versus around 64% for human analysts performing the same tasks. On data modeling tasks, the agent reached 85.5%, well above the human baseline (around 65%). These are remarkable results – the AI is actually outperforming experienced humans in interpreting and modeling data in this test, highlighting how effective the agent can be when it can use tools (like Python) to assist its analysis. SpreadsheetBench This is a suite of 912 spreadsheet editing questions (requiring formula adjustments, data manipulations, etc). ChatGPT Agent achieved 35.27% accuracy overall on this benchmark – which may sound modest, but it’s significantly higher than other AI models: OpenAI’s o3 managed 23.3%, Microsoft’s Copilot in Excel about 20.0% on the same test. Moreover, when the agent was allowed to directly edit Excel files (rather than just instructing changes), its score jumped to 45.54%. While still below the 71.3% human expert accuracy on these tasks, the agent outperformed specialized tools like Copilot. It also maintained formulas and formatting, not just values. Investment Banking Financial Modeling On a set of first- to third-year investment banking analyst tasks ( building a three-statement financial model for a company, complete with correct formulas and citations), ChatGPT Agent averaged 71.3% accuracy. This is a strong result considering humans are far from perfect on these complex multi-step problems. It substantially beat the older "Deep Research" agent mode (which scored ~55.9%) as well as the base o3 model (~48.6%) on the same evaluations. This means the AI can perform many financial analysis tasks at a level that approaches (and in many cases exceeds) a well-trained junior analyst – a notable achievement in a field that requires both domain knowledge and attention to detail. BrowseComp (Agentic Web Browsing) This benchmark measures how well an AI agent can find hard-to-locate information on the web. ChatGPT Agent set a new record with 68.9% success on BrowseComp, which is about 17.4 percentage points higher than the previous best achieved by the Deep Research mode (51.5%). It also well exceeds the older o3-based agent’s performance (49.7%) in this category. The agent is much better at using a browser to answer obscure questions or gather information than past models – an indication of its improved search strategies and tool use on the open web. WebArena (Real-World Web Tasks) In the WebArena benchmark (which tests an agent’s ability to complete practical tasks through a web interface), ChatGPT Agent scored 65.4%. This is a few points higher than the previous agent built on the o3 model (62.9%), showing that it has surpassed prior agents in real web task performance. However, it’s still below the human performance level on WebArena (humans score about 78.2% on these tasks). So while the agent is closing the gap in web task proficiency, there remains room to improve before it can match an expert human operator in all cases. API Equivalents ChatGPT Agent can click around websites, scroll, fill forms, run code, and complete end-to-end tasks on its own. The Responses API already provides a computer_use tool (since March 2025) that lets developers programmatically control a headless browser. In other words, developers have had this capability behind the scenes. The flashy on-stage demo is built on tech that third-party devs already use. However, now that it’s packaged neatly in ChatGPT’s UI, users will expect the same seamless automation from all agents. Developers need to match that user experience in their products (or focus on specialized niches where they can do even more). ChatGPT has one-click connectors to Gmail, GitHub, Google Drive, etc., allowing it to directly interact with those services. There’s no official Connectors API for developers (yet). Developers can integrate services by writing their own OAuth flows or using unofficial means (the Deep Research agent or community plugins), but OpenAI hasn’t released a standard connector framework in the API. This is a missing piece in the developer stack. Until OpenAI offers connectors to everyone, custom-agent developers can add value by building these integrations themselves. If you build a domain-specific agent (say, for finance or healthcare), you can create connectors to the services your users care about. You’ll be stepping in where the official platform currently doesn’t go, which can be a competitive advantage. (Keep in mind OpenAI has hinted connectors will come to the API once stabilized, so plan for that future.) ChatGPT now supports scheduled runs (recurring Tasks) with notifications (like a weekly report every Monday). There is no built-in scheduling or cron feature in the API. A developer using the API must implement their own scheduling system (using cron jobs or backend services) or wait for an official "Tasks SDK" that might be in the works. If your agent’s value proposition involves doing things on a schedule (daily summaries, periodic reports, etc.), you currently have to handle scheduling logic yourself. This is extra engineering work, but it also means you can differentiate on reliability and customization of scheduled tasks. The fact that OpenAI showed scheduling in the product hints that a developer solution might come, but until then, consider it a space to innovate. The ChatGPT Agent emphasizes user control, with live narration of its actions, a ‘watch mode’ for sensitive tasks (like sending emails) that requires user approval, and a privacy toggle to wipe data. The Agents SDK and API provide hooks and events for similar guardrails. Developers get callbacks or flags when the agent is about to do something sensitive, and can build in confirmations. The Moderation API and safety guardrails are also available for content filtering. However, these are not automatically enforced in a custom UI — you as the developer must make use of them. OpenAI has put a lot of work into a friendly interface that keeps users in charge (narrating steps, asking permission, etc.). If you’re building your own agent app, you should mirror these trust features. By using the guardrail hooks in the SDK, you can show users what the agent is doing, ask for confirmations on critical steps, and provide easy ways to pause/stop. This will not only match user expectations set by ChatGPT, but also protect you legally and ethically. Safety and transparency should be part of your UX just as they are in ChatGPT’s product. OpenAI’s ChatGPT Agent launch includes quotas: Pro subscribers get 400 agent actions per month, Plus users get 40 per month (with more purchasable via credits). The API’s usage is metered by tokens and processing time, not by a fixed number of actions. There’s no concept of a fixed monthly action count in the API - it’s pay-as-you-go. Developers pay per token and for tool usage seconds, with their own limits set by organization quotas or rate limits. If you build an agent for end-users, you have flexibility in the business model. You could offer unlimited use (within fair limits) if the cost in tokens is manageable, or set your own pricing tiers. The ChatGPT quotas might make some users feel constrained, which could drive them toward specialized third-party solutions. Keep an eye on whether OpenAI aligns API pricing with the ChatGPT-style quotas in the future. No new model family, but major performance jump The big boost does not come from a brand-new model but from smarter coordination — specifically, how the agent chains together browsing, coding, and file tools during execution.  These tools have been available in the API since March, but combining them in a continuous agent loop enables the system to think ahead and reuse context.  This orchestration now leads expert benchmarks and, in many cases, even surpasses human performance on high-value tasks. The takeaway from July’s launch: With the right setup, you can get much more out of existing models. User expectation has jumped Before, an average user might have been delighted if a third-party GPT-based app could, say, give a good answer with references. Now, after seeing ChatGPT Agent, users will expect AI agents to take action and complete tasks autonomously.  Agents are now expected to not just chat, but act. For developers, this means that simply providing an answer isn’t enough to impress — your agent may need to book the restaurant, format the spreadsheet, or orchestrate the workflow just like ChatGPT can.  OpenAI Agents SDK 0.2.x Release On July 17, 2025, OpenAI shipped version 0.2 of its Agents SDK. The release folds three production-grade capabilities into the library we already use to build AI assistants. Guardrails now run alongside every model or tool step. They apply the same policy filters OpenAI uses in ChatGPT, so unsafe content or prompt injection attempts are blocked before they reach users. A persistent session object is included. By default, it stores chat history in a local SQLite file, but the interface lets us plug in Postgres, Redis, or any other store that meets our data governance requirements. With sessions in place, the agent automatically “remembers” prior turns - developers no longer need to stitch conversation history into each prompt ourselves. Tracing records every loop iteration — each model call, tool invocation, and agent handoff — and exposes the log through a simple UI and JSON feed. Developers can replay a conversation after the fact, pinpoint slow or expensive steps, and provide auditors with a full trail of agent decisions. Together, these features replace much of the custom glue code developers maintain today for safety checks, state management, and debugging.  Migrating an existing prototype agent is expected to take one to two developer-days: change the import paths, enable guardrails, choose a session backend, and turn on tracing. Areas that still need solutions, which developers can address Connectors not open to API This gap means a developer can create a custom agent with connectors to niche services or internal systems today, while OpenAI’s own connectors are limited or beta. If you have industry-specific knowledge (say an agent for medical records or for enterprise CRM systems), you can integrate that now. By the time OpenAI releases an official connectors API, you could already have a reputation or customer base for that integration. No built-in scheduling for API If your use-case benefits from periodic or background tasks, you can build that infrastructure and offer something ChatGPT can’t (yet) do outside its interface. It’s more effort, but for certain enterprise clients or power users, that could be a deciding factor. Session length and persistence ChatGPT’s UI keeps the virtual computer session alive 30 minutes, whereas API sessions time out after 5 minutes idle. If long-lived sessions matter (monitoring something continuously or lengthy workflows), developers might soon get tools for it, but if not, you could engineer a workaround (perhaps by intelligently re-initializing context) and advertise that stability. Geographic and compliance constraints The ChatGPT Agent is geofenced out of the EU (EEA) at launch, likely due to regulatory concerns. This means EU-based enterprises or users can’t use ChatGPT Agent yet. If you are a developer or company operating in Europe, this is a moment to pitch custom solutions that run on the API (which can be configured for data residency and compliance). You can provide AI agent services without the geofence, addressing privacy or legal requirements in ways the main product might not initially. Advice for developers Since the core capabilities (browse the web, execute code, handle files, etc.) are becoming commodities available to anyone using OpenAI’s platform, the strategic advice is to differentiate in other ways. Proprietary data or knowledge Incorporate company-specific data or industry expertise that a general ChatGPT agent won’t have by default. For example, a financial agent that knows your accounting system or a medical agent that has access to up-to-date research in a specialized field. User experience & integration Design a user interface or workflow that fits a particular job better than ChatGPT’s general UI. Maybe a browser extension, or integration directly into a team's Slack/SharePoint, or a voice interface for use on the go. Provide a streamlined experience for specific tasks. Trust and custom guardrails As mentioned, use the guardrail primitives to build trustworthy agents. Enterprises might prefer an agent whose every action is logged, auditable, and with admin controls — something you can build with the API on top of OpenAI’s models. Branding and relationships Companies might prefer a branded solution (their own AI assistant) for psychological or business reasons. As a developer, you can offer white-label agents using OpenAI under the hood. You maintain the client relationship and can tailor the solution exactly to their needs, which the general ChatGPT cannot do for each client. Beyond the Hype: The Hidden Costs of "Mostly Correct" AI Automation Today’s AI "agents" promise to handle whole jobs for us, but leaders should view them less as veteran employees and more as unreliable interns: fast, tireless, and eloquent, but prone to costly blunders if left unsupervised. Even a two-percent error rate sounds tiny until you chain dozens of automated steps together. At that scale, the math virtually guarantees something will break. A mistyped year in a financial report, a stray decimal in pricing, or a logic bug in a data pipeline can ripple outward and cost millions—and chasing those hidden flaws often erases any time the AI saved. Security risks compound the problem. Because an agent blindly interprets any text it is fed, a malicious instruction buried inside an email, PDF, or even hidden HTML can hijack the system—what researchers call "prompt injection". If that agent also holds privileged credentials, one invisible line of text can trigger data exfiltration or an unauthorized purchase, turning the model into an unwitting insider threat. Handing high-powered tools to a model that can be tricked so easily is why regulators and risk teams are starting to worry. Regional regulations complicate deployment: Europe’s stricter privacy rules already delay some advanced features, forcing companies to navigate a patchwork of compliance before rolling agents out at scale. Yet squeezing the next few "nines" of accuracy from these systems is ruinously expensive. Improving a model from 98 percent to 99.99 percent reliability can demand orders of magnitude more data, compute, and money—much like the stalled quest for full self-driving cars. Hardware limits, energy costs, and a shortage of fresh, high-quality training data all hint that brute-force scaling is hitting a wall, and future gains will come from new architectures: smarter algorithms, not simply bigger models. History shows businesses tolerating small failure rates when the efficiency gain is significant, and AI will likely follow that pattern. Routine, rules-based office functions are already being automated, shifting human roles toward exception handling and firefighting. Far from ushering in leisure, faster tools usually raise management expectations. Employees often find their workload grows as they supervise the AI in addition to their original duties. Oversight remains the best defense. The most successful teams break work into micro-steps, subject every result to automated tests or human review, and immediately pull the agent off the task if something looks odd. Treating the model like a junior colleague who drafts at lightning speed but needs line-by-line checks prevents "tech debt" from snowballing. Without that discipline, whatever time you saved on first drafts you will pay back—often with interest—during debugging and clarification. The safest technical blueprint emerging today isolates privileges: one sub-agent can read untrusted text but has no authority to act, another holds the keys to sensitive tools but never sees raw external data. "Sandwiching" these models behind strict guardrails, capping their spending power with prepaid cards, and inserting human approval for anything high-impact are becoming basic.
Dmitry Baraishuk • 18 min read
BloombergGPT is Live. A Custom Large Language Model for Finance
BloombergGPT is Live. A Custom Large Language Model for Finance
BloombergGPT Is in Production or Not? The last official updates about custom financial LLM BloombergGPT came in 2023. All attempts to find out whether BloombergGPT is open source, to locate it on Hugging Face or GitHub, to download and try it, or just to understand how to access the BloombergGPT’s demo and its cost -  end in silence. The internet’s full of rumors. Some say BloombergGPT is already obsolete, frustrated by the lack of updates. Others laugh at the money spent, saying ChatGPT-4 does the same thing better, faster and cheaper for the end-user. They argue Bloomberg should’ve waited for stronger models. The team keeps quiet. Maybe because money doesn’t like noise. What matters: the model is already in production, built into Bloomberg’s stack. Doug Levin, a successfull startup founder, now at Harvard, wrote a review after testing BloombergGPT inside the Terminal. Called it a disruptive layer in Bloomberg’s legacy architecture. Not a research demo. A system already shaping workflows from the inside. Directly mentioned use cases from Doug Levin’s article: Financial report generation Market summaries Trading ideas or analysis Financial data analysis Market trend predictions Sentiment analysis Automated report generation Language translation Financial document text generation Risk assessment Real-time market updates Support in client communication Support in regulatory compliance S-1 analysis and modeling Search functionality via Bloomberg Search (SEAR) BloombergGPT-like LLM: Train from Scratch or Fine-Tune? A lot was written about BloombergGPT in broad terms: about the impressive results achieved by the team behind this custom financial LLM, how it outperformed many other models. But very little was said about what was happening backstage: what it’s actually like to develop a specialized model for the financial industry, how resource-intensive that process is, and what kinds of challenges these teams run into. Time to change that. Given that, we decided it was worth doing some reverse engineering: cutting through the PR noise in the available information to uncover insights that will likely stay relevant for a long time for any company that decides to build their own financial LLM. David Rosenberg, who leads the BloombergGPT development team, is still in his position (LinkedIn says so). And according to his social media, he insists that the information from mid-2023 about the model is still relevant. In this context, the information shared on The TWIML AI Podcast with Sam Charrington genuinely deserves close attention. "Using an API like OpenAI’s is not suitable for us: we have data we don’t want to send out. So for internal and sensitive use, in-house models are preferable." — David Rosenberg, The TWIML AI Podcast Let’s take a closer look at what else they discussed. Financial LLM use cases What the BloombergGPT's development team actually spent the most time on was thinking about financial LLM use cases from a variety of angles. For example, could BloombergGPT LLM help them solve problems they already had solutions for, but in a better way? Or with less investment in training data? They explored use cases like natural language to BQL (Bloomberg Query Language is used inside Bloomberg Terminal to pull structured financial data, the idea was to build a kind of financial code assistant that translates human language into Bloomberg-specific queries). They wanted an internal code assistant that understood their libraries. Or the ability to input a large document and interact with it: ask what information it contained, that sort of thing. In that sense, they were exploring many directions to see where the model could have the most impact. "As for production use, we need to be very cautious. No one has really solved the hallucination problem yet. These language models can say wrong things, do strange things, so there needs to be a process around making them safe to use, either internally or, eventually, with clients." — David Rosenberg, The TWIML AI Podcast They started with internal use, and in that context, it wasn’t so much about safety or reputation, it was about function: was it useful, did it do the job? That was their focus at the time. They were also aware that if people started relying on an LLM for internal tasks, they become less critical of its output.  So teams building custom LLMs needed a special system for checking its work. But then came the obvious question: if someone was always checking it anyway, should they just do the task themselves? In short, they kept things internal and focused on finance, code completion, and basic summarization tasks. Backstage of BloombergGPT Financial LLM Development Decision Behind Creating a Custom Financial LLM BloombergGPT is an example of a project where a team inside an enterprise trained and built a custom large language model specializing in financial language. The enterprise made a strategic decision to invest money, time, and human resources into this machine learning effort when GPT-3 was released. "The question was, is this a direction we pursue, we invest in? Because it was clearly a big investment. We didn’t know how much GPT-3 actually cost to make, but it was clear that it was a huge investment. We decided it was worth making the move. Maybe there’s some risk there, but it seemed like the possibilities were pretty great. That was kind of a decision made back in late 2020 - to start building towards this goal of our own GPT-3-style model. I’m not sure we knew exactly at that time what it would be used for. We're still experimenting to figure out how best to use it." — — David Rosenberg, The TWIML AI Podcast Training Dataset for BloombergGPT In some ways, it was a general-purpose model - but also purpose-built for financial applications. The training dataset was a mix of standard general-purpose data used for GPT-style models and Bloomberg’s proprietary financial data. About half of the dataset came from Bloomberg’s curated collection called FinPile, built over many years starting in 2007. It included financial reports, news articles, filings, press releases, earnings call transcripts, and other structured content. Some documents included tables and charts. They didn’t do any special processing for this training run - but when that information had already been extracted, they used it. Structured data wasn’t treated differently. It was tokenized like any other content. However, one area of concern was numerical data. Finance involves a heavy amount of numbers. They were concerned that the GPT-2 tokenizer didn’t treat numbers in any special way. A number like 5,234 could be split unpredictably - not digit-by-digit, not as a single unit - making it harder for the model to reason about numeric values. So they used character-level tokenization for numbers, allowing the model to learn digit structure and positional order. They followed an approach similar to Google’s PaLM model, where numbers were split into individual digits. This helped the model understand that the first digit carries the highest value, and so on - one way they adapted the model for financial data. BloombergGPT Training Process At the time, Meta’s OPT model had just been released. They used its training logs as a roadmap. Hugging Face’s BLOOM also published detailed logs, which helped guide their process. To reduce risk, they made their architecture as close as possible to something that had already worked. "We copied the BLOOM model architecture fairly closely, with some small tweaks. Tokenization and number handling were two key pieces. We called it v0."— David Rosenberg, The TWIML AI Podcast One thing they tried was sorting FinPile data chronologically - thinking newer data might be more accurate - while the rest of the data was randomly shuffled. For validation, they used the month immediately following the training set. They trained for 4-5 days and saw the loss curve level off. After 8-10 days, they stopped training. They suspected that curriculum learning (via time sorting) wasn’t helping. They restarted with fully shuffled data. That became version one of training. It started off stronger. Around day 8, the gradient norm spiked. Validation performance also dropped. They turned to the OPT paper’s troubleshooting guide, rolled back to a checkpoint, reshuffled data, and lowered the learning rate - but saw no major improvement. They investigated further and noticed something strange: out of 70 layers, the first layer’s layer norm scale weights dropped, then suddenly increased. This pointed to a bug: they were applying weight decay to weights that should’ve remained centered around one. They fixed this in version two, did a full code review, improved mixed-precision handling, and added an extra layer norm at the beginning. Then they restarted training. This time, it worked. They trained for 42 days, with a steady loss decrease. They hit some challenges around 75% of the dataset and eventually stopped training - performance had already exceeded expectations. Resources Used to Train BloombergGPT The core team included about nine people: Four focused on implementation Three on ML and data One on optimization and compute The rest handled evaluation, literature review, and support They trained on Amazon SageMaker using SMP (Sharded Model Parallelism) with 40GB A100 GPUs - 512 in total. They pre-purchased around 1.3 million GPU hours at a negotiated rate. Validation and Performance Evaluation During training, they used the last month of training data for validation. Later, they added a random validation set. They also monitored downstream tasks like MMLU and BBH (Big Bench Hard). After training, they did a full evaluation and compared BloombergGPT to OPT-66B, BLOOM, and GPT-NeoX. On general-purpose benchmarks, it was competitive. On financial tasks, it significantly outperformed open models. For example, on ConfFinQA (a benchmark requiring numerical reasoning in financial docs), it performed extremely well. They also had internal benchmarks: Sentiment analysis on financial news and social media Named entity disambiguation (linking "Apple" to its stock ticker, etc.) Natural language to BQL, which is like SQL for Bloomberg Terminal Even though the model wasn’t trained on BQL directly, it performed well in few-shot scenarios. They also experimented with headline generation and other generative tasks. What’s Next for the BloombergGPT Team? They had skipped a lot of early experimentation due to time constraints. Now that they have a working model, they’re going back to small-scale experiments - testing tokenization strategies, data mixtures, and architecture choices in a more disciplined way. They’re continuing instruction tuning using public data (like FLAN) and internal labeled datasets. They have rich internal data for tasks like entity recognition, which they’re formatting into query–response pairs for tuning. "We’re more interested now in smaller models. They’re easier to use: you can run inference on a single GPU. Our 50B model requires multi-GPU infrastructure. Inspired by the LLaMA paper, we’re exploring what we can achieve with smaller models, longer training, and careful design. We want both small and large models for practical use." — David Rosenberg, The TWIML AI Podcast Financial LLM: A Mirage? Financial LLMs are not a fantasy. There are already startups in this space that have gone to production, raised funding, and landed their first clients. The most well-known are those backed by YCombinator, for example, Truewind. So in fact, the question is simple: does the team have the Product Thinking or not? LLMs have made serious progress over the past two years, and it’s entirely possible that Bloomberg’s team are gradually shifting to a new technological foundation, replacing what they’d built before. Or maybe they’re just continuing to improve their system quietly - and we’ll hear updates soon enough. The key point: you can’t expect a financial LLM to do what it simply wasn’t built to do. LLMs have core limitations by design. This is clearly shown in this YCombinator discussion and explored in detail in this 2025 review of financial LLM capabilities. So, is a financial LLM a mirage? Yes, if used the wrong way. No, if used right. How Belitsoft Can Help Building a financial LLM is not a technical challenge. It’s a product challenge. Belitsoft helps financial firms build custom language models that are production-ready from the start. A good financial LLM needs to fit the firm’s data, language, workflows, and risk boundaries. This means: Training smaller models on tightly scoped tasks Fine-tuning existing open models with in-house financial content Building the validation systems Designing instruction datasets from internal annotations instead of starting from scratch Embedding the model behind interfaces users already trust, like dashboards and query layers The result is not just a model. It’s a usable system. A decision tool. An internal assistant. A document engine. Whatever your workflow needs. Belitsoft turns that strategy into a service: all deployments stay in-house. No data leaves the firm. No prompts go to external APIs. The entire model lifecycle is private, owned, and secure. Frequently Asked Questions
Dzmitry Garbar • 8 min read
Let's Talk Business
Do you have a software development project to implement? We have people to work on it. We will be glad to answer all your questions as well as estimate any project of yours. Use the form below to describe the project and we will get in touch with you within 1 business day.
Contact form
We will process your personal data as described in the privacy notice
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply
Contact us

USA +1 (917) 410-57-57
700 N Fairfax St Ste 614, Alexandria, VA, 22314 - 2040, United States

UK +44 (20) 3318-18-53
26/28 Hammersmith Grove, London W6 7HA

Poland +48 222 922 436
Warsaw, Poland, st. Elektoralna 13/103

Email us

[email protected]

to top