Build a Code Generation and Execution Agent with LangGraph

Written by Selva Prabhakaran | 28 min read

You give your LLM a task: “Calculate the average salary by department from this CSV.” It writes code that looks perfect. But when you run it, the column name is wrong and the script crashes. You fix it manually, re-prompt, and try again.

What if the agent could catch its own errors, rewrite the code, and retry — without you lifting a finger? That’s what we’re building here.

Before we write a single line, here’s how the data flows through this system.

Your user sends a natural language request. The agent node picks it up and calls the LLM, which generates Python code. That code flows to an execution node, where it runs inside an isolated sandbox.

The sandbox returns either a successful output or an error traceback. If the code succeeded, the agent checks whether the output actually answers the original question. If it does, the graph exits with the final answer.

If the code crashed or produced wrong results, the error feeds back to the agent node. The LLM reads the error, figures out what broke, and writes a corrected version. This loop continues until the code works — or we hit a retry limit.

Five pieces make up the system: the state (tracking messages, code, results, and retries), the code generation node, the execution node, an evaluation node, and the conditional routing that connects them.

What Is a Code Generation Agent?

A code generation agent is an LLM-powered system that writes code, runs it, and iterates on the result. It goes beyond simple code completion. The agent owns the full lifecycle: generate, execute, evaluate, and retry.

Standard LLM code generation works like this:

python

User prompt → LLM → Code (might work, might not)

A code generation agent works like this:

python

User prompt → Generate → Execute → Check → Fix if broken → Repeat → Answer

The critical difference is the feedback loop. The agent doesn’t just write code — it tests its own work. When the code crashes, the agent reads the traceback and writes a better version. This is the same process you’d follow as a developer, just automated.

Key Insight: **A code generation agent treats the LLM as a programmer and the sandbox as a test environment.** The agent doesn’t trust its first attempt. It writes, tests, reads errors, and rewrites — exactly like a developer working through a problem.

Prerequisites

Python version: 3.10+
Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+)
Install: pip install langgraph langchain-openai langchain-core
API key: An OpenAI API key set as OPENAI_API_KEY. See OpenAI’s docs to create one.
Time to complete: ~40 minutes
Prior knowledge: Basic LangGraph concepts (nodes, edges, state). If you’re new, start with our LangGraph installation and setup guide.

Step 1 — Define the Agent State

Every LangGraph agent starts with a state definition. The state is a typed dictionary that flows through every node in the graph. Each node reads from it, does work, and writes results back.

For our code agent, we need more than just messages. We track the generated code, the execution result, a success flag, and a retry counter. This gives every node the context it needs to decide what to do.

python

import os
from typing import Annotated, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.messages import (
    HumanMessage,
    AIMessage,
    SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages


class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    generated_code: str
    execution_result: str
    execution_succeeded: bool
    retry_count: int
    max_retries: int

Six fields in total. messages holds the conversation history and uses the add_messages reducer — it appends instead of replacing. generated_code stores the latest Python script the LLM produced. execution_result captures stdout or the error traceback.

The last three fields control the retry loop. execution_succeeded tells the router whether to proceed or retry. retry_count tracks how many times we’ve looped. max_retries sets the ceiling.

Quick check: Why does messages use a reducer while the other fields don’t? Because messages accumulate — each turn adds to the conversation. The other fields represent the current state of the last attempt. You want the latest code, not a list of every code version.

Step 2 — Build a Safe Code Executor

Here’s a question you should be asking: how do we safely run code that an LLM wrote? Running arbitrary Python on your machine is risky. One wrong os.remove() call and you’ve lost files.

We need a sandbox. For this tutorial, we’ll use Python’s subprocess module with a timeout. It runs code in a separate process with limited execution time. For production, I’d recommend Docker or langchain-sandbox for proper isolation.

The executor writes the code to a temporary file, runs it, and captures output. Three possible outcomes: success (stdout), failure (stderr), or timeout (killed process).

python

import subprocess
import tempfile


def execute_code_safely(code: str, timeout: int = 30) -> dict:
    """Execute Python code in a subprocess with a timeout."""
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", delete=False
    ) as f:
        f.write(code)
        temp_path = f.name

    try:
        result = subprocess.run(
            ["python", temp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        if result.returncode == 0:
            return {
                "success": True,
                "output": result.stdout,
                "error": "",
            }
        else:
            return {
                "success": False,
                "output": result.stdout,
                "error": result.stderr,
            }
    except subprocess.TimeoutExpired:
        return {
            "success": False,
            "output": "",
            "error": f"Code timed out after {timeout} seconds.",
        }
    finally:
        os.unlink(temp_path)

The return dictionary has three keys: success (boolean), output (what the code printed), and error (traceback on failure). The finally block deletes the temp file no matter what happens.

Warning: **This subprocess executor is for learning only.** It runs code with your user’s permissions — no file system isolation, no network restrictions. For production, use Docker, `langchain-sandbox` (Pyodide-based), or a cloud sandbox like E2B. Never run untrusted LLM code without proper isolation.

Step 3 — Create the Code Generation Node

This is where the LLM writes code. The generation node builds a prompt, sends it to the model, and extracts the Python script from the response.

I’ve found the system prompt matters more than anything here. You need to tell the model explicitly: output only executable Python, always include print statements, and include all imports. Without these instructions, the model produces code with no visible output or missing dependencies.

On retries, the prompt grows. It includes the previous code and the error message. This gives the model concrete feedback — not “try again,” but “here’s exactly what broke.”

python

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.

If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""


def generate_code(state: AgentState) -> dict:
    """Generate Python code based on the user request."""
    messages = state["messages"]
    retry_count = state.get("retry_count", 0)

    prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]

    if retry_count > 0 and state.get("generated_code"):
        error_context = (
            f"\n\nYour previous code:\n"
            f"```python\n{state['generated_code']}\n```\n\n"
            f"Error encountered:\n{state['execution_result']}\n\n"
            f"Fix the code. Attempt {retry_count + 1} of "
            f"{state['max_retries']}."
        )
        prompt_messages.extend(messages)
        prompt_messages.append(
            HumanMessage(content=error_context)
        )
    else:
        prompt_messages.extend(messages)

    response = model.invoke(prompt_messages)
    generated_code = response.content.strip()

    # Strip markdown code fences if the model adds them
    if generated_code.startswith("```python"):
        generated_code = generated_code[9:]
    if generated_code.startswith("```"):
        generated_code = generated_code[3:]
    if generated_code.endswith("```"):
        generated_code = generated_code[:-3]
    generated_code = generated_code.strip()

    return {
        "messages": [
            AIMessage(
                content=f"Generated code (attempt "
                f"{retry_count + 1}):\n```python\n"
                f"{generated_code}\n```"
            )
        ],
        "generated_code": generated_code,
    }

Notice the fence-stripping at the bottom. Even with “output ONLY code” in the prompt, models sometimes wrap responses in markdown. We strip those so the executor gets clean Python.

On retries, the function appends error context as a follow-up message. The model sees the full conversation plus the specific failure. This beats starting from scratch — the model knows what it tried and what broke.

Step 4 — Build the Execution Node

The execution node is the simplest piece. It takes generated code from the state, passes it to the sandbox, and writes the result back. Think of it as a bridge between the LLM’s output and reality.

python

def execute_code(state: AgentState) -> dict:
    """Execute the generated code and capture the result."""
    code = state["generated_code"]
    result = execute_code_safely(code)

    if result["success"]:
        output_text = (
            result["output"] if result["output"] else "(No output)"
        )
        return {
            "messages": [
                AIMessage(
                    content=f"Execution successful.\n"
                    f"Output:\n{output_text}"
                )
            ],
            "execution_result": output_text,
            "execution_succeeded": True,
        }
    else:
        error_text = result["error"]
        return {
            "messages": [
                AIMessage(
                    content=f"Execution failed.\n"
                    f"Error:\n{error_text}"
                )
            ],
            "execution_result": error_text,
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }

Two paths here. Success stores the output and sets execution_succeeded to True. Failure stores the traceback, flips the flag to False, and bumps the retry counter by one.

Why increment retry_count in the execution node and not the generation node? Because retries are triggered by failed executions. This is where we know something went wrong. The generation node shouldn’t care about counting — it just writes code based on whatever context it receives.

Step 5 — Add the Evaluation Node

Code that runs without errors doesn’t always mean it produced the right answer. Your user asks “show me the top 5 products by revenue” and the code prints an unsorted list of every product. No crash — but completely wrong.

The evaluation node catches this. After successful execution, it asks the LLM to verify whether the output actually answers the original question. This is a second LLM call per cycle, and it’s worth every token.

python

def evaluate_result(state: AgentState) -> dict:
    """Check if the output answers the user's question."""
    user_request = ""
    for msg in state["messages"]:
        if isinstance(msg, HumanMessage):
            user_request = msg.content
            break

    eval_prompt = (
        f"The user asked: '{user_request}'\n\n"
        f"The code produced this output:\n"
        f"{state['execution_result']}\n\n"
        f"Does this output correctly and completely answer "
        f"the user's request?\n"
        f"Reply with exactly 'YES' or 'NO: <reason>'."
    )

    response = model.invoke(
        [
            SystemMessage(
                content="You evaluate code execution results. "
                "Be strict but fair."
            ),
            HumanMessage(content=eval_prompt),
        ]
    )

    evaluation = response.content.strip()
    is_correct = evaluation.upper().startswith("YES")

    if is_correct:
        return {
            "messages": [
                AIMessage(
                    content=f"Result verified.\n\n"
                    f"Final answer:\n"
                    f"{state['execution_result']}"
                )
            ],
            "execution_succeeded": True,
        }
    else:
        return {
            "messages": [
                AIMessage(
                    content=f"Output doesn't match the "
                    f"request. Reason: {evaluation}"
                )
            ],
            "execution_result": (
                f"Code ran but output was wrong. "
                f"Evaluation: {evaluation}"
            ),
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }

The node extracts the original user request from the message history. It builds a yes/no question for the LLM and routes based on the answer. If the output passes, we’re done. If not, execution_succeeded flips to False and the retry counter goes up.

Is the extra LLM call worth the cost? I’d say yes. Without it, the agent returns wrong results with confidence. One extra API call prevents the user from trusting garbage output.

Tip: **You can tune the evaluation strictness through the system prompt.** For data analysis, add checks like: “Does the output include the correct number of rows?” For simple math, the default prompt handles it. I’d recommend stricter prompts for any task where wrong answers look plausible.

Step 6 — Wire the Graph with Conditional Routing

This is where the pieces snap together. We connect every node with edges — including conditional edges that route based on the state. The routing logic is the agent’s brain.

Three decisions drive the flow:

After execution: Did the code succeed? Go to evaluation. Did it crash? Check retry budget.
After evaluation: Did the output pass? End. Was it wrong? Retry.
Retry guard: Have we hit max_retries? If yes, stop. If no, regenerate.

python

def route_after_execution(state: AgentState) -> str:
    """Decide what happens after code execution."""
    if state["execution_succeeded"]:
        return "evaluate"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


def route_after_evaluation(state: AgentState) -> str:
    """Decide what happens after result evaluation."""
    if state["execution_succeeded"]:
        return "end"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"

Two small functions. route_after_execution checks the success flag first. If the code ran clean, it sends the result to evaluation. If it crashed and retries remain, it loops back to generation. Otherwise, it exits.

route_after_evaluation handles the subtler case: code that ran fine but produced wrong output. Same logic — pass or retry, with a retry ceiling.

Here’s the full graph assembly. Each add_node registers a function. Each add_edge or add_conditional_edges defines flow between nodes.

python

def build_code_agent() -> StateGraph:
    """Build and compile the code generation agent graph."""
    graph = StateGraph(AgentState)

    # Register nodes
    graph.add_node("generate", generate_code)
    graph.add_node("execute", execute_code)
    graph.add_node("evaluate", evaluate_result)

    # Start with code generation
    graph.add_edge(START, "generate")

    # After generation, always execute
    graph.add_edge("generate", "execute")

    # After execution, branch on success
    graph.add_conditional_edges(
        "execute",
        route_after_execution,
        {
            "evaluate": "evaluate",
            "retry": "generate",
            "end": END,
        },
    )

    # After evaluation, branch on correctness
    graph.add_conditional_edges(
        "evaluate",
        route_after_evaluation,
        {
            "end": END,
            "retry": "generate",
        },
    )

    return graph.compile()

The flow: START connects to generate, which always connects to execute. After execute, we branch — success goes to evaluate, failure loops back to generate or exits. After evaluate, correct results exit, wrong results loop back.

That’s the entire architecture. Three nodes, two routing functions, one compiled graph.

Key Insight: **The retry loop is what separates an agent from a chain.** A chain runs once and returns whatever it gets. An agent observes its own output, judges it, and corrects course. LangGraph’s conditional edges make this loop explicit — you can see exactly where each decision happens.

Step 7 — Run the Agent

Let’s put the agent to work. The invoke call takes an initial state dictionary. We set the user’s message, initialize tracking fields, and cap retries at 3.

python

agent = build_code_agent()

result = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content="Write a Python script that generates "
                "a list of 10 random numbers between 1 and "
                "100, sorts them, and prints the sorted list "
                "along with the average."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
)

# Print the final result
for msg in result["messages"]:
    print(f"\n{'='*50}")
    print(f"[{msg.__class__.__name__}]")
    print(msg.content)

The agent typically solves this in one pass. The message history shows the user request, the generated code, the execution output, and the verification result.

But what happens when things go wrong? Let’s try a task designed to trigger a retry.

python

result_retry = agent.invoke(
    {
        "messages": [
            HumanMessage(
                content="Read a CSV file called 'sales.csv' "
                "with columns 'product', 'region', 'revenue'. "
                "Group by region, calculate total revenue, "
                "and print the region with the highest total."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
)

print(f"\nRetries used: {result_retry['retry_count']}")
print(f"Succeeded: {result_retry['execution_succeeded']}")

This example is revealing. The first attempt crashes with FileNotFoundError — sales.csv doesn’t exist. The agent reads the traceback, realizes the file is missing, and rewrites the code to create sample data inline. That self-correction is the whole point of building an agent instead of a one-shot chain.

How to Watch the Agent Think Step by Step

When you’re debugging, LangGraph’s streaming mode shows each node’s output in real time. Instead of waiting for the final result, you see every decision as it happens.

python

agent = build_code_agent()

for event in agent.stream(
    {
        "messages": [
            HumanMessage(
                content="Calculate the first 20 Fibonacci "
                "numbers and print them."
            )
        ],
        "generated_code": "",
        "execution_result": "",
        "execution_succeeded": False,
        "retry_count": 0,
        "max_retries": 3,
    }
):
    for node_name, node_output in event.items():
        print(f"\n--- Node: {node_name} ---")
        if "generated_code" in node_output:
            code_preview = node_output["generated_code"][:200]
            print(f"Code:\n{code_preview}...")
        if "execution_result" in node_output:
            print(f"Result: {node_output['execution_result'][:200]}")
        if "execution_succeeded" in node_output:
            print(f"Success: {node_output['execution_succeeded']}")

You’ll see three nodes fire in sequence: generate outputs the Fibonacci code, execute prints the 20 numbers, evaluate confirms the output matches. For tasks that trigger retries, generate fires multiple times with improved code each round.

Adding Guard Rails for Production

A production code agent needs safety boundaries. Without them, the agent could generate harmful code, run forever, or drain your API budget. Here are the guard rails I’d add before deploying this to real users.

Guard rail 1: Code safety scanning. Before execution, scan for dangerous patterns. This isn’t a full security scanner — just a blocklist for the obvious threats.

python

FORBIDDEN_PATTERNS = [
    "os.remove", "os.rmdir", "shutil.rmtree",
    "subprocess.call", "subprocess.run",
    "os.system", "__import__",
    "eval(", "exec(",
    "open(", "pathlib",
]


def check_code_safety(code: str) -> tuple[bool, str]:
    """Check generated code for dangerous patterns."""
    for pattern in FORBIDDEN_PATTERNS:
        if pattern in code:
            return False, f"Blocked: contains '{pattern}'"
    return True, "Code passed safety check"

Warning: **Pattern matching isn’t real security.** A creative LLM can bypass these checks with `getattr`, string concatenation, or other tricks. For production, use a containerized sandbox. This blocklist catches accidental dangers, not deliberate exploits.

Guard rail 2: Cost tracking. Each retry doubles the cost. Track spending and set a hard ceiling per request.

python

def create_cost_tracker(max_cost: float = 0.10):
    """Track estimated API costs per request."""
    total = 0.0

    def check_cost(retry_count: int) -> bool:
        nonlocal total
        # ~$0.005 per generation + evaluation call
        total += 0.005
        return total <= max_cost

    return check_cost

Guard rail 3: Code length limits. If the model generates a 500-line script for a simple task, something’s off. Cap it at a reasonable length.

python

MAX_CODE_LINES = 100

def validate_code_length(code: str) -> bool:
    """Reject suspiciously long generated code."""
    return len(code.strip().split("\n")) <= MAX_CODE_LINES

Guard rail 4: Retry ceiling. Three retries is a sensible default. Beyond that, the model probably misunderstands the task — it’s not a typo it can fix.

When to Use a Code Agent (and When Not To)

Code generation agents are powerful, but they’re the wrong tool for many situations. Here’s my take on where they fit.

Good fit:

Data analysis — “Calculate X from this dataset.” Short code, verifiable output, retries handle edge cases.
Format conversion — “Convert this JSON to CSV with these columns.” Clear input, clear success criteria.
Math computation — “Solve this system of equations.” The LLM writes NumPy code, the answer is checkable.
One-off automation — “Rename files matching this pattern.” (With proper sandboxing.)

Bad fit:

Long-running processes — ML training takes hours. You don’t want a retry loop on a 3-hour job.
No clear output — “Write a web server.” There’s no single result to evaluate.
Side effects — Database writes, API calls. Each retry could create duplicates or corrupt data.
Subjective judgment — “Make this chart look professional.” The evaluation node can’t assess aesthetics.

Tip: **Start with tasks where you can define “correct” in one sentence.** If the success criteria needs a paragraph, the evaluation node won’t be reliable. Simple, verifiable tasks get the best results from code agents.

Common Mistakes and Troubleshooting

These are the errors that come up most when building code agents.

1. Infinite retry loops

python

RecursionError or the graph hangs

This happens when max_retries isn’t enforced or the counter doesn’t increment. Check that the execution node bumps retry_count on failure. Also verify your routing function — a missing "end" path creates an infinite cycle.

2. Code fence contamination

python

SyntaxError: invalid syntax (line 1: ```python)

The LLM wraps code in markdown fences despite the prompt saying not to. The generate_code function strips these, but edge cases slip through. Build a more aggressive stripper that handles backticks, language tags, and partial fences.

3. Missing imports in generated code

python

ModuleNotFoundError: No module named 'pandas'

Two distinct problems here. Either the model forgot the import (fix the system prompt) or the library isn’t installed in the execution environment. Your sandbox and dev environment may have different packages installed.

4. Stale state between invocations

If you reuse the compiled graph, pass a fresh initial state each time. Leftover retry_count or generated_code from a previous run confuses the agent. Reset every field on each new request.

Complete Code

Click to expand the full script (copy-paste and run)

python

# Complete code from: Building a Code Generation and Execution Agent
# Requires: pip install langgraph langchain-openai langchain-core
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running

import os
import subprocess
import tempfile
from typing import Annotated, TypedDict

from langchain_openai import ChatOpenAI
from langchain_core.messages import (
    HumanMessage,
    AIMessage,
    SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages


# --- State Definition ---

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]
    generated_code: str
    execution_result: str
    execution_succeeded: bool
    retry_count: int
    max_retries: int


# --- Safe Code Executor ---

def execute_code_safely(code: str, timeout: int = 30) -> dict:
    """Execute Python code in a subprocess with a timeout."""
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", delete=False
    ) as f:
        f.write(code)
        temp_path = f.name

    try:
        result = subprocess.run(
            ["python", temp_path],
            capture_output=True,
            text=True,
            timeout=timeout,
        )
        if result.returncode == 0:
            return {
                "success": True,
                "output": result.stdout,
                "error": "",
            }
        else:
            return {
                "success": False,
                "output": result.stdout,
                "error": result.stderr,
            }
    except subprocess.TimeoutExpired:
        return {
            "success": False,
            "output": "",
            "error": f"Code timed out after {timeout} seconds.",
        }
    finally:
        os.unlink(temp_path)


# --- Code Generation Node ---

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.

If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""


def generate_code(state: AgentState) -> dict:
    """Generate Python code based on the user request."""
    messages = state["messages"]
    retry_count = state.get("retry_count", 0)

    prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]

    if retry_count > 0 and state.get("generated_code"):
        error_context = (
            f"\n\nYour previous code:\n"
            f"```python\n{state['generated_code']}\n```\n\n"
            f"Error encountered:\n{state['execution_result']}\n\n"
            f"Fix the code. Attempt {retry_count + 1} of "
            f"{state['max_retries']}."
        )
        prompt_messages.extend(messages)
        prompt_messages.append(
            HumanMessage(content=error_context)
        )
    else:
        prompt_messages.extend(messages)

    response = model.invoke(prompt_messages)
    generated_code = response.content.strip()

    if generated_code.startswith("```python"):
        generated_code = generated_code[9:]
    if generated_code.startswith("```"):
        generated_code = generated_code[3:]
    if generated_code.endswith("```"):
        generated_code = generated_code[:-3]
    generated_code = generated_code.strip()

    return {
        "messages": [
            AIMessage(
                content=f"Generated code (attempt "
                f"{retry_count + 1}):\n```python\n"
                f"{generated_code}\n```"
            )
        ],
        "generated_code": generated_code,
    }


# --- Execution Node ---

def execute_code(state: AgentState) -> dict:
    """Execute the generated code and capture the result."""
    code = state["generated_code"]
    result = execute_code_safely(code)

    if result["success"]:
        output_text = (
            result["output"] if result["output"] else "(No output)"
        )
        return {
            "messages": [
                AIMessage(
                    content=f"Execution successful.\n"
                    f"Output:\n{output_text}"
                )
            ],
            "execution_result": output_text,
            "execution_succeeded": True,
        }
    else:
        error_text = result["error"]
        return {
            "messages": [
                AIMessage(
                    content=f"Execution failed.\n"
                    f"Error:\n{error_text}"
                )
            ],
            "execution_result": error_text,
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }


# --- Evaluation Node ---

def evaluate_result(state: AgentState) -> dict:
    """Check if the output answers the user's question."""
    user_request = ""
    for msg in state["messages"]:
        if isinstance(msg, HumanMessage):
            user_request = msg.content
            break

    eval_prompt = (
        f"The user asked: '{user_request}'\n\n"
        f"The code produced this output:\n"
        f"{state['execution_result']}\n\n"
        f"Does this output correctly and completely answer "
        f"the user's request?\n"
        f"Reply with exactly 'YES' or 'NO: <reason>'."
    )

    response = model.invoke(
        [
            SystemMessage(
                content="You evaluate code execution results. "
                "Be strict but fair."
            ),
            HumanMessage(content=eval_prompt),
        ]
    )

    evaluation = response.content.strip()
    is_correct = evaluation.upper().startswith("YES")

    if is_correct:
        return {
            "messages": [
                AIMessage(
                    content=f"Result verified.\n\n"
                    f"Final answer:\n"
                    f"{state['execution_result']}"
                )
            ],
            "execution_succeeded": True,
        }
    else:
        return {
            "messages": [
                AIMessage(
                    content=f"Output doesn't match the "
                    f"request. Reason: {evaluation}"
                )
            ],
            "execution_result": (
                f"Code ran but output was wrong. "
                f"Evaluation: {evaluation}"
            ),
            "execution_succeeded": False,
            "retry_count": state.get("retry_count", 0) + 1,
        }


# --- Routing Functions ---

def route_after_execution(state: AgentState) -> str:
    """Decide what happens after code execution."""
    if state["execution_succeeded"]:
        return "evaluate"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


def route_after_evaluation(state: AgentState) -> str:
    """Decide what happens after result evaluation."""
    if state["execution_succeeded"]:
        return "end"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


# --- Graph Assembly ---

def build_code_agent():
    """Build and compile the code generation agent graph."""
    graph = StateGraph(AgentState)

    graph.add_node("generate", generate_code)
    graph.add_node("execute", execute_code)
    graph.add_node("evaluate", evaluate_result)

    graph.add_edge(START, "generate")
    graph.add_edge("generate", "execute")

    graph.add_conditional_edges(
        "execute",
        route_after_execution,
        {
            "evaluate": "evaluate",
            "retry": "generate",
            "end": END,
        },
    )

    graph.add_conditional_edges(
        "evaluate",
        route_after_evaluation,
        {
            "end": END,
            "retry": "generate",
        },
    )

    return graph.compile()


# --- Run the Agent ---

if __name__ == "__main__":
    agent = build_code_agent()

    result = agent.invoke(
        {
            "messages": [
                HumanMessage(
                    content="Write a Python script that generates "
                    "a list of 10 random numbers between 1 and "
                    "100, sorts them, and prints the sorted list "
                    "along with the average."
                )
            ],
            "generated_code": "",
            "execution_result": "",
            "execution_succeeded": False,
            "retry_count": 0,
            "max_retries": 3,
        }
    )

    for msg in result["messages"]:
        print(f"\n{'='*50}")
        print(f"[{msg.__class__.__name__}]")
        print(msg.content)

    print(f"\nRetries used: {result['retry_count']}")
    print(f"Succeeded: {result['execution_succeeded']}")
    print("\nScript completed successfully.")

Exercise: Add a Code Safety Node

Here’s a challenge that tests your understanding of the graph structure. Add a safety checking node between generate and execute — a gatekeeper that blocks dangerous code before it runs.

Your task: Create a check_safety node that:

Reads state["generated_code"]
Scans for forbidden patterns (from the guard rails section)
If dangerous, returns an error and increments retry_count
If safe, passes the code through to execution

You’ll also need to rewire the graph: generate connects to check_safety, and check_safety uses conditional edges to route to execute or back to generate.

Hints

**Hint 1:** The safety node needs its own routing function. Model it after `route_after_execution` — check a boolean flag.

**Hint 2:** Add a `code_is_safe` boolean field to `AgentState`. The safety node sets it. The conditional edge reads it.

Solution

python

# Add to AgentState:
# code_is_safe: bool

FORBIDDEN_PATTERNS = [
    "os.remove", "os.rmdir", "shutil.rmtree",
    "subprocess.call", "subprocess.run",
    "os.system", "__import__",
    "eval(", "exec(",
]


def check_safety(state: AgentState) -> dict:
    """Scan generated code for dangerous patterns."""
    code = state["generated_code"]
    for pattern in FORBIDDEN_PATTERNS:
        if pattern in code:
            return {
                "messages": [
                    AIMessage(
                        content=f"Safety check failed: "
                        f"code contains '{pattern}'. "
                        f"Regenerating..."
                    )
                ],
                "execution_result": (
                    f"Code blocked: contains '{pattern}'"
                ),
                "execution_succeeded": False,
                "code_is_safe": False,
                "retry_count": state.get("retry_count", 0) + 1,
            }
    return {"code_is_safe": True}


def route_after_safety(state: AgentState) -> str:
    if state.get("code_is_safe", False):
        return "execute"
    elif state.get("retry_count", 0) >= state.get("max_retries", 3):
        return "end"
    else:
        return "retry"


# Updated graph wiring:
# graph.add_edge("generate", "check_safety")
# graph.add_conditional_edges(
#     "check_safety", route_after_safety,
#     {"execute": "execute", "retry": "generate", "end": END}
# )

The key change: `generate` no longer connects to `execute` directly. The safety node sits between them as a gatekeeper. If code passes, it flows to execution. If it fails, the agent loops back with the violation as context — and the LLM rewrites to avoid the blocked pattern.

Exercise: Support Multi-Step Tasks

This exercise pushes you to extend the agent’s capabilities. Some tasks need multiple code executions in sequence — for example: “Create a CSV with sample data, then read it and calculate statistics.”

Your task: Modify the agent so it handles multi-step requests. The agent should:

Parse the request into sequential steps
Generate and execute code for each step
Verify each step before moving to the next
Return the final result after all steps complete

Hints

**Hint 1:** Add `task_steps: list[str]` and `current_step: int` to the state. Create a planning node that breaks the request into steps.

**Hint 2:** After each successful evaluation, check if `current_step < len(task_steps) - 1`. If more steps remain, increment the counter and route back to `generate`.

Solution

python

# Add to AgentState:
# task_steps: list[str]
# current_step: int

def plan_task(state: AgentState) -> dict:
    """Break the user request into sequential steps."""
    user_request = state["messages"][-1].content

    plan_prompt = (
        f"Break this task into sequential Python steps:\n"
        f"'{user_request}'\n\n"
        f"Return each step on a new line, numbered. "
        f"Each step must be independently executable."
    )

    response = model.invoke(
        [SystemMessage(content="You are a task planner."),
         HumanMessage(content=plan_prompt)]
    )

    steps = [
        line.strip()
        for line in response.content.strip().split("\n")
        if line.strip() and line.strip()[0].isdigit()
    ]

    return {
        "task_steps": steps,
        "current_step": 0,
        "messages": [
            AIMessage(
                content=f"Plan: {len(steps)} steps identified."
            )
        ],
    }


# Modified graph: START → plan → generate → execute → evaluate
# After evaluation, check current_step vs len(task_steps).
# If more steps remain, increment and route to generate.
# If all done, route to END.

The planning node uses a separate LLM call to decompose the request. Each step becomes its own generate-execute-evaluate cycle. The `current_step` counter tracks progress, and the routing logic decides whether to continue or finish.

Summary

You’ve built a complete code generation and execution agent in LangGraph. The system writes Python code, runs it in a sandbox, evaluates the output, and retries on failure — all through a state machine with explicit routing.

Here’s what we covered:

State design — tracking code, results, success flags, and retries in a TypedDict
Safe execution — running generated code in a subprocess with timeout
Code generation — prompting the LLM with error context for retries
Output evaluation — a second LLM call to verify results match the request
Conditional routing — wiring the retry loop with LangGraph’s conditional edges
Guard rails — safety scanning, cost tracking, and code length limits

The agent pattern we built is a foundation. Extend it with better sandboxing (Docker, E2B), multi-step planning, persistent memory, or domain-specific prompts for your use case.

FAQ

Can this agent handle tasks that need external libraries?

Yes, as long as those libraries are installed in the execution environment. The agent generates imports — if pandas exists where the subprocess runs, the code works. Otherwise, the agent sees ModuleNotFoundError and may rewrite to avoid that dependency.

python

# Verify a library is available before running agent tasks
import importlib
try:
    importlib.import_module("pandas")
    print("pandas is available")
except ImportError:
    print("pandas is NOT installed")

How do I switch to a cloud sandbox?

Replace execute_code_safely with a call to your sandbox provider. E2B, Modal, and LangChain Sandbox all expose an execute() method that takes code and returns output. The rest of the graph stays identical — only the execution backend changes.

What does each request cost?

Each generation call costs roughly \(0.002-0.005 with GPT-4o-mini. Evaluation adds \)0.001-0.002. A first-attempt success costs about \(0.004 total. Three retries cost about \)0.015. The evaluation node is the first thing to optimize at scale — skip it for tasks with trivially verifiable outputs.

Can I use a local model instead of OpenAI?

Swap ChatOpenAI for any LangChain-compatible chat model. Ollama, vLLM, and Anthropic all work. Code quality depends on the model — GPT-4o and Claude produce reliable code, while smaller models may need more retries.

Topic Cluster: LangGraph Agent Patterns

This article is part of the LangGraph series on MachineLearningPlus. Related articles:

References

LangGraph documentation — Graph API overview. Link
LangGraph documentation — StateGraph and conditional edges. Link
LangChain documentation — Sandboxes for Deep Agents. Link
langchain-sandbox — PyPI package for safe Python execution. Link
E2B documentation — Code Interpreter with LangGraph. Link
LangChain blog — Execute Code with Sandboxes. Link
Modal documentation — Build a coding agent with LangGraph. Link
Yao, S. et al. — “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Link

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Selva Prabhakaran →

Related Course

Master Gen AI — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

Build a Code Generation and Execution Agent with LangGraph

What Is a Code Generation Agent?

Prerequisites

Step 1 — Define the Agent State

Step 2 — Build a Safe Code Executor

Step 3 — Create the Code Generation Node

Step 4 — Build the Execution Node

Step 5 — Add the Evaluation Node

Step 6 — Wire the Graph with Conditional Routing

Step 7 — Run the Agent

How to Watch the Agent Think Step by Step

Adding Guard Rails for Production

When to Use a Code Agent (and When Not To)

Common Mistakes and Troubleshooting

Complete Code

Exercise: Add a Code Safety Node

Exercise: Support Multi-Step Tasks

Summary

FAQ

Topic Cluster: LangGraph Agent Patterns

References

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

What Is a Code Generation Agent?

Prerequisites

Step 1 — Define the Agent State

Step 2 — Build a Safe Code Executor

Step 3 — Create the Code Generation Node

Step 4 — Build the Execution Node

Step 5 — Add the Evaluation Node

Step 6 — Wire the Graph with Conditional Routing

Step 7 — Run the Agent

How to Watch the Agent Think Step by Step

Adding Guard Rails for Production

When to Use a Code Agent (and When Not To)

Common Mistakes and Troubleshooting

Complete Code

Exercise: Add a Code Safety Node

Exercise: Support Multi-Step Tasks

Summary

FAQ

Topic Cluster: LangGraph Agent Patterns

References

Related Articles

Project — Build a Document Processing Agent with Multi-Modal Inputs

Project — Build a Customer Support Agent with Escalation Workflows in LangGraph

Build an Autonomous Web Scraping Pipeline with LangGraph

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.