Build a Code Generation and Execution Agent with LangGraph
You give your LLM a task: “Calculate the average salary by department from this CSV.” It writes code that looks perfect. But when you run it, the column name is wrong and the script crashes. You fix it manually, re-prompt, and try again.
What if the agent could catch its own errors, rewrite the code, and retry — without you lifting a finger? That’s what we’re building here.
Before we write a single line, here’s how the data flows through this system.
Your user sends a natural language request. The agent node picks it up and calls the LLM, which generates Python code. That code flows to an execution node, where it runs inside an isolated sandbox.
The sandbox returns either a successful output or an error traceback. If the code succeeded, the agent checks whether the output actually answers the original question. If it does, the graph exits with the final answer.
If the code crashed or produced wrong results, the error feeds back to the agent node. The LLM reads the error, figures out what broke, and writes a corrected version. This loop continues until the code works — or we hit a retry limit.
Five pieces make up the system: the state (tracking messages, code, results, and retries), the code generation node, the execution node, an evaluation node, and the conditional routing that connects them.
What Is a Code Generation Agent?
A code generation agent is an LLM-powered system that writes code, runs it, and iterates on the result. It goes beyond simple code completion. The agent owns the full lifecycle: generate, execute, evaluate, and retry.
Standard LLM code generation works like this:
User prompt → LLM → Code (might work, might not)
A code generation agent works like this:
User prompt → Generate → Execute → Check → Fix if broken → Repeat → Answer
The critical difference is the feedback loop. The agent doesn’t just write code — it tests its own work. When the code crashes, the agent reads the traceback and writes a better version. This is the same process you’d follow as a developer, just automated.
Prerequisites
- Python version: 3.10+
- Required libraries: langgraph (0.4+), langchain-openai (0.3+), langchain-core (0.3+)
- Install:
pip install langgraph langchain-openai langchain-core - API key: An OpenAI API key set as
OPENAI_API_KEY. See OpenAI’s docs to create one. - Time to complete: ~40 minutes
- Prior knowledge: Basic LangGraph concepts (nodes, edges, state). If you’re new, start with our LangGraph installation and setup guide.
Step 1 — Define the Agent State
Every LangGraph agent starts with a state definition. The state is a typed dictionary that flows through every node in the graph. Each node reads from it, does work, and writes results back.
For our code agent, we need more than just messages. We track the generated code, the execution result, a success flag, and a retry counter. This gives every node the context it needs to decide what to do.
import os
from typing import Annotated, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.messages import (
HumanMessage,
AIMessage,
SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
generated_code: str
execution_result: str
execution_succeeded: bool
retry_count: int
max_retries: int
Six fields in total. messages holds the conversation history and uses the add_messages reducer — it appends instead of replacing. generated_code stores the latest Python script the LLM produced. execution_result captures stdout or the error traceback.
The last three fields control the retry loop. execution_succeeded tells the router whether to proceed or retry. retry_count tracks how many times we’ve looped. max_retries sets the ceiling.
Quick check: Why does messages use a reducer while the other fields don’t? Because messages accumulate — each turn adds to the conversation. The other fields represent the current state of the last attempt. You want the latest code, not a list of every code version.
Step 2 — Build a Safe Code Executor
Here’s a question you should be asking: how do we safely run code that an LLM wrote? Running arbitrary Python on your machine is risky. One wrong os.remove() call and you’ve lost files.
We need a sandbox. For this tutorial, we’ll use Python’s subprocess module with a timeout. It runs code in a separate process with limited execution time. For production, I’d recommend Docker or langchain-sandbox for proper isolation.
The executor writes the code to a temporary file, runs it, and captures output. Three possible outcomes: success (stdout), failure (stderr), or timeout (killed process).
import subprocess
import tempfile
def execute_code_safely(code: str, timeout: int = 30) -> dict:
"""Execute Python code in a subprocess with a timeout."""
with tempfile.NamedTemporaryFile(
mode="w", suffix=".py", delete=False
) as f:
f.write(code)
temp_path = f.name
try:
result = subprocess.run(
["python", temp_path],
capture_output=True,
text=True,
timeout=timeout,
)
if result.returncode == 0:
return {
"success": True,
"output": result.stdout,
"error": "",
}
else:
return {
"success": False,
"output": result.stdout,
"error": result.stderr,
}
except subprocess.TimeoutExpired:
return {
"success": False,
"output": "",
"error": f"Code timed out after {timeout} seconds.",
}
finally:
os.unlink(temp_path)
The return dictionary has three keys: success (boolean), output (what the code printed), and error (traceback on failure). The finally block deletes the temp file no matter what happens.
Step 3 — Create the Code Generation Node
This is where the LLM writes code. The generation node builds a prompt, sends it to the model, and extracts the Python script from the response.
I’ve found the system prompt matters more than anything here. You need to tell the model explicitly: output only executable Python, always include print statements, and include all imports. Without these instructions, the model produces code with no visible output or missing dependencies.
On retries, the prompt grows. It includes the previous code and the error message. This gives the model concrete feedback — not “try again,” but “here’s exactly what broke.”
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.
If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""
def generate_code(state: AgentState) -> dict:
"""Generate Python code based on the user request."""
messages = state["messages"]
retry_count = state.get("retry_count", 0)
prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]
if retry_count > 0 and state.get("generated_code"):
error_context = (
f"\n\nYour previous code:\n"
f"```python\n{state['generated_code']}\n```\n\n"
f"Error encountered:\n{state['execution_result']}\n\n"
f"Fix the code. Attempt {retry_count + 1} of "
f"{state['max_retries']}."
)
prompt_messages.extend(messages)
prompt_messages.append(
HumanMessage(content=error_context)
)
else:
prompt_messages.extend(messages)
response = model.invoke(prompt_messages)
generated_code = response.content.strip()
# Strip markdown code fences if the model adds them
if generated_code.startswith("```python"):
generated_code = generated_code[9:]
if generated_code.startswith("```"):
generated_code = generated_code[3:]
if generated_code.endswith("```"):
generated_code = generated_code[:-3]
generated_code = generated_code.strip()
return {
"messages": [
AIMessage(
content=f"Generated code (attempt "
f"{retry_count + 1}):\n```python\n"
f"{generated_code}\n```"
)
],
"generated_code": generated_code,
}
Notice the fence-stripping at the bottom. Even with “output ONLY code” in the prompt, models sometimes wrap responses in markdown. We strip those so the executor gets clean Python.
On retries, the function appends error context as a follow-up message. The model sees the full conversation plus the specific failure. This beats starting from scratch — the model knows what it tried and what broke.
Step 4 — Build the Execution Node
The execution node is the simplest piece. It takes generated code from the state, passes it to the sandbox, and writes the result back. Think of it as a bridge between the LLM’s output and reality.
def execute_code(state: AgentState) -> dict:
"""Execute the generated code and capture the result."""
code = state["generated_code"]
result = execute_code_safely(code)
if result["success"]:
output_text = (
result["output"] if result["output"] else "(No output)"
)
return {
"messages": [
AIMessage(
content=f"Execution successful.\n"
f"Output:\n{output_text}"
)
],
"execution_result": output_text,
"execution_succeeded": True,
}
else:
error_text = result["error"]
return {
"messages": [
AIMessage(
content=f"Execution failed.\n"
f"Error:\n{error_text}"
)
],
"execution_result": error_text,
"execution_succeeded": False,
"retry_count": state.get("retry_count", 0) + 1,
}
Two paths here. Success stores the output and sets execution_succeeded to True. Failure stores the traceback, flips the flag to False, and bumps the retry counter by one.
Why increment retry_count in the execution node and not the generation node? Because retries are triggered by failed executions. This is where we know something went wrong. The generation node shouldn’t care about counting — it just writes code based on whatever context it receives.
Step 5 — Add the Evaluation Node
Code that runs without errors doesn’t always mean it produced the right answer. Your user asks “show me the top 5 products by revenue” and the code prints an unsorted list of every product. No crash — but completely wrong.
The evaluation node catches this. After successful execution, it asks the LLM to verify whether the output actually answers the original question. This is a second LLM call per cycle, and it’s worth every token.
def evaluate_result(state: AgentState) -> dict:
"""Check if the output answers the user's question."""
user_request = ""
for msg in state["messages"]:
if isinstance(msg, HumanMessage):
user_request = msg.content
break
eval_prompt = (
f"The user asked: '{user_request}'\n\n"
f"The code produced this output:\n"
f"{state['execution_result']}\n\n"
f"Does this output correctly and completely answer "
f"the user's request?\n"
f"Reply with exactly 'YES' or 'NO: <reason>'."
)
response = model.invoke(
[
SystemMessage(
content="You evaluate code execution results. "
"Be strict but fair."
),
HumanMessage(content=eval_prompt),
]
)
evaluation = response.content.strip()
is_correct = evaluation.upper().startswith("YES")
if is_correct:
return {
"messages": [
AIMessage(
content=f"Result verified.\n\n"
f"Final answer:\n"
f"{state['execution_result']}"
)
],
"execution_succeeded": True,
}
else:
return {
"messages": [
AIMessage(
content=f"Output doesn't match the "
f"request. Reason: {evaluation}"
)
],
"execution_result": (
f"Code ran but output was wrong. "
f"Evaluation: {evaluation}"
),
"execution_succeeded": False,
"retry_count": state.get("retry_count", 0) + 1,
}
The node extracts the original user request from the message history. It builds a yes/no question for the LLM and routes based on the answer. If the output passes, we’re done. If not, execution_succeeded flips to False and the retry counter goes up.
Is the extra LLM call worth the cost? I’d say yes. Without it, the agent returns wrong results with confidence. One extra API call prevents the user from trusting garbage output.
Step 6 — Wire the Graph with Conditional Routing
This is where the pieces snap together. We connect every node with edges — including conditional edges that route based on the state. The routing logic is the agent’s brain.
Three decisions drive the flow:
- After execution: Did the code succeed? Go to evaluation. Did it crash? Check retry budget.
- After evaluation: Did the output pass? End. Was it wrong? Retry.
- Retry guard: Have we hit
max_retries? If yes, stop. If no, regenerate.
def route_after_execution(state: AgentState) -> str:
"""Decide what happens after code execution."""
if state["execution_succeeded"]:
return "evaluate"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "end"
else:
return "retry"
def route_after_evaluation(state: AgentState) -> str:
"""Decide what happens after result evaluation."""
if state["execution_succeeded"]:
return "end"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "end"
else:
return "retry"
Two small functions. route_after_execution checks the success flag first. If the code ran clean, it sends the result to evaluation. If it crashed and retries remain, it loops back to generation. Otherwise, it exits.
route_after_evaluation handles the subtler case: code that ran fine but produced wrong output. Same logic — pass or retry, with a retry ceiling.
Here’s the full graph assembly. Each add_node registers a function. Each add_edge or add_conditional_edges defines flow between nodes.
def build_code_agent() -> StateGraph:
"""Build and compile the code generation agent graph."""
graph = StateGraph(AgentState)
# Register nodes
graph.add_node("generate", generate_code)
graph.add_node("execute", execute_code)
graph.add_node("evaluate", evaluate_result)
# Start with code generation
graph.add_edge(START, "generate")
# After generation, always execute
graph.add_edge("generate", "execute")
# After execution, branch on success
graph.add_conditional_edges(
"execute",
route_after_execution,
{
"evaluate": "evaluate",
"retry": "generate",
"end": END,
},
)
# After evaluation, branch on correctness
graph.add_conditional_edges(
"evaluate",
route_after_evaluation,
{
"end": END,
"retry": "generate",
},
)
return graph.compile()
The flow: START connects to generate, which always connects to execute. After execute, we branch — success goes to evaluate, failure loops back to generate or exits. After evaluate, correct results exit, wrong results loop back.
That’s the entire architecture. Three nodes, two routing functions, one compiled graph.
Step 7 — Run the Agent
Let’s put the agent to work. The invoke call takes an initial state dictionary. We set the user’s message, initialize tracking fields, and cap retries at 3.
agent = build_code_agent()
result = agent.invoke(
{
"messages": [
HumanMessage(
content="Write a Python script that generates "
"a list of 10 random numbers between 1 and "
"100, sorts them, and prints the sorted list "
"along with the average."
)
],
"generated_code": "",
"execution_result": "",
"execution_succeeded": False,
"retry_count": 0,
"max_retries": 3,
}
)
# Print the final result
for msg in result["messages"]:
print(f"\n{'='*50}")
print(f"[{msg.__class__.__name__}]")
print(msg.content)
The agent typically solves this in one pass. The message history shows the user request, the generated code, the execution output, and the verification result.
But what happens when things go wrong? Let’s try a task designed to trigger a retry.
result_retry = agent.invoke(
{
"messages": [
HumanMessage(
content="Read a CSV file called 'sales.csv' "
"with columns 'product', 'region', 'revenue'. "
"Group by region, calculate total revenue, "
"and print the region with the highest total."
)
],
"generated_code": "",
"execution_result": "",
"execution_succeeded": False,
"retry_count": 0,
"max_retries": 3,
}
)
print(f"\nRetries used: {result_retry['retry_count']}")
print(f"Succeeded: {result_retry['execution_succeeded']}")
This example is revealing. The first attempt crashes with FileNotFoundError — sales.csv doesn’t exist. The agent reads the traceback, realizes the file is missing, and rewrites the code to create sample data inline. That self-correction is the whole point of building an agent instead of a one-shot chain.
How to Watch the Agent Think Step by Step
When you’re debugging, LangGraph’s streaming mode shows each node’s output in real time. Instead of waiting for the final result, you see every decision as it happens.
agent = build_code_agent()
for event in agent.stream(
{
"messages": [
HumanMessage(
content="Calculate the first 20 Fibonacci "
"numbers and print them."
)
],
"generated_code": "",
"execution_result": "",
"execution_succeeded": False,
"retry_count": 0,
"max_retries": 3,
}
):
for node_name, node_output in event.items():
print(f"\n--- Node: {node_name} ---")
if "generated_code" in node_output:
code_preview = node_output["generated_code"][:200]
print(f"Code:\n{code_preview}...")
if "execution_result" in node_output:
print(f"Result: {node_output['execution_result'][:200]}")
if "execution_succeeded" in node_output:
print(f"Success: {node_output['execution_succeeded']}")
You’ll see three nodes fire in sequence: generate outputs the Fibonacci code, execute prints the 20 numbers, evaluate confirms the output matches. For tasks that trigger retries, generate fires multiple times with improved code each round.
Adding Guard Rails for Production
A production code agent needs safety boundaries. Without them, the agent could generate harmful code, run forever, or drain your API budget. Here are the guard rails I’d add before deploying this to real users.
Guard rail 1: Code safety scanning. Before execution, scan for dangerous patterns. This isn’t a full security scanner — just a blocklist for the obvious threats.
FORBIDDEN_PATTERNS = [
"os.remove", "os.rmdir", "shutil.rmtree",
"subprocess.call", "subprocess.run",
"os.system", "__import__",
"eval(", "exec(",
"open(", "pathlib",
]
def check_code_safety(code: str) -> tuple[bool, str]:
"""Check generated code for dangerous patterns."""
for pattern in FORBIDDEN_PATTERNS:
if pattern in code:
return False, f"Blocked: contains '{pattern}'"
return True, "Code passed safety check"
Guard rail 2: Cost tracking. Each retry doubles the cost. Track spending and set a hard ceiling per request.
def create_cost_tracker(max_cost: float = 0.10):
"""Track estimated API costs per request."""
total = 0.0
def check_cost(retry_count: int) -> bool:
nonlocal total
# ~$0.005 per generation + evaluation call
total += 0.005
return total <= max_cost
return check_cost
Guard rail 3: Code length limits. If the model generates a 500-line script for a simple task, something’s off. Cap it at a reasonable length.
MAX_CODE_LINES = 100
def validate_code_length(code: str) -> bool:
"""Reject suspiciously long generated code."""
return len(code.strip().split("\n")) <= MAX_CODE_LINES
Guard rail 4: Retry ceiling. Three retries is a sensible default. Beyond that, the model probably misunderstands the task — it’s not a typo it can fix.
When to Use a Code Agent (and When Not To)
Code generation agents are powerful, but they’re the wrong tool for many situations. Here’s my take on where they fit.
Good fit:
- Data analysis — “Calculate X from this dataset.” Short code, verifiable output, retries handle edge cases.
- Format conversion — “Convert this JSON to CSV with these columns.” Clear input, clear success criteria.
- Math computation — “Solve this system of equations.” The LLM writes NumPy code, the answer is checkable.
- One-off automation — “Rename files matching this pattern.” (With proper sandboxing.)
Bad fit:
- Long-running processes — ML training takes hours. You don’t want a retry loop on a 3-hour job.
- No clear output — “Write a web server.” There’s no single result to evaluate.
- Side effects — Database writes, API calls. Each retry could create duplicates or corrupt data.
- Subjective judgment — “Make this chart look professional.” The evaluation node can’t assess aesthetics.
Common Mistakes and Troubleshooting
These are the errors that come up most when building code agents.
1. Infinite retry loops
RecursionError or the graph hangs
This happens when max_retries isn’t enforced or the counter doesn’t increment. Check that the execution node bumps retry_count on failure. Also verify your routing function — a missing "end" path creates an infinite cycle.
2. Code fence contamination
SyntaxError: invalid syntax (line 1: ```python)
The LLM wraps code in markdown fences despite the prompt saying not to. The generate_code function strips these, but edge cases slip through. Build a more aggressive stripper that handles backticks, language tags, and partial fences.
3. Missing imports in generated code
ModuleNotFoundError: No module named 'pandas'
Two distinct problems here. Either the model forgot the import (fix the system prompt) or the library isn’t installed in the execution environment. Your sandbox and dev environment may have different packages installed.
4. Stale state between invocations
If you reuse the compiled graph, pass a fresh initial state each time. Leftover retry_count or generated_code from a previous run confuses the agent. Reset every field on each new request.
Complete Code
Click to expand the full script (copy-paste and run)
# Complete code from: Building a Code Generation and Execution Agent
# Requires: pip install langgraph langchain-openai langchain-core
# Python 3.10+
# Set OPENAI_API_KEY environment variable before running
import os
import subprocess
import tempfile
from typing import Annotated, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.messages import (
HumanMessage,
AIMessage,
SystemMessage,
)
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
# --- State Definition ---
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
generated_code: str
execution_result: str
execution_succeeded: bool
retry_count: int
max_retries: int
# --- Safe Code Executor ---
def execute_code_safely(code: str, timeout: int = 30) -> dict:
"""Execute Python code in a subprocess with a timeout."""
with tempfile.NamedTemporaryFile(
mode="w", suffix=".py", delete=False
) as f:
f.write(code)
temp_path = f.name
try:
result = subprocess.run(
["python", temp_path],
capture_output=True,
text=True,
timeout=timeout,
)
if result.returncode == 0:
return {
"success": True,
"output": result.stdout,
"error": "",
}
else:
return {
"success": False,
"output": result.stdout,
"error": result.stderr,
}
except subprocess.TimeoutExpired:
return {
"success": False,
"output": "",
"error": f"Code timed out after {timeout} seconds.",
}
finally:
os.unlink(temp_path)
# --- Code Generation Node ---
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
SYSTEM_PROMPT = """You are a Python code generation assistant.
1. Write complete, executable Python that solves the request.
2. Always include print() statements to show results.
3. Include all necessary imports at the top.
4. Handle potential errors with try/except where needed.
5. Output ONLY Python code — no explanations, no markdown.
If you receive an error from a previous attempt:
- Read the error carefully.
- Fix the specific issue.
- Do NOT rewrite everything unless necessary.
- Output the corrected code."""
def generate_code(state: AgentState) -> dict:
"""Generate Python code based on the user request."""
messages = state["messages"]
retry_count = state.get("retry_count", 0)
prompt_messages = [SystemMessage(content=SYSTEM_PROMPT)]
if retry_count > 0 and state.get("generated_code"):
error_context = (
f"\n\nYour previous code:\n"
f"```python\n{state['generated_code']}\n```\n\n"
f"Error encountered:\n{state['execution_result']}\n\n"
f"Fix the code. Attempt {retry_count + 1} of "
f"{state['max_retries']}."
)
prompt_messages.extend(messages)
prompt_messages.append(
HumanMessage(content=error_context)
)
else:
prompt_messages.extend(messages)
response = model.invoke(prompt_messages)
generated_code = response.content.strip()
if generated_code.startswith("```python"):
generated_code = generated_code[9:]
if generated_code.startswith("```"):
generated_code = generated_code[3:]
if generated_code.endswith("```"):
generated_code = generated_code[:-3]
generated_code = generated_code.strip()
return {
"messages": [
AIMessage(
content=f"Generated code (attempt "
f"{retry_count + 1}):\n```python\n"
f"{generated_code}\n```"
)
],
"generated_code": generated_code,
}
# --- Execution Node ---
def execute_code(state: AgentState) -> dict:
"""Execute the generated code and capture the result."""
code = state["generated_code"]
result = execute_code_safely(code)
if result["success"]:
output_text = (
result["output"] if result["output"] else "(No output)"
)
return {
"messages": [
AIMessage(
content=f"Execution successful.\n"
f"Output:\n{output_text}"
)
],
"execution_result": output_text,
"execution_succeeded": True,
}
else:
error_text = result["error"]
return {
"messages": [
AIMessage(
content=f"Execution failed.\n"
f"Error:\n{error_text}"
)
],
"execution_result": error_text,
"execution_succeeded": False,
"retry_count": state.get("retry_count", 0) + 1,
}
# --- Evaluation Node ---
def evaluate_result(state: AgentState) -> dict:
"""Check if the output answers the user's question."""
user_request = ""
for msg in state["messages"]:
if isinstance(msg, HumanMessage):
user_request = msg.content
break
eval_prompt = (
f"The user asked: '{user_request}'\n\n"
f"The code produced this output:\n"
f"{state['execution_result']}\n\n"
f"Does this output correctly and completely answer "
f"the user's request?\n"
f"Reply with exactly 'YES' or 'NO: <reason>'."
)
response = model.invoke(
[
SystemMessage(
content="You evaluate code execution results. "
"Be strict but fair."
),
HumanMessage(content=eval_prompt),
]
)
evaluation = response.content.strip()
is_correct = evaluation.upper().startswith("YES")
if is_correct:
return {
"messages": [
AIMessage(
content=f"Result verified.\n\n"
f"Final answer:\n"
f"{state['execution_result']}"
)
],
"execution_succeeded": True,
}
else:
return {
"messages": [
AIMessage(
content=f"Output doesn't match the "
f"request. Reason: {evaluation}"
)
],
"execution_result": (
f"Code ran but output was wrong. "
f"Evaluation: {evaluation}"
),
"execution_succeeded": False,
"retry_count": state.get("retry_count", 0) + 1,
}
# --- Routing Functions ---
def route_after_execution(state: AgentState) -> str:
"""Decide what happens after code execution."""
if state["execution_succeeded"]:
return "evaluate"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "end"
else:
return "retry"
def route_after_evaluation(state: AgentState) -> str:
"""Decide what happens after result evaluation."""
if state["execution_succeeded"]:
return "end"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "end"
else:
return "retry"
# --- Graph Assembly ---
def build_code_agent():
"""Build and compile the code generation agent graph."""
graph = StateGraph(AgentState)
graph.add_node("generate", generate_code)
graph.add_node("execute", execute_code)
graph.add_node("evaluate", evaluate_result)
graph.add_edge(START, "generate")
graph.add_edge("generate", "execute")
graph.add_conditional_edges(
"execute",
route_after_execution,
{
"evaluate": "evaluate",
"retry": "generate",
"end": END,
},
)
graph.add_conditional_edges(
"evaluate",
route_after_evaluation,
{
"end": END,
"retry": "generate",
},
)
return graph.compile()
# --- Run the Agent ---
if __name__ == "__main__":
agent = build_code_agent()
result = agent.invoke(
{
"messages": [
HumanMessage(
content="Write a Python script that generates "
"a list of 10 random numbers between 1 and "
"100, sorts them, and prints the sorted list "
"along with the average."
)
],
"generated_code": "",
"execution_result": "",
"execution_succeeded": False,
"retry_count": 0,
"max_retries": 3,
}
)
for msg in result["messages"]:
print(f"\n{'='*50}")
print(f"[{msg.__class__.__name__}]")
print(msg.content)
print(f"\nRetries used: {result['retry_count']}")
print(f"Succeeded: {result['execution_succeeded']}")
print("\nScript completed successfully.")
Exercise: Add a Code Safety Node
Here’s a challenge that tests your understanding of the graph structure. Add a safety checking node between generate and execute — a gatekeeper that blocks dangerous code before it runs.
Your task: Create a check_safety node that:
- Reads
state["generated_code"] - Scans for forbidden patterns (from the guard rails section)
- If dangerous, returns an error and increments
retry_count - If safe, passes the code through to execution
You’ll also need to rewire the graph: generate connects to check_safety, and check_safety uses conditional edges to route to execute or back to generate.
Hints
**Hint 1:** The safety node needs its own routing function. Model it after `route_after_execution` — check a boolean flag.
**Hint 2:** Add a `code_is_safe` boolean field to `AgentState`. The safety node sets it. The conditional edge reads it.
Solution
# Add to AgentState:
# code_is_safe: bool
FORBIDDEN_PATTERNS = [
"os.remove", "os.rmdir", "shutil.rmtree",
"subprocess.call", "subprocess.run",
"os.system", "__import__",
"eval(", "exec(",
]
def check_safety(state: AgentState) -> dict:
"""Scan generated code for dangerous patterns."""
code = state["generated_code"]
for pattern in FORBIDDEN_PATTERNS:
if pattern in code:
return {
"messages": [
AIMessage(
content=f"Safety check failed: "
f"code contains '{pattern}'. "
f"Regenerating..."
)
],
"execution_result": (
f"Code blocked: contains '{pattern}'"
),
"execution_succeeded": False,
"code_is_safe": False,
"retry_count": state.get("retry_count", 0) + 1,
}
return {"code_is_safe": True}
def route_after_safety(state: AgentState) -> str:
if state.get("code_is_safe", False):
return "execute"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "end"
else:
return "retry"
# Updated graph wiring:
# graph.add_edge("generate", "check_safety")
# graph.add_conditional_edges(
# "check_safety", route_after_safety,
# {"execute": "execute", "retry": "generate", "end": END}
# )
The key change: `generate` no longer connects to `execute` directly. The safety node sits between them as a gatekeeper. If code passes, it flows to execution. If it fails, the agent loops back with the violation as context — and the LLM rewrites to avoid the blocked pattern.
Exercise: Support Multi-Step Tasks
This exercise pushes you to extend the agent’s capabilities. Some tasks need multiple code executions in sequence — for example: “Create a CSV with sample data, then read it and calculate statistics.”
Your task: Modify the agent so it handles multi-step requests. The agent should:
- Parse the request into sequential steps
- Generate and execute code for each step
- Verify each step before moving to the next
- Return the final result after all steps complete
Hints
**Hint 1:** Add `task_steps: list[str]` and `current_step: int` to the state. Create a planning node that breaks the request into steps.
**Hint 2:** After each successful evaluation, check if `current_step < len(task_steps) - 1`. If more steps remain, increment the counter and route back to `generate`.
Solution
# Add to AgentState:
# task_steps: list[str]
# current_step: int
def plan_task(state: AgentState) -> dict:
"""Break the user request into sequential steps."""
user_request = state["messages"][-1].content
plan_prompt = (
f"Break this task into sequential Python steps:\n"
f"'{user_request}'\n\n"
f"Return each step on a new line, numbered. "
f"Each step must be independently executable."
)
response = model.invoke(
[SystemMessage(content="You are a task planner."),
HumanMessage(content=plan_prompt)]
)
steps = [
line.strip()
for line in response.content.strip().split("\n")
if line.strip() and line.strip()[0].isdigit()
]
return {
"task_steps": steps,
"current_step": 0,
"messages": [
AIMessage(
content=f"Plan: {len(steps)} steps identified."
)
],
}
# Modified graph: START → plan → generate → execute → evaluate
# After evaluation, check current_step vs len(task_steps).
# If more steps remain, increment and route to generate.
# If all done, route to END.
The planning node uses a separate LLM call to decompose the request. Each step becomes its own generate-execute-evaluate cycle. The `current_step` counter tracks progress, and the routing logic decides whether to continue or finish.
Summary
You’ve built a complete code generation and execution agent in LangGraph. The system writes Python code, runs it in a sandbox, evaluates the output, and retries on failure — all through a state machine with explicit routing.
Here’s what we covered:
- State design — tracking code, results, success flags, and retries in a TypedDict
- Safe execution — running generated code in a subprocess with timeout
- Code generation — prompting the LLM with error context for retries
- Output evaluation — a second LLM call to verify results match the request
- Conditional routing — wiring the retry loop with LangGraph’s conditional edges
- Guard rails — safety scanning, cost tracking, and code length limits
The agent pattern we built is a foundation. Extend it with better sandboxing (Docker, E2B), multi-step planning, persistent memory, or domain-specific prompts for your use case.
FAQ
Can this agent handle tasks that need external libraries?
Yes, as long as those libraries are installed in the execution environment. The agent generates imports — if pandas exists where the subprocess runs, the code works. Otherwise, the agent sees ModuleNotFoundError and may rewrite to avoid that dependency.
# Verify a library is available before running agent tasks
import importlib
try:
importlib.import_module("pandas")
print("pandas is available")
except ImportError:
print("pandas is NOT installed")
How do I switch to a cloud sandbox?
Replace execute_code_safely with a call to your sandbox provider. E2B, Modal, and LangChain Sandbox all expose an execute() method that takes code and returns output. The rest of the graph stays identical — only the execution backend changes.
What does each request cost?
Each generation call costs roughly \(0.002-0.005 with GPT-4o-mini. Evaluation adds \)0.001-0.002. A first-attempt success costs about \(0.004 total. Three retries cost about \)0.015. The evaluation node is the first thing to optimize at scale — skip it for tasks with trivially verifiable outputs.
Can I use a local model instead of OpenAI?
Swap ChatOpenAI for any LangChain-compatible chat model. Ollama, vLLM, and Anthropic all work. Code quality depends on the model — GPT-4o and Claude produce reliable code, while smaller models may need more retries.
Topic Cluster: LangGraph Agent Patterns
This article is part of the LangGraph series on MachineLearningPlus. Related articles:
- What Is LangGraph and Why Does It Exist?
- LangGraph Installation, Setup, and First Graph
- LangGraph State Management — TypedDict and Reducers
- LangGraph Conditional Edges and Routing
- Build a ReAct Agent from Scratch with LangGraph
- LangGraph Tool Calling and Agent Actions
- LangGraph Error Handling, Retries, and Fallbacks
- LangGraph Multi-Agent Systems — Supervisor, Swarm, Network
- LangGraph RAG Agent — Retrieval-Augmented Generation
- LangGraph Persistence and Checkpointing
References
- LangGraph documentation — Graph API overview. Link
- LangGraph documentation — StateGraph and conditional edges. Link
- LangChain documentation — Sandboxes for Deep Agents. Link
- langchain-sandbox — PyPI package for safe Python execution. Link
- E2B documentation — Code Interpreter with LangGraph. Link
- LangChain blog — Execute Code with Sandboxes. Link
- Modal documentation — Build a coding agent with LangGraph. Link
- Yao, S. et al. — “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Link
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →