From Simple Responses to Multi-Agent Systems: Building it from Scratch
A simple guide to building AI agents using the only OpenAI SDK and pure Python. No frameworks, no magic, just Python and a clear progression from your first API call to a full multi-agent coding system.
From Simple Responses to Multi-Agent Systems
I spent a while trying to understand AI agents. Not the conceptual fluff you find in Medium posts, the actual mechanics. How does an LLM go from spitting out text to autonomously writing code across multiple files?
Turns out, the jump from "hello world" to "multi-agent system" isn't as wild as it sounds. It's really just seven ideas stacked on top of each other, each one building naturally on the last.
This is the guide I wish I'd had when I started. Every snippet here is working code. No LangChain, no CrewAI, no frameworks, just the OpenAI SDK and Python.
Prerequisites: You'll need Python 3.11+, an OpenAI API key, and pip install openai python-dotenv. That's literally it.
The Roadmap
Here's where we're headed:
1. Chat Completion
Your first API call — the 'Hello World' of AI.
2. Chatbot
Multi-turn conversation with memory.
3. Tool Use
The model calls YOUR Python functions.
4. Chatbot + Tools
Conversation meets function calling.
5. Simple Agent
Encapsulating everything into an Agent class.
6. Agentic Loop
The agent drives itself. Think-Act-Observe.
7. Multi-Agent System
Multiple agents, one pipeline, real code output.
The Quick Start
Ready to build something cool? let me keep this lean. No over-engineered YAML files, no bulky Docker containers, and zero infrastructure headaches. Just you, some Python, and a bit of logic.
Let’s roll up our sleeves and get this running.
Before we roll up our sleeves, you need to get your environment ready. It’s a two-step process that takes about thirty seconds.
1. Install Dependencies
Open your terminal and grab the essentials. We’re using the openai SDK because it’s the gold standard for interacting with most LLMs, and python-dotenv to keep our secrets, well, secret.
pip install openai python-dotenv2. Configure Your Environment
Create a .env file in your project’s root directory. This is where we tell our code who to talk to and which "brain" to use.
OPENAI_API_KEY="gsk_YOUR_GROQ_API_KEY"
OPENAI_MODEL="openai/gpt-oss-120b"
OPENAI_BASE_URL="https://api.groq.com/openai/v1"Why Groq? > Honestly? Because they offer generous free API credits. While some might use big-name providers, I’d rather have this free API credits. And any OpenAI-compatible provider will work here. Just swap the URL, the key and even the big-name Models .
What's happening under the hood?
By setting the OPENAI_BASE_URL, we are effectively "redirecting" the standard OpenAI library to talk to Groq’s lightning-fast infrastructure. This keeps your code portable. If you ever decide to switch providers, you change one line in your .env rather than hunting through your source code and messing around it.
1. Chapter 1: Your First API Call
Everything starts here. One request, one response, just you are asking a question and getting an answer back.
Your First Chat Completion
Let's cut straight to it. You want to talk to an LLM from your own code. Not through ChatGPT's UI, not through a playground but from a Python script that you control.
This is the simplest possible version. One message in, one message out.
The Code
Here's the entire thing:
import os
from dotenv import load_dotenv
from openai import OpenAI
# 1. Load your API key from the .env file
load_dotenv()
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
base_url=os.getenv("OPENAI_BASE_URL")
)
# 2. Make a single chat completion request
response = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL),
messages=[
{
"role": "system",
"content": "You are a helpful assistant that explains things simply.",
},
{
"role": "user",
"content": "What is an AI agent in 3 sentences?",
},
],
temperature=0.7,
max_tokens=256,
)
# 3. Print the result
message = response.choices[0].message
print(f"Role : {message.role}")
print(f"Content: {message.content}")
# 4. Inspect usage (tokens consumed)
usage = response.usage
print(f"\n--- Token Usage ---")
print(f"Prompt tokens : {usage.prompt_tokens}")
print(f"Completion tokens : {usage.completion_tokens}")
print(f"Total tokens : {usage.total_tokens}")Run it:
python main.pyYou'll get something like:
Role : assistant
Content: An AI agent is a software system that can perceive its environment,
make decisions, and take actions to achieve specific goals. Unlike simple
programs, agents can adapt their behavior based on new information...
--- Token Usage ---
Prompt tokens : 32
Completion tokens : 68
Total tokens : 100Done. You just made your first API call.
What's Actually Happening
Let's break down the pieces, because every single one of them matters later.
The Messages List
messages=[
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "What is an AI agent?"},
]This is the most important concept in the entire OpenAI API. It's not a single string, it's a list of messages with roles. Three roles exist:
| Role | Purpose |
|---|---|
system | Sets the model's behavior and personality. The model treats this as instructions. |
user | What the human is asking. Your input. |
assistant | What the model has previously said. Used for multi-turn context. |
Right now we only have system and user. But in Chapter 2, we'll start appending assistant messages to this list, and that's how memory works and That is the literal definition of "AI memory." By feeding the model’s own previous answers back to it, you transform a robotic script into a fluid conversation.
The messages list is the secret sauce, by re-feeding the entire history back to the model with every prompt, you turn a series of static calls into a conversation.
It’s how the AI "remembers" what you said two minutes ago instead of treating every message like a first date.
The Response Object
response.choices[0].messageThe API returns a ChatCompletion object. It's not just text — it's structured:
choices— a list (usually length 1) of possible completionschoices[0].message.role— always"assistant"for completionschoices[0].message.content— the actual text responsechoices[0].finish_reason— why the model stopped ("stop","length","tool_calls")usage— token counts for billing
That finish_reason field becomes critical in Chapter 3. When the model wants to call a tool instead of responding with text, finish_reason changes to "tool_calls". But we'll get there.
Temperature
temperature=0.7This controls randomness:
0.0= deterministic, always picks the most likely token0.7= balanced creativity (good default)2.0= chaotic, borderline incoherent
For agents that need to make reliable decisions (like our coder in Chapter 7), you want low temperature. For creative writing, crank it up.
Token Limits
max_tokens=256This caps the response length. It does not affect the input, you can send as many tokens as the model's context window allows (128K for GPT-4o-mini). But the response will be cut off at 256 tokens.
If your response gets cut off mid-sentence, check finish_reason. If it says "length" instead of "stop", your max_tokens was too low.
The Mental Model
Here's how to think about what just happened:
Your Code OpenAI API
───────── ──────────
messages list ──────────────→ Process messages
Generate response
response object ←─────────── Return completionThere's no session, no connection, no state on their side. Every API call is a clean slate. The model only knows what you tell it in the messages list.
This is a constraint, but it's also a superpower. It means you have complete control over what the model sees. You can inject instructions, edit history, remove messages, rewrite context, whatever you need. And that flexibility is what makes agents possible.
What's Next
Right now we do one request and we're done. In Chapter 2, we'll keep the conversation going, adding each message to the list and sending it back, creating a chatbot with memory.
The jump is small: we just keep appending to messages in a loop. But it changes everything about how the interaction feels.
Run the code, poke at the parameters, try different system prompts. The best way to understand this stuff is to play with it.
Chapter 2: Building a Chatbot with Memory
Here's something that surprised me when I first started: the OpenAI API has no memory. Zero. Every request is completely independent. If you ask "What's the capital of France?" and then follow up with "How many people live there?", the model has no idea what "there" refers to.
So how does ChatGPT manage conversations? The answer is almost embarrassingly dead simple, they send the entire conversation history with every API call. As mentioned above the secret sauce i.e Messages List, they send this list every time.
That's what we're building now.
The Key Insight
Memory in LLM applications = a list you keep appending to.
messages = [
{"role": "system", "content": "You are a friendly assistant."},
]
# User says something → append it
messages.append({"role": "user", "content": "Hi there"})
# Model responds → append that too
messages.append({"role": "assistant", "content": "Hello! How can I help?"})
# Next turn: send the ENTIRE list again
# The model sees the full conversation and can respond in contextThat's the whole trick. There's no session management, no database, no special API parameter. You just replay the conversation each time.
The Full Chatbot
import os
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
base_url=os.getenv("OPENAI_BASE_URL")
)
SYSTEM_PROMPT = (
"You are a friendly and knowledgeable coding assistant. "
"You help developers learn about AI agents. "
"Keep answers concise but thorough."
)
def run_chatbot():
"""Main chatbot loop with streaming."""
# The conversation history — this is the KEY concept.
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
]
print("=" * 60)
print(" AI Chatbot (type 'exit' to quit)")
print("=" * 60)
while True:
try:
user_input = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() in ("exit", "quit"):
print("Goodbye!")
break
# Append what the user said
messages.append({"role": "user", "content": user_input})
# Stream the response
print("\nAssistant: ", end="", flush=True)
stream = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL),
messages=messages,
temperature=0.7,
stream=True,
)
# Collect the full response while printing tokens as they arrive
assistant_response = ""
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
token = delta.content
print(token, end="", flush=True)
assistant_response += token
print() # Newline after streaming finishes
# Append the assistant response to history
messages.append({"role": "assistant", "content": assistant_response})
if __name__ == "__main__":
run_chatbot()Streaming: Why It Matters
Notice the stream=True parameter. Without it, you'd wait several seconds for the full response, then it all appears at once. With streaming, tokens appear as the model generates them, it feels much more responsive.
The streaming API returns chunks instead of a single response:
stream = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL),
messages=messages,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)Each chunk contains a delta with a small piece of the response. You print them as they arrive and concatenate them into the full response afterward.
The flush=True in print() is important. Without it, Python buffers the output and you won't see the streaming effect — tokens will appear in batches instead of one at a time.
The Conversation Flow
Here's what happens across multiple turns:
Turn 1:
messages = [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "What is Python?"},
]
# → Model responds: "Python is a programming language..."
# → Append assistant response to messagesTurn 2:
messages = [
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language..."},
{"role": "user", "content": "What makes it good for AI?"},
]
# → Model sees the FULL history and responds in contextThe messages list grows every turn. The model gets the complete context every single time.
Tokens are the "DNA" of your prompt. Instead of reading word-by-word, the model breaks text into chunks, roughly 4 characters per token. The messages list grows in tokens with every turn, and that’s what determines your API cost and context limit.
The Tradeoff
This approach has an obvious limitation: the list keeps growing. After a long conversation, you're sending thousands of tokens per request. That means:
- Higher costs — you pay per token, input and output
- Context limit — eventually you'll hit the model's context window (128K tokens for GPT-4o-mini)
- Slower responses — more input tokens = more processing time
In production, you'd implement strategies like:
- Truncating older messages
- Summarizing the conversation periodically
- Using a sliding window of recent messages
But for learning, this straightforward approach works perfectly. And it teaches you the fundamental pattern: you own the conversation state.
What's Different From Chapter 1
| Chapter 1 | Chapter 2 |
|---|---|
| Single request/response | Loop of requests/responses |
| No state between calls | Full conversation history |
| No streaming | Token-by-token streaming |
| Script ends after one response | Interactive session |
The code is barely more complex. We added a while True loop, two messages.append() calls, and streaming. But the experience is completely different, you now have a conversational AI running in your terminal.
What's Next
Right now the chatbot can only talk. In Chapter 3, we teach the model to do things, call Python functions, look up data, perform calculations. That's where the real power starts.
Try different system prompts. Make the chatbot a pirate, a teacher, a sarcastic teenager. The system prompt shapes everything about how it responds.
Chapter 3: Teaching the Model to Use Tools
Everything up to this point has been text in, text out. This chapter changes everything. The model can now call your Python functions, decide the right arguments, and use the results in its response.
How It Works
Define Tool Schemas
Describe your local functions as JSON schemas and pass them to the model via the tools parameter.
Model Intent Detection
The model analyzes the prompt and decides which functions to call and extracts the necessary arguments.
Receive Tool Call
The model returns the function name and arguments; it acts as the "brain," but it does not execute the code.
Local Execution
You execute the function within your own environment (Python, Node, etc.) using the arguments provided by the model.
Submit Tool Outputs
Send the function results back to the model using the tool role, linking it to the original call via a tool_call_id.
Final Synthesis
The model reads the tool output and synthesizes a natural language response for the user.
Define Your Functions
Regular Python functions, nothing special about them:
def get_weather(city: str) -> dict:
"""Simulate a weather lookup."""
fake_weather = {
"new york": {"temp": "22C", "condition": "Sunny"},
"london": {"temp": "15C", "condition": "Cloudy"},
"tokyo": {"temp": "28C", "condition": "Humid"},
}
data = fake_weather.get(city.lower(), {"temp": "20C", "condition": "Unknown"})
return {"city": city, **data}
def calculate(expression: str) -> dict:
"""Evaluate a math expression safely."""
try:
result = eval(expression, {"__builtins__": {}})
return {"expression": expression, "result": str(result)}
except Exception as e:
return {"expression": expression, "error": str(e)}
AVAILABLE_FUNCTIONS = {
"get_weather": get_weather,
"calculate": calculate,
}Describe Them as JSON Schemas
This is how the model knows your functions exist. The description fields are critical, the model reads them to decide when to call each function:
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name, e.g. 'New York'"}
},
"required": ["city"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression and return the result.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "A math expression, e.g. '2 + 2 * 3'"}
},
"required": ["expression"],
},
},
},
]The Two-Call Flow
Tool calling takes two API calls, one for the model to decide, one to use the results:
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
base_url=os.getenv("OPENAI_BASE_URL")
)
# ══════════════════════════════════════════════
# STEP 1: Define your Python functions (the actual tools)
# ══════════════════════════════════════════════
def get_weather(city: str) -> dict:
"""Simulate a weather lookup. In production, call a real API."""
fake_weather = {
"new york": {"temp": "22°C", "condition": "Sunny"},
"london": {"temp": "15°C", "condition": "Cloudy"},
"tokyo": {"temp": "28°C", "condition": "Humid"},
}
data = fake_weather.get(city.lower(), {"temp": "20°C", "condition": "Unknown"})
return {"city": city, **data}
def calculate(expression: str) -> dict:
"""Evaluate a math expression safely."""
try:
# WARNING: In production, use a proper sandbox. This is for learning only.
result = eval(expression, {"__builtins__": {}})
return {"expression": expression, "result": str(result)}
except Exception as e:
return {"expression": expression, "error": str(e)}
# ══════════════════════════════════════════════
# STEP 2: Map function names → actual functions
# ══════════════════════════════════════════════
AVAILABLE_FUNCTIONS = {
"get_weather": get_weather,
"calculate": calculate,
}
# ══════════════════════════════════════════════
# STEP 3: Define the tool schemas (JSON descriptions)
# This tells the model WHAT tools exist, their parameters,
# and their descriptions. The model uses this to decide
# when and how to call them.
# ══════════════════════════════════════════════
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name, e.g. 'New York'",
}
},
"required": ["city"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression and return the result.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "A math expression to evaluate, e.g. '2 + 2 * 3'",
}
},
"required": ["expression"],
},
},
},
]
# ══════════════════════════════════════════════
# STEP 4: The main flow — call API, detect tool calls, execute, return results
# ══════════════════════════════════════════════
def run():
user_question = "What's the weather in Tokyo, and what is 145 * 37?"
print(f"User: {user_question}\n")
messages = [
{"role": "system", "content": "You are a helpful assistant. Use the provided tools when needed."},
{"role": "user", "content": user_question},
]
# ── First API call: the model decides what tools to call ──
response = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL),,
messages=messages,
tools=TOOLS,
tool_choice="auto", # Let the model decide
)
assistant_message = response.choices[0].message
print(f"[Model Decision] finish_reason = {response.choices[0].finish_reason}")
# ── Check if the model wants to call tools ──
if assistant_message.tool_calls:
print(f"[Model wants to call {len(assistant_message.tool_calls)} tool(s)]\n")
# IMPORTANT: append the assistant's message (with tool_calls) to history
messages.append(assistant_message)
# ── Execute each tool call ──
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f" Calling: {func_name}({func_args})")
# Look up and execute the function
func = AVAILABLE_FUNCTIONS[func_name]
result = func(**func_args)
print(f" Result : {result}\n")
# IMPORTANT: send the result back as a "tool" role message
messages.append({
"role": "tool",
"tool_call_id": tool_call.id, # Must match the tool_call's id
"content": json.dumps(result),
})
# ── Second API call: the model uses tool results to form its answer ──
final_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=TOOLS,
)
final_answer = final_response.choices[0].message.content
print(f"Assistant: {final_answer}")
else:
# Model didn't need tools — it answered directly
print(f"Assistant: {assistant_message.content}")
if __name__ == "__main__":
run()The tool_call_id field is critical. The API uses it to match each result with the function call that produced it. Miss it and you'll get an error.
The Key Insight
The model looked at a natural language question, figured out it needed two different functions, generated correct arguments for each, and then synthesized the results into a coherent answer. And you controlled every step.
The model never executed code. It just decided what to call. You can add any safety checks, logging, or rate limiting you want between the decision and the execution.
This is the building block. Every agent framework like LangChain, CrewAI, all of them, is built on exactly this mechanism.
What's Next
Right now we handle a single question. In Chapter 4, we combine this with the chatbot loop, the model can have a conversation AND call tools multiple times in a single turn automatically.
Try adding your own tools. A dictionary lookup, a file reader, a database query, anything with a clear input/output contract can become a tool.
Chapter 4: Chatbot with Tools
In Chapter 2, we built a chatbot. A while True loop that takes user input, calls the API, and prints the response. In Chapter 3, we added tool calling, the model picks a function, we execute it, we send the result back.
Now we combine them, and something architectural emerges that's worth really understanding, because it's the foundation of every agent.
The Problem Chapter 3 Left Unsolved
Chapter 3's tool calling was one-shot. The user asks a question, the model calls tools, you send results back, and you get a final answer. But what happens when:
-
The model needs to chain tools. It calls
get_weather("London"), looks at the result, then decides it also needscalculate("15 - 22")to compare temperatures. That's two rounds of tool execution within a single user turn. -
The user follows up. They ask "What about Tokyo?" And the model needs to call
get_weatheragain, but this time with the full conversation history including the previous weather lookup. -
The model calls multiple tools at once. "What's the weather in London and Tokyo?" The model might return two tool calls in a single response. You need to execute both and send both results back.
Chapter 3's code can't handle any of these. We need two loops.
The Architecture: Two Loops, One List
┌─── OUTER LOOP (conversation turns) ───────────────────────┐
│ │
│ User says something → append to messages │
│ │
│ ┌─── INNER LOOP (tool execution) ──────────────────┐ │
│ │ │ │
│ │ Call API with messages │ │
│ │ ↓ │ │
│ │ Model returns tool_calls? │ │
│ │ YES → execute tools, append results, LOOP │ │
│ │ NO → return text response │ │
│ │ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Print response → wait for next user input │
│ │
└───────────────────────────────────────────────────────────┘The outer loop is the chatbot, it handles conversation turns. The inner loop handles tool execution within a single turn. And crucially, they both operate on the same messages list.
The Inner Loop: process_tool_calls()
The Complete Code for this Tool Calling Chatbot
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# ──────────────────────────────────────────────
# Tools — same as Chapter 3, but integrated into a chat loop
# ──────────────────────────────────────────────
def get_weather(city: str) -> dict:
"""Simulate a weather lookup."""
fake_weather = {
"new york": {"temp": "22°C", "condition": "Sunny"},
"london": {"temp": "15°C", "condition": "Cloudy"},
"tokyo": {"temp": "28°C", "condition": "Humid"},
"paris": {"temp": "18°C", "condition": "Rainy"},
"mumbai": {"temp": "34°C", "condition": "Hot and humid"},
}
data = fake_weather.get(city.lower(), {"temp": "20°C", "condition": "Unknown"})
return {"city": city, **data}
def calculate(expression: str) -> dict:
"""Evaluate a math expression safely."""
try:
result = eval(expression, {"__builtins__": {}})
return {"expression": expression, "result": str(result)}
except Exception as e:
return {"expression": expression, "error": str(e)}
AVAILABLE_FUNCTIONS = {
"get_weather": get_weather,
"calculate": calculate,
}
TOOLS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a given city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name, e.g. 'New York'"}
},
"required": ["city"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression and return the result.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "A math expression, e.g. '2 + 2 * 3'"}
},
"required": ["expression"],
},
},
},
]
# ──────────────────────────────────────────────
# The tool execution loop — this is the core pattern
# ──────────────────────────────────────────────
def process_tool_calls(messages):
"""
Repeatedly calls the model until it stops requesting tools.
Returns the final assistant content.
This is the INNER LOOP that handles:
1. Model requests tool calls → execute them → send results → repeat
2. Model responds with text → return it (loop ends)
"""
while True:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
assistant_message = response.choices[0].message
finish_reason = response.choices[0].finish_reason
# If the model wants to call tools
if assistant_message.tool_calls:
# Append the assistant message (contains tool_calls metadata)
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f" 🔧 Calling {func_name}({func_args})")
# Execute the function
func = AVAILABLE_FUNCTIONS.get(func_name)
if func:
result = func(**func_args)
else:
result = {"error": f"Unknown function: {func_name}"}
# Append the tool result to messages
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Loop again — model might want to call more tools
continue
else:
# No more tool calls — model is done, return the text
messages.append({
"role": "assistant",
"content": assistant_message.content,
})
return assistant_message.content
# ──────────────────────────────────────────────
# The chatbot loop (outer loop = conversation turns)
# ──────────────────────────────────────────────
def run_chatbot():
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant with access to tools. "
"Use the weather tool when asked about weather, "
"and the calculator when math is involved. "
"Always respond naturally after using tools."
),
},
]
print("=" * 60)
print(" Chatbot with Tools (type 'exit' to quit)")
print("=" * 60)
while True:
try:
user_input = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() in ("exit", "quit"):
print("Goodbye!")
break
messages.append({"role": "user", "content": user_input})
print() # Blank line before tool calls / response
response_text = process_tool_calls(messages)
print(f"\nAssistant: {response_text}")
if __name__ == "__main__":
run_chatbot()This is the core of Chapter 4. Let's walk through every line:
def process_tool_calls(messages):
"""
Repeatedly calls the model until it stops requesting tools.
Returns the final assistant content.
"""
while True:
response = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL),
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
assistant_message = response.choices[0].message
finish_reason = response.choices[0].finish_reason
if assistant_message.tool_calls:
# Append the assistant message WITH its tool_calls metadata
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f" 🔧Tool Calling {func_name}({func_args})")
func = AVAILABLE_FUNCTIONS.get(func_name)
if func:
result = func(**func_args)
else:
result = {"error": f"Unknown function: {func_name}"}
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Loop again — model might want more tools
continue
else:
# No tool calls — we have a text response
messages.append({
"role": "assistant",
"content": assistant_message.content,
})
return assistant_message.contentWhat's Happening Behind the Scenes
Let me trace through a real example to show you exactly what happens at every step.
User asks: "What's the weather in London and Tokyo, and what's the difference in temperature?"
This requires the model to:
- Call
get_weathertwice (London and Tokyo) - Read both results
- Call
calculateto find the difference - Assemble a natural language answer
Here's the step-by-step reality:
User sends the message
The outer loop appends the user's message to messages:
messages = [
{"role": "system", "content": "You are a helpful assistant with tools..."},
{"role": "user", "content": "What's the weather in London and Tokyo..."},
]Then it calls process_tool_calls(messages).
Inner Loop : Iteration 1: Model decides to call two tools
The API receives the messages list and returns a response where assistant_message.tool_calls contains two tool calls:
[
{"id": "call_abc", "function": {"name": "get_weather", "arguments": "{\"city\": \"London\"}"}},
{"id": "call_def", "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}}
]The finish_reason is "tool_calls" not "stop". This tells us the model isn't done talking; it's waiting for tool results.
We execute both functions locally and append three messages to the list:
# 1. The assistant's message (contains the tool_calls metadata)
messages.append(assistant_message)
# 2. Tool result for London
messages.append({
"role": "tool",
"tool_call_id": "call_abc",
"content": '{"city": "London", "temp": "15°C", "condition": "Cloudy"}',
})
# 3. Tool result for Tokyo
messages.append({
"role": "tool",
"tool_call_id": "call_def",
"content": '{"city": "Tokyo", "temp": "28°C", "condition": "Humid"}',
})Then continue back to the top of the while loop.
Inner Loop : Iteration 2: Model calls calculate
Now the API call sends the entire messages list including the weather results. The model sees both temperatures and decides to call calculate:
[{"id": "call_ghi", "function": {"name": "calculate", "arguments": "{\"expression\": \"28 - 15\"}"}}]We execute calculate("28 - 15") → {"expression": "28 - 15", "result": "13"} and append the result.
Again, continue back to the top.
Inner Loop : Iteration 3: Model responds with text
The model now has all the data it needs. This time, assistant_message.tool_calls is empty and finish_reason is "stop". The model returns text:
"London is 15°C and cloudy, while Tokyo is 28°C and humid. The temperature difference is 13°C — Tokyo is significantly warmer!"
We append this as an assistant message and return from the inner loop. Back in the outer loop, the chatbot prints the response.
The Messages List: A Complete Trace
After that one user turn, here's what the messages list looks like 7 messages from a single question:
messages = [
# 1. System prompt (set at start)
{"role": "system", "content": "You are a helpful assistant with tools..."},
# 2. User's question
{"role": "user", "content": "What's the weather in London and Tokyo..."},
# 3. Model's first response (contains tool_calls, no text content)
{"role": "assistant", "tool_calls": [
{"id": "call_abc", "function": {"name": "get_weather", "arguments": '{"city":"London"}'}},
{"id": "call_def", "function": {"name": "get_weather", "arguments": '{"city":"Tokyo"}'}},
]},
# 4. Tool result: London weather
{"role": "tool", "tool_call_id": "call_abc",
"content": '{"city":"London","temp":"15°C","condition":"Cloudy"}'},
# 5. Tool result: Tokyo weather
{"role": "tool", "tool_call_id": "call_def",
"content": '{"city":"Tokyo","temp":"28°C","condition":"Humid"}'},
# 6. Model's second response (another tool_call)
{"role": "assistant", "tool_calls": [
{"id": "call_ghi", "function": {"name": "calculate", "arguments": '{"expression":"28-15"}'}},
]},
# 7. Tool result: calculation
{"role": "tool", "tool_call_id": "call_ghi",
"content": '{"expression":"28-15","result":"13"}'},
# 8. Model's final text response
{"role": "assistant", "content": "London is 15°C and cloudy, while Tokyo is 28°C..."},
]Every single message stays in the list. The next time the user asks something, the API sees all of this including the tool calls and results from previous turns. That's how the model knows "oh, we already looked up London's weather earlier" and can reference it without calling the tool again.
The Outer Loop: The Chatbot
With the inner loop handling all the tool complexity, the outer loop stays clean:
def run_chatbot():
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant with access to tools. "
"Use the weather tool when asked about weather, "
"and the calculator when math is involved. "
"Always respond naturally after using tools."
),
},
]
print("=" * 60)
print(" Chatbot with Tools (type 'exit' to quit)")
print("=" * 60)
while True:
try:
user_input = input("\nYou: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() in ("exit", "quit"):
print("Goodbye!")
break
messages.append({"role": "user", "content": user_input})
print() # Blank line before response
response_text = process_tool_calls(messages)
print(f"\nAssistant: {response_text}")From the user's perspective, they're just chatting. They have no idea that behind the scenes, the model might be making three API calls and executing five functions before it responds.
The Three Signals the API Gives You
Every response from the API contains a finish_reason that tells you why the model stopped:
finish_reason | Meaning | What You Do |
|---|---|---|
"stop" | Model is done talking, here's text | Return the text. Conversation turn complete. |
"tool_calls" | Model wants to call functions | Execute them, send results, call API again. |
"length" | Ran out of max_tokens | Response was cut off. You might need to increase the limit. |
In process_tool_calls(), we check assistant_message.tool_calls which is effectively the same as checking for finish_reason == "tool_calls". When it's present, we execute and loop. When it's absent, we return.
Why the messages.append(assistant_message) Matters
This is a subtle but critical detail. When the model returns tool calls, you must append the assistant's message before appending the tool results:
# ✅ Correct order
messages.append(assistant_message) # The model's decision (contains tool_calls)
messages.append({"role": "tool", ...}) # The result
# ❌ Wrong: missing the assistant message
messages.append({"role": "tool", ...}) # API will error — what tool_call does this respond to?The API matches tool results to tool calls using the tool_call_id. If the assistant's message (which contains the original tool_calls) isn't in the messages list, the API has nothing to match against and throws an error.
Think of it as a conversation:
- Assistant: "I'd like to call get_weather with city='London'" ← must be in the list
- Tool: "Here's the result: 15°C, Cloudy" ← matched via tool_call_id
The Critical Difference: continue vs return
The inner loop's decision point is the most important two lines in this chapter:
if assistant_message.tool_calls:
# ... execute tools ...
continue # ← Go back and call the API again
else:
# ... append text response ...
return # ← Done, exit the inner loopcontinue means: "The model needs to see these tool results. Call the API again."
return means: "The model has spoken. Send this text to the user."
This is what enables tool chaining, the model calls get_weather, sees the result, decides it needs calculate, sees that result, and finally responds with text. Each continue is another round trip to the API.
How This Connects to Agents
This two-loop pattern is three lines of code away from being an agent:
| What this chapter has | What agents add |
|---|---|
while True (no limit) | for i in range(max_iterations) |
| Model stops on its own | task_complete tool for explicit stop |
| Functions scattered globally | Bundled into an Agent class |
Chapter 5 wraps this in a class. Chapter 6 adds the safety valve. But the core mechanic call the API, check for tools, execute, loop, is exactly what we have here.
Everything that happens in LangChain's AgentExecutor, AutoGen's conversation loop, and CrewAI's task runner is a dressed-up version of this while True → check tool_calls → execute → continue pattern. Now you've seen the raw version.
What Happens to Memory
One thing to be aware of: the messages list grows fast when tools are involved. A single user turn that triggers 3 tool calls adds 7 messages (user + 3 assistant + 3 tool). After 10 turns of tool-heavy conversation, you could easily have 70+ messages.
This means:
- Token costs go up — you're sending all previous tool calls and results every time
- Context window pressure — eventually you'll hit the model's limit
- Slower responses — more input tokens = more processing time
In production, you'd want to prune old tool results or summarize previous tool interactions. But for understanding the pattern, the naive approach works perfectly.
But there are different ways to address these memory related challenges. That will be separte blog.
Try It Yourself
python main.pySome things to try:
- "What's the weather in London?" single tool call
- "What's 2^10 + 3^10?" single calculate call
- "What's the weather in all 5 cities you know about?" model might call
get_weatherfive times - "Is it warmer in Tokyo or London? By how many degrees?" chained tools: weather lookup then calculation
- "You mentioned London earlier what was the temperature?" model uses conversation history, no tool needed
Watch the 🔧Tool Calling logs in the console. They show you exactly when the model decides to use tools vs. when it answers from memory. That decision-making is the seed of agent behavior.
Chapter 5: The Agent Abstraction
Up to now, we've had functions and scripts. Code that works but isn't reusable. You can't easily spin up two agents with different personalities and tools. Every time you want a new capability, you're copy-pasting the inner loop from Chapter 4 and changing variable names.
This chapter changes the language:
A chatbot responds to messages. An agent receives a task, decides what to do, takes actions, and delivers a result.
The distinction matters because it changes how you think about the system. A chatbot is reactive, it waits for your next message and responds. An agent is proactive, you give it a job, and it figures out what tools to call, in what order, and when to stop. The interface goes from chat() to agent.run(task) → result.
Why Wrap It in a Class?
In Chapter 4, everything was top-level. The tools were global variables, the inner loop was a standalone function, the system prompt was hardcoded. That works for a single chatbot. But the moment you want two agents, say, a researcher and a calculator, you're in trouble:
- You'd have to duplicate
process_tool_calls()with different tool lists - System prompts can't be swapped without editing code
- There's no way to name or identify different agents
- You can't pass an agent around as a value
The Agent class solves all of this by encapsulating the inner loop, system prompt, tools, and configuration into a single, reusable object.
The Agent Class
Here's the full class. If you read Chapter 4 carefully, you'll recognise the inner loop, it's the same while True → check tool_calls → execute → continue mechanic, just encapsulated:
import json
from openai import OpenAI
class Agent:
"""
A simple agent: takes a task, uses tools, returns a result.
This is the inner loop from Chapter 4 wrapped in a class,
with a clean interface: agent.run(task) -> result.
"""
def __init__(
self,
client: OpenAI,
name: str,
system_prompt: str,
tools: list[dict] | None = None,
functions: dict | None = None,
model: str = "gpt-4o-mini",
):
self.client = client
self.name = name
self.system_prompt = system_prompt
self.tools = tools or []
self.functions = functions or {}
self.model = model
def run(self, task: str) -> str:
"""Execute a task and return the result."""
print(f"\n[{self.name}] Starting task: {task[:80]}...")
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": task},
]
while True:
kwargs = {"model": self.model, "messages": messages}
if self.tools:
kwargs["tools"] = self.tools
kwargs["tool_choice"] = "auto"
response = self.client.chat.completions.create(**kwargs)
assistant_message = response.choices[0].message
if assistant_message.tool_calls:
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
print(f" [{self.name}] Tool: {func_name}({func_args})")
func = self.functions.get(func_name)
if func:
result = func(**func_args)
else:
result = {"error": f"Unknown tool: {func_name}"}
result_str = json.dumps(result) if isinstance(result, dict) else str(result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result_str,
})
continue
else:
print(f" [{self.name}] Done.")
return assistant_message.contentLet's break down the key design decisions:
The Constructor: Configuration Over Code
def __init__(self, client, name, system_prompt, tools=None, functions=None, model="openai/gpt-oss-120b"):Each parameter serves a purpose:
client: The OpenAI client is injected, not created internally. This means you can share one client across many agents (saves resources) or mock it for testing.name: A human-readable identifier. This shows up in logs so you can trace which agent did what.system_prompt: The agent's personality and instructions. This is what makes a research agent different from a math agent same code, different prompt.tools: The tool schemas (JSON definitions the API needs).functions: A dictionary mapping function names to actual Python callables. The agent uses this to dispatch tool calls.model: Defaults toopenai/gpt-oss-120b, but you can pass any other models for harder tasks.
The run() Method: Task In, Result Out
def run(self, task: str) -> str:This is the key abstraction shift. In Chapter 4, the chatbot had a conversation. Here, the agent receives a task and returns a result. No conversation state leaks out. Every call to run() starts with a fresh messages list, system prompt plus the task. The agent works autonomously until the model responds with text instead of tool calls.
Notice that messages is a local variable inside run(), not an instance variable. Each task gets a clean slate. The agent doesn't carry memory between tasks, that's intentional. Conversation history is a chatbot concern. Agents are stateless workers.
The Example: A Population Research Agent
The main.py in this chapter builds a practical example. A Research Agent that can look up country populations and perform calculations. Let's walk through it.
Defining the Tools
We define two Python functions that simulate real tools:
def look_up_population(country: str) -> dict:
"""Simulated database of country populations."""
data = {
"india": {"country": "India", "population": "1.44 billion"},
"china": {"country": "China", "population": "1.43 billion"},
"usa": {"country": "USA", "population": "334 million"},
"japan": {"country": "Japan", "population": "125 million"},
"germany": {"country": "Germany", "population": "84 million"},
}
return data.get(country.lower(), {"country": country, "population": "Unknown"})
def calculate(expression: str) -> dict:
"""Evaluate a math expression."""
try:
result = eval(expression, {"__builtins__": {}})
return {"expression": expression, "result": str(result)}
except Exception as e:
return {"expression": expression, "error": str(e)}look_up_population simulates a database lookup — in a real agent, this might call an API or query a database. calculate evaluates math expressions safely (with __builtins__ disabled to prevent code injection).
Next, we tell the API about these tools using the standard schema format, and create a function map so the agent can dispatch calls:
TOOLS = [
{
"type": "function",
"function": {
"name": "look_up_population",
"description": "Look up the population of a country.",
"parameters": {
"type": "object",
"properties": {
"country": {"type": "string", "description": "Country name"}
},
"required": ["country"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression to evaluate"}
},
"required": ["expression"],
},
},
},
]
FUNCTIONS = {
"look_up_population": look_up_population,
"calculate": calculate,
}Creating and Running the Agent
Now comes the clean part, creating the agent and giving it a task:
# ──────────────────────────────────────────────
# Creating a Research Agent
# ──────────────────────────────────────────────
research_agent = Agent(
client=client,
name="Research Agent",
system_prompt=(
"You are a research agent. When given a question, "
"use your tools to look up data and perform calculations. "
"Always show your work and provide a clear final answer."
),
tools=TOOLS,
functions=FUNCTIONS,
)
result = research_agent.run(
"What is the combined population of India and Japan? "
"Show the individual numbers and the total."
)
print("\n" + "=" * 60)
print("FINAL RESULT:")
print("=" * 60)
print(result)That's it. One object, one method call, one result. Compare this to Chapter 4 where you'd need to wire up the outer loop, the inner loop, the tool dispatch, and the system prompt manually.
Walking Through the Execution
Let's trace exactly what happens when the agent runs the task: "What is the combined population of India and Japan? Show the individual numbers and the total."
Agent starts : run() is called
The run() method creates a fresh messages list:
messages = [
{"role": "system", "content": "You are a research agent..."},
{"role": "user", "content": "What is the combined population of India and Japan?..."},
]It enters the while True loop and calls the API.
Iteration 1 : Model calls two tools
The model reads the task and decides it needs population data for both countries. It returns two tool calls in a single response:
[
{"id": "call_abc", "function": {"name": "look_up_population", "arguments": "{\"country\": \"India\"}"}},
{"id": "call_def", "function": {"name": "look_up_population", "arguments": "{\"country\": \"Japan\"}"}}
]The agent executes both:
[Research Agent] Tool: look_up_population({'country': 'India'})
[Research Agent] Tool: look_up_population({'country': 'Japan'})Results are appended to messages: India → 1.44 billion, Japan → 125 million. Then continue back to the top of the loop.
Iteration 2 : Model calls calculate
The model sees both population numbers and needs to add them. It calls the calculate tool:
[{"id": "call_ghi", "function": {"name": "calculate", "arguments": "{\"expression\": \"1440000000 + 125000000\"}"}}][Research Agent] Tool: calculate({'expression': '1440000000 + 125000000'})Result: 1565000000. Appended to messages. continue again.
Iteration 3 : Model responds with text
The model now has all the data it needs. tool_calls is empty. It responds with a natural language answer:
"The population of India is approximately 1.44 billion and Japan's population is approximately 125 million. Their combined population is approximately 1.565 billion (1,565,000,000)."
The agent prints [Research Agent] Done. and returns this string.
Three iterations, three API calls, three tool executions, all happening inside a single run() call. The caller just sees the final result string.
The Messages Trace
After execution, the internal messages list looks like this 8 messages from a single run() call:
messages = [
# 1. System prompt
{"role": "system", "content": "You are a research agent..."},
# 2. The task
{"role": "user", "content": "What is the combined population of India and Japan?..."},
# 3. Model decides to look up both countries
{"role": "assistant", "tool_calls": [
{"id": "call_abc", "function": {"name": "look_up_population", "arguments": '{"country":"India"}'}},
{"id": "call_def", "function": {"name": "look_up_population", "arguments": '{"country":"Japan"}'}},
]},
# 4. India result
{"role": "tool", "tool_call_id": "call_abc",
"content": '{"country": "India", "population": "1.44 billion"}'},
# 5. Japan result
{"role": "tool", "tool_call_id": "call_def",
"content": '{"country": "Japan", "population": "125 million"}'},
# 6. Model decides to calculate the sum
{"role": "assistant", "tool_calls": [
{"id": "call_ghi", "function": {"name": "calculate", "arguments": '{"expression":"1440000000 + 125000000"}'}},
]},
# 7. Calculation result
{"role": "tool", "tool_call_id": "call_ghi",
"content": '{"expression": "1440000000 + 125000000", "result": "1565000000"}'},
# 8. Final text response
{"role": "assistant", "content": "The population of India is approximately 1.44 billion..."},
]This is the same pattern from Chapter 4, but now it's all internal. The caller never sees this list. They just get the final string.
The Power of Configuration
The real beauty of the Agent class shows when you create multiple agents from the same class with different configurations:
# A research agent with data tools
researcher = Agent(
client=client,
name="Researcher",
system_prompt="You research topics thoroughly using available tools.",
tools=SEARCH_TOOLS,
functions=SEARCH_FUNCTIONS,
)
# A math agent with calculation tools
calculator = Agent(
client=client,
name="Calculator",
system_prompt="You solve math problems step by step.",
tools=MATH_TOOLS,
functions=MATH_FUNCTIONS,
)
# Same class, different configuration
research_result = researcher.run("Find information about Python's GIL")
math_result = calculator.run("What is the integral of x^2 from 0 to 5?")Same class. Same run() method. Same inner loop. But different system prompts give them different personalities, and different tools give them different capabilities. This is the foundation of multi-agent systems, you don't build different agent architectures for different tasks; you build one architecture and configure it differently.
This is the same principle behind web frameworks. You don't write a new HTTP server for every endpoint, you configure routes, middleware, and handlers. The Agent class is your HTTP server for AI tasks.
What Changed From Chapter 4
| Chapter 4 | Chapter 5 |
|---|---|
| Loose functions | Encapsulated class |
| Single configuration | Multiple agents with different configs |
| Hard to reuse | agent = Agent(...) then agent.run(task) |
| Conversation-oriented | Task-oriented |
| Stateful (grows over turns) | Stateless (fresh messages per task) |
process_tool_calls(messages) | agent.run(task) → result |
The core logic is identical, it's still the inner loop from Chapter 4. But wrapping it in a class makes it composable. You can now imagine having a pipeline of agents, each one handling a different stage of a complex task. Agent A researches, Agent B analyses, Agent C writes the report, all using the same Agent class with different configurations.
Why Stateless Matters
Notice that run() creates a fresh messages list every time it's called. This is a deliberate design choice:
def run(self, task: str) -> str:
messages = [ # ← Fresh every call
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": task},
]A chatbot needs memory, the whole point is a continuous conversation. But an agent is a worker. You give it a task, it does the work, it returns the result. The next task might be completely unrelated. Carrying conversation history between tasks would be confusing, not helpful.
This also makes agents safe to reuse. You can call agent.run() ten times with ten different tasks, and each one starts clean. No state leakage, no cross-contamination between tasks.
The Missing Pieces
This agent has two problems that become apparent in production:
-
No safety valve. If the model keeps calling tools forever, the
while Trueloop never ends. Imagine a buggy tool that always returns results the model finds insufficient, it'll keep calling the tool in an infinite loop, burning through your API budget. -
No explicit completion. The agent finishes when the model happens to respond with text instead of tool calls. There's no way for the agent to say "I'm done, here's my final deliverable" in a structured way. The model might produce a partial answer and stop calling tools prematurely, or it might add unnecessary commentary after the real answer.
These aren't theoretical problems, they're the first things you hit when you deploy agents. Chapter 6 introduces the agentic loop which adds max_iterations (the safety valve) and a task_complete tool (explicit completion signaling) that solves both.
The shift from "chatbot" to "agent" is mostly about framing. The code barely changes. But the mental model, task in, result out, unlocks a completely different way of building systems. Once you think in terms of agents, you start seeing tasks everywhere that can be delegated to an autonomous worker.
Try It Yourself
python main.pyThe agent will autonomously look up populations, perform calculations, and return a clear answer. Watch the console logs, every [Research Agent] Tool: line is a decision the model made on its own.
Some experiments to try:
- Change the task to ask about different countries, the agent will call
look_up_populationfor each one - Ask a question that requires multiple calculation steps and watch the model chain tool calls
- Try a country not in the database (like "Brazil") and see how the agent handles the "Unknown" response
- Create a second agent with a different system prompt (e.g., "You are a geography expert. Be concise.") and give it the same task then compare how the personality changes the output
You now have a reusable building block. Chapter 6 makes it production-ready with safety limits and explicit completion signaling.
Chapter 6: The Agentic Loop
Chapter 5 gave us a clean Agent class. But it has a dangerous problem: the while True loop has no exit condition other than "the model stops calling tools." If the model gets confused and keeps requesting tools forever, your program hangs and your API bill climbs. There's no way to know how many iterations happened, no way to cap the cost, and no structured way for the agent to say "I'm done."
This chapter introduces the agentic loop the pattern that every serious agent framework implements. Two additions make all the difference:
max_iterations: a hard ceiling on think-act-observe cyclestask_completetool : the agent explicitly signals "I'm done, here's my result"
Together, these turn a hopeful script into a production-ready agent.
The Pattern: Think → Act → Observe → Repeat
┌──────────────────────────────────────────┐
│ while iterations < max_iterations: │
│ 1. THINK → call the model │
│ 2. DECIDE → tool calls or text? │
│ 3. ACT → execute tools │
│ 4. OBSERVE → feed results back │
│ 5. CHECK → did agent call "done"? │
│ YES → return result │
│ NO → loop again │
└──────────────────────────────────────────┘This is the heartbeat of every agent. LangChain's AgentExecutor, AutoGen's AssistantAgent, CrewAI's task runner, they all implement some version of this loop. The difference between frameworks is mostly what they wrap around this loop. The loop itself is universal.
The task_complete Tool
Instead of hoping the model stops calling tools, we give it an explicit way to say "I'm finished." This is a tool like any other, the model can call it with a structured result.
# Injected automatically the agent always has this tool
{
"type": "function",
"function": {
"name": "task_complete",
"description": "Call this when you have completed the task. Provide the final result.",
"parameters": {
"type": "object",
"properties": {
"result": {
"type": "string",
"description": "The final result/answer for the task.",
}
},
"required": ["result"],
},
},
}The handler returns a sentinel value that the loop checks:
self.functions["task_complete"] = lambda result: {"__done__": True, "result": result}Why a sentinel instead of a flag? Because task_complete is handled in the exact same code path as other tools, the loop iterates over tool_calls, executes each one, and checks the result. If the result contains __done__: True, the loop exits immediately. No special-casing, no separate code path. The completion signal flows through the same pipeline as every other tool result.
The task_complete tool is injected automatically by the constructor. You never define it in your tool list. The agent always knows how to say "I'm done" & it's a built-in capability or included battery, not something you configure.
The Full Agent
Here's the complete AgenticAgent class. If you compare it to Chapter 5's Agent, you'll find three differences: max_iterations, the injected task_complete tool, and for instead of while True.
class AgenticAgent:
def __init__(self, client, name, system_prompt,
tools=None, functions=None,
model="openai/gpt-oss-120b", max_iterations=10):
self.client = client
self.name = name
self.system_prompt = system_prompt
self.model = model
self.max_iterations = max_iterations
self.functions = dict(functions or {})
self.tools = list(tools or [])
# Inject the completion tool
self.tools.append({
"type": "function",
"function": {
"name": "task_complete",
"description": "Call this when you have completed the task.",
"parameters": {
"type": "object",
"properties": {
"result": {"type": "string", "description": "The final result."}
},
"required": ["result"],
},
},
})
self.functions["task_complete"] = lambda result: {
"__done__": True, "result": result
}
def run(self, task: str) -> str:
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": task},
]
for iteration in range(1, self.max_iterations + 1):
print(f" [{self.name}] Iteration {iteration}/{self.max_iterations}")
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=self.tools,
tool_choice="auto",
)
assistant_message = response.choices[0].message
# No tool calls = implicit completion
if not assistant_message.tool_calls:
return assistant_message.content or "(No response)"
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
try:
result = self.functions[func_name](**func_args)
except Exception as e:
result = {"error": str(e)}
# Check for explicit completion
if isinstance(result, dict) and result.get("__done__"):
print(f" [{self.name}] Task complete!")
return result["result"]
result_str = json.dumps(result) if isinstance(result, dict) else str(result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result_str,
})
# Safety valve
return f"[Stopped after {self.max_iterations} iterations]"What Changed From Chapter 5
Let's zoom in on the three differences:
1. while True → for iteration in range(max_iterations)
- while True:
+ for iteration in range(1, self.max_iterations + 1):
+ print(f" [{self.name}] Iteration {iteration}/{self.max_iterations}")This is the safety valve. The loop will run at most max_iterations times. If the agent hasn't completed by then, it returns a safety message instead of running forever. You get progress visibility too each iteration is logged.
2. The task_complete tool injection
+ self.tools.append({...task_complete schema...})
+ self.functions["task_complete"] = lambda result: {"__done__": True, "result": result}The constructor adds a task_complete tool to every agent. The model can call it to say "I'm done, here's my answer." The handler returns a sentinel dict that the loop checks.
3. The completion check inside the tool loop
result = self.functions[func_name](**func_args)
+
+ if isinstance(result, dict) and result.get("__done__"):
+ return result["result"]After executing each tool, the loop checks if the result is the sentinel. If so, it returns immediately & no more iterations, no more API calls.
4. Error handling
- func = self.functions.get(func_name)
- if func:
- result = func(**func_args)
- else:
- result = {"error": f"Unknown tool: {func_name}"}
+ try:
+ result = self.functions[func_name](**func_args)
+ except Exception as e:
+ result = {"error": str(e)}Chapter 5 checked for unknown tools. Chapter 6 wraps the entire call in try/except, which catches unknown tools and any runtime errors inside the tool function. The error is sent back to the model so it can decide what to do (retry, use a different tool, or give up).
Three Exit Paths
The agent can complete in three ways:
| Exit | How | When |
|---|---|---|
| Explicit | Agent calls task_complete(result) | Agent knows it's done and provides a structured result |
| Implicit | Agent responds with text, no tool calls | Simple tasks that don't need tools |
| Safety | max_iterations reached | Agent is stuck, confused, or in an infinite loop |
The explicit path is the one you want most of the time. It gives you a clean, intentional result rather than hoping the model's last message happens to be the right answer.
The Example: A Market Research Analyst
The main.py in this chapter builds a market research agent that analyses stocks. The task is deliberately multi-step, it requires looking up company info, fetching stock prices, performing calculations, and synthesizing a final analysis. This is exactly the kind of task where the agentic loop shines.
The Tools
We define three simulated tools:
def get_stock_price(symbol: str) -> dict:
"""Simulated stock price lookup."""
prices = {
"AAPL": 182.52, "GOOGL": 141.80, "MSFT": 378.91,
"AMZN": 178.25, "TSLA": 248.42, "META": 390.10,
}
price = prices.get(symbol.upper())
if price:
return {"symbol": symbol.upper(), "price": price, "currency": "USD"}
return {"symbol": symbol, "error": "Symbol not found"}
def get_company_info(name: str) -> dict:
"""Simulated company info lookup."""
info = {
"apple": {"name": "Apple Inc.", "ticker": "AAPL", "sector": "Technology", "employees": "164,000"},
"google": {"name": "Alphabet Inc.", "ticker": "GOOGL", "sector": "Technology", "employees": "182,000"},
"microsoft":{"name": "Microsoft Corp.", "ticker": "MSFT", "sector": "Technology", "employees": "221,000"},
}
return info.get(name.lower(), {"name": name, "error": "Company not found"})
def calculate(expression: str) -> dict:
"""Evaluate a math expression."""
try:
result = eval(expression, {"__builtins__": {}})
return {"expression": expression, "result": str(result)}
except Exception as e:
return {"expression": expression, "error": str(e)}Notice the tools are richer than Chapter 5. get_company_info returns the ticker symbol, sector, and employee count — giving the agent data it can reason about. get_stock_price takes a ticker symbol, so the agent needs to first look up the company to get the ticker, then use that ticker to look up the price. This forces tool chaining the output of one tool becomes the input to another.
Creating and Running the Agent
analyst = AgenticAgent(
client=client,
name="Market Analyst",
system_prompt=(
"You are a market research analyst. "
"Use your tools to gather data, perform calculations, and produce insights. "
"When you have a complete analysis, call the task_complete tool with your findings."
),
tools=TOOLS,
functions=FUNCTIONS,
max_iterations=10,
)
result = analyst.run(
"Compare Apple and Microsoft: look up both companies' info and stock prices, "
"then calculate which stock is cheaper per 1,000 employees. "
"Present a brief analysis."
)
print("\n" + "=" * 60)
print("FINAL ANALYSIS:")
print("=" * 60)
print(result)Two things to notice about the system prompt: it tells the agent to "call the task_complete tool with your findings." This nudges the model toward the explicit completion path rather than just responding with text. Without this instruction, the model might skip task_complete entirely and respond implicitly.
The task itself is complex & it requires the agent to:
- Look up company info for Apple and Microsoft (to get employee counts and tickers)
- Look up stock prices using those tickers
- Calculate the price per 1,000 employees for each
- Compare the results and present an analysis
- Call
task_completewith the final deliverable
That's at minimum 5 tool calls across multiple iterations. Let's trace through it.
Walking Through the Execution
Iteration 1 : Agent gathers company information
The agent reads the task and decides it needs company data first. It calls get_company_info for both companies in a single response:
[Market Analyst] Iteration 1/10
[Market Analyst] 🔧 get_company_info({"name": "Apple"})
[Market Analyst] 📋 Result: {"name": "Apple Inc.", "ticker": "AAPL", "sector": "Technology", "employees": "164,000"}
[Market Analyst] 🔧 get_company_info({"name": "Microsoft"})
[Market Analyst] 📋 Result: {"name": "Microsoft Corp.", "ticker": "MSFT", "sector": "Technology", "employees": "221,000"}The model now knows both tickers (AAPL, MSFT) and employee counts (164,000 and 221,000). These results are appended to the messages list and the loop continues.
Iteration 2 : Agent fetches stock prices
With the tickers in hand, the agent calls get_stock_price for both:
[Market Analyst] Iteration 2/10
[Market Analyst] 🔧 get_stock_price({"symbol": "AAPL"})
[Market Analyst] 📋 Result: {"symbol": "AAPL", "price": 182.52, "currency": "USD"}
[Market Analyst] 🔧 get_stock_price({"symbol": "MSFT"})
[Market Analyst] 📋 Result: {"symbol": "MSFT", "price": 378.91, "currency": "USD"}Now the agent has all four data points: employee counts and stock prices for both companies. Next it needs to calculate.
Iteration 3 : Agent performs calculations
The agent calculates the "price per 1,000 employees" for each company:
[Market Analyst] Iteration 3/10
[Market Analyst] 🔧 calculate({"expression": "182.52 / 164"})
[Market Analyst] 📋 Result: {"expression": "182.52 / 164", "result": "1.1129..."}
[Market Analyst] 🔧 calculate({"expression": "378.91 / 221"})
[Market Analyst] 📋 Result: {"expression": "378.91 / 221", "result": "1.7145..."}The agent divides the stock price by the number of employees (in thousands). Apple: ~$1.11 per 1,000 employees. Microsoft: ~$1.71 per 1,000 employees. Apple is cheaper by this metric.
Iteration 4 : Agent delivers the final analysis
The agent has all the data and calculations it needs. It calls task_complete with a structured analysis:
[Market Analyst] Iteration 4/10
[Market Analyst] 🔧 task_complete({"result": "..."})
[Market Analyst] ✅ Task complete!The task_complete handler returns {"__done__": True, "result": "..."}. The loop checks for __done__, finds it, and returns the result string immediately. No more iterations.
Four iterations, ~8 tool calls, one clean result. The agent used only 4 of its 10 allowed iterations plenty of headroom.
The Execution Flow in Detail
Here's what the iteration counter and messages list look like at each stage:
Iteration 1: [system, user] → API → [assistant+tools, tool, tool]
Iteration 2: [system, user, asst, tool, tool] → API → [assistant+tools, tool, tool]
Iteration 3: [system, user, asst, tool, tool, asst, tool, tool] → API → [assistant+tools, tool, tool]
Iteration 4: [system, user, asst, tool, tool, asst, tool, tool, asst, tool, tool] → API → task_complete → EXITEach iteration sends the entire messages history to the API. The model sees everything it has done so far every tool call, every result and decides what to do next. This is why agents can reason about intermediate results and chain tools together.
Notice how the messages list grows with each iteration. After 4 iterations with 2 tool calls each, we have 14 messages from a single run() call. In production, a 10-iteration run with complex tools could produce 30-40 messages. This is why max_iterations is also a cost control mechanism each iteration costs an API call with an increasingly large input.
Why task_complete Changes Everything
Without task_complete, the agent in Chapter 5 finishes when the model happens to respond with text. This creates a subtle problem:
# Chapter 5: Implicit completion
# The model might respond with:
# "Based on my calculations, Apple is cheaper at $1.11 per 1000 employees."
# But it might also respond with:
# "Let me also check Google for comparison..." ← then call more tools
# "Interesting, let me recalculate..." ← unnecessary extra workThe model doesn't know when to stop. Sometimes it over-delivers, sometimes it under-delivers. With task_complete, the agent makes a deliberate decision to finish:
# Chapter 6: Explicit completion
# The model calls:
# task_complete(result="Apple (AAPL) at $182.52 is cheaper per 1,000 employees
# ($1.11) compared to Microsoft (MSFT) at $378.91 ($1.71).")The system prompt tells the agent when to call task_complete ("When you have a complete analysis"), and the model follows that instruction. The result is cleaner, more predictable, and easier to parse downstream if you're feeding it into another system.
Error Recovery
One of the improvements in Chapter 6 is error handling. The try/except block catches any exception from tool execution:
try:
result = self.functions[func_name](**func_args)
except Exception as e:
result = {"error": str(e)}When a tool fails, the error is sent back to the model as a regular tool result. The model can then:
- Retry with different arguments (e.g., correcting a typo in a ticker symbol)
- Use a different tool to get the same information
- Report the error in its final analysis
This makes agents resilient. A single tool failure doesn't crash the entire run, the agent adapts and continues.
Why This Matters
Compare Chapter 5 and Chapter 6:
# Chapter 5: "I hope it stops eventually"
while True:
response = call_model()
if not response.tool_calls:
return response.content # 🤞
# Chapter 6: "It will stop, and I'll know why"
for i in range(max_iterations):
response = call_model()
for tool in response.tool_calls:
result = execute(tool)
if result.get("__done__"):
return result["result"] # ✅ Explicit completion
return "[Safety limit reached]" # ✅ BoundedThe difference is small in code but huge in practice. You can now:
- Monitor progress : you know which iteration you're on
- Set budgets : 10 iterations max means at most 10 API calls
- Debug reliably : if it hit the safety limit, something went wrong
- Get structured results :
task_completereturns a clean deliverable - Control costs : cap iterations per agent based on task complexity
Tuning max_iterations
The default of 10 is a good starting point, but the right value depends on the task:
| Task Type | Recommended max_iterations | Why |
|---|---|---|
| Simple Q&A | 3-5 | One or two tool calls, then answer |
| Data lookup + calculation | 5-8 | Multiple lookups, some math, then synthesize |
| Multi-step research | 8-15 | Chain of lookups, comparisons, analysis |
| Complex code generation | 15-25 | Write, test, debug, iterate |
A good rule of thumb: set max_iterations to 2x what you expect the agent to need. If a task should take 4 iterations, set the limit to 8. This gives the agent room for retries and unexpected paths while still preventing runaway loops.
The agent pattern is now complete: a reusable, configurable, safe, and predictable autonomous worker that takes a task and delivers a result.
Try It Yourself
python main.pyThe agent will autonomously research Apple and Microsoft, fetch stock prices, perform calculations, and deliver a final analysis. Watch the iteration counter and tool calls in the console.
Some experiments to try:
- Set
max_iterations=2: watch the agent hit the safety limit before it can finish the analysis. The output will be[Stopped after 2 iterations] - Add more companies to the task : "Compare Apple, Microsoft, and Google" and see how many iterations the agent needs
- Remove the
task_completehint from the system prompt : the agent might respond implicitly instead of callingtask_complete. Compare the output quality - Ask for a ticker that doesn't exist : "Compare Apple and Netflix" & see how the agent handles the
{"error": "Symbol not found"}result fromget_stock_price - Create a second agent with the same tools but
max_iterations=3and a system prompt that says "Be extremely concise" see how the personality and budget constraints affect the output
This is the last single-agent chapter. You now have a production-ready agent pattern, bounded, explicit, and resilient. Chapter 7 takes the leap to multiple agents collaborating on a single task.
Chapter 7: The Multi-Agent System
This is the capstone. Everything from the previous six chapters comes together here: messages, tools, the agentic loop, and now multiple agents working as a team.
The idea is simple: instead of one agent doing everything (badly), you have specialized agents that each do one thing (well):
User Input → Clarifier → PRD Agent → Planner → Coder → Working SoftwareThe Clarifier turns a vague idea into a spec. The PRD Agent turns the spec into requirements. The Planner breaks requirements into tasks. The Coder writes actual files. Each agent is focused, predictable, and testable.
Project Structure
Before diving in, here's the complete file tree. Every file is shown with its full code later in this chapter:
The Example: Building a Calculator App
To make this concrete, here's the task we'll trace through the entire chapter:
python main.py "Build a calculator app with HTML, CSS, and JavaScript"That single sentence goes in one end. Out the other end comes a timestamped sandbox with a specification, PRD, task plan, and working code files, all produced by four agents collaborating autonomously.
Part 1: The Framework
The framework is the engine that powers the multi-agent lifecycle. Similar to production frameworks like CrewAI or LangGraph, it provides the core primitives for state management, tool execution, and agent orchestration. These components are entirely domain-agnostic, serving as the reusable infrastructure for any complex agentic workflow.
We're not just building a multi-agent system — we're building the framework itself from scratch. By the end of this chapter, you'll understand exactly how tools like CrewAI, LangGraph, and AutoGen work under the hood: how they register tools, manage agent lifecycles, orchestrate pipelines, and sandbox execution. No magic, no abstractions you can't explain, just the raw mechanics that every production framework is built on.
framework/__init__.py
This is just a convenience file. Instead of writing long import paths every time, this file lets the rest of the project say from framework import BaseAgent instead of from framework.base_agent import BaseAgent. Think of it as a reception desk it tells Python "here's everything this package offers":
"""Framework Package Init"""
from .base_agent import BaseAgent
from .tool_registry import ToolRegistry
from .message_bus import MessageBus, Message
from .orchestrator import Orchestrator, Phase
__all__ = ["BaseAgent", "ToolRegistry", "MessageBus", "Message", "Orchestrator", "Phase"]framework/tool_registry.py
What is it? A central phonebook for tools.
Why do we need it? In Chapters 3–6, we had two separate lists, one describing what each tool looks like (the "schema"), and another holding the actual function to run. If you added a new tool to one list but forgot the other, everything would crash at runtime. That's fragile and error-prone.
The ToolRegistry fixes this by bundling the description and the function together in one place. You call register() once, and both the "what it looks like" and "what it does" are stored together. When the AI asks to use a tool, the registry looks it up by name and runs it, no mismatches possible:
class ToolRegistry:
"""
Manages the registration and execution of tools.
Each tool is:
- A Python callable (the function to run)
- A JSON schema (tells the model what parameters the function accepts)
- A name and description (helps the model decide when to use it)
"""
def __init__(self):
self._tools: dict[str, dict] = {}
def register(self, name: str, description: str, parameters: dict, func: callable):
"""Register a tool with its schema and handler."""
self._tools[name] = {
"func": func,
"schema": {
"type": "function",
"function": {
"name": name,
"description": description,
"parameters": parameters,
},
},
}
def get_schemas(self) -> list[dict]:
"""Return all tool schemas in OpenAI API format."""
return [t["schema"] for t in self._tools.values()]
def get_names(self) -> list[str]:
"""Return all registered tool names."""
return list(self._tools.keys())
def has_tool(self, name: str) -> bool:
return name in self._tools
def execute(self, name: str, arguments: dict) -> any:
"""Execute a tool by name with the given arguments."""
if name not in self._tools:
raise ValueError(f"Tool '{name}' is not registered.")
return self._tools[name]["func"](**arguments)
def __len__(self):
return len(self._tools)
def __repr__(self):
return f"ToolRegistry(tools={self.get_names()})"framework/base_agent.py
What is it? The blueprint that every agent in our system is built from.
Why do we need it? Imagine you're running a company. You don't want every employee to reinvent how meetings work, how to submit reports, or how to clock in. Instead, you create a standard employee handbook that covers all the basics. Every employee follows the same process, but each one specializes in different work.
That's exactly what BaseAgent does. It's the "employee handbook" for AI agents. It handles the boring-but-essential stuff that every agent needs:
- The agentic loop : the think → decide → act → observe cycle from Chapter 6
- Rate limiting : prevents burning through your API budget by spacing out calls
- Colored logging : each agent prints its name in a different color so you can tell them apart in the terminal
- The
task_completetool : a built-in way for agents to say "I'm done, here's my result" - Pre/post hooks : lets specialized agents customize behavior before or after running
Every specialized agent (Clarifier, Coder, Planner) inherits from this class, so they all get these features for free. They only need to define what makes them unique their personality (system prompt) and their tools:
import json
import os
import time
from openai import OpenAI
from .tool_registry import ToolRegistry
class BaseAgent:
"""
Base class for all agents in the multi-agent system.
Subclasses should:
1. Set self.name and self.system_prompt in __init__
2. Register tools via self.tool_registry.register(...)
3. Optionally override pre_run() and post_run() hooks
"""
COLORS = {
"blue": "\033[94m", "green": "\033[92m", "yellow": "\033[93m",
"magenta": "\033[95m", "cyan": "\033[96m", "red": "\033[91m",
"reset": "\033[0m", "bold": "\033[1m", "dim": "\033[2m",
}
# Class-level rate limiter — shared across ALL agents
_api_call_timestamps: list = []
_rate_limit: int = 14
_rate_window: int = 60
def __init__(self, client: OpenAI, name: str = "Agent",
system_prompt: str = "You are a helpful assistant.",
model: str = None, max_iterations: int = 15, color: str = "cyan"):
self.client = client
self.name = name
self.system_prompt = system_prompt
self.model = model or os.getenv("OPENAI_MODEL", "gpt-4o-mini")
self.max_iterations = max_iterations
self.color = color
self.tool_registry = ToolRegistry()
# Built-in task_complete tool
self.tool_registry.register(
name="task_complete",
description="Call this tool when you have fully completed the assigned task.",
parameters={
"type": "object",
"properties": {
"result": {
"type": "string",
"description": "The complete final result/deliverable for the task.",
}
},
"required": ["result"],
},
func=self._handle_task_complete,
)
@classmethod
def _wait_for_rate_limit(cls):
"""Sleep if we're about to exceed the API rate limit."""
now = time.time()
cls._api_call_timestamps = [
t for t in cls._api_call_timestamps if now - t < cls._rate_window
]
if len(cls._api_call_timestamps) >= cls._rate_limit:
oldest = cls._api_call_timestamps[0]
wait_time = cls._rate_window - (now - oldest) + 1
if wait_time > 0:
print(f"\033[93m⏳ Rate limit: waiting {wait_time:.0f}s...\033[0m")
time.sleep(wait_time)
cls._api_call_timestamps.append(time.time())
def log(self, message: str, style: str = ""):
c = self.COLORS.get(self.color, "")
r = self.COLORS["reset"]
s = self.COLORS.get(style, "")
print(f"{c}[{self.name}]{r} {s}{message}{r}")
def pre_run(self, task: str) -> str:
return task
def post_run(self, result: str) -> str:
return result
def run(self, task: str) -> str:
task = self.pre_run(task)
self.log(f"📋 Task received", "bold")
self.log(f" {task[:120]}{'...' if len(task) > 120 else ''}", "dim")
tool_names = self.tool_registry.get_names()
tool_reminder = (
f"\n\nIMPORTANT: You have ONLY these tools available: {', '.join(tool_names)}. "
"Do NOT invent tool names or add prefixes. Use the exact names listed above."
)
messages = [
{"role": "system", "content": self.system_prompt + tool_reminder},
{"role": "user", "content": task},
]
tool_schemas = self.tool_registry.get_schemas()
for iteration in range(1, self.max_iterations + 1):
self.log(f"🔄 Iteration {iteration}/{self.max_iterations}", "dim")
self._wait_for_rate_limit()
kwargs = {"model": self.model, "messages": messages}
if tool_schemas:
kwargs["tools"] = tool_schemas
kwargs["tool_choice"] = "auto"
response = self.client.chat.completions.create(**kwargs)
assistant_message = response.choices[0].message
if not assistant_message.tool_calls:
result = assistant_message.content or "(empty response)"
self.log("💬 Completed (text response)")
return self.post_run(result)
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
self.log(f"🔧 {func_name}({self._summarize_args(func_args)})")
try:
result = self.tool_registry.execute(func_name, func_args)
except Exception as e:
result = {"error": str(e)}
self.log(f"❌ Error: {e}", "red")
if isinstance(result, dict) and result.get("__done__"):
self.log("✅ Task complete!", "bold")
return self.post_run(result["result"])
result_str = json.dumps(result) if isinstance(result, (dict, list)) else str(result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result_str,
})
self.log(f"⚠️ Max iterations ({self.max_iterations}) reached!", "yellow")
return self.post_run(f"[Agent stopped after {self.max_iterations} iterations]")
def _handle_task_complete(self, result: str) -> dict:
return {"__done__": True, "result": result}
def _summarize_args(self, args: dict) -> str:
parts = []
for k, v in args.items():
val_str = str(v)
if len(val_str) > 60:
val_str = val_str[:57] + "..."
parts.append(f"{k}={val_str!r}")
return ", ".join(parts)Why the tool_reminder? LLMs sometimes make up tool names that don't exist like calling search_web() when no such tool was given. This is called "hallucination." By appending the exact list of real tool names to every prompt, we give the AI a cheat sheet: "these are your ONLY tools, don't invent new ones." It's a small trick that dramatically reduces errors, especially with smaller, cheaper models.
framework/message_bus.py
What is it? A shared notebook where every message between agents is recorded.
Why do we need it? Imagine four people working on a project, but they all communicate by whispering to each other. If something goes wrong, you have no idea who said what to whom. A message bus solves this, it's like a shared Slack channel where every message is logged with who sent it, who received it, what type of message it was, and when it was sent.
This gives us two superpowers:
- Debugging : if the Coder produces wrong code, you can trace back through the message log to see exactly what instructions the Planner gave it
- Auditability : after the pipeline runs, you get a full
message_bus_log.mdfile showing every single interaction between agents
The Message class defines the structure of each message (sender, receiver, type, content). The MessageBus class stores them all and can export the full history as a readable Markdown file:
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
@dataclass
class Message:
"""A structured message between agents."""
sender: str
receiver: str
msg_type: str # "task", "result", "document", "feedback"
content: Any
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
metadata: dict = field(default_factory=dict)
class MessageBus:
"""
A simple message bus for inter-agent communication.
Agents can:
- send() a message to a specific receiver
- receive() all messages addressed to them
- get_conversation() to see the full history between two agents
- export_to_markdown() to dump the full log to a readable .md file
"""
def __init__(self):
self._messages: list[Message] = []
def send(self, sender: str, receiver: str, msg_type: str,
content: Any, metadata: dict | None = None) -> Message:
msg = Message(
sender=sender, receiver=receiver, msg_type=msg_type,
content=content, metadata=metadata or {},
)
self._messages.append(msg)
return msg
def receive(self, receiver: str, msg_type: str | None = None) -> list[Message]:
msgs = [m for m in self._messages if m.receiver == receiver]
if msg_type:
msgs = [m for m in msgs if m.msg_type == msg_type]
return msgs
def get_latest(self, receiver: str, msg_type: str | None = None) -> Message | None:
msgs = self.receive(receiver, msg_type)
return msgs[-1] if msgs else None
def get_conversation(self, agent_a: str, agent_b: str) -> list[Message]:
return [
m for m in self._messages
if (m.sender == agent_a and m.receiver == agent_b)
or (m.sender == agent_b and m.receiver == agent_a)
]
def get_all(self) -> list[Message]:
return list(self._messages)
def clear(self):
self._messages.clear()
def export_to_markdown(self, file_path: str) -> str:
lines = ["# 📬 Message Bus Log", "",
f"> Total messages: {len(self._messages)}", "", "---", ""]
for i, msg in enumerate(self._messages, 1):
lines.append(f"## Message {i}: {msg.sender} → {msg.receiver}")
lines.append("")
lines.append(f"| Field | Value |")
lines.append(f"|-------|-------|")
lines.append(f"| **Type** | `{msg.msg_type}` |")
lines.append(f"| **Timestamp** | `{msg.timestamp}` |")
if msg.metadata:
meta_str = ", ".join(f"{k}={v}" for k, v in msg.metadata.items())
lines.append(f"| **Metadata** | {meta_str} |")
lines.append("")
lines.append("**Content:**")
lines.append("")
content_str = str(msg.content)
if content_str.startswith("#") or "**" in content_str:
lines.append(content_str)
else:
lines.append("```")
lines.append(content_str[:3000])
lines.append("```")
lines.append("")
lines.append("---")
lines.append("")
import os
os.makedirs(os.path.dirname(file_path) or ".", exist_ok=True)
with open(file_path, "w", encoding="utf-8") as f:
f.write("\n".join(lines))
return file_path
def __len__(self):
return len(self._messages)framework/orchestrator.py
What is it? The project manager that runs the entire show.
Why do we need it? You have four agents, but someone needs to decide: who goes first? What does each agent receive as input? What happens with their output? Do we need human approval before continuing? Where do we save the results?
The Orchestrator answers all of these questions. It runs a pipeline a sequence of steps called Phases. Each phase says: "run this agent, give it this prompt, and optionally save the output or ask the user for approval."
The beautiful thing is that the orchestrator is completely generic. It doesn't know anything about calculators, specifications, or code. It just runs phases in order. Want to add a step? Define a new Phase. Want to skip a step? Remove it. No code changes needed.
The Phase dataclass is like a recipe card, it describes one step. The Orchestrator class is the chef that follows the recipe cards in order:
import os
from dataclasses import dataclass, field
from datetime import datetime
from typing import Callable
from openai import OpenAI
from .message_bus import MessageBus
from tools.sandbox import set_sandbox_root
@dataclass
class Phase:
"""A single step in the agent pipeline."""
name: str # "Requirement Clarification"
agent_name: str # Key from register_agent()
prompt_template: str # Uses {user_input} and {previous_output}
report_filename: str | None = None
needs_approval: bool = False # Ask user yes/no before proceeding
setup_sandbox: bool = False # Activate sandbox for file tools
pass_message_bus: bool = False # Give agent access to the message bus
pre_hooks: list[Callable] = field(default_factory=list)
post_hooks: list[Callable] = field(default_factory=list)
class Orchestrator:
"""Generic pipeline runner for multi-agent workflows."""
COLORS = {
"header": "\033[1;97m", "reset": "\033[0m",
"dim": "\033[2m", "green": "\033[92m",
"yellow": "\033[93m", "cyan": "\033[96m",
}
def __init__(self, client: OpenAI, output_base: str = None):
self.client = client
self.message_bus = MessageBus()
self.agents = {}
self.phases: list[Phase] = []
self.output_base = output_base or os.path.join(
os.path.dirname(os.path.abspath(__file__)), "..", "output"
)
def register_agent(self, name: str, agent):
self.agents[name] = agent
def add_phase(self, phase: Phase):
self.phases.append(phase)
def run(self, user_input: str) -> str:
paths = self._create_sandbox()
self.log(f"📂 Sandbox: {paths['sandbox']}")
self.message_bus.send("user", "orchestrator", "input", user_input)
context = {"user_input": user_input, "paths": paths, "message_bus": self.message_bus}
previous_output = user_input
for i, phase in enumerate(self.phases, 1):
self._print_phase(f"Phase {i}/{len(self.phases)}: {phase.name}")
agent = self.agents.get(phase.agent_name)
if not agent:
raise RuntimeError(f"Agent '{phase.agent_name}' not registered.")
if phase.setup_sandbox:
set_sandbox_root(paths["code"])
if phase.pass_message_bus:
agent.message_bus = self.message_bus
for hook in phase.pre_hooks:
hook(agent, phase, context)
prompt = phase.prompt_template.format(
user_input=user_input, previous_output=previous_output,
)
output = agent.run(prompt)
if phase.needs_approval:
output = self._approval_loop(agent, phase, output)
for hook in phase.post_hooks:
hook(agent, phase, output, context)
if phase.report_filename:
self._save_report(paths["reports"], phase.report_filename,
phase.name, output)
next_agent = self.phases[i].agent_name if i < len(self.phases) else "orchestrator"
self.message_bus.send(phase.agent_name, next_agent, "document", output,
metadata={"phase": phase.name})
previous_output = output
# Save final report and message bus log
self._save_report(paths["reports"], "final_report.md", "Final Report", previous_output)
bus_log_path = os.path.join(paths["sandbox"], "message_bus_log.md")
self.message_bus.export_to_markdown(bus_log_path)
return previous_output
def _approval_loop(self, agent, phase, output):
"""Show output to user and optionally re-run with feedback."""
print(f"\n{'='*60}\n {phase.name.upper()}\n{'='*60}")
print(output)
print(f"{'='*60}\n")
while True:
feedback = input("Is this correct? (yes / no / type your edits): ").strip()
if feedback.lower() in ("yes", "y", ""):
print(" ✅ Approved!\n")
return output
edit_request = input("What should be changed? ").strip() if feedback.lower() in ("no", "n") else feedback
output = agent.run(
f"The user wants changes.\n\nCurrent output:\n{output}\n\n"
f"User feedback:\n{edit_request}\n\nUpdate based on this feedback."
)
print(f"\n{'='*60}\n UPDATED {phase.name.upper()}\n{'='*60}")
print(output)
print(f"{'='*60}\n")
def _create_sandbox(self):
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
sandbox = os.path.join(os.path.abspath(self.output_base), f"sandbox_{timestamp}")
reports_dir = os.path.join(sandbox, "reports")
code_dir = os.path.join(sandbox, "code")
os.makedirs(reports_dir, exist_ok=True)
os.makedirs(code_dir, exist_ok=True)
return {"sandbox": sandbox, "reports": reports_dir, "code": code_dir}
def _save_report(self, reports_dir, filename, title, content):
filepath = os.path.join(reports_dir, filename)
with open(filepath, "w", encoding="utf-8") as f:
f.write(f"# {title}\n\n> Generated: {datetime.now().isoformat()}\n\n---\n\n{content}\n")
self.log(f"📄 Saved: {filename}")
def log(self, message):
print(f"{self.COLORS['green']}[Orchestrator]{self.COLORS['reset']} {message}")
def _print_phase(self, text):
c = self.COLORS
print(f"\n{c['cyan']}{'─' * 60}\n {text}\n{'─' * 60}{c['reset']}\n")The key line is previous_output = output. This single line is what makes the whole system work as a pipeline. After each agent finishes, its output is saved into previous_output. The next agent's prompt template uses {previous_output} as a placeholder, which gets filled with whatever the previous agent produced. So the Clarifier's spec flows into the PRD Agent, the PRD flows into the Planner, and so on like an assembly line where each worker adds their piece and passes it forward.
The approval loop (_approval_loop) is equally important when a phase has needs_approval=True, the orchestrator pauses, shows you the output, and asks "Is this correct?" If you say yes, it continues. If you type feedback, it re-runs the agent with your comments, showing both what it produced AND what you want changed. This way the agent makes targeted edits rather than starting from scratch.
Now we are ready with the framework, lets dive into building the multi agent workflow. So we start with Tools for the agents.
Part 2: The Tools
What are tools? They're the agent's hands. The AI can think and decide, but it can't actually create files or run commands on your computer it's just a language model generating text. Tools bridge this gap. When the AI says "I want to create index.html", a tool function actually writes that file to disk.
Why sandboxed? Here's the problem: the AI doesn't know (and shouldn't care) where your project folder is. It just says write_file("index.html", ...). The sandbox layer transparently redirects this to a safe output directory like output/sandbox_2026-02-21/code/index.html. This means the AI can't accidentally overwrite your system files, and each pipeline run gets its own clean folder.
tools/__init__.py
Just like the framework's init file, this is a convenience file. It combines all the tool schemas from the three tool modules (file, shell, search) into one big list called ALL_TOOL_SCHEMAS. When the Coder agent wants every tool, it just imports this single list instead of importing from three different files:
"""Tools Package Init"""
from .file_tools import FILE_TOOL_SCHEMAS
from .shell_tools import SHELL_TOOL_SCHEMAS
from .search_tools import SEARCH_TOOL_SCHEMAS
from .sandbox import set_sandbox_root, get_sandbox_root
ALL_TOOL_SCHEMAS = FILE_TOOL_SCHEMAS + SHELL_TOOL_SCHEMAS + SEARCH_TOOL_SCHEMAS
__all__ = ["FILE_TOOL_SCHEMAS", "SHELL_TOOL_SCHEMAS", "SEARCH_TOOL_SCHEMAS",
"ALL_TOOL_SCHEMAS", "set_sandbox_root", "get_sandbox_root"]tools/sandbox.py
What is it? The security guard for file paths.
Why do we need it? When the AI says write_file("index.html", ...), we need to decide where that file actually goes. Without a sandbox, it could end up anywhere on your computer. Worse, a creative AI could try write_file("../../../etc/passwd", ...) and mess with system files, this is called a "path traversal attack."
The sandbox module sets a single root directory. Every file path the AI provides is automatically resolved inside that directory. Try to escape with ../? The sandbox catches it and locks you back in. The AI doesn't even know this is happening, it just says simple filenames and the sandbox handles the rest:
import os
_sandbox_root: str | None = None
def set_sandbox_root(path: str):
"""Set the global sandbox root directory for all tools."""
global _sandbox_root
_sandbox_root = os.path.abspath(path)
os.makedirs(_sandbox_root, exist_ok=True)
def get_sandbox_root() -> str | None:
return _sandbox_root
def resolve_path(user_path: str) -> str:
"""
Resolve a user-provided path to an absolute path within the sandbox.
Prevents path traversal attacks (e.g. ../../../etc/passwd).
"""
if _sandbox_root is None:
return os.path.abspath(user_path)
abs_path = os.path.abspath(user_path)
if abs_path.startswith(_sandbox_root):
return abs_path
resolved = os.path.normpath(os.path.join(_sandbox_root, user_path))
if not resolved.startswith(_sandbox_root):
resolved = os.path.join(_sandbox_root, os.path.basename(user_path))
return resolvedtools/file_tools.py
What is it? The four basic file operations that every coding agent needs: read a file, write a file, list what's in a directory, and create a new directory.
Why do we need it? These are the Coder agent's core "hands." Without these, the AI can think about code all day but can't actually create any files. Every function uses resolve_path() from the sandbox, so file paths are always safe. Each function returns a dictionary with status information (was it created? how big is it?) or an error message if something went wrong:
import os
from .sandbox import resolve_path
def read_file(file_path: str) -> dict:
try:
abs_path = resolve_path(file_path)
if not os.path.isfile(abs_path):
return {"path": file_path, "error": f"File not found: {abs_path}"}
with open(abs_path, "r", encoding="utf-8") as f:
content = f.read()
return {"path": abs_path, "content": content, "size_bytes": len(content.encode("utf-8"))}
except Exception as e:
return {"path": file_path, "error": str(e)}
def write_file(file_path: str, content: str) -> dict:
try:
abs_path = resolve_path(file_path)
parent_dir = os.path.dirname(abs_path)
if parent_dir:
os.makedirs(parent_dir, exist_ok=True)
with open(abs_path, "w", encoding="utf-8") as f:
f.write(content)
return {"path": abs_path, "status": "written", "size_bytes": len(content.encode("utf-8"))}
except Exception as e:
return {"path": file_path, "error": str(e)}
def list_directory(directory_path: str = ".") -> dict:
try:
abs_path = resolve_path(directory_path)
if not os.path.isdir(abs_path):
return {"path": directory_path, "error": f"Not a directory: {abs_path}"}
entries = []
for item in sorted(os.listdir(abs_path)):
full_path = os.path.join(abs_path, item)
entry = {"name": item, "type": "directory" if os.path.isdir(full_path) else "file"}
if os.path.isfile(full_path):
entry["size_bytes"] = os.path.getsize(full_path)
entries.append(entry)
return {"path": abs_path, "entries": entries, "count": len(entries)}
except Exception as e:
return {"path": directory_path, "error": str(e)}
def create_directory(directory_path: str) -> dict:
try:
abs_path = resolve_path(directory_path)
os.makedirs(abs_path, exist_ok=True)
return {"path": abs_path, "status": "created"}
except Exception as e:
return {"path": directory_path, "error": str(e)}
FILE_TOOL_SCHEMAS = [
{
"name": "read_file",
"description": "Read the contents of a file. Just provide the filename.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Filename or relative path."
}
},
"required": [
"file_path"
]
},
"func": read_file
},
{
"name": "write_file",
"description": "Write content to a file. Parent directories are created automatically.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Filename or relative path."
},
"content": {
"type": "string",
"description": "The content to write."
}
},
"required": [
"file_path",
"content"
]
},
"func": write_file
},
{
"name": "list_directory",
"description": "List all files and subdirectories. Use '.' for the project root.",
"parameters": {
"type": "object",
"properties": {
"directory_path": {
"type": "string",
"description": "Directory or '.' for root."
}
},
"required": [
"directory_path"
]
},
"func": list_directory
},
{
"name": "create_directory",
"description": "Create a directory.",
"parameters": {
"type": "object",
"properties": {
"directory_path": {
"type": "string",
"description": "Directory path."
}
},
"required": [
"directory_path"
]
},
"func": create_directory
},
]tools/shell_tools.py
What is it? Lets the AI run terminal commands like python app.py or npm install.
Why do we need it? Sometimes the AI needs to test its code after writing it, install a dependency, or check something on the system. This tool runs a command in the terminal and captures the output.
But giving an AI access to your terminal is risky, so we add safety guards: a 30-second timeout (no infinite loops), output caps (no flooding memory with huge outputs), and a blocklist of obviously dangerous commands like rm -rf / or format. Commands run inside the sandbox directory by default:
import subprocess
import os
from .sandbox import get_sandbox_root
def run_command(command: str, cwd: str = None) -> dict:
"""Execute a shell command with safety guards and 30s timeout."""
dangerous = ["rm -rf /", "format ", "del /s /q", "shutdown", "mkfs"]
for d in dangerous:
if d in command.lower():
return {"command": command, "error": f"Blocked: dangerous pattern '{d}'"}
try:
if cwd:
abs_cwd = os.path.abspath(cwd)
else:
sandbox = get_sandbox_root()
abs_cwd = sandbox if sandbox else os.path.abspath(".")
os.makedirs(abs_cwd, exist_ok=True)
result = subprocess.run(command, shell=True, capture_output=True,
text=True, timeout=30, cwd=abs_cwd)
return {"command": command, "cwd": abs_cwd,
"stdout": result.stdout[:5000], "stderr": result.stderr[:2000],
"exit_code": result.returncode}
except subprocess.TimeoutExpired:
return {"command": command, "error": "Command timed out after 30 seconds."}
except Exception as e:
return {"command": command, "error": str(e)}
SHELL_TOOL_SCHEMAS = [
{
"name": "run_command",
"description": "Execute a shell command. Has a 30-second timeout.",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The shell command."
},
"cwd": {
"type": "string",
"description": "Optional working directory."
}
},
"required": ["command"]
},
"func": run_command
},
]tools/search_tools.py
What is it? Two tools for finding things: search_files finds files by name, and grep_in_file searches inside a file for specific text.
Why do we need it? When the Coder is working on Task 2 and needs to find a function defined in Task 1's code, it can use grep_in_file("script.js", "calculateResult") to find the exact line. search_files is useful when the AI isn't sure what files exist — it can search by pattern (e.g., find all .css files):
import os
import re
from .sandbox import resolve_path, get_sandbox_root
def search_files(directory: str = ".", pattern: str = "") -> dict:
try:
abs_dir = resolve_path(directory)
if not os.path.isdir(abs_dir):
return {"directory": directory, "error": f"Not a directory: {abs_dir}"}
matches = []
pattern_lower = pattern.lower()
for root, dirs, files in os.walk(abs_dir):
dirs[:] = [d for d in dirs if not d.startswith('.') and d not in ('node_modules', '__pycache__')]
for f in files:
if pattern_lower in f.lower():
matches.append(os.path.join(root, f))
if len(matches) >= 50:
break
if len(matches) >= 50:
break
return {"directory": abs_dir, "pattern": pattern, "matches": matches, "count": len(matches)}
except Exception as e:
return {"directory": directory, "error": str(e)}
def grep_in_file(file_path: str, search_term: str) -> dict:
try:
abs_path = resolve_path(file_path)
if not os.path.isfile(abs_path):
return {"path": file_path, "error": f"File not found: {abs_path}"}
matches = []
with open(abs_path, "r", encoding="utf-8", errors="replace") as f:
for line_num, line in enumerate(f, 1):
if re.search(search_term, line, re.IGNORECASE):
matches.append({"line_number": line_num, "content": line.rstrip()})
if len(matches) >= 50:
break
return {"path": abs_path, "search_term": search_term, "matches": matches, "count": len(matches)}
except Exception as e:
return {"path": file_path, "error": str(e)}
SEARCH_TOOL_SCHEMAS = [
{
"name": "search_files",
"description": "Search for files by name pattern (recursive).",
"parameters": {
"type": "object",
"properties": {
"directory": {
"type": "string",
"description": "Directory or '.' for root."
},
"pattern": {
"type": "string",
"description": "Filename substring (case-insensitive)."
}
},
"required": [
"directory",
"pattern"
]
},
"func": search_files
},
{
"name": "grep_in_file",
"description": "Search for a text pattern inside a file.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "Filename or path."
},
"search_term": {
"type": "string",
"description": "Text or regex to search for."
}
},
"required": [
"file_path",
"search_term"
]
},
"func": grep_in_file
},
]The sandbox is set per pipeline run, not per agent. All agents share the same sandbox, so when the Planner assigns Task 2, the Coder can read_file files created in Task 1.
Part 3: The Agents
Now the fun part. Each agent is a thin subclass of BaseAgent, meaning it inherits all the infrastructure (agentic loop, rate limiting, logging, task_complete) for free. The only things that make each agent unique are:
- Its personality : the system prompt that tells it who it is and how to behave
- Its tools : what actions it can perform
Think of it like hiring four specialists. They all follow the same company handbook (BaseAgent), but each one is an expert in a different area.
agents/__init__.py
Another convenience file makes all four agent classes importable from one place:
"""Agents Package Init"""
from .clarifier_agent import ClarifierAgent
from .prd_agent import PRDAgent
from .planner_agent import PlannerAgent
from .coder_agent import CoderAgent
__all__ = ["ClarifierAgent", "PRDAgent", "PlannerAgent", "CoderAgent"]agents/clarifier_agent.py
Role: The interviewer. Takes your vague, one-line idea and turns it into a structured, clear specification with purpose, features, tech stack, and file structure.
Why do we need it? If you tell a developer "build me a calculator," they'll have 100 questions: What kind? Scientific? Basic? Web or mobile? What features? The Clarifier asks and answers these questions for you by reasoning about your input and producing a standardized spec. It has max_iterations=3 because this is a simple task, no tools needed, just thinking:
from framework.base_agent import BaseAgent
from openai import OpenAI
class ClarifierAgent(BaseAgent):
"""Elaborates vague problem statements into concise specifications."""
def __init__(self, client: OpenAI):
super().__init__(
client=client, name="Clarifier",
system_prompt=self._build_system_prompt(),
max_iterations=3, color="blue",
)
def _build_system_prompt(self) -> str:
return """You are a requirements analyst. Take the user's raw idea and produce a SHORT, focused specification.
Your output MUST follow this EXACT format and stay UNDER 150 words total:
## Purpose
One sentence describing what the app does.
## Features
A numbered list of 3-6 core features. Keep each to one line. No sub-features.
## Tech Stack
One line listing the technologies (e.g. "HTML5, CSS3, vanilla JavaScript").
## File Structure
A simple list of files to create (3-5 files max for simple projects).
RULES:
- Do NOT add accessibility modules, pub/sub patterns, or complex architecture.
- Do NOT over-engineer. A calculator needs 3 files (HTML, CSS, JS), not 8.
- Keep it SIMPLE. Think MVP — minimum viable product.
- When done, call task_complete with your specification."""agents/prd_agent.py
Role: The product manager. Takes the Clarifier's specification and turns it into a Product Requirements Document (PRD) a more structured document with numbered requirements, a file structure, and an implementation order.
Why do we need it? A specification says what the app should do. A PRD says how to organize the work. It transforms loose feature descriptions into concrete, numbered requirements ("FR-1: Create a responsive layout") and decides what order to build things in. This gives the Planner a much clearer target to break into tasks:
from framework.base_agent import BaseAgent
from openai import OpenAI
class PRDAgent(BaseAgent):
"""Generates a compact PRD from a specification."""
def __init__(self, client: OpenAI):
super().__init__(
client=client, name="PRD Agent",
system_prompt=self._build_system_prompt(),
max_iterations=3, color="green",
)
def _build_system_prompt(self) -> str:
return """You are a product manager. Given an approved specification, produce a SHORT PRD.
Your output MUST follow this EXACT format:
## Summary
2-3 sentences describing the project.
## Requirements
A numbered list (FR-1, FR-2, ...) of requirements. ONE LINE each. Maximum 5-6.
## File Structureproject_name/ ├── file1.ext # brief description └── ...
## Implementation Order
A numbered list of 2-3 steps, ordered by dependency. Group related work together.
RULES:
- Do NOT add non-functional requirements, deployment sections, or glossaries.
- A simple calculator = index.html + styles.css + script.js. That's it.
- Think MINIMAL. When done, call task_complete with your PRD."""agents/planner_agent.py
Role: The team lead who breaks the PRD into tasks and delegates them to the Coder. This is where the magic of "agents calling agents" happens.
Why do we need it? A PRD might say "build a calculator with HTML, CSS, and JS." But a Coder agent works best with focused, one-at-a-time tasks like "Create the HTML structure and CSS styling" or "Implement all JavaScript calculator logic." The Planner splits the big job into 2-3 smaller tasks and uses its special assign_task_to_coder tool to hand each one to the Coder agent.
This is the most interesting agent because it's an agent that uses another agent as a tool. When the Planner calls assign_task_to_coder(...), the function internally runs self.coder_agent.run(task) launching a full sub-agent with its own agentic loop. The Planner waits for the Coder to finish, then assigns the next task. It also keeps track of files created by previous tasks so the Coder can reference them:
import json
import os
from framework.base_agent import BaseAgent
from openai import OpenAI
from tools.sandbox import get_sandbox_root
class PlannerAgent(BaseAgent):
"""Master agent that divides a PRD into tasks and assigns them to the Coder."""
def __init__(self, client: OpenAI, coder_agent=None):
super().__init__(
client=client, name="Planner",
system_prompt=self._build_system_prompt(),
max_iterations=10, color="yellow",
)
self.coder_agent = coder_agent
self.message_bus = None # Set by orchestrator at runtime
self._task_results = []
self._created_files = []
self.tool_registry.register(
name="assign_task_to_coder",
description=(
"Assign a coding task to the Coder agent. "
"Call this once per task, in order. MAXIMUM 3 tasks total."
),
parameters={
"type": "object",
"properties": {
"task_number": {"type": "integer", "description": "Task number (1, 2, or 3)."},
"task_title": {"type": "string", "description": "Short title."},
"task_description": {"type": "string",
"description": "Detailed description including ALL file names and code requirements."},
},
"required": ["task_number", "task_title", "task_description"],
},
func=self._assign_task,
)
def _build_system_prompt(self) -> str:
return """You are a technical project manager.
Given a PRD, break it into AT MOST 2-3 coding tasks and assign each to the Coder.
CRITICAL RULES:
- Create AT MOST 3 tasks total. Fewer is better.
- COMBINE related work into single tasks. For example:
- Task 1: "Create HTML structure with CSS styling" (creates index.html AND styles.css)
- Task 2: "Implement all JavaScript logic" (creates script.js with ALL functionality)
- Do NOT create a separate task for each file or each feature.
- Assign tasks ONE AT A TIME using assign_task_to_coder.
- After all tasks complete, call task_complete with a brief summary."""
def _get_file_context(self) -> str:
if not self._created_files:
return ""
context = "\n\nPREVIOUSLY CREATED FILES (you can read these with read_file):\n"
for f in self._created_files:
context += f" - {f}\n"
return context
def _scan_sandbox_files(self) -> list[str]:
sandbox = get_sandbox_root()
if not sandbox or not os.path.isdir(sandbox):
return []
files = []
for root, dirs, filenames in os.walk(sandbox):
dirs[:] = [d for d in dirs if not d.startswith('.')]
for f in filenames:
files.append(os.path.relpath(os.path.join(root, f), sandbox))
return files
def _assign_task(self, task_number: int, task_title: str, task_description: str) -> dict:
self.log(f"📌 Task {task_number}: {task_title}")
if self.message_bus:
self.message_bus.send("planner", "coder", "task",
f"Task {task_number}: {task_title}\n\n{task_description}",
metadata={"task_number": task_number, "task_title": task_title})
if not self.coder_agent:
return {"task_number": task_number, "status": "error", "error": "No coder agent available."}
file_context = self._get_file_context()
full_task = f"TASK {task_number}: {task_title}\n\n{task_description}{file_context}"
coder_result = self.coder_agent.run(full_task)
self._created_files = self._scan_sandbox_files()
result = {"task_number": task_number, "task_title": task_title,
"status": "completed", "files_created": self._created_files.copy(),
"coder_output": coder_result[:500]}
self._task_results.append(result)
if self.message_bus:
self.message_bus.send("coder", "planner", "result", coder_result[:1000],
metadata={"task_number": task_number, "status": "completed",
"files": self._created_files.copy()})
self.log(f"✅ Task {task_number} completed by Coder")
self.log(f" 📁 Files: {', '.join(self._created_files) or 'none'}")
return resultagents/coder_agent.py
Role: The developer who actually writes code files. This is the only agent with access to file, shell, and search tools.
Why do we need it? Everything so far has been planning and writing documents specs, PRDs, task plans. The Coder is where all that planning turns into actual code. It receives a focused task like "Create index.html with a calculator layout and styles.css with a dark theme," and it uses write_file to create the actual files in the sandbox.
Notice the system prompt includes strict efficiency rules write files directly (no asking for permission), batch related files into one task, and call task_complete immediately after finishing. Without these rules, the AI tends to over-think and loop unnecessarily:
from framework.base_agent import BaseAgent
from tools import ALL_TOOL_SCHEMAS
from openai import OpenAI
class CoderAgent(BaseAgent):
"""Writes actual code files using sandboxed file, shell, and search tools."""
def __init__(self, client: OpenAI, project_dir: str = "./output"):
super().__init__(
client=client, name="Coder",
system_prompt=self._build_system_prompt(),
max_iterations=10, color="magenta",
)
for tool_def in ALL_TOOL_SCHEMAS:
self.tool_registry.register(
name=tool_def["name"], description=tool_def["description"],
parameters=tool_def["parameters"], func=tool_def["func"],
)
def _build_system_prompt(self) -> str:
return """You are an expert software developer. Write clean, working code.
All file operations are automatically sandboxed — just use simple filenames.
For example: write_file("index.html", ...) or write_file("src/app.js", ...).
AVAILABLE TOOLS (use ONLY these exact names):
- write_file(file_path, content) — Create/overwrite a file.
- read_file(file_path) — Read a previously created file.
- list_directory(directory_path) — List files. Use '.' for root.
- create_directory(directory_path) — Create a subdirectory.
- run_command(command) — Run a shell command. 30s timeout.
- search_files(directory, pattern) — Search for files by name.
- grep_in_file(file_path, search_term) — Search inside a file.
- task_complete(result) — Call when ALL files are written. REQUIRED.
EFFICIENCY RULES:
1. Write files DIRECTLY. Don't list_directory before creating NEW files.
2. Create MULTIPLE files in ONE response — batch write_file calls.
3. After writing all files, IMMEDIATELY call task_complete.
Write COMPLETE, WORKING code. No placeholder comments like "// TODO"."""Part 4: The Entry Point
main.py
import os
import sys
from dotenv import load_dotenv
from openai import OpenAI
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from framework.orchestrator import Orchestrator, Phase
from agents.clarifier_agent import ClarifierAgent
from agents.prd_agent import PRDAgent
from agents.planner_agent import PlannerAgent
from agents.coder_agent import CoderAgent
def build_pipeline(orchestrator: Orchestrator):
"""Define the software development pipeline."""
# Phase 1: Clarify requirements (with user approval)
orchestrator.add_phase(Phase(
name="Requirement Clarification",
agent_name="clarifier",
prompt_template=(
"The user wants to build the following:\n\n{user_input}\n\n"
"Elaborate this into a concise specification."
),
report_filename="01_specification.md",
needs_approval=True,
))
# Phase 2: Generate a PRD
orchestrator.add_phase(Phase(
name="PRD Generation",
agent_name="prd_agent",
prompt_template=(
"Create a concise Product Requirements Document (PRD) based on "
"this approved specification:\n\n{previous_output}"
),
report_filename="02_prd.md",
))
# Phase 3: Plan tasks and generate code
orchestrator.add_phase(Phase(
name="Task Planning & Code Generation",
agent_name="planner",
prompt_template=(
"You have a PRD to implement. Break it down into coding tasks "
"and use your tools to assign each task to the coding agent.\n\n"
"PRD:\n{previous_output}"
),
report_filename="03_task_plan.md",
setup_sandbox=True,
pass_message_bus=True,
))
def main():
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
print("❌ Error: OPENAI_API_KEY not found. Create a .env file.")
sys.exit(1)
client = OpenAI(api_key=api_key, base_url=os.getenv("OPENAI_BASE_URL"))
if len(sys.argv) > 1:
user_input = " ".join(sys.argv[1:])
else:
print("=" * 60)
print(" 🤖 Multi-Agent Software Development System")
print("=" * 60)
user_input = input("\nYour idea: ").strip()
if not user_input:
sys.exit(0)
output_base = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")
# Create agents
clarifier = ClarifierAgent(client=client)
prd_agent = PRDAgent(client=client)
coder = CoderAgent(client=client)
planner = PlannerAgent(client=client, coder_agent=coder)
# Create orchestrator, register agents, define pipeline
orchestrator = Orchestrator(client=client, output_base=output_base)
orchestrator.register_agent("clarifier", clarifier)
orchestrator.register_agent("prd_agent", prd_agent)
orchestrator.register_agent("planner", planner)
orchestrator.register_agent("coder", coder)
build_pipeline(orchestrator)
try:
result = orchestrator.run(user_input)
except KeyboardInterrupt:
print("\n\nInterrupted by user.")
sys.exit(0)
if __name__ == "__main__":
main()Walking Through the Execution
Let's trace what happens when you run python main.py "Build a calculator app with HTML, CSS, and JavaScript":
Phase 1: Clarifier Agent : The Clarifier produces a structured specification with Purpose, Features, Tech Stack, and File Structure. The orchestrator shows it to you and asks for approval. You type "yes" (or suggest changes). Saved as 01_specification.md.
Phase 2: PRD Agent : Receives the approved spec. Produces a Product Requirements Document with numbered requirements (FR-1, FR-2...), file structure, and implementation order. Saved as 02_prd.md.
Phase 3: Planner → Coder : The Planner reads the PRD and breaks it into 2 tasks. For each task, it calls assign_task_to_coder, which runs the Coder's own agentic loop:
Orchestrator
└── Planner.run(PRD)
├── assign_task_to_coder(task_1)
│ └── Coder.run(task_1)
│ ├── write_file("index.html")
│ ├── write_file("styles.css")
│ └── task_complete()
├── assign_task_to_coder(task_2)
│ └── Coder.run(task_2)
│ ├── read_file("index.html") ← reads Task 1's file
│ ├── write_file("script.js")
│ └── task_complete()
└── task_complete()The Output
output/sandbox_2026-02-21_13-00-00/
reports/
01_specification.md ← Clarifier's approved spec
02_prd.md ← PRD Agent's document
03_task_plan.md ← Planner's summary
final_report.md ← Combined final output
message_bus_log.md ← Every message between agents
code/
index.html ← Coder's actual code
styles.css
script.jsThe Design Principles
Separation of concerns. Each agent does one thing. The clarifier doesn't write code. The coder doesn't write specs.
Pipeline, not graph. Output flows forward. No complex negotiation between agents. Simple and debuggable.
Generic orchestrator. Want to add a code reviewer? Add a phase. Remove clarification? Remove a phase. Zero code changes:
orchestrator.register_agent("reviewer", CodeReviewAgent(client))
orchestrator.add_phase(Phase(
name="Code Review", agent_name="reviewer",
prompt_template="Review this code:\n\n{previous_output}",
report_filename="04_code_review.md",
))Sandboxed tools. The LLM writes write_file("app.js", code), the tool handles path resolution, directory creation, and security.
Agents calling agents. The Planner's assign_task_to_coder is just a wrapper around coder.run(). From the Planner's perspective, delegating to the Coder is no different from calling any other tool.
The Full Stack
| Chapter | Concept | Code |
|---|---|---|
| 1 | Chat Completion | Single API call |
| 2 | Chatbot | Messages list + streaming |
| 3 | Tool Use | JSON schemas + two-call flow |
| 4 | Chatbot + Tools | Inner loop + outer loop |
| 5 | Agent Class | Encapsulation + reusability |
| 6 | Agentic Loop | max_iterations + task_complete |
| 7 | Multi-Agent | BaseAgent + ToolRegistry + Orchestrator |
Want to extend this? Add a code reviewer agent that reads the generated code and flags issues. Add RAG so agents can search a knowledge base. Swap in Claude or Gemini the architecture doesn't change, only the API client.
You've now built an entire multi-agent system from scratch. Every piece, from the first API call to the final orchestrator, was written by hand. You understand what's inside the box.
Thanks for reading.