Frameworks like LangChain and CrewAI are useful, but they hide the one thing every AI engineer should actually understand: how an agent works when you strip away the decorators. When you build AI agents from scratch, the patterns become obvious, debugging becomes possible, and framework choices become informed rather than cargo-culted.
This guide walks through the full anatomy of an agent in plain Python. No @tool decorators, no runnables, no chains. Just an LLM call, a loop, and the patterns you need to know to make it production-worthy.
What you'll learn
- The real difference between an agent and a chatbot
- The four components every agent needs (and what each one buys you)
- How to write the agent loop yourself in under 100 lines
- How to add tool use, state, and planning without a framework
- What changes the moment you put an agent in production
What Makes an Agent Different From a Chatbot
A chatbot is a function: (message, history) -> reply. It returns once per turn.
An agent is a loop. It can decide to call a tool, read the result, decide to call another tool, verify its own output, and keep going until it has finished the task — or given up. The user sends one message; the agent may take ten internal turns before replying.
The difference that matters in code: a chatbot call produces text. An agent call produces either text (done) or a request to use a tool (not done yet). The agent runtime has to recognize which and respond accordingly.
Everything else — memory, planning, multi-agent orchestration — is decoration on that core idea. We explored this from a different angle in our post on AI agents vs AI assistants.
Core Components: LLM, Tools, Memory, Planning
Every agent you will ever build is some combination of these four:
1. LLM — the brain. This is where reasoning happens. The LLM receives the current state and decides the next action. Pick a model with good tool-use training (Claude Sonnet or Opus, GPT-4 class, Gemini 2.5).
2. Tools — the hands. Functions the agent is allowed to call. Each one has a name, a description, and an input schema. Good tools are specific (search_customer_by_email) rather than generic (run_sql).
3. Memory — the notebook. Short-term memory is the message list for this run. Long-term memory is whatever persists across runs — a vector store, a database, a file. Most agents only need short-term memory; reach for long-term only when you can point to a concrete reason.
4. Planning — the outline. Sometimes the agent thinks step-by-step inside a single response; sometimes it writes an explicit plan and then executes it. Planning is not always necessary — simple agents skip it — but it becomes critical as tasks get longer.
You can build a useful agent with just the first two. Memory and planning are optimizations you add when the basic loop runs out of capacity.
Building the Agent Loop (With Code)
Let's build the skeleton. This is the minimum viable agent.
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-sonnet-4-5"
SYSTEM = """You are a careful task executor.
Use tools when helpful. Stop when the task is done."""
def agent_loop(task: str, tools: list, tool_impls: dict, max_steps: int = 10):
messages = [{"role": "user", "content": task}]
for step in range(max_steps):
resp = client.messages.create(
model=MODEL,
system=SYSTEM,
max_tokens=2048,
tools=tools,
messages=messages,
)
# Agent is done: model returned a final answer, no tool call
if resp.stop_reason == "end_turn":
return "".join(b.text for b in resp.content if b.type == "text")
# Otherwise: model asked for tools. Run them, feed results back.
messages.append({"role": "assistant", "content": resp.content})
results = []
for block in resp.content:
if block.type == "tool_use":
try:
out = tool_impls[block.name](**block.input)
except Exception as e:
out = f"ERROR: {type(e).__name__}: {e}"
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(out),
})
messages.append({"role": "user", "content": results})
raise RuntimeError(f"Exceeded {max_steps} steps without completion")
This is the whole runtime. Everything else is tools and prompts.
A couple of things worth noticing:
stop_reason == "end_turn"is how you know the model is finished. If it wants another tool, the stop reason will betool_use.- Tool errors get fed back as strings, not raised. The model can often recover if it sees "ERROR: FileNotFoundError: /tmp/missing.txt" — it cannot recover if your runtime crashes.
Adding Tool Use (Claude API Example)
Tools are just functions plus a JSON-schema description. Here are two:
import httpx
from urllib.parse import urlparse
TOOLS = [
{
"name": "http_get",
"description": "Fetch a URL and return the text body (first 4000 chars).",
"input_schema": {
"type": "object",
"properties": {"url": {"type": "string"}},
"required": ["url"],
},
},
{
"name": "extract_domain",
"description": "Return the domain portion of a URL.",
"input_schema": {
"type": "object",
"properties": {"url": {"type": "string"}},
"required": ["url"],
},
},
]
def http_get(url: str) -> str:
r = httpx.get(url, timeout=10, follow_redirects=True)
r.raise_for_status()
return r.text[:4000]
def extract_domain(url: str) -> str:
return urlparse(url).netloc
TOOL_IMPLS = {"http_get": http_get, "extract_domain": extract_domain}
answer = agent_loop(
"Fetch https://example.com and tell me which domain served it.",
TOOLS, TOOL_IMPLS
)
print(answer)
That is a working research mini-agent. It can fetch pages, parse URLs, and compose the two. Want web search? Add a tool. Want file reads? Add a tool. The loop does not change.
The Anthropic docs on tool use cover the request/response schema in detail.
State Management and Memory
The messages list above is state. That is enough until:
- Your conversation history gets too long to fit in the context window
- You want the agent to remember things across sessions
- Multiple agents need to share knowledge
Handling long contexts. When history grows past what you can afford to send every turn, summarize older turns into a running memo and drop the raw turns. The summary sits in the system prompt; the recent turns stay verbatim. This keeps recency sharp and ancient context compressed.
Persistent memory. Expose read/write as tools:
import json, pathlib
STORE = pathlib.Path("memory.json")
def _load(): return json.loads(STORE.read_text()) if STORE.exists() else {}
def _save(d): STORE.write_text(json.dumps(d, indent=2))
def remember(key: str, value: str):
d = _load(); d[key] = value; _save(d)
return f"stored {key}"
def recall(key: str):
return _load().get(key, "not found")
Now the agent can store and retrieve facts across runs. Keep the schema flat — models are worse at deeply nested JSON than developers expect.
Semantic retrieval. When the "fact store" gets large, key-value lookups break down and you need RAG. See our RAG vs fine-tuning guide for when that transition makes sense.
Production Considerations
A working agent in a notebook is one thing. A production agent is a different object. The list of things you have to think about grows fast.
Timeouts and budgets. Every tool call needs a timeout. Every agent run needs a step cap. Every user request needs a token budget. Without these, a single confused run can cost dollars.
Idempotency. If the agent retries a step, will it double-charge a customer? Double-send an email? Build tools so retries are safe — use deterministic IDs, check before write, or explicitly mark non-retryable actions.
Observability. Log every tool call, every tool result, every model response. You cannot debug an agent without the full trace. At minimum: a unique trace_id per run, and the ability to replay any run from logs. LangSmith, Phoenix, and Langfuse are good off-the-shelf options; simple JSON logs work too.
Evaluation. How do you know an agent is working? Build a fixed suite of tasks with known good outcomes and score runs against them. Check both final answer and tool-call correctness. The moment "it felt right" becomes your quality metric, you have stopped doing engineering. Our how to evaluate AI agents post goes deeper.
Safety rails. Tool permissions should be minimum-viable. An agent that can read your database does not need to write to it. An agent that can write to staging does not need production credentials. Treat LLM output like user input: never trust, always validate before passing to a privileged sink.
Cost. Agents burn tokens. Cache system prompts (Anthropic's prompt caching gives you up to 90% savings on repeated context), use smaller models for simple decisions, and measure cost per task as religiously as you measure latency.
Conclusion
An agent is a loop around an LLM with tools. That's the whole idea. Everything else — memory, planning, evaluation, observability — is engineering on top of that primitive. When you build AI agents from scratch once, every framework makes sense forever after.
Build a minimal agent this week. Pick a task, write two tools, wire up the loop, ship it. Then add one thing at a time: error handling, memory, a second agent. This is the fastest way to go from "I read a tutorial" to "I can build this at work."
When you want the structured path with feedback, MindloomHQ curriculum walks through the full agentic stack; the Agentic AI Development course is the flagship.
FAQ
Do I need a framework to build a real agent?
No, but it depends on the use case. For a single agent with a handful of tools, the raw SDK is often cleaner. For multi-agent orchestration, streaming to a UI, or heavy observability, a framework pays off. Start without one so you understand what the framework is doing for you.
How many tools can an agent handle?
Modern models stay sharp up to roughly 10–15 tools. Past that, tool selection degrades. The fix is either better tool descriptions or a router pattern — a top-level agent that dispatches to specialist sub-agents with smaller tool sets each.
What's the difference between an agent loop and a ReAct loop?
ReAct (Reason-Act-Observe) was an early pattern where the model was prompted to output an explicit reason before each action. Modern tool-use APIs make this implicit — the model reasons in its own thinking, then emits a tool call. Same idea, cleaner interface.
How do I stop an agent from looping forever?
Always set max_steps. Also track tool-call repetition — if the agent calls the same tool with the same arguments twice in a row, that's a good signal it's stuck. Fail loudly; infinite loops silently eat money.
What should I monitor in production?
At minimum: steps per run, tokens per run, cost per run, tool error rate, and an evaluation score on a fixed test set. Alert on sudden changes in any of these — they usually mean a prompt regression or a tool breakage.