The AI Engineer's Tech Stack in 2026

There is a gap between the tools that get talked about in AI tutorials and the tools that show up in actual AI engineering job postings, production systems, and engineering blogs from companies shipping real AI products.

This is the real stack. Not every tool, not every option — the specific set that experienced AI engineers reach for in 2026, with explanations for why each one belongs and what it replaced.

Python

Why it is still the default.

Python did not win because of performance. It won because the entire AI research community writes Python, which means every library, framework, and paper implementation is Python-first. NumPy, PyTorch, Hugging Face Transformers, LangGraph, CrewAI — they all ship Python first, with other languages as secondary considerations if they ship at all.

For AI engineers, the most important Python skills are not language features — they are the ecosystem. Knowing how to manage virtual environments (use uv, not pip, in 2026), understanding async/await for concurrent API calls, and knowing how to serialize and deserialize data efficiently across service boundaries.

What it replaced: Nothing, but JavaScript alternatives (LangChain.js, Vercel AI SDK) are now genuinely viable for web-native AI applications. Python remains the default for anything touching models, data, or agents.

# Modern Python project setup — uv is the standard now
# uv init my-ai-project
# uv add anthropic langchain-core pgvector fastapi

# Async-first for parallel LLM calls
import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def process_batch(texts: list[str]) -> list[str]:
    tasks = [
        client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}]
        )
        for text in texts
    ]
    responses = await asyncio.gather(*tasks)
    return [r.content[0].text for r in responses]

The Anthropic and OpenAI APIs

The two SDKs every AI engineer has in their toolbox.

Both APIs expose LLMs as a service. In 2026, most teams pick one primary provider and treat the other as fallback or for specific use cases. Anthropic's Claude models are the default for reasoning-heavy tasks, complex tool use, and anything requiring long context. OpenAI's GPT-4o family remains strong for multimodal tasks and has a larger ecosystem of integrations.

The more important skill than picking a provider: understanding the shared patterns. Tool calling, structured output, streaming, context management — these patterns are nearly identical across providers. Engineers who know one SDK well can pick up the other in an afternoon.

# Anthropic SDK — direct usage
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a precise data extraction assistant. Return JSON only.",
    messages=[{
        "role": "user",
        "content": f"Extract name, email, and company from: {raw_text}"
    }]
)

# OpenAI SDK — same pattern, different client
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}]
)

What it replaced: Custom ML model training for most NLP tasks. Classification, entity extraction, summarization, question answering — tasks that previously required a fine-tuned BERT model now call an API. The API is faster to ship, easier to maintain, and often more accurate.

LangGraph

The framework for stateful agents.

LangGraph is where the field has landed for production agent orchestration. It models agent workflows as directed graphs — nodes are functions, edges are transitions between steps. This makes complex multi-step agent logic composable, debuggable, and testable in a way that "while True: ask the LLM what to do" loops are not.

The key innovation: persistent state. LangGraph graphs carry state across nodes. When a node fails, you can resume from that node without replaying the entire workflow. When you need to add a human-in-the-loop step, you pause the graph and resume it after human approval. This is infrastructure-level thinking applied to agent design.

from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    tool_results: list
    final_answer: str

def researcher(state: AgentState) -> AgentState:
    # Call LLM to decide what to research
    ...
    return {**state, "tool_results": results}

def synthesizer(state: AgentState) -> AgentState:
    # Combine research into final answer
    ...
    return {**state, "final_answer": answer}

def should_continue(state: AgentState) -> str:
    if state.get("final_answer"):
        return "synthesize"
    return "research"

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("research", researcher)
graph.add_node("synthesize", synthesizer)
graph.add_conditional_edges("research", should_continue, {
    "research": "research",
    "synthesize": "synthesize",
})
graph.add_edge("synthesize", END)
graph.set_entry_point("research")

agent = graph.compile()

What it replaced: Hand-rolled agent loops. Most teams that shipped agents in 2023-2024 using raw while loops and manual state dictionaries are migrating to LangGraph for the debuggability and persistence.

pgvector

Postgres with vector search built in.

pgvector is a Postgres extension that adds a vector data type and approximate nearest-neighbor search to your existing Postgres database. It is the default choice for adding semantic search and RAG capabilities to applications that already run on Postgres.

The alternative is a dedicated vector database — Pinecone, Weaviate, Qdrant. These are faster at large scale and have more sophisticated indexing options. But for most applications, pgvector inside your existing Postgres instance is simpler to operate and "good enough" up to tens of millions of vectors.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Add a vector column to an existing table
ALTER TABLE documents ADD COLUMN embedding vector(1536);

-- Create an index for fast search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Semantic search
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

# Python insertion
from pgvector.psycopg import register_vector
import psycopg
import anthropic

client = anthropic.Anthropic()

def get_embedding(text: str) -> list[float]:
    # Use a dedicated embedding model — text-embedding-3-small or similar
    # pgvector dimension must match embedding model output
    ...

def index_document(conn, content: str):
    embedding = get_embedding(content)
    conn.execute(
        "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
        (content, embedding)
    )

What it replaced: Dedicated vector databases for most use cases. Also replaced basic keyword search (LIKE, full-text search) for semantic retrieval scenarios.

FastAPI

The API layer for AI services.

FastAPI is the default web framework for Python AI services. Its async-first design handles concurrent LLM API calls efficiently without blocking. Automatic OpenAPI docs reduce integration friction with frontend teams. Pydantic models enforce clean request/response contracts and make structured output validation straightforward.

from fastapi import FastAPI
from pydantic import BaseModel
from anthropic import AsyncAnthropic

app = FastAPI()
client = AsyncAnthropic()

class ChatRequest(BaseModel):
    message: str
    session_id: str

class ChatResponse(BaseModel):
    reply: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": request.message}]
    )
    return ChatResponse(
        reply=response.content[0].text,
        tokens_used=response.usage.output_tokens
    )

# Streaming endpoint
from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    async def generate():
        async with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{"role": "user", "content": request.message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

What it replaced: Flask for new AI services. FastAPI's async support is not optional when you are making concurrent LLM calls — blocking one request blocks all requests in a synchronous framework.

Redis (via Upstash)

Session state, rate limiting, and caching.

Redis appears in most AI applications for three distinct purposes.

Rate limiting: LLM API calls are expensive. Redis counters enforce per-user, per-minute, per-day limits using atomic increment operations. Upstash is the serverless Redis option that works well with edge functions and serverless deployments.

Session state: Agent conversations need persistent context across requests. Storing conversation history in Redis (with a TTL matching session length) keeps state out of your application servers and makes horizontal scaling trivial.

Response caching: For deterministic or near-deterministic queries, caching LLM responses reduces cost and latency. A semantic cache (cache by embedding similarity, not exact string match) handles the non-exact-match nature of LLM queries.

import redis
from upstash_redis import Redis

# Upstash Redis — serverless, HTTP-based
cache = Redis.from_env()

def rate_limit(user_id: str, limit: int = 50) -> bool:
    key = f"rl:{user_id}:{datetime.now().strftime('%Y%m%d%H')}"
    count = cache.incr(key)
    if count == 1:
        cache.expire(key, 3600)  # Reset hourly
    return count <= limit

def get_session_history(session_id: str) -> list:
    data = cache.get(f"session:{session_id}")
    return json.loads(data) if data else []

def save_session_history(session_id: str, history: list):
    cache.setex(f"session:{session_id}", 3600, json.dumps(history))

What it replaced: Database-backed session storage for real-time applications (too slow), and in-memory state in application servers (does not survive restarts or horizontal scaling).

Docker

The deployment standard.

AI applications have notoriously messy dependency trees — CUDA libraries, Python version constraints, model files, system packages. Docker solves the "works on my machine" problem by packaging your application and all its dependencies into a portable container.

For AI engineers specifically, Docker matters because: model inference has strict environment requirements, your team likely uses different operating systems, and your production environment (Kubernetes, Cloud Run, ECS) requires container images.

FROM python:3.12-slim

WORKDIR /app

# Install uv for faster dependency installation
RUN pip install uv

# Copy dependency files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync --frozen --no-dev

# Copy application
COPY . .

# Run with uvicorn
CMD ["uv", "run", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

What it replaced: Manual server configuration, virtualenv gymnastics, and "but it works on my machine" deployment debugging.

How the Stack Fits Together

A typical AI service in production looks like this:

FastAPI receives the request from the frontend or another service
Redis checks the rate limit and loads conversation history
pgvector retrieves relevant context via semantic search (RAG)
LangGraph orchestrates the agent loop, calling the Anthropic API at each reasoning step
FastAPI streams the response back to the client via SSE
Redis saves the updated conversation history
Docker packages all of this for deployment

Each component has a clear responsibility. The boundaries are defined by data contracts, not framework magic. And because each piece is independently deployable, you can swap one out without touching the others.

This is the stack that ships. Learn it in the context of building real things, and the pieces will click into place.

Build It from the Ground Up

Understanding how each tool works individually is the prerequisite to understanding how they work together. Phase 2 of the Agentic AI course at MindloomHQ covers LLMs and the API layer in depth — how models work, how to design prompts for production, how to handle errors and rate limits, and how to build the API layer around it.

Phases 0 and 1 are completely free. No credit card required.

This is the real stack. Not every tool, not every option — the specific set that experienced AI engineers reach for in 2026, with explanations for why each one belongs and what it replaced.

Python

Why it is still the default.

# Modern Python project setup — uv is the standard now
# uv init my-ai-project
# uv add anthropic langchain-core pgvector fastapi

# Async-first for parallel LLM calls
import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def process_batch(texts: list[str]) -> list[str]:
    tasks = [
        client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": text}]
        )
        for text in texts
    ]
    responses = await asyncio.gather(*tasks)
    return [r.content[0].text for r in responses]

The Anthropic and OpenAI APIs

The two SDKs every AI engineer has in their toolbox.

# Anthropic SDK — direct usage
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a precise data extraction assistant. Return JSON only.",
    messages=[{
        "role": "user",
        "content": f"Extract name, email, and company from: {raw_text}"
    }]
)

# OpenAI SDK — same pattern, different client
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}]
)

LangGraph

The framework for stateful agents.

from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    tool_results: list
    final_answer: str

def researcher(state: AgentState) -> AgentState:
    # Call LLM to decide what to research
    ...
    return {**state, "tool_results": results}

def synthesizer(state: AgentState) -> AgentState:
    # Combine research into final answer
    ...
    return {**state, "final_answer": answer}

def should_continue(state: AgentState) -> str:
    if state.get("final_answer"):
        return "synthesize"
    return "research"

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("research", researcher)
graph.add_node("synthesize", synthesizer)
graph.add_conditional_edges("research", should_continue, {
    "research": "research",
    "synthesize": "synthesize",
})
graph.add_edge("synthesize", END)
graph.set_entry_point("research")

agent = graph.compile()

pgvector

Postgres with vector search built in.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Add a vector column to an existing table
ALTER TABLE documents ADD COLUMN embedding vector(1536);

-- Create an index for fast search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Semantic search
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

# Python insertion
from pgvector.psycopg import register_vector
import psycopg
import anthropic

client = anthropic.Anthropic()

def get_embedding(text: str) -> list[float]:
    # Use a dedicated embedding model — text-embedding-3-small or similar
    # pgvector dimension must match embedding model output
    ...

def index_document(conn, content: str):
    embedding = get_embedding(content)
    conn.execute(
        "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
        (content, embedding)
    )

What it replaced: Dedicated vector databases for most use cases. Also replaced basic keyword search (LIKE, full-text search) for semantic retrieval scenarios.

FastAPI

The API layer for AI services.

from fastapi import FastAPI
from pydantic import BaseModel
from anthropic import AsyncAnthropic

app = FastAPI()
client = AsyncAnthropic()

class ChatRequest(BaseModel):
    message: str
    session_id: str

class ChatResponse(BaseModel):
    reply: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": request.message}]
    )
    return ChatResponse(
        reply=response.content[0].text,
        tokens_used=response.usage.output_tokens
    )

# Streaming endpoint
from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    async def generate():
        async with client.messages.stream(
            model="claude-haiku-4-5-20251001",
            max_tokens=512,
            messages=[{"role": "user", "content": request.message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

Redis (via Upstash)

Session state, rate limiting, and caching.

Redis appears in most AI applications for three distinct purposes.

import redis
from upstash_redis import Redis

# Upstash Redis — serverless, HTTP-based
cache = Redis.from_env()

def rate_limit(user_id: str, limit: int = 50) -> bool:
    key = f"rl:{user_id}:{datetime.now().strftime('%Y%m%d%H')}"
    count = cache.incr(key)
    if count == 1:
        cache.expire(key, 3600)  # Reset hourly
    return count <= limit

def get_session_history(session_id: str) -> list:
    data = cache.get(f"session:{session_id}")
    return json.loads(data) if data else []

def save_session_history(session_id: str, history: list):
    cache.setex(f"session:{session_id}", 3600, json.dumps(history))

What it replaced: Database-backed session storage for real-time applications (too slow), and in-memory state in application servers (does not survive restarts or horizontal scaling).

Docker

The deployment standard.

FROM python:3.12-slim

WORKDIR /app

# Install uv for faster dependency installation
RUN pip install uv

# Copy dependency files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync --frozen --no-dev

# Copy application
COPY . .

# Run with uvicorn
CMD ["uv", "run", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

What it replaced: Manual server configuration, virtualenv gymnastics, and "but it works on my machine" deployment debugging.

How the Stack Fits Together

A typical AI service in production looks like this:

FastAPI receives the request from the frontend or another service
Redis checks the rate limit and loads conversation history
pgvector retrieves relevant context via semantic search (RAG)
LangGraph orchestrates the agent loop, calling the Anthropic API at each reasoning step
FastAPI streams the response back to the client via SSE
Redis saves the updated conversation history
Docker packages all of this for deployment

This is the stack that ships. Learn it in the context of building real things, and the pieces will click into place.

Build It from the Ground Up

Phases 0 and 1 are completely free. No credit card required.

The AI Engineer's Tech Stack in 2026

Python

The Anthropic and OpenAI APIs

LangGraph

pgvector

FastAPI

Redis (via Upstash)

Docker

How the Stack Fits Together

Build It from the Ground Up

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts

The AI Engineer's Tech Stack in 2026

Python

The Anthropic and OpenAI APIs

LangGraph

pgvector

FastAPI

Redis (via Upstash)

Docker

How the Stack Fits Together

Build It from the Ground Up

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts