If you have built a RAG system or read about semantic search, you have hit the term "vector database." The explanations are usually either too abstract ("stores high-dimensional vectors") or too shallow to be useful.
This post explains what vector databases actually are, why you need them, how they work under the hood, and when you genuinely need one versus when you are overengineering.
What Are Embeddings?
Before understanding vector databases, you need to understand what goes into them.
An embedding is a list of floating-point numbers that represents the meaning of a piece of text. When you pass a sentence through an embedding model, it outputs something like [0.23, -0.71, 0.08, ..., 0.44] — typically 768 to 3072 numbers, depending on the model.
The key property: semantically similar text produces numerically similar vectors. "The cat sat on the mat" and "A feline rested on the rug" produce vectors that are close to each other in this high-dimensional space. "The stock market crashed" produces a vector that is far away from both.
from anthropic import Anthropic
# Using a dedicated embedding model (e.g., OpenAI, Cohere, or a local model)
# This example uses the voyage-3 model via the Anthropic API
import voyageai
vo = voyageai.Client()
texts = [
"How do I reset my password?",
"I forgot my login credentials",
"The quarterly earnings report is ready",
]
result = vo.embed(texts, model="voyage-3", input_type="document")
embeddings = result.embeddings
# embeddings[0] and embeddings[1] are numerically close
# embeddings[2] is far from both
This is what makes semantic search possible. Instead of matching keywords, you match meaning.
Why Traditional Databases Fail for Semantic Search
A relational database stores rows and columns. When you query SELECT * FROM docs WHERE content LIKE '%password reset%', it scans for exact string matches. It is fast for that use case because B-tree indexes work on exact or prefix matches.
Semantic search is fundamentally different. You need to find the 10 documents whose embeddings are numerically closest to the query embedding. That requires comparing the query vector against every stored vector and computing a similarity score.
For 1,000 documents, a brute-force scan in Python works fine:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search(query_embedding, doc_embeddings, top_k=5):
scores = [cosine_similarity(query_embedding, doc) for doc in doc_embeddings]
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
For 100,000 documents, brute-force is still manageable. For 10 million documents, comparing a 1,536-dimension vector against 10 million other vectors on every query is too slow for production — even with NumPy.
A traditional database with a B-tree index cannot help here because there is no natural ordering for high-dimensional vector space that makes nearest-neighbor lookups cheap. You cannot say "vectors close to X start around row 4,280" the way you can say "names starting with 'S' start at a certain offset."
How Vector Databases Work: HNSW
Most production vector databases use an index algorithm called HNSW (Hierarchical Navigable Small World). Understanding it at a high level helps you reason about performance and recall trade-offs.
HNSW builds a multi-layer graph at index time:
- The bottom layer contains every vector, connected to its nearest neighbors
- Upper layers are sparser, long-range connections that let you navigate the graph quickly
- The top layer has very few nodes but lets you jump to the approximate region of your query very fast
At query time, you enter at the top layer, greedily move toward nodes closest to your query, then descend through each layer, narrowing down. The result is approximate nearest-neighbor search: not guaranteed to find the exact top-K, but finds very good candidates much faster than brute force.
The key trade-off is recall vs. speed. You can tune the ef_search parameter (how many candidates to examine) — higher ef_search means higher recall but slower queries.
For most RAG applications, 95%+ recall is fine. The user asking "what's our return policy?" doesn't need the mathematically perfect nearest neighbors — any high-quality semantically relevant chunk works.
pgvector vs. Pinecone vs. Weaviate
Here is a practical comparison for the most common choices in 2026:
| | pgvector | Pinecone | Weaviate |
|---|---|---|---|
| Type | PostgreSQL extension | Managed cloud | Open-source / cloud |
| Setup | Add extension to existing Postgres | SaaS, no infra | Self-host or cloud |
| Scaling | Manual (Postgres limits) | Automatic | Configurable |
| Hybrid search | With full-text search combo | Yes (built-in) | Yes (BM25 + vector) |
| Cost at scale | Cheap (your Postgres instance) | Expensive at high volume | Middle ground |
| Best for | Existing Postgres users, < 1M vectors | Managed, high scale, fast setup | Need hybrid search or self-hosted |
| Metadata filtering | SQL WHERE clauses | Filter syntax | GraphQL |
pgvector is the pragmatic choice for most applications starting out. If you are already on Postgres (Supabase, RDS, Neon), you can add vector search without any new infrastructure. The query latency is slightly higher than dedicated vector DBs, but for typical RAG workloads with under a million vectors, this does not matter in practice.
-- pgvector: create a table with an embedding column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536)
);
-- Create HNSW index for fast approximate search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- Query: find 5 most similar documents to a query embedding
SELECT id, content, 1 - (embedding <=> '[0.23, -0.71, ...]'::vector) AS similarity
FROM documents
ORDER BY embedding <=> '[0.23, -0.71, ...]'::vector
LIMIT 5;
Pinecone is the fastest way to production if you do not want to manage infrastructure and expect to scale to tens of millions of vectors. The managed service handles replication, scaling, and backups. Pricing becomes significant at high query volumes.
Weaviate is the best choice if you need hybrid search (combining vector similarity with keyword matching) or want self-hosted control. Hybrid search matters for domains with specialized terminology where pure semantic search under-performs — legal, medical, financial.
When You Actually Need a Vector Database
This is the question most tutorials skip.
You probably do NOT need a dedicated vector database if:
- Your corpus is under 100,000 chunks. pgvector handles this trivially, and even SQLite with a vector extension works.
- You are prototyping or in early development. Start with the simplest thing (an in-memory list or a local Chroma instance) and migrate later.
- Your data fits in memory. A numpy array of embeddings loaded at startup is faster than a network call to a vector DB for small corpora.
You probably DO need a vector database if:
- Your corpus exceeds 1 million documents and query latency matters
- You need real-time index updates (adding new documents without full re-indexing)
- You need metadata filtering at scale (find the 10 nearest vectors where
date > 2025-01-01 AND department = 'engineering') - Your data is too large to load into memory and you need persistent storage with fast lookups
For the majority of RAG applications in production today — a company's internal knowledge base, a product documentation search, a support ticket classifier — pgvector is sufficient and avoids adding another managed service to your stack.
The Bigger Picture: RAG Systems
Vector databases are one component in a RAG pipeline, not the whole thing. The parts that matter as much or more:
Chunking strategy. How you split documents into chunks affects retrieval quality more than your choice of vector database. Too-large chunks dilute relevance; too-small chunks lose context. The right chunk size depends on your content type.
Embedding model choice. Not all embeddings are equal. An embedding model fine-tuned on code performs better for code search than a general-purpose one. For multilingual corpora, multilingual embedding models are necessary.
Retrieval evaluation. Measure recall@K on a held-out set of (query, relevant-document) pairs before deciding your retrieval is working. Most RAG systems are shipped without any retrieval evaluation, which is why they fail in ways that are hard to debug.
If you want to go deeper on building production RAG systems — chunking strategies, hybrid retrieval, rerankers, evaluation, and real-world failure modes — Phase 4 of the Agentic AI course at MindloomHQ covers this in full.
The 10 lessons go from embedding basics to a complete, evaluated RAG pipeline. Free to start with Phases 0 and 1, no credit card required.