Most prompt engineering advice comes from people optimizing for demos. The tricks that impress in a notebook don't always hold up when you need consistent, reliable outputs across thousands of real requests.
This guide is for developers who've moved past "make ChatGPT do a thing" and need prompts that work repeatedly, fail gracefully, and are maintainable six months from now.
The Mental Model That Changes Everything
Before tactics: understand what you're actually doing when you write a prompt.
You're not giving instructions to a program. You're constructing a document that the model completes. The model's job is to predict what comes next given everything in the context window. Your system prompt, your few-shot examples, your user message — all of it is context that shapes the completion.
This reframe changes how you debug bad outputs. When a prompt fails, the question isn't "why isn't it following my instructions?" — it's "what completion does this document make most likely?"
1. Chain-of-Thought: When and How to Use It
Chain-of-thought (CoT) prompting gets the model to reason step-by-step before producing an answer. It works because it forces the model to allocate "compute" (token predictions) to the reasoning path, not just the answer.
Basic CoT — add "think step by step":
Classify this customer support ticket as: billing, technical, or general.
Ticket: "I was charged twice last month but the second charge disappeared from my statement after three days."
Think step by step, then give your final classification.
Output: "This mentions billing (two charges), but the charge resolved itself — likely a pending authorization. The customer may be confused rather than disputing a charge. Classification: billing."
Without CoT: The model might output "billing" with no justification, and you can't verify it reasoned correctly.
Structured CoT — give the format explicitly:
Analyze this code for bugs. Use this format:
REASONING: [think through what the code does and what could go wrong]
BUGS: [list any bugs found, or "none"]
SEVERITY: [critical / medium / low / none]
This is better than "think step by step" for programmatic use because the output is structured and parseable.
When CoT makes things worse: Simple classification tasks with obvious answers. Factual lookups. Transformations that are purely mechanical. Adding CoT to these inflates token usage without improving accuracy — sometimes it introduces confusion.
The rule: Use CoT when the task requires multi-step reasoning, judgment calls, or analysis. Skip it when the answer is direct.
2. Few-Shot Examples: The Most Reliable Improvement
If your prompt isn't working, add examples. This is the highest-leverage technique in practical prompt engineering, and it's underused because developers underestimate how much it matters.
Few-shot prompting works by showing the model the pattern you want — not just describing it. "Be concise" means different things to different people. Three examples of the conciseness you want communicate it unambiguously.
Structure your examples as actual message pairs:
messages = [
{
"role": "system",
"content": "Extract the key action item from a meeting note. Be specific and actionable."
},
{
"role": "user",
"content": "We discussed the Q3 roadmap and everyone agreed that the API rate limiting was getting urgent."
},
{
"role": "assistant",
"content": "Implement API rate limiting before Q3 roadmap kickoff."
},
{
"role": "user",
"content": "Sarah mentioned she'd look into why the dashboard is slow when there are more than 500 rows."
},
{
"role": "assistant",
"content": "Sarah to investigate dashboard performance regression with >500 rows."
},
{
"role": "user",
"content": input_text # actual user input goes here
}
]
The prior assistant turns are the examples. The model sees the pattern, infers the style and format, and continues it.
Choosing good examples:
- Cover edge cases you've seen fail, not just happy-path inputs
- Keep examples proportional to real distribution (if 80% of inputs are short, most examples should be short)
- Negative examples (showing what not to do) are often more powerful than positive ones for correcting specific failure modes
How many examples: 3-5 covers most cases. Beyond 10, you're usually not gaining much and you're burning context tokens. If you need 20+ examples, consider fine-tuning instead.
3. System Prompts: Set Up Context, Not Commands
The system prompt is the most important part of your prompt — and the most misused.
Common mistake: Loading the system prompt with rules.
// BAD — a list of don'ts
You are a helpful assistant.
- Never say "I don't know"
- Always be polite
- Don't make things up
- If you're unsure, say so
- Be concise
- Don't use bullet points
Rules are antagonistic and often contradictory (never say "I don't know" vs. say so when unsure). The model has to simultaneously satisfy them all, and it fails on edge cases.
Better approach: Give the model a persona with clear purpose and context. The constraints emerge from the character.
// GOOD — a grounded persona
You are the technical support agent for Meridian Cloud, a developer-focused infrastructure platform.
Your users are software engineers troubleshooting deployments, networking, or billing issues. They're technical and busy — they don't want pleasantries, they want answers.
When you don't know something, say: "I don't have that information — check the docs at docs.meridiancloud.com or open a support ticket."
Format: Direct answer first, then context if needed. Use code blocks for commands.
This system prompt produces a consistent voice because it describes a coherent entity. The constraints (terse, technical, honest about limitations) come naturally from the character.
Persona stability: Mention the persona early and reinforce it implicitly. Models weight earlier context more heavily. "You are X" in the first token is more influential than "remember you are X" buried in the middle.
4. Output Formatting: Ask for What You'll Process
If you're going to parse the output programmatically, tell the model exactly what format to use — and use JSON with a schema.
from pydantic import BaseModel
from typing import Literal
from openai import OpenAI
class TicketClassification(BaseModel):
category: Literal["billing", "technical", "general"]
priority: Literal["high", "medium", "low"]
requires_human: bool
summary: str # max 1 sentence
client = OpenAI()
result = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You classify customer support tickets."},
{"role": "user", "content": f"Ticket: {ticket_text}"}
],
response_format=TicketClassification,
)
classification = result.choices[0].message.parsed
# classification.category, classification.priority, etc. — fully typed
This is strictly better than asking the model to "respond in JSON" and parsing the string. Structured outputs guarantee valid JSON matching your schema. No json.loads() exceptions, no hallucinated keys.
When structured outputs aren't available: Add a clear format block to your prompt:
Respond with ONLY a JSON object. No explanation before or after. Example:
{"category": "billing", "priority": "high", "requires_human": true, "summary": "..."}
Then validate the response with Pydantic before trusting it.
5. The Five Mistakes That Break Prompts Silently
These don't cause obvious errors. They cause subtly degraded outputs that are hard to diagnose.
Mistake 1: Vague success criteria
"Write a good summary" is meaningless to a model. Good in what way? For whom? How long? Under what constraints?
Fix: Be specific. "Write a 2-sentence summary for a non-technical manager. Focus on business impact, not implementation details. Under 50 words."
Mistake 2: Contradictory instructions
"Be brief but thorough." "Explain everything but keep it simple." These create unresolvable tension.
Fix: Prioritize. "Be brief. If detail is necessary to avoid misunderstanding, include it — but default to concise."
Mistake 3: Testing on easy inputs
Your prompt works great on clear, well-structured inputs. It fails on the messy, ambiguous, edge-case inputs that dominate real traffic.
Fix: Build a test set from real inputs that previously caused problems. Run every prompt change against it.
Mistake 4: Ignoring temperature
Temperature 0.7 (the default for many APIs) introduces randomness. For classification, extraction, or any task where you want deterministic output — use 0.0 or 0.1.
Fix: Set temperature=0 for classification/extraction. Use higher values only for creative tasks where variation is acceptable.
Mistake 5: Not versioning prompts
A prompt is code. Changing it is a deployment. If you don't version control your prompts, you can't roll back when a change degrades performance.
Fix: Store prompts in version-controlled config files or a prompt management system. Treat prompt changes with the same rigor as code changes.
Testing Your Prompts
The difference between a prompt that works in development and one that holds up in production is a real evaluation set.
Minimum viable prompt testing:
- Deterministic tests (
temperature=0): Given input X, expect output Y. Fails are immediate regression signals. - Edge case tests: Empty inputs, very long inputs, inputs in a different language, adversarial inputs that try to override your system prompt.
- Distribution tests: A sample of 50-100 real inputs, graded by a human or a judge model. Track accuracy over prompt versions.
You don't need a sophisticated MLOps platform for this. A CSV with inputs, expected outputs, and actual outputs, compared across prompt versions, is enough to catch most regressions.
Putting It Together
The pattern for a production-quality prompt:
- System prompt: Grounded persona, clear purpose, explicit format instructions
- Few-shot examples (3-5): Real examples covering common cases and known failure modes
- User message: Clean, structured input with no ambiguity
- Structured output: Pydantic model + structured outputs API, or explicit JSON format + validation
- Temperature: Set appropriately for the task (0 for deterministic, higher for creative)
- Test set: 50+ real inputs, versioned alongside the prompt
This isn't clever — it's disciplined. The developers writing the most reliable AI systems aren't doing anything exotic. They're applying basic software engineering practices to a new type of output.
If you want to go deep on prompt engineering — with real exercises, graded quizzes, and projects that simulate production scenarios — the Prompt Engineering course at MindloomHQ covers everything in this post plus advanced techniques like automatic prompt optimization and evaluation at scale.