A Practical Guide to Prompt Engineering for Production Systems
Prompt Engineering in Production Is Different
Most prompt engineering guides focus on getting a good response from a single interaction. That is useful for chatbot conversations, but production systems have completely different requirements. When your prompt runs thousands of times a day, you need consistency, reliability, structured output, and cost efficiency. A prompt that works 90% of the time is not good enough when it fails silently on the other 10%.
I have built multiple production systems that rely on AI models at their core, and I have learned these lessons the hard way. Here is what actually works.
Principle 1: Define Your Output Schema Explicitly
In production, you almost always need structured output. The model needs to return JSON (or another structured format) that your code can parse and act on. Never leave the output format implicit.
# Bad: ambiguous output format
prompt = "Analyse this document and tell me the key findings."
# Good: explicit schema definition
prompt = """Analyse this document and return a JSON object with this exact structure:
{
"key_findings": ["string - each finding as one sentence"],
"risk_level": "low | medium | high",
"confidence": 0.0 to 1.0,
"summary": "string - 2-3 sentence summary"
}
Return ONLY the JSON object, no additional text."""
I have found that including the word "exact" when describing schemas significantly improves compliance. Models take it as a stronger instruction than simply listing the fields.
Principle 2: Use System Prompts for Persistent Instructions
System prompts are your best tool for establishing consistent behaviour across many interactions. I put all of the following in system prompts:
- Output format requirements
- Tone and style guidelines
- Domain-specific terminology definitions
- Error handling instructions ("if you cannot analyse the input, return {error: true, reason: string}")
- Few-shot examples of ideal responses
Few-Shot Examples Are Powerful
Including 2-3 examples of ideal input/output pairs in your system prompt dramatically improves consistency. The model learns the pattern and follows it. This is especially valuable for classification tasks and structured extraction.
system_prompt = """You are a document classifier. Given a document excerpt,
classify it into one of these categories and extract key metadata.
Example 1:
Input: "The quarterly revenue exceeded projections by 12%..."
Output: {"category": "financial_report", "entities": ["revenue", "quarterly"], "sentiment": "positive"}
Example 2:
Input: "Section 4.2 of the agreement stipulates that..."
Output: {"category": "legal_contract", "entities": ["section 4.2", "agreement"], "sentiment": "neutral"}
Always return valid JSON matching this schema exactly."""
Principle 3: Build Validation and Retry Logic
Even with perfect prompts, models occasionally produce invalid output. Your production system must handle this gracefully:
async def reliable_ai_call(prompt: str, max_retries: int = 2) -> dict:
for attempt in range(max_retries + 1):
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
try:
result = json.loads(response.content[0].text)
validate_schema(result) # raises if schema doesn't match
return result
except (json.JSONDecodeError, SchemaError) as e:
if attempt == max_retries:
raise AIOutputError(f"Failed after {max_retries + 1} attempts: {e}")
# Add error context to the retry prompt
prompt = f"{prompt}\n\nPrevious attempt failed: {e}. Please return valid JSON."
In my experience, retries with error feedback succeed about 95% of the time when the first attempt fails. The model "sees" what went wrong and corrects it.
Principle 4: Optimise for Cost
In production, every token costs money. Some cost optimisation strategies I use regularly:
- Use the cheapest model that works: Do not use Sonnet for a task that Haiku handles perfectly. Classification and simple extraction tasks almost never need the most powerful model.
- Minimise input tokens: Send only the relevant portion of a document, not the entire thing. Pre-process and extract the relevant sections before calling the API.
- Cache aggressively: If the same input produces the same output (which it should for deterministic tasks), cache the results.
- Batch similar requests: If you have 10 items to classify, send them in one prompt rather than 10 separate API calls. This reduces overhead and often produces more consistent results.
Principle 5: Version Your Prompts
Treat prompts like code. Store them in version control, tag them with version numbers, and track which version is in production. When you change a prompt, you need to be able to roll back if the new version performs worse.
# prompts/document_analysis_v3.py
SYSTEM_PROMPT = """..."""
USER_TEMPLATE = """..."""
VERSION = "3.1.0"
MODEL = "claude-sonnet-4-20250514"
LAST_TESTED = "2026-02-10"
Principle 6: Monitor and Measure
In production, you need to track:
- Success rate: What percentage of calls return valid, usable output?
- Latency: How long does each call take? Are there outliers?
- Cost per call: Track token usage and costs over time
- Quality metrics: If you can measure output quality automatically, do so
I log every AI call with its input hash, output, token counts, latency, and whether it passed validation. This data is invaluable for identifying degradation over time or after model updates.
Common Mistakes I See
- Vague instructions: "Analyse this" is not a production prompt. Be specific about what analysis means.
- No error handling: If you are not parsing and validating AI output, you are building fragile systems.
- Using one model for everything: Match the model to the task complexity.
- Ignoring model updates: When providers update models, your prompts may behave differently. Test after every model update.
The Bottom Line
Production prompt engineering is about reliability, not creativity. The best production prompts are boring, explicit, heavily validated, and rigorously tested. Save the creative prompting for exploration. In production, predictability is king.