Using Pydantic for AI Pipeline Data Validation
Why Validation Matters in AI Pipelines
AI pipelines are inherently non-deterministic. The same input can produce different outputs, and LLM responses do not always match the format you requested. Without validation, these inconsistencies propagate through your pipeline, causing failures in downstream components that are difficult to debug.
Pydantic is the single most valuable library in my AI engineering toolkit after the LLM SDKs themselves. It turns loose, unpredictable AI outputs into structured, validated data that the rest of my system can trust.
Basic LLM Output Validation
The most common pattern is defining a Pydantic model for the expected output and parsing the LLM response into it:
from pydantic import BaseModel, Field, validator
from typing import Optional
class ContentScore(BaseModel):
relevance: float = Field(ge=0, le=10)
accuracy: float = Field(ge=0, le=10)
clarity: float = Field(ge=0, le=10)
overall: float = Field(ge=0, le=10)
summary: str = Field(max_length=500)
pass_threshold: bool
@validator('overall')
def overall_should_be_average(cls, v, values):
expected = sum([
values.get('relevance', 0),
values.get('accuracy', 0),
values.get('clarity', 0)
]) / 3
if abs(v - expected) > 1.5:
raise ValueError(
f'Overall score {v} too far from dimension average {expected:.1f}'
)
return v
Parsing LLM Responses
LLMs do not always return clean JSON. I use a robust parsing function that handles common issues:
import json
import re
def parse_llm_json(text: str, model: type[BaseModel]) -> BaseModel:
# Try direct JSON parse first
try:
data = json.loads(text)
return model(**data)
except (json.JSONDecodeError, Exception):
pass
# Extract JSON from markdown code blocks
json_match = re.search(r'```(?:json)?\s*(.+?)```', text, re.DOTALL)
if json_match:
try:
data = json.loads(json_match.group(1))
return model(**data)
except (json.JSONDecodeError, Exception):
pass
# Last resort: find anything that looks like JSON
brace_match = re.search(r'\{.+\}', text, re.DOTALL)
if brace_match:
data = json.loads(brace_match.group())
return model(**data)
raise ValueError(f"Could not extract valid JSON from LLM response")
Nested Models for Complex Pipelines
Real AI pipelines produce complex, nested data. Pydantic handles this beautifully:
class ExtractedEntity(BaseModel):
name: str
entity_type: str
confidence: float = Field(ge=0, le=1)
source_text: str
class DocumentAnalysis(BaseModel):
document_id: str
title: Optional[str]
summary: str = Field(max_length=1000)
entities: list[ExtractedEntity]
key_dates: list[str]
sentiment: float = Field(ge=-1, le=1)
language: str
processing_model: str
tokens_used: int = Field(ge=0)
Each field has constraints that catch invalid data immediately. If the LLM returns a confidence score of 1.5 or a sentiment of 2.0, Pydantic raises a clear error before that bad data causes problems elsewhere.
Input Validation for API Endpoints
Pydantic integrates natively with FastAPI, so your API inputs are validated automatically:
class AnalyzeRequest(BaseModel):
content: str = Field(min_length=10, max_length=100000)
analysis_type: str = Field(pattern='^(summary|entities|full)$')
language: str = Field(default='en', pattern='^[a-z]{2}$')
max_tokens: int = Field(default=2000, ge=100, le=8000)
@app.post("/analyze")
async def analyze(request: AnalyzeRequest):
# request is already validated
result = await pipeline.run(request)
return result
Configuration Management
I also use Pydantic for pipeline configuration. This ensures that configuration errors are caught at startup, not at runtime:
from pydantic_settings import BaseSettings
class PipelineConfig(BaseSettings):
anthropic_api_key: str
openai_api_key: str
supabase_url: str
supabase_key: str
max_concurrent_requests: int = Field(default=5, ge=1, le=50)
score_threshold: float = Field(default=6.5, ge=0, le=10)
model_name: str = Field(default="claude-sonnet-4-20250514")
class Config:
env_file = '.env'
Error Handling Patterns
When Pydantic validation fails, you want to handle it gracefully rather than crashing the pipeline:
from pydantic import ValidationError
async def process_with_retry(content: str, max_retries: int = 2) -> ContentScore:
for attempt in range(max_retries + 1):
raw_response = await call_llm(content)
try:
return parse_llm_json(raw_response, ContentScore)
except (ValidationError, ValueError) as e:
if attempt < max_retries:
logger.warning(f"Validation failed (attempt {attempt + 1}): {e}")
continue
logger.error(f"All retries failed for content scoring: {e}")
raise
Performance Considerations
Pydantic v2 is significantly faster than v1, but validation still has a cost. For high-throughput pipelines, consider:
- Using
model_validateinstead of constructing models with kwargs for bulk operations - Keeping validators simple and moving complex logic to separate functions
- Using
model_constructto skip validation for trusted internal data
Pydantic does not make your AI pipeline slower. It makes your AI pipeline correct. The time you save debugging malformed data far exceeds the nanoseconds spent on validation.
Getting Started
If you are building AI pipelines without Pydantic, start by defining models for your LLM outputs. This single change will catch the majority of data quality issues in your pipeline and make your code significantly easier to understand and maintain.