| 3 min read

Using Pydantic for AI Pipeline Data Validation

Pydantic data validation Python AI pipelines type safety

Why Validation Matters in AI Pipelines

AI pipelines are inherently non-deterministic. The same input can produce different outputs, and LLM responses do not always match the format you requested. Without validation, these inconsistencies propagate through your pipeline, causing failures in downstream components that are difficult to debug.

Pydantic is the single most valuable library in my AI engineering toolkit after the LLM SDKs themselves. It turns loose, unpredictable AI outputs into structured, validated data that the rest of my system can trust.

Basic LLM Output Validation

The most common pattern is defining a Pydantic model for the expected output and parsing the LLM response into it:

from pydantic import BaseModel, Field, validator
from typing import Optional

class ContentScore(BaseModel):
    relevance: float = Field(ge=0, le=10)
    accuracy: float = Field(ge=0, le=10)
    clarity: float = Field(ge=0, le=10)
    overall: float = Field(ge=0, le=10)
    summary: str = Field(max_length=500)
    pass_threshold: bool
    
    @validator('overall')
    def overall_should_be_average(cls, v, values):
        expected = sum([
            values.get('relevance', 0),
            values.get('accuracy', 0),
            values.get('clarity', 0)
        ]) / 3
        if abs(v - expected) > 1.5:
            raise ValueError(
                f'Overall score {v} too far from dimension average {expected:.1f}'
            )
        return v

Parsing LLM Responses

LLMs do not always return clean JSON. I use a robust parsing function that handles common issues:

import json
import re

def parse_llm_json(text: str, model: type[BaseModel]) -> BaseModel:
    # Try direct JSON parse first
    try:
        data = json.loads(text)
        return model(**data)
    except (json.JSONDecodeError, Exception):
        pass
    
    # Extract JSON from markdown code blocks
    json_match = re.search(r'```(?:json)?\s*(.+?)```', text, re.DOTALL)
    if json_match:
        try:
            data = json.loads(json_match.group(1))
            return model(**data)
        except (json.JSONDecodeError, Exception):
            pass
    
    # Last resort: find anything that looks like JSON
    brace_match = re.search(r'\{.+\}', text, re.DOTALL)
    if brace_match:
        data = json.loads(brace_match.group())
        return model(**data)
    
    raise ValueError(f"Could not extract valid JSON from LLM response")

Nested Models for Complex Pipelines

Real AI pipelines produce complex, nested data. Pydantic handles this beautifully:

class ExtractedEntity(BaseModel):
    name: str
    entity_type: str
    confidence: float = Field(ge=0, le=1)
    source_text: str

class DocumentAnalysis(BaseModel):
    document_id: str
    title: Optional[str]
    summary: str = Field(max_length=1000)
    entities: list[ExtractedEntity]
    key_dates: list[str]
    sentiment: float = Field(ge=-1, le=1)
    language: str
    processing_model: str
    tokens_used: int = Field(ge=0)

Each field has constraints that catch invalid data immediately. If the LLM returns a confidence score of 1.5 or a sentiment of 2.0, Pydantic raises a clear error before that bad data causes problems elsewhere.

Input Validation for API Endpoints

Pydantic integrates natively with FastAPI, so your API inputs are validated automatically:

class AnalyzeRequest(BaseModel):
    content: str = Field(min_length=10, max_length=100000)
    analysis_type: str = Field(pattern='^(summary|entities|full)$')
    language: str = Field(default='en', pattern='^[a-z]{2}$')
    max_tokens: int = Field(default=2000, ge=100, le=8000)

@app.post("/analyze")
async def analyze(request: AnalyzeRequest):
    # request is already validated
    result = await pipeline.run(request)
    return result

Configuration Management

I also use Pydantic for pipeline configuration. This ensures that configuration errors are caught at startup, not at runtime:

from pydantic_settings import BaseSettings

class PipelineConfig(BaseSettings):
    anthropic_api_key: str
    openai_api_key: str
    supabase_url: str
    supabase_key: str
    max_concurrent_requests: int = Field(default=5, ge=1, le=50)
    score_threshold: float = Field(default=6.5, ge=0, le=10)
    model_name: str = Field(default="claude-sonnet-4-20250514")
    
    class Config:
        env_file = '.env'

Error Handling Patterns

When Pydantic validation fails, you want to handle it gracefully rather than crashing the pipeline:

from pydantic import ValidationError

async def process_with_retry(content: str, max_retries: int = 2) -> ContentScore:
    for attempt in range(max_retries + 1):
        raw_response = await call_llm(content)
        try:
            return parse_llm_json(raw_response, ContentScore)
        except (ValidationError, ValueError) as e:
            if attempt < max_retries:
                logger.warning(f"Validation failed (attempt {attempt + 1}): {e}")
                continue
            logger.error(f"All retries failed for content scoring: {e}")
            raise

Performance Considerations

Pydantic v2 is significantly faster than v1, but validation still has a cost. For high-throughput pipelines, consider:

  • Using model_validate instead of constructing models with kwargs for bulk operations
  • Keeping validators simple and moving complex logic to separate functions
  • Using model_construct to skip validation for trusted internal data
Pydantic does not make your AI pipeline slower. It makes your AI pipeline correct. The time you save debugging malformed data far exceeds the nanoseconds spent on validation.

Getting Started

If you are building AI pipelines without Pydantic, start by defining models for your LLM outputs. This single change will catch the majority of data quality issues in your pipeline and make your code significantly easier to understand and maintain.