1 March 2026 | 4 min read

How to Structure a Python AI Project for Production

Python project structure production best practices AI engineering DevOps

Most AI Projects Start Messy

Nearly every AI project I have seen starts the same way: a Jupyter notebook, a few Python scripts, and a growing sense of chaos. That is fine for prototyping, but when you need to deploy, maintain, and iterate on a production AI system, structure matters enormously.

I have settled on a project structure that works well for production AI applications. It has evolved through building multiple systems and learning what causes pain during deployment, debugging, and handover.

The Project Layout

my-ai-project/
|-- src/
|   |-- my_project/
|   |   |-- __init__.py
|   |   |-- config.py          # Configuration management
|   |   |-- main.py            # Application entry point
|   |   |-- api/
|   |   |   |-- __init__.py
|   |   |   |-- routes.py      # API endpoints
|   |   |   |-- schemas.py     # Pydantic models
|   |   |   |-- dependencies.py # FastAPI dependencies
|   |   |-- agents/
|   |   |   |-- __init__.py
|   |   |   |-- analyst.py     # Individual agent definitions
|   |   |   |-- reviewer.py
|   |   |   |-- orchestrator.py # Agent coordination
|   |   |-- prompts/
|   |   |   |-- __init__.py
|   |   |   |-- templates.py   # Prompt templates
|   |   |   |-- versions.py    # Prompt version tracking
|   |   |-- services/
|   |   |   |-- __init__.py
|   |   |   |-- ai_client.py   # LLM client wrapper
|   |   |   |-- database.py    # Database operations
|   |   |   |-- embeddings.py  # Embedding generation
|   |   |-- models/
|   |   |   |-- __init__.py
|   |   |   |-- domain.py      # Domain models
|   |   |   |-- database.py    # Database models
|   |   |-- utils/
|   |       |-- __init__.py
|   |       |-- text.py        # Text processing utilities
|   |       |-- logging.py     # Logging configuration
|-- tests/
|   |-- test_agents/
|   |-- test_api/
|   |-- test_services/
|   |-- conftest.py            # Test fixtures
|-- scripts/
|   |-- seed_db.py
|   |-- run_backfill.py
|-- .env.example
|-- pyproject.toml
|-- Dockerfile

Key Design Decisions

1. Separate Agents from API

The agents/ directory contains all AI agent logic, completely independent of the API layer. This means agents can be tested in isolation, reused across different interfaces (API, CLI, scripts), and modified without touching the API code.

# agents/analyst.py
class DocumentAnalyst:
    def __init__(self, ai_client, config):
        self.ai_client = ai_client
        self.config = config
    
    async def analyse(self, document: Document) -> Analysis:
        prompt = self.build_prompt(document)
        response = await self.ai_client.complete(prompt)
        return self.parse_response(response)
    
    def build_prompt(self, document: Document) -> str:
        template = PromptTemplates.get("document_analysis", self.config.prompt_version)
        return template.format(content=document.text)

2. Centralised Configuration

All configuration lives in one place and is loaded from environment variables with sensible defaults:

# config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    # API
    api_host: str = "0.0.0.0"
    api_port: int = 8000
    
    # AI
    anthropic_api_key: str
    default_model: str = "claude-sonnet-4-20250514"
    max_retries: int = 3
    
    # Database
    database_url: str
    
    # Features
    enable_caching: bool = True
    log_level: str = "INFO"
    
    class Config:
        env_file = ".env"

settings = Settings()

3. Prompt Version Management

Prompts are treated as versioned artefacts, not inline strings:

# prompts/versions.py
PROMPT_REGISTRY = {
    "document_analysis": {
        "v1": "Analyse this document...",
        "v2": "You are a document analyst. Analyse...",  
        "v3": "You are a precise document analyst..."  # current
    }
}

CURRENT_VERSIONS = {
    "document_analysis": "v3"
}

This lets me roll back to previous prompt versions if a new one performs worse, without redeploying code.

4. AI Client Abstraction

I wrap the LLM client in a service layer that handles retries, logging, and cost tracking:

# services/ai_client.py
class AIClient:
    def __init__(self, settings: Settings):
        self.client = anthropic.AsyncAnthropic(api_key=settings.anthropic_api_key)
        self.default_model = settings.default_model
    
    async def complete(self, prompt: str, model: str = None, **kwargs) -> str:
        model = model or self.default_model
        start = time.time()
        
        for attempt in range(settings.max_retries):
            try:
                response = await self.client.messages.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    **kwargs
                )
                duration = time.time() - start
                self._log_usage(model, response.usage, duration)
                return response.content[0].text
            except anthropic.RateLimitError:
                await asyncio.sleep(2 ** attempt)
        
        raise AIClientError("Max retries exceeded")

Testing AI Applications

Testing AI systems requires a different approach than testing deterministic code. I use three levels of testing:

Unit Tests (Mocked AI)

Test your logic with mocked AI responses. This verifies that your parsing, validation, and business logic work correctly without making API calls:

async def test_analyst_parses_valid_response(mock_ai_client):
    mock_ai_client.complete.return_value = json.dumps({
        "summary": "Test summary",
        "key_findings": ["Finding 1"],
        "risk_level": "low",
        "confidence": 0.9
    })
    analyst = DocumentAnalyst(mock_ai_client, test_config)
    result = await analyst.analyse(test_document)
    assert result.risk_level == "low"

Integration Tests (Real AI, Controlled Input)

Test with real API calls using known inputs. Run these less frequently (they cost money) but they catch issues that mocks miss.

Evaluation Tests

A suite of test cases with expected outputs that measure quality over time. Run after prompt changes or model updates.

Deployment

A minimal Dockerfile for a FastAPI AI application:

FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml .
RUN pip install .
COPY src/ src/
CMD ["uvicorn", "src.my_project.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Payoff

This structure takes about 30 minutes to set up for a new project. That investment pays back immediately the first time you need to debug a production issue, change a prompt version, add a new agent, or hand the project to another developer. Structure is not overhead. It is the foundation that makes everything else faster.