How to Structure a Python AI Project for Production
Most AI Projects Start Messy
Nearly every AI project I have seen starts the same way: a Jupyter notebook, a few Python scripts, and a growing sense of chaos. That is fine for prototyping, but when you need to deploy, maintain, and iterate on a production AI system, structure matters enormously.
I have settled on a project structure that works well for production AI applications. It has evolved through building multiple systems and learning what causes pain during deployment, debugging, and handover.
The Project Layout
my-ai-project/
|-- src/
| |-- my_project/
| | |-- __init__.py
| | |-- config.py # Configuration management
| | |-- main.py # Application entry point
| | |-- api/
| | | |-- __init__.py
| | | |-- routes.py # API endpoints
| | | |-- schemas.py # Pydantic models
| | | |-- dependencies.py # FastAPI dependencies
| | |-- agents/
| | | |-- __init__.py
| | | |-- analyst.py # Individual agent definitions
| | | |-- reviewer.py
| | | |-- orchestrator.py # Agent coordination
| | |-- prompts/
| | | |-- __init__.py
| | | |-- templates.py # Prompt templates
| | | |-- versions.py # Prompt version tracking
| | |-- services/
| | | |-- __init__.py
| | | |-- ai_client.py # LLM client wrapper
| | | |-- database.py # Database operations
| | | |-- embeddings.py # Embedding generation
| | |-- models/
| | | |-- __init__.py
| | | |-- domain.py # Domain models
| | | |-- database.py # Database models
| | |-- utils/
| | |-- __init__.py
| | |-- text.py # Text processing utilities
| | |-- logging.py # Logging configuration
|-- tests/
| |-- test_agents/
| |-- test_api/
| |-- test_services/
| |-- conftest.py # Test fixtures
|-- scripts/
| |-- seed_db.py
| |-- run_backfill.py
|-- .env.example
|-- pyproject.toml
|-- Dockerfile
Key Design Decisions
1. Separate Agents from API
The agents/ directory contains all AI agent logic, completely independent of the API layer. This means agents can be tested in isolation, reused across different interfaces (API, CLI, scripts), and modified without touching the API code.
# agents/analyst.py
class DocumentAnalyst:
def __init__(self, ai_client, config):
self.ai_client = ai_client
self.config = config
async def analyse(self, document: Document) -> Analysis:
prompt = self.build_prompt(document)
response = await self.ai_client.complete(prompt)
return self.parse_response(response)
def build_prompt(self, document: Document) -> str:
template = PromptTemplates.get("document_analysis", self.config.prompt_version)
return template.format(content=document.text)
2. Centralised Configuration
All configuration lives in one place and is loaded from environment variables with sensible defaults:
# config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# API
api_host: str = "0.0.0.0"
api_port: int = 8000
# AI
anthropic_api_key: str
default_model: str = "claude-sonnet-4-20250514"
max_retries: int = 3
# Database
database_url: str
# Features
enable_caching: bool = True
log_level: str = "INFO"
class Config:
env_file = ".env"
settings = Settings()
3. Prompt Version Management
Prompts are treated as versioned artefacts, not inline strings:
# prompts/versions.py
PROMPT_REGISTRY = {
"document_analysis": {
"v1": "Analyse this document...",
"v2": "You are a document analyst. Analyse...",
"v3": "You are a precise document analyst..." # current
}
}
CURRENT_VERSIONS = {
"document_analysis": "v3"
}
This lets me roll back to previous prompt versions if a new one performs worse, without redeploying code.
4. AI Client Abstraction
I wrap the LLM client in a service layer that handles retries, logging, and cost tracking:
# services/ai_client.py
class AIClient:
def __init__(self, settings: Settings):
self.client = anthropic.AsyncAnthropic(api_key=settings.anthropic_api_key)
self.default_model = settings.default_model
async def complete(self, prompt: str, model: str = None, **kwargs) -> str:
model = model or self.default_model
start = time.time()
for attempt in range(settings.max_retries):
try:
response = await self.client.messages.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs
)
duration = time.time() - start
self._log_usage(model, response.usage, duration)
return response.content[0].text
except anthropic.RateLimitError:
await asyncio.sleep(2 ** attempt)
raise AIClientError("Max retries exceeded")
Testing AI Applications
Testing AI systems requires a different approach than testing deterministic code. I use three levels of testing:
Unit Tests (Mocked AI)
Test your logic with mocked AI responses. This verifies that your parsing, validation, and business logic work correctly without making API calls:
async def test_analyst_parses_valid_response(mock_ai_client):
mock_ai_client.complete.return_value = json.dumps({
"summary": "Test summary",
"key_findings": ["Finding 1"],
"risk_level": "low",
"confidence": 0.9
})
analyst = DocumentAnalyst(mock_ai_client, test_config)
result = await analyst.analyse(test_document)
assert result.risk_level == "low"
Integration Tests (Real AI, Controlled Input)
Test with real API calls using known inputs. Run these less frequently (they cost money) but they catch issues that mocks miss.
Evaluation Tests
A suite of test cases with expected outputs that measure quality over time. Run after prompt changes or model updates.
Deployment
A minimal Dockerfile for a FastAPI AI application:
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml .
RUN pip install .
COPY src/ src/
CMD ["uvicorn", "src.my_project.main:app", "--host", "0.0.0.0", "--port", "8000"]
The Payoff
This structure takes about 30 minutes to set up for a new project. That investment pays back immediately the first time you need to debug a production issue, change a prompt version, add a new agent, or hand the project to another developer. Structure is not overhead. It is the foundation that makes everything else faster.