Using OpenRouter as an LLM Gateway for Multi-Model Pipelines
The Multi-Model Problem
In production AI work, you rarely use just one model. Some tasks need GPT-4o's reasoning. Others benefit from Claude's long context window. Quick classification jobs might use Llama 3 to keep costs down. But managing separate API keys, different request formats, and varied rate limits for each provider is a headache.
That is where OpenRouter comes in. It provides a single API endpoint that routes to over 200 models from OpenAI, Anthropic, Google, Meta, and others. I have been using it as the backbone of my multi-agent pipelines for months, and it has simplified my architecture significantly.
What OpenRouter Actually Does
OpenRouter is an LLM gateway. You send requests to one endpoint using a consistent OpenAI-compatible format, and it routes them to whatever model you specify. The key benefits:
- Single API key for all providers
- OpenAI-compatible format so existing code works with minimal changes
- Automatic fallbacks when a provider has downtime
- Usage tracking across all models in one dashboard
- Cost optimization with model routing strategies
Basic Usage
The API is straightforward. If you already use the OpenAI Python client, you just change the base URL and model name:
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="sk-or-..."
)
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello"}]
)Switching models is just changing the model string. No new SDK, no different authentication flow.
Building a Multi-Model Pipeline
In my content production pipeline, different stages use different models:
MODEL_CONFIG = {
"research": "anthropic/claude-sonnet-4-20250514",
"outline": "openai/gpt-4o",
"draft": "anthropic/claude-sonnet-4-20250514",
"classify": "meta-llama/llama-3-70b-instruct",
"summarize": "google/gemini-pro"
}The research and drafting stages need strong reasoning, so I use Claude. Outlining benefits from GPT-4o's structured output. Classification runs on Llama 3 because it is fast and cheap for simple tasks. Summarization uses Gemini Pro for its speed.
A Routing Helper
I wrote a simple wrapper that selects the model based on the task:
async def llm_call(task: str, messages: list, **kwargs):
model = MODEL_CONFIG.get(task, "openai/gpt-4o-mini")
response = await client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.contentThis keeps model selection centralized. If I want to swap Gemini for a new model, I change one line in the config dictionary.
Cost Management
One of the biggest advantages of OpenRouter is visibility into costs across models. The dashboard shows spending per model, per day, and per route. I discovered that my classification step was costing more than expected because I was sending unnecessarily long prompts to GPT-4. Switching to Llama 3 for that step cut costs by 90% with no quality loss.
You can also set spending limits and get alerts when you approach them. This is critical for production systems where a bug in your retry logic could burn through your budget.
Fallback Strategies
OpenRouter supports automatic fallbacks. If your primary model is down or rate-limited, it can route to an alternative:
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4-20250514",
route="fallback",
models=[
"anthropic/claude-sonnet-4-20250514",
"openai/gpt-4o",
"google/gemini-pro"
],
messages=messages
)This has saved me from pipeline failures during provider outages. The fallback is transparent, and you can check which model actually served the request in the response headers.
Practical Tips
- Cache responses: Use a Redis or file-based cache keyed on the prompt hash. Many LLM calls in pipelines are repeated across runs.
- Set timeouts: Some models are slower than others. Set per-model timeouts to avoid blocking your pipeline.
- Log everything: Store the model used, token counts, and latency for every call. This data is invaluable for optimization.
- Use streaming for long outputs: Streaming responses let you show progress and reduce perceived latency.
When Not to Use OpenRouter
If you exclusively use one provider, the direct API will have lower latency. If you need enterprise SLAs, you may want direct contracts with providers. And if you are processing sensitive data, review OpenRouter's data handling policies carefully.
For everyone else building multi-model AI systems, OpenRouter is an excellent tool that removes significant infrastructure complexity.