Setting Up Telegram Alerts for AI Pipeline Monitoring
Why Telegram for Pipeline Monitoring
When you are running AI pipelines in production, things break. Models hit rate limits, APIs go down, data quality drifts, and servers run out of memory. You need to know about these problems immediately, not when a customer complains or when you happen to check your logs.
I use Telegram as my primary alerting channel for all my production AI systems. It hits the perfect balance of immediacy, simplicity, and reliability. My phone buzzes within seconds of any pipeline failure, and I can see exactly what went wrong without opening a laptop.
Creating Your Telegram Bot
Setting up a Telegram bot takes about five minutes. Here is the process:
- Open Telegram and search for @BotFather
- Send
/newbotand follow the prompts to name your bot - Save the API token you receive
- Create a private channel or group for your alerts
- Add the bot to your channel and get the chat ID
Getting Your Chat ID
The easiest way to get your chat ID is to send a message to your bot, then hit the Telegram API:
curl https://api.telegram.org/bot<YOUR_TOKEN>/getUpdates
Look for the chat.id field in the response. For channels, the ID will be a negative number.
Building the Alert Module
I keep a reusable alert module that all my pipelines import. Here is the core of it:
import httpx
from datetime import datetime
class TelegramAlerts:
def __init__(self, token: str, chat_id: str):
self.token = token
self.chat_id = chat_id
self.base_url = f"https://api.telegram.org/bot{token}"
async def send(self, message: str, level: str = "info"):
icons = {
"info": "[INFO]",
"warning": "[WARN]",
"error": "[ERROR]",
"critical": "[CRITICAL]"
}
prefix = icons.get(level, "[INFO]")
timestamp = datetime.now().strftime("%H:%M:%S")
text = f"{prefix} {timestamp}\n{message}"
async with httpx.AsyncClient() as client:
await client.post(
f"{self.base_url}/sendMessage",
json={
"chat_id": self.chat_id,
"text": text,
"parse_mode": "HTML"
}
)
What I Monitor
After running nine production AI projects, I have settled on a core set of alerts that catch 95% of problems:
- Pipeline start and completion: Know when jobs kick off and finish, with duration
- API rate limits: Get warned before you hit hard limits on Claude, OpenAI, or Gemini
- Error counts: If errors exceed a threshold within a time window, alert immediately
- Cost tracking: Daily spend summaries and alerts when usage spikes unexpectedly
- Data quality: When scoring pipelines detect quality drops, flag for review
- Server health: CPU, memory, and disk usage on my VPS
Rate Limiting Your Alerts
One mistake I made early on was flooding my phone with alerts during cascading failures. When an API goes down, every request fails, and you do not need 500 individual failure messages. I solved this with a simple debounce pattern:
from collections import defaultdict
from time import time
class AlertThrottler:
def __init__(self, cooldown_seconds: int = 300):
self.cooldown = cooldown_seconds
self.last_sent = defaultdict(float)
def should_send(self, alert_key: str) -> bool:
now = time()
if now - self.last_sent[alert_key] > self.cooldown:
self.last_sent[alert_key] = now
return True
return False
Structured Alert Messages
Good alert messages tell you exactly what happened, where it happened, and what to do about it. I use a consistent format across all my pipelines:
[ERROR] 14:23:07
Pipeline: content-scorer
Stage: gemini-analysis
Error: Rate limit exceeded (429)
Requests today: 1,847 / 2,000
Action: Auto-retry in 60s
Dashboard: https://example.com/logs
This format lets me triage problems at a glance without needing to SSH into the server.
Integration with FastAPI Services
Most of my AI applications run as FastAPI services. I add alert hooks to the exception handlers so that unhandled errors automatically trigger Telegram messages:
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
await alerts.send(
f"Unhandled exception in {request.url.path}\n"
f"Type: {type(exc).__name__}\n"
f"Detail: {str(exc)[:200]}",
level="error"
)
return JSONResponse(status_code=500, content={"detail": "Internal error"})
Daily Digest Reports
Beyond real-time alerts, I send myself a daily digest at 8am summarizing all pipeline activity from the previous 24 hours. This includes total requests processed, error rates, costs incurred, and any notable events. It takes about 30 seconds to read and gives me confidence that everything is running smoothly.
The best monitoring system is the one you actually check. Telegram wins because it lives on the device I already look at dozens of times a day.
Getting Started
Start with basic error alerts on your most critical pipeline. Once you see how much faster you catch and resolve issues, you will want to add alerts to everything. The whole setup takes under an hour and costs nothing. Telegram bots are free, and the API is generous with rate limits for alerting use cases.