| 4 min read

Whisper for Automated Transcription: Setup and Optimisation

Whisper transcription speech-to-text Python automation optimisation

Why Local Whisper?

OpenAI's Whisper is an open-source speech recognition model that runs locally on your own hardware. I use it in my YouTube auto-upload pipeline for transcription, and it has been running reliably for months. The main reason I chose local Whisper over the API version is cost: the API charges $0.006 per minute, but local Whisper costs nothing per transcription once you have the model downloaded.

This guide covers everything I have learned about setting up, optimising, and running Whisper in production.

Installation

The basic installation is straightforward:

pip install openai-whisper

# You also need ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

Whisper will download the model weights automatically on first use. The models range from 39MB (tiny) to 2.87GB (large-v3).

Model Selection

Choosing the right model is the most important decision. Here is what I have found through testing:

Available Models

  • tiny (39MB): Fast but inaccurate. Word error rate around 15% on clean English audio. Useful for rough transcripts where speed matters more than accuracy.
  • base (74MB): Slightly better accuracy, still very fast. Good for development and testing.
  • small (244MB): The sweet spot for most production use cases. Good accuracy (around 5-7% word error rate on clean audio) with reasonable speed. This is what I use.
  • medium (769MB): Better accuracy, especially for accented speech and noisy audio. 2-3x slower than small.
  • large-v3 (2.87GB): Best accuracy, but requires significant compute. 5-10x slower than small on CPU.
import whisper

# Load the model (downloads on first use)
model = whisper.load_model("small")

# Basic transcription
result = model.transcribe("audio.mp3")
print(result["text"])

My Recommendation

For most AI engineering use cases where the transcript feeds into another AI model (like generating YouTube metadata), the small model is ideal. The downstream AI model is tolerant of minor transcription errors, so paying the speed and memory cost of medium or large is usually not worth it.

Performance Optimisation

CPU vs GPU

On a standard VPS without a GPU, the small model transcribes at roughly 3x real-time on a modern CPU. That means a 10-minute video takes about 3 minutes to transcribe. If you have a GPU available, the same transcription takes seconds.

# Force CPU (default on most servers)
model = whisper.load_model("small", device="cpu")

# Use GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("small", device=device)

Audio Pre-processing

Whisper works with audio, not video. Extracting audio before transcription avoids loading the entire video file into memory:

import subprocess

def extract_audio(video_path: str, audio_path: str):
    """Extract audio from video using ffmpeg."""
    subprocess.run([
        "ffmpeg", "-i", video_path,
        "-vn",  # no video
        "-acodec", "pcm_s16le",  # WAV format
        "-ar", "16000",  # 16kHz sample rate (Whisper's native rate)
        "-ac", "1",  # mono
        audio_path
    ], check=True, capture_output=True)

Converting to 16kHz mono WAV before transcription avoids Whisper doing the conversion internally, which can save memory and slightly improve speed.

Memory Management

On servers with limited RAM, memory management matters. The small model uses about 1GB of RAM during transcription. If you are processing videos sequentially, load the model once and reuse it:

class TranscriptionService:
    def __init__(self, model_name: str = "small"):
        self._model = None
        self._model_name = model_name
    
    @property
    def model(self):
        if self._model is None:
            self._model = whisper.load_model(self._model_name)
        return self._model
    
    def transcribe(self, audio_path: str) -> dict:
        return self.model.transcribe(audio_path)

Getting Timestamps for Chapters

Whisper provides word-level and segment-level timestamps. I use these to generate YouTube chapters automatically:

result = model.transcribe("audio.mp3", word_timestamps=True)

for segment in result["segments"]:
    start = segment["start"]
    text = segment["text"].strip()
    minutes = int(start // 60)
    seconds = int(start % 60)
    print(f"{minutes:02d}:{seconds:02d} - {text}")

To generate meaningful chapters rather than raw segment timestamps, I send the timestamped transcript to an LLM that identifies natural topic boundaries and creates chapter titles.

Handling Different Audio Quality

Real-world audio varies dramatically in quality. Some tips from experience:

  • Background music: Whisper handles light background music well but struggles with loud music. If possible, process the audio through a vocal isolation tool first.
  • Multiple speakers: Whisper does not natively identify speakers. For multi-speaker content, consider using pyannote-audio for speaker diarisation alongside Whisper for transcription.
  • Accents: The small model handles most English accents well. For heavy accents or non-English languages, the medium or large model is noticeably better.
  • Noise: For noisy audio, using the --condition_on_previous_text flag (True by default) helps maintain context, but can also propagate errors. Test both settings.

Production Deployment Pattern

In my YouTube pipeline, Whisper runs as part of a processing script triggered by cron. Here is the simplified production pattern:

import logging
from pathlib import Path

logger = logging.getLogger(__name__)

class VideoTranscriber:
    def __init__(self):
        self.service = TranscriptionService("small")
        self.temp_dir = Path("/tmp/transcription")
        self.temp_dir.mkdir(exist_ok=True)
    
    def process(self, video_path: str) -> dict:
        audio_path = self.temp_dir / "temp_audio.wav"
        
        try:
            logger.info(f"Extracting audio from {video_path}")
            extract_audio(video_path, str(audio_path))
            
            logger.info("Starting transcription")
            result = self.service.transcribe(str(audio_path))
            
            logger.info(f"Transcription complete: {len(result['text'])} chars")
            return {
                "text": result["text"],
                "segments": result["segments"],
                "language": result["language"]
            }
        finally:
            if audio_path.exists():
                audio_path.unlink()

Alternatives Worth Knowing

Whisper is not the only option. faster-whisper (using CTranslate2) provides the same accuracy with 2-4x speed improvement on CPU. If speed is a bottleneck, it is worth investigating. The API is nearly identical, so switching is straightforward.

The Bottom Line

Whisper is a remarkably capable tool that runs reliably in production with minimal fuss. The small model is the right choice for most automation workflows, offering a good balance of accuracy and speed without requiring GPU hardware. If you are building any pipeline that needs speech-to-text, local Whisper is hard to beat on cost and simplicity.