| 4 min read

Building a YouTube Auto-Upload Pipeline with Whisper and Gemini

YouTube API Whisper Gemini automation Python content pipeline

The Problem: Manual YouTube Uploads Are Tedious

I create video content as part of several projects, and the upload process was eating my time. For each video I had to: write a title, craft a description, pick tags, generate a thumbnail concept, add chapters, and then manually upload through the YouTube Studio interface. For a single video, that is 20 to 30 minutes of busywork.

I decided to automate the entire pipeline. The goal was simple: drop a video file into a folder and have it appear on YouTube, fully optimised, without any manual intervention.

Pipeline Architecture

The system runs as a Python script triggered by a cron job every 30 minutes. Here is the flow:

  1. Watch a local directory for new video files
  2. Extract audio and transcribe using Whisper
  3. Send the transcript to Gemini for metadata generation
  4. Upload to YouTube via the Data API v3
  5. Archive the original file and log the result

Transcription with Whisper

I use OpenAI's Whisper model locally. Running the small model gives a good balance between accuracy and speed on my VPS:

import whisper

def transcribe_video(video_path: str) -> str:
    model = whisper.load_model("small")
    result = model.transcribe(video_path)
    return result["text"]

For most content, the small model produces transcripts that are accurate enough for metadata generation. Occasional errors in the transcript do not matter much because the transcript is used as input for the AI, not displayed directly.

Metadata Generation with Gemini

Once I have the transcript, I send it to Gemini with a carefully crafted prompt that generates all the YouTube metadata in one call:

import google.generativeai as genai

def generate_metadata(transcript: str) -> dict:
    model = genai.GenerativeModel("gemini-2.0-flash")
    prompt = f"""Based on this video transcript, generate YouTube metadata as JSON:
    - title: compelling, under 70 chars, keyword-rich
    - description: 200+ words with timestamps and key points
    - tags: 10-15 relevant tags
    - category: best YouTube category ID
    
    Transcript: {transcript}"""
    
    response = model.generate_content(prompt)
    return json.loads(response.text)

I chose Gemini Flash for this task because the cost is negligible and it handles structured output well. The quality of the generated titles and descriptions has been consistently good.

YouTube API Integration

The upload itself uses the YouTube Data API v3 with OAuth2 credentials. The trickiest part was handling the resumable upload protocol, which is necessary for larger video files:

from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload

def upload_video(filepath, metadata, credentials):
    youtube = build("youtube", "v3", credentials=credentials)
    body = {
        "snippet": {
            "title": metadata["title"],
            "description": metadata["description"],
            "tags": metadata["tags"],
            "categoryId": metadata["category"]
        },
        "status": {"privacyStatus": "public"}
    }
    media = MediaFileUpload(filepath, resumable=True)
    request = youtube.videos().insert(
        part="snippet,status", body=body, media_body=media
    )
    response = request.execute()
    return response["id"]

Error Handling and Reliability

A pipeline that runs unattended needs to be bulletproof. I added several layers of protection:

  • File locking: Prevents the cron job from processing the same file twice if it runs while a previous upload is still in progress
  • Retry logic: YouTube API calls retry up to 3 times with exponential backoff
  • Dead letter queue: Failed uploads move to an error directory with a JSON file describing what went wrong
  • Health monitoring: A simple endpoint that reports pipeline status, checked by an external uptime monitor

Cost Breakdown

This is the part that surprised me most. The running costs are almost nothing:

  • Whisper: Runs locally, so zero API cost
  • Gemini Flash: Roughly 0.001p per video for metadata generation
  • YouTube API: Free tier covers well over 100 uploads per day
  • VPS overhead: The pipeline shares a server I already pay for

Total additional cost per month: under 10p. That is not a typo.

Thumbnail Generation

Beyond the core metadata, I also experimented with automated thumbnail concepts. The pipeline generates a text description of an ideal thumbnail using Gemini, based on the video content. While I do not auto-generate the thumbnail image itself yet, having a detailed description ready means I can create one quickly when needed. The descriptions include suggested text overlay, background style, and colour scheme. This alone saves about 10 minutes per video compared to starting from scratch.

Monitoring and Logging

Since the pipeline runs unattended, logging is critical. Every step of the process is logged with timestamps, file sizes, and success or failure status. I wrote a simple daily summary script that emails me a digest of what was processed, any errors that occurred, and the total API cost for the day. This gives me confidence that the pipeline is running smoothly without needing to check manually.

Optimisation Lessons

A few things I learned through building this:

  • Whisper's timestamps are good enough for generating YouTube chapters automatically. I extract them from the transcript segments and include them in the description.
  • Gemini sometimes generates titles that are too clickbaity. Adding "professional and informative tone" to the prompt fixed this.
  • The YouTube API has a daily quota of 10,000 units. Each upload costs 1,600 units, so you get about 6 uploads per day. For most use cases that is plenty, but worth knowing upfront.

What This Taught Me

This pipeline is one of my favourite projects because it demonstrates something I believe strongly: AI engineering is not always about building complex systems. Sometimes the highest value comes from connecting simple, reliable components into a pipeline that removes tedious manual work entirely. The total codebase is under 500 lines of Python, but it saves me hours every week.