| 3 min read

Embedding Models Explained: text-embedding-3-small in Practice

embeddings OpenAI text-embedding-3-small vector search RAG NLP

What Are Embedding Models and Why Should You Care?

Embedding models convert text into dense numerical vectors that capture semantic meaning. Two sentences that mean similar things will produce vectors that are close together in high-dimensional space, even if they share no common words. This is the foundation of modern search, recommendation systems, and retrieval-augmented generation.

OpenAI's text-embedding-3-small is my go-to embedding model for production applications. It strikes the ideal balance between quality, speed, and cost. Here is how I use it in practice.

Why text-embedding-3-small Over Alternatives

The embedding model landscape in 2026 is crowded, but text-embedding-3-small remains compelling for several reasons:

  • Cost: At $0.02 per million tokens, it is roughly 5x cheaper than text-embedding-3-large
  • Speed: Batch embedding of 1000 documents takes under 10 seconds
  • Dimension flexibility: You can reduce dimensions from 1536 down to 256 with minimal quality loss
  • Quality: It outperforms many open-source alternatives on retrieval benchmarks

Getting Started

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(text: str, dimensions: int = 1536) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
        dimensions=dimensions
    )
    return response.data[0].embedding

# Example
embedding = get_embedding("How to deploy a FastAPI application")
print(f"Vector dimensions: {len(embedding)}")  # 1536

Dimension Reduction: The Secret Weapon

One of the most useful features of the v3 embedding models is native dimension reduction. You can request vectors with fewer dimensions, and the model uses Matryoshka Representation Learning to preserve as much information as possible in fewer dimensions.

# Full dimensions (1536)
full = get_embedding("Python web development", dimensions=1536)

# Reduced dimensions (512) - 66% less storage
reduced = get_embedding("Python web development", dimensions=512)

# Minimal dimensions (256) - 83% less storage  
minimal = get_embedding("Python web development", dimensions=256)

In my testing across RAG applications, 512 dimensions retain about 95% of the retrieval quality compared to the full 1536. For most use cases, that is a massive storage and compute savings with negligible quality impact.

Batch Embedding for Production

When embedding large document collections, you need to batch your requests efficiently:

def batch_embed(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
            dimensions=512
        )
        embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(embeddings)
    
    return all_embeddings

# Embed 10,000 documents
docs = load_documents()
embeddings = batch_embed([doc.text for doc in docs])
print(f"Embedded {len(embeddings)} documents")

Similarity Search in Practice

Once you have embeddings, similarity search is straightforward. Cosine similarity is the standard metric:

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Search
query_embedding = get_embedding("How do I set up CI/CD?")

scores = [
    (doc, cosine_similarity(query_embedding, doc_emb))
    for doc, doc_emb in zip(docs, embeddings)
]

top_results = sorted(scores, key=lambda x: x[1], reverse=True)[:5]

Chunking Strategy Matters More Than Model Choice

Here is something I learned the hard way: your chunking strategy has a bigger impact on retrieval quality than the embedding model you choose. A mediocre model with smart chunking will outperform a state-of-the-art model with naive chunking.

My recommended approach:

  • Chunk at semantic boundaries (paragraphs, sections) rather than fixed token counts
  • Include overlap of 50 to 100 tokens between chunks to preserve context
  • Keep chunks between 200 and 500 tokens for optimal embedding quality
  • Prepend document metadata (title, section header) to each chunk

Real-World RAG Integration

Here is how I wire text-embedding-3-small into a RAG pipeline:

class RAGPipeline:
    def __init__(self, vector_store):
        self.vector_store = vector_store
        self.client = OpenAI()
    
    def query(self, question: str, top_k: int = 5) -> str:
        # Embed the question
        query_emb = get_embedding(question, dimensions=512)
        
        # Retrieve relevant chunks
        results = self.vector_store.search(query_emb, top_k=top_k)
        
        # Build context
        context = "\n\n".join([r.text for r in results])
        
        # Generate answer
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": f"Answer based on this context:\n{context}"},
                {"role": "user", "content": question}
            ]
        )
        return response.choices[0].message.content

Cost Analysis

For a knowledge base with 50,000 documents averaging 300 tokens each, the total embedding cost is roughly $0.30. That is not a typo. At this price point, there is no reason not to re-embed your entire corpus when you update your chunking strategy or want to experiment with different dimension sizes.

Text-embedding-3-small is the workhorse of modern AI applications. It is cheap, fast, and good enough for the vast majority of production use cases. Start here, and only upgrade to larger models if you have a specific quality gap to close.