Embedding Models Explained: text-embedding-3-small in Practice
What Are Embedding Models and Why Should You Care?
Embedding models convert text into dense numerical vectors that capture semantic meaning. Two sentences that mean similar things will produce vectors that are close together in high-dimensional space, even if they share no common words. This is the foundation of modern search, recommendation systems, and retrieval-augmented generation.
OpenAI's text-embedding-3-small is my go-to embedding model for production applications. It strikes the ideal balance between quality, speed, and cost. Here is how I use it in practice.
Why text-embedding-3-small Over Alternatives
The embedding model landscape in 2026 is crowded, but text-embedding-3-small remains compelling for several reasons:
- Cost: At $0.02 per million tokens, it is roughly 5x cheaper than text-embedding-3-large
- Speed: Batch embedding of 1000 documents takes under 10 seconds
- Dimension flexibility: You can reduce dimensions from 1536 down to 256 with minimal quality loss
- Quality: It outperforms many open-source alternatives on retrieval benchmarks
Getting Started
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str, dimensions: int = 1536) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
dimensions=dimensions
)
return response.data[0].embedding
# Example
embedding = get_embedding("How to deploy a FastAPI application")
print(f"Vector dimensions: {len(embedding)}") # 1536
Dimension Reduction: The Secret Weapon
One of the most useful features of the v3 embedding models is native dimension reduction. You can request vectors with fewer dimensions, and the model uses Matryoshka Representation Learning to preserve as much information as possible in fewer dimensions.
# Full dimensions (1536)
full = get_embedding("Python web development", dimensions=1536)
# Reduced dimensions (512) - 66% less storage
reduced = get_embedding("Python web development", dimensions=512)
# Minimal dimensions (256) - 83% less storage
minimal = get_embedding("Python web development", dimensions=256)
In my testing across RAG applications, 512 dimensions retain about 95% of the retrieval quality compared to the full 1536. For most use cases, that is a massive storage and compute savings with negligible quality impact.
Batch Embedding for Production
When embedding large document collections, you need to batch your requests efficiently:
def batch_embed(texts: list[str], batch_size: int = 100) -> list[list[float]]:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch,
dimensions=512
)
embeddings = [item.embedding for item in response.data]
all_embeddings.extend(embeddings)
return all_embeddings
# Embed 10,000 documents
docs = load_documents()
embeddings = batch_embed([doc.text for doc in docs])
print(f"Embedded {len(embeddings)} documents")
Similarity Search in Practice
Once you have embeddings, similarity search is straightforward. Cosine similarity is the standard metric:
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Search
query_embedding = get_embedding("How do I set up CI/CD?")
scores = [
(doc, cosine_similarity(query_embedding, doc_emb))
for doc, doc_emb in zip(docs, embeddings)
]
top_results = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
Chunking Strategy Matters More Than Model Choice
Here is something I learned the hard way: your chunking strategy has a bigger impact on retrieval quality than the embedding model you choose. A mediocre model with smart chunking will outperform a state-of-the-art model with naive chunking.
My recommended approach:
- Chunk at semantic boundaries (paragraphs, sections) rather than fixed token counts
- Include overlap of 50 to 100 tokens between chunks to preserve context
- Keep chunks between 200 and 500 tokens for optimal embedding quality
- Prepend document metadata (title, section header) to each chunk
Real-World RAG Integration
Here is how I wire text-embedding-3-small into a RAG pipeline:
class RAGPipeline:
def __init__(self, vector_store):
self.vector_store = vector_store
self.client = OpenAI()
def query(self, question: str, top_k: int = 5) -> str:
# Embed the question
query_emb = get_embedding(question, dimensions=512)
# Retrieve relevant chunks
results = self.vector_store.search(query_emb, top_k=top_k)
# Build context
context = "\n\n".join([r.text for r in results])
# Generate answer
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
Cost Analysis
For a knowledge base with 50,000 documents averaging 300 tokens each, the total embedding cost is roughly $0.30. That is not a typo. At this price point, there is no reason not to re-embed your entire corpus when you update your chunking strategy or want to experiment with different dimension sizes.
Text-embedding-3-small is the workhorse of modern AI applications. It is cheap, fast, and good enough for the vast majority of production use cases. Start here, and only upgrade to larger models if you have a specific quality gap to close.