Embedding Models Compared: OpenAI vs Open-Source
Most production RAG systems quietly depend on a single workhorse: the embedding model. When retrieval feels “magic” or inexplicably bad, it almost always comes down to how we turn text into vectors.
Over the last year, the tradeoff between OpenAI embeddings and open-source models has shifted. You no longer have to choose between “good but closed” and “free but mediocre.” The reality is more nuanced, especially if you care about cost, latency, privacy, and domain specificity.
This post walks through how I think about this decision when building retrieval systems in practice.
What embedding models actually do in your stack
Embeddings sit in the critical path of:
- Document ingestion (chunking + embedding)
- Query processing (query rewriting + embedding)
- Retrieval ranking (vector search + scoring)
Concretely, embeddings power:
- Semantic search: "how do I reset my password" should match "account credential reset steps".
- RAG retrieval: selecting the top-k chunks that will condition the LLM.
- Clustering: grouping similar documents, issues, or customers.
- Classification: mapping text to label vectors or using k-NN for weak supervision.
- Deduplication: catching near-duplicates that only differ lightly.
Bottom line: your choice of embedding model is as important as your choice of LLM for overall system quality.
OpenAI embeddings in practice
OpenAI currently offers several embedding models (like text-embedding-3-large, previously text-embedding-ada-002). They are designed as general-purpose, highly optimized models.
Strengths
1. Strong performance out of the box
OpenAI models usually rank very well on public benchmarks like MTEB, without any tuning. For most English-heavy, generic domains (support docs, product knowledge bases, API docs), they just work.
2. Low operational complexity
No GPU management, no capacity planning, no autoscaling issues. You hit an API and get vectors. This matters for teams that want to move fast or do not have ML infrastructure.
3. Stable and well-supported
The client libraries, observability, and documentation are solid. If your main business is not ML infra, this is valuable.
4. Good multilingual support
OpenAI embeddings handle many languages reasonably well without extra work, which is important for global products.
Limitations
1. Data control and privacy
If you work with sensitive data, privacy-preserving NLP constraints often push you away from external APIs.
OpenAI offers options to avoid training on your data, but regulatory or contractual requirements may still forbid sending data off-prem.
2. Cost at scale
Embedding a few thousand documents is cheap. Embedding tens of millions is a different story.
For a large RAG system or a log-heavy product, you can hit:
- High one-time cost for initial indexing
- Non-trivial ongoing cost for updates and user queries
3. Vendor and feature lock-in
Using OpenAI specific models can feel convenient at first, but if you design your system around their behavior or vector sizes, switching later may be painful.
4. Limited specialization
You cannot fine-tune OpenAI embeddings yourself. If your domain is very specific (legal, medical, highly technical) you might hit a quality ceiling that only domain-adapted or fine-tuned open-source models can overcome.
Open-source embeddings: where they shine
The open-source ecosystem has exploded. You now have:
- Sentence-transformer based models
- E5-style models (for retrieval-oriented embeddings)
- FlagEmbedding, GTE, BGE models
- Domain-specific models (code, legal, biomedical, multilingual, and increasingly multimodal)
Many of these rank extremely well on benchmarks and, more importantly, can be tuned for your use case.
Strengths
1. Full control and privacy
You run the model where your data is. For strict privacy settings, this is a big win.
2. Cost and scalability
Once you invest in hardware or a managed inference platform, marginal cost per embedding can be extremely low.
- Batch processing becomes very cheap
- High-throughput ingestion is easier to control
- Latency is under your control, within your infra limits
3. Specialization and fine-tuning
This is often the killer feature:
- You can pick a model that already targets your domain (e.g.,
bge-en-icl,e5-mistral,gte-large) - You can fine-tune embeddings with in-domain labeled pairs or triplets
The same principle applies to generative fine-tuning: you get big gains when your training distribution matches your real data.
4. Flexible vector sizes and architectures
You can choose 384, 768, 1024, 2048 dimensional models depending on your latency/accuracy tradeoff and how it interacts with your vector database's indexing and storage strategy.
Limitations
1. Infra complexity
You need to:
- Deploy and scale GPU or CPU inference
- Monitor latency and throughput
- Handle model updates and rollbacks
If you are not comfortable with ML infra, this is a real tax.
2. Model selection risk
There are many models. Many look good on paper, fewer are good for your data. It is easy to pick something that looks nice on MTEB but fails on your internal queries.
3. Engineering overhead
You must maintain a serving stack, upgrade models, and ensure compatibility with your vector database and RAG pipeline. For small teams, that overhead may outweigh cost savings.
How I approach the decision in real systems
The particular tradeoff depends on your constraints, but I usually start from a few key questions.
1. What is your privacy and compliance model?
- If you cannot send data outside your VPC or country, open-source is nearly mandatory.
- If you can send anonymized or partially masked data, OpenAI is still viable with appropriate pre-processing (redaction, tokenization, differential privacy).
A common compromise:
- Use open-source for highly sensitive indices (HR, legal, PII-heavy data)
- Use OpenAI for public or semi-public content (docs, marketing, FAQs)
2. What volume and growth do you expect?
For small-scale systems (up to a few million documents, modest query volume), OpenAI can be cost-effective and incredibly simple.
For very large RAG systems with constant ingestion (logs, tickets, call transcripts), open-source + dedicated hardware often wins long-term.
Rough heuristic:
- < 10^7 total embeddings + modest QPS: OpenAI is fine.
-
10^8 embeddings or high QPS: strongly consider open-source.
3. How domain-specific is your content?
- Generic SaaS support docs: OpenAI or strong generalist open-source models like
bge-largeare usually enough. - Technical code-heavy docs, law, medicine, specialized research: domain-tuned or custom-finetuned open-source models often outperform general-purpose ones.
For serious RAG in narrow domains, I rarely stay with a pure off-the-shelf embedding model forever. After some logging, I curate hard negatives and fine-tune.
Concrete model candidates: OpenAI vs OSS
OpenAI side
At time of writing, typical choices:
text-embedding-3-large: high quality, larger dimensionality, better for complex domains.text-embedding-3-small: cheaper and smaller, often enough for many RAG pipelines.
Use 3-small by default, switch critical indices to 3-large if evaluation shows gains.
Open-source side
Some strong, widely used choices:
sentence-transformers/all-MiniLM-L6-v2- 384d vectors, very fast, good for lightweight or client-side use.
intfloat/e5-large-v2orintfloat/multilingual-e5-large- Excellent retrieval performance, widely adopted.
BAAI/bge-base-en/bge-large-en- Strong on English retrieval tasks.
Alibaba-NLP/gte-large- Good general-purpose English model.
If you build multilingual or specialized systems, look for models explicitly tuned for that.
Simple Python examples
Using OpenAI embeddings
from openai import OpenAI
client = OpenAI()
texts = [
"How do I reset my password?",
"To change your password, go to account settings and click 'Reset'.",
]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
embeddings = [item.embedding for item in response.data]
print(len(embeddings), len(embeddings[0])) # n_texts, embedding_dim
You can store these embeddings in any vector database that supports the corresponding dimensionality.
Using a sentence-transformer locally
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [
"How do I reset my password?",
"To change your password, go to account settings and click 'Reset'.",
]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=False)
print(embeddings.shape) # (n_texts, embedding_dim)
For production, you would typically wrap this in a FastAPI or gRPC service and deploy on a GPU or high-CPU instance.
Evaluation: do not trust benchmarks blindly
Benchmarks like MTEB are helpful to narrow the search space, but they are not your ground truth. You should evaluate embeddings on task-specific metrics, the same way you would evaluate any other component of a retrieval pipeline.
A minimal evaluation loop
Suppose you already have a small labeled dataset:
- Queries
q_i - Relevant document IDs
R_ifor each query
You can do something like this (pseudo-production code):
from collections import defaultdict
import numpy as np
# Suppose you already have these
queries = ["how to reset password", "cancel subscription", ...]
relevant_docs = [
{"doc_1", "doc_17"},
{"doc_42"},
# ...
]
# Document corpus
doc_ids = ["doc_1", "doc_2", "doc_3", ...]
doc_texts = ["To reset password, ...", "Billing details", ...]
# Step 1: embed documents with a given model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_embs = model.encode(doc_texts, batch_size=64, normalize_embeddings=True)
id_to_idx = {doc_id: i for i, doc_id in enumerate(doc_ids)}
# Step 2: simple brute-force retrieval for evaluation
def retrieve(query, top_k=10):
q_emb = model.encode([query], normalize_embeddings=True)[0]
scores = np.dot(doc_embs, q_emb)
top_idx = np.argsort(-scores)[:top_k]
return [doc_ids[i] for i in top_idx]
# Step 3: compute recall@k
def recall_at_k(k):
hits = 0
total = 0
for q, rel in zip(queries, relevant_docs):
retrieved = set(retrieve(q, top_k=k))
hits += len(retrieved & rel) > 0
total += 1
return hits / total
for k in [1, 3, 5, 10]:
print(f"Recall@{k}: {recall_at_k(k):.3f}")
Then you repeat the same with another model (OpenAI or different OSS) and compare. More work than eyeballing embeddings, but it directly measures whether your RAG system will retrieve useful chunks.
Cost and latency considerations
When comparing OpenAI vs open-source, do not only look at per-1K token costs. Ask:
- What is my target latency for query-time retrieval?
- How many documents or chunks do I expect to index over a year?
- How often will I re-embed due to schema changes or chunking strategy changes?
OpenAI gives you predictable, pay-per-use economics.
Open-source gives you:
- Higher upfront cost (infra, engineering)
- Much lower marginal cost for high volumes
For RAG systems at scale, it often pays to:
- Start with OpenAI for speed to market.
- Once you understand your volumes and failure modes, transition critical indices to open-source embeddings.
- Use the same evaluation harness for both, to ensure quality stays the same or improves.
Hybrid strategies that work well
In practice, the best systems often combine both worlds.
Strategy 1: OpenAI for queries, OSS for documents
You can train (or choose) an open-source model that is compatible with OpenAI embeddings in vector space, or at least evaluate them side-by-side. A more common pattern is:
- Embed documents with a strong open-source model you control.
- Embed queries with the same model in production.
- Use OpenAI generative models for the final RAG response.
This keeps retrieval private and cheap, while still benefiting from OpenAI for generation.
Strategy 2: OSS for sensitive indices, OpenAI for everything else
Split your indices by sensitivity:
private_indexwith open-source embeddings and on-prem vector DB.public_indexwith OpenAI embeddings and a managed vector DB.
At query time, you hit both indices, then merge results before passing them to your RAG pipeline.
Strategy 3: OSS baseline with OpenAI as a fallback
In early phases, you may:
- Use an open-source model by default.
- Log queries with low retrieval confidence (for example low max similarity score).
- For those, optionally try OpenAI embeddings and measure if they retrieve better content.
This gives you a real-world A/B style evaluation without fully committing.
Key Takeaways
- Embedding models are core infrastructure for RAG, search, and clustering, not just a detail.
- OpenAI embeddings are strong, easy to use, and great for generic English content at modest scale.
- Open-source embeddings shine when you need privacy, scale, or domain-specific performance.
- Benchmarks like MTEB help, but you must evaluate models on your own queries and documents.
- Cost tradeoffs change with volume: APIs are cheap at low scale, infra wins at high scale.
- Hybrid setups are often best: private OSS embeddings for sensitive data, OpenAI for public or less critical indices.
- Reuse your RAG evaluation stack (metrics, labeled queries) to compare embedding models objectively.
- Choose embeddings with the same care you choose your LLM - they can make or break your system quality.
Related Articles
2026: The Year of AI Memory Beyond Basic RAG
How AI memory systems are evolving past basic RAG with episodic, semantic, and procedural memory for persistent, context-aware agents
9 min read · intermediateGetting StartedBuilding a RAG Chatbot from Scratch with Python
Learn how to build a Retrieval-Augmented Generation (RAG) chatbot from scratch in Python, from data loading to retrieval and LLM integration.
10 min read · beginnerRAG SystemsChunking Strategies for RAG Pipelines
Learn practical chunking strategies for RAG pipelines, from basic splits to adaptive and hybrid methods, with code and evaluation tips.
11 min read · intermediate