Multimodal AI: Combining Vision and Language Models
Modern AI systems are starting to see.
Once you plug images into the same pipeline as text, everything from search to agents to RAG changes. Interfaces get simpler, edge cases vanish, and new products appear: visual search, image-grounded copilots, multimodal RAG over PDFs, or agents that can read dashboards and UIs.
From an engineering perspective, multimodal AI is not magic. It is mostly embeddings, alignment losses, shared latent spaces, plus the same production concerns around monitoring and reliability.
In this post I will walk through how to combine vision and language models in practical ways, with code, and how to think about architectures that actually ship.
Why combine vision and language?
Multimodal systems are useful when:
- Text alone is incomplete or ambiguous
- Users naturally provide images (screenshots, photos, plots, scans)
- Knowledge is locked inside visual formats (diagrams, UI states, PDFs with bad OCR)
Some concrete product patterns:
- Visual search: "show me shoes like this photo"
- Image-grounded assistants: "explain this error on my CNC machine display"
- Document copilots: upload a 200-page technical PDF, then ask questions that require reading tables and plots
- UI agents: "click the Export button, then download the CSV" from a screenshot or remote session
If you have built a text-based RAG pipeline before, you can reuse most of that toolbox for multimodal work, with one upgrade: image embeddings.
Core multimodal architectures
1. Dual encoders with a shared embedding space
This is the CLIP-style approach:
- A vision encoder maps images to vectors
- A text encoder maps text to vectors
- A contrastive loss aligns related image-text pairs
You end up with a shared latent space where similarity works across modalities. This means everything you know about text embeddings and vector similarity now also applies to images.
Typical usage patterns:
- Image retrieval from text query
- Text retrieval from image query
- Zero-shot classification by comparing an image embedding to embeddings of textual labels
High-level architecture:
- Pretrained encoders (ViT for images, Transformer for text)
- Project both to the same dimension
- Train with a contrastive loss (InfoNCE-style)
2. Encoder-decoder: images as context for generation
Here you want the model to produce language conditioned on an image:
- Image captioning
- Visual question answering (VQA)
- Multimodal chat ("what is happening in this picture?")
Modern systems usually:
- Encode the image into a grid of visual tokens with a vision transformer
- Optionally project visual tokens to the LLM hidden size
- Feed visual tokens as additional context tokens to the LLM
This is similar to RAG: the image acts like retrieved context chunks, just in a different embedding basis. The mental model is text chunking plus an additional projection layer.
3. Multimodal RAG
Once you treat image embeddings as another vector type, you can build:
- RAG over screenshots and PDFs
- Hybrid search mixing text, tables, diagrams, and UI states
The pipeline works like this:
- Ingestion extracts text, images, tables from documents
- Text encoder processes text chunks
- Vision encoder processes images or image patches
- You store both in a vector database
- At query time, you convert the query to text and maybe also accept an image
- Retrieve multimodal neighbors and feed them as context to an LLM
A practical CLIP-style dual encoder in PyTorch
Let us make the core idea concrete with a toy dual encoder, close to the classic CLIP setup.
We want:
ImageEncoder-> image to embeddingTextEncoder-> text to embedding- A contrastive training loop
This example is intentionally simplified for clarity.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
class ImageEncoder(nn.Module):
def __init__(self, embed_dim: int = 256):
super().__init__()
base = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
self.backbone = nn.Sequential(*list(base.children())[:-1]) # remove fc
self.proj = nn.Linear(base.fc.in_features, embed_dim)
def forward(self, x): # x: (B, 3, H, W)
feats = self.backbone(x).squeeze(-1).squeeze(-2) # (B, C)
emb = self.proj(feats)
emb = F.normalize(emb, dim=-1)
return emb
class TextEncoder(nn.Module):
def __init__(self, vocab_size: int, embed_dim: int = 256):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, embed_dim)
self.encoder = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=embed_dim, nhead=8), num_layers=2
)
def forward(self, tokens): # tokens: (T, B)
x = self.token_emb(tokens) # (T, B, D)
x = self.encoder(x) # (T, B, D)
x = x.mean(dim=0) # (B, D) simple pooling
x = F.normalize(x, dim=-1)
return x
def clip_loss(image_emb, text_emb, temperature: float = 0.07):
# image_emb, text_emb: (B, D)
logits = image_emb @ text_emb.t() / temperature # (B, B)
labels = torch.arange(len(image_emb), device=image_emb.device)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.t(), labels)
return (loss_i2t + loss_t2i) / 2
You would train this on image-text pairs (img, caption_tokens) and optimize clip_loss. In practice you will use a larger backbone, better text encoder, and careful batching, but the core idea is this symmetric contrastive loss.
At inference time, you can:
@torch.inference_mode()
def build_image_index(image_loader, image_encoder):
all_embs, all_ids = [], []
for batch_imgs, batch_ids in image_loader:
embs = image_encoder(batch_imgs.cuda())
all_embs.append(embs.cpu())
all_ids.extend(batch_ids)
return torch.cat(all_embs, dim=0), all_ids
@torch.inference_mode()
def search_images(text_query_tokens, text_encoder, image_embs, image_ids, k=5):
q_emb = text_encoder(text_query_tokens.cuda()).cpu() # (1, D)
sims = (q_emb @ image_embs.t()).squeeze(0) # (N,)
topk = sims.topk(k)
return [image_ids[i] for i in topk.indices]
For large scale, you will move this into a vector database with approximate nearest neighbor indexing.
Practical multimodal RAG pipeline
Let us sketch a production-style multimodal RAG over PDFs with diagrams and screenshots.
1. Ingestion and chunking
For each document:
- Extract text with OCR where needed
- Detect figures, tables, diagrams, and crop them as separate images
- Apply chunking for text (fixed-size overlapping windows or semantic boundaries)
- For each text chunk, store its page number and bounding boxes
- For each image, store page number and caption or nearby text
2. Encoding and storage
from sentence_transformers import SentenceTransformer
import torch
text_model = SentenceTransformer("all-MiniLM-L6-v2")
vision_model = ... # CLIP image encoder or similar
# text_chunks: list[str]
text_embs = text_model.encode(text_chunks, batch_size=64, show_progress_bar=True)
# images: list[PIL.Image]
vision_model.eval().cuda()
@torch.inference_mode()
def encode_images(images):
# transform to tensors, batch, etc.
# here we assume you already have a DataLoader
all_embs = []
for batch in image_loader:
embs = vision_model(batch.cuda())
all_embs.append(embs.cpu())
return torch.cat(all_embs)
image_embs = encode_images(images)
# Persist to vector DB with metadata
Metadata should include:
- Document id
- Page number
- Type ("text" or "image")
- Bounding boxes or coordinates
- Section header if available
3. Query handling
A query can be:
- Pure text: "Explain the loss function used in figure 3"
- Text + image: screenshot of a figure plus "what does this chart say about model performance?"
You can route queries:
- Always embed the text with the text encoder
- If an image is present, embed it with the vision encoder
- Run retrieval separately against the text index and the image index
- Merge results with a simple score fusion, for example
def fuse_scores(text_hits, image_hits, alpha=0.6):
# Each hit: {"id": ..., "score": ...}
from collections import defaultdict
scores = defaultdict(float)
for h in text_hits:
scores[h["id"]] += alpha * h["score"]
for h in image_hits:
scores[h["id"]] += (1 - alpha) * h["score"]
fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return fused
Then feed the top-k context items to an LLM as text. For images you can:
- Caption them with a vision-language model
- Or use an LLM that directly accepts image inputs
Design choices and tradeoffs
Encoder choice and cost
You need to pick encoders that match your budget and latency constraints.
- OpenAI and other hosted multimodal models: good quality, but with privacy and cost tradeoffs
- Open source CLIP or SigLIP models: good for search and retrieval, easy to self host
- Larger multimodal LLMs: best quality for rich reasoning, but expensive
There is usually a sweet spot with mid-size open models fine-tuned on your domain.
Alignment and evaluation
Multimodal systems fail in new ways:
- Hallucinating visual content that is not present
- Misreading charts or UI text
- Ignoring the image and answering from priors only
For evaluating multimodal retrieval you can measure:
- Retrieval metrics per modality: recall@k for image vs text
- Faithfulness metrics conditioned on both text and image context
- Human eval on subsets that heavily rely on visuals, for example reading tables
In production, monitor:
- Inputs with image hashes and feature summaries
- Performance per modality mix (text only vs image + text)
Privacy and compliance
Images can contain sensitive data: faces, license plates, screens with PII. Key practices:
- Mask or blur sensitive regions before encoding
- Avoid storing raw images unless strictly necessary
- Consider hashing and access controls for image content
- If you store embeddings, treat them as potentially re-identifiable
If you are building internal tools that process screenshots of internal dashboards, this is not optional. The same concerns apply in finance and healthcare, where visual data carries regulatory obligations.
A lightweight multimodal assistant pattern
Combining all of this, a simple but powerful architecture looks like this:
-
User uploads an image and optional text query
-
Backend:
- Embeds the text
- Embeds the image
- Retrieves relevant text chunks and similar images
-
The system builds a prompt for a multimodal LLM:
- Includes the text query
- Provides the retrieved text as context
- Attaches the user image and maybe top-k retrieved images
-
LLM responds with a grounded explanation or action plan
You can implement steps 2-3 with a FastAPI backend serving the encoders and coordinating retrieval, then streaming the LLM response back to the client.
Key Takeaways
- Multimodal AI is mostly about aligning image and text embeddings into a shared space and then reusing familiar retrieval and generation patterns.
- Dual encoders like CLIP are the workhorses for image-text retrieval and zero-shot classification, with simple contrastive losses.
- Multimodal RAG extends classic RAG pipelines to images by treating figures and screenshots as retrievable units alongside text.
- Production systems must address encoder choice, latency, and evaluation, just like text-only RAG, but with new failure modes around visual hallucinations.
- Privacy risks increase with images, so masking, storage policies, and access control are critical for screenshots and sensitive documents.
- A practical starting point is a lightweight multimodal assistant that retrieves over both text and images and uses a multimodal LLM to generate grounded answers.
Related Articles
Multimodal RAG 2026: Vision and Text for State-of-the-Art Pipelines
Build production multimodal RAG pipelines combining vision and text retrieval with Qwen3-VL, cross-modal fusion, and cost optimization strategies.
8 min read · intermediateAI & MLEmbedding Models Compared: OpenAI vs Open-Source
Compare OpenAI and open-source embedding models for RAG, search, and clustering, with tradeoffs, benchmarks, costs, and practical code examples.
11 min read · intermediateAI & MLUnderstanding Transformer Architectures
Deep dive into transformer architectures, from self-attention math to practical variants for RAG, privacy NLP, and production systems.
11 min read · advanced