Deploying ML Models with FastAPI and Docker
Shipping a model is often harder than training one. The Jupyter notebook that demoed beautifully on your laptop suddenly becomes a fragile snowflake when real users and real traffic show up. Logs are missing, dependencies break, and latency jumps.
FastAPI and Docker give us a clean way out of this. FastAPI provides a modern async web framework with type hints and automatic docs. Docker gives us reproducible, portable environments. Put together, they are a solid foundation for deploying anything from a small classifier to a full Retrieval-Augmented Generation (RAG) pipeline.
This post walks through a practical pattern for deploying models, from API design to Docker packaging to production scaling.
Designing the API around the model
Before writing Dockerfiles, I want a clear contract: what requests look like, what responses look like, and how we will monitor and debug them.
Start with a minimal interface
For a classifier, a simple interface might be:
POST /predict- takes input text, returns predicted label and scoreGET /healthz- returns service health
Define the Pydantic models first. It forces you to think about the API surface instead of jumping straight to model loading.
from typing import List
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Text Classifier API")
class PredictRequest(BaseModel):
texts: List[str]
class Prediction(BaseModel):
label: str
score: float
class PredictResponse(BaseModel):
predictions: List[Prediction]
@app.get("/healthz")
def health_check():
return {"status": "ok"}
The same schema-first approach applies to any ML endpoint, whether it returns class labels, embeddings, or generated text.
Loading the model efficiently
Naively loading the model inside the endpoint handler is a classic anti-pattern. It reloads on every request and kills performance.
Use FastAPI's startup event to load heavy resources once per process:
from fastapi import FastAPI
import joblib
app = FastAPI(title="Text Classifier API")
model = None
@app.on_event("startup")
async def load_model():
global model
model = joblib.load("./model.joblib")
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
# simple batch prediction
labels = model.predict(req.texts)
# assume model has predict_proba
probs = model.predict_proba(req.texts).max(axis=1)
predictions = [
Prediction(label=str(label), score=float(score))
for label, score in zip(labels, probs)
]
return PredictResponse(predictions=predictions)
Production tip: for large deep learning models, you might want to lazy-load or share them using a separate process or model server if multiple services need the same model.
Structuring the project for deployment
A clean folder layout makes Docker and CI/CD easier.
ml-service/
app/
__init__.py
main.py
models.py # pydantic schemas
inference.py # prediction logic
config.py
models/
model.joblib
requirements.txt
Dockerfile
start.sh
Split responsibilities:
main.py- FastAPI app and routinginference.py- model loading and prediction logicconfig.py- environment variable parsing
config.py keeps configuration out of the code, which matters as soon as you handle secrets or environment-specific values.
# app/config.py
from pydantic import BaseSettings
class Settings(BaseSettings):
model_path: str = "./models/model.joblib"
log_level: str = "info"
max_batch_size: int = 32
settings = Settings()
Then use it in your inference layer:
# app/inference.py
from typing import List
import joblib
from .config import settings
class ModelService:
def __init__(self):
self.model = joblib.load(settings.model_path)
def predict(self, texts: List[str]):
labels = self.model.predict(texts)
probs = self.model.predict_proba(texts).max(axis=1)
return list(zip(labels, probs))
And wire it into FastAPI:
# app/main.py
from fastapi import FastAPI
from .models import PredictRequest, PredictResponse, Prediction
from .inference import ModelService
app = FastAPI(title="Text Classifier API")
model_service: ModelService
@app.on_event("startup")
async def startup_event():
global model_service
model_service = ModelService()
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
results = model_service.predict(req.texts)
predictions = [
Prediction(label=str(label), score=float(score))
for label, score in results
]
return PredictResponse(predictions=predictions)
Writing a production-friendly Dockerfile
A fragile Dockerfile is a silent source of pain in ML deployments. The environment needs to be stable, reproducible, and reasonably small.
Basic Dockerfile for CPU inference
# Use a slim Python image
FROM python:3.11-slim as base
# Set working directory
WORKDIR /app
# Install system dependencies (adjust as needed)
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies first, to leverage Docker cache
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY app ./app
COPY models ./models
# Expose FastAPI default port
EXPOSE 8000
# Use a non-root user for better security
RUN useradd -m appuser
USER appuser
# Start script
COPY start.sh ./
RUN chmod +x start.sh
CMD ["./start.sh"]
And a simple start.sh that uses Uvicorn with multiple workers:
#!/usr/bin/env bash
set -e
HOST=${HOST:-0.0.0.0}
PORT=${PORT:-8000}
WORKERS=${WORKERS:-2}
exec uvicorn app.main:app \
--host "$HOST" \
--port "$PORT" \
--workers "$WORKERS" \
--proxy-headers
During local development, you can build and run it with:
docker build -t text-classifier:latest .
docker run -p 8000:8000 text-classifier:latest
The same Dockerfile structure works for heavier services too, such as a RAG API backed by a vector database.
GPU-enabled images
For deep learning models, you often need GPUs. With NVIDIA, the pattern is:
- Use an
nvidia/cudabase images - Install the right PyTorch / TensorFlow build
- Run with
--gpusflag
Example snippet:
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# then similar steps as before: install Python, dependencies, copy app
And run:
docker run --gpus all -p 8000:8000 your-image:tag
Align CUDA, PyTorch, and driver versions carefully. Mismatches are one of the most common sources of deployment failures.
Performance and scalability considerations
Once the basic service runs, I usually look at three performance layers:
- request-level optimizations
- model-level optimizations
- system-level scaling
Request-level optimizations
-
Batching - The API already takes
List[str]. Many models are much more efficient on batches. -
Timeouts and limits - Cap the max batch size and request body size to avoid abuse.
In FastAPI you can enforce batch size:
from fastapi import HTTPException
from .config import settings
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
if len(req.texts) > settings.max_batch_size:
raise HTTPException(status_code=400, detail="Batch too large")
# actual prediction logic...
For latency-sensitive services where multiple microservices are chained, these limits help avoid cascading failures.
Model-level optimizations
Two patterns pay off quickly:
- Use quantized or distilled models when possible
- Preload tokenizers and avoid re-initializing them per request
For transformer-based models:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
class TransformerService:
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
self.model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to(self.device)
self.model.eval()
def predict(self, texts):
inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
scores, labels = torch.max(probs, dim=-1)
return [
(int(label.item()), float(score.item()))
for label, score in zip(labels, scores)
]
The structure mirrors what you would build for an embedding endpoint, except here we return class labels and scores instead of vectors.
System-level scaling
For real traffic you will typically:
- run multiple container replicas
- put them behind a load balancer
- configure auto-scaling based on CPU or latency
You can quickly try horizontal scaling locally with Docker Compose:
version: "3.9"
services:
api:
image: text-classifier:latest
deploy:
replicas: 3
ports:
- "8000:8000"
In Kubernetes, you would add an HPA (Horizontal Pod Autoscaler) on CPU or custom metrics like request latency.
Observability and reliability
Models rarely fail silently. They show drift, skew, or increased error rates. Observability is non-negotiable for any ML service, and doubly so for multimodal pipelines where failures can hide in either modality.
Structured logging
Use structured logs instead of raw print so your logging backend can parse them.
import logging
import json
logger = logging.getLogger("ml-service")
logging.basicConfig(level=logging.INFO)
def log_prediction(input_count: int, latency_ms: float):
logger.info(json.dumps({
"event": "prediction",
"inputs": input_count,
"latency_ms": round(latency_ms, 2),
}))
Avoid logging raw user text. Redact or hash personally identifiable fields before they hit your logging backend.
Health checks and readiness
For container orchestrators, splitting liveness and readiness probes is crucial.
- Liveness - container is not stuck
- Readiness - model is loaded and ready to serve
Add a simple readiness flag:
from fastapi import FastAPI
app = FastAPI()
model_ready = False
@app.on_event("startup")
async def startup_event():
global model_ready
# load model
model_ready = True
@app.get("/readyz")
async def readyz():
if not model_ready:
return {"status": "not-ready"}
return {"status": "ready"}
In Kubernetes, you would then configure readiness and liveness probes against /readyz and /healthz.
Extending to RAG and agents
The same patterns scale to more complex systems:
- A RAG service: FastAPI app that orchestrates retrieval, ranking, and generation, with each stage possibly being its own container.
- An AI agent backend: FastAPI app that exposes a single
/agentendpoint, internally calling multiple tools or services.
The core ideas stay the same:
- keep the API contract explicit
- isolate model logic
- containerize with clear dependencies
- layer in observability and scalability
Once you have one robust Dockerized FastAPI service for a small model, reusing the same template for larger systems becomes mostly a matter of configuration and infrastructure.
Key Takeaways
- Design a clear API contract first, using Pydantic models to structure inputs and outputs.
- Load models once at startup, not per request, to avoid massive latency penalties.
- Use a clean project structure that separates API, inference logic, and configuration.
- Write a Dockerfile that prioritizes reproducibility, small images, and non-root execution.
- Optimize at three levels: request batching and limits, efficient model implementations, and system-level scaling.
- Add structured logging, health, and readiness endpoints to make the service observable and reliable.
- Reuse the same FastAPI + Docker patterns for RAG services, embedding APIs, and AI agents. If your service includes a retrieval component, evaluating its end-to-end performance becomes part of your deployment checklist.
- Treat privacy and security as first-class concerns, especially when logging or handling user data.
Related Articles
CI/CD Pipelines for Machine Learning Projects
Learn how to design practical CI/CD pipelines for ML projects, covering testing, data checks, model evaluation, deployment and MLOps tooling.
11 min read · intermediateEngineeringPython Best Practices for ML Engineers
Practical Python best practices for ML engineers, from project structure and typing to performance, testing, and production-ready code.
10 min read · beginnerEngineeringMonitoring ML Models in Production
Practical guide to monitoring ML models in production, covering metrics, drift, data quality, logging, alerts, and code patterns in Python.
12 min read · intermediate