CI/CD Pipelines for Machine Learning Projects
Software engineers have enjoyed solid CI/CD practices for years, while many ML teams still copy models by hand to servers and tweak things in production. The result is fragile systems, hard-to-reproduce bugs, and inconsistent results between experiments and deployment.
In ML, the challenge is not just deploying code. We also deploy data, models, and infrastructure. That makes CI/CD more complex, but also more valuable when it is done well.
This post walks through how to approach CI/CD for ML projects, from simple research repos to production RAG systems.
Why CI/CD for ML is different
Traditional CI/CD focuses on building, testing, and deploying code. In ML projects, we need to think in terms of four moving pieces:
- Code - model definitions, data pipelines, API servers, RAG orchestration logic.
- Data - training data, evaluation datasets, prompts, retrieval corpora.
- Models - versioned artifacts, checkpoints, embedding models, LLM adapters.
- Infrastructure - vector databases, GPUs, workers, API gateways.
A robust ML CI/CD pipeline needs to:
- Keep experiments reproducible.
- Prevent regressions in prediction quality.
- Protect user data through anonymization and differential privacy.
- Make deployments boring and reversible.
I like to think of it in three stages:
- CI for research - fast feedback on notebooks and experiments.
- CI for production - serious tests, checks, and quality gates.
- CD for production - safe, automated deployment of models and services.
Let's walk through a concrete architecture.
Repository structure for ML CI/CD
A clean repo layout makes CI/CD much easier. For a typical ML service or RAG system, something like this works well:
.
├── src/
│ ├── app/ # FastAPI / Flask app, RAG orchestration
│ ├── models/ # Model definitions, wrappers, inference logic
│ └── data/ # Data loaders, feature extraction, chunking
├── tests/
│ ├── unit/
│ ├── integration/
│ └── regression/ # Model quality tests
├── pipelines/
│ ├── train.py # Training / fine-tuning entrypoint
│ ├── evaluate.py # Evaluation and metrics
│ ├── preprocess.py # Data clean / chunk / index
│ └── export.py # Create deployable model artifact
├── configs/
│ ├── model.yaml
│ ├── data.yaml
│ └── deployment.yaml
├── docker/
│ ├── Dockerfile.api
│ └── Dockerfile.worker
├── .github/workflows/ # Or .gitlab-ci.yml, etc
└── requirements.txt / pyproject.toml
``
## Continuous Integration (CI) for ML projects
### 1. Basic CI: linting, formatting, unit tests
This part is standard [Python engineering](/blog/python-best-practices-for-ml-engineers): formatting, static analysis, and unit tests.
Typical steps:
- Install dependencies.
- Run formatting (black, isort).
- Run static analysis (mypy, ruff).
- Run unit tests (pytest) with coverage.
Example GitHub Actions workflow (simplified):
```yaml
name: CI
on:
push:
branches: [ main ]
pull_request:
jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -U pip
pip install -e .[dev]
- name: Lint and type-check
run: |
ruff check src tests
mypy src
- name: Run tests
run: pytest -q
Nothing ML specific yet, but this is your foundation.
2. Data and schema checks
Data is one of the most common failure points. The CI should reject PRs that change data format or violate basic assumptions.
Useful techniques:
- Schema checks with pydantic models.
- Statistical checks (null ratios, category coverage).
- Consistency checks for RAG corpora (no empty chunks, correct metadata).
Example: pydantic-based schema for an input sample.
from pydantic import BaseModel, Field
from typing import List, Optional
class DocumentChunk(BaseModel):
id: str
text: str = Field(min_length=1)
source: str
metadata: dict
class QueryRequest(BaseModel):
query: str = Field(min_length=3)
top_k: int = Field(gt=0, le=50)
filters: Optional[dict]
You can then add tests that verify your data pipeline respects these schemas:
from src.data.loader import load_indexed_documents
from src.schemas import DocumentChunk
def test_loaded_documents_match_schema():
docs = load_indexed_documents(limit=100)
for d in docs:
DocumentChunk(**d) # raises if invalid
For larger projects, libraries like great_expectations can formalize these expectations and integrate into CI.
3. Model quality regression tests
Here is where ML-specific CI starts.
You do not want a PR to silently degrade model quality. I usually maintain:
- A small golden dataset for fast checks on every PR.
- Larger evaluation datasets for nightly or pre-release runs.
This follows the same principles used when evaluating RAG system performance: define metrics upfront, automate measurement, and fail on regressions.
At minimum, you can:
- Commit a small labeled dataset to the repo (or fetch from a secure artifact store).
- Write an evaluation script that outputs metrics.
- Fail CI if metrics drop beyond a threshold compared to the main branch or baseline.
Example evaluation in Python:
import json
from pathlib import Path
from src.models import load_model
from src.metrics import accuracy_score
def evaluate_on_golden_set(model_path: str) -> float:
model = load_model(model_path)
data = json.loads(Path("tests/data/golden_set.json").read_text())
y_true, y_pred = [], []
for sample in data:
y_true.append(sample["label"])
y_pred.append(model.predict(sample["text"]))
return accuracy_score(y_true, y_pred)
if __name__ == "__main__":
score = evaluate_on_golden_set("artifacts/latest/model.pt")
print(f"accuracy={score:.3f}")
Then in CI, run this script and parse the result.
python pipelines/evaluate.py > metrics.txt
ACC=$(grep "accuracy=" metrics.txt | cut -d'=' -f2)
MIN_ACC=0.85
python - <<EOF
acc = float("$ACC")
min_acc = float("$MIN_ACC")
if acc < min_acc:
raise SystemExit(f"Accuracy {acc:.3f} below threshold {min_acc:.3f}")
EOF
For RAG, quality checks could cover:
- Retrieval hit rate on a QA set.
- Exact match / F1 for answers.
- Latency and cost per request.
4. Privacy and security checks
If you work with sensitive data, your CI/CD pipeline must help enforce privacy commitments.
Ideas:
- Static checks to prevent logging of raw PII.
- Unit tests for anonymization / pseudonymization functions.
- Verifying that differential privacy parameters stay within acceptable ranges.
Example unit test to ensure logging does not expose raw user text:
from src.logging_utils import sanitize_log
def test_sanitize_log_removes_emails_and_phones():
raw = "User [email protected] called from +1-555-123-4567"
sanitized = sanitize_log(raw)
assert "[email protected]" not in sanitized
assert "+1-555-123-4567" not in sanitized
CI should run these tests on every change to the relevant modules.
Continuous Delivery (CD) of ML models and services
Once CI passes, you want a reliable way to ship models and services. A common baseline is serving behind FastAPI in Docker, then layering on artifact versioning and staged rollouts.
A typical ML CD pipeline involves:
- Building a Docker image with the serving code and environment.
- Downloading or embedding a specific model artifact.
- Pushing the image to a registry.
- Updating the deployment (Kubernetes, ECS, serverless, etc.).
- Running smoke tests and rolling out gradually.
1. Reproducible model artifacts
Separate model training from model serving.
- Training pipeline outputs a versioned artifact (model weights, tokenizer, config).
- Serving container pulls the exact artifact_version.
You can store artifacts in:
- S3 / GCS
- MLflow Model Registry
- Weights & Biases Artifacts
Example export.py to package a model for serving:
import joblib
from pathlib import Path
from src.models import MyModel, load_trained_model
def export_model(model_dir: str, version: str) -> str:
model = load_trained_model()
path = Path(model_dir) / f"model_{version}.joblib"
joblib.dump(model, path)
return str(path)
if __name__ == "__main__":
version = "2024-02-09-01"
artifact_path = export_model("artifacts/", version)
print(f"exported_path={artifact_path}")
CD will then reference this version when building and deploying.
2. Docker-based deployment
A simplified Dockerfile for a FastAPI inference service:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install -U pip && pip install -r requirements.txt
COPY src ./src
COPY configs ./configs
# Environment variable to select model artifact
ENV MODEL_VERSION="latest"
CMD ["uvicorn", "src.app.main:app", "--host", "0.0.0.0", "--port", "8080"]
Your CD pipeline can build the image and tag it with both the git SHA and model version.
MODEL_VERSION=2024-02-09-01
GIT_SHA=$(git rev-parse --short HEAD)
IMAGE_TAG="registry.example.com/ml-service:${MODEL_VERSION}-${GIT_SHA}"
docker build -t "$IMAGE_TAG" -f docker/Dockerfile.api .
docker push "$IMAGE_TAG"
3. Environment-specific deployment stages
A classic pattern:
- staging: deployed on every merge to
main, used for QA. - production: deployed on manual approval or tag.
In GitHub Actions, you might define two workflows or two jobs:
jobs:
deploy-staging:
needs: tests
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
# build & push image
# apply k8s manifests to staging
deploy-production:
needs: tests
if: startsWith(github.ref, 'refs/tags/v')
runs-on: ubuntu-latest
steps:
# build & push image with prod tag
# apply k8s manifests to production
For Kubernetes, use separate namespaces and configs. deployment.yaml can reference different resource limits, vector DB endpoints, and feature flags for each environment.
4. Smoke tests and canary releases
Blindly deploying a new model to all users is risky. Two improvements:
- Smoke tests - run a set of test requests against the newly deployed service.
- Canary release - route a small percentage of traffic to the new version.
Smoke tests can be a simple Python script:
import requests
def smoke_test(base_url: str) -> None:
r = requests.post(
f"{base_url}/predict",
json={"text": "Test input"},
timeout=5,
)
r.raise_for_status()
data = r.json()
assert "prediction" in data
if __name__ == "__main__":
smoke_test("https://staging.example.com")
Run this from CD right after deployment. If it fails, roll back automatically.
For canaries, the implementation depends on your stack (Kubernetes, API gateway, service mesh), but the CI/CD side can:
- Label the new deployment as
canary. - Trigger traffic-splitting rules.
- Collect metrics (latency, error rate, key model metrics) for a few minutes.
This pattern is especially useful for RAG systems where new embedding models, prompt templates, or chunking strategies can change behavior in unexpected ways.
CI/CD for RAG and multi-agent systems
When dealing with RAG or multimodal pipelines, there are extra moving parts:
- Indexing pipelines for the document store.
- Multiple services (retriever API, generator API, orchestrator).
- Vector DB migrations and performance constraints.
In CI/CD, you should:
- Version your index build pipeline. A new index should be built and validated in staging before promoting it to production.
- Add performance tests for retrieval latency and throughput.
- Include integration tests that exercise realistic flows (multi-step tools, agents calling retrievers, etc.).
Example integration test for a RAG pipeline:
from src.app.client import RAGClient
def test_rag_answer_contains_source():
client = RAGClient(base_url="http://localhost:8080")
resp = client.ask("What is the return policy?")
assert "answer" in resp
assert resp["sources"], "RAG must return at least one source"
for s in resp["sources"]:
assert "url" in s and "snippet" in s
This type of test can run in CI using docker-compose to spin up the API and vector DB.
Making CI/CD sustainable for ML teams
A few practical tips so your pipeline does not collapse under its own weight:
- Keep fast checks fast: under 10 minutes for PRs. Push heavy evaluations to nightly jobs.
- Cache aggressively: Python wheels, model downloads, built Docker layers.
- Make failures actionable: if a quality gate fails, show a clear diff vs previous metrics.
- Automate reproducibility: store config files and random seeds with model artifacts.
- Document the flow: developers should know what happens when they open a PR or tag a release.
Once teams experience the reliability of a well-tuned ML CI/CD pipeline, they rarely want to go back to manual experiments and ad hoc deployments.
Key Takeaways
- Treat ML CI/CD as managing code, data, models, and infrastructure together.
- Start with standard CI practices, then add data validation and model quality checks.
- Maintain small golden datasets for fast regression tests on every PR.
- Enforce privacy constraints in CI with tests for logging and anonymization.
- Separate training from serving, and version model artifacts explicitly.
- Use Docker-based deployments with environment-specific configs and stages.
- Add smoke tests and, where possible, canary releases to catch bad deployments.
- For RAG and multi-agent systems, test the full pipeline, not just individual components.
- Keep PR pipelines fast, push heavy evaluations and benchmarks to scheduled jobs.
- Document the pipeline so that every engineer understands how models reach production.
Related Articles
Deploying ML Models with FastAPI and Docker
Learn how to containerize and deploy ML models using FastAPI and Docker, with patterns for scaling, performance, and production-ready setups.
8 min read · intermediateEngineeringPython Best Practices for ML Engineers
Practical Python best practices for ML engineers, from project structure and typing to performance, testing, and production-ready code.
10 min read · beginnerEngineeringMonitoring ML Models in Production
Practical guide to monitoring ML models in production, covering metrics, drift, data quality, logging, alerts, and code patterns in Python.
12 min read · intermediate