Python Best Practices for ML Engineers
Good ML work is not only about the model. For every result, there is a pile of glue code: data loaders, feature pipelines, training scripts, evaluation, inference services. If that code is fragile or messy, experiments are hard to reproduce and deployment becomes painful.
As an ML engineer, Python is often your primary tool. Treating it with the same discipline that backend engineers apply to their services pays off fast, especially when you start shipping models to production or building RAG systems.
This article covers Python best practices I use in day-to-day ML work, from project structure and typing to performance and deployment.
1. Structure projects like software, not notebooks
Most ML projects start in a notebook. That is fine, but notebooks should not be the final form of your code.
From notebook chaos to a simple project layout
A simple, clean structure is already a big win:
my_project/
pyproject.toml # or setup.cfg / requirements.txt
README.md
src/
my_project/
__init__.py
config.py
data/
loaders.py
preprocess.py
models/
base.py
classifier.py
training/
train.py
eval.py
inference/
serve.py
notebooks/
01_exploration.ipynb
02_modeling.ipynb
tests/
test_data_loaders.py
test_models.py
Key ideas:
- Keep production code in
src/my_project, not in notebooks - Use notebooks only for exploration, visualizations, quick prototypes
- Promote stable notebook logic into modules in
src/as soon as it solidifies
This separation between application code, experimentation, and deployment scales well as your project grows.
2. Use virtual environments and pinned dependencies
Reproducibility is fundamental. Nothing kills productivity faster than "it worked yesterday".
Virtual environments
Use venv or a tool like conda, poetry, or uv.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install -U pip
Pin versions
Use a requirements.txt or pyproject.toml with explicit versions:
numpy==1.26.4
pandas==2.2.0
scikit-learn==1.4.0
pydantic==2.6.1
fastapi==0.110.0
uvicorn[standard]==0.27.0
For research you can be slightly looser, but when something goes to production or is part of a pipeline (RAG retrieval services, vector database clients, etc.), pin versions and document how to recreate the environment.
3. Embrace typing and static analysis
ML code is often full of dictionaries, tensors, and nested configs. Without types, things get confusing quickly.
Type hints for clarity and safety
Use Python type hints consistently:
from typing import Dict, List, Tuple
import numpy as np
Array = np.ndarray
def train_val_split(
X: Array,
y: Array,
val_ratio: float = 0.2,
) -> Tuple[Array, Array, Array, Array]:
"""Split features and labels into train and validation sets."""
n_samples = X.shape[0]
n_val = int(n_samples * val_ratio)
X_train, X_val = X[:-n_val], X[-n_val:]
y_train, y_val = y[:-n_val], y[-n_val:]
return X_train, X_val, y_train, y_val
This makes the expected inputs and outputs explicit. Static analyzers like mypy can catch many bugs before you run expensive training.
Dataclasses and Pydantic for configs
Model and training configurations should be structured, not random dictionaries.
from dataclasses import dataclass
@dataclass
class TrainingConfig:
batch_size: int = 32
learning_rate: float = 3e-4
num_epochs: int = 10
seed: int = 42
config = TrainingConfig(batch_size=64)
print(config)
For configs that come from JSON/YAML or env variables, I like Pydantic because it validates and documents at the same time:
from pydantic import BaseModel, Field, ValidationError
class RAGConfig(BaseModel):
top_k: int = Field(10, ge=1, le=100)
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"
vector_db_url: str
try:
cfg = RAGConfig(vector_db_url="http://localhost:6333")
except ValidationError as e:
print("Invalid config:", e)
Strongly typed configs make configuration choices traceable and safe.
4. Write small, composable functions and modules
Long scripts mixing data loading, feature engineering, training, and evaluation are painful to change. Aim for small functions that each do one thing.
from pathlib import Path
from typing import Tuple
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
def load_data(path: Path) -> pd.DataFrame:
return pd.read_csv(path)
def preprocess(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
X = df.drop(columns=["label"])
y = df["label"]
return X, y
def train_model(X_train, y_train) -> LogisticRegression:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
return clf
def evaluate_model(clf, X_val, y_val) -> float:
y_pred = clf.predict(X_val)
return f1_score(y_val, y_pred)
if __name__ == "__main__":
df = load_data(Path("data/dataset.csv"))
X, y = preprocess(df)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model = train_model(X_train, y_train)
score = evaluate_model(model, X_val, y_val)
print(f"Validation F1: {score:.3f}")
This style makes it easier to reuse load_data and preprocess in later RAG embedding pipelines or during online inference.
5. Logging, not printing
print is fine while exploring, but in training jobs, RAG services, or multi-agent systems, you want structured logs.
import logging
from pathlib import Path
def setup_logger(log_path: Path) -> logging.Logger:
logger = logging.getLogger("ml_project")
logger.setLevel(logging.INFO)
fmt = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
ch = logging.StreamHandler()
ch.setFormatter(fmt)
logger.addHandler(ch)
fh = logging.FileHandler(log_path)
fh.setFormatter(fmt)
logger.addHandler(fh)
return logger
logger = setup_logger(Path("logs/train.log"))
def train_epoch(epoch: int) -> None:
logger.info("Starting epoch %d", epoch)
# training loop
logger.info("Finished epoch %d", epoch)
Structured logging becomes essential when you deploy with FastAPI or need to evaluate system performance across multiple components.
6. Testing: start small but start early
You do not need a full TDD setup, but a few tests already prevent painful bugs.
Unit tests for core utilities
Focus tests on:
- Data loading and preprocessing
- Feature transformations
- Evaluation metrics and custom losses
- Critical business logic (e.g., RAG retrieval and ranking)
Using pytest:
# src/my_project/data/loaders.py
from pathlib import Path
import pandas as pd
def load_csv(path: Path) -> pd.DataFrame:
df = pd.read_csv(path)
if "label" not in df.columns:
raise ValueError("Missing 'label' column")
return df
# tests/test_loaders.py
from pathlib import Path
import pandas as pd
from my_project.data.loaders import load_csv
def test_load_csv_requires_label(tmp_path: Path):
path = tmp_path / "data.csv"
df = pd.DataFrame({"x": [1, 2, 3]})
df.to_csv(path, index=False)
try:
load_csv(path)
except ValueError as e:
assert "label" in str(e)
else:
assert False, "Expected ValueError"
When you move from toy scripts to production RAG pipelines, tests around chunking and retrieval become critical for catching regressions.
7. Performance: profile before optimizing
Python is rarely the performance bottleneck in ML (model libraries are usually in C++ or CUDA), but preprocessing, feature engineering, and RAG retrieval pipelines can become slow.
Use vectorized operations
Avoid Python loops over large arrays. Use NumPy, pandas, or vectorized libraries.
import numpy as np
def slow_normalize(vectors: np.ndarray) -> np.ndarray:
# Bad: Python loop
norms = []
for v in vectors:
norms.append(v / np.linalg.norm(v))
return np.stack(norms)
def fast_normalize(vectors: np.ndarray) -> np.ndarray:
# Good: vectorized
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
return vectors / norms
For embedding pipelines, batching operations is usually better than many small calls.
Profile with cProfile
python -m cProfile -o profile.out src/my_project/training/train.py
Then use snakeviz or similar tools to inspect hotspots.
8. Configuration and reproducibility
ML experiments must be reproducible, otherwise benchmarks and improvements are unreliable.
Fix random seeds
import os
import random
import numpy as np
import torch
def set_seed(seed: int) -> None:
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Use the same function in every script (training, evaluation, RAG indexing) to minimize differences.
Centralize config
Do not scatter magic numbers across your code. Even a simple config file helps:
# src/my_project/config.py
from dataclasses import dataclass
@dataclass
class ModelConfig:
model_name: str = "bert-base-uncased"
max_length: int = 128
@dataclass
class TrainingConfig:
batch_size: int = 32
lr: float = 2e-5
num_epochs: int = 3
model_config = ModelConfig()
training_config = TrainingConfig()
When working on RAG systems or multimodal pipelines, this becomes even more important because you have many moving parts: embedding models, chunk sizes, retriever settings, LLM parameters. Typed configs make everything explicit.
9. Clean interfaces for models and pipelines
Treat your models and pipelines as components with clear interfaces.
A simple model wrapper
from typing import List
import numpy as np
from sklearn.linear_model import LogisticRegression
class Classifier:
def __init__(self, model: LogisticRegression) -> None:
self.model = model
def predict_proba(self, X: np.ndarray) -> np.ndarray:
return self.model.predict_proba(X)
def predict_labels(self, X: np.ndarray, threshold: float = 0.5) -> List[int]:
probs = self.predict_proba(X)[:, 1]
return [int(p >= threshold) for p in probs]
This makes it easier to swap models or plug the classifier into a FastAPI service.
RAG style pipeline
If you are working with RAG, define clear interfaces between components: retriever, ranker, generator.
from typing import Protocol, List
class Retriever(Protocol):
def retrieve(self, query: str, top_k: int = 5) -> List[str]:
...
class Generator(Protocol):
def generate(self, prompt: str) -> str:
...
class SimpleRAGPipeline:
def __init__(self, retriever: Retriever, generator: Generator) -> None:
self.retriever = retriever
self.generator = generator
def answer(self, query: str, top_k: int = 5) -> str:
docs = self.retriever.retrieve(query, top_k=top_k)
context = "\n".join(docs)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
return self.generator.generate(prompt)
Using Protocol keeps your code flexible. You can plug in different vector databases or LLM providers without changing the pipeline logic.
10. Production mindset: from scripts to services
Once a model or RAG pipeline is useful, someone will eventually want to call it via an API. If you have followed the best practices above, this step is much easier.
A minimal FastAPI inference service
from typing import List
import numpy as np
from fastapi import FastAPI
from pydantic import BaseModel
from my_project.models.classifier import Classifier, load_trained_model
class PredictRequest(BaseModel):
features: List[List[float]]
class PredictResponse(BaseModel):
predictions: List[int]
app = FastAPI()
model: Classifier | None = None
@app.on_event("startup")
def load_model() -> None:
global model
model = load_trained_model()
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest) -> PredictResponse:
assert model is not None
X = np.array(req.features, dtype=float)
preds = model.predict_labels(X)
return PredictResponse(predictions=preds)
The earlier best practices around typing, small modules, and configuration make such services easier to reason about, test, and monitor in production.
11. Privacy and security awareness
If your ML system touches user data, privacy is not optional. Practical tips:
- Centralize all data access and anonymization logic
- Never log raw user data or secrets
- Use environment variables or secret managers for API keys
- Consider tokenization or pseudonymization before sending data to third party APIs
A simple pattern for secrets:
import os
from pydantic import BaseModel, SecretStr
class Settings(BaseModel):
openai_api_key: SecretStr
settings = Settings(openai_api_key=os.environ["OPENAI_API_KEY"])
Then pass settings around instead of sprinkling os.environ[...] everywhere.
Key Takeaways
- Treat ML projects as software projects: structure your code, not just notebooks
- Use virtual environments and pinned dependencies for reproducible experiments
- Add type hints, dataclasses, and Pydantic configs to make your code self-documenting
- Write small, composable functions and modules with clear responsibilities
- Prefer logging over prints, especially for training jobs and services
- Start testing early, focusing on data loading, preprocessing, and core logic
- Optimize only after profiling, and favor vectorized operations and batching
- Centralize configuration and fix random seeds for reproducibility
- Design clean interfaces for models and RAG pipelines so components are swappable
- Prepare for production by exposing models via typed, well structured APIs
- Keep privacy and security in mind whenever you touch user or sensitive data
Related Articles
CI/CD Pipelines for Machine Learning Projects
Learn how to design practical CI/CD pipelines for ML projects, covering testing, data checks, model evaluation, deployment and MLOps tooling.
11 min read · intermediateEngineeringDeploying ML Models with FastAPI and Docker
Learn how to containerize and deploy ML models using FastAPI and Docker, with patterns for scaling, performance, and production-ready setups.
8 min read · intermediateEngineeringMonitoring ML Models in Production
Practical guide to monitoring ML models in production, covering metrics, drift, data quality, logging, alerts, and code patterns in Python.
12 min read · intermediate