Engineering

LLMOps: Managing LLM-Powered Production Systems

By Hélain ZimmermannCo-Founder & CTO @ Ailog · ex-INRIA researcherApr 6, 2026

10 min readintermediate

LLMOpsMLOpsProductionMonitoring

A production AI system in 2026 is not a model behind an API. It is an orchestration of foundation models, fine-tuned adapters, retrieval systems, guardrails, routing logic, caching layers, and fallback chains. When something breaks (and it will), you need to know whether the problem is in your prompt, your retrieval pipeline, your guardrails, or the upstream model provider that quietly changed their API response format at 2 AM.

This is what LLMOps is about. Not a rebranding of MLOps, but a genuine extension of it to handle the unique operational challenges of LLM-powered systems. According to a 2025 survey by Weights & Biases, 67% of failed AI deployments cited inadequate infrastructure and operations as a primary factor. Not bad models; bad operations.

How LLMOps Differs from Traditional MLOps

Traditional MLOps was built for a world where you train a model, validate it, deploy it, and monitor its predictions. The model is a relatively self-contained artifact. LLM-powered systems break this model in several ways.

The model is external. When you use GPT-4o, Claude, or Gemini through an API, you do not control the model. It can change without notice. Anthropic or OpenAI may update weights, adjust safety filters, or deprecate versions. Your system's behavior depends on an artifact you cannot pin, version, or reproduce. Even with open-weight models, the combination of base model, adapter, quantization, and runtime settings creates a complex versioning problem.

Prompts are code. In traditional ML, the model weights encode behavior. In LLM systems, prompts are a critical part of the behavior definition. A one-word change to a system prompt can dramatically alter outputs. Yet most teams manage prompts in ad-hoc ways: hardcoded strings, shared documents, or scattered config files. Prompts need the same versioning, review, and deployment rigor as application code.

Retrieval is a dependency. Most production LLM systems include a retrieval component (RAG). The quality of retrieved context directly determines output quality. But retrieval quality degrades silently: indexes go stale, new documents are ingested with formatting issues, embedding drift causes relevance decay. Monitoring ML models in production covers operational health metrics, but LLMOps adds retrieval-specific monitoring on top of those fundamentals.

Cost is variable and significant. A traditional ML model has relatively predictable inference costs. LLM costs depend on input length, output length, which model handles each request, and whether caching is effective. A single user interaction might cost $0.001 or $0.50 depending on how the routing logic and context assembly work. Without cost monitoring, a successful feature launch can also be a financial disaster.

Evaluation is subjective. You cannot compute an F1 score for "was this answer helpful?" LLM evaluation requires human judgment, LLM-as-judge pipelines, or proxy metrics that are themselves imperfect. This makes the feedback loop longer and noisier than in traditional ML.

Prompt Versioning and Management

Every prompt in your system should be version-controlled, independently deployable, and traceable to the outputs it produced. Here is a practical implementation:

import hashlib
import json
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any


@dataclass
class PromptVersion:
    name: str
    version: str
    template: str
    model_id: str
    temperature: float
    max_tokens: int
    metadata: dict[str, Any] = field(default_factory=dict)
    created_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )

    @property
    def content_hash(self) -> str:
        """Deterministic hash of prompt content for integrity checks."""
        content = json.dumps({
            "template": self.template,
            "model_id": self.model_id,
            "temperature": self.temperature,
            "max_tokens": self.max_tokens,
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:12]


class PromptRegistry:
    """Version-controlled prompt management with rollback support."""

    def __init__(self, storage_backend=None):
        self._prompts: dict[str, list[PromptVersion]] = {}
        self._active: dict[str, str] = {}  # name -> active version
        self._storage = storage_backend

    def register(self, prompt: PromptVersion) -> str:
        """Register a new prompt version. Returns the content hash."""
        if prompt.name not in self._prompts:
            self._prompts[prompt.name] = []

        # Check for duplicate content
        for existing in self._prompts[prompt.name]:
            if existing.content_hash == prompt.content_hash:
                return existing.content_hash  # Idempotent

        self._prompts[prompt.name].append(prompt)

        # Auto-activate if this is the first version
        if prompt.name not in self._active:
            self._active[prompt.name] = prompt.version

        return prompt.content_hash

    def get_active(self, name: str) -> PromptVersion | None:
        """Get the currently active version of a named prompt."""
        if name not in self._active:
            return None
        active_version = self._active[name]
        return next(
            (p for p in self._prompts[name] if p.version == active_version),
            None,
        )

    def activate(self, name: str, version: str):
        """Promote a specific version to active (blue-green deploy)."""
        versions = self._prompts.get(name, [])
        if not any(p.version == version for p in versions):
            raise ValueError(f"Version '{version}' not found for prompt '{name}'")
        self._active[name] = version

    def rollback(self, name: str) -> str | None:
        """Roll back to the previous version. Returns the rolled-back version."""
        versions = self._prompts.get(name, [])
        if len(versions) < 2:
            return None
        current = self._active[name]
        current_idx = next(
            i for i, p in enumerate(versions) if p.version == current
        )
        previous_idx = max(0, current_idx - 1)
        self._active[name] = versions[previous_idx].version
        return self._active[name]

    def get_history(self, name: str) -> list[dict]:
        """Get version history for audit trail."""
        return [
            {
                "version": p.version,
                "content_hash": p.content_hash,
                "model_id": p.model_id,
                "created_at": p.created_at,
                "is_active": p.version == self._active.get(name),
            }
            for p in self._prompts.get(name, [])
        ]


# Usage
registry = PromptRegistry()

# Register prompt versions
v1 = PromptVersion(
    name="customer_support",
    version="1.0.0",
    template=(
        "You are a helpful customer support agent for {company}. "
        "Answer the user's question based on the following context:\n\n"
        "{context}\n\nQuestion: {question}"
    ),
    model_id="claude-sonnet-4-20250514",
    temperature=0.3,
    max_tokens=1024,
)

v2 = PromptVersion(
    name="customer_support",
    version="1.1.0",
    template=(
        "You are a customer support agent for {company}. "
        "Be concise and precise. If unsure, say so.\n\n"
        "Relevant context:\n{context}\n\n"
        "Customer question: {question}\n\n"
        "Respond in 3 sentences or fewer unless the question requires detail."
    ),
    model_id="claude-sonnet-4-20250514",
    temperature=0.2,
    max_tokens=512,
)

registry.register(v1)
registry.register(v2)

# Deploy v2
registry.activate("customer_support", "1.1.0")

# Something goes wrong? Roll back in seconds
registry.rollback("customer_support")  # Back to 1.0.0

The key design decisions: prompts are identified by name and version, content hashes detect duplicate registrations, and activation is separate from registration (so you can register a new version, test it, and promote it independently). In production, the storage backend would be a database or a Git repository, and the activate call would be gated by A/B test results or a canary deployment.

Guardrail Monitoring

Guardrails (input validation, output filtering, content safety checks) are critical for production LLM systems. But guardrails themselves need monitoring. A guardrail that fires too often may indicate a prompt regression or a shift in user behavior. One that never fires may be misconfigured or redundant.

import time
from dataclasses import dataclass
from collections import defaultdict
from enum import Enum


class GuardrailAction(Enum):
    PASS = "pass"
    BLOCK = "block"
    MODIFY = "modify"
    FLAG_FOR_REVIEW = "flag_for_review"


@dataclass
class GuardrailEvent:
    guardrail_id: str
    action: GuardrailAction
    trigger_reason: str | None
    latency_ms: float
    timestamp: float


class GuardrailMonitor:
    """Monitor guardrail behavior for operational health and tuning."""

    def __init__(self):
        self.events: list[GuardrailEvent] = []
        self._alert_thresholds: dict[str, dict] = {}

    def set_thresholds(self, guardrail_id: str, max_block_rate: float = 0.1,
                       max_latency_p99_ms: float = 200.0,
                       min_fire_rate: float = 0.001):
        """Set alerting thresholds for a guardrail."""
        self._alert_thresholds[guardrail_id] = {
            "max_block_rate": max_block_rate,
            "max_latency_p99_ms": max_latency_p99_ms,
            "min_fire_rate": min_fire_rate,
        }

    def record(self, event: GuardrailEvent):
        """Record a guardrail evaluation event."""
        self.events.append(event)

    def get_metrics(self, guardrail_id: str, window_seconds: int = 3600) -> dict:
        """Compute metrics for a guardrail over a time window."""
        now = time.time()
        window_events = [
            e for e in self.events
            if e.guardrail_id == guardrail_id and now - e.timestamp < window_seconds
        ]

        if not window_events:
            return {"total": 0, "block_rate": 0.0, "avg_latency_ms": 0.0}

        total = len(window_events)
        blocks = sum(1 for e in window_events if e.action == GuardrailAction.BLOCK)
        modifications = sum(
            1 for e in window_events if e.action == GuardrailAction.MODIFY
        )
        flags = sum(
            1 for e in window_events if e.action == GuardrailAction.FLAG_FOR_REVIEW
        )
        latencies = sorted(e.latency_ms for e in window_events)

        block_rate = blocks / total
        fire_rate = (blocks + modifications + flags) / total

        metrics = {
            "total": total,
            "blocks": blocks,
            "modifications": modifications,
            "flags": flags,
            "block_rate": round(block_rate, 4),
            "fire_rate": round(fire_rate, 4),
            "avg_latency_ms": round(sum(latencies) / total, 2),
            "p50_latency_ms": latencies[len(latencies) // 2],
            "p99_latency_ms": latencies[int(len(latencies) * 0.99)],
        }

        # Check thresholds and generate alerts
        alerts = self._check_thresholds(guardrail_id, metrics)
        metrics["alerts"] = alerts

        return metrics

    def _check_thresholds(self, guardrail_id: str, metrics: dict) -> list[str]:
        """Check metrics against configured thresholds."""
        alerts = []
        thresholds = self._alert_thresholds.get(guardrail_id)
        if not thresholds:
            return alerts

        if metrics["block_rate"] > thresholds["max_block_rate"]:
            alerts.append(
                f"Block rate {metrics['block_rate']:.2%} exceeds "
                f"threshold {thresholds['max_block_rate']:.2%}. "
                f"Possible prompt regression or input distribution shift."
            )

        if metrics["p99_latency_ms"] > thresholds["max_latency_p99_ms"]:
            alerts.append(
                f"P99 latency {metrics['p99_latency_ms']}ms exceeds "
                f"threshold {thresholds['max_latency_p99_ms']}ms. "
                f"Guardrail may be degrading request performance."
            )

        if metrics["fire_rate"] < thresholds["min_fire_rate"] and metrics["total"] > 100:
            alerts.append(
                f"Fire rate {metrics['fire_rate']:.4%} below minimum "
                f"{thresholds['min_fire_rate']:.4%}. "
                f"Guardrail may be misconfigured or redundant."
            )

        return alerts


# Usage
monitor = GuardrailMonitor()
monitor.set_thresholds(
    "pii_filter",
    max_block_rate=0.05,  # More than 5% blocks is unusual
    max_latency_p99_ms=50.0,
    min_fire_rate=0.005,  # Should catch at least 0.5% of requests
)
monitor.set_thresholds(
    "toxicity_filter",
    max_block_rate=0.02,
    max_latency_p99_ms=100.0,
    min_fire_rate=0.001,
)

The insight that took me a while to internalize: guardrail monitoring is more important than guardrail implementation. A well-coded filter that silently blocks 40% of legitimate requests is worse than no filter at all. You need to know fire rates, false positive rates, and latency impact to tune guardrails effectively.

Retrieval Quality Tracking

For RAG-based systems, retrieval quality is often the bottleneck. If the retriever surfaces irrelevant or stale documents, the LLM will produce confident but wrong answers. Tracking retrieval quality requires metrics at multiple levels.

Retrieval relevance measures whether the retrieved documents actually answer the query. In production, you typically approximate this with an LLM-as-judge that scores context relevance on a 1-5 scale, or with click-through and feedback signals.

Context utilization tracks whether the LLM actually uses the retrieved context. If the model consistently ignores the top retrieved chunks, your retrieval may be returning noise. You can estimate this by checking whether the model's output contains information from the context that it could not have known from training data alone.

Index freshness monitors the age distribution of retrieved documents. If your knowledge base was last updated three weeks ago but users are asking about events from yesterday, your system is silently failing.

These metrics integrate naturally with the production monitoring patterns you already use for inference pipelines. The key addition is tracking them per-query so you can correlate retrieval quality with user satisfaction.

Cost Monitoring and Optimization

LLM costs are deceptively hard to predict. A system that costs $500/day during testing might cost $15,000/day in production if prompt lengths grow, caching misses increase, or traffic patterns shift.

from dataclasses import dataclass, field
from collections import defaultdict
import time


# Approximate pricing per million tokens (as of early 2026)
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "claude-haiku-35": {"input": 0.80, "output": 4.00},
    "gemini-2.0-flash": {"input": 0.10, "output": 0.40},
}


@dataclass
class LLMCallRecord:
    model_id: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int
    latency_ms: float
    timestamp: float
    route: str  # Which part of the system made this call
    cache_hit: bool


class CostTracker:
    """Track and analyze LLM costs across your system."""

    def __init__(self):
        self.records: list[LLMCallRecord] = []

    def record(self, call: LLMCallRecord):
        self.records.append(call)

    def compute_cost(self, call: LLMCallRecord) -> float:
        """Compute cost for a single LLM call in USD."""
        pricing = MODEL_PRICING.get(call.model_id)
        if not pricing:
            return 0.0

        billable_input = call.input_tokens - call.cached_tokens
        input_cost = (billable_input / 1_000_000) * pricing["input"]
        # Cached tokens typically billed at 50% (provider-dependent)
        cache_cost = (call.cached_tokens / 1_000_000) * pricing["input"] * 0.5
        output_cost = (call.output_tokens / 1_000_000) * pricing["output"]

        return input_cost + cache_cost + output_cost

    def daily_report(self, window_hours: int = 24) -> dict:
        """Generate a cost report broken down by model and route."""
        now = time.time()
        cutoff = now - (window_hours * 3600)
        recent = [r for r in self.records if r.timestamp > cutoff]

        by_model = defaultdict(lambda: {"calls": 0, "cost": 0.0, "tokens": 0})
        by_route = defaultdict(lambda: {"calls": 0, "cost": 0.0})

        total_cost = 0.0
        total_tokens = 0
        cache_hits = 0

        for call in recent:
            cost = self.compute_cost(call)
            total_cost += cost
            total_tokens += call.input_tokens + call.output_tokens
            if call.cache_hit:
                cache_hits += 1

            by_model[call.model_id]["calls"] += 1
            by_model[call.model_id]["cost"] += cost
            by_model[call.model_id]["tokens"] += call.input_tokens + call.output_tokens

            by_route[call.route]["calls"] += 1
            by_route[call.route]["cost"] += cost

        cache_rate = cache_hits / len(recent) if recent else 0

        return {
            "window_hours": window_hours,
            "total_calls": len(recent),
            "total_cost_usd": round(total_cost, 2),
            "total_tokens": total_tokens,
            "cache_hit_rate": round(cache_rate, 4),
            "cost_per_call_avg": round(total_cost / len(recent), 4) if recent else 0,
            "by_model": dict(by_model),
            "by_route": dict(by_route),
            "projected_monthly_usd": round(total_cost * (720 / window_hours), 2),
        }

    def find_optimization_opportunities(self) -> list[str]:
        """Identify cost optimization opportunities."""
        suggestions = []
        report = self.daily_report(24)

        if report["cache_hit_rate"] < 0.2:
            suggestions.append(
                f"Cache hit rate is {report['cache_hit_rate']:.1%}. "
                f"Consider implementing semantic caching for repeated queries. "
                f"Estimated savings: 20-40% of input token costs."
            )

        for model_id, stats in report["by_model"].items():
            pricing = MODEL_PRICING.get(model_id, {})
            if pricing.get("input", 0) > 2.0 and stats["calls"] > 100:
                suggestions.append(
                    f"'{model_id}' accounts for {stats['calls']} calls. "
                    f"Consider routing simpler requests to a smaller model. "
                    f"A 50% routing split could save ~${stats['cost'] * 0.3:.2f}/day."
                )

        for route, stats in report["by_route"].items():
            cost_per_call = stats["cost"] / stats["calls"] if stats["calls"] else 0
            if cost_per_call > 0.05:
                suggestions.append(
                    f"Route '{route}' averages ${cost_per_call:.3f}/call. "
                    f"Review context window size and output length limits."
                )

        return suggestions

The most impactful cost optimization I have seen is intelligent model routing. Not every request needs your most capable model. A routing layer that sends simple queries to a smaller model (Gemini Flash, Claude Haiku) and reserves the expensive model for complex reasoning can cut costs by 40-60% with minimal quality impact. This aligns well with the efficiency gains from sub-10B models that are increasingly competitive for straightforward tasks.

A/B Testing for Prompts

Prompt A/B testing is structurally different from traditional A/B testing. The output is text, not a click or a conversion. The metrics are fuzzier. And the sample sizes needed are larger because of LLM output variance.

My approach: define a primary metric (user satisfaction score, task completion rate, or an LLM-judge quality rating), a guardrail metric (cost, latency, safety filter triggers), and run experiments with at least 500 observations per variant. Use the prompt registry to manage variants, and route traffic using consistent hashing on user ID so the same user always sees the same variant within an experiment.

The critical mistake I see teams make: testing prompt changes without controlling for the retrieval pipeline. If you change your prompt and your RAG index in the same week, you cannot attribute the quality change to either one. Test one variable at a time.

The Hybrid Pattern: Managed Cloud and Open-Source Tools

Most production LLMOps stacks in 2026 are hybrid. You use managed services (LangSmith, Braintrust, Humanloop) for prompt management and evaluation, open-source tools (MLflow, Evidently, Phoenix) for custom monitoring and self-hosted components, and cloud provider services (AWS Bedrock, Azure AI Studio, GCP Vertex) for model hosting and scaling.

The pattern I recommend:

Prompt management and evaluation: Use a managed platform. The iteration speed from a purpose-built UI (test prompts, compare outputs, run evaluations, deploy) is worth the cost. Building this in-house is a trap that consumes months of engineering time.

Operational monitoring: Extend your existing observability stack. If you already use Prometheus, Grafana, and OpenTelemetry, add LLM-specific metrics (token counts, latencies by model, guardrail fire rates) as custom metrics. Do not replace your monitoring with an LLM-specific tool.

Cost tracking: Build this yourself. The cost calculation logic is simple, but it needs to integrate with your billing, routing, and caching systems. Third-party tools rarely have enough context about your specific architecture to give accurate cost attribution.

Retrieval monitoring: Open-source tools like Evidently or RAGAS work well here. Retrieval quality metrics (relevance, recall, freshness) are relatively standardized, and the tooling is mature. Integrate them into your CI/CD pipeline to catch retrieval regressions before they reach production.

Auto-Retraining and Adapter Updates

For teams using fine-tuned models or LoRA adapters, the retraining pipeline needs automation. The trigger is typically a combination of: performance degradation detected by monitoring, new training data accumulated past a threshold, and a scheduled cadence (weekly or monthly).

The pipeline should: pull the latest base model and training data, run fine-tuning or adapter training, evaluate against a held-out test set and the current production model, auto-deploy if quality improves and passes safety checks, and alert a human if quality degrades. This is standard MLOps pipeline design, extended with prompt evaluation and guardrail testing as additional pipeline stages.

The one addition specific to LLMOps: when you retrain an adapter, re-run your entire prompt evaluation suite. Fine-tuning changes the model's response patterns, which can interact with prompts in unexpected ways. A prompt that worked perfectly with the old adapter may produce inferior results with the new one.

Key Takeaways

LLMOps is not rebranded MLOps; it addresses genuinely new challenges including external model dependencies, prompt-as-code management, retrieval pipeline monitoring, and variable cost structures.
Prompt versioning with content hashing, independent deployment, and instant rollback is table-stakes for production LLM systems; treat prompts with the same rigor as application code.
Guardrail monitoring (fire rates, false positive rates, latency impact) is more important than guardrail implementation; a well-coded filter that silently blocks legitimate requests is worse than no filter at all.
Cost monitoring must track usage by model, route, and cache hit rate; intelligent model routing (sending simple queries to cheaper models) typically saves 40-60% with minimal quality loss.
Retrieval quality tracking (relevance scores, context utilization, index freshness) should be a first-class metric in your monitoring dashboard, not an afterthought.
A/B testing prompts requires controlling for retrieval pipeline changes, consistent user assignment, and at least 500 observations per variant to account for LLM output variance.
The hybrid pattern (managed platforms for prompt management, open-source tools for monitoring, custom code for cost tracking) is the most practical architecture for most teams in 2026.
When retraining adapters, always re-run the full prompt evaluation suite; fine-tuning changes response patterns that can interact with prompts in unexpected ways.

Engineering

All Articles

Hélain Zimmermann

Co-Founder & CTO @ Ailog

MSc Machine Learning @ KTH · ENSIMAG · ex-INRIA researcher

I build production AI systems: RAG pipelines, autonomous agents, privacy-preserving NLP. I write about what I ship, not what I read.