Engineering

End-to-End Multi-Agent Systems: Design Patterns from IEEE CAI 2026

By Hélain ZimmermannCo-Founder & CTO @ Ailog · ex-INRIA researcherFeb 18, 2026Updated Mar 30, 2026

11 min readadvanced

Multi-Agent SystemsDistributed SystemsFederated LearningPythonMLOpsAI Agents

The conversations at IEEE CAI 2026 and the adjacent ICDCS and HPDC sessions made one thing clear: multi-agent AI systems are production infrastructure now. But the gap between a demo with three agents passing messages and a fault-tolerant distributed system serving real users is large, and most teams underestimate it.

This article builds on foundational patterns (orchestrator plus specialists, blackboard architectures, peer-to-peer agents) and goes deeper into design patterns that emerged from this year's conference discussions and from our own production experience at Ailog. The focus is on separation of concerns, fault tolerance, federated orchestration, and the monitoring infrastructure that makes it all debuggable.

The compound AI system mindset

The term "compound AI systems" keeps surfacing in 2026 literature, and for good reason. A compound system is not a single model. It is a composition of models, retrievers (often backed by vector databases), tools, validators, and orchestration logic that together solve a task. Multi-agent architectures are the natural way to structure these compositions.

The key shift from earlier agent frameworks is treating the system as a distributed application first and an AI application second. That means applying decades of distributed systems knowledge (consensus, fault isolation, observability, graceful degradation) to agent coordination.

Pattern 1: Separation of planning, execution, and validation

The single most impactful pattern is strict separation of three concerns that many teams collapse into one agent.

The planner

The planner receives a user goal and produces a structured plan: a DAG of tasks with dependencies, estimated resource requirements, and success criteria. It does not execute anything.

from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any
from enum import Enum


class TaskStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    SKIPPED = "skipped"


@dataclass
class Task:
    id: str
    agent: str
    action: str
    params: Dict[str, Any] = field(default_factory=dict)
    depends_on: List[str] = field(default_factory=list)
    status: TaskStatus = TaskStatus.PENDING
    result: Optional[Any] = None
    retries: int = 0
    max_retries: int = 2


@dataclass
class ExecutionPlan:
    goal: str
    tasks: List[Task]
    metadata: Dict[str, Any] = field(default_factory=dict)

    def ready_tasks(self) -> List[Task]:
        """Return tasks whose dependencies are all completed."""
        completed_ids = {
            t.id for t in self.tasks if t.status == TaskStatus.COMPLETED
        }
        return [
            t for t in self.tasks
            if t.status == TaskStatus.PENDING
            and all(dep in completed_ids for dep in t.depends_on)
        ]

    def is_complete(self) -> bool:
        terminal = {TaskStatus.COMPLETED, TaskStatus.FAILED, TaskStatus.SKIPPED}
        return all(t.status in terminal for t in self.tasks)

The planner generates an ExecutionPlan by calling an LLM with structured output constraints. The plan is data, not code. This makes it inspectable, serializable, and auditable, which are requirements that showed up repeatedly in the IEEE CAI fault tolerance sessions.

The executor

The executor takes a plan and runs it. It manages concurrency, timeouts, and retries. It does not make strategic decisions about what to do next; it only follows the plan.

import asyncio
import logging
from typing import Dict, Callable, Any

logger = logging.getLogger(__name__)


class PlanExecutor:
    def __init__(self, agent_registry: Dict[str, Callable], timeout: float = 30.0):
        self.agents = agent_registry
        self.timeout = timeout

    async def execute_task(self, task: Task) -> Task:
        agent_fn = self.agents.get(task.agent)
        if not agent_fn:
            task.status = TaskStatus.FAILED
            task.result = {"error": f"Unknown agent: {task.agent}"}
            return task

        for attempt in range(task.max_retries + 1):
            try:
                task.status = TaskStatus.RUNNING
                result = await asyncio.wait_for(
                    agent_fn(task.action, task.params),
                    timeout=self.timeout,
                )
                task.status = TaskStatus.COMPLETED
                task.result = result
                return task
            except asyncio.TimeoutError:
                task.retries = attempt + 1
                logger.warning(f"Task {task.id} timed out, attempt {attempt + 1}")
            except Exception as e:
                task.retries = attempt + 1
                logger.error(f"Task {task.id} failed: {e}, attempt {attempt + 1}")

        task.status = TaskStatus.FAILED
        task.result = {"error": "Max retries exceeded"}
        return task

    async def run_plan(self, plan: ExecutionPlan) -> ExecutionPlan:
        while not plan.is_complete():
            ready = plan.ready_tasks()
            if not ready:
                # Deadlock or all remaining tasks depend on failed ones
                for t in plan.tasks:
                    if t.status == TaskStatus.PENDING:
                        t.status = TaskStatus.SKIPPED
                break

            results = await asyncio.gather(
                *[self.execute_task(t) for t in ready]
            )
            # Results are already written back to task objects

        return plan

This executor runs independent tasks in parallel using asyncio.gather, respects dependency ordering, and handles retries with timeout enforcement. The pattern directly mirrors what was presented in the HPDC 2026 session on resilient task scheduling for AI workloads.

The validator

After execution, a validator agent reviews the results against the original goal. It checks for completeness, consistency, and quality. If validation fails, it can request a replan.

@dataclass
class ValidationResult:
    passed: bool
    issues: List[str] = field(default_factory=list)
    suggested_actions: List[str] = field(default_factory=list)


def validate_plan_results(llm, plan: ExecutionPlan) -> ValidationResult:
    completed = [t for t in plan.tasks if t.status == TaskStatus.COMPLETED]
    failed = [t for t in plan.tasks if t.status == TaskStatus.FAILED]

    summary = {
        "goal": plan.goal,
        "completed_tasks": [
            {"id": t.id, "agent": t.agent, "result_preview": str(t.result)[:200]}
            for t in completed
        ],
        "failed_tasks": [
            {"id": t.id, "agent": t.agent, "error": str(t.result)}
            for t in failed
        ],
    }

    prompt = (
        "You are a validation agent. Check if the completed tasks "
        "achieve the stated goal. Identify gaps, inconsistencies, "
        "or quality issues. Respond as JSON with keys: "
        "passed (bool), issues (list of strings), "
        "suggested_actions (list of strings).\n\n"
        f"Plan summary:\n{json.dumps(summary, indent=2)}"
    )

    response = llm.generate(prompt)
    parsed = json.loads(response)
    return ValidationResult(**parsed)

The three-way separation (planning, execution, validation) means each component can be tested independently, scaled differently, and replaced without affecting the others. It also creates natural checkpoints for human review in high-stakes applications.

Pattern 2: Fault-tolerant agent communication

In distributed systems, network partitions and process failures are the norm. Multi-agent AI systems need the same resilience patterns.

Message queues over direct calls

Instead of agents calling each other synchronously, route communication through a durable message queue. This decouples agent lifecycles and provides automatic retry semantics.

import json
from dataclasses import dataclass
from typing import Any, Dict


@dataclass
class AgentMessage:
    sender: str
    recipient: str
    payload: Dict[str, Any]
    message_id: str
    correlation_id: str  # Links related messages across a workflow
    retry_count: int = 0


class MessageBroker:
    """Minimal abstraction over a message queue (Redis Streams, Kafka, etc.)."""

    def __init__(self, backend):
        self.backend = backend

    async def publish(self, message: AgentMessage):
        channel = f"agent.{message.recipient}"
        await self.backend.publish(
            channel,
            json.dumps({
                "sender": message.sender,
                "payload": message.payload,
                "message_id": message.message_id,
                "correlation_id": message.correlation_id,
                "retry_count": message.retry_count,
            }),
        )

    async def subscribe(self, agent_name: str, handler):
        channel = f"agent.{agent_name}"
        async for raw_msg in self.backend.listen(channel):
            parsed = json.loads(raw_msg)
            await handler(parsed)

This pattern came up in every ICDCS 2026 talk on multi-agent reliability. The message broker provides durability (messages survive agent restarts), ordering guarantees, and dead-letter queues for messages that repeatedly fail processing.

Circuit breakers for agent calls

When an agent or external service starts failing, you do not want cascading failures across the entire system. Circuit breakers stop calling a failing component after a threshold and periodically test if it has recovered.

import time


class CircuitBreaker:
    def __init__(self, failure_threshold: int = 3, reset_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failures = 0
        self.last_failure_time = 0.0
        self.state = "closed"  # closed = healthy, open = failing

    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        if time.time() - self.last_failure_time > self.reset_timeout:
            self.state = "half-open"
            return True
        return False

    def record_success(self):
        self.failures = 0
        self.state = "closed"

    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "open"

In our production systems at Ailog, every agent-to-agent call and every external tool invocation goes through a circuit breaker. This is standard practice in microservice architectures, and multi-agent AI systems should adopt it.

Pattern 3: Federated orchestration

For organizations with multiple teams or deployments, centralized orchestration becomes a bottleneck. The federated orchestration pattern distributes coordination across multiple orchestrators that share learned policies.

Each team or deployment runs its own local orchestrator with local agents. A global coordination layer handles:

Cross-team task routing when a task requires capabilities from another team's agents
Policy synchronization so all orchestrators share updated routing rules and safety constraints
Aggregate monitoring for system-wide visibility

This mirrors the federated learning pattern but applies it to control policies rather than model weights. Local orchestrators learn from their own traffic patterns, and periodically share aggregated routing statistics with the global coordinator.

The practical benefit is autonomy: teams can add, remove, or update agents without coordinating with every other team, as long as they conform to the shared message protocol.

Monitoring multi-agent systems in production

Observability in multi-agent systems is harder than monitoring a single model endpoint. With agents, you need to track:

Distributed traces: a single user request may touch five agents across three services
Agent-level metrics: latency, token usage, tool call frequency, error rates per agent
Plan-level metrics: how often plans succeed on first attempt, average replanning cycles, task skip rates
Quality signals: validation pass rates, user feedback correlation with plan structure (see evaluating RAG system performance for related metrics)

A structured logging approach works well:

import time
import json
import logging

logger = logging.getLogger("multi_agent_trace")


class AgentTracer:
    def __init__(self, trace_id: str):
        self.trace_id = trace_id
        self.spans = []

    def start_span(self, agent: str, action: str) -> Dict[str, Any]:
        span = {
            "trace_id": self.trace_id,
            "agent": agent,
            "action": action,
            "start_time": time.time(),
            "status": "running",
        }
        self.spans.append(span)
        return span

    def end_span(self, span: Dict[str, Any], status: str, metadata: Dict = None):
        span["end_time"] = time.time()
        span["duration_ms"] = (span["end_time"] - span["start_time"]) * 1000
        span["status"] = status
        span["metadata"] = metadata or {}
        logger.info(json.dumps(span))

Feed these traces into your existing observability stack, whether that is Prometheus plus Grafana, Datadog, or a custom setup. The key is correlating traces across agent boundaries using a shared trace_id, the same pattern used in microservice distributed tracing.

Scaling challenges and practical limits

The IEEE CAI 2026 panels were candid about scaling limits. Some patterns that emerged:

Agent count scaling: most production systems use 3 to 8 specialized agents. Beyond that, coordination overhead dominates. If you need more specialization, consider hierarchical orchestration where a top-level orchestrator delegates to sub-orchestrators, each managing a small team. Agents that handle images or video alongside text add another dimension to this coordination (see multimodal AI).

Context window pressure: every agent that adds context to the shared state increases the prompt size for downstream agents. Aggressive summarization between agent steps is essential. Do not pass raw outputs forward; pass distilled summaries.

Cost management: multi-agent systems multiply LLM API costs. Instrument token usage per agent, per task type, and set budgets. Some tasks do not need a frontier model, and routing simpler subtasks to smaller models reduces cost significantly.

Testing: unit testing individual agents is straightforward. Integration testing the full system, with realistic failure modes, is where most teams struggle. Build a test harness that can simulate agent failures, slow responses, and malformed outputs. You want automated regression tests that exercise the full agent graph.

Error recovery strategies

Three recovery patterns proved most reliable in production discussions at the conference:

Local retry with backoff: the executor retries failed tasks with exponential backoff before escalating. This handles transient failures in LLM APIs and external tools.
Partial replanning: when a critical task fails after retries, send the partial results back to the planner and ask for an alternative path. The planner may route around the failed capability or decompose the task differently.
Graceful degradation: if a non-critical agent is unavailable, skip it and mark the output as partial. A portfolio analysis without the sentiment agent is still useful if the quantitative agents succeeded. Always communicate the degradation to the user.

Key Takeaways

Separate planning, execution, and validation into distinct components that can be tested, scaled, and replaced independently.
Use durable message queues instead of synchronous agent calls to gain resilience, retry semantics, and lifecycle decoupling.
Apply circuit breakers to every agent call and external tool invocation to prevent cascading failures across the system.
Federated orchestration distributes coordination across teams while sharing routing policies and aggregate monitoring.
Instrument distributed traces with shared correlation IDs across all agent boundaries for effective debugging.
Keep agent count between 3 and 8 per orchestrator, and use hierarchical orchestration for larger systems.
Build integration tests that simulate realistic failure modes, including slow agents, malformed outputs, and partial availability.
Treat multi-agent AI systems as distributed applications first, applying proven patterns from microservice and distributed systems engineering.

Engineering

All Articles

Hélain Zimmermann

Co-Founder & CTO @ Ailog

MSc Machine Learning @ KTH · ENSIMAG · ex-INRIA researcher

I build production AI systems: RAG pipelines, autonomous agents, privacy-preserving NLP. I write about what I ship, not what I read.