AI & ML

Mistral Small 4: One MoE to Replace Three Models

By Hélain ZimmermannCo-Founder & CTO @ Ailog · ex-INRIA researcherMar 30, 2026Updated Apr 1, 2026

9 min readintermediate

MistralMixture of ExpertsOpen SourceModel Architecture

Mistral AI released Mistral Small 4 on March 16, 2026, and the pitch is simple: stop juggling three models for three workloads. This single 119B parameter Mixture-of-Experts model handles general instruction following, extended reasoning, and multimodal (vision) tasks, all under the Apache 2.0 license.

The model activates roughly 22B parameters per forward pass, which keeps inference costs comparable to much smaller dense models while delivering performance that competes with closed models three to five times its total parameter count. That is an interesting value proposition, and it deserves a closer look.

The Architecture

Mistral Small 4 is a Mixture-of-Experts model with a 256K token context window and native image input support. The MoE routing means that for any given token, only a fraction of the total 119B parameters are active. This is the same architectural family that has made models like DeepSeek V3 so cost-effective to run.

The key design choice is unification. Previous Mistral releases had separate models or fine-tuned variants for different capabilities: one for general chat, one for code, and reasoning was handled through separate prompting strategies. Small 4 collapses these into a single checkpoint with configurable reasoning depth.

from mistralai import Mistral

client = Mistral(api_key="your-key")

# General instruction following
response = client.chat.complete(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Summarize the key changes in the EU AI Act."}],
)

# Same model, reasoning mode enabled
response = client.chat.complete(
    model="mistral-small-4",
    messages=[{"role": "user", "content": "Prove that the square root of 2 is irrational."}],
    reasoning={"enabled": True},
)

# Same model, vision input
response = client.chat.complete(
    model="mistral-small-4",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What does this architecture diagram show?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/diagram.png"}},
        ],
    }],
)

The reasoning mode is configurable, not always-on. This matters for latency-sensitive applications where you want fast responses for simple queries but deeper reasoning for complex ones.

Benchmarks: Compact but Competitive

The headline numbers:

Benchmark	Mistral Small 4	Notes
GPQA Diamond	71.2%	Graduate-level science questions
MMLU-Pro	78.0%	Professional knowledge tasks
LiveCodeBench	Competitive with GPT-OSS 120B	20% less output tokens
AA LCR	0.72 (1.6K chars)	Qwen models need 3.5-4x more output

The efficiency angle is where Small 4 stands out. On long-context reasoning (AA LCR), it scores 0.72 while generating only 1.6K characters of output. Comparable models from Qwen produce 5.8K to 6.1K characters for similar scores. Less output means faster responses and lower cost per query.

Compared to Mistral Small 3, the improvements in operational efficiency are significant: 40% reduction in end-to-end completion time in latency-optimized configurations, and 3x more requests per second in throughput-optimized setups. These are not abstract benchmark gains; they translate directly to infrastructure costs.

Why Unification Matters

Running separate models for separate tasks is operationally expensive. Each model needs its own deployment, its own GPU allocation, its own monitoring, and its own version management. For teams building multi-agent architectures, the overhead multiplies.

A unified model simplifies the stack:

Single deployment, multiple capabilities. One model serves all request types. Your routing layer becomes simpler (or disappears entirely for simpler applications). GPU utilization improves because you are not maintaining idle capacity across multiple specialized deployments.

Consistent behavior across tasks. When different models handle different tasks, you get inconsistencies in tone, formatting, and error handling. A unified model produces more predictable outputs across workloads.

Simpler fine-tuning pipeline. If you need to adapt the model to your domain, you fine-tune one checkpoint instead of three. Your training data pipeline consolidates, and your evaluation suite covers all capabilities in one pass.

The trade-off is that a unified model may not match the absolute peak performance of a specialist. If your entire workload is code generation, a model trained exclusively for code might edge out Small 4. But for most production systems that handle a mix of tasks, the operational simplification outweighs marginal benchmark differences.

Apache 2.0: What It Actually Enables

The license matters more than people give it credit for. Apache 2.0 means:

Full commercial use with no revenue thresholds
Modification and redistribution permitted
No requirement to share modifications
Patent grant included

This is a genuine open-source release, not the "open weights with restrictive license" pattern we have seen from some labs. For organizations that need to self-host due to data privacy requirements, regulatory constraints, or cost control at scale, this is significant.

The debate around open-washing in AI has made license terms a first-order concern. Mistral has consistently shipped under permissive licenses, and Small 4 continues that pattern.

For self-hosting, the MoE architecture helps. At 22B active parameters, you can run this on hardware that would struggle with a 70B dense model. The total parameter count is 119B, but you only need enough memory to hold the full weights; the compute per token is proportional to the active parameters.

Practical Deployment Considerations

Memory requirements

The full model in FP16 needs roughly 238GB of VRAM. With quantization (AWQ or GPTQ at 4-bit), you can bring this down to around 60GB, which fits on a single A100 80GB or a pair of consumer GPUs with tensor parallelism.

# Example: loading with vLLM for efficient serving
from vllm import LLM, SamplingParams

llm = LLM(
    model="mistralai/mistral-small-4-119b",
    tensor_parallel_size=2,
    quantization="awq",
    max_model_len=65536,  # adjust based on your use case
)

params = SamplingParams(temperature=0.7, max_tokens=2048)
outputs = llm.generate(["Explain MoE routing in three sentences."], params)

When to enable reasoning

The configurable reasoning depth is useful, but do not enable it by default. Reasoning mode increases latency and token generation. Use it selectively:

Enable for complex analytical queries, multi-step problems, and tasks requiring chain-of-thought
Disable for summarization, translation, simple Q&A, and formatting tasks
Build a classifier or heuristic to route between modes if your traffic is mixed

Context window trade-offs

The 256K context window is large, but loading it fully impacts throughput. For most applications, you will get better performance by keeping inputs under 32K tokens and using retrieval to bring in relevant context rather than stuffing everything into the prompt. This is where a well-designed RAG pipeline pays for itself.

Where Small 4 Fits in the Landscape

Mistral Small 4 occupies an interesting position. It is not competing with GPT-5.4 or Claude Opus on raw capability. Instead, it targets the "good enough for most tasks, cheap enough to self-host" segment that has driven adoption of open-weight models throughout 2025 and 2026.

The closest competitors are Qwen 3.5 (Alibaba) and the various DeepSeek releases. All three labs are converging on MoE architectures with high total parameter counts but low active parameter counts. The differentiation is increasingly about:

License terms (Mistral wins here with Apache 2.0)
Multimodal support (Small 4 includes vision natively)
Reasoning capabilities (configurable depth is a practical feature)
Ecosystem and tooling (Mistral's API and deployment tools are solid)

For European organizations, there is an additional consideration. Mistral is a French company, and some enterprises prefer to work with EU-based AI providers for regulatory and data sovereignty reasons. This is not a technical differentiator, but it influences procurement decisions.

Integration with Agent Frameworks

Small 4's tool use support makes it viable as the backbone of agent systems. The model can generate structured tool calls, handle multi-turn interactions with tool results, and reason about when to use which tool.

For teams building on OpenCLAW or similar agent orchestration frameworks, Small 4 provides a self-hostable alternative to closed API models. The trade-off is that you manage the infrastructure, but you gain full control over the model, its data, and its availability.

# Tool use with Mistral Small 4
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_documents",
            "description": "Search internal documents by semantic query",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "limit": {"type": "integer", "default": 5},
                },
                "required": ["query"],
            },
        },
    }
]

response = client.chat.complete(
    model="mistral-small-4",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

Key Takeaways

Mistral Small 4 is a 119B parameter MoE model that unifies instruct, reasoning, and vision capabilities into a single checkpoint with 22B active parameters per token.
It ships under Apache 2.0, making it one of the most permissively licensed frontier-class models available.
Configurable reasoning depth lets you trade latency for accuracy on a per-request basis, which is more practical than always-on reasoning.
The MoE architecture makes self-hosting feasible on hardware that would struggle with equivalent dense models.
Operational simplification from running one model instead of three is often more valuable than marginal benchmark gains from specialists.
For mixed workloads (chat, reasoning, vision, tool use), this is one of the most practical open-weight options available in March 2026.
The competitive landscape is converging: Mistral, Qwen, and DeepSeek are all shipping MoE models with similar efficiency profiles, differentiated primarily by license, ecosystem, and regional factors.

AI & ML

All Articles

Hélain Zimmermann

Co-Founder & CTO @ Ailog

MSc Machine Learning @ KTH · ENSIMAG · ex-INRIA researcher

I build production AI systems: RAG pipelines, autonomous agents, privacy-preserving NLP. I write about what I ship, not what I read.