Benchmarking French open-weight models against global proprietary standards for production runs.
Measurement protocol
We measured end-to-end latency (TTFB, tokens/s, p95) for Mistral Large 2 self-hosted on OVHcloud H100 instances versus GPT-4.1 via Azure OpenAI in westeurope, across three production-shaped workloads: short Q&A, long-context retrieval, and tool-calling chains.
Results — short Q&A (≤ 1K tokens)
Mistral self-hosted: TTFB 180 ms, 92 tok/s, p95 1.4 s. Azure GPT-4.1: TTFB 410 ms, 78 tok/s, p95 2.3 s. The colocated open-weight wins on cold-path latency by avoiding the multi-tenant routing overhead.
Results — long context (32K tokens)
Throughput inverts. Azure GPT-4.1 sustains 76 tok/s at 32K, Mistral drops to 41 tok/s on a single H100 (KV cache pressure). Two-H100 deployment closes the gap but doubles cost.
Results — tool-calling chains
5-step agentic loops: GPT-4.1 5.8 s total, Mistral Large 2 7.2 s. Difference traced to fewer round-trips needed (GPT-4.1 emits cleaner tool args on first attempt — 0.7 retry/loop avg vs 1.4).
The sovereign verdict
For predictable short-form workloads under EU data residency, Mistral self-hosted is faster and cheaper at scale (€/Mtok). For agentic complexity, GPT-4.1 still wins on retry economics. Most enterprises will route per task class, not per vendor.