Benchmarks
Mistral Large 3 logo

mistral

Mistral Large 3

The quiet 5/5. Half the price of GPT-5.5 and didn't whine about JSON once.

5/5
benchmarks passed
$0.267
spent in total
24
tool calls

What it is.

Mistral AI's flagship dense model out of Paris, the European frontier-model lab's answer to GPT-5 and Claude Sonnet. We tested the November 2026 production endpoint via Mistral's first-party API.

What it does well.

  • JSON discipline. Every run produced a parseable final message conforming to the schema — no preamble, no markdown fences leaking in, no "### Final Output:" prose before the object.
  • Tool budget. Used the fewest tool calls of any 5/5 model on three of the five benchmarks. It doesn't double-check itself into the ground.
  • Cheap to run. Total spend across the matrix was $0.27 — the cheapest perfect score and roughly 4x cheaper than GPT-5.5, 14x cheaper than Sonnet 4.6, and 17x cheaper than Opus 4.8.
  • Predictable shape. Output length and tool-call count varied very little across runs — easy to budget for in production agents.

Where to be careful.

  • We tested short-horizon agent jobs (2-9 tool calls per task). We have no data on its behavior across multi-hour, multi-day, or multi-tenant chains where context-management becomes the bottleneck.
  • No coding rubric beyond "call the right Stripe primitives." Don't read this as a coding-quality verdict — read it as "can it read docs and produce a working function."
  • No adversarial inputs in this suite. We didn't test prompt injection, schema collisions, or partial-tool-failure recovery.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

Found Hightouch's Crunchbase slug via Google, then hit the Crunchbase tool, then scraped the page once more to verify the funding number. Seven tool calls is on the higher end here but every call had a clear purpose.

$0.0587 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Passed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

$0.048 — second-cheapest pass on the Reddit benchmark, behind only DeepSeek (which broke the next benchmark to compensate). Ranked the 5 tools by distinct-commenter mentions and stopped.

$0.0489 tool calls
Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

Two tool calls — a single product search and a single review pull — and out. The other 5/5 models averaged 4-13 calls on this one.

$0.0662 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Passed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

Read the official Stripe docs, wrote idiomatic Node.js with `createHmac('sha256', ...)` and `timingSafeEqual`, named the function `verifyStripeWebhook`. Schema-clean on the first attempt.

$0.0714 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

$0.025 — the single cheapest pass across all 16 models on any benchmark. Two tool calls, all four Apollo tiers extracted with three features each.

$0.0252 tool calls