Last run · 2026-06-02 · 80 agent runs

Real benchmarks
for real people.

The popular benchmarks don't tell you how AI does real work. They're gamed for themselves. So we gave 16 models five real jobs to do. Same tools, same prompts. We graded how each one did.

16
models tested
5
real agents
80
agent runs
$0.29
avg cost per run

Why this is different

Most leaderboards measure trivia.
We measured work.

01

Real agents.

Salespeople researching their prospects. Marketers reading what customers are actually saying. A CX team mining product reviews. An engineer writing the Stripe webhook they keep meaning to ship. Real jobs people do every day.

02

Real tools.

We gave the models Google search, the Crunchbase API, Google Shopping reviews, and a real site scraper. These are the same tools our customers use every day across different verticals.

03

Real grading.

Each task gets graded on a rubric. The rubric checks the output for accuracy and for the style it's returning in. Pass or fail, explicitly. We also look at how much each run cost and how long it took.

The five jobs

Meet the agents.

We ran a Cotera agent for each task. Same tools, same prompt. Everything besides the actual LLM was kept consistent across runs.

Sales

Get financials on a company

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

See full agent breakdown
Agent
Sales Research Analyst
Tools available
Google SearchSite ScraperCrunchbase
How it's graded
  • Did it find the right funding amount?
  • Did it get the right round (Series E)?
  • Did it return a link to real Crunchbase data, not made up from memory?
Cheapest passing models
1.
Gemini 3 Flash logoGemini 3 Flash
$0.032
2.
Grok 4.1 Fast logoGrok 4.1 Fast
$0.035
3.
GGLM 5.1
$0.048
Marketing

Find what customers are recommending on Reddit

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

See full agent breakdown
Agent
Marketing Research Analyst
Tools available
Google SearchReddit
How it's graded
  • Did it actually open a real Reddit thread?
  • Did it return exactly 5 tools, not 4 or 7?
  • Did each tool come with a real mention count and a real reason?
Cheapest passing models
1.
DeepSeek V3.1 logoDeepSeek V3.1
$0.045
2.
Mistral Large 3 logoMistral Large 3
$0.048
3.
GPT-5 Mini logoGPT-5 Mini
$0.056
CX

Find what customers are complaining about

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

See full agent breakdown
Agent
Customer Insights Analyst
Tools available
Google SearchGoogle Shopping
How it's graded
  • Did it identify the actual AirPods product?
  • Did it return exactly 3 complaints?
  • Are the quotes pulled verbatim from real reviews, not paraphrased from memory?
Cheapest passing models
1.
DeepSeek V3.1 logoDeepSeek V3.1
$0.053
2.
Mistral Large 3 logoMistral Large 3
$0.066
3.
GPT-5 Mini logoGPT-5 Mini
$0.108
Coding

Read API docs and write working code

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

See full agent breakdown
Agent
Senior Engineer Assistant
Tools available
Google SearchSite Scraper
How it's graded
  • Did it actually read Stripe's docs?
  • Did it write a real, working function, not just describe one in prose?
  • Does the function use the right security primitives so it would actually run?
Cheapest passing models
1.
Llama 4 Maverick logoLlama 4 Maverick
$0.001
2.
GPT-5 Mini logoGPT-5 Mini
$0.037
3.
Mistral Large 3 logoMistral Large 3
$0.071
Web Scraping

Scrape a competitor's pricing page

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

See full agent breakdown
Agent
Sales Outreach Specialist
Tools available
Google SearchSite Scraper
How it's graded
  • Did it return all of Apollo's pricing tiers, in price order?
  • Does each tier have a real price, or 'Contact sales' for the enterprise tier?
  • Did it return three actual features per tier?
Cheapest passing models
1.
Llama 4 Maverick logoLlama 4 Maverick
$0.001
2.
DeepSeek V3.1 logoDeepSeek V3.1
$0.014
3.
Mistral Large 3 logoMistral Large 3
$0.025

The scoreboard

16 models, ranked by what they
actually delivered.

Sorted by pass count, ties broken by total dollars spent. Hover any ✗ to see why it failed.

RankModelPassTotalSalesMarketingCXCodingScrape
🥇1
Mistral Large 3 logoMistral Large 3
5/5
$0.267
$0.058 $0.048 $0.066 $0.071 $0.025
🥈2
GPT-5 Mini logoGPT-5 Mini
5/5
$0.376
$0.146 $0.056 $0.108 $0.037 $0.029
🥉3
Gemini 3 Flash logoGemini 3 Flash
5/5
$0.758
$0.032 $0.217 $0.339 $0.092 $0.077
4
GPT-5.5 logoGPT-5.5
5/5
$1.21
$0.104 $0.435 $0.348 $0.266 $0.053
5
Gemini 3.5 Flash logoGemini 3.5 Flash
5/5
$1.41
$0.148 $0.118 $0.623 $0.486 $0.035
6
Gemini 3.1 Pro logoGemini 3.1 Pro
5/5
$2.73
$0.086 $0.665 $1.48 $0.388 $0.107
7
Claude Sonnet 4.6 logoClaude Sonnet 4.6
5/5
$3.83
$0.100 $1.82 $0.723 $1.11 $0.084
8
Claude Opus 4.8 logoClaude Opus 4.8
5/5
$4.58
$1.22 $1.74 $0.998 $0.439 $0.185
9
Claude Haiku 4.5 logoClaude Haiku 4.5
4/5
$0.941
$0.007 $0.125 $0.119 $0.632 $0.058
10
DeepSeek V3.1 logoDeepSeek V3.1
3/5
$0.113
$0.001 $0.045 $0.053 $0.001 $0.014
11
MMiniMax M3
3/5
$1.10
$0.254 $0.200 $0.092 $0.508 $0.044
12
GGLM 5.1
3/5
$1.19
$0.048 $0.000 $0.420 $0.667 $0.052
13
KKimi K2.6
3/5
$1.27
$0.104 $0.423 $0.333 $0.268 $0.144
14
Llama 4 Maverick logoLlama 4 Maverick
2/5
$0.004
$0.001 $0.001 $0.001 $0.001 $0.001
15
Grok 4.1 Fast logoGrok 4.1 Fast
1/5
$0.089
$0.035 $0.052 $0.000 $0.000 $0.002
16
Qwen 3.6 Plus logoQwen 3.6 Plus
0/5
$2.98
$1.91 $0.399 $0.229 $0.251 $0.188

Dollars = what Cotera bills per agent run (~$0.005 per credit). Lower is better. A ✓ at $0.00 means the model emitted an answer from training memory without calling tools — we treat that as a pass on the rubric, but flag it as a meaningful behavior in the failure breakdown.

Where they broke

The failures are the
interesting part.

Every benchmark suite advertises the wins. We're showing the failures. Here's where each model usually breaks down. The pattern, not the one-off.

Llama 4 Maverick logoLlama 4 Maverick
Hallucinated
SalesGet financials on a company

This model doesn't use the tools, but instead looks at the training data (which is super old). This makes it very confident, and very wrong.

0 tool calls$0.001 spent
KKimi K2.6
Stuck in a loop
MarketingFind what customers are recommending on Reddit

This model starts writing the answer, but gets stuck repeating the same fragment over and over (until the run has to be killed). It looks busy, but it never actually finishes.

3 tool calls$0.423 spent
KKimi K2.6
Confident but empty
CodingRead API docs and write working code

This model writes a confident explanation of what the function should do, but then forgets to actually write the function (so the code field comes back empty). It looks correct, until you try to run it.

5 tool calls$0.268 spent
MMiniMax M3
Thought, never wrote
CodingRead API docs and write working code

This model keeps thinking and figuring out the code in a private scratchpad, but it never actually responds with anything real (you can see it working, you just don't get any output).

22 tool calls$0.508 spent
GGLM 5.1
Couldn't format the output
CodingRead API docs and write working code

This model doesn't have reliable structured output, so the answer comes back in the wrong shape (which the agent can't read). It tried to answer, but the format is broken.

11 tool calls$0.667 spent
Qwen 3.6 Plus logoQwen 3.6 Plus
Spent a lot, returned nothing
SalesGet financials on a company

This model uses a ton of tools and burns a ton of compute, but it never actually returns a usable answer (which makes it the most expensive way to do nothing in the matrix).

9 tool calls$1.91 spent
Claude Haiku 4.5 logoClaude Haiku 4.5
Stopped before starting
SalesGet financials on a company

This model tells you it's going to do the work, but then just doesn't (the whole response is a polite sentence of intent, with no tools actually called).

1 tool calls$0.007 spent

How we ran it

No vibes. Show your work.

01

One agent, every model.

For each task we built a Cotera agent — a system prompt, a tool list, and a default model. To benchmark, we kept the agent and overrode the model via context.model on the chat endpoint. Tools and prompts stay identical.

02

Strict JSON output.

Every task ends with "return ONLY a JSON object conforming to this schema." Rubrics parse the final message, validate field types, ranges, array lengths, and required values. No regex on free-form text.

03

Costs straight from the bill.

We don't estimate. Cotera tracks per-run cost with an LLM/tool-call breakdown in chatStatus.totals — the dollar figures here are the same per-run charges customers see in their dashboard (1 credit ≈ $0.005).

04

Failures classified, not buried.

Every ✗ is tagged: wrong value, wrong format, truncated output, never-used-tools, infra error. The full per-cell breakdown — including the actual final messages — lives in REPORT.md alongside the run.

The runner, the agents, and every result file are open source.

Audit it. Add your own benchmark. Re-run the matrix with models we missed.

Read the source

Run your own agents.
Pick the model that ships.

Cotera agents work with the same tools, the same data, and any of the 16 models on this page. Start with a free org, swap models with a config change, and only pay for what your agents actually do.