Last run · 2026-07-09 · 120 agent runs

Real benchmarks
for real people.

The popular benchmarks don't tell you how AI does real work. They're gamed for themselves. So we gave 24 models five real jobs to do. Same tools, same prompts. We graded how each one did.

models tested

real agents

120

agent runs

$0.25

avg cost per run

See the leaderboard How we ran it

Why this is different

Most leaderboards measure trivia.
We measured work.

Real agents.

Salespeople researching their prospects. Marketers reading what customers are actually saying. A CX team mining product reviews. An engineer writing the Stripe webhook they keep meaning to ship. Real jobs people do every day.

Real tools.

We gave the models Google search, the Crunchbase API, Google Shopping reviews, and a real site scraper. These are the same tools our customers use every day across different verticals.

Real grading.

Each task gets graded on a rubric. The rubric checks the output for accuracy and for the style it's returning in. Pass or fail, explicitly. We also look at how much each run cost and how long it took.

The five jobs

Meet the agents.

We ran a Cotera agent for each task. Same tools, same prompt. Everything besides the actual LLM was kept consistent across runs.

Sales

Get financials on a company

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

See full agent breakdown

Agent

Sales Research Analyst

Tools available

Google SearchSite ScraperCrunchbase

How it's graded

Did it find the right funding amount?
Did it get the right round (Series E)?
Did it return a link to real Crunchbase data, not made up from memory?

Marketing

Find what customers are recommending on Reddit

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

See full agent breakdown

Agent

Marketing Research Analyst

Tools available

Google SearchReddit

How it's graded

Did it actually open a real Reddit thread?
Did it return exactly 5 tools, not 4 or 7?
Did each tool come with a real mention count and a real reason?

Find what customers are complaining about

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

See full agent breakdown

Agent

Customer Insights Analyst

Tools available

Google SearchGoogle Shopping

How it's graded

Did it identify the actual AirPods product?
Did it return exactly 3 complaints?
Are the quotes pulled verbatim from real reviews, not paraphrased from memory?

Coding

Read API docs and write working code

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

See full agent breakdown

Agent

Senior Engineer Assistant

Tools available

Google SearchSite Scraper

How it's graded

Did it actually read Stripe's docs?
Did it write a real, working function, not just describe one in prose?
Does the function use the right security primitives so it would actually run?

Web Scraping

Scrape a competitor's pricing page

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

See full agent breakdown

Agent

Sales Outreach Specialist

Tools available

Google SearchSite Scraper

How it's graded

Did it return all of Apollo's pricing tiers, in price order?
Does each tier have a real price, or 'Contact sales' for the enterprise tier?
Did it return three actual features per tier?

The scoreboard

24 models, ranked by what they
actually delivered.

Sorted by pass count, ties broken by total dollars spent. Hover any ✗ to see why it failed.

Rank	Model	Pass	Total	Sales	Marketing	CX	Coding	Scrape
🥇1	Mistral Large 3	5/5	$0.267	✓ $0.058	✓ $0.048	✓ $0.066	✓ $0.071	✓ $0.025
🥈2	GPT-5.6 Luna	5/5	$0.296	✓ $0.089	✓ $0.080	✓ $0.060	✓ $0.046	✓ $0.020
🥉3	GPT-5 Mini	5/5	$0.376	✓ $0.146	✓ $0.056	✓ $0.108	✓ $0.037	✓ $0.029
4	GPT-5.6 Terra	5/5	$0.392	✓ $0.022	✓ $0.089	✓ $0.204	✓ $0.050	✓ $0.027
5	Nemotron 3 Super 120B	5/5	$0.610	✓ $0.045	✓ $0.050	✓ $0.329	✓ $0.134	✓ $0.052
6	Gemini 3 Flash	5/5	$0.758	✓ $0.032	✓ $0.217	✓ $0.339	✓ $0.092	✓ $0.077
7	GPT-5.5	5/5	$1.21	✓ $0.104	✓ $0.435	✓ $0.348	✓ $0.266	✓ $0.053
8	Gemini 3.5 Flash	5/5	$1.41	✓ $0.148	✓ $0.118	✓ $0.623	✓ $0.486	✓ $0.035
9	Claude Sonnet 5	5/5	$1.44	✓ $0.163	✓ $0.424	✓ $0.373	✓ $0.347	✓ $0.132
10	Gemini 3.1 Pro	5/5	$2.73	✓ $0.086	✓ $0.665	✓ $1.48	✓ $0.388	✓ $0.107
11	Claude Sonnet 4.6	5/5	$3.83	✓ $0.100	✓ $1.82	✓ $0.723	✓ $1.11	✓ $0.084
12	Claude Opus 4.8	5/5	$4.58	✓ $1.22	✓ $1.74	✓ $0.998	✓ $0.439	✓ $0.185
13	GPT-5.6 Sol	4/5	$0.697	✓ $0.036	✓ $0.252	✗ $0.239	✓ $0.123	✓ $0.048
14	Claude Haiku 4.5	4/5	$0.941	✗ $0.007	✓ $0.125	✓ $0.119	✓ $0.632	✓ $0.058
15	DeepSeek V3.1	3/5	$0.113	✗ $0.001	✓ $0.045	✓ $0.053	✗ $0.001	✓ $0.014
16	GGLM 5.2	3/5	$0.521	✓ $0.030	✗ $0.119	✓ $0.113	✗ $0.201	✓ $0.058
17	MMiniMax M3	3/5	$1.10	✓ $0.254	✓ $0.200	✗ $0.092	✗ $0.508	✓ $0.044
18	GGLM 5.1	3/5	$1.19	✓ $0.048	✗ $0.000	✓ $0.420	✗ $0.667	✓ $0.052
19	KKimi K2.6	3/5	$1.27	✓ $0.104	✗ $0.423	✓ $0.333	✗ $0.268	✓ $0.144
20	Llama 4 Maverick	2/5	$0.004	✗ $0.001	✗ $0.001	✗ $0.001	✓ $0.001	✓ $0.001
21	Grok 4.1 Fast	1/5	$0.089	✓ $0.035	✗ $0.052	✗ $0.000	✗ $0.000	✗ $0.002
22	Nemotron 3 Ultra 550B	1/5	$2.46	✗ $0.012	✗ $1.73	✗ $0.148	✓ $0.399	✗ $0.169
23	Nemotron 3 Nano 30B	0/5	$0.411	✗ $0.032	✗ $0.084	✗ $0.202	✗ $0.033	✗ $0.062
24	Qwen 3.6 Plus	0/5	$2.98	✗ $1.91	✗ $0.399	✗ $0.229	✗ $0.251	✗ $0.188

Lower is better. A ✓ at $0.00 means the model emitted an answer from training memory without calling tools — we treat that as a pass on the rubric, but flag it as a meaningful behavior in the failure breakdown.

Where they broke

The failures are the
interesting part.

Every benchmark suite advertises the wins. We're showing the failures. Here's where each model usually breaks down. The pattern, not the one-off.

Llama 4 Maverick

✗ Hallucinated

SalesGet financials on a company

This model doesn't use the tools, but instead looks at the training data (which is super old). This makes it very confident, and very wrong.

KKimi K2.6

✗ Stuck in a loop

MarketingFind what customers are recommending on Reddit

This model starts writing the answer, but gets stuck repeating the same fragment over and over (until the run has to be killed). It looks busy, but it never actually finishes.

KKimi K2.6

✗ Confident but empty

CodingRead API docs and write working code

This model writes a confident explanation of what the function should do, but then forgets to actually write the function (so the code field comes back empty). It looks correct, until you try to run it.

MMiniMax M3

✗ Thought, never wrote

CodingRead API docs and write working code

This model keeps thinking and figuring out the code in a private scratchpad, but it never actually responds with anything real (you can see it working, you just don't get any output).

GGLM 5.1

✗ Couldn't format the output

CodingRead API docs and write working code

This model doesn't have reliable structured output, so the answer comes back in the wrong shape (which the agent can't read). It tried to answer, but the format is broken.

Qwen 3.6 Plus

✗ Spent a lot, returned nothing

SalesGet financials on a company

This model uses a ton of tools and burns a ton of compute, but it never actually returns a usable answer (which makes it the most expensive way to do nothing in the matrix).

Claude Haiku 4.5

✗ Stopped before starting

SalesGet financials on a company

This model tells you it's going to do the work, but then just doesn't (the whole response is a polite sentence of intent, with no tools actually called).

How we ran it

No vibes. Show your work.

One agent, every model.

For each task we built a Cotera agent — a system prompt, a tool list, and a default model. To benchmark, we kept the agent and overrode the model via context.model on the chat endpoint. Tools and prompts stay identical.

Strict JSON output.

Every task ends with "return ONLY a JSON object conforming to this schema." Rubrics parse the final message, validate field types, ranges, array lengths, and required values. No regex on free-form text.

Costs straight from the bill.

We don't estimate. Cotera tracks per-run cost with an LLM/tool-call breakdown in chatStatus.totals — the dollar figures here are the same per-run charges customers see in their dashboard (1 credit ≈ $0.005).

Failures classified, not buried.

Every ✗ is tagged: wrong value, wrong format, truncated output, never-used-tools, infra error. The full per-cell breakdown — including the actual final messages — lives in REPORT.md alongside the run.

The runner, the agents, and every result file are open source.

Audit it. Add your own benchmark. Re-run the matrix with models we missed.

Read the source

Run your own agents.
Pick the model that ships.

Cotera agents work with the same tools, the same data, and any of the 24 models on this page. Start with a free org, swap models with a config change, and only pay for what your agents actually do.

Try Cotera free Talk to us

Real benchmarksfor real people.

Most leaderboards measure trivia.We measured work.

Real agents.

Real tools.

Real grading.

Meet the agents.

24 models, ranked by what theyactually delivered.

The failures are theinteresting part.

No vibes. Show your work.

One agent, every model.

Strict JSON output.

Costs straight from the bill.

Failures classified, not buried.

The runner, the agents, and every result file are open source.

Run your own agents.Pick the model that ships.

Real benchmarks
for real people.

Most leaderboards measure trivia.
We measured work.

24 models, ranked by what they
actually delivered.

The failures are the
interesting part.

Run your own agents.
Pick the model that ships.