anthropic

Claude Opus 4.8

5/5 at $4.58. The most expensive perfect score, and the only one we actively don't recommend.

5/5

benchmarks passed

$4.58

spent in total

tool calls

What it is.

Anthropic's Opus 4.8 — the top of the Claude 4.X family, frontier reasoning at frontier prices. Marketed for agent work; we ran it on agent work and it got every answer right.

What it does well.

5/5 pass with the deepest verification chains of any model. Opus made 17 tool calls on the Crunchbase brief — the most thorough Sales run in the matrix.
Best multi-source corroboration. Opus consistently cross-checked Crunchbase data against the company's own marketing page before returning a number.
Zero JSON drama. If anything, Opus's explanation fields are the most carefully written prose in the matrix.

Where to be careful.

Crunchbase was $1.22 for one run. Reddit was $1.74. CX was $1.00. Per-task costs that would be a wash on a customer migration are unsustainable as the agent-loop default model.
Total spend was $4.58 — 17x Mistral, 12x GPT-5 Mini. We saw no quality lift on this rubric that would justify the multiple.
Opus's verification instinct is great for high-stakes singletons (one-off research, regulated work). Don't put it on a job queue.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

17 tool calls, $1.22. Opus verified Hightouch's funding against Crunchbase, the company blog, and a press release before committing. Correct answer at a steep price.

MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Passed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

5 tool calls, $1.74. Long-context reasoning dominates the bill; retrieval is normal.

CXCustomer Insights Analyst

Find what customers are complaining about

Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

4 tool calls, $1.00. The model with the most expensive correct answer in our cheapest benchmark.

CodingSenior Engineer Assistant

Read API docs and write working code

Passed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

3 tool calls, $0.439. The one benchmark where Opus actually behaved economically.

Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

2 tool calls, $0.185. Sensible, but still 7x what Mistral spent for the same output.

All models Run Claude Opus 4.8 on your data