Benchmarks
Claude Sonnet 4.6 logo

anthropic

Claude Sonnet 4.6

5/5 every time, billed like a senior engineer. Worth it when correctness is non-negotiable; brutal at scale.

5/5
benchmarks passed
$3.83
spent in total
27
tool calls

What it is.

Anthropic's Sonnet 4.6 — the mid-tier in the Claude 4.X family, sitting between Haiku and Opus. The Anthropic crowd's default daily driver for production agent work.

What it does well.

  • 5/5 with the most thorough per-task reasoning of the mid-tier models. Output JSON is consistently the most explicit (full field names, no abbreviation).
  • Cleanest Coding answer in the matrix. Wrote the Stripe webhook with HMAC + timing-safe compare, explanation field hit 3 sentences, code field passed every rubric check.
  • Zero schema drift across the run. No prose preambles, no envelope leakage, no half-finished JSON.

Where to be careful.

  • Marketing cost $1.82 on a single benchmark — 38x Mistral's spend for the same output. Long-context Reddit threads are death by token at Sonnet's rates.
  • Total spend was $3.83 — the second-most-expensive 5/5 in the matrix. Unless you've measured a real quality lift on your workload, every Sonnet pass here is a more-expensive version of a Mistral pass.
  • Reasoning-tax shows up on retrieval-heavy tasks. CX was $0.723, vs Mistral's $0.066, on the same correct answer.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

6 tool calls, $0.100. Standard Sonnet shape: clean retrieval, careful verification, no detours.

$0.1006 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Passed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

$1.82 — the most expensive single benchmark run in the matrix on a task that passed. 5 tool calls; the cost is reasoning + long-context input.

$1.825 tool calls
Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

4 tool calls, $0.723. Pulled reviews, grouped by recurrence, returned three verbatim complaints with quotes ≥10 chars.

$0.7234 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Passed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

10 tool calls, $1.11. The most thorough Stripe-docs read in the matrix; resulting code is also the most explicit.

$1.1110 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

2 tool calls, $0.084. The one task where Sonnet's bill was reasonable.

$0.0842 tool calls