Benchmarks
Llama 4 Maverick logo

meta

Llama 4 Maverick

Made zero tool calls across all five benchmarks. Answered from training memory or didn't answer at all.

2/5
benchmarks passed
$0.004
spent in total
0
tool calls

What it is.

Meta's Llama 4 Maverick — the dense reasoning variant in the Llama 4 family. We tested via the openweights endpoint.

What it does well.

  • Total spend was $0.004 — by far the cheapest run in the matrix. (See "Watch-outs" for what that actually means.)
  • Coding and Web Scraping "passed" the rubric — although see notes; these are lenient passes from training-data memorization, not retrieval.

How it broke.

  • Zero tool calls on every single benchmark. Maverick refused to engage with the agent loop. On Sales, Marketing, and CX, it just wrote prose answers from training memory — wrong shape, wrong format, instant rubric fail.
  • The Coding and Web Scraping "passes" are an artifact of two rubric quirks: the Stripe code lives in training data verbatim, and the Apollo pricing answer happened to match the rubric's structural check from memory. Neither was retrieved.
  • This is the failure mode you most need to catch in evals: a model that always answers cheaply and confidently and never verifies anything. In a real customer-facing agent it would hallucinate funding numbers, product features, and prices — silently.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Failed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

✗ 0 tool calls, $0.0005. Wrote a prose answer about Hightouch from training memory. No Crunchbase visit, no Google search, nothing. Schema parser rejected the prose.

$0.0010 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 0 tool calls, $0.0005. Refused to call Reddit. Listed five enrichment tools from training memory in non-rubric format.

$0.0010 tool calls
Failed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

✗ 0 tool calls, $0.001. No Google Shopping call, no product lookup, no real review data.

$0.0010 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Passed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

0 tool calls, $0.001. Wrote a verifyStripeWebhook function from training memory. The Stripe webhook pattern is well-documented; Maverick's memorized version happened to satisfy the rubric. Don't read this as a model strength — it's training-data exposure.

$0.0010 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

0 tool calls, $0.001. Same pattern — wrote Apollo's pricing tiers from training memory; the rubric's structural check passed. Apollo's actual pricing may have moved; Maverick didn't check.

$0.0010 tool calls