Llama 4 Maverick

Made zero tool calls across all five benchmarks. Answered from training memory or didn't answer at all.

2/5

benchmarks passed

$0.004

spent in total

tool calls

What it is.

Meta's Llama 4 Maverick — the dense reasoning variant in the Llama 4 family. We tested via the openweights endpoint.

What it does well.

Total spend was $0.004 — by far the cheapest run in the matrix. (See "Watch-outs" for what that actually means.)
Coding and Web Scraping "passed" the rubric — although see notes; these are lenient passes from training-data memorization, not retrieval.

How it broke.

Zero tool calls on every single benchmark. Maverick refused to engage with the agent loop. On Sales, Marketing, and CX, it just wrote prose answers from training memory — wrong shape, wrong format, instant rubric fail.
The Coding and Web Scraping "passes" are an artifact of two rubric quirks: the Stripe code lives in training data verbatim, and the Apollo pricing answer happened to match the rubric's structural check from memory. Neither was retrieved.
This is the failure mode you most need to catch in evals: a model that always answers cheaply and confidently and never verifies anything. In a real customer-facing agent it would hallucinate funding numbers, product features, and prices — silently.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Failed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

✗ 0 tool calls, $0.0005. Wrote a prose answer about Hightouch from training memory. No Crunchbase visit, no Google search, nothing. Schema parser rejected the prose.

MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 0 tool calls, $0.0005. Refused to call Reddit. Listed five enrichment tools from training memory in non-rubric format.

CXCustomer Insights Analyst

Find what customers are complaining about

Failed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

✗ 0 tool calls, $0.001. No Google Shopping call, no product lookup, no real review data.

CodingSenior Engineer Assistant

Read API docs and write working code

Passed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

0 tool calls, $0.001. Wrote a verifyStripeWebhook function from training memory. The Stripe webhook pattern is well-documented; Maverick's memorized version happened to satisfy the rubric. Don't read this as a model strength — it's training-data exposure.

Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

0 tool calls, $0.001. Same pattern — wrote Apollo's pricing tiers from training memory; the rubric's structural check passed. Apollo's actual pricing may have moved; Maverick didn't check.

All models Run Llama 4 Maverick on your data