0 tool calls, $0.001 — wrote Apollo's pricing tiers from training memory. The rubric's structural check passed because the pattern lives in Llama's training corpus. Apollo may have changed prices since; Llama didn't check.
Scrape a competitor's pricing page
We asked an AI to scrape Apollo's pricing page and return every tier. 14 of 16 models could do it.
Why this benchmark exists.
Outbound sales teams need accurate, current pricing for the tools they sell against (Apollo, Outreach, Salesloft, Clay, and every point solution they compete with). The data lives on the vendor's pricing page, and the page changes more often than training data updates. The job is to scrape the page, extract every tier with its price and features, and return clean structured data the rep can paste into a comparison deck.
What we asked it.
Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.
System prompt
You are a sales outreach specialist. Combine real-time company research with crisp copywriting. ALWAYS verify facts (pricing, features) by visiting the company's website with the site scraper. Never quote a price or feature from training memory.
Tools available
Google Search
- Site Scraper
How it's graded
- Did it return all of Apollo's pricing tiers, in price order?
- Does each tier have a real price, or 'Contact sales' for the enterprise tier?
- Did it return three actual features per tier?
What we saw.
We asked a sales outreach specialist agent to scrape Apollo.io and return all four pricing tiers (Free, Basic, Professional, Organization) with name, price, and top 3 features. The rubric checks that the URL is on apollo.io, the tiers come back in price order, and each one has a real price (or "Contact sales" for the enterprise tier) along with exactly three features.
14 models passed, the highest pass rate in the matrix. Only two failed: Grok with the familiar empty-final pattern, and Qwen burning $0.188 to produce one tool call and no real output.
What worked.
- DeepSeek did it for $0.014, the cheapest single benchmark run anywhere in the matrix. Two tool calls, clean output.
- Every successful model handled the "Contact sales" tier correctly (it has no number, just a label). That is a small detail, but it is exactly the kind of edge case that breaks pricing-scraper agents in production.
How it broke.
- Grok made zero tool calls and wrote nothing. Even on the easiest benchmark in the matrix, Grok stayed in form.
- Qwen made one tool call, then burned $0.188 thinking about it, then closed the run with no answer. Same tool-loop pattern as everywhere else, just at lower cost because it gave up faster.
- Llama 4 technically passed the rubric, but with zero tool calls. It wrote Apollo's pricing tiers from training memory and the structural check happened to pass (Apollo may have changed their prices since Llama was trained, Llama didn't check).
Results by model
16 models, ranked.
Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.
2 tool calls, $0.014 — cheapest single benchmark run in the entire matrix.
2 tool calls, $0.025 — second-cheapest run. Same minimal chain as DeepSeek.
3 tool calls, $0.029. Cheap and clean.
2 tool calls, $0.035 — 3.5 Flash's most disciplined moment in the matrix.
2 tool calls, $0.185 — Opus's most economical correct answer. Even Opus knew when to stop verifying.
✗ 1 tool call, $0.188. The cheapest Qwen failure — one tool call, no JSON. The tool-loop pattern even at light load.
✗ 0 tool calls, $0.002. Empty final on the easiest benchmark in the matrix.