Benchmarks
Web Scraping·Sales Outreach Specialist

Scrape a competitor's pricing page

We asked an AI to scrape Apollo's pricing page and return every tier. 14 of 16 models could do it.

14/16
models passed
$0.001
cheapest pass
$0.065
avg cost of passes
$0.188
costliest fail

Why this benchmark exists.

Outbound sales teams need accurate, current pricing for the tools they sell against (Apollo, Outreach, Salesloft, Clay, and every point solution they compete with). The data lives on the vendor's pricing page, and the page changes more often than training data updates. The job is to scrape the page, extract every tier with its price and features, and return clean structured data the rep can paste into a comparison deck.

What we asked it.

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

System prompt

You are a sales outreach specialist. Combine real-time company research with crisp copywriting. ALWAYS verify facts (pricing, features) by visiting the company's website with the site scraper. Never quote a price or feature from training memory.

Tools available

  • Google Search
  • Site Scraper

How it's graded

  • Did it return all of Apollo's pricing tiers, in price order?
  • Does each tier have a real price, or 'Contact sales' for the enterprise tier?
  • Did it return three actual features per tier?

What we saw.

We asked a sales outreach specialist agent to scrape Apollo.io and return all four pricing tiers (Free, Basic, Professional, Organization) with name, price, and top 3 features. The rubric checks that the URL is on apollo.io, the tiers come back in price order, and each one has a real price (or "Contact sales" for the enterprise tier) along with exactly three features.

14 models passed, the highest pass rate in the matrix. Only two failed: Grok with the familiar empty-final pattern, and Qwen burning $0.188 to produce one tool call and no real output.

What worked.

  • DeepSeek did it for $0.014, the cheapest single benchmark run anywhere in the matrix. Two tool calls, clean output.
  • Every successful model handled the "Contact sales" tier correctly (it has no number, just a label). That is a small detail, but it is exactly the kind of edge case that breaks pricing-scraper agents in production.

How it broke.

  • Grok made zero tool calls and wrote nothing. Even on the easiest benchmark in the matrix, Grok stayed in form.
  • Qwen made one tool call, then burned $0.188 thinking about it, then closed the run with no answer. Same tool-loop pattern as everywhere else, just at lower cost because it gave up faster.
  • Llama 4 technically passed the rubric, but with zero tool calls. It wrote Apollo's pricing tiers from training memory and the structural check happened to pass (Apollo may have changed their prices since Llama was trained, Llama didn't check).

Results by model

16 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

Llama 4 Maverick logoLlama 4 Maverick
$0.0010 toolsPassed

0 tool calls, $0.001 — wrote Apollo's pricing tiers from training memory. The rubric's structural check passed because the pattern lives in Llama's training corpus. Apollo may have changed prices since; Llama didn't check.

DeepSeek V3.1 logoDeepSeek V3.1
$0.0142 toolsPassed

2 tool calls, $0.014 — cheapest single benchmark run in the entire matrix.

Mistral Large 3 logoMistral Large 3
$0.0252 toolsPassed

2 tool calls, $0.025 — second-cheapest run. Same minimal chain as DeepSeek.

GPT-5 Mini logoGPT-5 Mini
$0.0293 toolsPassed

3 tool calls, $0.029. Cheap and clean.

Gemini 3.5 Flash logoGemini 3.5 Flash
$0.0352 toolsPassed

2 tool calls, $0.035 — 3.5 Flash's most disciplined moment in the matrix.

Claude Opus 4.8 logoClaude Opus 4.8
$0.1852 toolsPassed

2 tool calls, $0.185 — Opus's most economical correct answer. Even Opus knew when to stop verifying.

Qwen 3.6 Plus logoQwen 3.6 Plus
$0.1881 toolFailed

✗ 1 tool call, $0.188. The cheapest Qwen failure — one tool call, no JSON. The tool-loop pattern even at light load.

Grok 4.1 Fast logoGrok 4.1 Fast
$0.0020 toolsFailed

✗ 0 tool calls, $0.002. Empty final on the easiest benchmark in the matrix.