Web Scraping·Sales Outreach Specialist

Scrape a competitor's pricing page

We asked an AI to scrape Apollo's pricing page and return every tier. 20 of 24 models could do it.

20/24

models passed

$0.001

cheapest pass

$0.062

avg cost of passes

$0.188

costliest fail

Why this benchmark exists.

Outbound sales teams need accurate, current pricing for the tools they sell against (Apollo, Outreach, Salesloft, Clay, and every point solution they compete with). The data lives on the vendor's pricing page, and the page changes more often than training data updates. The job is to scrape the page, extract every tier with its price and features, and return clean structured data the rep can paste into a comparison deck.

What we asked it.

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

System prompt

You are a sales outreach specialist. Combine real-time company research with crisp copywriting. ALWAYS verify facts (pricing, features) by visiting the company's website with the site scraper. Never quote a price or feature from training memory.

Tools available

Google Search
Site Scraper

How it's graded

Did it return all of Apollo's pricing tiers, in price order?
Does each tier have a real price, or 'Contact sales' for the enterprise tier?
Did it return three actual features per tier?

What we saw.

We asked a sales outreach specialist agent to scrape Apollo.io and return all four pricing tiers (Free, Basic, Professional, Organization) with name, price, and top 3 features. The rubric checks that the URL is on apollo.io, the tiers come back in price order, and each one has a real price (or "Contact sales" for the enterprise tier) along with exactly three features.

14 models passed, the highest pass rate in the matrix. Only two failed: Grok with the familiar empty-final pattern, and Qwen burning $0.188 to produce one tool call and no real output.

What worked.

DeepSeek did it for $0.014, the cheapest single benchmark run anywhere in the matrix. Two tool calls, clean output.
Every successful model handled the "Contact sales" tier correctly (it has no number, just a label). That is a small detail, but it is exactly the kind of edge case that breaks pricing-scraper agents in production.

How it broke.

Grok made zero tool calls and wrote nothing. Even on the easiest benchmark in the matrix, Grok stayed in form.
Qwen made one tool call, then burned $0.188 thinking about it, then closed the run with no answer. Same tool-loop pattern as everywhere else, just at lower cost because it gave up faster.
Llama 4 technically passed the rubric, but with zero tool calls. It wrote Apollo's pricing tiers from training memory and the structural check happened to pass (Apollo may have changed their prices since Llama was trained, Llama didn't check).

Results by model

24 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

Llama 4 Maverick

$0.0010 toolsPassed

0 tool calls, $0.001 — wrote Apollo's pricing tiers from training memory. The rubric's structural check passed because the pattern lives in Llama's training corpus. Apollo may have changed prices since; Llama didn't check.