Benchmarks
Sales·Sales Research Analyst

Get financials on a company

We asked an AI to get funding data for a sales prospect. 12 of 16 models found the right number.

12/16
models passed
$0.032
cheapest pass
$0.194
avg cost of passes
$1.91
costliest fail

Why this benchmark exists.

Every sales rep before a discovery call needs the same three facts: what does this company do, how much money do they have, and who gave it to them. The data lives in Crunchbase, but models often default to their own training data and hallucinate the wrong answer. Researching prospects before calls is the cheapest and most common agent job our customers run.

What we asked it.

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

System prompt

You are a sales research analyst. Your job is to prepare research briefs for sales reps before customer calls. ALWAYS use the available tools to verify facts; your training data is stale and may hallucinate funding amounts. Prefer Crunchbase for funding data; use Google search to first locate the right slug. If a tool call returns no data, retry once with a different identifier before giving up.

Tools available

  • Google Search
  • Site Scraper
  • Crunchbase

How it's graded

  • Did it find the right funding amount?
  • Did it get the right round (Series E)?
  • Did it return a link to real Crunchbase data, not made up from memory?

What we saw.

We asked a sales research analyst agent to research a company called Hightouch and get the data for their last funding round. The rubric checks to see if there's a real funding number, the actual series raised, a real description, and a Crunchbase URL that mentions the company.

12 models passed. The ones that failed were really interesting: two of them bailed before calling any tools, one answered from memory, and one burned a lot of money thinking without producing anything.

What worked.

  • The cheapest passes used three to five tool calls. Gemini, Grok, GLM, and Mistral all did the correct thing: found Hightouch's Crunchbase entry with one Google search, then hit the Crunchbase tool, and returned cleanly. This is the perfect pass.
  • The more expensive models ended up trying to verify using other sources, which is good but ends up costing money, especially when the underlying credits are fairly expensive.

How it broke.

  • A lot of models said they were going to do the work, but then never actually did it. They might not be powerful enough, or they might have been waiting for a follow-up response before continuing.
  • Llama 4 didn't actually use any tool calls. It hallucinated the funding amount based on what it thought was the right answer from its own training memory.
  • Qwen 3.6 thought a lot and used a ton of tool calls, but never actually returned any output.

Results by model

16 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

Gemini 3 Flash logoGemini 3 Flash
$0.0323 toolsPassed

3 tool calls, $0.032. Cheapest pass. Google → Crunchbase → JSON. The minimal-viable agent chain.

Mistral Large 3 logoMistral Large 3
$0.0587 toolsPassed

7 tool calls, $0.058. Used Google to find the Crunchbase slug, hit Crunchbase, scraped the page to verify the funding number. The reference shape for what "good" looks like on this task.

GPT-5 Mini logoGPT-5 Mini
$0.14610 toolsPassed

10 tool calls, $0.146. The most thorough successful run; every call chained productively without thrashing.

Claude Opus 4.8 logoClaude Opus 4.8
$1.2217 toolsPassed

17 tool calls, $1.22. Opus cross-checked Crunchbase against the Hightouch blog and a press release before committing. Correct answer at the highest passing price.

Qwen 3.6 Plus logoQwen 3.6 Plus
$1.919 toolsFailed

✗ 9 tool calls, $1.91. The single most expensive failure across the entire matrix. Qwen engaged, retrieved data, thought about it for $1.91 worth of tokens, then closed the run with no JSON.

Claude Haiku 4.5 logoClaude Haiku 4.5
$0.0071 toolFailed

✗ Wrote "I'll help you research Hightouch..." as the entire output. 1 tool call, $0.007. The polite-customer-service failure mode in plain view.

DeepSeek V3.1 logoDeepSeek V3.1
$0.0010 toolsFailed

✗ 0 tool calls, $0.0005. "Let me search for..." — and nothing else. DeepSeek bails on Sales when the system prompt is persona-heavy.

Llama 4 Maverick logoLlama 4 Maverick
$0.0010 toolsFailed

✗ 0 tool calls, $0.0005. Wrote a prose paragraph about Hightouch's funding from training memory. The hallucination case the rubric is built to catch.