3 tool calls, $0.032. Cheapest pass. Google → Crunchbase → JSON. The minimal-viable agent chain.
Get financials on a company
We asked an AI to get funding data for a sales prospect. 12 of 16 models found the right number.
Why this benchmark exists.
Every sales rep before a discovery call needs the same three facts: what does this company do, how much money do they have, and who gave it to them. The data lives in Crunchbase, but models often default to their own training data and hallucinate the wrong answer. Researching prospects before calls is the cheapest and most common agent job our customers run.
What we asked it.
Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.
System prompt
You are a sales research analyst. Your job is to prepare research briefs for sales reps before customer calls. ALWAYS use the available tools to verify facts; your training data is stale and may hallucinate funding amounts. Prefer Crunchbase for funding data; use Google search to first locate the right slug. If a tool call returns no data, retry once with a different identifier before giving up.
Tools available
Google Search
- Site Scraper
Crunchbase
How it's graded
- Did it find the right funding amount?
- Did it get the right round (Series E)?
- Did it return a link to real Crunchbase data, not made up from memory?
What we saw.
We asked a sales research analyst agent to research a company called Hightouch and get the data for their last funding round. The rubric checks to see if there's a real funding number, the actual series raised, a real description, and a Crunchbase URL that mentions the company.
12 models passed. The ones that failed were really interesting: two of them bailed before calling any tools, one answered from memory, and one burned a lot of money thinking without producing anything.
What worked.
- The cheapest passes used three to five tool calls. Gemini, Grok, GLM, and Mistral all did the correct thing: found Hightouch's Crunchbase entry with one Google search, then hit the Crunchbase tool, and returned cleanly. This is the perfect pass.
- The more expensive models ended up trying to verify using other sources, which is good but ends up costing money, especially when the underlying credits are fairly expensive.
How it broke.
- A lot of models said they were going to do the work, but then never actually did it. They might not be powerful enough, or they might have been waiting for a follow-up response before continuing.
- Llama 4 didn't actually use any tool calls. It hallucinated the funding amount based on what it thought was the right answer from its own training memory.
- Qwen 3.6 thought a lot and used a ton of tool calls, but never actually returned any output.
Results by model
16 models, ranked.
Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.
7 tool calls, $0.058. Used Google to find the Crunchbase slug, hit Crunchbase, scraped the page to verify the funding number. The reference shape for what "good" looks like on this task.
10 tool calls, $0.146. The most thorough successful run; every call chained productively without thrashing.
17 tool calls, $1.22. Opus cross-checked Crunchbase against the Hightouch blog and a press release before committing. Correct answer at the highest passing price.
✗ 9 tool calls, $1.91. The single most expensive failure across the entire matrix. Qwen engaged, retrieved data, thought about it for $1.91 worth of tokens, then closed the run with no JSON.
✗ Wrote "I'll help you research Hightouch..." as the entire output. 1 tool call, $0.007. The polite-customer-service failure mode in plain view.
✗ 0 tool calls, $0.0005. "Let me search for..." — and nothing else. DeepSeek bails on Sales when the system prompt is persona-heavy.
✗ 0 tool calls, $0.0005. Wrote a prose paragraph about Hightouch's funding from training memory. The hallucination case the rubric is built to catch.