5 tool calls, $0.046 — the only DeepSeek pass under $0.10 in the matrix and the cheapest Marketing pass overall. When V3.1 actually engages, it's excellent.
Find what customers are recommending on Reddit
We asked an AI to read a Reddit thread and rank the top tools mentioned. 11 of 16 models could do it.
Why this benchmark exists.
Marketers need to know what people are actually saying about their products. Reddit is where buyers openly compare tools, and the conversation lives in long open-ended threads. The job is part retrieval (read the actual thread), part counting (which products got mentioned by which people), and part discipline (don't just list products from memory).
What we asked it.
Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.
System prompt
You are a marketing research analyst. Your job is to mine social-media discussions to surface customer language and product recommendations. ALWAYS read real Reddit threads with the provided tools; don't list products from training memory. Count which products are mentioned by multiple commenters and rank by recurrence.
Tools available
Google Search
Reddit
How it's graded
- Did it actually open a real Reddit thread?
- Did it return exactly 5 tools, not 4 or 7?
- Did each tool come with a real mention count and a real reason?
What we saw.
We asked a marketing research analyst agent to read a real r/sales thread about sales-enrichment tools and return the top 5 tools, ranked by how many different people mentioned each one. The rubric checks that it actually opened a real Reddit thread, returned exactly 5 tools, and gave each one a real mention count and a reason.
11 models passed. The ones that failed were interesting: three of them retrieved the data but never actually wrote an answer, one model got stuck repeating itself until the run had to be killed, and one model never opened Reddit at all (it just listed tools from training memory).
What worked.
- DeepSeek and Mistral both passed for under $0.05 with five tool calls or fewer. One Google search to find the thread, one read of the Reddit thread, and then a clean structured answer. This is the perfect pass.
- Every successful model respected the 'exactly 5 tools' constraint. None of them padded with extras when their reading was thin (they either had five real mentions or they didn't pass).
How it broke.
- A lot of models did the retrievals correctly and then just never wrote the final answer. GLM did 7 tool calls, Grok did 7, and Qwen did 22 (the most of any run in the matrix), but none of them returned an output the agent could actually use.
- Kimi 2.6 started writing the answer, but got stuck repeating the same fragment over and over until the run had to be killed.
- Llama 4 didn't open Reddit at all. It just listed five tools from training memory in the wrong shape, with no actual mention counts.
Results by model
16 models, ranked.
Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.
9 tool calls, $0.048. A touch more retrieval than DeepSeek but still cheap. Tightest Marketing run by a frontier-tier model.
5 tool calls, $0.666. Gemini Pro's reasoning-tax shows up clearly here: same tool-call count as the cheaper models, ~3x the cost.
5 tool calls, $1.74 — second-most-expensive pass on this benchmark. Same long-context-tax pattern as Sonnet.
5 tool calls, $1.82 — the single most expensive passing run in the entire matrix. Sonnet's reasoning plus long-context Reddit input compounds fast; the bill is mostly thinking, not retrieval.
✗ 3 tool calls, $0.423. Token-repetition loop mid-JSON: emitted `{"sourceUrl":"..."` then got stuck looping the same key until the run was killed. The decoder pathology in plain view.
✗ 22 tool calls, $0.399 — the most tool calls of any single run in the matrix. None produced a final answer.
✗ 7 tool calls, $0.053. Read Reddit successfully, then closed the run without writing the JSON. The "stops without writing the answer" Grok pattern.
✗ 0 tool calls, $0.0005. Listed five enrichment tools from training memory in non-rubric format. Didn't open Reddit at all.
✗ 7 tool calls, $0.00. GLM read the thread, then closed the run before writing the JSON. No characters returned; no rubric to grade.