Benchmarks
Marketing·Marketing Research Analyst

Find what customers are recommending on Reddit

We asked an AI to read a Reddit thread and rank the top tools mentioned. 11 of 16 models could do it.

11/16
models passed
$0.045
cheapest pass
$0.496
avg cost of passes
$0.423
costliest fail

Why this benchmark exists.

Marketers need to know what people are actually saying about their products. Reddit is where buyers openly compare tools, and the conversation lives in long open-ended threads. The job is part retrieval (read the actual thread), part counting (which products got mentioned by which people), and part discipline (don't just list products from memory).

What we asked it.

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

System prompt

You are a marketing research analyst. Your job is to mine social-media discussions to surface customer language and product recommendations. ALWAYS read real Reddit threads with the provided tools; don't list products from training memory. Count which products are mentioned by multiple commenters and rank by recurrence.

Tools available

  • Google Search
  • Reddit

How it's graded

  • Did it actually open a real Reddit thread?
  • Did it return exactly 5 tools, not 4 or 7?
  • Did each tool come with a real mention count and a real reason?

What we saw.

We asked a marketing research analyst agent to read a real r/sales thread about sales-enrichment tools and return the top 5 tools, ranked by how many different people mentioned each one. The rubric checks that it actually opened a real Reddit thread, returned exactly 5 tools, and gave each one a real mention count and a reason.

11 models passed. The ones that failed were interesting: three of them retrieved the data but never actually wrote an answer, one model got stuck repeating itself until the run had to be killed, and one model never opened Reddit at all (it just listed tools from training memory).

What worked.

  • DeepSeek and Mistral both passed for under $0.05 with five tool calls or fewer. One Google search to find the thread, one read of the Reddit thread, and then a clean structured answer. This is the perfect pass.
  • Every successful model respected the 'exactly 5 tools' constraint. None of them padded with extras when their reading was thin (they either had five real mentions or they didn't pass).

How it broke.

  • A lot of models did the retrievals correctly and then just never wrote the final answer. GLM did 7 tool calls, Grok did 7, and Qwen did 22 (the most of any run in the matrix), but none of them returned an output the agent could actually use.
  • Kimi 2.6 started writing the answer, but got stuck repeating the same fragment over and over until the run had to be killed.
  • Llama 4 didn't open Reddit at all. It just listed five tools from training memory in the wrong shape, with no actual mention counts.

Results by model

16 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

DeepSeek V3.1 logoDeepSeek V3.1
$0.0455 toolsPassed

5 tool calls, $0.046 — the only DeepSeek pass under $0.10 in the matrix and the cheapest Marketing pass overall. When V3.1 actually engages, it's excellent.

Mistral Large 3 logoMistral Large 3
$0.0489 toolsPassed

9 tool calls, $0.048. A touch more retrieval than DeepSeek but still cheap. Tightest Marketing run by a frontier-tier model.

Gemini 3.1 Pro logoGemini 3.1 Pro
$0.6655 toolsPassed

5 tool calls, $0.666. Gemini Pro's reasoning-tax shows up clearly here: same tool-call count as the cheaper models, ~3x the cost.

Claude Opus 4.8 logoClaude Opus 4.8
$1.745 toolsPassed

5 tool calls, $1.74 — second-most-expensive pass on this benchmark. Same long-context-tax pattern as Sonnet.

Claude Sonnet 4.6 logoClaude Sonnet 4.6
$1.825 toolsPassed

5 tool calls, $1.82 — the single most expensive passing run in the entire matrix. Sonnet's reasoning plus long-context Reddit input compounds fast; the bill is mostly thinking, not retrieval.

KKimi K2.6
$0.4233 toolsFailed

✗ 3 tool calls, $0.423. Token-repetition loop mid-JSON: emitted `{"sourceUrl":"..."` then got stuck looping the same key until the run was killed. The decoder pathology in plain view.

Qwen 3.6 Plus logoQwen 3.6 Plus
$0.39922 toolsFailed

✗ 22 tool calls, $0.399 — the most tool calls of any single run in the matrix. None produced a final answer.

Grok 4.1 Fast logoGrok 4.1 Fast
$0.0527 toolsFailed

✗ 7 tool calls, $0.053. Read Reddit successfully, then closed the run without writing the JSON. The "stops without writing the answer" Grok pattern.

Llama 4 Maverick logoLlama 4 Maverick
$0.0010 toolsFailed

✗ 0 tool calls, $0.0005. Listed five enrichment tools from training memory in non-rubric format. Didn't open Reddit at all.

GGLM 5.1
$0.0007 toolsFailed

✗ 7 tool calls, $0.00. GLM read the thread, then closed the run before writing the JSON. No characters returned; no rubric to grade.