Benchmarks
CX·Customer Insights Analyst

Find what customers are complaining about

We asked an AI to find what people are complaining about on a real product page. 12 of 16 models could do it.

12/16
models passed
$0.053
cheapest pass
$0.467
avg cost of passes
$0.229
costliest fail

Why this benchmark exists.

Customer experience teams need to know what's actually breaking for real customers before support tickets start piling up. The data is sitting right there on the product page (Google Shopping reviews are the public version of NPS comments). The job is to pull the reviews, group the recurring complaints, and quote them verbatim so a CX lead can prioritize against real signal.

What we asked it.

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

System prompt

You are a customer insights analyst for a CX team. Mine product reviews for recurring pain points. ALWAYS pull real reviews with the tools; never summarize from memory. Group reviewer complaints by recurrence and cite verbatim snippets.

Tools available

  • Google Search
  • Google Shopping

How it's graded

  • Did it identify the actual AirPods product?
  • Did it return exactly 3 complaints?
  • Are the quotes pulled verbatim from real reviews, not paraphrased from memory?

What we saw.

We asked a customer insights analyst agent to pull reviews for AirPods Pro 2 (USB-C) from Google Shopping and return the three most recurring complaints, with verbatim customer quotes. The rubric checks that it found the right product, returned exactly three complaints, and pulled the quotes verbatim from real reviews.

12 models passed. The four that failed were interesting: MiniMax M3 thought it through in a private scratchpad and then never wrote the answer, Llama 4 didn't attempt to read the reviews at all, and Grok and Qwen retrieved the reviews and then closed the run without writing anything.

What worked.

  • DeepSeek hit it for $0.053 with four tool calls (one product search, one review pull, three complaints with verbatim quotes). This is the perfect pass.
  • Most of the passes returned the same three top complaints: case rattle, battery degradation, and connection drops. The signal is sitting right there in the reviews, and the models that bothered to look at them all found it.

How it broke.

  • MiniMax M3 listed the complaints in a private thinking block and then closed the run without ever writing the answer. The reasoning was there, but the output was not.
  • Llama 4 didn't pull any reviews. It tried to answer from training memory and returned the wrong shape.
  • Grok and Qwen both read the reviews and then never wrote the final answer. Grok did nine tool calls before closing the run with no output, and Qwen did ten.

Results by model

16 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

DeepSeek V3.1 logoDeepSeek V3.1
$0.0534 toolsPassed

4 tool calls, $0.053 — cheapest CX pass. DeepSeek's bailing tendencies didn't show up on the easier review-reading task.

Mistral Large 3 logoMistral Large 3
$0.0662 toolsPassed

2 tool calls, $0.066 — tightest run on the benchmark. One product search, one review pull, three complaints with verbatim quotes.

Claude Opus 4.8 logoClaude Opus 4.8
$0.9984 toolsPassed

4 tool calls, $1.00. Opus's per-task cost is just structurally high; the answer was correct.

Gemini 3.1 Pro logoGemini 3.1 Pro
$1.4815 toolsPassed

15 tool calls, $1.48 — the most expensive correct answer in the matrix. Pro decided to read nearly every review on the page; the rubric wasn't asking for that.

Qwen 3.6 Plus logoQwen 3.6 Plus
$0.22910 toolsFailed

✗ 10 tool calls, $0.229. Engaged with the data, never wrote the answer.

MMiniMax M3
$0.0922 toolsFailed

✗ 2 tool calls, $0.093. M3's signature failure: emitted a `<think>` block listing the complaints, then closed the run without producing the JSON envelope. Analysis was there; output wasn't.

Llama 4 Maverick logoLlama 4 Maverick
$0.0010 toolsFailed

✗ 0 tool calls, $0.001. Didn't look at real Google Shopping reviews; tried to answer from training memory in the wrong shape.

Grok 4.1 Fast logoGrok 4.1 Fast
$0.0009 toolsFailed

✗ 9 tool calls, $0.00. The most expensive "no tokens written" failure in the matrix — Grok read 9 review pages and emitted nothing in the final message.