CX·Customer Insights Analyst

Find what customers are complaining about

We asked an AI to find what people are complaining about on a real product page. 17 of 24 models could do it.

17/24

models passed

$0.053

cheapest pass

$0.393

avg cost of passes

$0.239

costliest fail

Why this benchmark exists.

Customer experience teams need to know what's actually breaking for real customers before support tickets start piling up. The data is sitting right there on the product page (Google Shopping reviews are the public version of NPS comments). The job is to pull the reviews, group the recurring complaints, and quote them verbatim so a CX lead can prioritize against real signal.

What we asked it.

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

System prompt

You are a customer insights analyst for a CX team. Mine product reviews for recurring pain points. ALWAYS pull real reviews with the tools; never summarize from memory. Group reviewer complaints by recurrence and cite verbatim snippets.

Tools available

Google Search
Google Shopping

How it's graded

Did it identify the actual AirPods product?
Did it return exactly 3 complaints?
Are the quotes pulled verbatim from real reviews, not paraphrased from memory?

What we saw.

We asked a customer insights analyst agent to pull reviews for AirPods Pro 2 (USB-C) from Google Shopping and return the three most recurring complaints, with verbatim customer quotes. The rubric checks that it found the right product, returned exactly three complaints, and pulled the quotes verbatim from real reviews.

12 models passed. The four that failed were interesting: MiniMax M3 thought it through in a private scratchpad and then never wrote the answer, Llama 4 didn't attempt to read the reviews at all, and Grok and Qwen retrieved the reviews and then closed the run without writing anything.

What worked.

DeepSeek hit it for $0.053 with four tool calls (one product search, one review pull, three complaints with verbatim quotes). This is the perfect pass.
Most of the passes returned the same three top complaints: case rattle, battery degradation, and connection drops. The signal is sitting right there in the reviews, and the models that bothered to look at them all found it.

How it broke.

MiniMax M3 listed the complaints in a private thinking block and then closed the run without ever writing the answer. The reasoning was there, but the output was not.
Llama 4 didn't pull any reviews. It tried to answer from training memory and returned the wrong shape.
Grok and Qwen both read the reviews and then never wrote the final answer. Grok did nine tool calls before closing the run with no output, and Qwen did ten.

Results by model

24 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

DeepSeek V3.1

$0.0534 toolsPassed

4 tool calls, $0.053 — cheapest CX pass. DeepSeek's bailing tendencies didn't show up on the easier review-reading task.

GPT-5.6 Luna

$0.0602 toolsPassed

Mistral Large 3

$0.0662 toolsPassed

2 tool calls, $0.066 — tightest run on the benchmark. One product search, one review pull, three complaints with verbatim quotes.

$0.1084 toolsPassed

$0.1132 toolsPassed

$0.1194 toolsPassed

$0.2044 toolsPassed

Nemotron 3 Super 120B

$0.3296 toolsPassed

$0.3333 toolsPassed

$0.33913 toolsPassed

$0.3485 toolsPassed

$0.3732 toolsPassed

$0.4204 toolsPassed

$0.6235 toolsPassed

$0.7234 toolsPassed

$0.9984 toolsPassed

4 tool calls, $1.00. Opus's per-task cost is just structurally high; the answer was correct.

Gemini 3.1 Pro

$1.4815 toolsPassed

15 tool calls, $1.48 — the most expensive correct answer in the matrix. Pro decided to read nearly every review on the page; the rubric wasn't asking for that.

GPT-5.6 Sol

$0.2393 toolsFailed

Returned 4 complaints where the schema demands exactly 3.

Qwen 3.6 Plus

$0.22910 toolsFailed

✗ 10 tool calls, $0.229. Engaged with the data, never wrote the answer.

Nemotron 3 Nano 30B

$0.20216 toolsFailed

16 tool calls, then cut off before producing a final answer.

Nemotron 3 Ultra 550B

$0.1483 toolsFailed

Extracted a real complaint but omitted the required productId field.

MMiniMax M3

$0.0922 toolsFailed

✗ 2 tool calls, $0.093. M3's signature failure: emitted a `<think>` block listing the complaints, then closed the run without producing the JSON envelope. Analysis was there; output wasn't.

Llama 4 Maverick

$0.0010 toolsFailed

✗ 0 tool calls, $0.001. Didn't look at real Google Shopping reviews; tried to answer from training memory in the wrong shape.

Grok 4.1 Fast

$0.0009 toolsFailed

✗ 9 tool calls, $0.00. The most expensive "no tokens written" failure in the matrix — Grok read 9 review pages and emitted nothing in the final message.

All agents Build your own agent