Benchmarks
M

minimax

MiniMax M3

Thinks in `<think>` blocks. Sometimes forgets to write the answer afterward.

3/5
benchmarks passed
$1.10
spent in total
54
tool calls

What it is.

MiniMax's M3 — the November 2026 reasoning model out of the Chinese MiniMax lab. Tested via the first-party API.

What it does well.

  • Sales pass was thorough: 20 tool calls verifying Crunchbase data against multiple sources, $0.254. Most agent-loop-style behavior of any model that succeeded here.
  • Web Scraping ($0.044) and Marketing ($0.200) passed cleanly with no envelope issues.

How it broke.

  • M3's signature failure mode: opens a `<think>...</think>` block, plans the answer, then closes the run before writing the answer. Both CX and Coding failed this way. The thinking is visible; the output isn't.
  • This is a known issue with M3's chat-template — the model treats `<think>` as a non-streaming scratchpad and sometimes doesn't transition out of it before hitting stop tokens.
  • Coding cost $0.508 to produce no output. M3 read the Stripe docs, planned the function in a `<think>` block, then stopped. You're paying for thinking the user never sees.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

20 tool calls, $0.254. The most verification-heavy Sales run that passed. Right answer, lots of double-checking.

$0.25420 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Passed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

6 tool calls, $0.200. Clean Marketing pass.

$0.2006 tool calls
Failed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

✗ 2 tool calls, $0.093. M3 emitted a `<think>` block enumerating complaints, then closed the run without producing the JSON. Analysis was visible in the trace; the user got nothing.

$0.0922 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 22 tool calls, $0.508. Same thinking-block-without-output pattern as CX, but with 10x more tool calls and 5x the cost. The model planned the webhook function in a `<think>` block, then stopped.

$0.50822 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

4 tool calls, $0.044. Apollo's pricing tiers extracted, JSON returned.

$0.0444 tool calls