Get financials on a company
Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.
20 tool calls, $0.254. The most verification-heavy Sales run that passed. Right answer, lots of double-checking.
minimax
Thinks in `<think>` blocks. Sometimes forgets to write the answer afterward.
MiniMax's M3 — the November 2026 reasoning model out of the Chinese MiniMax lab. Tested via the first-party API.
Results by agent
Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.
20 tool calls, $0.254. The most verification-heavy Sales run that passed. Right answer, lots of double-checking.
Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.
6 tool calls, $0.200. Clean Marketing pass.
Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.
✗ 2 tool calls, $0.093. M3 emitted a `<think>` block enumerating complaints, then closed the run without producing the JSON. Analysis was visible in the trace; the user got nothing.
Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.
✗ 22 tool calls, $0.508. Same thinking-block-without-output pattern as CX, but with 10x more tool calls and 5x the cost. The model planned the webhook function in a `<think>` block, then stopped.
Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.
4 tool calls, $0.044. Apollo's pricing tiers extracted, JSON returned.