Benchmarks
K

moonshot

Kimi K2.6

Confident prose, missing code. Token-repetition loop on Reddit. Read the actual outputs before you trust this one.

3/5
benchmarks passed
$1.27
spent in total
19
tool calls

What it is.

Moonshot AI's Kimi K2.6 — a Chinese long-context model with reasoning capabilities. We tested the production endpoint as of November 2026.

What it does well.

  • Sales, CX, and Web Scraping all passed cleanly. When Kimi K2.6 stays in its lane, it's a perfectly competent agent.
  • CX pass was the only Kimi run under $0.35 — 3 tool calls, clean grouping of complaints.

How it broke.

  • Reddit failure was a decoder pathology: the model emitted `{"sourceUrl":"..."` and then got stuck looping the same key (`{"sourceUrl":{"sourceUrl":...`) until it was killed. Token-repetition mid-JSON. Cost $0.423 of nonsense.
  • Coding failure was a confidence/competence mismatch: valid JSON, confident explanation describing HMAC-SHA256, but the code field omitted the actual `createHmac` call. Described the algorithm without implementing it.
  • Both failure modes are dangerous because they look correct at a glance. The Coding output passes a syntax check; only the rubric's substring check (`createHmac("sha256", ...)`) caught it.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

5 tool calls, $0.104. Standard Sales pass with appropriate verification.

$0.1045 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 3 tool calls, $0.423. Token-repetition loop mid-JSON. The model never recovered; the run was killed after the repetition pattern was detected.

$0.4233 tool calls
Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

3 tool calls, $0.333. Cleanest Kimi run — read reviews, grouped complaints, returned three verbatim quotes.

$0.3333 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 5 tool calls, $0.268. The dangerous one. Confident, schema-valid output. Read the Stripe docs, wrote a 3-sentence explanation of HMAC-SHA256, then in the code field omitted the `createHmac` call entirely. Described the algorithm without implementing it.

$0.2685 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

3 tool calls, $0.145. Apollo tiers extracted; passable but the most expensive Apollo pass among the open-weights models in the matrix.

$0.1443 tool calls