moonshot

Kimi K2.6

Confident prose, missing code. Token-repetition loop on Reddit. Read the actual outputs before you trust this one.

3/5

benchmarks passed

$1.27

spent in total

tool calls

What it is.

Moonshot AI's Kimi K2.6 — a Chinese long-context model with reasoning capabilities. We tested the production endpoint as of November 2026.

What it does well.

Sales, CX, and Web Scraping all passed cleanly. When Kimi K2.6 stays in its lane, it's a perfectly competent agent.
CX pass was the only Kimi run under $0.35 — 3 tool calls, clean grouping of complaints.

How it broke.

Reddit failure was a decoder pathology: the model emitted `{"sourceUrl":"..."` and then got stuck looping the same key (`{"sourceUrl":{"sourceUrl":...`) until it was killed. Token-repetition mid-JSON. Cost $0.423 of nonsense.
Coding failure was a confidence/competence mismatch: valid JSON, confident explanation describing HMAC-SHA256, but the code field omitted the actual `createHmac` call. Described the algorithm without implementing it.
Both failure modes are dangerous because they look correct at a glance. The Coding output passes a syntax check; only the rubric's substring check (`createHmac("sha256", ...)`) caught it.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

5 tool calls, $0.104. Standard Sales pass with appropriate verification.

MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 3 tool calls, $0.423. Token-repetition loop mid-JSON. The model never recovered; the run was killed after the repetition pattern was detected.

CXCustomer Insights Analyst

Find what customers are complaining about

Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

3 tool calls, $0.333. Cleanest Kimi run — read reviews, grouped complaints, returned three verbatim quotes.

CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 5 tool calls, $0.268. The dangerous one. Confident, schema-valid output. Read the Stripe docs, wrote a 3-sentence explanation of HMAC-SHA256, then in the code field omitted the `createHmac` call entirely. Described the algorithm without implementing it.

Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

3 tool calls, $0.145. Apollo tiers extracted; passable but the most expensive Apollo pass among the open-weights models in the matrix.

All models Run Kimi K2.6 on your data