Benchmarks
G

zhipu

GLM 5.1

Smart quotes in Stripe code. Empty final on Reddit. JSON encoding is GLM 5.1's nemesis.

3/5
benchmarks passed
$1.19
spent in total
27
tool calls

What it is.

Zhipu AI's GLM 5.1 — a Chinese reasoning model with open weights. We tested the production endpoint as of November 2026.

What it does well.

  • Sales pass was $0.048 — the second-cheapest Sales pass in the matrix after Gemini 3 Flash.
  • CX and Web Scraping passed without drama. When GLM 5.1 commits to writing a final answer, it writes one.

How it broke.

  • Reddit failure was the worst kind: empty final message after 7 tool calls. The model retrieved everything correctly and then closed the run without writing anything. $0 spent on inference is misleading — you still paid for the tools.
  • Stripe failure was a tokenizer/encoding bug: GLM 5.1 emitted the JSON with smart quotes inside the code field and unbalanced backtick template-literal fences. Output was clearly an attempt at the right answer; encoding broke it.
  • Both failure modes are decoder-level, not reasoning-level. This is the kind of model where you wrap the output in a strict-JSON repair step before the schema parser sees it.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Passed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

3 tool calls, $0.048. One of the cheapest Sales passes in the matrix.

$0.0483 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 7 tool calls, $0.000. Empty final message — the model read Reddit, then ran out of agent-loop budget without ever writing the JSON. No characters returned; no rubric to grade.

$0.0007 tool calls
Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

4 tool calls, $0.420. Solid CX pass with three verbatim complaints.

$0.4204 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 11 tool calls, $0.667. Visible attempt at the right answer, but the JSON had broken escaping — smart quotes inside the code string, unbalanced backtick template-literal fences. Schema parser rejected the envelope before grading the code.

$0.66711 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

2 tool calls, $0.053. Clean Apollo pass.

$0.0522 tool calls