Benchmarks
Qwen 3.6 Plus logo

alibaba

Qwen 3.6 Plus

0/5. Burned $1.91 on Crunchbase alone to produce zero parseable output. The worst run in the matrix.

0/5
benchmarks passed
$2.98
spent in total
55
tool calls

What it is.

Alibaba's Qwen 3.6 Plus — the largest dense Qwen variant in the November 2026 lineup, pitched as a frontier reasoning option from the Chinese open-weights ecosystem.

What it does well.

  • There aren't any to report on this rubric. Every benchmark ended with no parseable JSON.

How it broke.

  • Crunchbase run was $1.91 with 9 tool calls and no output. This is what "burned for nothing" looks like — almost all of the cost was LLM thinking tokens, none of it produced an answer.
  • Reddit, CX, Coding, and Web Scraping all ended with empty final messages after between 1 and 22 tool calls. The model engaged with the agent loop, then never wrote the answer.
  • This is the tool-loop pathology at its most expensive. Total spend was $2.98 — second-most-expensive model in the matrix — for the worst pass rate.
  • We tested the production endpoint as of November 2026. If Qwen updates the weights or the serving stack, this profile may stop being accurate; rerun before you make any decisions on it.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Failed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

✗ 9 tool calls, $1.91. The single most expensive failure in the matrix. Almost all of the budget was thinking tokens; the model never produced parseable JSON.

$1.919 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 22 tool calls, $0.399. Most tool calls of any benchmark run in the matrix. None of them led to an answer.

$0.39922 tool calls
Failed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

✗ 10 tool calls, $0.229. Tools called, responses read, no final message.

$0.22910 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 13 tool calls, $0.251. Same pattern — engaged, retrieved, stopped without writing code.

$0.25113 tool calls
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Failed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

✗ 1 tool call, $0.188. The cheapest Qwen failure: one tool call, no JSON.

$0.1881 tool call