alibaba

Qwen 3.6 Plus

0/5. Burned $1.91 on Crunchbase alone to produce zero parseable output. The worst run in the matrix.

0/5

benchmarks passed

$2.98

spent in total

tool calls

What it is.

Alibaba's Qwen 3.6 Plus — the largest dense Qwen variant in the November 2026 lineup, pitched as a frontier reasoning option from the Chinese open-weights ecosystem.

What it does well.

There aren't any to report on this rubric. Every benchmark ended with no parseable JSON.

How it broke.

Crunchbase run was $1.91 with 9 tool calls and no output. This is what "burned for nothing" looks like — almost all of the cost was LLM thinking tokens, none of it produced an answer.
Reddit, CX, Coding, and Web Scraping all ended with empty final messages after between 1 and 22 tool calls. The model engaged with the agent loop, then never wrote the answer.
This is the tool-loop pathology at its most expensive. Total spend was $2.98 — second-most-expensive model in the matrix — for the worst pass rate.
We tested the production endpoint as of November 2026. If Qwen updates the weights or the serving stack, this profile may stop being accurate; rerun before you make any decisions on it.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Failed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

✗ 9 tool calls, $1.91. The single most expensive failure in the matrix. Almost all of the budget was thinking tokens; the model never produced parseable JSON.

MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Failed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

✗ 22 tool calls, $0.399. Most tool calls of any benchmark run in the matrix. None of them led to an answer.

CXCustomer Insights Analyst

Find what customers are complaining about

Failed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

✗ 10 tool calls, $0.229. Tools called, responses read, no final message.

CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 13 tool calls, $0.251. Same pattern — engaged, retrieved, stopped without writing code.

Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Failed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

✗ 1 tool call, $0.188. The cheapest Qwen failure: one tool call, no JSON.

All models Run Qwen 3.6 Plus on your data