Benchmarks
DeepSeek V3.1 logo

deepseek

DeepSeek V3.1

Cheapest model in the matrix. Also the most likely to bail at "let me search for…" without ever calling a tool.

3/5
benchmarks passed
$0.113
spent in total
12
tool calls

What it is.

DeepSeek's V3.1 — the Chinese open-weights frontier model that defines the cost floor for serious reasoning. We tested via DeepSeek's first-party API.

What it does well.

  • Total spend was $0.113 — cheaper than every other model in the matrix by a wide margin.
  • The three tasks it actually engaged with all passed cleanly under $0.06 each.
  • Reddit was $0.046 — second-cheapest Marketing pass after DeepSeek's own willingness to skip verification on Sales.

How it broke.

  • Bailed twice. On both Sales and Coding, DeepSeek wrote a one-line intent statement ("I'll help you research..." / "Let me create a TypeScript function...") then stopped — no tool calls on Sales, exactly one tool call on Coding. Total damage: $0.001 to produce nothing.
  • This is a known V3.1 failure mode when the system prompt has a verbose persona block. The model treats the agent loop like a chat opener and waits for the user to confirm.
  • When it does engage, it's excellent. When it doesn't, you get a polite preamble and a closed run. You need supervisor logic to retry these in production.

Results by agent

Five real jobs.

SalesSales Research Analyst

Get financials on a company

Failed

Find Hightouch on Crunchbase and return their total funding, last round type, and a one-line description.

✗ Zero tool calls, $0.0005. "Let me search for Hightouch..." — then nothing. Schema parser had nothing to parse.

$0.0010 tool calls
MarketingMarketing Research Analyst

Find what customers are recommending on Reddit

Passed

Read a real r/sales thread on enrichment tools and rank the top 5 by how many people recommended them.

5 tool calls, $0.046 — the model's best run. Read Reddit, counted mentions, returned the 5 tools.

$0.0455 tool calls
Passed

Pull Google Shopping reviews for AirPods Pro 2 (USB-C) and return the top 3 recurring complaints, with verbatim quotes.

4 tool calls, $0.053. Clean pass with verbatim quotes.

$0.0534 tool calls
CodingSenior Engineer Assistant

Read API docs and write working code

Failed

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

✗ 1 tool call, $0.0005. One Google search, then bailed. "I'll help you create a TypeScript function..." — and stopped.

$0.0011 tool call
Web ScrapingSales Outreach Specialist

Scrape a competitor's pricing page

Passed

Scrape Apollo.io's pricing page and return every tier (Free, Basic, Professional, Organization) with name, price, and top 3 features.

2 tool calls, $0.014. Cheapest Apollo pass in the matrix.

$0.0142 tool calls