Benchmarks
Coding·Senior Engineer Assistant

Read API docs and write working code

We asked an AI to read Stripe's docs and write a real verifier function. 10 of 16 models could do it.

10/16
models passed
$0.001
cheapest pass
$0.352
avg cost of passes
$0.667
costliest fail

Why this benchmark exists.

Writing a function from API docs is the most common engineering job in B2B integrations. Stripe's docs are excellent, the function fits in about 30 lines, and the right implementation is one of the most copy-pasted patterns in the JavaScript ecosystem. The wrong implementation silently lets unsigned requests through, so we graded for the actual security primitives, not just code that compiles.

What we asked it.

Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.

System prompt

You are a senior engineer assistant. Read official API documentation and write production-grade integration code. ALWAYS verify behavior against the OFFICIAL docs; APIs change. Write idiomatic Node.js and include error handling and edge cases the docs mention.

Tools available

  • Google Search
  • Site Scraper

How it's graded

  • Did it actually read Stripe's docs?
  • Did it write a real, working function, not just describe one in prose?
  • Does the function use the right security primitives so it would actually run?

What we saw.

We asked a senior engineer assistant agent to read Stripe's webhook docs and write a `verifyStripeWebhook` function in TypeScript. The rubric checks that it actually read the docs, wrote a real working function (not just described one in prose), and used the right security primitives so the function would actually run.

10 models passed. The six that failed were the most interesting failures in the whole matrix: one bailed before reading the docs, one thought about the code in a private scratchpad and never wrote it, one broke the output format, and one wrote a confident explanation of the function but forgot to actually write the function itself.

What worked.

  • GPT-5 Mini wrote the function in two tool calls for $0.037, the cheapest correct answer in the matrix.
  • Every model that passed read the docs first and then wrote the function. Sonnet's run was the most thorough (ten tool calls, full docs read), and the resulting function was also the most explicit. There's a clear correlation between actually reading the docs and writing real code.

How it broke.

  • DeepSeek bailed at 'I'll help you create a TypeScript function...' and then closed the run. One tool call, no code produced.
  • MiniMax M3 went deep into a private thinking block planning the implementation, but then closed the run without writing any code. 22 tool calls, no output.
  • GLM 5.1 wrote real-looking code, but the output format was broken (smart quotes inside the code field, unbalanced backtick fences). The function might have worked, but the agent could not read what came back.
  • Kimi 2.6 wrote a confident explanation describing the right security primitives, but the code field omitted the actual function call. It described the algorithm correctly, but didn't implement it.

Results by model

16 models, ranked.

Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.

Llama 4 Maverick logoLlama 4 Maverick
$0.0010 toolsPassed

0 tool calls, $0.001 — technically passed the rubric from training-data memorization. The Stripe webhook pattern is in Llama's training corpus verbatim, so the memorized version happens to satisfy the substring checks. Don't read this as a model strength; read it as the false-positive the matrix exists to catch.

GPT-5 Mini logoGPT-5 Mini
$0.0372 toolsPassed

2 tool calls, $0.037 — cheapest correct Stripe pass in the matrix.

Mistral Large 3 logoMistral Large 3
$0.0714 toolsPassed

4 tool calls, $0.071. Read the docs, wrote the function, schema-clean. The right-sized Stripe answer.

Gemini 3.5 Flash logoGemini 3.5 Flash
$0.48636 toolsPassed

36 tool calls, $0.487. 3.5 Flash decided to read every Stripe docs page it could find. The answer was correct; you paid for thoroughness the task didn't ask for.

Claude Sonnet 4.6 logoClaude Sonnet 4.6
$1.1110 toolsPassed

10 tool calls, $1.11. Most thorough Stripe docs read in the matrix; resulting code is also the most explicit.

GGLM 5.1
$0.66711 toolsFailed

✗ 11 tool calls, $0.667. Visible attempt at the right answer, but JSON had broken escaping — smart quotes and unbalanced backticks inside the code string. Schema parser couldn't read the envelope.

MMiniMax M3
$0.50822 toolsFailed

✗ 22 tool calls, $0.508 — most expensive M3 failure. Same thinking-block-without-output pattern as CX: read docs, planned in a `<think>` block, closed the run before writing code.

KKimi K2.6
$0.2685 toolsFailed

✗ 5 tool calls, $0.268. The dangerous failure: confident, schema-valid output describing HMAC-SHA256 in the explanation field — and the code field omitted `createHmac`. Described the algorithm; didn't implement it.

Qwen 3.6 Plus logoQwen 3.6 Plus
$0.25113 toolsFailed

✗ 13 tool calls, $0.251. Same as Qwen's other failures: engaged, retrieved, stopped without writing code.

DeepSeek V3.1 logoDeepSeek V3.1
$0.0011 toolFailed

✗ 1 tool call, $0.0005. "I'll help you create a TypeScript function..." — and stopped. DeepSeek's polite-helper failure mode.

Grok 4.1 Fast logoGrok 4.1 Fast
$0.0000 toolsFailed

✗ 0 tool calls, $0.00. Grok didn't engage with the task at all.