0 tool calls, $0.001 — technically passed the rubric from training-data memorization. The Stripe webhook pattern is in Llama's training corpus verbatim, so the memorized version happens to satisfy the substring checks. Don't read this as a model strength; read it as the false-positive the matrix exists to catch.
Read API docs and write working code
We asked an AI to read Stripe's docs and write a real verifier function. 10 of 16 models could do it.
Why this benchmark exists.
Writing a function from API docs is the most common engineering job in B2B integrations. Stripe's docs are excellent, the function fits in about 30 lines, and the right implementation is one of the most copy-pasted patterns in the JavaScript ecosystem. The wrong implementation silently lets unsigned requests through, so we graded for the actual security primitives, not just code that compiles.
What we asked it.
Read Stripe's official docs and write a real, working webhook-verification function in TypeScript.
System prompt
You are a senior engineer assistant. Read official API documentation and write production-grade integration code. ALWAYS verify behavior against the OFFICIAL docs; APIs change. Write idiomatic Node.js and include error handling and edge cases the docs mention.
Tools available
Google Search
- Site Scraper
How it's graded
- Did it actually read Stripe's docs?
- Did it write a real, working function, not just describe one in prose?
- Does the function use the right security primitives so it would actually run?
What we saw.
We asked a senior engineer assistant agent to read Stripe's webhook docs and write a `verifyStripeWebhook` function in TypeScript. The rubric checks that it actually read the docs, wrote a real working function (not just described one in prose), and used the right security primitives so the function would actually run.
10 models passed. The six that failed were the most interesting failures in the whole matrix: one bailed before reading the docs, one thought about the code in a private scratchpad and never wrote it, one broke the output format, and one wrote a confident explanation of the function but forgot to actually write the function itself.
What worked.
- GPT-5 Mini wrote the function in two tool calls for $0.037, the cheapest correct answer in the matrix.
- Every model that passed read the docs first and then wrote the function. Sonnet's run was the most thorough (ten tool calls, full docs read), and the resulting function was also the most explicit. There's a clear correlation between actually reading the docs and writing real code.
How it broke.
- DeepSeek bailed at 'I'll help you create a TypeScript function...' and then closed the run. One tool call, no code produced.
- MiniMax M3 went deep into a private thinking block planning the implementation, but then closed the run without writing any code. 22 tool calls, no output.
- GLM 5.1 wrote real-looking code, but the output format was broken (smart quotes inside the code field, unbalanced backtick fences). The function might have worked, but the agent could not read what came back.
- Kimi 2.6 wrote a confident explanation describing the right security primitives, but the code field omitted the actual function call. It described the algorithm correctly, but didn't implement it.
Results by model
16 models, ranked.
Passes first, sorted cheap → expensive. Failures last, sorted by how much budget they burned producing nothing.
2 tool calls, $0.037 — cheapest correct Stripe pass in the matrix.
4 tool calls, $0.071. Read the docs, wrote the function, schema-clean. The right-sized Stripe answer.
36 tool calls, $0.487. 3.5 Flash decided to read every Stripe docs page it could find. The answer was correct; you paid for thoroughness the task didn't ask for.
10 tool calls, $1.11. Most thorough Stripe docs read in the matrix; resulting code is also the most explicit.
✗ 11 tool calls, $0.667. Visible attempt at the right answer, but JSON had broken escaping — smart quotes and unbalanced backticks inside the code string. Schema parser couldn't read the envelope.
✗ 22 tool calls, $0.508 — most expensive M3 failure. Same thinking-block-without-output pattern as CX: read docs, planned in a `<think>` block, closed the run before writing code.
✗ 5 tool calls, $0.268. The dangerous failure: confident, schema-valid output describing HMAC-SHA256 in the explanation field — and the code field omitted `createHmac`. Described the algorithm; didn't implement it.
✗ 13 tool calls, $0.251. Same as Qwen's other failures: engaged, retrieved, stopped without writing code.
✗ 1 tool call, $0.0005. "I'll help you create a TypeScript function..." — and stopped. DeepSeek's polite-helper failure mode.
✗ 0 tool calls, $0.00. Grok didn't engage with the task at all.