For API, platform, and developer-experience teams.

Stress-test your developer experience with AI agents

Find where your docs, onboarding, APIs, and workflows actually break down in practice — and what to fix first.

I run AI agents through the real work your developers do — finding docs, getting credentials, making API calls, recovering from errors, completing workflows end-to-end. The transcript shows exactly where things break.

Matt Steen

Run by Matt Steen — product leader, twenty years shipping APIs, integrations, and self-serve experiences. Previously Head of Product in ecommerce; before that, e-learning, insurance, and telecoms.

Published Audits

Stripe

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

3 Positive findings 2 Observer notes 2 Minor problems 2 Major problems
Read report →

Notion

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

4 Positive findings 3 Minor problems 1 Major problem 1 Critical problem
Read report →

GitHub

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

4 Positive findings 7 Minor problems
Read report →

How I test

Every new service starts with a baseline audit — a standardised six-task suite (Discover, Onboard, Core Task, Error Handling, Cleanup, Reflection) run with two Claude models at different capability tiers (Opus and Sonnet). The contrast between tiers separates obvious breakdowns — even the top tier fails — from subtle friction only the lower tier hits.

In client engagements I go further: aligning on the outcomes that matter most to your team and designing task suites around your key workflows and integration paths.

Every report documents the starting state, fixture policy, and credential timing so the conditions are reproducible and the results are auditable. These are case studies, not a benchmark leaderboard.

Read the full methodology →

About

See the work. Then talk.

Read a published audit, or get in touch about auditing your own product.