Published Audits

Full records of where AI agents got stuck testing real developer products. Each report includes prioritised findings, transcript evidence, and the exact conditions under which the audit was run.

What gets tested.

Each report is a Universal Baseline — the same standardised six-task suite applied to any developer-facing product, regardless of domain (payments, productivity, developer platform, etc.). Domain-specific baselines may be added in future audit cycles.

How findings are categorised.

Critical problemMajor problemMinor problemObserver notePositive finding

Which models.

Each service is audited with two Claude models at different capability tiers (Opus and Sonnet). Differences between the two runs are preserved in the report, not averaged away. See methodology for why.

Stripe

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

3 Positive findings 2 Observer notes 2 Minor problems 2 Major problems
Read report →

Notion

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

4 Positive findings 3 Minor problems 1 Major problem 1 Critical problem
Read report →

GitHub

Universal Baseline

Tested March 2026

Opus 4.6 · Sonnet 4.6

4 Positive findings 7 Minor problems
Read report →