Stripe
Universal BaselineOpus 4.6 · Sonnet 4.6
Full records of where AI agents got stuck testing real developer products. Each report includes prioritised findings, transcript evidence, and the exact conditions under which the audit was run.
Each report is a Universal Baseline — the same standardised six-task suite applied to any developer-facing product, regardless of domain (payments, productivity, developer platform, etc.). Domain-specific baselines may be added in future audit cycles.
Each service is audited with two Claude models at different capability tiers (Opus and Sonnet). Differences between the two runs are preserved in the report, not averaged away. See methodology for why.
Opus 4.6 · Sonnet 4.6
Opus 4.6 · Sonnet 4.6
Opus 4.6 · Sonnet 4.6