Methodology

How DX Audit runs agent experience audits of developer-facing products, and how to read the resulting reports.

This is the framework behind the published audits and client baseline audits. I use observed agent behavior as evidence — when an agent struggles to discover documentation, obtain credentials, or recover from an error, that friction is real and can affect human developers too. The published audits are an early evidence-backed set of case studies, not a benchmark leaderboard. The published audits all use the Universal Baseline suite — a standardised task set applied regardless of service domain. Domain-specific baselines (e.g. for payments APIs or transactional email providers) may be added in future audit cycles as the evidence base grows.

Standard Task Suite

What gets tested. The baseline audit for a new service follows a six-stage workflow. Each stage tests a distinct aspect of the developer experience. The suite is standardized enough to support pattern-finding across services, while leaving room for service-specific task design within each stage. In client engagements, I align on key outcomes and design task suites around the workflows that matter most.

01

Discover

Can the agent find the relevant docs, references, and entry points?

02

Onboard

Can it obtain access, credentials, and the minimum setup needed to begin?

03

Core task

Can it complete the representative workflow the service is supposed to support?

04

Error handling

When the first attempt fails, can it recover using the available signals?

05

Cleanup / offboard

Can it identify and clean up the resources it created?

06

Reflection

What friction did the agent identify after completing the workflow?

Note

The task suite provides a common structure across audits, not benchmark-grade comparability. Each service audit adapts the core task to the service's primary workflow. Differences in task scope are disclosed in the report's Run Conditions.

Model Runs and Evidence Policy

How evidence is gathered. Audits are run with real AI agents. Each launch report includes runs from two Claude models at different capability tiers (Opus and Sonnet) to capture behavioral variation. Running two tiers of the same family surfaces which issues are obvious breakdowns — even the top tier fails — versus subtle friction only the lower tier hits. That contrast is the core triage signal in every report. Divergences between runs are recorded rather than flattened away, because the divergence itself is often the finding.

Agent + Models

Harness
Claude Code
Launch models
Opus 4.6 and Sonnet 4.6
Run strategy

Each service is audited with both models. One published report per service combines both runs, with divergences noted inline.

Evidence Model

Reader-first layer
Report findings and recommendations are the primary reader-facing artifact.
Supporting evidence
Evidence remains visible through excerpts, session timelines, and transcript links.
Model divergences
Differences between model runs are preserved when they materially affect interpretation.

Launch status

Launch reports use Claude models only. Codex and potentially other agent harnesses will be added in future audit cycles. Running the same service audit across multiple agent families will help distinguish service-attributable friction from model-specific behavior.

Run Conditions

What gets disclosed. Every published launch report includes the same six-field disclosure block. These fields document the exact conditions under which the audit was run, so readers can assess what is service-attributable and what is an artifact of the test environment.

Run Conditions

Present in every published report. Labels match exactly.

Starting state
What existed before the run that materially affects interpretation.
Fixture policy
What was prepared in advance, if anything, and why.
Credential timing
When credentials became available and how they were provided.
Allowed surfaces
Which interfaces the agent was permitted to use, and what the harness constrained.
Operator intervention policy
What the operator could do and how interventions are recorded.
Declared deviations from baseline
Any service-, harness-, environment-, or operator-specific departures from the default procedure.

Finding attribution

Findings distinguish between root causes so that not every observed problem is attributed to the service under test. Each finding is tagged with both a severity (Critical / Major / Minor / Observer note / Positive) and an attribution (one of the five below). Only findings attributed to Service are counted as service problems; findings attributed to Harness, Environment, or Operator are flagged as Observer notes so readers can distinguish real product issues from test-environment artifacts.

ServiceDocumentationHarnessEnvironmentOperator

Finding Severity

Each finding is tagged with one of five severity levels. Severity reflects the observed impact on task completion during the audit, not a general importance ranking.

  • Critical problem blocks completion of a core task
  • Major problem forces material rework or workarounds
  • Minor problem friction that slows agents but doesn't block them
  • Observer note something worth flagging that isn't attributable to the service (harness, environment, or operator behavior)
  • Positive finding something the service does well; documented to make patterns visible across audits

Analysis Dimensions

How findings are organized. Seven dimensions structure how I analyze developer-facing products. Each finding in a published audit maps to one or more of these dimensions, making it easier to see where friction concentrates and where a service performs well.

01

Discoverability

How easily an agent can find the right starting points, references, and workflow hints.

02

Documentation quality

Whether the docs are accurate, current, machine-accessible, and sufficient to proceed.

03

Onboarding friction

How hard it is to obtain credentials, configure access, and reach a usable starting state.

04

API coherence

Whether the service's mental model, naming, and workflow structure are internally consistent.

05

Failure transparency

Whether errors are clear enough to support recovery.

06

Output reliability

Whether the service returns stable, interpretable outputs the agent can act on confidently.

07

Agent-native interface availability

Whether the service offers machine-oriented surfaces such as llms.txt, MCP, or other agent-specific affordances.

Interpretation and Validity

How to read the results. Agent behavior changes as models and services evolve. These reports are useful as evidence-backed case studies within their stated scope, not as permanent verdicts.

Temporal validity

Every report is published with the test date and model version. Results become less representative as services update their APIs, docs, and onboarding — and as models themselves evolve.

Re-audits are expected over time. Updated reports will be published alongside the originals, not as silent replacements.

What launch results do and do not claim

They are: evidence-backed case studies of real agent interactions with real developer-facing products, structured enough to reveal patterns across services.

They are not: a benchmark leaderboard, a permanent verdict, or a claim of perfectly identical test conditions across all services.