Methodology

How DX Audit runs agent experience audits of developer-facing products, and how to read the resulting reports.

This is the framework behind the published audits and client baseline audits. I use observed agent behavior as evidence — when an agent struggles to discover documentation, obtain credentials, or recover from an error, that friction is real and can affect human developers too. The published audits are an early evidence-backed set of case studies, not a benchmark leaderboard. The published audits all use the Universal Baseline suite — a standardised task set applied regardless of service domain. Domain-specific baselines (e.g. for payments APIs or transactional email providers) may be added in future audit cycles as the evidence base grows.

Audit process

Standard Task Suite

What gets tested. The baseline audit for a new service follows a six-stage workflow. Each stage tests a distinct aspect of the developer experience. The suite is standardized enough to support pattern-finding across services, while leaving room for service-specific task design within each stage. In client engagements, I align on key outcomes and design task suites around the workflows that matter most.

Discover

Can the agent find the relevant docs, references, and entry points?

Onboard

Can it obtain access, credentials, and the minimum setup needed to begin?

Core task

Can it complete the representative workflow the service is supposed to support?

Error handling

When the first attempt fails, can it recover using the available signals?

Cleanup / offboard

Can it identify and clean up the resources it created?

Reflection

What friction did the agent identify after completing the workflow?

Note

The task suite provides a common structure across audits, not benchmark-grade comparability. Each service audit adapts the core task to the service's primary workflow. Differences in task scope are disclosed in the report's Run Conditions.

Agent and evidence

Model Runs and Evidence Policy

How evidence is gathered. Audits are run with real AI agents. Each launch report includes runs from two Claude models at different capability tiers (Opus and Sonnet) to capture behavioral variation. Running two tiers of the same family surfaces which issues are obvious breakdowns — even the top tier fails — versus subtle friction only the lower tier hits. That contrast is the core triage signal in every report. Divergences between runs are recorded rather than flattened away, because the divergence itself is often the finding.

Agent + Models

Harness: Claude Code
Launch models: Opus 4.6 and Sonnet 4.6
Run strategy: Each service is audited with both models. One published report per service combines both runs, with divergences noted inline.

Evidence Model

Reader-first layer: Report findings and recommendations are the primary reader-facing artifact.
Supporting evidence: Evidence remains visible through excerpts, session timelines, and transcript links.
Model divergences: Differences between model runs are preserved when they materially affect interpretation.

Launch status

Launch reports use Claude models only. Codex and potentially other agent harnesses will be added in future audit cycles. Running the same service audit across multiple agent families will help distinguish service-attributable friction from model-specific behavior.

Disclosure contract

Run Conditions

What gets disclosed. Every published launch report includes the same six-field disclosure block. These fields document the exact conditions under which the audit was run, so readers can assess what is service-attributable and what is an artifact of the test environment.

Run Conditions

Present in every published report. Labels match exactly.

Starting state: What existed before the run that materially affects interpretation.
Fixture policy: What was prepared in advance, if anything, and why.
Credential timing: When credentials became available and how they were provided.
Allowed surfaces: Which interfaces the agent was permitted to use, and what the harness constrained.
Operator intervention policy: What the operator could do and how interventions are recorded.
Declared deviations from baseline: Any service-, harness-, environment-, or operator-specific departures from the default procedure.

Finding attribution

Findings distinguish between root causes so that not every observed problem is attributed to the service under test. Each finding is tagged with both a severity (Critical / Major / Minor / Observer note / Positive) and an attribution (one of the five below). Only findings attributed to Service are counted as service problems; findings attributed to Harness, Environment, or Operator are flagged as Observer notes so readers can distinguish real product issues from test-environment artifacts.

ServiceDocumentationHarnessEnvironmentOperator

Severity scale

Finding Severity

Each finding is tagged with one of five severity levels. Severity reflects the observed impact on task completion during the audit, not a general importance ranking.

Critical problem blocks completion of a core task
Major problem forces material rework or workarounds
Minor problem friction that slows agents but doesn't block them
Observer note something worth flagging that isn't attributable to the service (harness, environment, or operator behavior)
Positive finding something the service does well; documented to make patterns visible across audits

Analytic framework

Analysis Dimensions

How findings are organized. Seven dimensions structure how I analyze developer-facing products. Each finding in a published audit maps to one or more of these dimensions, making it easier to see where friction concentrates and where a service performs well.

Discoverability

How easily an agent can find the right starting points, references, and workflow hints.

Documentation quality

Whether the docs are accurate, current, machine-accessible, and sufficient to proceed.

Onboarding friction

How hard it is to obtain credentials, configure access, and reach a usable starting state.

API coherence

Whether the service's mental model, naming, and workflow structure are internally consistent.

Failure transparency

Whether errors are clear enough to support recovery.

Output reliability

Whether the service returns stable, interpretable outputs the agent can act on confidently.

Agent-native interface availability

Whether the service offers machine-oriented surfaces such as llms.txt, MCP, or other agent-specific affordances.

Reading the results

Interpretation and Validity

How to read the results. Agent behavior changes as models and services evolve. These reports are useful as evidence-backed case studies within their stated scope, not as permanent verdicts.

Temporal validity

Every report is published with the test date and model version. Results become less representative as services update their APIs, docs, and onboarding — and as models themselves evolve.

Re-audits are expected over time. Updated reports will be published alongside the originals, not as silent replacements.

What launch results do and do not claim

They are: evidence-backed case studies of real agent interactions with real developer-facing products, structured enough to reveal patterns across services.

They are not: a benchmark leaderboard, a permanent verdict, or a claim of perfectly identical test conditions across all services.

Methodology

Launch Posture

Standard Task Suite

Discover

Onboard

Core task

Error handling

Cleanup / offboard

Reflection

Model Runs and Evidence Policy

Agent + Models

Evidence Model

Run Conditions

Finding Severity

Analysis Dimensions

Discoverability

Documentation quality

Onboarding friction

API coherence

Failure transparency

Output reliability

Agent-native interface availability

Interpretation and Validity

Temporal validity

What launch results do and do not claim