Frequently Asked Questions

How our audits work, what the evidence means, and what to expect.

These questions come from prospective clients, report readers, and people evaluating the methodology. If your question isn't here, get in touch.

Trust & Reliability

How do you prevent hallucinated or fabricated findings?

Every published finding must trace back to what actually happened during the audit session. Findings cite observations, observations reference evidence from the transcript, and the full session transcript is published alongside the report — so every claim is checkable against the record.

Findings also go through an independent review: a separate review process audits every claim against the transcript before publication. If a claim can't be traced to the source, it doesn't ship.

Source of truth
Full session transcript
The complete, unedited record — published with every report
excerpts extracted
Evidence
Evidence summary
What the agent did and what the API returned — authored from the transcript
grouped into
Analysis
Observation
A grouped pattern across evidence from the session
supports
Published
Finding
A specific, severity-rated claim about your API
Verified two ways
We verify
Independent review
Every claim is audited against the transcript by a separate review process before publication
You verify
Reproduction steps
Run the steps yourself against your own account to see the same behaviour
Is a single audit run enough to be reliable?

Each model runs once. We prioritise cross-model validation over repeated runs of the same model, because two independent models hitting the same issue is a stronger signal than one model hitting it twice — it separates API-surface problems from model-specific behaviour.

When both models hit the same problem, confidence is high that the issue is in your API surface. When they diverge, we report that too. The findings we surface are about your documentation, error messages, SDK behaviour, and API design — things that are consistent regardless of which model encounters them.

Input
Same tasks, same API, comparable conditions
Model A
Runs the full task suite independently
Own transcript recorded
Model B
Runs the full task suite independently
Own transcript recorded
Both models agree
High confidence finding
Issue is in the API surface, not the model
Models diverge
Noted in report
Published with model attribution so you can see which model saw what
Output
One report per service
Cross-validated findings with per-model evidence
How do you tell the difference between "the agent messed up" and "the API has a problem"?

Three signals:

  1. Cross-model agreement. If both models independently hit the same wall, the common factor is likely your API surface rather than a model-specific behaviour.
  2. Task design. Our tasks follow the same steps a real agent integration would: read docs, install SDK, write code, handle errors. If the docs say one thing and the API does another, that's the API.
  3. Escalation tracking. When the agent gets stuck and a human has to intervene, we record exactly what the human did. If the fix was "read a doc page the agent couldn't find" or "work around an undocumented behaviour," that's an API surface issue.

Methodology

How is this different from someone trying the API and writing a blog post?

Three key differences:

  1. The agent interacts the way real agent integrations do — reading public docs, generating code, hitting the API, interpreting error messages in real time. We're testing the agent experience, not the human experience.
  2. Everything is recorded. Every API call, every doc page read, every error encountered is captured in a full session transcript. There's no "I think I remember seeing an error" — there's a complete record.
  3. Structured methodology. A standardised task framework covers the full integration lifecycle. Severity is defined on a published scale. Reports must include positive findings — we report what works, not just what's broken.
How do you choose what to test?

Our universal baseline audit follows a six-task framework that mirrors the lifecycle of a real agent integration:

  1. Discover — Find and understand the API documentation
  2. Onboard — Create an account, get credentials, set up the SDK
  3. Core Task — Complete a representative workflow
  4. Error Handling — Encounter and recover from errors
  5. Cleanup — Remove test data and resources
  6. Reflection — The agent self-assesses what went well and what didn't

Tasks are defined before the run starts, not chosen after the fact. Deep-dive audits can target specific workflows, edge cases, or API surfaces beyond this baseline.

How do you decide severity?

Severity is based on impact to task completion, not opinion:

  • Critical — Blocks completion of a core task
  • Major — The agent completes the task but with meaningful friction or workarounds
  • Minor — A rough edge that doesn't prevent completion
  • Observer note — A pattern worth noting that doesn't directly affect the tested tasks
  • Positive — Something that works particularly well for agents

The scale is published in our methodology.

What stops a provider from gaming the audit?

The agent uses only public documentation and APIs — the same surface your customers' agents see. Tasks are defined by our framework, not negotiated with the provider. We don't pre-announce which endpoints or workflows we'll test.

For published audits, providers don't know when an audit is happening. For commissioned audits, the task framework is standardised — optimising for it would mean improving your actual developer experience, which is the point.

Verification

Can I verify findings myself?

Yes. Actionable findings include reproduction steps — concrete, numbered instructions you can run against your own account to see the same behaviour we saw.

We also periodically re-verify findings and flag those that have been addressed since the audit. If a provider fixes an issue, the finding gets a status update noting the fix.

How transparent is the evidence behind each finding?

Every report is published with the full session transcript — a comprehensive record of what the agent did during the audit. Findings reference specific moments in that transcript, so you can see exactly what happened.

Many findings also include inline evidence summaries that link directly to the relevant transcript section. We're actively expanding this coverage so that every finding has a direct deep-link to its source evidence.

When the agent delegates work to a background task (e.g., reading multiple files to investigate a type definition), that task's detailed activity isn't yet surfaced in the published transcript — only the summary of what it reported back. The raw data is captured; making it visible is on our roadmap.

How do I know findings are still current?

We re-verify findings periodically and flag those that have been addressed by the provider. When something gets fixed, we update the finding with a status notice rather than silently removing it.

Scope & Applicability

You tested with Claude. Does this apply to Codex, Gemini, or other agents?

The issues we find are about your API surface — documentation gaps, misleading error messages, SDK inconsistencies, undocumented behaviour. These affect any agent that reads your docs and calls your API, regardless of which model powers it.

We currently cross-validate with multiple Claude models. Adding models from other families (such as OpenAI's Codex) is on our roadmap, which will further strengthen cross-validation across model architectures.

You tested one workflow. How representative is that of our whole API?

The universal baseline covers the full lifecycle — from discovering your docs to cleaning up resources. It exercises the paths that every agent integration goes through, not every endpoint.

Deep-dive audits can target specific workflows, edge cases, or API surfaces beyond the baseline.

Our API is designed for humans. Why should we care about agent usability?

Agent-driven API consumption is growing fast. Your API's developer experience for agents — documentation clarity, error message quality, SDK consistency — increasingly determines whether integrations succeed or fail without human intervention.

The issues we find also tend to improve the human developer experience. Clearer error messages, better docs, and more consistent SDK behaviour help everyone.

Do you access our internal systems?

No. We test against your public documentation and APIs only — the same surface your customers' agents use. We need the same access any developer would need to use your service: typically an API key and a test account.

Fairness & Process

Are published audits the same as commissioned audits?

Our published audits (Stripe, GitHub, Notion) exist to showcase the methodology. They are conducted independently and published without prior coordination with the provider.

Commissioned audits are private. We will not publish a commissioned report unless both parties agree to make it public.

Do you give the service provider a chance to respond?

For commissioned audits, you receive the full report before any publication decision. For published audits, we welcome provider responses after publication and will update findings with the provider's perspective.

What if we disagree with a finding?

Every finding is evidence-bound. The transcript is published — it settles disputes. If we got something wrong, we'll update the report.

For findings that reflect design choices rather than bugs (e.g., "error messages are terse by design"), we note the provider's perspective when shared with us.

Is there a conflict of interest using Claude to audit?

We use Claude as a testing tool, the same way you'd use Chrome to test a website. The multi-model approach exists partly to address this — findings where models diverge are reported transparently, not buried. Adding models from other families is on our roadmap to further broaden cross-validation.

The methodology, evidence, and full transcripts are all published. You can inspect every step of the process. The audit evaluates your API surface, not Claude's capabilities.

Working With Us

How long does an audit take and what do you need from us?

A universal baseline audit typically takes 1–2 weeks. Deep-dive audits depend on the scope of the commission.

We need the same access any developer would need to start using your service — typically an API key and a test account. Treat us like any other developer onboarding to your platform. For deep-dive audits, we'll clarify any additional requirements during scoping.

What does an audit report include?

Each report contains:

  • Executive summary — Key findings at a glance
  • Task overview — Outcome of each task per model
  • Findings — Grouped by topic, each with severity, description, evidence citations, and reproduction steps
  • Recommendations — Specific, actionable suggestions tied to findings
  • Cross-service patterns — Where your findings connect to patterns we've seen across other APIs
  • Session timelines — Task-by-task overview of each model's run
  • Full transcripts — The complete session record, published separately
F-005 Major 1
SDK error message misidentifies the cause of invalid property errors
When passing an unrecognised property, the SDK raises a validation error citing the wrong field name. Both models encountered this independently during the core task. The misleading message caused both to attempt fixes to the wrong property before discovering the actual issue through trial and error.
Evidence Summary 2
"ValidationError: Invalid value for 'source' — expected string, got object." The actual problem was the 'metadata' field, not 'source'. Agent spent 3 additional attempts fixing the wrong field.
How can I reproduce this? 3
Requires a test-mode API key
  1. Install the SDK and initialise with your test key
  2. Create a resource with an invalid nested property
  3. Observe the error message — it will cite the wrong field name
1
Severity
Rated on a published scale based on impact to task completion
2
Evidence
Excerpt from the session transcript showing what actually happened
3
Repro steps
Verify the finding yourself against your own account
4
Recommendation
Specific, actionable fix tied to the finding
How do I prioritise which findings to fix?

Findings are severity-ranked. Start with Criticals — these block completion of core tasks. Majors cause significant friction or require workarounds. Minors are rough edges.

Each finding includes a specific recommendation. Many fixes are straightforward: clarify an error message, add a missing doc page, fix an SDK inconsistency. The reproduction steps let you verify the fix.