MethodologyMay 14, 20269 min read

Why your AI-readiness score is a lie

We ran our own auditor against itself and found five measurement sins that make every published score a coin flip dressed as data. Here's what we fixed — and what every GEO tool should disclose.

Your GEO audit tool gave you a B+. It felt authoritative — letter grade, numerical score, detailed breakdown. It wasn't. We ran Hidden Layer's own auditor against itself and found five measurement failures that make every score we've published unreliable. We're writing this publicly because we're fixing it, and because every GEO tool should be held to the same standard.

These aren't corner cases. They're the core of how GEO scores are computed across every tool in this space.

Sin 1: N=1 — one model, one call, 15 points

The geo_cold_recall check is the highest-weight single signal in the audit at 15 points. It asks a language model: 'What is this domain?' Pass = model knows you. Fail = model doesn't. The check runs once, against one model (Llama 3.1 8B via Cloudflare Workers AI), with no repetition.

// auditor.ts — before the fix
const aiResult = await ai.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: `What is ${domain}?...` }],
  max_tokens: 150,
});
// n = 1. One call. 15 points. No variance calculation.

LLMs are stochastic. The same prompt returns different output on repeated calls. A single-call 15-point score has unknown variance. We observed the same domain flip between pass and warn on consecutive calls. Every published score that includes an LLM check — including every score on our leaderboard — has this problem.

The fix is N≥3 per check with majority-vote scoring and a variance band. We haven't shipped this yet because it requires a budget decision: N=3 per audit call costs more than N=1. The interim fix — shipping now — is to flag every LLM-derived check as 'indicative' rather than 'measured' so the UI can render it with the appropriate uncertainty caveat.

Sin 2: unpinned models — your score drifts on every Cloudflare deploy

The model slug '@cf/meta/llama-3.1-8b-instruct' is a floating pointer. Cloudflare updates the underlying checkpoint without changing the slug name. When they ship a new model version, every audit using that slug quietly changes behaviour. Your score from six months ago used a different model than your score today — the same number, different computation.

// Before: floating slug, no record of what actually ran
evidence: {
  model: 'llama-3.1-8b',  // which checkpoint? unknown.
  response: response.slice(0, 300),
}

// After: pinned date + version — at least we know when it was validated
const GEO_RECALL_MODEL_PIN = '2026-05-14' as const;
const GEO_RECALL_PROMPT_VERSION = 'v1.0' as const;

evidence: {
  model: GEO_RECALL_MODEL,
  model_pin: GEO_RECALL_MODEL_PIN,
  prompt_version: GEO_RECALL_PROMPT_VERSION,
  response,  // full transcript — no truncation
  n: 1,
}

We now record the date on which the model was validated against our pass/fail criteria. When Cloudflare ships a new checkpoint, we validate again, bump the pin date, and release a new prompt version. Audits store the model + pin date in evidence, so historical scores are at least traceable. True reproducibility requires a model SHA — CF Workers AI doesn't expose one yet. The pin date is the best available proxy.

Sin 3: truncated transcripts — 300 characters of evidence

The LLM response was truncated to 300 characters before being stored as evidence. A hallucination that's 400 characters long — the model confidently describing the wrong company — gets promoted to a pass on the keyword match and stored as a 300-char snippet that looks plausible.

// Before: you couldn't read the full model output
evidence: {
  response: response.slice(0, 300),
  verdict: 'recognized',
}
// The keyword match said pass. The full response said something else. You'd never know.

// After: full transcript in evidence
evidence: {
  response,  // full — truncation removed
  verdict: 'recognized',
}

This matters more than it sounds. The geo_cold_recall check computes verdict via keyword matching: does the response include the domain stem and an industry term? A model that confuses your brand with a competitor, gives a plausible-but-wrong description, and mentions your domain once in passing — that passes the keyword check. You'd only know by reading the full transcript. We now store it.

Sin 4: no golden-corpus validation — an A is easy when you write the rubric

Every GEO audit tool, including Hidden Layer, defines its own pass/fail criteria and its own weights. There's no ground truth. We don't have a set of domains with known LLM citation rates that we've validated our scoring against. An A on our leaderboard means 'scored well on our rubric', not 'AI systems actually cite this domain more.' These may correlate — but we haven't proven it.

// The GEO Presence checks that run in production:
const checks = [
  await checkGeoRecall(domain, brandName, industry, ai),
  await checkCategoryShareOfVoice(domain, brandName, industry, ai),
  await checkHnMentions(domain, cache),
  // Missing:
  // await validateAgainstGoldenCorpus(domain, auditResult)
  // No calibration. An A is whatever the algorithm says.
];

The fix is a 50-site golden corpus with hand-labeled ground truth: for each domain, we have actual LLM citation data (does ChatGPT recommend this brand for its category?) that we can correlate against audit scores. We're building this now. Until it exists, every score on every GEO tool is an opinion with a number attached.

Sin 5: no confidence intervals — 83/100 implies precision the signal doesn't support

Hidden Layer reports scores as integers. 83. 76. 55. The integer format implies precision. There's no error bar, no confidence interval, no indication that the LLM-derived checks (25+ points combined) are stochastic while the HTTP-fetch checks (the rest) are deterministic.

// Before: a single stochastic + deterministic score presented identically
const pct = Math.round((totalScore / totalPossible) * 100);
// pct = 83. Implied precision: ±0. Actual: unknown variance from N=1 LLM calls.

// After: LLM-derived checks explicitly flagged
return {
  id: 'geo_cold_recall',
  confidence: 'indicative',  // stochastic — treat differently than measured checks
  // vs. robots.txt check:
  // confidence: undefined  // defaults to 'measured' — deterministic HTTP fetch
};

The 'indicative' flag is now on every LLM-derived check: geo_cold_recall and category_share_of_voice. The remaining 46+ checks are HTTP fetches, parse operations, and API calls — deterministic for the same domain state. The distinction belongs in the UI: we'll render indicative checks with a footnote instead of the same pass/fail chip as a deterministic check.

Why we published this

We could have fixed these quietly and moved on. We're not, for two reasons.

First, these problems exist in every GEO audit tool. Single-model LLM calls with no pinning or CI are industry standard. Publishing our audit publicly creates pressure for transparency across the category. If you're using a GEO tool that doesn't disclose which model it uses, how many calls it makes, and what its variance looks like — that's a signal worth asking about.

Second, our leaderboard data is cited externally now. When someone shares 'Hidden Layer gave us a B/84', they should know that the 84 contains approximately 25 points of stochastic signal with unknown variance. The transparency is the credibility.

What we've shipped (as of today)

Model pin + prompt version on all LLM checks (commit 4489cb3). Every audit stores the model slug, pin date, and prompt version used. Historical audits will show which model era they were scored against.
Full transcripts. Truncation removed. You can now read the exact model output that produced your geo_cold_recall verdict.
'indicative' confidence flag on all LLM-derived checks. UI rendering (greying out indicative checks) ships in the next release.
Full prompt stored in evidence. You can verify what question produced what answer.

What we're building next

50-site golden corpus with hand-labeled citation data. The ground truth that makes every GEO audit score meaningful instead of self-referential. Dataset will be public.
N≥3 per LLM check with majority vote and variance band. Requires budget decision — likely default N=3 with opt-in N=5 for premium audits.
Multi-model cold recall (ChatGPT + Claude + Perplexity). A brand that one model knows but three models don't is different from a brand all three know. Weighted N-model score.
CI bands on the score display. Instead of '83', display '83 ±4' where the variance is computable from the stochastic component weights and observed variance.

The golden corpus dataset drops when Phase 3 ships. Sign up at hidden-layer-blogs.pages.dev to get it first — and see the methodology that replaces what we just described.

MethodologyGEOTransparencySignal Authenticity

Methodology

The 50-site golden corpus: ground truth for GEO audit calibration

We hand-labeled 50 domains across 4 tiers, live-verified 37 of them, and ran a grid search against the results. Here's what we learned — and why twitter_card_tags (0.67 correlation) is the most underweighted check in every GEO audit you've ever seen.

Leaderboard

The GEO leaderboard: who's winning AI visibility in 2026

Agent integration = 0 across all 118 domains. The agentic web is coming but zero companies have started building for it.

Agent Integration

Agent integration: the 0/13 frontier

Hidden Layer audited 118 domains across 12 industries. Every single one scored 0/13 on agent integration. The frontier is completely open.

Hidden Layer Research

Independent GEO audit research. Data-first. Not affiliated with any LLM vendor.