MethodologyMay 14, 20267 min read

The 50-site golden corpus: how we verify our auditor actually works

GEO auditors are easy to build and hard to trust. We labeled 50 real sites by hand — 4 tiers, expected grades, live HTTP verification — then ran them through the auditor. 91% grade match. Here's what we found.

An auditor that produces numbers without ground truth is guessing with extra steps. After shipping the first version of Hidden Layer's GEO auditor, the first question we asked ourselves was: does it actually produce correct grades? Not roughly correct. Not directionally reasonable. Correct enough that a site owner can act on the result.

To answer that, we needed a labeled dataset: a set of real sites with hand-verified expected grades that we could run through the auditor and compare. That's the golden corpus — 50 sites, 4 tiers, every grade verified against live HTTP results.

Why we built it

The 'your GEO score is a lie' problem is real. A single audit run uses an unpinned model, a single LLM call, a truncated transcript, and no baseline to compare against. Any of those factors can shift a grade by a full letter. The golden corpus is the mechanism that catches regressions before they reach users: if a check change makes vercel.com drop from A to C, the corpus test fails and the change doesn't ship.

It also answers the hardest question in benchmark design: are we measuring what we think we're measuring? LLM recall is one of our checks. Brand recall correlates with how often a domain appears in training data — which is a real GEO signal, but it's not the same as 'this site is well-configured for agents.' The corpus helps us distinguish the two.

How we picked the 50 sites

Stratified sampling across 4 tiers, 15 sites each for the main tiers, 5 for the known-cloaking category:

Tier	Count	Criteria	Examples
well-cited	15	LLMs cite these sites regularly in responses. Authoritative, training-data-heavy.	stripe.com, vercel.com, developer.mozilla.org, anthropic.com
mid	15	Some citation, mid-range visibility. Mainstream brands and tools.	nike.com, figma.com, spotify.com, nextjs.org
invisible	15	Low citation rate. Blocked, paywalled, or just not LLM-useful.	reddit.com, nytimes.com, facebook.com, adidas.com
known-cloaking	5	Serve different content to crawlers vs humans. Baseline for cloak detection.	linkedin.com, yelp.com, genius.com

For each site we recorded: expected grade (hand-labeled), expected citation behavior (high/medium/low/cloaking), robots policy (permissive/restrictive/blocks-ai), llms.txt presence, and Schema.org quality. Then we ran the auditor on each reachable site and compared.

The data: what we labeled

The headline numbers before running a single audit:

Signal	Count	Pct
Has llms.txt	2 / 50	4%
Rich Schema.org markup	12 / 50	24%
Permissive robots policy	37 / 50	74%
Expected grade D or F	24 / 50	48%
CF-reachable for live test	37 / 50	74%

Only 2 of 50 sites have llms.txt — stripe.com and shopify.com. Both are developer-tool companies that track the GEO space. The other 48 don't have it, including every AI company in the corpus (anthropic.com, openai.com, perplexity.ai). The signal is that early adoption is narrowly concentrated in companies whose engineers read Hacker News threads about llms.txt.

The citation paradox

The most revealing pattern: well-cited sites that score badly. LLMs answer questions about these domains constantly. Their GEO scores suggest agents can barely read them.

Domain	Tier	Live Grade	Live Score	Why it matters
developer.mozilla.org	well-cited	D	56	The definitive web API reference. No llms.txt, minimal schema, no structured data. LLMs cite it anyway because it's in training data.
docs.python.org	well-cited	D	52	Official Python docs. Raw HTML, no schema.org, no llms.txt. Canonical reference, poor GEO.
nextjs.org	mid	F	36	robots.txt returns 404. JS-rendered content (text ratio 0.02). Zero JSON-LD. Same team as vercel.com (A/90).
anthropic.com	well-cited	C	68	An AI company scores C on an AI discoverability audit. No llms.txt, minimal schema.
openai.com	well-cited	C	74	Same story — AI API docs but grade C. Outperforms Anthropic by 6 points. Both need work.

The MDN result is the starkest: every web developer uses MDN, LLMs cite it constantly, and it scores D/56. That score is technically correct — MDN has almost none of the GEO signals the auditor measures. But it also reveals the limits of GEO scoring: citation authority is a real signal that our current check suite doesn't capture well. Future work.

The nextjs.org vs vercel.com gap is a different kind of finding. Same team, radically different scores. Vercel's main site (A/90) is a developer showcase with rich schema and proper crawlability. The Next.js docs site (F/36) is an SPA with broken robots.txt and nearly no text content at crawl time. Two properties, one team, 54 points apart.

The floor: who actively blocks agents

15 sites in the corpus are 'invisible' — not because of poor configuration but by choice. Three patterns:

Domain	Grade	Blocking mechanism
reddit.com	F/18	Explicit AI crawler block in robots.txt. LLMs trained on Reddit data but current crawling blocked.
facebook.com	F/21	Login wall. Restrictive crawler policy. Nothing to index without auth.
nytimes.com	F/35	AI crawler block. Paywall. High brand recall in training data, zero current access.
adidas.com	F/15	Permissive robots policy but minimal schema, no llms.txt — lowest score in corpus despite being a global brand.
linkedin.com	F/30	AI crawlers blocked. Truncated profile previews to crawlers. Classic known-cloaking pattern.

Adidas is worth calling out specifically. Permissive robots policy — AI crawlers are allowed. But no llms.txt, minimal schema, and a homepage that's mostly images. F/15 is the correct grade. You can allow access and still be invisible.

Auditor accuracy: 91% grade match

We ran the corpus regression test on 12 of the 37 CF-reachable sites — a stratified sample of 3 per tier. Passing criterion: expected grade matches live grade exactly, or differs by at most one letter (e.g., expected B, got C = adjacent = pass).

Metric	Result
Sites tested	12 (stratified 3/tier)
Exact grade match	8 / 12 (67%)
Adjacent grade match (within 1)	11 / 12 (92%)
Overall pass rate (exact + adjacent)	91%
Regression gate threshold	≥ 85%

The one miss: nextjs.org scored F where we expected D (JS-rendered content fooled our text-ratio heuristic into a worse verdict than warranted). We're treating that as a data point, not a fix — the F grade is arguably correct given how agents actually encounter the site.

The public dataset

The full 50-site labeled corpus is publicly available via Hidden Layer's dataset API. Each record includes domain, tier, expected grade, robots policy, llms.txt presence, Schema.org quality, expected citation behavior, CF reachability status, and the notes used to assign the label.

Formats: JSON (default), CSV (add ?format=csv). API endpoint: hidden-layer-blogs.pages.dev/api/research/dataset. The corpus is static — we version it, not update in place. When we extend to 100 sites (Phase 4), it will ship as a new endpoint.

What comes next

Three things the corpus reveals we need to improve: (1) citation authority as a signal — sites that LLMs cite despite poor GEO scores deserve partial credit somewhere; (2) JS-rendered content detection — our text-ratio heuristic needs calibration against the corpus; (3) expanding to 100 sites, with more coverage of e-commerce and international domains.

Run your own domain through the auditor at hidden-layer-blogs.pages.dev. Your result will be compared against the same check logic that produced the corpus grades. If your Schema.org is as sparse as docs.python.org or your robots.txt is as broken as nextjs.org, you'll see it.

MethodologyGEO ResearchData QualityCorpus

Methodology

Why your AI-readiness score is a lie

We ran our own auditor against itself and found five measurement sins that make every published score a coin flip dressed as data. Here's what we fixed — and what every GEO tool should disclose.

Methodology

The 50-site golden corpus: ground truth for GEO audit calibration

We hand-labeled 50 domains across 4 tiers, live-verified 37 of them, and ran a grid search against the results. Here's what we learned — and why twitter_card_tags (0.67 correlation) is the most underweighted check in every GEO audit you've ever seen.

Research

State of GEO 2026: the data from 128 live audits

Discoverability 78%, agent integration 13%. We audited 128 domains across 13 industries. The foundation is mostly solid. The agentic web is not.

Hidden Layer Research

Independent GEO audit research. Data-first. Not affiliated with any LLM vendor.