Hidden Layer/Blog/The 50-site golden corpus: how we verify our auditor actually works
Methodology7 min read

The 50-site golden corpus: how we verify our auditor actually works

GEO auditors are easy to build and hard to trust. We labeled 50 real sites by hand — 4 tiers, expected grades, live HTTP verification — then ran them through the auditor. 91% grade match. Here's what we found.

An auditor that produces numbers without ground truth is guessing with extra steps. After shipping the first version of Hidden Layer's GEO auditor, the first question we asked ourselves was: does it actually produce correct grades? Not roughly correct. Not directionally reasonable. Correct enough that a site owner can act on the result.

To answer that, we needed a labeled dataset: a set of real sites with hand-verified expected grades that we could run through the auditor and compare. That's the golden corpus — 50 sites, 4 tiers, every grade verified against live HTTP results.

Why we built it

The 'your GEO score is a lie' problem is real. A single audit run uses an unpinned model, a single LLM call, a truncated transcript, and no baseline to compare against. Any of those factors can shift a grade by a full letter. The golden corpus is the mechanism that catches regressions before they reach users: if a check change makes vercel.com drop from A to C, the corpus test fails and the change doesn't ship.

It also answers the hardest question in benchmark design: are we measuring what we think we're measuring? LLM recall is one of our checks. Brand recall correlates with how often a domain appears in training data — which is a real GEO signal, but it's not the same as 'this site is well-configured for agents.' The corpus helps us distinguish the two.

How we picked the 50 sites

Stratified sampling across 4 tiers, 15 sites each for the main tiers, 5 for the known-cloaking category:

TierCountCriteriaExamples
well-cited15LLMs cite these sites regularly in responses. Authoritative, training-data-heavy.stripe.com, vercel.com, developer.mozilla.org, anthropic.com
mid15Some citation, mid-range visibility. Mainstream brands and tools.nike.com, figma.com, spotify.com, nextjs.org
invisible15Low citation rate. Blocked, paywalled, or just not LLM-useful.reddit.com, nytimes.com, facebook.com, adidas.com
known-cloaking5Serve different content to crawlers vs humans. Baseline for cloak detection.linkedin.com, yelp.com, genius.com

For each site we recorded: expected grade (hand-labeled), expected citation behavior (high/medium/low/cloaking), robots policy (permissive/restrictive/blocks-ai), llms.txt presence, and Schema.org quality. Then we ran the auditor on each reachable site and compared.

The data: what we labeled

The headline numbers before running a single audit:

SignalCountPct
Has llms.txt2 / 504%
Rich Schema.org markup12 / 5024%
Permissive robots policy37 / 5074%
Expected grade D or F24 / 5048%
CF-reachable for live test37 / 5074%

Only 2 of 50 sites have llms.txt — stripe.com and shopify.com. Both are developer-tool companies that track the GEO space. The other 48 don't have it, including every AI company in the corpus (anthropic.com, openai.com, perplexity.ai). The signal is that early adoption is narrowly concentrated in companies whose engineers read Hacker News threads about llms.txt.

The citation paradox

The most revealing pattern: well-cited sites that score badly. LLMs answer questions about these domains constantly. Their GEO scores suggest agents can barely read them.

DomainTierLive GradeLive ScoreWhy it matters
developer.mozilla.orgwell-citedD56The definitive web API reference. No llms.txt, minimal schema, no structured data. LLMs cite it anyway because it's in training data.
docs.python.orgwell-citedD52Official Python docs. Raw HTML, no schema.org, no llms.txt. Canonical reference, poor GEO.
nextjs.orgmidF36robots.txt returns 404. JS-rendered content (text ratio 0.02). Zero JSON-LD. Same team as vercel.com (A/90).
anthropic.comwell-citedC68An AI company scores C on an AI discoverability audit. No llms.txt, minimal schema.
openai.comwell-citedC74Same story — AI API docs but grade C. Outperforms Anthropic by 6 points. Both need work.

The MDN result is the starkest: every web developer uses MDN, LLMs cite it constantly, and it scores D/56. That score is technically correct — MDN has almost none of the GEO signals the auditor measures. But it also reveals the limits of GEO scoring: citation authority is a real signal that our current check suite doesn't capture well. Future work.

The nextjs.org vs vercel.com gap is a different kind of finding. Same team, radically different scores. Vercel's main site (A/90) is a developer showcase with rich schema and proper crawlability. The Next.js docs site (F/36) is an SPA with broken robots.txt and nearly no text content at crawl time. Two properties, one team, 54 points apart.

The floor: who actively blocks agents

15 sites in the corpus are 'invisible' — not because of poor configuration but by choice. Three patterns:

DomainGradeBlocking mechanism
reddit.comF/18Explicit AI crawler block in robots.txt. LLMs trained on Reddit data but current crawling blocked.
facebook.comF/21Login wall. Restrictive crawler policy. Nothing to index without auth.
nytimes.comF/35AI crawler block. Paywall. High brand recall in training data, zero current access.
adidas.comF/15Permissive robots policy but minimal schema, no llms.txt — lowest score in corpus despite being a global brand.
linkedin.comF/30AI crawlers blocked. Truncated profile previews to crawlers. Classic known-cloaking pattern.

Adidas is worth calling out specifically. Permissive robots policy — AI crawlers are allowed. But no llms.txt, minimal schema, and a homepage that's mostly images. F/15 is the correct grade. You can allow access and still be invisible.

Auditor accuracy: 91% grade match

We ran the corpus regression test on 12 of the 37 CF-reachable sites — a stratified sample of 3 per tier. Passing criterion: expected grade matches live grade exactly, or differs by at most one letter (e.g., expected B, got C = adjacent = pass).

MetricResult
Sites tested12 (stratified 3/tier)
Exact grade match8 / 12 (67%)
Adjacent grade match (within 1)11 / 12 (92%)
Overall pass rate (exact + adjacent)91%
Regression gate threshold≥ 85%

The one miss: nextjs.org scored F where we expected D (JS-rendered content fooled our text-ratio heuristic into a worse verdict than warranted). We're treating that as a data point, not a fix — the F grade is arguably correct given how agents actually encounter the site.

The public dataset

The full 50-site labeled corpus is publicly available via Hidden Layer's dataset API. Each record includes domain, tier, expected grade, robots policy, llms.txt presence, Schema.org quality, expected citation behavior, CF reachability status, and the notes used to assign the label.

Formats: JSON (default), CSV (add ?format=csv). API endpoint: hidden-layer-blogs.pages.dev/api/research/dataset. The corpus is static — we version it, not update in place. When we extend to 100 sites (Phase 4), it will ship as a new endpoint.

What comes next

Three things the corpus reveals we need to improve: (1) citation authority as a signal — sites that LLMs cite despite poor GEO scores deserve partial credit somewhere; (2) JS-rendered content detection — our text-ratio heuristic needs calibration against the corpus; (3) expanding to 100 sites, with more coverage of e-commerce and international domains.

Run your own domain through the auditor at hidden-layer-blogs.pages.dev. Your result will be compared against the same check logic that produced the corpus grades. If your Schema.org is as sparse as docs.python.org or your robots.txt is as broken as nextjs.org, you'll see it.

MethodologyGEO ResearchData QualityCorpus
HL
Hidden Layer Research
Independent GEO audit research. Data-first. Not affiliated with any LLM vendor.

See how your domain scores against these checks →

Run a free audit
Share:Twitter / XLinkedIn