The 50-site golden corpus: how we verify our auditor actually works
GEO auditors are easy to build and hard to trust. We labeled 50 real sites by hand — 4 tiers, expected grades, live HTTP verification — then ran them through the auditor. 91% grade match. Here's what we found.
An auditor that produces numbers without ground truth is guessing with extra steps. After shipping the first version of Hidden Layer's GEO auditor, the first question we asked ourselves was: does it actually produce correct grades? Not roughly correct. Not directionally reasonable. Correct enough that a site owner can act on the result.
To answer that, we needed a labeled dataset: a set of real sites with hand-verified expected grades that we could run through the auditor and compare. That's the golden corpus — 50 sites, 4 tiers, every grade verified against live HTTP results.
Why we built it
The 'your GEO score is a lie' problem is real. A single audit run uses an unpinned model, a single LLM call, a truncated transcript, and no baseline to compare against. Any of those factors can shift a grade by a full letter. The golden corpus is the mechanism that catches regressions before they reach users: if a check change makes vercel.com drop from A to C, the corpus test fails and the change doesn't ship.
It also answers the hardest question in benchmark design: are we measuring what we think we're measuring? LLM recall is one of our checks. Brand recall correlates with how often a domain appears in training data — which is a real GEO signal, but it's not the same as 'this site is well-configured for agents.' The corpus helps us distinguish the two.
How we picked the 50 sites
Stratified sampling across 4 tiers, 15 sites each for the main tiers, 5 for the known-cloaking category:
| Tier | Count | Criteria | Examples |
|---|---|---|---|
| well-cited | 15 | LLMs cite these sites regularly in responses. Authoritative, training-data-heavy. | stripe.com, vercel.com, developer.mozilla.org, anthropic.com |
| mid | 15 | Some citation, mid-range visibility. Mainstream brands and tools. | nike.com, figma.com, spotify.com, nextjs.org |
| invisible | 15 | Low citation rate. Blocked, paywalled, or just not LLM-useful. | reddit.com, nytimes.com, facebook.com, adidas.com |
| known-cloaking | 5 | Serve different content to crawlers vs humans. Baseline for cloak detection. | linkedin.com, yelp.com, genius.com |
For each site we recorded: expected grade (hand-labeled), expected citation behavior (high/medium/low/cloaking), robots policy (permissive/restrictive/blocks-ai), llms.txt presence, and Schema.org quality. Then we ran the auditor on each reachable site and compared.
The data: what we labeled
The headline numbers before running a single audit:
| Signal | Count | Pct |
|---|---|---|
| Has llms.txt | 2 / 50 | 4% |
| Rich Schema.org markup | 12 / 50 | 24% |
| Permissive robots policy | 37 / 50 | 74% |
| Expected grade D or F | 24 / 50 | 48% |
| CF-reachable for live test | 37 / 50 | 74% |
Only 2 of 50 sites have llms.txt — stripe.com and shopify.com. Both are developer-tool companies that track the GEO space. The other 48 don't have it, including every AI company in the corpus (anthropic.com, openai.com, perplexity.ai). The signal is that early adoption is narrowly concentrated in companies whose engineers read Hacker News threads about llms.txt.
The citation paradox
The most revealing pattern: well-cited sites that score badly. LLMs answer questions about these domains constantly. Their GEO scores suggest agents can barely read them.
| Domain | Tier | Live Grade | Live Score | Why it matters |
|---|---|---|---|---|
| developer.mozilla.org | well-cited | D | 56 | The definitive web API reference. No llms.txt, minimal schema, no structured data. LLMs cite it anyway because it's in training data. |
| docs.python.org | well-cited | D | 52 | Official Python docs. Raw HTML, no schema.org, no llms.txt. Canonical reference, poor GEO. |
| nextjs.org | mid | F | 36 | robots.txt returns 404. JS-rendered content (text ratio 0.02). Zero JSON-LD. Same team as vercel.com (A/90). |
| anthropic.com | well-cited | C | 68 | An AI company scores C on an AI discoverability audit. No llms.txt, minimal schema. |
| openai.com | well-cited | C | 74 | Same story — AI API docs but grade C. Outperforms Anthropic by 6 points. Both need work. |
The MDN result is the starkest: every web developer uses MDN, LLMs cite it constantly, and it scores D/56. That score is technically correct — MDN has almost none of the GEO signals the auditor measures. But it also reveals the limits of GEO scoring: citation authority is a real signal that our current check suite doesn't capture well. Future work.
The nextjs.org vs vercel.com gap is a different kind of finding. Same team, radically different scores. Vercel's main site (A/90) is a developer showcase with rich schema and proper crawlability. The Next.js docs site (F/36) is an SPA with broken robots.txt and nearly no text content at crawl time. Two properties, one team, 54 points apart.
The floor: who actively blocks agents
15 sites in the corpus are 'invisible' — not because of poor configuration but by choice. Three patterns:
| Domain | Grade | Blocking mechanism |
|---|---|---|
| reddit.com | F/18 | Explicit AI crawler block in robots.txt. LLMs trained on Reddit data but current crawling blocked. |
| facebook.com | F/21 | Login wall. Restrictive crawler policy. Nothing to index without auth. |
| nytimes.com | F/35 | AI crawler block. Paywall. High brand recall in training data, zero current access. |
| adidas.com | F/15 | Permissive robots policy but minimal schema, no llms.txt — lowest score in corpus despite being a global brand. |
| linkedin.com | F/30 | AI crawlers blocked. Truncated profile previews to crawlers. Classic known-cloaking pattern. |
Adidas is worth calling out specifically. Permissive robots policy — AI crawlers are allowed. But no llms.txt, minimal schema, and a homepage that's mostly images. F/15 is the correct grade. You can allow access and still be invisible.
Auditor accuracy: 91% grade match
We ran the corpus regression test on 12 of the 37 CF-reachable sites — a stratified sample of 3 per tier. Passing criterion: expected grade matches live grade exactly, or differs by at most one letter (e.g., expected B, got C = adjacent = pass).
| Metric | Result |
|---|---|
| Sites tested | 12 (stratified 3/tier) |
| Exact grade match | 8 / 12 (67%) |
| Adjacent grade match (within 1) | 11 / 12 (92%) |
| Overall pass rate (exact + adjacent) | 91% |
| Regression gate threshold | ≥ 85% |
The one miss: nextjs.org scored F where we expected D (JS-rendered content fooled our text-ratio heuristic into a worse verdict than warranted). We're treating that as a data point, not a fix — the F grade is arguably correct given how agents actually encounter the site.
The public dataset
The full 50-site labeled corpus is publicly available via Hidden Layer's dataset API. Each record includes domain, tier, expected grade, robots policy, llms.txt presence, Schema.org quality, expected citation behavior, CF reachability status, and the notes used to assign the label.
Formats: JSON (default), CSV (add ?format=csv). API endpoint: hidden-layer-blogs.pages.dev/api/research/dataset. The corpus is static — we version it, not update in place. When we extend to 100 sites (Phase 4), it will ship as a new endpoint.
What comes next
Three things the corpus reveals we need to improve: (1) citation authority as a signal — sites that LLMs cite despite poor GEO scores deserve partial credit somewhere; (2) JS-rendered content detection — our text-ratio heuristic needs calibration against the corpus; (3) expanding to 100 sites, with more coverage of e-commerce and international domains.
Run your own domain through the auditor at hidden-layer-blogs.pages.dev. Your result will be compared against the same check logic that produced the corpus grades. If your Schema.org is as sparse as docs.python.org or your robots.txt is as broken as nextjs.org, you'll see it.
See how your domain scores against these checks →
Run a free audit