MethodologyMay 14, 202610 min read

The 50-site golden corpus: ground truth for GEO audit calibration

We hand-labeled 50 domains across 4 tiers, live-verified 37 of them, and ran a grid search against the results. Here's what we learned — and why twitter_card_tags (0.67 correlation) is the most underweighted check in every GEO audit you've ever seen.

The previous post in this series identified four measurement sins in GEO auditing: N=1 LLM calls, unpinned models, truncated transcripts, and — the hardest one — no ground truth. 'An A is whatever the algorithm says.' We committed to building a 50-site golden corpus to fix that last one. It's built, verified, and the findings are not what we expected.

What the corpus is

The corpus is a labeled fixture: 50 domains, each with a hand-assigned expected_grade (A–F), a tier classification, and metadata about robots policy, llms.txt, and schema richness. It's used as a regression gate in the audit CI pipeline and as the calibration set for CHECK_WEIGHTS.

The four tiers represent distinct positions in the 'LLM citation distribution':

Tier	N	Description	Median expected grade
well-cited	15	Canonical sources: LLMs cite these domains unprompted for their category	B
mid	15	Established brands: LLMs know them but don't cite them as defaults	C–D
invisible	15	Real companies with minimal LLM training corpus presence	D–F
known-cloaking	5	Sites with deliberate AI-blocking signals (social, paywalled media)	F

The 'well-cited' tier is calibrated to what LLMs cite organically — not to what scores well on our rubric. Stripe, Vercel, and GitHub are in this tier because if you ask GPT-4 'what payment API should I use?', it names Stripe. The expected_grade is what we'd want the audit to return for a site that earns top citations in its category.

Verification: 37 of 50 sites, live-audited

Of the 50 corpus sites, 13 are unreachable from Cloudflare's edge network via standard HTTP (marked cf_edge_reachable: false in the fixture). These sites either block CF IPs, require JavaScript-rendered responses, or are behind WAFs that reject non-browser UAs. The remaining 37 were batch-verified through the live Hidden Layer API.

Three findings from verification that changed the corpus:

nextjs.org scored F/36 — not D as initially estimated. robots.txt returns 404 (robots_present: fail), content is Next.js-rendered (low content_efficiency), and no explicit AI bot rules. A canonical developer documentation site in the invisible tier would score D; nextjs.org has two compounding failures that push it to F.
developer.mozilla.org scored D/56 — lower than its 'well-cited' tier implies. MDN has massive LLM citation presence (geo_cold_recall: pass, hn_mentions: pass) but thin structured data and no llms.txt. The gap between 'LLMs know this site' and 'this site has GEO signals in place' is wider here than anywhere else in the corpus.
Social and cloaking sites (facebook.com, linkedin.com, yelp.com, glassdoor.com) all scored F — consistent with expected_grade F. Facebook robots.txt blocks every AI crawler explicitly. The robots.txt is actually a useful model: complete, deliberate, explicit policy for every named bot.

The grid search: 73.7% grade match on 38 sites

With 37+ live audit results, we ran a grid search (simulated annealing, 5,000 iterations) against the corpus expected_grades. Baseline match rate with original opinion-based weights: 71.1% exact, 97.4% within one grade. After correlation analysis and conservative adjustments: 73.7% exact match.

The within-1 rate of 97.4% tells the important story: the scoring algorithm rarely makes egregious errors (grade off by 2+). The 73.7% exact rate means roughly 1-in-4 sites is off by one grade. That's the margin we're calibrating against.

The surprising finding: twitter_card_tags at 0.67 correlation

We computed Pearson correlation between each check's pass/fail result and the domain's expected grade (converted to ordinal: A=4, B=3, C=2, D=1, F=0). The results upended several assumptions.

Check	Correlation	Previous weight	New weight	Interpretation
gpt_bot_rule	+0.79	5	5	Bot access rules are highest-corr group; well-known brands have explicit policies
training_search_mismatch	+0.76	8	8	Justified: mixed signals are the worst outcome
twitter_card_tags	+0.67	2	4	Was the most underweighted high-correlation check in the dataset
sitemap_url_count	+0.48	5	7	Active content volume predicts B/A grade
sitemap_reachable	+0.47	8	8	Already correctly weighted
category_share_of_voice	+0.34	10	12	Core GEO differentiator, bump justified
open_graph_tags	+0.36	5	7	Social presence = established brand signal
sameAs_entity_linking	+0.29	5	7	Entity resolution across knowledge graphs
a2a_agent_card	-0.21	2	1	Negative correlation: penalises non-developer sites unfairly
oauth_protected_resource	-0.22	2	1	Negative: ~0% adoption, net harm to general-purpose scores

twitter_card_tags at 0.67 correlation was the biggest surprise. We had it at weight 2 — 'low: Twitter/X cards; minor AI preview signal'. The corpus says otherwise. Established brands with B/A expected grades almost universally have complete Twitter Card metadata. It's a proxy for 'this site was built by a team that cares about structured metadata', and that correlates strongly with grade.

The negative correlations for agent_integration checks (a2a_agent_card: -0.21, oauth_protected_resource: -0.22) are expected and confirm the design decision to keep those weights low. Only developer-tool companies have these — and developer-tool companies tend to cluster in the mid tier (C/D), not at the top. Weighting them heavily would punish media companies and fashion brands for not having MCP servers.

The bot access paradox

The highest-correlation group is bot access rules (gpt_bot_rule: +0.79, cc_bot_rule: +0.78, claude_bot_rule: +0.77). This looks like it validates the current weighting — except it's mostly spurious correlation.

Well-known brands in the well-cited tier (Stripe, Vercel, Cloudflare, GitHub) tend to have explicit robots.txt entries for each AI bot because they have dedicated developer-relations teams that update infrastructure documentation. Less well-known brands (mid and invisible tier) often have default robots.txt files that haven't been touched since 2019.

The correlation isn't 'explicit bot rules cause high grade'. It's 'having explicit bot rules is a proxy for being the kind of company that actively maintains its web infrastructure'. That's a real signal — but it means bot access rules are partly measuring team maturity, not just GEO signal compliance. Something to watch as the corpus grows.

The calibration limit: N=38 is too small for per-weight precision

With 57 weights and 38 samples, the grid search is operating at roughly 1.5 samples per parameter. Any individual weight change is noise-sensitive. Our approach was conservative: only change weights with |correlation| ≥ 0.25 and cap changes at ±2 points. We changed 8 weights.

The right version of this analysis uses N=200+. The corpus expansion plan:

Current: 50 sites, 37 verified (this post)
Phase 4: 200 sites, 150+ verified — sufficient for per-category weight calibration
Phase 5: 500 sites with actual citation-frequency data from ChatGPT/Claude/Perplexity API queries — this is the ground truth that replaces 'our rubric' with 'actual AI citation behavior'

Phase 5 is the version that makes GEO audit scores meaningful in the same way that click-through rate data makes SEO audit scores meaningful. We're building toward it.

What the corpus revealed about 'canonical' sites

The most interesting finding isn't in the numbers — it's in the gap between 'well-cited by LLMs' and 'scores well on our audit'. Vercel and Stripe are the canonical examples: they're in the well-cited tier because LLMs cite them unprompted, yet both score C on the current audit.

Why? The recalibration we shipped in Phase 3 expanded the agent_integration and ai_visibility categories. Vercel and Stripe don't have llms.txt (surprisingly), don't have agent-card.json, and have modest structured data for their homepage. The checks they fail are the ones added most recently — the 2025-era signals.

This creates a useful heuristic: if a site is well-cited by LLMs but scores C on our audit, it's usually because it earned its training corpus presence before these signals existed. It's a legacy moat. The question for those brands is whether they'll maintain that moat as newer sites adopt the signals their audits now check for.

The corpus is public

The 50-site fixture is published at /api/research/dataset.json (JSON) and /api/research/dataset.csv (CSV). Each row contains: domain, tier, expected_grade, robots_policy, llms_txt, schema_org, expected_citation_behavior, and notes including the live audit result where verified.

If you use the corpus to evaluate your own GEO audit tool and find a site we've miscategorized, file an issue at the project repository. The corpus is a living document — we'll expand and recalibrate it as more verified data comes in.

Next in the series: 400 queries × 4 models × 30 days — what ChatGPT, Claude, Perplexity, and Gemini actually cite when asked category questions. That's the citation frequency data that will replace expected_grade with measured citation rate.

MethodologyGEODatasetTransparencyCalibration

References

Methodology

Why your AI-readiness score is a lie

We ran our own auditor against itself and found five measurement sins that make every published score a coin flip dressed as data. Here's what we fixed — and what every GEO tool should disclose.

Leaderboard

The GEO leaderboard: who's winning AI visibility in 2026

Agent integration = 0 across all 118 domains. The agentic web is coming but zero companies have started building for it.

Agent Integration

Agent integration: the 0/13 frontier

Hidden Layer audited 118 domains across 12 industries. Every single one scored 0/13 on agent integration. The frontier is completely open.

Hidden Layer Research