The 50-site golden corpus: ground truth for GEO audit calibration
We hand-labeled 50 domains across 4 tiers, live-verified 37 of them, and ran a grid search against the results. Here's what we learned — and why twitter_card_tags (0.67 correlation) is the most underweighted check in every GEO audit you've ever seen.
The previous post in this series identified four measurement sins in GEO auditing: N=1 LLM calls, unpinned models, truncated transcripts, and — the hardest one — no ground truth. 'An A is whatever the algorithm says.' We committed to building a 50-site golden corpus to fix that last one. It's built, verified, and the findings are not what we expected.
What the corpus is
The corpus is a labeled fixture: 50 domains, each with a hand-assigned expected_grade (A–F), a tier classification, and metadata about robots policy, llms.txt, and schema richness. It's used as a regression gate in the audit CI pipeline and as the calibration set for CHECK_WEIGHTS.
The four tiers represent distinct positions in the 'LLM citation distribution':
| Tier | N | Description | Median expected grade |
|---|---|---|---|
| well-cited | 15 | Canonical sources: LLMs cite these domains unprompted for their category | B |
| mid | 15 | Established brands: LLMs know them but don't cite them as defaults | C–D |
| invisible | 15 | Real companies with minimal LLM training corpus presence | D–F |
| known-cloaking | 5 | Sites with deliberate AI-blocking signals (social, paywalled media) | F |
The 'well-cited' tier is calibrated to what LLMs cite organically — not to what scores well on our rubric. Stripe, Vercel, and GitHub are in this tier because if you ask GPT-4 'what payment API should I use?', it names Stripe. The expected_grade is what we'd want the audit to return for a site that earns top citations in its category.
Verification: 37 of 50 sites, live-audited
Of the 50 corpus sites, 13 are unreachable from Cloudflare's edge network via standard HTTP (marked cf_edge_reachable: false in the fixture). These sites either block CF IPs, require JavaScript-rendered responses, or are behind WAFs that reject non-browser UAs. The remaining 37 were batch-verified through the live Hidden Layer API.
Three findings from verification that changed the corpus:
- nextjs.org scored F/36 — not D as initially estimated. robots.txt returns 404 (robots_present: fail), content is Next.js-rendered (low content_efficiency), and no explicit AI bot rules. A canonical developer documentation site in the invisible tier would score D; nextjs.org has two compounding failures that push it to F.
- developer.mozilla.org scored D/56 — lower than its 'well-cited' tier implies. MDN has massive LLM citation presence (geo_cold_recall: pass, hn_mentions: pass) but thin structured data and no llms.txt. The gap between 'LLMs know this site' and 'this site has GEO signals in place' is wider here than anywhere else in the corpus.
- Social and cloaking sites (facebook.com, linkedin.com, yelp.com, glassdoor.com) all scored F — consistent with expected_grade F. Facebook robots.txt blocks every AI crawler explicitly. The robots.txt is actually a useful model: complete, deliberate, explicit policy for every named bot.
The grid search: 73.7% grade match on 38 sites
With 37+ live audit results, we ran a grid search (simulated annealing, 5,000 iterations) against the corpus expected_grades. Baseline match rate with original opinion-based weights: 71.1% exact, 97.4% within one grade. After correlation analysis and conservative adjustments: 73.7% exact match.
The within-1 rate of 97.4% tells the important story: the scoring algorithm rarely makes egregious errors (grade off by 2+). The 73.7% exact rate means roughly 1-in-4 sites is off by one grade. That's the margin we're calibrating against.
The surprising finding: twitter_card_tags at 0.67 correlation
We computed Pearson correlation between each check's pass/fail result and the domain's expected grade (converted to ordinal: A=4, B=3, C=2, D=1, F=0). The results upended several assumptions.
| Check | Correlation | Previous weight | New weight | Interpretation |
|---|---|---|---|---|
| gpt_bot_rule | +0.79 | 5 | 5 | Bot access rules are highest-corr group; well-known brands have explicit policies |
| training_search_mismatch | +0.76 | 8 | 8 | Justified: mixed signals are the worst outcome |
| twitter_card_tags | +0.67 | 2 | 4 | Was the most underweighted high-correlation check in the dataset |
| sitemap_url_count | +0.48 | 5 | 7 | Active content volume predicts B/A grade |
| sitemap_reachable | +0.47 | 8 | 8 | Already correctly weighted |
| category_share_of_voice | +0.34 | 10 | 12 | Core GEO differentiator, bump justified |
| open_graph_tags | +0.36 | 5 | 7 | Social presence = established brand signal |
| sameAs_entity_linking | +0.29 | 5 | 7 | Entity resolution across knowledge graphs |
| a2a_agent_card | -0.21 | 2 | 1 | Negative correlation: penalises non-developer sites unfairly |
| oauth_protected_resource | -0.22 | 2 | 1 | Negative: ~0% adoption, net harm to general-purpose scores |
twitter_card_tags at 0.67 correlation was the biggest surprise. We had it at weight 2 — 'low: Twitter/X cards; minor AI preview signal'. The corpus says otherwise. Established brands with B/A expected grades almost universally have complete Twitter Card metadata. It's a proxy for 'this site was built by a team that cares about structured metadata', and that correlates strongly with grade.
The negative correlations for agent_integration checks (a2a_agent_card: -0.21, oauth_protected_resource: -0.22) are expected and confirm the design decision to keep those weights low. Only developer-tool companies have these — and developer-tool companies tend to cluster in the mid tier (C/D), not at the top. Weighting them heavily would punish media companies and fashion brands for not having MCP servers.
The bot access paradox
The highest-correlation group is bot access rules (gpt_bot_rule: +0.79, cc_bot_rule: +0.78, claude_bot_rule: +0.77). This looks like it validates the current weighting — except it's mostly spurious correlation.
Well-known brands in the well-cited tier (Stripe, Vercel, Cloudflare, GitHub) tend to have explicit robots.txt entries for each AI bot because they have dedicated developer-relations teams that update infrastructure documentation. Less well-known brands (mid and invisible tier) often have default robots.txt files that haven't been touched since 2019.
The correlation isn't 'explicit bot rules cause high grade'. It's 'having explicit bot rules is a proxy for being the kind of company that actively maintains its web infrastructure'. That's a real signal — but it means bot access rules are partly measuring team maturity, not just GEO signal compliance. Something to watch as the corpus grows.
The calibration limit: N=38 is too small for per-weight precision
With 57 weights and 38 samples, the grid search is operating at roughly 1.5 samples per parameter. Any individual weight change is noise-sensitive. Our approach was conservative: only change weights with |correlation| ≥ 0.25 and cap changes at ±2 points. We changed 8 weights.
The right version of this analysis uses N=200+. The corpus expansion plan:
- Current: 50 sites, 37 verified (this post)
- Phase 4: 200 sites, 150+ verified — sufficient for per-category weight calibration
- Phase 5: 500 sites with actual citation-frequency data from ChatGPT/Claude/Perplexity API queries — this is the ground truth that replaces 'our rubric' with 'actual AI citation behavior'
Phase 5 is the version that makes GEO audit scores meaningful in the same way that click-through rate data makes SEO audit scores meaningful. We're building toward it.
What the corpus revealed about 'canonical' sites
The most interesting finding isn't in the numbers — it's in the gap between 'well-cited by LLMs' and 'scores well on our audit'. Vercel and Stripe are the canonical examples: they're in the well-cited tier because LLMs cite them unprompted, yet both score C on the current audit.
Why? The recalibration we shipped in Phase 3 expanded the agent_integration and ai_visibility categories. Vercel and Stripe don't have llms.txt (surprisingly), don't have agent-card.json, and have modest structured data for their homepage. The checks they fail are the ones added most recently — the 2025-era signals.
This creates a useful heuristic: if a site is well-cited by LLMs but scores C on our audit, it's usually because it earned its training corpus presence before these signals existed. It's a legacy moat. The question for those brands is whether they'll maintain that moat as newer sites adopt the signals their audits now check for.
The corpus is public
The 50-site fixture is published at /api/research/dataset.json (JSON) and /api/research/dataset.csv (CSV). Each row contains: domain, tier, expected_grade, robots_policy, llms_txt, schema_org, expected_citation_behavior, and notes including the live audit result where verified.
If you use the corpus to evaluate your own GEO audit tool and find a site we've miscategorized, file an issue at the project repository. The corpus is a living document — we'll expand and recalibrate it as more verified data comes in.
Next in the series: 400 queries × 4 models × 30 days — what ChatGPT, Claude, Perplexity, and Gemini actually cite when asked category questions. That's the citation frequency data that will replace expected_grade with measured citation rate.
See how your domain scores against these checks →
Run a free audit