ResearchMay 14, 20267 min read

State of GEO 2026: the data from 128 live audits

Discoverability 78%, agent integration 13%. We audited 128 domains across 13 industries. The foundation is mostly solid. The agentic web is not.

We audited 128 domains across 13 industries. Not a survey, not self-reported — 128 live HTTP audits running 46+ checks each. Here is what the data says.

The short version: the web is mostly visible to AI crawlers at a basic level. It is almost entirely unprepared for AI agents that need to take actions.

The six-category breakdown

Every audit scores six categories. The pass rates across 128 domains paint a clear picture of where the industry stands:

Category	Pass rate	What it measures
Discoverability	78%	Sitemap, robots.txt, canonical URLs, HTTPS — the basics
Bot Access	75%	robots.txt rules for specific AI crawlers
GEO Presence	60%	llms.txt, Open Graph, schema.org entity data, cold recall
AI Discovery	42%	Structured data depth, content signals, retrieval anchors
AI Visibility	21%	LLM-cited presence, category share-of-voice, knowledge graph links
Agent Integration	13%	MCP, A2A, OpenAPI, OAuth — can an AI agent act on this site?

The gradient is stark. Discoverability (78%) and bot access (75%) are the web doing what the web does — HTTP infrastructure built over decades. GEO presence (60%) shows the early adopters are here. AI discovery (42%) and AI visibility (21%) reveal most sites haven't thought about how LLMs find and reference them. Agent integration (13%) is barely started.

Agent integration: the 13% number

Agent integration measures whether a site can be operated by an AI agent — not just read. The checks include: MCP server card (13% pass), A2A agent card (10% pass), OpenAPI spec (2% pass), OAuth metadata resource server (13% pass), API catalog (11% pass).

The 13% category pass rate means roughly 1 in 8 checks in the agent integration category passes across all 128 audited domains. Not 1 in 8 domains — 1 in 8 checks. Most sites pass nothing in this category.

This is the agentic web gap. ChatGPT Operator, Claude's computer-use mode, and emerging AI shopping agents are all trying to take actions on behalf of users. They need machine-readable APIs, structured authentication, and agent discovery files. Today, almost no commercial website has published these.

The protocol cliff

Looking at the six agentic protocols individually:

Protocol	Adoption	Notes
llms.txt	49%	Close to half — the highest-adoption emerging standard by far
MCP server card	13%	Developer-tool companies leading; consumer brands absent
OAuth metadata	13%	Present mainly on API-first businesses (payment, developer tools)
API catalog	11%	Only sites with documented public APIs
A2A agent card	10%	Very early; mostly AI/developer ecosystem
OpenAPI spec	2%	Lowest adoption — requires formal API documentation infrastructure

llms.txt at 49% is the outlier. It is a static text file you can create in 20 minutes. The other five protocols require actual engineering work — API infrastructure, authentication implementation, agent discovery files that reference real endpoints. The cliff between llms.txt (49%) and everything else (2–13%) is the 'static content vs. live systems' divide.

If your interpretation of GEO is 'add an llms.txt file', the data says you're half-adopted. If your interpretation is 'be operable by AI agents', the data says you're at roughly 10% of the way there.

Bot access: who is allowing which AI crawlers

Across 128 domains, the major AI crawlers see broadly similar allow rates — all in the 75–78% range:

Bot	Allow	Block	Silent
OAI-SearchBot (ChatGPT browse)	78%	13%	9%
Claude-User (on-demand fetch)	77%	14%	9%
Meta-ExternalAgent (Llama training)	77%	14%	9%
Claude-SearchBot (Claude browse)	76%	15%	9%
ChatGPT-User (on-demand fetch)	75%	16%	9%

The similarity across bots reflects how most robots.txt files are written: either allow all crawlers (no rules), or block all AI crawlers together. Nuanced per-bot policy — allowing ChatGPT browse but blocking training, for example — exists in a small minority of sites.

The 13–16% block rate for these bots is the 'deliberate AI exclusion' cohort: paywalled media, social networks, and brands that have explicitly added AI disallow rules to robots.txt. Reddit, NYT, LinkedIn, Facebook — these are the known blockers. The 75–78% allow rate means most of the web is accessible to AI crawlers by default.

The zero-adoption signals

Some signals have essentially no adoption across 128 domains:

schema.org speakable markup: 0% — zero domains in the dataset have this
Agent-mode content view: 3% — a ?mode=agent URL parameter for AI-specific responses
Markdown content negotiation: 5% — Accept: text/markdown support for AI processing
Content-Signal HTTP directive: 7% — response headers for AI crawler guidance

These are the signals the audit checks for because they represent where the specification community is heading — not where the deployed web is today. speakable markup was defined in schema.org to help voice assistants identify quotable content. It has been in the spec for years. Adoption is zero.

The gap between 'spec defined' and 'deployed' is typically 5–7 years for web standards. These signals are in that gap. We check for them anyway because first-movers matter — the sites that implement speakable markup today will have a search/citation advantage as AI interfaces for content grow.

Cold recall: 21% invisible to LLMs

21% of the 115 domains we tested for cold recall are not in LLM training data — the model has no knowledge of them when asked directly. This is not a GEO optimization problem. It is a content presence problem: these domains haven't been written about enough (HN, Reddit, Wikipedia, trade press) to appear in training data.

For these sites, optimising llms.txt and structured data is premature. The prerequisite is generating the kind of content — technical blog posts, case studies, public documentation — that gets cited, linked, and eventually ingested into training sets. GEO starts with being present, not just being readable.

What the data implies for 2026

The pattern across six categories tells a consistent story: the web adapted to crawlers (discoverability 78%), is adapting to AI-specific crawlers (bot access 75%, llms.txt 49%), is beginning to add AI discovery signals (GEO presence 60%), and has barely started the work of becoming agent-operable (agent integration 13%).

Each transition has taken roughly 2–3 years to shift from early-adopter to majority. llms.txt launched in late 2024 and is already at 49%. MCP launched in late 2024 and is at 13%. A2A launched in early 2025 and is at 10%.

If the adoption curve follows the same pattern as llms.txt, MCP and A2A will cross 40%+ by late 2026. That's the window in which implementing agent integration is a competitive differentiator rather than table stakes.

The live dataset

Everything above comes from live audits, not a snapshot. The research dashboard at /research auto-derives new hero stats and panel data every time a new audit runs. The aggregate dataset is downloadable at /api/research/audit-dataset (JSON) and /api/research/audit-dataset?format=csv (CSV, CC-BY-4.0 license).

If you audit your own domain and notice your data in the aggregate, that's expected — every completed audit feeds into this dataset. If you want to be removed, contact us and we'll add your domain to the exclusion list.

GEO ResearchAgent ReadinessDataBot PolicyProtocol Adoption

Methodology

The 50-site golden corpus: how we verify our auditor actually works

GEO auditors are easy to build and hard to trust. We labeled 50 real sites by hand — 4 tiers, expected grades, live HTTP verification — then ran them through the auditor. 91% grade match. Here's what we found.

Methodology

46+ checks, one grade: how we score AI readiness

A transparent look at the 8 categories, 229 base points, and why a B doesn't mean you're ready and an F doesn't mean you're invisible.

Bot Policy

robots.txt in the AI era: what 10 major brands got wrong

robots.txt was designed for Googlebot in 1994. Now 12+ AI bots use it as policy. Most brands haven't updated their files in years — and it shows.

Hidden Layer Research

Independent GEO audit research. Data-first. Not affiliated with any LLM vendor.