State of GEO 2026: the data from 128 live audits
Discoverability 78%, agent integration 13%. We audited 128 domains across 13 industries. The foundation is mostly solid. The agentic web is not.
We audited 128 domains across 13 industries. Not a survey, not self-reported — 128 live HTTP audits running 46+ checks each. Here is what the data says.
The short version: the web is mostly visible to AI crawlers at a basic level. It is almost entirely unprepared for AI agents that need to take actions.
The six-category breakdown
Every audit scores six categories. The pass rates across 128 domains paint a clear picture of where the industry stands:
| Category | Pass rate | What it measures |
|---|---|---|
| Discoverability | 78% | Sitemap, robots.txt, canonical URLs, HTTPS — the basics |
| Bot Access | 75% | robots.txt rules for specific AI crawlers |
| GEO Presence | 60% | llms.txt, Open Graph, schema.org entity data, cold recall |
| AI Discovery | 42% | Structured data depth, content signals, retrieval anchors |
| AI Visibility | 21% | LLM-cited presence, category share-of-voice, knowledge graph links |
| Agent Integration | 13% | MCP, A2A, OpenAPI, OAuth — can an AI agent act on this site? |
The gradient is stark. Discoverability (78%) and bot access (75%) are the web doing what the web does — HTTP infrastructure built over decades. GEO presence (60%) shows the early adopters are here. AI discovery (42%) and AI visibility (21%) reveal most sites haven't thought about how LLMs find and reference them. Agent integration (13%) is barely started.
Agent integration: the 13% number
Agent integration measures whether a site can be operated by an AI agent — not just read. The checks include: MCP server card (13% pass), A2A agent card (10% pass), OpenAPI spec (2% pass), OAuth metadata resource server (13% pass), API catalog (11% pass).
The 13% category pass rate means roughly 1 in 8 checks in the agent integration category passes across all 128 audited domains. Not 1 in 8 domains — 1 in 8 checks. Most sites pass nothing in this category.
This is the agentic web gap. ChatGPT Operator, Claude's computer-use mode, and emerging AI shopping agents are all trying to take actions on behalf of users. They need machine-readable APIs, structured authentication, and agent discovery files. Today, almost no commercial website has published these.
The protocol cliff
Looking at the six agentic protocols individually:
| Protocol | Adoption | Notes |
|---|---|---|
| llms.txt | 49% | Close to half — the highest-adoption emerging standard by far |
| MCP server card | 13% | Developer-tool companies leading; consumer brands absent |
| OAuth metadata | 13% | Present mainly on API-first businesses (payment, developer tools) |
| API catalog | 11% | Only sites with documented public APIs |
| A2A agent card | 10% | Very early; mostly AI/developer ecosystem |
| OpenAPI spec | 2% | Lowest adoption — requires formal API documentation infrastructure |
llms.txt at 49% is the outlier. It is a static text file you can create in 20 minutes. The other five protocols require actual engineering work — API infrastructure, authentication implementation, agent discovery files that reference real endpoints. The cliff between llms.txt (49%) and everything else (2–13%) is the 'static content vs. live systems' divide.
If your interpretation of GEO is 'add an llms.txt file', the data says you're half-adopted. If your interpretation is 'be operable by AI agents', the data says you're at roughly 10% of the way there.
Bot access: who is allowing which AI crawlers
Across 128 domains, the major AI crawlers see broadly similar allow rates — all in the 75–78% range:
| Bot | Allow | Block | Silent |
|---|---|---|---|
| OAI-SearchBot (ChatGPT browse) | 78% | 13% | 9% |
| Claude-User (on-demand fetch) | 77% | 14% | 9% |
| Meta-ExternalAgent (Llama training) | 77% | 14% | 9% |
| Claude-SearchBot (Claude browse) | 76% | 15% | 9% |
| ChatGPT-User (on-demand fetch) | 75% | 16% | 9% |
The similarity across bots reflects how most robots.txt files are written: either allow all crawlers (no rules), or block all AI crawlers together. Nuanced per-bot policy — allowing ChatGPT browse but blocking training, for example — exists in a small minority of sites.
The 13–16% block rate for these bots is the 'deliberate AI exclusion' cohort: paywalled media, social networks, and brands that have explicitly added AI disallow rules to robots.txt. Reddit, NYT, LinkedIn, Facebook — these are the known blockers. The 75–78% allow rate means most of the web is accessible to AI crawlers by default.
The zero-adoption signals
Some signals have essentially no adoption across 128 domains:
- schema.org speakable markup: 0% — zero domains in the dataset have this
- Agent-mode content view: 3% — a ?mode=agent URL parameter for AI-specific responses
- Markdown content negotiation: 5% — Accept: text/markdown support for AI processing
- Content-Signal HTTP directive: 7% — response headers for AI crawler guidance
These are the signals the audit checks for because they represent where the specification community is heading — not where the deployed web is today. speakable markup was defined in schema.org to help voice assistants identify quotable content. It has been in the spec for years. Adoption is zero.
The gap between 'spec defined' and 'deployed' is typically 5–7 years for web standards. These signals are in that gap. We check for them anyway because first-movers matter — the sites that implement speakable markup today will have a search/citation advantage as AI interfaces for content grow.
Cold recall: 21% invisible to LLMs
21% of the 115 domains we tested for cold recall are not in LLM training data — the model has no knowledge of them when asked directly. This is not a GEO optimization problem. It is a content presence problem: these domains haven't been written about enough (HN, Reddit, Wikipedia, trade press) to appear in training data.
For these sites, optimising llms.txt and structured data is premature. The prerequisite is generating the kind of content — technical blog posts, case studies, public documentation — that gets cited, linked, and eventually ingested into training sets. GEO starts with being present, not just being readable.
What the data implies for 2026
The pattern across six categories tells a consistent story: the web adapted to crawlers (discoverability 78%), is adapting to AI-specific crawlers (bot access 75%, llms.txt 49%), is beginning to add AI discovery signals (GEO presence 60%), and has barely started the work of becoming agent-operable (agent integration 13%).
Each transition has taken roughly 2–3 years to shift from early-adopter to majority. llms.txt launched in late 2024 and is already at 49%. MCP launched in late 2024 and is at 13%. A2A launched in early 2025 and is at 10%.
If the adoption curve follows the same pattern as llms.txt, MCP and A2A will cross 40%+ by late 2026. That's the window in which implementing agent integration is a competitive differentiator rather than table stakes.
The live dataset
Everything above comes from live audits, not a snapshot. The research dashboard at /research auto-derives new hero stats and panel data every time a new audit runs. The aggregate dataset is downloadable at /api/research/audit-dataset (JSON) and /api/research/audit-dataset?format=csv (CSV, CC-BY-4.0 license).
If you audit your own domain and notice your data in the aggregate, that's expected — every completed audit feeds into this dataset. If you want to be removed, contact us and we'll add your domain to the exclusion list.
See how your domain scores against these checks →
Run a free audit