Hidden Layer/Blog/robots.txt in the AI era: what 10 major brands got wrong
Bot Policy6 min read

robots.txt in the AI era: what 10 major brands got wrong

robots.txt was designed for Googlebot in 1994. Now 12+ AI bots use it as policy. Most brands haven't updated their files in years — and it shows.

The Robots Exclusion Protocol was proposed by Martijn Koster in 1994 to let webmasters tell search engine crawlers which pages to skip. For thirty years, it worked fine. Google, Bing, Yahoo — a handful of well-behaved crawlers that most teams understood.

Then 2023 happened. OpenAI launched GPTBot. Anthropic published ClaudeBot documentation. Perplexity, Cohere, Meta, Apple, ByteDance — each with their own crawler UA string. Suddenly robots.txt became AI policy, and most brands had no policy at all.

What the major bots actually check

Each AI crawler looks for its own user-agent string in robots.txt. GPTBot looks for `User-agent: GPTBot`. ClaudeBot looks for `User-agent: ClaudeBot`. If neither is present, most crawlers fall back to the `User-agent: *` rule — which for most sites is `Allow: /`.

This means an absence of AI bot rules isn't neutral. On most sites, it means all AI crawlers are implicitly allowed. That may be your intent — but it should be a conscious decision, not a default.

The 403 trap

Some brands use Cloudflare's Bot Fight Mode or similar WAF rules to block non-browser traffic at the edge. When an AI crawler hits the robots.txt endpoint and receives a 403, it has no way to distinguish "this file doesn't exist" from "this domain has a CDN that blocks all bots." The practical result is equivalent to a blanket Disallow: / — the crawler treats the domain as inaccessible.

If you're using edge-level bot blocking, verify that your CDN rules pass through well-known AI crawler UAs. Cloudflare's verified bot list now includes GPTBot, ClaudeBot, and PerplexityBot — enabling "Allow verified bots" in the dashboard restores access.

Training vs. browsing bots: different rules, same file

Not all AI bots are the same. GPTBot and ClaudeBot are training crawlers — they download content for model training. OAI-SearchBot, Claude-SearchBot, and PerplexityBot are retrieval crawlers — they fetch live content when a user asks the LLM to browse the web. ChatGPT-User and Claude-User are on-demand fetchers — triggered by a user asking the AI to read a specific URL.

You may want different policies for each type. A brand that doesn't want AI training on its content might still want retrieval bots to send the AI tool's users to their pages. robots.txt supports this with separate User-agent blocks.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

Content-Signals: the emerging extension

The Content-Signals draft adds structured directives to robots.txt as comments. This provides more granular intent than a binary Allow/Disallow. It's not a formal standard yet, but adding the comment costs nothing and signals intent to crawlers that check for it.

# Content-Signal: allow-training allow-search disallow-training-commercial

Where most brands fail

Running Hidden Layer audits on 50+ domains reveals a consistent pattern: most brands have robots.txt files that predate AI crawlers entirely. They block Googlebot-Image and Bing crawlers with precision — and have no mention of GPTBot, ClaudeBot, or any other AI UA. Policy by omission.

The fix takes 10 minutes: add explicit User-agent blocks for each AI crawler with the Allow or Disallow that reflects your actual intent. Write it like policy, because that's what it is.

robots.txtAI BotsBot Policy

See how your domain scores against these checks →

Run a free audit