Bot RegistryApril 8, 20268 min read

The 12 AI bots crawling your site right now

A field guide to every major AI crawler: what company runs it, what it does with your content, whether it respects robots.txt, and what UA string to put in your rules.

There are twelve AI crawlers that matter right now. Some train models. Some power live search. Some fetch content on-demand when a user asks an AI tool to read a URL. All of them use your robots.txt as policy — if you have one.

Training crawlers

Training crawlers download content at scale for use in model training datasets. They're not tied to live user queries — they run on schedules, build large corpora, and the content they collect shows up in model behaviour months later.

GPTBot — OpenAI

UA: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)

OpenAI's primary training crawler. Respects robots.txt. Blocking GPTBot means your content doesn't appear in future GPT training runs — it doesn't affect existing model knowledge or live ChatGPT browsing.

ClaudeBot — Anthropic

UA: Mozilla/5.0 AppleWebKit/537.36 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com)

Anthropic's training crawler. Respects robots.txt. Like GPTBot, blocking it affects future training, not current model knowledge.

Google-Extended — Google

Separate from Googlebot (search indexing). Controls whether your content trains Bard/Gemini. Can be blocked independently without affecting Google Search.

Applebot-Extended — Apple

Apple's AI training crawler, used for training Apple Intelligence models. Distinct from Applebot (used for Spotlight and Siri search results).

CCBot — Common Crawl

Non-profit crawler that publishes a public web archive. Many open-source models train on Common Crawl data. Frequently blocked by brands concerned about AI training.

Meta-ExternalAgent — Meta

Used for training Meta's Llama and other AI models. Respects robots.txt.

Bytespider — ByteDance

ByteDance's crawler, used for TikTok, Doubao, and other AI products. Historically controversial for its robots.txt compliance. Hidden Layer recommends blocking by default unless you have specific reason to allow ByteDance AI training.

Retrieval/search crawlers

These bots fetch content in response to live user queries — when you ask Perplexity a question, it sends PerplexityBot to retrieve current information.

OAI-SearchBot — OpenAI

Powers ChatGPT's web browsing and SearchGPT. This is the bot you want to allow even if you block GPTBot — it sends users to your site and appears as a referral source in analytics.

Claude-SearchBot — Anthropic

Powers Claude's web search feature. Same commercial reasoning as OAI-SearchBot — this is traffic and discovery, not training.

PerplexityBot — Perplexity

Powers Perplexity's real-time answers. Blocking PerplexityBot means your content doesn't appear in Perplexity answers, which for some industries is now a meaningful traffic source.

On-demand fetch bots

ChatGPT-User — OpenAI

Used when a ChatGPT user shares a URL in conversation. Appears as a session visit, not a continuous crawl.

Claude-User — Anthropic

When a user pastes a URL and asks Claude to read it, this is the UA string used.

The robots.txt template

For most brands, the recommended starting point:

# Training crawlers — allow by default
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# Training crawlers — common to block
User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Retrieval bots — almost always allow
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# On-demand fetch — almost always allow
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

This makes your intent explicit rather than relying on the User-agent: * fallback. Update it when new AI crawlers launch — this list will grow.

AI BotsBot Registryrobots.txt

See how your domain scores against these checks →

Run a free audit

← All articles