Should I allow AI crawlers?

Depends on goals. Allowing them lets your content appear in ChatGPT, Claude, Perplexity answers — drives AI visibility and citations. Blocking protects content from training but removes you from AI answer engines. Most public-facing sites benefit from allowing; sites with proprietary content may prefer blocking.

What user-agent strings to use?

OpenAI: GPTBot. Anthropic: ClaudeBot (training), Claude-Web (live retrieval). Google: Google-Extended (training, separate from Googlebot). Perplexity: PerplexityBot. Meta: FacebookBot, Meta-ExternalAgent. ByteDance: Bytespider. Common Crawl: CCBot. Use exact strings — partial matches don't work.

Does Googlebot include AI training?

No. Googlebot is for search index. Google-Extended is the separate user agent for Bard/Gemini training. You can allow Googlebot (keep search visibility) while blocking Google-Extended (opt out of AI training). They're independent toggles.

Do AI crawlers respect robots.txt?

Major providers (OpenAI, Anthropic, Google, Perplexity, Common Crawl) honour robots.txt. Some scrapers ignore it but those aren't the AI engines whose visibility matters. For ignore-policy scrapers, WAF rate limiting and bot management blocks work better than robots.txt.

How to Fix AI Crawler Access

AI engines (ChatGPT, Claude, Perplexity, Gemini, Brave Leo) crawl the web with named bots. Block them and you don't appear in their answers. The Agent Readiness audit lists which AI crawlers your site allows or blocks at robots.txt and WAF layers. This guide covers the policy decision, the per-crawler robots.txt patterns, and the WAF rules that matter.

1. The major AI crawlers

Provider	User-Agent	Purpose
OpenAI	`GPTBot`	Training data
OpenAI	`OAI-SearchBot`	Live search (ChatGPT browsing)
OpenAI	`ChatGPT-User`	User-triggered fetches
Anthropic	`ClaudeBot`	Training data
Anthropic	`Claude-Web`	Live retrieval
Anthropic	`anthropic-ai`	Legacy training bot
Google	`Google-Extended`	Bard/Gemini training (separate from Googlebot)
Perplexity	`PerplexityBot`	Search + training
Common Crawl	`CCBot`	Public dataset (used by many AI labs)
Meta	`FacebookBot`, `Meta-ExternalAgent`	Llama training
ByteDance	`Bytespider`	Doubao / TikTok AI
Apple	`Applebot-Extended`	Apple Intelligence training
DuckDuckGo	`DuckAssistBot`	Duck.ai answers

2. Decide policy

Allow all (recommended for most public sites)

Maximises AI visibility — your content appears in ChatGPT, Claude, Perplexity, Gemini answers. Best for businesses that want to be cited and recommended.

Allow live retrieval, block training

OAI-SearchBot, Claude-Web, PerplexityBot can fetch for real-time answers — but GPTBot, ClaudeBot, Google-Extended can't train on your content. Middle ground for sites with proprietary content.

Block all

Maximum content protection — but invisible to AI answer engines. Trade-off: defending content vs being recommended.

3. Robots.txt patterns

Allow all AI crawlers (default state)

# No specific rules needed — default allow
User-agent: *
Allow: /

Allow search bots, block training bots

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training-only bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Block all AI

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: DuckAssistBot
Disallow: /

4. WAF / Bot management layer

robots.txt only works if bots reach it. If your WAF blocks AI crawlers before they read robots.txt, robots.txt has no effect.

Cloudflare

Cloudflare added AI Audit + AI Crawl Control in 2024. Settings → Bots → AI Crawl Control. Allow/block each bot family individually. Overrides "Bot Fight Mode" for these specific agents.

AWS WAF

# Allow AI crawler user agents
{
  "Name": "AllowAICrawlers",
  "Priority": 10,
  "Action": { "Allow": {} },
  "Statement": {
    "OrStatement": {
      "Statements": [
        { "ByteMatchStatement": {
          "SearchString": "GPTBot",
          "FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
          "PositionalConstraint": "CONTAINS"
        }},
        { "ByteMatchStatement": {
          "SearchString": "ClaudeBot",
          "FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
          "PositionalConstraint": "CONTAINS"
        }}
      ]
    }
  }
}

nginx rate-limit safety net

# If AI crawlers are too aggressive, rate-limit per IP
limit_req_zone $binary_remote_addr zone=aibots:10m rate=10r/s;

location / {
  if ($http_user_agent ~* "(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)") {
    set $is_aibot 1;
  }
  if ($is_aibot = 1) {
    limit_req zone=aibots burst=20 nodelay;
  }
}

5. Verify with logs

Step 1

Check access log for each crawler

# Past week of AI crawler hits
for bot in GPTBot ClaudeBot PerplexityBot Google-Extended CCBot OAI-SearchBot; do
  count=$(grep -c "$bot" /var/log/nginx/access.log)
  echo "$bot: $count requests"
done

Step 2

Confirm they get 200, not 403

# Status codes for each crawler
grep "GPTBot" /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c
# Should be mostly 200s. If you see 403s, the WAF is blocking despite robots.txt

6. Test with curl

# Pretend to be GPTBot and check if you get through
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
  -I https://example.com/

# Should return 200 OK, not 403 or 429
# If 403: WAF blocks the user agent regardless of robots.txt
# If 429: rate limit too aggressive

7. Verify with checker

Step 1

Re-run Agent Readiness

Crawler-access findings clear. Per-bot status matches your intended policy.

💡 The default "allow all AI" position serves most businesses. Block-everything is rarely the right call unless you have strict proprietary-content requirements. Selective block (train: no, search: yes) is the middle path for content sites worried about training but wanting AI search visibility.

🤖 Re-run Agent Readiness

Verify AI crawler access matches policy.

Run Agent Readiness →

How to Fix AI Crawler Access

1. The major AI crawlers

2. Decide policy

Allow all (recommended for most public sites)

Allow live retrieval, block training

Block all

3. Robots.txt patterns

Allow all AI crawlers (default state)

Allow search bots, block training bots

Block all AI

4. WAF / Bot management layer

Cloudflare

AWS WAF

nginx rate-limit safety net

5. Verify with logs

6. Test with curl

7. Verify with checker

🤖 Re-run Agent Readiness

About aiwebpageseo