/ Agent Readiness / AI Crawler Access

How to Fix AI Crawler Access

AI engines (ChatGPT, Claude, Perplexity, Gemini, Brave Leo) crawl the web with named bots. Block them and you don't appear in their answers. The Agent Readiness audit lists which AI crawlers your site allows or blocks at robots.txt and WAF layers. This guide covers the policy decision, the per-crawler robots.txt patterns, and the WAF rules that matter.

1. The major AI crawlers

ProviderUser-AgentPurpose
OpenAIGPTBotTraining data
OpenAIOAI-SearchBotLive search (ChatGPT browsing)
OpenAIChatGPT-UserUser-triggered fetches
AnthropicClaudeBotTraining data
AnthropicClaude-WebLive retrieval
Anthropicanthropic-aiLegacy training bot
GoogleGoogle-ExtendedBard/Gemini training (separate from Googlebot)
PerplexityPerplexityBotSearch + training
Common CrawlCCBotPublic dataset (used by many AI labs)
MetaFacebookBot, Meta-ExternalAgentLlama training
ByteDanceBytespiderDoubao / TikTok AI
AppleApplebot-ExtendedApple Intelligence training
DuckDuckGoDuckAssistBotDuck.ai answers

2. Decide policy

Allow all (recommended for most public sites)

Maximises AI visibility — your content appears in ChatGPT, Claude, Perplexity, Gemini answers. Best for businesses that want to be cited and recommended.

Allow live retrieval, block training

OAI-SearchBot, Claude-Web, PerplexityBot can fetch for real-time answers — but GPTBot, ClaudeBot, Google-Extended can't train on your content. Middle ground for sites with proprietary content.

Block all

Maximum content protection — but invisible to AI answer engines. Trade-off: defending content vs being recommended.

3. Robots.txt patterns

Allow all AI crawlers (default state)

# No specific rules needed — default allow
User-agent: *
Allow: /

Allow search bots, block training bots

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training-only bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Block all AI

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: DuckAssistBot
Disallow: /

4. WAF / Bot management layer

robots.txt only works if bots reach it. If your WAF blocks AI crawlers before they read robots.txt, robots.txt has no effect.

Cloudflare

Cloudflare added AI Audit + AI Crawl Control in 2024. Settings → Bots → AI Crawl Control. Allow/block each bot family individually. Overrides "Bot Fight Mode" for these specific agents.

AWS WAF

# Allow AI crawler user agents
{
  "Name": "AllowAICrawlers",
  "Priority": 10,
  "Action": { "Allow": {} },
  "Statement": {
    "OrStatement": {
      "Statements": [
        { "ByteMatchStatement": {
          "SearchString": "GPTBot",
          "FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
          "PositionalConstraint": "CONTAINS"
        }},
        { "ByteMatchStatement": {
          "SearchString": "ClaudeBot",
          "FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
          "PositionalConstraint": "CONTAINS"
        }}
      ]
    }
  }
}

nginx rate-limit safety net

# If AI crawlers are too aggressive, rate-limit per IP
limit_req_zone $binary_remote_addr zone=aibots:10m rate=10r/s;

location / {
  if ($http_user_agent ~* "(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)") {
    set $is_aibot 1;
  }
  if ($is_aibot = 1) {
    limit_req zone=aibots burst=20 nodelay;
  }
}

5. Verify with logs

Step 1
Check access log for each crawler
# Past week of AI crawler hits
for bot in GPTBot ClaudeBot PerplexityBot Google-Extended CCBot OAI-SearchBot; do
  count=$(grep -c "$bot" /var/log/nginx/access.log)
  echo "$bot: $count requests"
done
Step 2
Confirm they get 200, not 403
# Status codes for each crawler
grep "GPTBot" /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c
# Should be mostly 200s. If you see 403s, the WAF is blocking despite robots.txt

6. Test with curl

# Pretend to be GPTBot and check if you get through
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
  -I https://example.com/

# Should return 200 OK, not 403 or 429
# If 403: WAF blocks the user agent regardless of robots.txt
# If 429: rate limit too aggressive

7. Verify with checker

Step 1
Crawler-access findings clear. Per-bot status matches your intended policy.
💡 The default "allow all AI" position serves most businesses. Block-everything is rarely the right call unless you have strict proprietary-content requirements. Selective block (train: no, search: yes) is the middle path for content sites worried about training but wanting AI search visibility.

🤖 Re-run Agent Readiness

Verify AI crawler access matches policy.

Run Agent Readiness →
Related Guides: Agent Readiness Fixes  ·  Fix llms.txt  ·  Fix Robots Blocks  ·  Agent Readiness Guide
💬 Got a problem?