AI engines (ChatGPT, Claude, Perplexity, Gemini, Brave Leo) crawl the web with named bots. Block them and you don't appear in their answers. The Agent Readiness audit lists which AI crawlers your site allows or blocks at robots.txt and WAF layers. This guide covers the policy decision, the per-crawler robots.txt patterns, and the WAF rules that matter.
| Provider | User-Agent | Purpose |
|---|---|---|
| OpenAI | GPTBot | Training data |
| OpenAI | OAI-SearchBot | Live search (ChatGPT browsing) |
| OpenAI | ChatGPT-User | User-triggered fetches |
| Anthropic | ClaudeBot | Training data |
| Anthropic | Claude-Web | Live retrieval |
| Anthropic | anthropic-ai | Legacy training bot |
Google-Extended | Bard/Gemini training (separate from Googlebot) | |
| Perplexity | PerplexityBot | Search + training |
| Common Crawl | CCBot | Public dataset (used by many AI labs) |
| Meta | FacebookBot, Meta-ExternalAgent | Llama training |
| ByteDance | Bytespider | Doubao / TikTok AI |
| Apple | Applebot-Extended | Apple Intelligence training |
| DuckDuckGo | DuckAssistBot | Duck.ai answers |
Maximises AI visibility — your content appears in ChatGPT, Claude, Perplexity, Gemini answers. Best for businesses that want to be cited and recommended.
OAI-SearchBot, Claude-Web, PerplexityBot can fetch for real-time answers — but GPTBot, ClaudeBot, Google-Extended can't train on your content. Middle ground for sites with proprietary content.
Maximum content protection — but invisible to AI answer engines. Trade-off: defending content vs being recommended.
# No specific rules needed — default allow User-agent: * Allow: /
User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / # Block training-only bots User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: /
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: anthropic-ai Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: FacebookBot Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Bytespider Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: DuckAssistBot Disallow: /
robots.txt only works if bots reach it. If your WAF blocks AI crawlers before they read robots.txt, robots.txt has no effect.
Cloudflare added AI Audit + AI Crawl Control in 2024. Settings → Bots → AI Crawl Control. Allow/block each bot family individually. Overrides "Bot Fight Mode" for these specific agents.
# Allow AI crawler user agents
{
"Name": "AllowAICrawlers",
"Priority": 10,
"Action": { "Allow": {} },
"Statement": {
"OrStatement": {
"Statements": [
{ "ByteMatchStatement": {
"SearchString": "GPTBot",
"FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
"TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
"PositionalConstraint": "CONTAINS"
}},
{ "ByteMatchStatement": {
"SearchString": "ClaudeBot",
"FieldToMatch": { "SingleHeader": { "Name": "user-agent" }},
"TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
"PositionalConstraint": "CONTAINS"
}}
]
}
}
}
# If AI crawlers are too aggressive, rate-limit per IP
limit_req_zone $binary_remote_addr zone=aibots:10m rate=10r/s;
location / {
if ($http_user_agent ~* "(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)") {
set $is_aibot 1;
}
if ($is_aibot = 1) {
limit_req zone=aibots burst=20 nodelay;
}
}
# Past week of AI crawler hits for bot in GPTBot ClaudeBot PerplexityBot Google-Extended CCBot OAI-SearchBot; do count=$(grep -c "$bot" /var/log/nginx/access.log) echo "$bot: $count requests" done
# Status codes for each crawler
grep "GPTBot" /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c
# Should be mostly 200s. If you see 403s, the WAF is blocking despite robots.txt
# Pretend to be GPTBot and check if you get through curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" \ -I https://example.com/ # Should return 200 OK, not 403 or 429 # If 403: WAF blocks the user agent regardless of robots.txt # If 429: rate limit too aggressive