WAFs and bot-protection often block AI agents alongside scrapers because both look programmatic. Default Cloudflare bot-fight, AWS WAF managed rules, and Imperva all routinely block GPTBot, ClaudeBot, PerplexityBot. Result: zero AI visibility, AI answer engines never citing you. This guide covers identifying legitimate agent traffic, verifying it isn't spoofed, and allowlisting at each major WAF.
| Agent | UA string contains | Purpose |
|---|---|---|
| GPTBot | GPTBot | OpenAI training data |
| ChatGPT-User | ChatGPT-User | ChatGPT user browsing |
| OAI-SearchBot | OAI-SearchBot | ChatGPT search results |
| ClaudeBot | ClaudeBot | Anthropic crawler |
| Claude-Web | Claude-Web | Claude live web access |
| Anthropic-AI | anthropic-ai | Anthropic training |
| PerplexityBot | PerplexityBot | Perplexity index |
| Perplexity-User | Perplexity-User | Perplexity user queries |
| Google-Extended | (Googlebot UA + token) | Gemini training opt-out |
| CCBot | CCBot/2.0 | Common Crawl (many AI providers) |
| Meta-ExternalAgent | meta-externalagent | Meta AI |
| Bytespider | Bytespider | ByteDance AI |
grep -E "GPTBot|ClaudeBot|PerplexityBot|Anthropic|CCBot|Google-Extended" \
/var/log/nginx/access.log | tail -50
# Check status codes — 403/429 means blocked
grep "GPTBot" /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c
Cloudflare auto-detects and validates known bots via IP and reverse DNS:
Security → Bots → Bot Fight Mode - "Allow verified bots" → enabled - No manual UA rules needed for GPTBot, ClaudeBot, PerplexityBot, Googlebot, Bingbot
# Cloudflare → Security → WAF → Custom rules → Create # Field: User Agent # Operator: contains # Value: GPTBot # Then: Skip → all remaining custom rules + managed challenge # Or in expression syntax: (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Anthropic") or (http.user_agent contains "OAI-SearchBot") # Action: Skip → managed challenge
# Web ACL → Rules → Add rule → Custom rule # Statement: Inspect → Single header → User-Agent # Match type: Contains string # String: GPTBot # Action: Allow # Add additional rules for ClaudeBot, PerplexityBot, etc. # Set higher priority than blocking rules — allow first, block what's left
# /etc/nginx/conf.d/ai-agents.conf
map $http_user_agent $is_ai_agent {
default 0;
"~*GPTBot" 1;
"~*ChatGPT-User" 1;
"~*ClaudeBot" 1;
"~*Claude-Web" 1;
"~*anthropic-ai" 1;
"~*PerplexityBot" 1;
"~*Perplexity-User" 1;
"~*OAI-SearchBot" 1;
"~*Google-Extended" 1;
}
server {
# Skip rate-limit and challenge for AI agents
if ($is_ai_agent = 1) {
set $bypass_check 1;
}
}
# OpenAI publishes ranges at openai.com/gptbot-ranges.json curl -s https://openai.com/gptbot-ranges.json | jq # Anthropic publishes ClaudeBot ranges curl -s https://anthropic.com/claudebot-ranges.json | jq # Auto-update WAF allowlists from these every 24h via cron
for ua in "GPTBot/1.0" "ClaudeBot/1.0" "PerplexityBot/1.0" "anthropic-ai/1.0"; do
status=$(curl -s -o /dev/null -w "%{http_code}" -A "$ua" https://example.com/)
echo "$status $ua"
done
# All should return 200, not 403/429/503