Which AI agents should I allow?

Depends on your strategy. GPTBot (ChatGPT training), ChatGPT-User (live browsing), ClaudeBot, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended (Gemini), Bingbot (powers Copilot), CCBot (Common Crawl). Allowing them increases AI visibility; blocking keeps content out of training data and AI answers.

How do I distinguish real agents from impersonators?

User-agent strings are trivially spoofable. OpenAI publishes IP ranges; Anthropic does too. Cloudflare's Verified Bots feature auto-validates. For Google and Bing, reverse DNS confirms (googlebot.com, msn.com domains).

Will allowing agents hurt my server?

Legitimate agents respect robots.txt and crawl politely. Volume is small compared to humans. Real risk is scrapers spoofing agent UAs — verify with IP ranges, not strings alone.

Block or allow GPTBot?

Strategic decision. Allow if you want content cited in ChatGPT. Block if protecting IP or negotiating licensing. For most SMBs and SaaS, allowing wins more visibility than it costs.

How to Fix AI Agent User-Agent Blocks

WAFs and bot-protection often block AI agents alongside scrapers because both look programmatic. Default Cloudflare bot-fight, AWS WAF managed rules, and Imperva all routinely block GPTBot, ClaudeBot, PerplexityBot. Result: zero AI visibility, AI answer engines never citing you. This guide covers identifying legitimate agent traffic, verifying it isn't spoofed, and allowlisting at each major WAF.

1. Known AI agent user agents (2026)

Agent	UA string contains	Purpose
GPTBot	`GPTBot`	OpenAI training data
ChatGPT-User	`ChatGPT-User`	ChatGPT user browsing
OAI-SearchBot	`OAI-SearchBot`	ChatGPT search results
ClaudeBot	`ClaudeBot`	Anthropic crawler
Claude-Web	`Claude-Web`	Claude live web access
Anthropic-AI	`anthropic-ai`	Anthropic training
PerplexityBot	`PerplexityBot`	Perplexity index
Perplexity-User	`Perplexity-User`	Perplexity user queries
Google-Extended	(Googlebot UA + token)	Gemini training opt-out
CCBot	`CCBot/2.0`	Common Crawl (many AI providers)
Meta-ExternalAgent	`meta-externalagent`	Meta AI
Bytespider	`Bytespider`	ByteDance AI

2. Audit current traffic

Step 1

Grep access logs

grep -E "GPTBot|ClaudeBot|PerplexityBot|Anthropic|CCBot|Google-Extended" \
  /var/log/nginx/access.log | tail -50

# Check status codes — 403/429 means blocked
grep "GPTBot" /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c

Step 2

Cloudflare bot analytics

Cloudflare → Security → Bots. Lists detected bots with action taken. Agents shown as "blocked" or "challenged" need allowlisting.

3. Cloudflare allowlist

Method 1: Verified Bots (easiest)

Cloudflare auto-detects and validates known bots via IP and reverse DNS:

Security → Bots → Bot Fight Mode
- "Allow verified bots" → enabled
- No manual UA rules needed for GPTBot, ClaudeBot, PerplexityBot, Googlebot, Bingbot

Method 2: Custom WAF rule (specific control)

# Cloudflare → Security → WAF → Custom rules → Create
# Field: User Agent
# Operator: contains
# Value: GPTBot
# Then: Skip → all remaining custom rules + managed challenge

# Or in expression syntax:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Anthropic") or
(http.user_agent contains "OAI-SearchBot")
# Action: Skip → managed challenge

4. AWS WAF allowlist

# Web ACL → Rules → Add rule → Custom rule
# Statement: Inspect → Single header → User-Agent
# Match type: Contains string
# String: GPTBot
# Action: Allow

# Add additional rules for ClaudeBot, PerplexityBot, etc.
# Set higher priority than blocking rules — allow first, block what's left

5. Nginx allowlist (no WAF)

# /etc/nginx/conf.d/ai-agents.conf
map $http_user_agent $is_ai_agent {
  default 0;
  "~*GPTBot" 1;
  "~*ChatGPT-User" 1;
  "~*ClaudeBot" 1;
  "~*Claude-Web" 1;
  "~*anthropic-ai" 1;
  "~*PerplexityBot" 1;
  "~*Perplexity-User" 1;
  "~*OAI-SearchBot" 1;
  "~*Google-Extended" 1;
}

server {
  # Skip rate-limit and challenge for AI agents
  if ($is_ai_agent = 1) {
    set $bypass_check 1;
  }
}

6. Verify legitimacy via IP

# OpenAI publishes ranges at openai.com/gptbot-ranges.json
curl -s https://openai.com/gptbot-ranges.json | jq

# Anthropic publishes ClaudeBot ranges
curl -s https://anthropic.com/claudebot-ranges.json | jq

# Auto-update WAF allowlists from these every 24h via cron

7. Test the allowlist

Step 1

curl with each agent UA

for ua in "GPTBot/1.0" "ClaudeBot/1.0" "PerplexityBot/1.0" "anthropic-ai/1.0"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" -A "$ua" https://example.com/)
  echo "$status $ua"
done

# All should return 200, not 403/429/503

💡 Don't trust user-agent strings alone for allowlisting in production — they're trivially spoofed and let scrapers in. Use Cloudflare Verified Bots or pull IP ranges from official endpoints. UA-only allowlisting is fine for diagnosis but not for security policy.

🤖 Re-run Agent Compat audit

Verify all major AI agents reach your content.

Run Agent Compat →

How to Fix AI Agent User-Agent Blocks

1. Known AI agent user agents (2026)

2. Audit current traffic

3. Cloudflare allowlist

Method 1: Verified Bots (easiest)

Method 2: Custom WAF rule (specific control)

4. AWS WAF allowlist

5. Nginx allowlist (no WAF)

6. Verify legitimacy via IP

7. Test the allowlist

🤖 Re-run Agent Compat audit

About aiwebpageseo