robots.txt parsing is forgiving — too forgiving. Malformed directives don't throw errors; they're silently ignored. A typo in Disalow: means that rule is skipped and the URL stays crawlable when you thought it was blocked. Wildcards in the wrong position match too much or too little. This guide covers the directive grammar, the common typos, the position-sensitive wildcards, and the validators that catch what visual inspection misses.
Every directive is one line of Field: value:
User-agent: * Disallow: /admin/ Allow: /admin/help/ Sitemap: https://example.com/sitemap.xml # Comment line User-agent: Googlebot Disallow: /private/
User-agent, user-agent, USER-AGENT all work)User-agent:, not User-agent)A User-agent: line opens a group. All following Allow/Disallow apply to that user-agent until the next User-agent: line.
# Group 1: all crawlers User-agent: * Disallow: /admin/ Disallow: /private/ # Group 2: Googlebot specifically User-agent: Googlebot Disallow: /tmp/ # Group 3: Bingbot specifically User-agent: Bingbot Disallow: /experimental/ # Sitemap is global, applies to all groups Sitemap: https://example.com/sitemap.xml
# Bad assumption User-agent: * Disallow: /admin/ Disallow: /private/ User-agent: Googlebot Disallow: /tmp/ # Googlebot only obeys its own group: Disallow: /tmp/ # /admin/ and /private/ are NOT blocked for Googlebot # To block these for Googlebot too, repeat them in its group: User-agent: * Disallow: /admin/ Disallow: /private/ User-agent: Googlebot Disallow: /admin/ Disallow: /private/ Disallow: /tmp/
Two wildcard characters with different meanings:
| Wildcard | Meaning | Example |
|---|---|---|
* | Matches any sequence of characters | /*.pdf matches /file.pdf, /docs/file.pdf |
$ | Anchors to end of URL | /*.pdf$ matches /file.pdf but NOT /file.pdf?id=1 |
# Block all PDFs Disallow: /*.pdf$ # Block all URLs with query strings Disallow: /*? # Block specific query parameter Disallow: /*?sort= # Block paths containing /admin/ anywhere Disallow: /*admin/ # Block CSV downloads in a specific directory Disallow: /reports/*.csv$
# These two are equivalent Disallow: /admin/ Disallow: /admin/* # Without trailing slash, prefix match Disallow: /admin # Matches /admin, /admin/, /admin/page, /administrator
# BAD: silently ignored Disalow: /admin/ # RIGHT Disallow: /admin/
# BAD: silently ignored User-agent * Disallow /admin/ # RIGHT User-agent: * Disallow: /admin/
# BAD: tries to disallow "/admin/ temporary block" Disallow: /admin/ temporary block # RIGHT Disallow: /admin/ # temporary block # or # temporary block Disallow: /admin/
# File saved with UTF-8 BOM (byte-order mark) confuses some parsers # Save as UTF-8 without BOM, Unix line endings (LF, not CRLF) # Check with hexdump: hexdump -C robots.txt | head -1 # Should NOT start with ef bb bf
# BAD: wildcards do NOT work in User-agent values User-agent: Googlebot* # RIGHT: list each user-agent explicitly User-agent: Googlebot User-agent: Googlebot-Image User-agent: Googlebot-News Disallow: /
Google supports: User-agent, Allow, Disallow, Sitemap. Common non-standard directives:
# Crawl-delay: ignored by Google, used by Bing/Yandex User-agent: Bingbot Crawl-delay: 10 # Host: Yandex-specific Host: example.com # Clean-param: Yandex-specific Clean-param: ref /forum/showthread.php # Comments — always supported # This is a comment
For crawl rate control with Google, use Search Console → Settings → Crawl rate (legacy feature, varies by account).
Search Console → Settings → robots.txt Tester # Paste your robots.txt # Test specific URLs to confirm allow/disallow status # Identifies syntax warnings with line numbers
pip install google-robotxt
# Parser identical to Googlebot's
python3 -c "
from google_robotxt import RobotsTxt
r = RobotsTxt.from_file('robots.txt')
print(r.is_allowed('Googlebot', '/some/path/'))
"
npm install robots-parser
const robotsParser = require('robots-parser');
const fs = require('fs');
const content = fs.readFileSync('robots.txt', 'utf-8');
const robots = robotsParser('https://example.com/robots.txt', content);
console.log(robots.isAllowed('https://example.com/admin/', 'Googlebot'));
// false (blocked)
# Fetch and look for byte-level issues curl -v https://example.com/robots.txt 2>&1 | grep "Content-Type" # Should be: text/plain; charset=utf-8 # Check first bytes — no BOM curl -s https://example.com/robots.txt | head -1 | hexdump -C | head -1
# GitHub Actions example
- name: Validate robots.txt
run: |
pip install google-robotxt
python3 -c "
from google_robotxt import RobotsTxt
r = RobotsTxt.from_file('public/robots.txt')
# Critical paths must be allowed
critical = ['/', '/products/', '/about/']
for path in critical:
assert r.is_allowed('Googlebot', path), f'CRITICAL: {path} blocked'
# Sensitive paths must be blocked
blocked = ['/admin/', '/api/internal/']
for path in blocked:
assert not r.is_allowed('Googlebot', path), f'SENSITIVE: {path} allowed'
print('robots.txt validation passed')
"