A crawler trap is any URL pattern generating effectively infinite distinct URLs without unique content. Faceted search where every filter combination creates a new URL. Calendar pagination going back to year 1900. Session IDs in URLs. Each variation looks unique to Googlebot — it crawls them, wasting crawl budget on duplicate content while your important pages wait. This guide covers identifying common traps, capping with robots.txt patterns, and removing already-indexed trap URLs with noindex.
/products?category=shoes&size=10&color=blue&brand=nike /products?category=shoes&size=10&color=blue&brand=adidas /products?category=shoes&size=10&color=red&brand=nike /products?category=shoes&size=11&color=blue&brand=nike ... (millions of combinations)
/events/2025/01 /events/2025/02 /events/1850/01 ← who's looking for events from 1850? /events/3000/06 ← future-dated empty pages
/products/widget?sessionid=abc123 /products/widget?sessionid=def456 /products/widget?sessionid=ghi789 (every visitor gets a unique URL for the same product)
/products /products?sort=price-asc /products?sort=price-desc /products?sort=name-asc /products?page=1&sort=price-asc /products?page=2&sort=price-asc&perpage=24 /products?page=2&sort=price-asc&perpage=48 /products?perpage=12&page=2&sort=price-asc ← same content, different param order
https://example.com/article/123 https://example.com/article/123/print https://example.com/article/123/email https://example.com/article/123/share?via=twitter https://example.com/article/123?print=1 https://example.com/article/123?utm_source=...
/blog/post?replytocom=12345 /blog/post?replytocom=12346 /blog/post?cpage=2#comments (WordPress-specific, generates URLs for every comment reply form)
# Top crawled URLs by Googlebot in past week
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $7}' | \
sort | uniq -c | sort -rn | \
head -30
# If trap patterns dominate top results, you have a budget problem
site:example.com inurl:?sort= site:example.com inurl:?page= site:example.com inurl:?utm_ # If results count is in the thousands and they're trap variants — Google has indexed your traps
Best for traps not yet indexed. Prevents crawl entirely.
User-agent: * # Block all URLs with multiple filter parameters Disallow: /products?*&* # Or block specific parameter combinations Disallow: /products?*size=* Disallow: /products?*color=* Disallow: /products?*brand=*
# Block years before 2000 and after 2030 Disallow: /events/19* Disallow: /events/204* Disallow: /events/205* # (better: fix the calendar to not generate impossible years)
Disallow: /*?sort= Disallow: /*&sort=
Disallow: /*?sessionid= Disallow: /*&sessionid= Disallow: /*?PHPSESSID= Disallow: /*?sid=
Disallow: /*?utm_ Disallow: /*&utm_
// Apply at the application level
function maybeNoindex(req, res) {
const url = new URL(req.url, 'https://example.com');
// Multi-parameter URLs are trap variants
if (url.search.split('&').length > 2) {
res.setHeader('X-Robots-Tag', 'noindex, nofollow');
}
// Sort/filter parameter combinations
if (url.searchParams.has('sort') || url.searchParams.has('utm_source')) {
res.setHeader('X-Robots-Tag', 'noindex, nofollow');
}
}
<!-- On filter pages, conditionally render --> <?php if (count($_GET) > 2): ?> <meta name="robots" content="noindex, follow"> <?php endif; ?>
2-8 weeks. Monitor "Excluded by noindex tag" in Search Console.
Once URLs are deindexed, add robots.txt Disallow patterns to prevent re-crawl waste.
For sort/pagination/filter variants that should consolidate to a canonical:
<!-- On /products?sort=price&page=2 --> <link rel="canonical" href="https://example.com/products" /> <!-- Or to the paginated version --> <link rel="canonical" href="https://example.com/products?page=2" /> <!-- (drops sort but keeps pagination) -->
Canonical tag tells Google "these variants represent the same content". Google consolidates ranking signals to the canonical URL.
<!-- Filter form submit creates trap URLs — nofollow them --> <form action="/products" method="get"> <select name="sort">...</select> <button type="submit" rel="nofollow">Apply</button> </form> <!-- Print/email variant links --> <a href="/aipageseo-demo-pages/how-to-fix-crawler-traps.html" rel="nofollow">Print</a> <a href="/aipageseo-demo-pages/how-to-fix-crawler-traps.html" rel="nofollow">Share</a>
Reduces discovery in the first place. Combined with Disallow, very effective at preventing traps from entering the crawl graph.
// Don't render links to years with no events
const eventsByYear = await getEventsByYear();
const yearsWithEvents = Object.keys(eventsByYear);
// Only render pagination links for actual data
yearsWithEvents.forEach(year => {
// ...render link
});
// Don't generate impossible years
const minYear = 2010;
const maxYear = new Date().getFullYear() + 2;
if (year < minYear || year > maxYear) {
return res.status(404).end();
}
// Use cookies or headers, not URL params
// BAD: ?sessionid=abc123
// RIGHT: Cookie: session=abc123 in HTTP header
// Express example
app.use(session({
secret: '...',
cookie: { httpOnly: true, secure: true },
// No "saveUninitialized: true" — only set cookie when needed
}));
// Middleware to canonicalise query strings
function normaliseQueryString(req, res, next) {
const url = new URL(req.url, 'https://example.com');
const params = [...url.searchParams.entries()].sort();
const canonical = params.map(([k, v]) => `${k}=${v}`).join('&');
if (canonical !== url.search.slice(1)) {
return res.redirect(301, `${url.pathname}?${canonical}`);
}
next();
}
site:example.com inurl:?sort= result count should drop to zero over 2-3 months as Google deindexes blocked variants.