/ Robots & Sitemap Fixes / Crawler Traps

How to Fix Crawler-Trap Patterns

A crawler trap is any URL pattern generating effectively infinite distinct URLs without unique content. Faceted search where every filter combination creates a new URL. Calendar pagination going back to year 1900. Session IDs in URLs. Each variation looks unique to Googlebot — it crawls them, wasting crawl budget on duplicate content while your important pages wait. This guide covers identifying common traps, capping with robots.txt patterns, and removing already-indexed trap URLs with noindex.

1. Identify common trap patterns

Faceted search / filter combinations

/products?category=shoes&size=10&color=blue&brand=nike
/products?category=shoes&size=10&color=blue&brand=adidas
/products?category=shoes&size=10&color=red&brand=nike
/products?category=shoes&size=11&color=blue&brand=nike
... (millions of combinations)

Calendar pagination

/events/2025/01
/events/2025/02
/events/1850/01    ← who's looking for events from 1850?
/events/3000/06    ← future-dated empty pages

Session IDs in URLs

/products/widget?sessionid=abc123
/products/widget?sessionid=def456
/products/widget?sessionid=ghi789
(every visitor gets a unique URL for the same product)

Sort/pagination combinations

/products
/products?sort=price-asc
/products?sort=price-desc
/products?sort=name-asc
/products?page=1&sort=price-asc
/products?page=2&sort=price-asc&perpage=24
/products?page=2&sort=price-asc&perpage=48
/products?perpage=12&page=2&sort=price-asc   ← same content, different param order

Print/email/share variants

https://example.com/article/123
https://example.com/article/123/print
https://example.com/article/123/email
https://example.com/article/123/share?via=twitter
https://example.com/article/123?print=1
https://example.com/article/123?utm_source=...

Comment pagination loops

/blog/post?replytocom=12345
/blog/post?replytocom=12346
/blog/post?cpage=2#comments
(WordPress-specific, generates URLs for every comment reply form)

2. Measure the impact

Step 1
Search Console crawl stats
Search Console → Settings → Crawl stats. "Crawl requests" chart shows total. "By response" shows status code breakdown. "By file type" shows HTML vs CSS vs JS. "By Googlebot type" shows mobile vs desktop vs image.
Step 2
Server log analysis
# Top crawled URLs by Googlebot in past week
grep "Googlebot" /var/log/nginx/access.log | \
  awk '{print $7}' | \
  sort | uniq -c | sort -rn | \
  head -30

# If trap patterns dominate top results, you have a budget problem
Step 3
Audit indexed URLs
site:example.com inurl:?sort=
site:example.com inurl:?page=
site:example.com inurl:?utm_

# If results count is in the thousands and they're trap variants — Google has indexed your traps

3. Cap with robots.txt

Best for traps not yet indexed. Prevents crawl entirely.

Faceted search

User-agent: *
# Block all URLs with multiple filter parameters
Disallow: /products?*&*

# Or block specific parameter combinations
Disallow: /products?*size=*
Disallow: /products?*color=*
Disallow: /products?*brand=*

Calendar limits

# Block years before 2000 and after 2030
Disallow: /events/19*
Disallow: /events/204*
Disallow: /events/205*
# (better: fix the calendar to not generate impossible years)

Sort parameter

Disallow: /*?sort=
Disallow: /*&sort=

Session IDs

Disallow: /*?sessionid=
Disallow: /*&sessionid=
Disallow: /*?PHPSESSID=
Disallow: /*?sid=

UTM and tracking parameters

Disallow: /*?utm_
Disallow: /*&utm_

4. Remove already-indexed traps

⚠️ If Google has already indexed trap URLs, robots.txt Disallow alone won't remove them. Disallow stops new crawls but indexed URLs persist. Need noindex meta first, then Disallow after deindexing completes.

Phase 1: Add noindex to trap pages

// Apply at the application level
function maybeNoindex(req, res) {
  const url = new URL(req.url, 'https://example.com');
  
  // Multi-parameter URLs are trap variants
  if (url.search.split('&').length > 2) {
    res.setHeader('X-Robots-Tag', 'noindex, nofollow');
  }
  
  // Sort/filter parameter combinations
  if (url.searchParams.has('sort') || url.searchParams.has('utm_source')) {
    res.setHeader('X-Robots-Tag', 'noindex, nofollow');
  }
}

Or use meta tag in HTML

<!-- On filter pages, conditionally render -->
<?php if (count($_GET) > 2): ?>
<meta name="robots" content="noindex, follow">
<?php endif; ?>

Phase 2: Wait for Google to drop them

2-8 weeks. Monitor "Excluded by noindex tag" in Search Console.

Phase 3: Add Disallow rules

Once URLs are deindexed, add robots.txt Disallow patterns to prevent re-crawl waste.

5. Canonical tag for parameter variants

For sort/pagination/filter variants that should consolidate to a canonical:

<!-- On /products?sort=price&page=2 -->
<link rel="canonical" href="https://example.com/products" />

<!-- Or to the paginated version -->
<link rel="canonical" href="https://example.com/products?page=2" />
<!-- (drops sort but keeps pagination) -->

Canonical tag tells Google "these variants represent the same content". Google consolidates ranking signals to the canonical URL.

6. rel=nofollow on internal trap links

<!-- Filter form submit creates trap URLs — nofollow them -->
<form action="/products" method="get">
  <select name="sort">...</select>
  <button type="submit" rel="nofollow">Apply</button>
</form>

<!-- Print/email variant links -->
<a href="/aipageseo-demo-pages/how-to-fix-crawler-traps.html" rel="nofollow">Print</a>
<a href="/aipageseo-demo-pages/how-to-fix-crawler-traps.html" rel="nofollow">Share</a>

Reduces discovery in the first place. Combined with Disallow, very effective at preventing traps from entering the crawl graph.

7. Application-level fixes (best)

Fix calendar generation

// Don't render links to years with no events
const eventsByYear = await getEventsByYear();
const yearsWithEvents = Object.keys(eventsByYear);

// Only render pagination links for actual data
yearsWithEvents.forEach(year => {
  // ...render link
});

// Don't generate impossible years
const minYear = 2010;
const maxYear = new Date().getFullYear() + 2;
if (year < minYear || year > maxYear) {
  return res.status(404).end();
}

Remove session IDs from URLs

// Use cookies or headers, not URL params
// BAD: ?sessionid=abc123
// RIGHT: Cookie: session=abc123 in HTTP header

// Express example
app.use(session({
  secret: '...',
  cookie: { httpOnly: true, secure: true },
  // No "saveUninitialized: true" — only set cookie when needed
}));

Normalise parameter ordering

// Middleware to canonicalise query strings
function normaliseQueryString(req, res, next) {
  const url = new URL(req.url, 'https://example.com');
  const params = [...url.searchParams.entries()].sort();
  
  const canonical = params.map(([k, v]) => `${k}=${v}`).join('&');
  if (canonical !== url.search.slice(1)) {
    return res.redirect(301, `${url.pathname}?${canonical}`);
  }
  next();
}

8. Verify the fix

Step 1
Re-run Robots Tester
Crawler-trap findings cleared. Sample URLs from trap patterns confirmed blocked by robots.txt.
Step 2
Search Console crawl stats
4-8 weeks after deployment, crawl requests per day should drop for trap patterns and concentrate on real content. Total HTML crawl requests may decrease while indexed URL count holds steady — good sign.
Step 3
site: query reduction
site:example.com inurl:?sort= result count should drop to zero over 2-3 months as Google deindexes blocked variants.
💡 The biggest crawl-budget wins come from one or two top trap patterns, not from blocking every edge case. Check your server logs — find which trap URLs Googlebot hits most, block those first. The long tail of minor traps matters less.

🤖 Re-run the Robots & Sitemap Tester

Verify trap patterns capped.

Run Tester →
Related Guides: Robots & Sitemap Fixes  ·  Fix Robots Blocks  ·  Fix noindex vs Disallow  ·  Robots & Sitemap Guide
💬 Got a problem?