Ecommerce Crawl: Deep PDP, PLP & Cart Crawls at Scale
A generic site crawler misses what matters most on an ecommerce site — variant URLs, out-of-stock signals, faceted navigation duplication, missing or stale Product schema, abandoned product pages and broken cart flows. The Ecommerce Crawl is purpose-built for product catalogues: it crawls PDPs and PLPs at scale, surfaces structural issues that hurt ranking and conversion, and validates the schema-and-availability signals that Google Shopping needs. This guide covers the crawl strategy, the issue taxonomy and the fix workflow.
What Ecommerce Crawl checks
PDP CoverageEvery product is crawlable from the navigation in fewer than 3 clicks.
Product SchemaValid JSON-LD with name, image, description, sku, brand, offers, availability, aggregateRating.
Variant HandlingVariants have unique URLs and canonical to the parent product, or use offer-level pricing on one URL.
Out-of-Stock SignalsOOS products signal correctly via availability schema; persistent OOS may need 301 to category.
Faceted NavigationFacets that don't add ranking value are noindexed or canonicalised.
Internal LinkingCategory-to-product linking; related-products linking; bestseller surfacing.
Image OptimisationProduct images: alt text, dimensions, srcset, format (WebP/AVIF), file size.
PaginationListing pagination uses rel=next/prev or self-canonicalised pages.
How the crawl works
The crawler starts at your homepage, follows category links, then enumerates products on each category page, then crawls each PDP. It respects robots.txt, throttles per your site's response time, and supports JavaScript-rendered pages where products are loaded client-side. Crawl depth and concurrency are configurable.
Common ecommerce issues
Faceted navigation explosion
Faceted nav generates astronomical URL counts: 5 categories × 3 brands × 4 colours × 3 sizes × 2 sort orders = 360 URL variants per category, most of them indexable. Google crawls them, finds near-duplicate content, dilutes ranking signal. Fix: canonicalise non-value-adding facets to the base category, noindex sort/order parameters, allow only high-value facets (typically brand and category) to index.
Variant URL chaos
A product in 10 colours and 5 sizes can produce 50 URLs. If each is indexed independently, Google sees 50 near-duplicate pages competing for the same query. Pick one pattern: (a) one canonical URL per product with offer-level pricing for variants, OR (b) variant URLs that canonical to the parent. Pick once and apply site-wide.
Out-of-stock handling
Out-of-stock products lose rankings if they signal availability incorrectly. Use Product schema with availability:
{
"@type": "Product",
"offers": {
"@type": "Offer",
"availability": "https://schema.org/OutOfStock",
"price": "29.99",
"priceCurrency": "GBP"
}
}
For seasonal OOS (back in stock soon), keep the page live. For permanent OOS, 301 redirect to a similar product or the parent category — never 404. Don't keep zombie OOS pages indefinitely.
Missing or stale schema
Every PDP needs a complete Product schema block. Common omissions: missing sku, missing aggregateRating (Google uses this for rich results), missing brand, stale price, stale availability. Validate with Google's Rich Results Test on a representative PDP every quarter.
Pagination handling
Listing pages are typically paginated. Two valid patterns: (a) rel=next / rel=prev in <head> on each page, OR (b) each pagination page self-canonicalised with a "View all" page available. Don't canonicalise page 2+ to page 1 — Google may de-index the deep products.
JavaScript rendering
Many modern ecommerce sites (Shopify, BigCommerce, custom React) render product details client-side. The crawler must support JavaScript rendering or it sees empty product pages. Confirm rendering by viewing a PDP with JavaScript disabled — if content is missing, your site relies on JS rendering and crawlers need to wait for the DOM to populate.
⚠️ Server-side rendering (SSR) or static generation produces dramatically better crawl outcomes than client-side rendering. If you're on a stack that supports SSR/SSG (Next.js, Nuxt, Astro, Hydrogen), use it.
Performance budgets
Ecommerce sites with 100,000+ products need careful crawl budget management. Signals to Google:
- Fast TTFB and clean 200 responses (no 5xx during crawl waves)
- Clear sitemap with priority weighting on top-revenue products
- Robots.txt blocking known-low-value URL patterns (e.g. faceted combinations)
- Consistent URL structure — no session IDs, no random query params
- Reasonable internal linking depth — every product within 4 clicks of the homepage
Output format
Per crawl run you get:
- Coverage report — % of products crawlable from navigation
- Schema validation report — % of PDPs with valid Product schema
- Variant analysis — duplicates, canonicals, variant URL strategy
- Out-of-stock inventory — products signalling OOS, recommend redirect or keep
- Faceted nav report — facets indexed vs noindexed; recommendations
- Pagination integrity — rel=next/prev coverage
- Image audit — alt text coverage, format, dimensions
- Internal linking health — orphan products, weakly-linked categories
Frequently Asked Questions
How is Ecommerce Crawl different from a generic site crawler?
A generic crawler treats every URL the same. Ecommerce Crawl knows about product entities — it recognises PDPs vs PLPs vs cart vs checkout, validates Product schema, detects variant URL patterns, identifies faceted-nav explosions and surfaces out-of-stock signal issues. Generic crawlers miss most of the structural problems that hurt ecommerce rankings.
Should I canonicalise variants to the parent product?
Depends on whether variants compete for the same query. If colour/size variants would never realistically rank for separate queries, canonicalise them to the parent and use offer-level pricing on one URL. If variants DO target distinct queries (e.g. "red running shoes" vs "blue running shoes"), give them separate URLs with their own schema. Pick once, apply site-wide. Mixed patterns confuse Google.
What's the right way to handle permanently out-of-stock products?
301 redirect to the closest available product or the parent category. Don't 404 (loses backlink value), don't keep the page with availability=OutOfStock indefinitely (becomes a quality drag site-wide). For seasonal OOS that will return, keep the page live with OutOfStock schema and an email-when-available form.
How often should I crawl a large ecommerce site?
Weekly for high-velocity sites (catalogue changing daily). Monthly for stable sites. After every major release. The crawl should be a routine pipeline step, not an annual project. Surface issues to the merchandising and dev teams within hours of crawl completion so problems are fixed before they hurt rankings.
🕷 Crawl your ecommerce site
Deep crawl of PDPs, PLPs, schema, variants, faceted nav, out-of-stock and internal linking.
Run Ecommerce Crawl →