Technical SEO for the Code-Curious: Crawl & Index Control in Code

For the Code-Curious Control the crawl in code robots.txt, canonical, sitemaps, noindex, redirects and log greps — the config. Show me how →
Your journey cost
Tick the steps you want — total updates live
Total
Live prices · pay as you go
Pricing comparison
PAYG vs Subscription
PAYG
£0 /mo min

Top up from £4.99 · credits never expire

Subscription

Select a plan to compare.

£4.99/mo
Compare against plan:
Calculating…

Technical SEO for the code-curious: crawl and index control in code

This guide is the files, tags and rules that control how Google crawls and indexes a site, for the code-curious. We will write a working robots.txt, get canonical signals to agree, build an XML sitemap, use noindex correctly (and know why it differs from Disallow), ship clean redirects, and grep server logs to see what Googlebot actually does. For crawl efficiency at scale and architecture, see our technical crawl and indexing and technical SEO architecture guides.

robots.txt: control crawling, point to the sitemap

This lives at your domain root and controls what crawlers fetch. Block the low-value URL space (parameters, internal search, admin) and reference your sitemap:

# /robots.txt
User-agent: *
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /cart
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

Patterns use * as a wildcard; Disallow: /*?sort= blocks any URL with a sort parameter. Do not block CSS or JS Google needs to render the page. And never list a URL in your sitemap that you also Disallow — that is a contradictory signal.

robots Disallow vs noindex — they are not the same

This trips people constantly. Disallow in robots.txt stops Google crawling a URL; noindex stops it being indexed. The catch: if you Disallow a page, Google cannot crawl it to see a noindex, so a blocked URL can still appear in results from external links. To remove a page from the index, allow crawling and add noindex:

<!-- in <head> -->
<meta name="robots" content="noindex, follow">

Or, for non-HTML files (PDFs) or at scale, send it as a header:

# nginx
add_header X-Robots-Tag "noindex, follow";

# Apache
Header set X-Robots-Tag "noindex, follow"

Rule of thumb: Disallow to save crawl budget on junk you never want fetched; noindex (crawl allowed) to pull a real page out of the index.

Canonical: one URL, every signal agreeing

The canonical tag tells Google which URL is the master version. Set it on every page, self-referencing where the page is canonical:

<link rel="canonical" href="https://example.com/page">

What matters most is consistency: the canonical tag, the URL in your sitemap, and your internal links must all name the same version (trailing slash or not, www or not, parameters stripped). When they disagree, Google guesses — and at scale that produces duplicate and wrong-page-indexed problems. Decide the canonical form, enforce it in routing, and make every signal match.

XML sitemap: only canonical, indexable URLs

The sitemap is a discovery signal listing the pages you want indexed — nothing else. No redirects, no noindex pages, no blocked URLs, no parameters:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-06-01</lastmod>
  </url>
  <url>
    <loc>https://example.com/services</loc>
    <lastmod>2026-05-20</lastmod>
  </url>
</urlset>

Generate it from your routes so it stays accurate, keep lastmod honest (it is a recrawl hint), and split into multiple sitemaps with an index file once you pass tens of thousands of URLs.

Redirects: permanent, and no chains

Use a 301 for anything permanently moved so equity passes, and redirect straight to the final URL — never A→B→C, which wastes crawl and leaks signal:

# nginx — single hop, permanent
location = /old-page { return 301 https://example.com/new-page; }

# Apache .htaccess
Redirect 301 /old-page https://example.com/new-page

On a migration, map every old URL directly to its destination, leaving no chains and no orphaned equity. Audit for chains and loops with a Site Audit.

Read the logs: what Googlebot actually crawls

Server logs are the only honest record of crawl behaviour. A grep shows which URLs Googlebot requests and how often — often revealing budget burned on junk:

# top URLs Googlebot hit, most-crawled first
grep -i "googlebot" access.log \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn \
  | head -20

If parameter or filter URLs dominate that list while your money pages are rarely hit, that is your crawl-budget problem in black and white — and it tells you exactly what to Disallow or canonicalise. (Verify it is really Googlebot by reverse-DNS if it matters; spoofing is common.)

A worked example

A site has thousands of pages indexed that should not be, and new content crawled slowly. The log grep shows Googlebot spending most requests on ?sort= and ?filter= URLs. The coder adds Disallow patterns for those parameters, sets self-referencing canonicals on the real pages, strips the parameter URLs from the sitemap (leaving only canonical, indexable URLs), and noindexes the thin pages that had leaked in (crawl allowed so Google can see the directive). Re-running the grep a couple of weeks later, crawl has shifted onto real pages and the bloat is falling. Every change was a config edit, verified against the logs.

Common mistakes to avoid

Confusing Disallow with noindex — blocking a page so Google never sees its noindex. Blocking CSS/JS Google needs to render. Conflicting canonical, sitemap and internal-link signals. Listing redirected, blocked or noindexed URLs in the sitemap. Redirect chains instead of single permanent hops. And managing crawl budget by guesswork instead of reading the logs.

Frequently asked questions

What is the difference between Disallow and noindex?

Disallow in robots.txt stops Google crawling a URL; noindex stops it being indexed. If you Disallow a page, Google cannot crawl it to see a noindex, so it can still appear from external links. To deindex a real page, allow crawling and add noindex.

How do I set canonical URLs correctly?

Add a self-referencing rel="canonical" on each canonical page, and make the canonical tag, sitemap URL and internal links all name the same version (slash, www, parameters). Consistency is what prevents duplicate and wrong-page indexing.

What should go in my XML sitemap?

Only canonical, indexable URLs you want crawled — never redirected, blocked, noindexed or parameter URLs. Generate it from your routes, keep lastmod honest, and use a sitemap index once you exceed tens of thousands of URLs.

How do I redirect URLs without losing SEO value?

Use a 301 (permanent) so equity passes, and redirect directly to the final destination with no chains. On migrations, map every old URL straight to its new one.

How do I see what Googlebot crawls?

Grep your server access logs for the Googlebot user-agent and aggregate by URL to see which pages it requests and how often. It reveals crawl budget wasted on parameters or junk that reports do not show.

Can I block a page and noindex it at the same time?

No — that is self-defeating. If robots.txt blocks the URL, Google cannot crawl it to read the noindex. Choose one: Disallow to prevent crawling junk, or allow crawling plus noindex to deindex a real page.