This guide is the files, tags and rules that control how Google crawls and indexes a site, for the code-curious. We will write a working robots.txt, get canonical signals to agree, build an XML sitemap, use noindex correctly (and know why it differs from Disallow), ship clean redirects, and grep server logs to see what Googlebot actually does. For crawl efficiency at scale and architecture, see our technical crawl and indexing and technical SEO architecture guides.
This lives at your domain root and controls what crawlers fetch. Block the low-value URL space (parameters, internal search, admin) and reference your sitemap:
# /robots.txt User-agent: * Disallow: /search Disallow: /*?sort= Disallow: /*?filter= Disallow: /cart Disallow: /admin/ Allow: / Sitemap: https://example.com/sitemap.xml
Patterns use * as a wildcard; Disallow: /*?sort= blocks any URL with a sort parameter. Do not block CSS or JS Google needs to render the page. And never list a URL in your sitemap that you also Disallow — that is a contradictory signal.
This trips people constantly. Disallow in robots.txt stops Google crawling a URL; noindex stops it being indexed. The catch: if you Disallow a page, Google cannot crawl it to see a noindex, so a blocked URL can still appear in results from external links. To remove a page from the index, allow crawling and add noindex:
<!-- in <head> --> <meta name="robots" content="noindex, follow">
Or, for non-HTML files (PDFs) or at scale, send it as a header:
# nginx add_header X-Robots-Tag "noindex, follow"; # Apache Header set X-Robots-Tag "noindex, follow"
Rule of thumb: Disallow to save crawl budget on junk you never want fetched; noindex (crawl allowed) to pull a real page out of the index.
The canonical tag tells Google which URL is the master version. Set it on every page, self-referencing where the page is canonical:
<link rel="canonical" href="https://example.com/page">
What matters most is consistency: the canonical tag, the URL in your sitemap, and your internal links must all name the same version (trailing slash or not, www or not, parameters stripped). When they disagree, Google guesses — and at scale that produces duplicate and wrong-page-indexed problems. Decide the canonical form, enforce it in routing, and make every signal match.
The sitemap is a discovery signal listing the pages you want indexed — nothing else. No redirects, no noindex pages, no blocked URLs, no parameters:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-06-01</lastmod>
</url>
<url>
<loc>https://example.com/services</loc>
<lastmod>2026-05-20</lastmod>
</url>
</urlset>
Generate it from your routes so it stays accurate, keep lastmod honest (it is a recrawl hint), and split into multiple sitemaps with an index file once you pass tens of thousands of URLs.
Use a 301 for anything permanently moved so equity passes, and redirect straight to the final URL — never A→B→C, which wastes crawl and leaks signal:
# nginx — single hop, permanent
location = /old-page { return 301 https://example.com/new-page; }
# Apache .htaccess
Redirect 301 /old-page https://example.com/new-page
On a migration, map every old URL directly to its destination, leaving no chains and no orphaned equity. Audit for chains and loops with a Site Audit.
Server logs are the only honest record of crawl behaviour. A grep shows which URLs Googlebot requests and how often — often revealing budget burned on junk:
# top URLs Googlebot hit, most-crawled first
grep -i "googlebot" access.log \
| awk '{print $7}' \
| sort | uniq -c | sort -rn \
| head -20
If parameter or filter URLs dominate that list while your money pages are rarely hit, that is your crawl-budget problem in black and white — and it tells you exactly what to Disallow or canonicalise. (Verify it is really Googlebot by reverse-DNS if it matters; spoofing is common.)
A site has thousands of pages indexed that should not be, and new content crawled slowly. The log grep shows Googlebot spending most requests on ?sort= and ?filter= URLs. The coder adds Disallow patterns for those parameters, sets self-referencing canonicals on the real pages, strips the parameter URLs from the sitemap (leaving only canonical, indexable URLs), and noindexes the thin pages that had leaked in (crawl allowed so Google can see the directive). Re-running the grep a couple of weeks later, crawl has shifted onto real pages and the bloat is falling. Every change was a config edit, verified against the logs.
Confusing Disallow with noindex — blocking a page so Google never sees its noindex. Blocking CSS/JS Google needs to render. Conflicting canonical, sitemap and internal-link signals. Listing redirected, blocked or noindexed URLs in the sitemap. Redirect chains instead of single permanent hops. And managing crawl budget by guesswork instead of reading the logs.
Disallow in robots.txt stops Google crawling a URL; noindex stops it being indexed. If you Disallow a page, Google cannot crawl it to see a noindex, so it can still appear from external links. To deindex a real page, allow crawling and add noindex.
Add a self-referencing rel="canonical" on each canonical page, and make the canonical tag, sitemap URL and internal links all name the same version (slash, www, parameters). Consistency is what prevents duplicate and wrong-page indexing.
Only canonical, indexable URLs you want crawled — never redirected, blocked, noindexed or parameter URLs. Generate it from your routes, keep lastmod honest, and use a sitemap index once you exceed tens of thousands of URLs.
Use a 301 (permanent) so equity passes, and redirect directly to the final destination with no chains. On migrations, map every old URL straight to its new one.
Grep your server access logs for the Googlebot user-agent and aggregate by URL to see which pages it requests and how often. It reveals crawl budget wasted on parameters or junk that reports do not show.
No — that is self-defeating. If robots.txt blocks the URL, Google cannot crawl it to read the noindex. Choose one: Disallow to prevent crawling junk, or allow crawling plus noindex to deindex a real page.