Learning Hub — Beginner’s Guide
⭐ Beginner — No coding experience needed

What you will learn in this guide

1 What is robots.txt?

robots.txt is a plain text file at the root of your domain (yoursite.com/robots.txt) that tells crawlers which URLs they may and may not fetch. It is the first file most search engines request when they visit your site.

User-agent: * Disallow: /admin/ Disallow: /cart Allow: / Sitemap: https://yoursite.com/sitemap.xml
Important:robots.txt prevents crawling, not indexing. If another site links to a blocked URL, Google can still index it without seeing the content. Use noindex meta tags or HTTP headers for true exclusion.

2 What is sitemap.xml?

A sitemap.xml is a list of every URL on your site you want indexed, in a format Google understands. It helps Google find pages that internal links might miss.

ElementRequired?Purpose
YesThe full URL
NoWhen the page was last meaningfully changed
NoHow often the page updates (Google ignores this now)
NoRelative importance 0.0-1.0 (Google ignores this)

Most modern sites generate sitemaps automatically. WordPress with Yoast or Rank Math creates them at /sitemap_index.xml. Custom sites can use packages like sitemap-generator or build them in the CMS.

3 How to set both up

  1. 1Generate a sitemapMost CMSs do this automatically. If not, use a sitemap generator and upload sitemap.xml to your site root.
  2. 2Reference it in robots.txtAdd a Sitemap: line at the bottom of robots.txt with the full URL. This is how new search engines discover it.
  3. 3Submit to Google Search ConsoleIn Search Console → Sitemaps, paste the sitemap URL. Google will fetch it within 24 hours and start indexing.
  4. 4Audit weeklyUse the audit tool to confirm every URL in the sitemap returns 200, is indexable, and has no noindex tag. Mismatches confuse Google.

4 The 5 most dangerous robots.txt mistakes

MistakeWhat happensFix
Disallow: /Blocks entire site from crawlingRemove the slash; use specific paths
Blocking CSS or JSGoogle can't render the page properly; rankings dropAllow /wp-includes/, /assets/, etc.
Disallow on noindex pagesGoogle can't see the noindex; URL stays indexedAllow crawling, use noindex meta tag instead
No sitemap referenceCrawlers may miss new pages for weeksAdd Sitemap: line
Old test directives left inProduction blocking dev paths or vice versaAudit on deploy; never copy staging robots.txt to prod
Test before saving:Even a single typo in robots.txt can de-index your entire site overnight. Google Search Console has a robots.txt tester — use it.
Written by
John
Founder, AIWebPageSEO

robots.txt is one line of text that can wipe your site from Google. Treat it like a production config file: review every change, test before deploying, and never copy from staging to production without checking.