What does robots.txt do?

Robots.txt is a plain text file at the root of your website that tells search engine crawlers which pages or sections they are not allowed to crawl. It does not prevent pages from being indexed if they have external links pointing to them — it only prevents crawling. Use noindex meta tags or x-robots-tag headers to prevent indexing.

What should be in my XML sitemap?

Your XML sitemap should list every page you want Google to index. Include the URL, last modified date, and optionally the change frequency and priority. Do not include pages that return 404, pages with noindex tags, pages blocked by robots.txt, or duplicate/redirect pages. A clean sitemap only contains indexable, canonical URLs.

How do I submit my sitemap to Google?

Go to Google Search Console, select your property, navigate to Sitemaps in the left menu, and enter your sitemap URL — typically https://yoursite.com/sitemap.xml. Google will fetch and process it. Check back in 24-48 hours to see how many URLs were discovered and indexed.

Robots.txt and XML Sitemap: Setup and Validation Guide

Robots.txt tells Google which pages not to crawl. Your XML sitemap tells Google which pages to crawl. Getting either wrong can silently remove your pages from Google's index — here is how to check and fix both.

🤖 Test Robots & Sitemap All Audit Tools →

The two most common robots.txt mistakes

1. Accidentally blocking your entire site

The directive Disallow: / under User-agent: * blocks every crawler from crawling any page on your site. This is a catastrophic error that removes your entire site from Google. Always test robots.txt changes using the Robots & Sitemap Tester before deploying.

Never do this: Disallow: / under User-agent: * — this tells every crawler to crawl nothing on your site.

2. Blocking CSS and JavaScript

Google renders pages using a headless browser. If your robots.txt blocks the CSS and JavaScript files that build your page layout, Google sees a broken, unstyled page and may classify it as low quality. Allow all CSS, JS and font files.

What a correct XML sitemap looks like

A valid sitemap is an XML file listing your canonical, indexable URLs. Every URL in the sitemap should return HTTP 200, have a canonical tag pointing to itself, and not have a noindex robots directive. Include lastmod dates so Google knows when pages were last updated.

Quick check: Visit https://yoursite.com/sitemap.xml in a browser. If it displays XML content, it is accessible. If it shows a 404 or blank page, you need to create or configure your sitemap.

The distinction that causes most of the damage

Robots.txt controls crawling. It does not control indexing. These are different things, and conflating them is the single most expensive misunderstanding in technical SEO.

Disallowing a URL in robots.txt tells Google not to fetch it. It does not tell Google not to list it. If other pages link to that URL, Google may index it anyway — showing it in results with no title and a description reading that no information is available, because it was forbidden from looking. The page is in the index and Google has no idea what is on it.

Worse, and more common: a page that is disallowed in robots.txt and carries a noindex tag will never be removed from the index, because Google cannot fetch the page to read the tag telling it to go away. The two directives cancel each other out, and the page stays.

The rule: to keep a page out of the index, allow it to be crawled and serve noindex. To save crawl budget on pages you do not care about, disallow it. Never both — the combination is self-defeating and it is one of the most common configuration errors on the web.

The robots.txt mistakes that take sites off the internet

Disallow: /

Two characters that remove an entire site from search. It happens when a staging configuration is deployed to production, and it is the fastest way to lose all organic traffic. Check it after every deploy — not because it is likely, but because the cost of missing it is total.

Blocking CSS and JavaScript

Disallowing /assets/, /js/ or /css/ is a habit inherited from an era when it seemed tidy. Today it prevents Google from rendering your page: it can fetch the HTML but not the code and styles that fill it, so it renders a broken shell and judges the page on that. Google needs your resources. Let it have them.

Assuming it is a security control

Robots.txt is a public file listing the paths you would prefer people did not visit. It is read by anyone curious, and it is obeyed only by crawlers that choose to. Putting /admin/ or /private-backups/ in it is not protection — it is a signpost. Anything genuinely sensitive requires authentication, not a polite request.

Order and specificity confusion

Google applies the most specific matching rule, not the first one. A broad Disallow followed by a narrow Allow does permit the narrower path — but the behaviour differs between crawlers, and complicated rule sets are where mistakes hide. Keep the file short and obvious.

User-agent blocks that do not apply

A crawler reads only the block that matches it most specifically. If you have a block for Googlebot and a block for *, Googlebot reads only its own block and ignores the wildcard entirely — including any rules you assumed were universal.

What a sitemap is for, and what it is not

A sitemap is a discovery aid. It tells Google which URLs you consider worth crawling, and it is particularly useful for pages that are poorly linked internally, newly published, or otherwise hard to find.

It is not a guarantee of indexing. Submitting a URL in a sitemap does not oblige Google to crawl it, and certainly does not oblige it to index it. A site with a thousand sitemap entries and two hundred indexed pages does not have a sitemap problem — it has a quality problem, and the sitemap is faithfully reporting it.

Nor is a sitemap a substitute for internal linking. A page that appears in the sitemap and is linked from nowhere is an orphan, and orphans are routinely crawled once and never returned to. Google reads internal links as a statement about what you consider important; a page nothing links to is a page you have implicitly declared unimportant, whatever the sitemap says.

Sitemap errors that quietly cost you

Including non-canonical URLs. Every URL in the sitemap should be the canonical version, indexable, and return 200. A sitemap listing URLs that redirect, 404, or canonicalise elsewhere is sending Google to fetch pages you have already told it not to index — which wastes crawl budget and undermines the file's credibility.
Including noindexed pages. A direct contradiction: the sitemap says "crawl this, it matters" and the page says "do not index me".
Stale lastmod values. A lastmod that updates on every build, for every page, whether or not anything changed, is noise. Google learns to ignore it. An accurate lastmod is genuinely useful; a fabricated one is worse than none.
Priority and changefreq. Google has said plainly that it ignores both. They are harmless and they are not doing anything.
Exceeding the limits. 50,000 URLs or 50MB uncompressed per file. Beyond that, split into multiple sitemaps behind a sitemap index.
Not referencing it from robots.txt. A Sitemap: line with the full URL is how crawlers other than Google find it without being told.

The sitemap is not the list of your important pages

This point deserves its own section because it causes real damage during migrations and audits.

A sitemap contains the URLs you knew about and chose to declare. It systematically omits the ones that matter most in a crisis: the old post from years ago that quietly accumulated links, the discontinued product page that still ranks for a term nobody remembers targeting, the orphaned landing page from a campaign that ended. None of these are in your sitemap. Several of them are earning.

Any process that starts from the sitemap — building a redirect map, auditing content, deciding what to keep — will therefore miss exactly the pages whose loss hurts most. The complete picture requires a crawl, a Search Console export of pages that received impressions, and ideally the server logs. The gap between that union and your sitemap is precisely the traffic that gets lost in a migration.

Crawl budget: when it matters and when it does not

Crawl budget is discussed far more often than it applies. For a site of a few hundred pages, it is essentially irrelevant — Google will crawl everything it wants to, and the constraint is never the budget.

It becomes real on large sites, and its symptom is specific: pages that Google has discovered and not crawled. If Search Console reports a large number of URLs as "Discovered — currently not indexed", Google has looked at your site and decided the remainder is not worth fetching. That is a judgement about quality and authority, and it is the honest signal underneath most crawl budget conversations.

What genuinely wastes it:

Faceted navigation generating thousands of parameter combinations, each a distinct URL.
Long redirect chains, where every hop is a fetch.
Large numbers of 404s from stale internal links.
Duplicate pages reachable at multiple URLs with no canonical.
Thin pages published at volume — each one divides the same budget further.

The last of these is the one nobody wants to hear. Publishing more pages does not earn more crawling. It spreads the same crawling more thinly, and if the pages are thin, it lowers Google's estimate of what the rest of the site is worth.

Structuring sitemaps so they tell you something

A single sitemap containing every URL is a missed opportunity. Split by type, and the sitemap stops being a submission and becomes a diagnostic.

Search Console reports indexing coverage per sitemap. If products, articles and category pages sit in one file, you learn that 60% of the site is indexed and nothing about which 40% is not. Split them — products, articles, categories, each in its own file behind a sitemap index — and the same report tells you that products are indexed at 95% and articles at 20%, which is an entirely different and immediately actionable fact.

This costs nothing to implement and it is the difference between knowing you have a problem and knowing where it is.

Splitting worth doing

By content type — products, articles, categories, static pages. The most useful split, because indexing rates differ sharply between them.
By recency — a separate sitemap for pages published in the last month makes it obvious whether new content is being picked up.
By section — for large sites where different parts of the tree behave differently.

Do not split arbitrarily by number. A sitemap containing URLs 1–50,000 and another with 50,001–100,000 tells you nothing at all, and that is how most large sitemaps are split.

What to do when pages are not being indexed

The instinct on discovering unindexed pages is to submit the sitemap again. This almost never helps, because resubmission is not the constraint — Google already knows the URLs exist. Work through the actual causes in order.

Is it blocked?

Check robots.txt and the noindex tag, including the X-Robots-Tag HTTP header, which is invisible in the page source and is a common cause of pages disappearing after a server configuration change. All three must be clear.

Is it reachable?

A page linked from nowhere is an orphan. The sitemap will get it discovered; it will not get it valued. Internal links are how you tell Google a page matters, and a page with no inbound links has been implicitly declared unimportant by its own site.

Is it a duplicate?

If Google has decided the page is a near-duplicate of another, it will index one and drop the other, and Search Console will report it as an alternate with a proper canonical or a duplicate without a user-selected canonical. That is not a bug to be argued with; it is a judgement about the content.

Is it worth indexing?

The uncomfortable answer, and the correct one more often than any of the above. "Discovered — currently not indexed" means Google found the URL and chose not to spend a crawl on it. That is a verdict on the site's quality and authority, and no amount of sitemap work overturns it. The remedy is fewer, better pages — not more submissions.

The honest test: if a page were removed entirely, would anyone notice, and would any query go unanswered? If not, the page is not being unfairly ignored. It is being correctly assessed.

Frequently asked questions

Will robots.txt keep a page out of Google?

No. It stops Google crawling the page, not indexing it. A disallowed page that other sites link to can still appear in results, with no title and no description, because Google was forbidden from looking at it. To keep a page out of the index, allow the crawl and serve noindex.

Can I use noindex and robots.txt disallow together?

You can, and it is self-defeating. Google cannot fetch the page to read the noindex, so the page is never removed. Use one or the other, never both.

Should I block my JavaScript and CSS?

No. Google needs them to render the page. Blocking them means it renders a broken shell and judges the page on that.

Does including a page in my sitemap get it indexed?

No. A sitemap aids discovery; it obliges nothing. If a large share of your sitemap is not indexed, the sitemap is not the problem — it is faithfully reporting a quality or authority problem.

Do priority and changefreq do anything?

No. Google has stated it ignores both. They are harmless and inert.

Is robots.txt a security measure?

Never. It is a public file that lists the paths you would rather people did not visit, and it is obeyed only by crawlers that choose to obey it. Anything sensitive requires authentication.

🤖 Test Robots & Sitemap Now

Run the Robots & Sitemap Tester and get actionable results in minutes. Pay as you go — no subscription needed.

Test Robots & Sitemap →

Robots.txt and XML Sitemap: Setup and Validation Guide

The two most common robots.txt mistakes

1. Accidentally blocking your entire site

2. Blocking CSS and JavaScript

What a correct XML sitemap looks like

The distinction that causes most of the damage

The robots.txt mistakes that take sites off the internet

Disallow: /

Blocking CSS and JavaScript

Assuming it is a security control

Order and specificity confusion

User-agent blocks that do not apply

What a sitemap is for, and what it is not

Sitemap errors that quietly cost you

The sitemap is not the list of your important pages

Crawl budget: when it matters and when it does not

Structuring sitemaps so they tell you something

Splitting worth doing

What to do when pages are not being indexed

Is it blocked?

Is it reachable?

Is it a duplicate?

Is it worth indexing?

Frequently asked questions

Will robots.txt keep a page out of Google?

Can I use noindex and robots.txt disallow together?

Should I block my JavaScript and CSS?

Does including a page in my sitemap get it indexed?

Do priority and changefreq do anything?

Is robots.txt a security measure?

🤖 Test Robots & Sitemap Now

Related tools

About aiwebpageseo