Why does Google care about sitemap 404s?

Crawl budget. Google spends time fetching every URL in your sitemap. If 30% are 404s, that's 30% of crawl budget wasted on dead URLs that could have been spent on real content. Worse, Google may infer the sitemap is unreliable and trust it less for prioritisation.

Should I include redirected URLs?

No. Sitemaps should contain only the final canonical URL. If /old-page redirects to /new-page, only /new-page belongs in the sitemap. Listing the redirected URL wastes crawl on the 301 hop and conflicts with the canonical signal.

What about noindex pages?

Pages with noindex meta or X-Robots-Tag should NOT be in the sitemap. The sitemap signals 'index this' and noindex signals 'don't index'. Contradictory signals confuse Google and waste crawl budget on pages you've explicitly excluded.

How often should I validate?

Automate. Run validation in CI on every deploy. Run a scheduled job (weekly or daily) on production sitemap to catch URL drift over time. URLs change for many reasons — page deletions, slug changes, redirects added — and the sitemap can fall out of sync silently.

How to Fix 404 URLs in Sitemap

A sitemap with 404s wastes Google's crawl budget — every dead URL Google fetches is one real URL not fetched. Worse, a sitemap full of stale URLs signals that the file isn't maintained, making Google trust its priority hints less. The fix is filtering at generation time so only indexable, 200-status URLs make it in, plus CI validation that prevents regressions.

1. Audit current state

Step 1

Run the Robots Tester

Findings list each sitemap URL with status code, grouped by issue: 404s, 301s, noindex pages, blocked by robots.

Step 2

Bulk-check yourself

# Extract URLs from sitemap
curl -s https://example.com/sitemap.xml | \
  grep -oE '<loc>[^<]+</loc>' | \
  sed 's/<[^>]*>//g' > sitemap-urls.txt

# Check status of each
while read url; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
  echo "$status $url"
done < sitemap-urls.txt | grep -v "^200" > sitemap-issues.txt

cat sitemap-issues.txt | head -20

2. Categorise findings

Status	Issue	Action
404	URL doesn't exist	Remove from sitemap
301/302	URL redirected	Replace with final destination
410	Gone permanently	Remove from sitemap
500	Server error	Fix server, retry validation
200 + noindex	Page returns OK but has noindex meta	Remove from sitemap
200 + blocked	Page exists but robots.txt blocks crawl	Remove from sitemap OR remove block

3. Fix at generation time, not after

Patching the file isn't enough — it'll regenerate dirty next time. Fix the generator.

WordPress (Yoast)

// Filter sitemap to exclude noindex, password-protected, drafts
add_filter('wpseo_sitemap_entry', function($entry, $type, $object) {
    // Exclude noindex pages
    if ($object && method_exists('YoastSEO', 'meta')) {
        $robots = YoastSEO()->meta->for_post($object->ID)->robots;
        if (isset($robots['index']) && $robots['index'] === 'noindex') {
            return false;
        }
    }
    return $entry;
}, 10, 3);

Custom generator (Python)

import requests
from xml.etree import ElementTree as ET

def is_indexable(url):
    try:
        r = requests.head(url, allow_redirects=False, timeout=5)
        # Must return 200 directly (no redirect)
        if r.status_code != 200:
            return False
        # Check meta robots via GET
        r = requests.get(url, timeout=5)
        if 'noindex' in r.text.lower():
            return False
        return True
    except:
        return False

def generate_sitemap(candidate_urls):
    urlset = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
    for url in candidate_urls:
        if is_indexable(url):
            u = ET.SubElement(urlset, 'url')
            ET.SubElement(u, 'loc').text = url
    return ET.tostring(urlset, encoding='unicode')

Next.js (App Router)

// app/sitemap.ts
export default async function sitemap() {
  const posts = await getPosts({ status: 'published', indexable: true });
  // Filter at query time — only publish-status, indexable=true
  
  return posts.map(post => ({
    url: `https://example.com/blog/${post.slug}`,
    lastModified: post.updatedAt,
  }));
}

4. CI/CD validation

GitHub Actions example

name: Validate Sitemap
on: [push, deployment_status]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Extract URLs from sitemap
        run: |
          curl -s https://example.com/sitemap.xml | \
            grep -oE '<loc>[^<]+</loc>' | \
            sed 's/<[^>]*>//g' > urls.txt
          echo "URLs to validate: $(wc -l < urls.txt)"
      
      - name: Check each URL returns 200
        run: |
          failed=0
          while read url; do
            status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
            if [ "$status" != "200" ]; then
              echo "FAIL: $status $url"
              failed=$((failed+1))
            fi
          done < urls.txt
          if [ $failed -gt 0 ]; then
            echo "$failed URLs failed validation"
            exit 1
          fi

Pre-commit hook

#!/bin/bash
# .git/hooks/pre-commit

if git diff --cached --name-only | grep -q "sitemap"; then
    echo "Sitemap changed — running validation"
    ./scripts/validate-sitemap.sh || exit 1
fi

5. Common sources of stale URLs

Source 1: Cached generation

CMS regenerates sitemap once a day. Pages deleted in the morning still appear until tomorrow. Solution: invalidate cache on page delete/unpublish.

Source 2: Soft delete vs hard delete

Page marked "deleted" in database but still in sitemap query. Filter WHERE deleted_at IS NULL in the query.

Source 3: URL changes without redirect

Page slug changes from /old-slug to /new-slug. Old URL in sitemap. Fix: add 301 from old to new AND regenerate sitemap with new URL.

Source 4: Draft vs published

Sitemap includes drafts that aren't accessible. Filter WHERE status = 'published'.

Source 5: Stale subdomain references

Sitemap on www.example.com includes URLs from blog.example.com that have moved. Each subdomain should have its own sitemap.

6. Re-submit after fixes

After regenerating, force Google to re-fetch:

Search Console → Sitemaps
→ Click your sitemap entry
→ "Submit" again to force re-fetch
→ Status updates within hours

7. Verify resolution

Step 1

Re-run Robots Tester

All URLs report 200. Zero 404s, zero redirects, zero noindex pages.

Step 2

Search Console processing

Sitemap status: "Success". URL count matches expected. "Discovered URLs" vs "Submitted URLs" delta should narrow over time.

Step 3

Crawl budget metrics

Search Console → Settings → Crawl stats. Pages-per-day metric stable, crawl errors dropping after sitemap cleanup.

💡 The single rule: sitemap query should be the same query that determines whether a page should be indexed. If your CMS publishes a page → its URL appears in sitemap. If your CMS unpublishes/deletes → URL drops from sitemap immediately. Source-of-truth filtering at query time is more robust than post-hoc cleanup.

🤖 Re-run the Robots & Sitemap Tester

Verify sitemap contains only live, indexable URLs.

Run Tester →

How to Fix 404 URLs in Sitemap

1. Audit current state

2. Categorise findings

3. Fix at generation time, not after

WordPress (Yoast)

Custom generator (Python)

Next.js (App Router)

4. CI/CD validation

GitHub Actions example

Pre-commit hook

5. Common sources of stale URLs

Source 1: Cached generation

Source 2: Soft delete vs hard delete

Source 3: URL changes without redirect

Source 4: Draft vs published

Source 5: Stale subdomain references

6. Re-submit after fixes

7. Verify resolution

🤖 Re-run the Robots & Sitemap Tester

About aiwebpageseo