/ Robots & Sitemap Fixes / Sitemap 404s

How to Fix 404 URLs in Sitemap

A sitemap with 404s wastes Google's crawl budget — every dead URL Google fetches is one real URL not fetched. Worse, a sitemap full of stale URLs signals that the file isn't maintained, making Google trust its priority hints less. The fix is filtering at generation time so only indexable, 200-status URLs make it in, plus CI validation that prevents regressions.

1. Audit current state

Step 1
Run the Robots Tester
Findings list each sitemap URL with status code, grouped by issue: 404s, 301s, noindex pages, blocked by robots.
Step 2
Bulk-check yourself
# Extract URLs from sitemap
curl -s https://example.com/sitemap.xml | \
  grep -oE '<loc>[^<]+</loc>' | \
  sed 's/<[^>]*>//g' > sitemap-urls.txt

# Check status of each
while read url; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
  echo "$status $url"
done < sitemap-urls.txt | grep -v "^200" > sitemap-issues.txt

cat sitemap-issues.txt | head -20

2. Categorise findings

StatusIssueAction
404URL doesn't existRemove from sitemap
301/302URL redirectedReplace with final destination
410Gone permanentlyRemove from sitemap
500Server errorFix server, retry validation
200 + noindexPage returns OK but has noindex metaRemove from sitemap
200 + blockedPage exists but robots.txt blocks crawlRemove from sitemap OR remove block

3. Fix at generation time, not after

Patching the file isn't enough — it'll regenerate dirty next time. Fix the generator.

WordPress (Yoast)

// Filter sitemap to exclude noindex, password-protected, drafts
add_filter('wpseo_sitemap_entry', function($entry, $type, $object) {
    // Exclude noindex pages
    if ($object && method_exists('YoastSEO', 'meta')) {
        $robots = YoastSEO()->meta->for_post($object->ID)->robots;
        if (isset($robots['index']) && $robots['index'] === 'noindex') {
            return false;
        }
    }
    return $entry;
}, 10, 3);

Custom generator (Python)

import requests
from xml.etree import ElementTree as ET

def is_indexable(url):
    try:
        r = requests.head(url, allow_redirects=False, timeout=5)
        # Must return 200 directly (no redirect)
        if r.status_code != 200:
            return False
        # Check meta robots via GET
        r = requests.get(url, timeout=5)
        if 'noindex' in r.text.lower():
            return False
        return True
    except:
        return False

def generate_sitemap(candidate_urls):
    urlset = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
    for url in candidate_urls:
        if is_indexable(url):
            u = ET.SubElement(urlset, 'url')
            ET.SubElement(u, 'loc').text = url
    return ET.tostring(urlset, encoding='unicode')

Next.js (App Router)

// app/sitemap.ts
export default async function sitemap() {
  const posts = await getPosts({ status: 'published', indexable: true });
  // Filter at query time — only publish-status, indexable=true
  
  return posts.map(post => ({
    url: `https://example.com/blog/${post.slug}`,
    lastModified: post.updatedAt,
  }));
}

4. CI/CD validation

GitHub Actions example

name: Validate Sitemap
on: [push, deployment_status]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Extract URLs from sitemap
        run: |
          curl -s https://example.com/sitemap.xml | \
            grep -oE '<loc>[^<]+</loc>' | \
            sed 's/<[^>]*>//g' > urls.txt
          echo "URLs to validate: $(wc -l < urls.txt)"
      
      - name: Check each URL returns 200
        run: |
          failed=0
          while read url; do
            status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
            if [ "$status" != "200" ]; then
              echo "FAIL: $status $url"
              failed=$((failed+1))
            fi
          done < urls.txt
          if [ $failed -gt 0 ]; then
            echo "$failed URLs failed validation"
            exit 1
          fi

Pre-commit hook

#!/bin/bash
# .git/hooks/pre-commit

if git diff --cached --name-only | grep -q "sitemap"; then
    echo "Sitemap changed — running validation"
    ./scripts/validate-sitemap.sh || exit 1
fi

5. Common sources of stale URLs

Source 1: Cached generation

CMS regenerates sitemap once a day. Pages deleted in the morning still appear until tomorrow. Solution: invalidate cache on page delete/unpublish.

Source 2: Soft delete vs hard delete

Page marked "deleted" in database but still in sitemap query. Filter WHERE deleted_at IS NULL in the query.

Source 3: URL changes without redirect

Page slug changes from /old-slug to /new-slug. Old URL in sitemap. Fix: add 301 from old to new AND regenerate sitemap with new URL.

Source 4: Draft vs published

Sitemap includes drafts that aren't accessible. Filter WHERE status = 'published'.

Source 5: Stale subdomain references

Sitemap on www.example.com includes URLs from blog.example.com that have moved. Each subdomain should have its own sitemap.

6. Re-submit after fixes

After regenerating, force Google to re-fetch:

Search Console → Sitemaps
→ Click your sitemap entry
→ "Submit" again to force re-fetch
→ Status updates within hours

7. Verify resolution

Step 1
Re-run Robots Tester
All URLs report 200. Zero 404s, zero redirects, zero noindex pages.
Step 2
Search Console processing
Sitemap status: "Success". URL count matches expected. "Discovered URLs" vs "Submitted URLs" delta should narrow over time.
Step 3
Crawl budget metrics
Search Console → Settings → Crawl stats. Pages-per-day metric stable, crawl errors dropping after sitemap cleanup.
💡 The single rule: sitemap query should be the same query that determines whether a page should be indexed. If your CMS publishes a page → its URL appears in sitemap. If your CMS unpublishes/deletes → URL drops from sitemap immediately. Source-of-truth filtering at query time is more robust than post-hoc cleanup.

🤖 Re-run the Robots & Sitemap Tester

Verify sitemap contains only live, indexable URLs.

Run Tester →
Related Guides: Robots & Sitemap Fixes  ·  Fix Sitemap Declaration  ·  Fix Sitemap Size  ·  Robots & Sitemap Guide
💬 Got a problem?