A sitemap with 404s wastes Google's crawl budget — every dead URL Google fetches is one real URL not fetched. Worse, a sitemap full of stale URLs signals that the file isn't maintained, making Google trust its priority hints less. The fix is filtering at generation time so only indexable, 200-status URLs make it in, plus CI validation that prevents regressions.
# Extract URLs from sitemap
curl -s https://example.com/sitemap.xml | \
grep -oE '<loc>[^<]+</loc>' | \
sed 's/<[^>]*>//g' > sitemap-urls.txt
# Check status of each
while read url; do
status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
echo "$status $url"
done < sitemap-urls.txt | grep -v "^200" > sitemap-issues.txt
cat sitemap-issues.txt | head -20
| Status | Issue | Action |
|---|---|---|
| 404 | URL doesn't exist | Remove from sitemap |
| 301/302 | URL redirected | Replace with final destination |
| 410 | Gone permanently | Remove from sitemap |
| 500 | Server error | Fix server, retry validation |
| 200 + noindex | Page returns OK but has noindex meta | Remove from sitemap |
| 200 + blocked | Page exists but robots.txt blocks crawl | Remove from sitemap OR remove block |
Patching the file isn't enough — it'll regenerate dirty next time. Fix the generator.
// Filter sitemap to exclude noindex, password-protected, drafts
add_filter('wpseo_sitemap_entry', function($entry, $type, $object) {
// Exclude noindex pages
if ($object && method_exists('YoastSEO', 'meta')) {
$robots = YoastSEO()->meta->for_post($object->ID)->robots;
if (isset($robots['index']) && $robots['index'] === 'noindex') {
return false;
}
}
return $entry;
}, 10, 3);
import requests
from xml.etree import ElementTree as ET
def is_indexable(url):
try:
r = requests.head(url, allow_redirects=False, timeout=5)
# Must return 200 directly (no redirect)
if r.status_code != 200:
return False
# Check meta robots via GET
r = requests.get(url, timeout=5)
if 'noindex' in r.text.lower():
return False
return True
except:
return False
def generate_sitemap(candidate_urls):
urlset = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
for url in candidate_urls:
if is_indexable(url):
u = ET.SubElement(urlset, 'url')
ET.SubElement(u, 'loc').text = url
return ET.tostring(urlset, encoding='unicode')
// app/sitemap.ts
export default async function sitemap() {
const posts = await getPosts({ status: 'published', indexable: true });
// Filter at query time — only publish-status, indexable=true
return posts.map(post => ({
url: `https://example.com/blog/${post.slug}`,
lastModified: post.updatedAt,
}));
}
name: Validate Sitemap
on: [push, deployment_status]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Extract URLs from sitemap
run: |
curl -s https://example.com/sitemap.xml | \
grep -oE '<loc>[^<]+</loc>' | \
sed 's/<[^>]*>//g' > urls.txt
echo "URLs to validate: $(wc -l < urls.txt)"
- name: Check each URL returns 200
run: |
failed=0
while read url; do
status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$status" != "200" ]; then
echo "FAIL: $status $url"
failed=$((failed+1))
fi
done < urls.txt
if [ $failed -gt 0 ]; then
echo "$failed URLs failed validation"
exit 1
fi
#!/bin/bash
# .git/hooks/pre-commit
if git diff --cached --name-only | grep -q "sitemap"; then
echo "Sitemap changed — running validation"
./scripts/validate-sitemap.sh || exit 1
fi
CMS regenerates sitemap once a day. Pages deleted in the morning still appear until tomorrow. Solution: invalidate cache on page delete/unpublish.
Page marked "deleted" in database but still in sitemap query. Filter WHERE deleted_at IS NULL in the query.
Page slug changes from /old-slug to /new-slug. Old URL in sitemap. Fix: add 301 from old to new AND regenerate sitemap with new URL.
Sitemap includes drafts that aren't accessible. Filter WHERE status = 'published'.
Sitemap on www.example.com includes URLs from blog.example.com that have moved. Each subdomain should have its own sitemap.
After regenerating, force Google to re-fetch:
Search Console → Sitemaps → Click your sitemap entry → "Submit" again to force re-fetch → Status updates within hours
Verify sitemap contains only live, indexable URLs.
Run Tester →