Sitemap files have hard limits: 50,000 URLs OR 50MB uncompressed, whichever comes first. Exceed either and Google rejects the file entirely or processes only the first valid portion. Most sites never hit these limits, but ecommerce stores, large publishers, and forums regularly do. The fix is the sitemap index pattern — one master file references multiple smaller child sitemaps, each within limits. This guide covers the split strategies, generation patterns, and gzip compression.
curl -s https://example.com/sitemap.xml | grep -c "<loc>" # If > 50,000 you need to split # If close to 50k, split sooner rather than later
curl -sI https://example.com/sitemap.xml | grep -i content-length # Compare to 52428800 bytes (50MB) # If close to 50MB, split or gzip immediately
Single index file references all child sitemaps:
<!-- sitemap.xml (the index) -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2024-01-20</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products-1.xml</loc>
<lastmod>2024-01-20</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products-2.xml</loc>
<lastmod>2024-01-20</lastmod>
</sitemap>
</sitemapindex>
Child sitemap files have the same format as a standalone sitemap:
<!-- sitemap-pages.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/about/</loc>
<lastmod>2024-01-15</lastmod>
</url>
<!-- ... up to 50,000 URLs ... -->
</urlset>
Most natural for most sites:
sitemap-pages.xml — static pagessitemap-posts.xml — blog postssitemap-products.xml — products (split further if needed)sitemap-categories.xml — category pagessitemap-tags.xml — tag pages (optional, often noindex)Useful for news sites and high-volume blogs:
sitemap-posts-2024.xml sitemap-posts-2023.xml sitemap-posts-2022.xml sitemap-archive.xml
Older content rarely updates, so older sitemaps don't need frequent regeneration.
sitemap-products-electronics.xml sitemap-products-clothing.xml sitemap-products-home.xml sitemap-stores.xml sitemap-brands.xml
When one logical group exceeds 50k:
sitemap-products-1.xml (URLs 1-50,000) sitemap-products-2.xml (URLs 50,001-100,000) sitemap-products-3.xml (URLs 100,001-150,000)
Auto-generates sitemap index at /sitemap_index.xml. Splits by content type by default, further splits each at 1,000 URLs.
// Customise URLs per sitemap
add_filter('wpseo_sitemap_entries_per_page', function() {
return 2000; // Default 1000, can go up to 50000
});
// Exclude content types
add_filter('wpseo_sitemap_exclude_post_type', function($excluded, $type) {
if ($type === 'attachment') return true;
return $excluded;
}, 10, 2);
Similar pattern at /sitemap_index.xml. Configurable per content type in plugin settings.
// app/sitemap.ts — sitemap index
import type { MetadataRoute } from 'next';
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
return [
{
url: 'https://example.com/sitemap-pages.xml',
lastModified: new Date(),
},
{
url: 'https://example.com/sitemap-posts.xml',
lastModified: new Date(),
},
];
}
// app/sitemap-pages.xml/route.ts — generate child sitemap
import { generateSitemap } from '@/lib/sitemap';
export async function GET() {
const pages = await getPagesForSitemap();
return new Response(generateSitemap(pages), {
headers: { 'Content-Type': 'application/xml' }
});
}
// @astrojs/sitemap auto-splits at 45,000 URLs per file
import sitemap from '@astrojs/sitemap';
export default defineConfig({
site: 'https://example.com',
integrations: [sitemap({
entryLimit: 45000, // optional, default 45000
filter: (page) => !page.includes('/admin/'),
})]
});
from xml.etree.ElementTree import Element, SubElement, tostring
from datetime import datetime
def generate_sitemap_chunk(urls, output_path):
urlset = Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
for url in urls:
u = SubElement(urlset, 'url')
SubElement(u, 'loc').text = url['loc']
if 'lastmod' in url:
SubElement(u, 'lastmod').text = url['lastmod']
with open(output_path, 'wb') as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
f.write(tostring(urlset))
def generate_sitemaps(all_urls, base_url, chunk_size=45000):
chunks = [all_urls[i:i+chunk_size] for i in range(0, len(all_urls), chunk_size)]
sitemap_urls = []
for i, chunk in enumerate(chunks):
filename = f'sitemap-{i+1}.xml'
generate_sitemap_chunk(chunk, f'public/{filename}')
sitemap_urls.append(f'{base_url}/{filename}')
# Generate index
sitemapindex = Element('sitemapindex', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
for url in sitemap_urls:
s = SubElement(sitemapindex, 'sitemap')
SubElement(s, 'loc').text = url
SubElement(s, 'lastmod').text = datetime.now().isoformat()
with open('public/sitemap.xml', 'wb') as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
f.write(tostring(sitemapindex))
XML compresses extremely well — sitemap files often shrink 10x with gzip. Google and Bing both support .xml.gz.
# Generate compressed version gzip -k sitemap.xml # Creates sitemap.xml.gz, keeps original
Reference the gzipped version in robots.txt:
Sitemap: https://example.com/sitemap.xml.gz
server {
gzip on;
gzip_types application/xml text/xml;
gzip_min_length 1000;
gzip_comp_level 6;
}
Browser sets Accept-Encoding: gzip, nginx compresses the response automatically. Original file stays uncompressed on disk.
https://example.com/sitemap.xml or /sitemap_index.xml). Google discovers child sitemaps automatically.
Submit only the index. Google reads the index and fetches children automatically. Submitting children separately creates redundant management work.
Child sitemap has lastmod 2024-01-20. Index references that child with lastmod 2023-12-01. Crawlers use the older date and may not re-fetch the child. Always update both.
New child sitemap deployed but not added to index. Google never finds it. Always regenerate the index after child changes.
One file is either a sitemap (urlset root) or an index (sitemapindex root). Can't have urlset elements in a sitemapindex root or vice versa.
for f in /var/www/html/sitemap-*.xml; do size=$(stat -c%s "$f") count=$(grep -c "<loc>" "$f") echo "$count URLs, $((size/1024)) KB - $f" done # Each line should show count <= 50000, size <= 51200 KB
Verify size warnings cleared and index loads cleanly.
Run Tester →