Skip to main content

Scaling with Sitemap Indexes

The sitemap framework is designed to handle large datasets by adhering to the Sitemaps protocol, which limits a single sitemap file to 50,000 URLs or 50MB. To scale beyond these limits, the framework implements an automated pagination system and a sitemap index mechanism.

The Pagination Mechanism

At the core of the scaling strategy is the Sitemap class, found in django/contrib/sitemaps/__init__.py. It defines a hardcoded limit attribute set to 50,000, reflecting the maximum allowed by search engines.

class Sitemap:
# This limit is defined by Google. See the index documentation at
# https://www.sitemaps.org/protocol.html#index.
limit = 50000

The class leverages django.core.paginator.Paginator to split the results of the items() method into manageable chunks. The paginator property is computed dynamically:

@property
def paginator(self):
return paginator.Paginator(self._items(), self.limit)

When a specific page is requested via the sitemap view (e.g., /sitemap.xml?p=2), the get_urls method retrieves only the subset of items corresponding to that page. This ensures that memory usage remains stable even if the underlying dataset contains millions of records.

Sitemap Indexing with SitemapIndexItem

When a sitemap exceeds the 50,000-item limit, it must be referenced by a "Sitemap Index" file. The index view in django/contrib/sitemaps/views.py is responsible for generating this master list. It uses the SitemapIndexItem dataclass to represent each individual sitemap file or page.

@dataclass
class SitemapIndexItem:
location: str
last_mod: bool = None

The index view iterates through all registered sitemap sections. If a section contains more items than the limit, the view automatically generates multiple SitemapIndexItem entries, appending a page parameter (?p=2, ?p=3, etc.) to the URL:

# From django/contrib/sitemaps/views.py
for section, site in sitemaps.items():
# ... logic to determine absolute_url ...
sites.append(SitemapIndexItem(absolute_url, site_lastmod))

# Add links to all additional pages of the sitemap.
for page in range(2, site.paginator.num_pages + 1):
sites.append(
SitemapIndexItem("%s?p=%s" % (absolute_url, page), site_lastmod)
)

Scaling with Internationalization (i18n)

The framework's approach to internationalization significantly impacts scaling. When i18n is set to True on a Sitemap class, the framework generates a URL for every enabled language for every item. This is handled in the internal _items() method:

def _items(self):
if self.i18n:
items = [
(item, lang_code)
for item in self.items()
for lang_code in self.get_languages_for_item(item)
]
return items
return self.items()

Because the total count is the product of items and languages, a dataset of 10,000 items translated into 6 languages will result in 60,000 entries, automatically triggering the creation of a second sitemap page in the index.

Implementation in URL Configurations

To support this scaling, developers must wire both the index and sitemap views. The index view points to the sitemap view to resolve the URLs for individual pages.

A typical implementation for a large dataset, as seen in the project's test suites (e.g., tests/sitemaps_tests/urls/http.py), looks like this:

from django.contrib.sitemaps import views
from django.urls import path

# A sitemap that forces pagination
class LargeSitemap(Sitemap):
limit = 10 # Lowered for demonstration
def items(self):
return range(100)

sitemaps = {
'large': LargeSitemap,
}

urlpatterns = [
# The master index file
path('sitemap.xml', views.index, {'sitemaps': sitemaps}),

# The individual sitemap pages
path('sitemap-<section>.xml', views.sitemap, {'sitemaps': sitemaps},
name='django.contrib.sitemaps.views.sitemap'),
]

In this configuration, requesting sitemap.xml will return an index containing 10 links (e.g., sitemap-large.xml, sitemap-large.xml?p=2, etc.), each handled by the sitemap view which retrieves the specific slice of data from the LargeSitemap.paginator.