Home / Technical site audit / Blocked for indexing but in sitemap.xml

Blocked for indexing but in sitemap.xml

A Sitemap.xml file is essentially a map of your website designed specifically for easy navigation and indexing of your site by search engines. It is located within your public_html folder (or site root) and includes important instructions for search engine crawlers that specify what pages should be visited, in what order, and how often to visit them.

This drastically accelerates the indexing process of important pages and allows the search crawlers to allocate their crawl time to pages of high importance to both you and your users.

Creating a sitemap.xml is not always needed but always recommended, especially for large sites with thousands of pages. With bigger sites, comes the need to really make sure search engine crawlers spend their time on those high value pages with deep content and commercial intent, not side pages that offer thin value.

As a rule of thumb, when software and CMS’s automatically generate a sitemap.xml file, they include all available pages for indexing. A typical website owner is not likely to be aware of this, and while they may have set noindex for certain pages, their automatically generated sitemaps are likely including these pages and wasting valuable crawl budgets!

It is highly recommended to use plugins, custom software, or sitemap generators to configure specific URLS to show in your sitemap, certain URL’s to be avoided, what order to crawl URL’s and how often to crawl them.

Sitemap errors found by Labrika

Attention! The sitemap error report will only be accessible if sufficient permissions to scan the whole website are configured correctly. Otherwise, Labrika will only be able to view pages specifically listed in the sitemap.xml rather than being able to view all pages on the website, and then cross-compare them with pages listed in the sitemap.

Labrika sitemap analysis helps find the following types of errors:

  • Pages that exist in the sitemap but are not accessible for indexing.

  • Pages that exist in the sitemap but have a noindex tag.

  • Pages that don’t exist in the sitemap but are indexable.

Please note: different search engines process sitemap rules in different ways. Google, most frequently, will only index pages than can be reached through automatic crawling without a sitemap. That is, pages that can be reached via internal links within the allotted crawl time and crawl depth for your site that day. They will not look at your sitemap.xml file to ascertain which links to crawl, but instead use the sitemap as a guide for how often to crawl pages listed in the sitemap.