Index bloat refers to websites with a high page count, that are 'bloated' with low value pages. These pages are then indexed by the search engines, negatively affecting your site's performance.
The main issue is that index bloat means that the low value pages outweigh any high value pages. This means the search engines will view your site as low value. Even if you are putting in good effort on your high value pages, this will be outweighed by those low value pages.
Your primary aim with SEO is that a search engine's crawler is able to:
When a website has a high page count, but many of those pages are of low quality, it wastes your valuable crawl budget. This, of course, can then degrade the overall ranking of your site in the search engines. For this reason, it is an important element to keep an eye on regularly.
One of the main sufferers of index bloat are e-commerce sites, as they typically have many product pages. Even after products are no longer available, the URL and page may still be indexed. Product searches and filtering features can also cause hundreds or thousands of 'bloated' pages. There are also many other causes of index bloat, such as:
Essentially, every page listed by a search engine that does not give value to the user is index bloat. Some cannot be avoided, but the aim should be to minimize them as much as possible.
You really have two options:
As simple as this sounds, it may take some time to do. It may also take some time for any positive results to show from your work. However, be assured, that over time this will pay off. To establish the pages that need to be removed you need to analyse the index rate of your website (ensuring to list the important pages that must be indexed). You must then cross compare this with the pages that Google has indexed. The excess is the index bloat that you want to get rid of.
You can start by targeting the low-hanging fruits. That is pages you can easily identify in your XML sitemap that shouldn't be there. Then remove them from your sitemap, and/or delete them if they no longer serve any purpose.
You can identify other offending pages in several ways:
Whilst you can't prevent web crawlers from accessing a page, you can instruct them not to index it. Most search engines will obey this directive, but others may not, so this isn't a fool proof method.
If you have content that is truly confidential you would need more advance security features to block the web crawlers. One of these being the .htaccess file, this can control who sees what in individual folders. However, this is a complex, technical process, not to be undertaken by beginners!
<head>
section. This can be done like this:
<head> <meta name="robots" content="noindex"> </head>
You can enter this manually or via a plugin such as Yoast on a Wordpress site.
User-agent: googlebot Disallow: /testimonials/ Disallow: /checkout/ Disallow: /content-page1.htm/
Noindex: /content-page1.htm/
The primary objective of any search engine is to be the best at serving top quality results for its users. To achieve this, they deploy significant resources to identity and discard pages (or whole websites) that don't fulfil their criteria.
This is also a process that continues to be improved and refined. This means we, as SEO professionals and webmasters, must be doing our best to get ahead of these issues.
This type of technical SEO issue should become an important part of any website's quality review. Ensure that crawlers are only seeing the best of your content!
Carrying out the fixes we described above is a key step in optimizing your SEO efforts.