close
August 4, 2021

What is Index bloat and how to fix it

What is index bloat?

Index bloat refers to websites with a high page count, that are 'bloated' with low value pages. These pages are then indexed by the search engines, negatively affecting your site's performance. 

Why is index bloat bad for SEO?

The main issue is that index bloat means that the low value pages outweigh any high value pages. This means the search engines will view your site as low value. Even if you are putting in good effort on your high value pages, this will be outweighed by those low value pages.

Your primary aim with SEO is that a search engine's crawler is able to:

  • Find the content you want it to find.
  • Rank it well in the search results.
  • Ignore content that you don't want indexed.

When a website has a high page count, but many of those pages are of low quality, it wastes your valuable crawl budget. This, of course, can then degrade the overall ranking of your site in the search engines. For this reason, it is an important element to keep an eye on regularly.

What are the causes of index bloat?

One of the main sufferers of index bloat are e-commerce sites, as they typically have many product pages. Even after products are no longer available, the URL and page may still be indexed. Product searches and filtering features can also cause hundreds or thousands of 'bloated' pages. There are also many other causes of index bloat, such as:

  • Internal duplicate links and pagination.
  • Tracking URLs that include a query string at the end.
  • Auto-generated user profiles.
  • Site development, migration and rebuilds also often leave behind useless test pages.
  • Blog websites frequently generate archive pages such as monthly archives, blog tags, category tags and so on. Over time these build up into substantial bloat content.
  • A badly ordered XML sitemap and internal linking. When a sitemap isn't properly thought out it can result in wasted crawl budget. After the crawler has crawled all of the pages on the site, it will then start following internal links resulting in a far higher page count.
  • General low value content pages such as 'thank you' or testimonial pages. These would be considered low quality/ thin content, and shouldn't be indexed by search engine crawlers.

Essentially, every page listed by a search engine that does not give value to the user is index bloat. Some cannot be avoided, but the aim should be to minimize them as much as possible.

How to fix index bloat on your website

You really have two options:

  1. You delete the unwanted pages.
  2. You tell the search engines not to index them.

As simple as this sounds, it may take some time to do. It may also take some time for any positive results to show from your work. However, be assured, that over time this will pay off. To establish the pages that need to be removed you need to analyse the index rate of your website (ensuring to list the important pages that must be indexed). You must then cross compare this with the pages that Google has indexed. The excess is the index bloat that you want to get rid of.

You can start by targeting the low-hanging fruits. That is pages you can easily identify in your XML sitemap that shouldn't be there. Then remove them from your sitemap, and/or delete them if they no longer serve any purpose.

You can identify other offending pages in several ways:

  • Use an online service, such as Labrika, to identify them for you. You can do this via our ‘User behaviour data’ report, in the section ‘pages without any traffic’. This is likely the easiest option.

  • Analyze your log files and find pages that users are visiting that perhaps you didn't know about, and other low-value pages. You may find some surprises!
  • Check in Google search console for the 'index coverage report' which lists the pages that Google has indexed for your website.

You can also restrict access to content and prevent indexing by web crawlers

Whilst you can't prevent web crawlers from accessing a page, you can instruct them not to index it.  Most search engines will obey this directive, but others may not, so this isn't a fool proof method.

If you have content that is truly confidential you would need more advance security features to block the web crawlers. One of these being the .htaccess file, this can control who sees what in individual folders. However, this is a complex, technical process, not to be undertaken by beginners!

4 easy ways to fix index bloat

  1. Delete duplicate pages, unwanted pages, old test pages and so on.
  2. Remove low-value pages from your XML sitemap and mark them with a noindex meta tag in the HTML <head> section. This can be done like this:
    <head>
    	<meta name="robots" content="noindex">
    </head>

    You can enter this manually or via a plugin such as Yoast on a Wordpress site.

  3. Set a disallow directive in your robots.txt file to indicate which folders or individual pages not to crawl. This content will then not be crawled, nor indexed by search engines. 
    User-agent: googlebot
    Disallow: /testimonials/
    Disallow: /checkout/
    Disallow: /content-page1.htm/
  4. Set a noindex directive in your robot.txt file. The pages will be crawled but not indexed by search engines.
    Noindex: /content-page1.htm/

Do's and don’ts when fixing index bloat

  • Do not allow internal search result pages (when a user uses the search bar on your site) to be crawled by search engines. Otherwise searchers may click on a link on the search engine results page but be directed to some other search result page on your website. This would provide a poor user experience.
  • If proxy services generate URLs for your website, do not allow these to be crawled.
  • Have a thorough SEO audit performed, either by a SEO specialist or by an online tool, such as us here at Labrika. Our user behaviour report allows you to view pages that have no traffic and are therefore likely ‘bloating’ your site.   

Summary: finding and fixing index bloat

The primary objective of any search engine is to be the best at serving top quality results for its users. To achieve this, they deploy significant resources to identity and discard pages (or whole websites) that don't fulfil their criteria.

This is also a process that continues to be improved and refined. This means we, as SEO professionals and webmasters, must be doing our best to get ahead of these issues.

This type of technical SEO issue should become an important part of any website's quality review. Ensure that crawlers are only seeing the best of your content!

Carrying out the fixes we described above is a key step in optimizing your SEO efforts.

FREE TRIAL

Start your free trial now

Capabilities

close

Full SEO Audit

  • Probable Affiliates Check;
  • Text Quality Optimization Test;
  • Check for Plagiarism (unoriginal texts);
  • Snippets in Google;
  • Number of internal links to landing pages;
  • Number of duplicate links on the page;
  • Links to other sites;
  • Over-spamming in the text;
  • Over-spamming in META tags and H1 (+3 factors);
  • Excessive use of bold type;
  • Multiple use of the same word in the sentence;
  • Multiple use of bigrams in the text and META tags;
  • Multiple use of trigrams in the text and META tags;
  • Excessive use of headers;
  • Skinny pages (with small text);
  • Pages without outgoing internal links;
  • Check landing page relevance;
  • Pages closed from indexing;
  • TITLE = DESCRIPTION;
  • TITLE = H1;
  • DESCRIPTION = H1;
  • H1 = H2, H3, H4;
  • TITLE duplicates;
  • DESCRIPTION duplicates;
  • Not filled TITLE, DESCRIPTION (+2);
  • Number of indexed pages in Google (+2);
  • Pages closed from indexing in Robots, noindex, nofollow, rel = canonical (+4);
  • Landing pages in the sitemap.xml;
  • Non-indexed landing pages;
  • Landing pages URLs history;
  • Adult content;
  • Swear words and profanity.
close

Tools

  • Export your reports to XLS;
  • Import your key phrases, cluster analysis and landing pages url’s from CSV format;
  • Printed version of the site audit in DOCX;
  • Guest access to audit;
  • Generate sitemap.xml with duplicate pages and pages closed from indexing;
  • Labrika highlights texts that are used for snippets.
close

Technical audit

  • Errors 403, 404;
  • Errors 500, 503, 504;
  • Not Responding pages;
  • Critical HTML errors
  • W3C HTML Validator;
  • Multiple redirects;
  • Lost images;
  • Lost JS;
  • Lost CSS;
  • Lost files;
  • Multiple TITLE tags;
  • Multiple DESCRIPTION tags;
  • Multiple KEYWORDS tags;
  • Multiple H1 tags;
  • Pages with rel = "canonical";
  • Common Duplicate Content Issues: www. vs non-www. and http vs https versions of URLs;
  • Correct 404 status code header;
  • Duplicate pages;
  • Mobile HTML optimization;
  • HTML size optimization;
  • Page speed time;
  • Large pages;
  • 3 types of Sitemap.xml errors (+3);
  • 26 types of Robots.txt errors (+26);
  • Tag Length: TITLE, DESCRIPTION, H1 (+3);
  • SSL Certificate Checker (+7);
  • Check if the Domain or IP is Blacklisted;
  • Pages with program's error messages;
  • Check a website response from User-agent;
  • Test the availability of your website from locations worldwide;
  • Test the website for Cloaking;
  • Test if some search engine is blocked by the website;
  • Check a website response from mobile.
close

Recommendations for text optimization

  • Keyword clustering;
  • Check landing page relevance;
  • Find correct landing page;
  • Find the optimal level of the page;
  • Recommendations for text optimization;
  • Optimal text length;
  • Keyword in the main text (+2);
  • Keyword in TITLE (+2);
  • Keyword in DESCRIPTION (+2);
  • Keyword in H1 (+2);
  • Latent semantics (LSI) on the page;
  • Number of relevant pages on the site;
  • TF-IDF calculation for text in BODY, TITLE, DESCRIPTION, H1 (+4);
  • Estimate the level of the page optimization.
close

Keyword characteristics

  • Number of main pages in TOP10;
  • A list of relevant landing pages;
  • Recommended keyword depth;
  • Latent semantics (LSI).
close

User metrics

  • Google Analytics;
  • Drawing charts;
  • % of Bounce Rates;
  • View depth of the site;
  • Average session time;
  • Number of visitors;
  • Mobile devices: traffic, bounce rates, visit time (+4);
  • Visits from sources with traffic and bounce rates (+2);
  • Information on the pages with traffic and level of bounce rates (+2);
  • Visits from cities with traffic and bounce rates (+2);
  • Visits from search engines with traffic and bounce rates (+2);
  • Key phrases with bounce rates and traffic from search results (+2);
  • List of all search requests that people used to find your site for one-year period;
  • Pages without traffic.
close

Analysis of competitors' websites

  • List of competitors;
  • Snippets of competitors;
  • Labrika generates recommendations for texts based on the analysis of competitors' websites and Machine Learning algorithm;
close

Check your search rankings

  • Site positions in Google around the world (+1);
  • All Locations & Languages;
  • Country, city or region levels;
  • More than 100k locations;
  • High-precision analysis;
  • Check search positions in several regions at the same time;
  • Monitor the dynamics of search rankings (+1);
  • Available position archive;
  • Download position lists in XLS format for selected period;
  • Desktop and mobile rankings;
  • Top 50, 100 or 200 results.
close

Domain information

  • Domain Age;
  • Domain payment plan expiration date;
  • Website hosting;
  • Hosting region;
  • IP of the site;
  • The number of sites with the same IP;
  • NS Records;
  • Favicon.

Address

3rd Floor
86-90 Paul Street
London EC2A 4NE