close
August 4, 2021

How to avoid duplicate content

When we talk about duplicate content we are referring to situations where you have a unique piece of content on your site, but with multiple URLs.  All of these URLs then lead to that same piece of content.

This can occur for a multitude of reasons which we will be going through. We will also discuss the best ways to fix this issue.

To note: this is different to the issue where other sites duplicate your content on their own sites. We would refer to this as external duplicate content. This is harder to control; however, we can help with internal duplicate content.

Why does duplicate content matter for SEO?

Google prioritizes providing a great user experience for their users. When it encounters content that is significantly similar, it must decide which source or URL gets the highest ranking.

If it thinks a website is attempting to manipulate rankings to get more traffic, the site or URL may be downgraded. In extreme circumstances it may be removed from Google's index altogether. For this reason, it is an important issue to take care of.

How to detect if you have significant duplicate content issues

There are a variety of online tools that can check for duplicate content.

Here at Labrika, we offer a non-original content checker, this will check and show any URLs on the internet that show similar (or the same) content. This will work even if it is within your own site. Making it a quick and easy way to find duplicate content within your own website.

For external duplicate content a site such as Copyscape is excellent. Alternatively, Siteliner (another tool created by Copyscape) is useful for finding internal duplicate content. They offer a limited free service, or a paid-for premium service.

Note: services like these may note a higher level of duplicate content than Google, as they tend to include all elements on the page, such as the sidebars. As Google would not include this in their analysis these tools may give an inflated duplicate content count.

If you already have a Labrika account you can use our non-original content checker, or if not you can sign up here.

Another method, if you have more time, is to use Google itself. There are many Google search operators but you should start with the site: and intitle: operators.

For example, say you have an article or page called: "How to fly a kite really high".

To find all the URLs that point to this, enter into Google search:

site: mysite.com intitle:"How to fly a kite really high"

Google will then search all instances of this page name within your site. Ideally it should only return one, if it returns more you know you have duplicate content.

Of course, this is a more long-winded process, but can be useful if you only have a very small site.

The 6 most common causes of duplicate content issues

  1. HTTP/HTTPS and WWW/ non-WWW

    Does the content have links that contain: http://mysite.com/article1 and also https://mysite.com/article1

    Does your system refer to your site as: www.mysite.com or mysite.com

    And are there links to the same content using both versions? If so you are creating duplicate content.

  2. Paginated comments

    Systems like WordPress offer the option to paginate comments. This avoids displaying very large pages with possibly hundreds of comments at the bottom of each article. Each page has its own URL such as:

    mysite.com/myarticle/comments-page-1 and mysite.com/myarticle/comments-page-2

    These are examples of multiple URLs to the same piece of content. Therefore, creating a duplicate content scenario.

  3. Session IDs

    Session IDs are very useful for allowing a website to remember a visitor & dynamic actions they took on your site. For example, it can refer to a shopping cart that contains all the products the user wants to purchase. As the user navigates around the site, that unique session ID is tagged onto the URL of each page visited. Meaning a brand-new URL is created for each page. Once again creating duplicate content.

    In this case Cookies provide a better approach as search engines never see them. But we will get into the fixes later.

  4. Printer-friendly pages

    Some systems offer printer-friendly pages as an option. Any link on the website to a printer-friendly version is picked up by the search engines. This causes them to detect duplicate content.

    If you want this feature it is best to use CSS or Javascript to generate the printable page. Or exclude them from searches by using a nofollow or noindex tag. Or exclude them in robots.txt.

  5. Web developers who don't 'get it'

    A developer will view a piece of content as a record in a database, with a unique reference number. But this isn't how a search engine views this content. The website software may spawn multiple URLs that link to the same piece of content in different ways. Search engines detect that there are multiple unique URLs that retrieve the same content. Therefore, indicating it may be duplicate content.

    In this case, you would need to inform your developers to make sure that for every unique URL there is no duplicate content, no exceptions.

  6. URL parameters

    When a system uses parameters in the URL to identify a piece of content in the database, those parameters can often be constructed in different ways, for the same content.

    For example:

    “/?id=1&cat=2” might refer to a unique article, but so does:

    /?cat=2&id=1 (cat = Category, ID = unique database reference).

    A search engine sees two different links to the same piece of content. For this issue, Google has a special Parameter Handling tool where you can indicate how to handle parameters like these.

The best solutions to resolve duplicate content issues

  • 301 redirect ("Permanent Redirect")

    A 301 redirect can be served up, by your webserver, to a user's browser, or a search engine crawler, when a specific URL is sought. It tells the user or search engine that the link address is outdated and indicates the new address. It’s the coding equivalent of redirecting mail when we move house!

    A 301 redirect is most commonly used when you move from one website to another (e.g. a name change). But, it can also be used to redirect multiple URLs to one 'master URL'. This helps search engines keep their indexes up to date. And helps you avoid any duplicate content issues.

    Some web systems let you set up redirects in the Admin settings. Older Linux systems require you to manually insert them in the .htaccess file. This is a more hands-on technical approach but it’s not too difficult to do.

    A typical redirect entry might look something like this:

    Redirect 301 /old-page.html /new-page.html
  • Canonical references

    The word canonical means 'the authoritative URL' in this context. You nominate one URL as being the 'canonical' version for the search engines.

    It’s a simple technical solution in theory but implementing it can be a little complex. However, it solves the problem of multiple URLs pointing to the same content. It also improves your site’s SEO and has the same effect as 301 redirects without redirecting anything. Think of it as a ‘soft 301 redirect’.

    Example of a canonical tag:

    <link rel="canonical" href="https://mysite.com/my-article/" />"

    The rel attribute in HTML specifies the relationship to the linked document and must be accompanied by the href attribute.

  • Use boilerplate text sparingly

    Most sites have a footer that is repeated at the bottom of each page. It’s not a good idea to place a lot of content here. Instead, link to a page that summarizes all the things you want users to know. This avoids text being repeated across multiple pages, needlessly.

  • Reduce the occurrence of actual duplicate content.

    Sometimes you may have very similar content across several pages. For example, several similar products in a range. Where possible, it's always best to consolidate as much as you can into one single page. Or focus on changing the copy of each product, so that it's different enough from the rest, whilst still conveying the meaning.

    This may be a lot of effort, but is worth it in the end to avoid duplicate content issues.

FREE TRIAL

Start your free trial now

Capabilities

close

Full SEO Audit

  • Probable Affiliates Check;
  • Text Quality Optimization Test;
  • Check for Plagiarism (unoriginal texts);
  • Snippets in Google;
  • Number of internal links to landing pages;
  • Number of duplicate links on the page;
  • Links to other sites;
  • Over-spamming in the text;
  • Over-spamming in META tags and H1 (+3 factors);
  • Excessive use of bold type;
  • Multiple use of the same word in the sentence;
  • Multiple use of bigrams in the text and META tags;
  • Multiple use of trigrams in the text and META tags;
  • Excessive use of headers;
  • Skinny pages (with small text);
  • Pages without outgoing internal links;
  • Check landing page relevance;
  • Pages closed from indexing;
  • TITLE = DESCRIPTION;
  • TITLE = H1;
  • DESCRIPTION = H1;
  • H1 = H2, H3, H4;
  • TITLE duplicates;
  • DESCRIPTION duplicates;
  • Not filled TITLE, DESCRIPTION (+2);
  • Number of indexed pages in Google (+2);
  • Pages closed from indexing in Robots, noindex, nofollow, rel = canonical (+4);
  • Landing pages in the sitemap.xml;
  • Non-indexed landing pages;
  • Landing pages URLs history;
  • Adult content;
  • Swear words and profanity.
close

Tools

  • Export your reports to XLS;
  • Import your key phrases, cluster analysis and landing pages url’s from CSV format;
  • Printed version of the site audit in DOCX;
  • Guest access to audit;
  • Generate sitemap.xml with duplicate pages and pages closed from indexing;
  • Labrika highlights texts that are used for snippets.
close

Technical audit

  • Errors 403, 404;
  • Errors 500, 503, 504;
  • Not Responding pages;
  • Critical HTML errors
  • W3C HTML Validator;
  • Multiple redirects;
  • Lost images;
  • Lost JS;
  • Lost CSS;
  • Lost files;
  • Multiple TITLE tags;
  • Multiple DESCRIPTION tags;
  • Multiple KEYWORDS tags;
  • Multiple H1 tags;
  • Pages with rel = "canonical";
  • Common Duplicate Content Issues: www. vs non-www. and http vs https versions of URLs;
  • Correct 404 status code header;
  • Duplicate pages;
  • Mobile HTML optimization;
  • HTML size optimization;
  • Page speed time;
  • Large pages;
  • 3 types of Sitemap.xml errors (+3);
  • 26 types of Robots.txt errors (+26);
  • Tag Length: TITLE, DESCRIPTION, H1 (+3);
  • SSL Certificate Checker (+7);
  • Check if the Domain or IP is Blacklisted;
  • Pages with program's error messages;
  • Check a website response from User-agent;
  • Test the availability of your website from locations worldwide;
  • Test the website for Cloaking;
  • Test if some search engine is blocked by the website;
  • Check a website response from mobile.
close

Recommendations for text optimization

  • Keyword clustering;
  • Check landing page relevance;
  • Find correct landing page;
  • Find the optimal level of the page;
  • Recommendations for text optimization;
  • Optimal text length;
  • Keyword in the main text (+2);
  • Keyword in TITLE (+2);
  • Keyword in DESCRIPTION (+2);
  • Keyword in H1 (+2);
  • Latent semantics (LSI) on the page;
  • Number of relevant pages on the site;
  • TF-IDF calculation for text in BODY, TITLE, DESCRIPTION, H1 (+4);
  • Estimate the level of the page optimization.
close

Keyword characteristics

  • Number of main pages in TOP10;
  • A list of relevant landing pages;
  • Recommended keyword depth;
  • Latent semantics (LSI).
close

User metrics

  • Google Analytics;
  • Drawing charts;
  • % of Bounce Rates;
  • View depth of the site;
  • Average session time;
  • Number of visitors;
  • Mobile devices: traffic, bounce rates, visit time (+4);
  • Visits from sources with traffic and bounce rates (+2);
  • Information on the pages with traffic and level of bounce rates (+2);
  • Visits from cities with traffic and bounce rates (+2);
  • Visits from search engines with traffic and bounce rates (+2);
  • Key phrases with bounce rates and traffic from search results (+2);
  • List of all search requests that people used to find your site for one-year period;
  • Pages without traffic.
close

Analysis of competitors' websites

  • List of competitors;
  • Snippets of competitors;
  • Labrika generates recommendations for texts based on the analysis of competitors' websites and Machine Learning algorithm;
close

Check your search rankings

  • Site positions in Google around the world (+1);
  • All Locations & Languages;
  • Country, city or region levels;
  • More than 100k locations;
  • High-precision analysis;
  • Check search positions in several regions at the same time;
  • Monitor the dynamics of search rankings (+1);
  • Available position archive;
  • Download position lists in XLS format for selected period;
  • Desktop and mobile rankings;
  • Top 50, 100 or 200 results.
close

Domain information

  • Domain Age;
  • Domain payment plan expiration date;
  • Website hosting;
  • Hosting region;
  • IP of the site;
  • The number of sites with the same IP;
  • NS Records;
  • Favicon.

Address

3rd Floor
86-90 Paul Street
London EC2A 4NE