Why and How to Check for Duplicate Content

Most professional writers know that Google and other search algorithms are not friendly toward duplicate content. The gold standard is fresh, high-quality writing that follows excellent search engine optimization practices and brings novel information to the table. Even when you’re writing a piece that presents the same information in a different light, it’s important that you avoid copying – or even the perception of copying – work that has come before you.

There are several reasons why it’s important to stay cognizant of this, and why it’s important you perform a duplicate content check before hitting the Publish button. These reasons range from being SEO-driven to brand-building to the ethical desire to avoid accusations of plagiarism or theft.

Whether you’re a copywriter, an online editor, or another professional creating content online, it’s critical you understand how to check for duplicate content. Let’s take a look at what it is, where it occurs, why it’s a problem, and how to keep it out of your work using today’s cutting-edge anti-plagiarism technology.

What is Duplicate Content?

Duplicate content is exactly what it sounds like: a piece of content that can be found in more than one place. Note that this isn’t restricted to verbatim text, with the same headings and formatting, found on two different websites. Duplicate content still counts when it is:

  • Internal, and found on two or more pages of the same site
  • Rearranged into a patchwork – which is still plagiarism if the work isn’t your own – or is self-plagiarism, meaning you were expected to create new content for the publication and did not
  • Published with different headings or formatting
  • Only segments of the original work

It’s important to note that internal duplicate content, which isn’t fantastic in SEO terms (discussed shortly), is much less detrimental than externally duplicated work. This has to do with how search engines organize, rank, and return different pieces of content to internet users when they type keywords into the search field.

How Does Google Decide Which Content to Rank?

Here’s how it works. Google compares duplicate documents side by side to find the one it believes to be most relevant to its audience. As far as the bots are concerned, only one piece of content is necessary. It assesses both pieces against typical algorithmic measures of relevance and ranks one in the Google Index, while essentially archiving the other so that it does not appear in search results. The one that ranks is the “canonical” page, and will henceforth be pulled up by algorithms for those keywords.

It is important to note that Google does not distinguish between the original piece of content and the one that came after. If the bots believe the non-original content is better – due to linking, structure, headings and formatting, metadata, and other measures of success – it will assign that the ranking and pull it up in search results from now on.

Duplicate Content and the Search Experience

People routinely mistake this for a “penalty.” They assume that one piece of work is held up, while others are slashed and burned. Not so. Google simply has to choose which result to return in search rankings. If it didn’t, the algorithm would exist in a constant state of selecting which similar web address to send people to for various keywords. This would mean dividing everyone’s traffic, diluting the worth of all links with less inbound traffic, and generally reducing the impact of all web pages involved.

Admittedly, preventing this scenario leads to the best user experience. You limit identical content, return high-quality results to searchers within parameters, heighten functionality and ensure great SERPs (search engine results pages). By ditching similar content, copied content, internal duplicate content and less-optimized website content, Google keeps its results pages much cleaner and more user-friendly.

Note that this doesn’t mean the original content loses its place on the web. For instance, the non-canonical content:

  • Remains on the website where it was published, and can’t be taken down by Google
  • Can still be linked internally to or by other websites
  • Will be represented on a search through Wayback Machine or other digital archives

Still, once an original piece of content has been sidelined by a new piece of content, there are serious consequences for the person or organization who created that content in the first place. For now, the easiest way to understand this is with a case study.

Duplicate Content: A Case Study in Cheese

So say you wrote a blog article entitled “Mastering the Art of Dutch Cheese Tasting.” It was one of the most popular posts written by a food blogger back in the early 2000s, and it got a lot of traffic for a while, but now is a bit tired. You don’t update your site much anymore or add new articles. Not too many people are linking to the original post these days. Still, it’s yours, and it leads to decent traffic that pays you in affiliate income.

Enter a giant food blog and a writer who left their moral compass somewhere in Greece about 10 years ago. They take “Mastering the Art of Dutch Cheese Tasting,” repurpose it with different headings and a nice patchwork effect, slap on some photos and a branded color scheme, and load it into WordPress. They shamelessly fail to run a duplicate content check and they hit Publish.

Immediately, they get lots of hits. In addition to pats on the back from their bosses, who are impressed by the traffic. The post, which was always well written, gets lots of social media love and is shared a lot of times. It gets linked to in current food blogs, and it’s even mentioned in the New York Times. Eventually, of course, the truth surfaces … but by then, the SEO consequences for the original poster have already occurred, and will take time to correct this wrong.

Naturally, this isn’t the norm. But it does happen, and not infrequently. Let’s take a look at some of the most common other causes of unintentionally duplicated content.

Causes of Duplicate Content

Most people also assume that duplicate content issues arise when a nefarious party steals a piece of original content and places it on their own webpage. While direct plagiarism does occur, it isn’t the most common form of plagiarized content.

More commonly, it arises when an individual is looking to the work of another when creating their content. When working professionally on the web, the same rules don’t apply as when you’re writing a physical paper in a classroom. For instance, duplicate content often gets created when bloggers quote from other blogs, even with proper attribution. The thing is, it doesn’t matter how much credit you’re giving to another site if you are a) taking so much content it causes Google to think there are duplicates present, or b) borrowing content that appears in the snippet (at the top of the search results page), which is critical to SEO – and therefore unfair to the organization that ranked there first.

It can also happen when people quote website content in social media posts. Each social media post on Instagram, Twitter, Facebook, Pinterest, TikTok, and so forth is treated as its own unique URL. Copied content is copied content, and Google will treat these URLs as such, then make the decision about which is canonical.

Duplicate content can also be triggered by session IDs, multiple URLs for the same piece of content, print-only versions of webpages, and other more innocent reasons.

Why Prevent Duplicate Content

There are two answers to the question “Why should I try to avoid duplicate content?” The first is that you can get into serious trouble for plagiarizing the work of others. It is unethical and deeply frowned upon in the academic, scientific, and general writing communities, where original work is prized. In some cases, you may find yourself at the wrong end of a lawsuit. It can also lead to negative SEO consequences for your site when you duplicate content internally. You dilute the worth of your own pages, lowering your web score on Google and affecting your own placements on SERPs.

The final reason you should prevent duplicate content from happening is that it can lead to serious consequences for someone else.

The Downside of Duplicate Content for Original Creators

If someone is writing content to be optimized, using duplicate content can negatively affect their search engine optimization as well.

Unfortunately, while the scrapers (bots that scan content) do a good job meeting searchers’ needs, they have no way of telling who wrote what. They simply line content up and note whether it is “appreciably similar”. They then, based on various metrics, choose one to rank and discard other. When the algorithm is archiving the original piece of content – therefore dismissing someone’s hard work – that’s a serious problem.

It means the original writer or webmaster may lose the credit that they earned if their page is archived and no longer returned in search results. It also means the second piece of content appears in the search rankings to be the original piece of content. This has a range of consequences, including:

  • Losing any future backlink potential, because the duplicate content is now much likelier to earn links from other sources
  • Reducing the e-commerce potential of the archived duplicate pages
  • Wasting the time and resources of a company that worked hard to optimize the original content (now considered “duplicate”) through the right formats, internal links, HTML structure, meta tags, and so on
  • Making the original creator look like the plagiarist

For obvious reasons, the last one alone makes duplicate content a huge problem. Most people, of course, are honorable and wouldn’t wish to do that. (Those who would- note that tools like plagiarism checkers and Google Search Console now make it much harder to get away with intellectual property theft on the web.)

If you want to remain in the clear, you should always create unique content, using your own words. If you’re worried about duplications, you can perform an SEO audit or use a variety of other SEO tools to ensure you’re always creating something new, compelling and meaningful to your readers. That’s the best way to grow an audience and keep your nose clean at the same time.

That, and content checkers, of course.

How to Check for Duplicate Content

Obviously, as a professional, you don’t want to be seen as copying the work of others. While you can run a duplicate content checker to check your work against another – much like how you’d compare two Word documents – a better alternative would be to use a plagiarism checker to search every piece of content ever written and avoid any accidental duplication.

Additionally, keep the following guidelines in mind:

  • Where possible, only use others’ ideas as inspiration, at most paraphrasing their work
  • If it would be a disservice to the original writer not to quote, and if it does not include their Google Snippet, you may cut and paste a line or two of content from another source, with attribution
  • Try to use as many sources as possible when creating your own stories and informational articles
  • Always give credit for photos, graphics, music, or other types of art as well
  • Never quote data without attribution
  • Even with numbers, try your best to rework the content so it does not trigger duplicates

In addition to a duplicate content checker, you can also copy sections of your writing that you are worried may be too close and paste them into the Google search field. If it pulls up the source you used, your writing is still too close to the original.

Closing Thoughts

Ready to ensure the best possible results in all your writing? Take these duplicate content lessons to heart, write your own content each and every time, and always use an industry-leading unique content checker on your work before hitting Publish!