Duplicate Content and SEO

Note that this is an update of a May 2, 2006 article with more data and more commentary.

Several months ago, a subsidiary of my current employer created a very robust recipe search tool with over 10,000 recipes. The availability of this tool resulted in some excitement from various other business groups because they wanted to incorporate it in to their websites. When I heard this I thought I should monitor the situation to see what would happen with these different search tools because the sites were, in effect, deploying duplicate content and which the general belief is that search engines don't like.

Of course the question is, what does it mean that search engines don't like duplicate content? My take is that when they encounter duplicate content, the search engines will want to make sure only one copy is shown in the search engine results for any given search term. After all, it does the user little good to display 5 different sites all with the exact same content. Confirming that two pages are duplicates is easy. That's the sort of thing that computers are good at. Even if the sites have a slightly different look depending on the branding, I think search engines can still determine that the real content of a page is a duplicate.

However, the challenge that search engines face is figuring out which copy is the original. This is important so that the original is displayed in search results while the duplicate is discarded. From what I've read, this decision is made by considering a number of factors (all speculation of course) including which copy was found first, which copy comes from an older site, and which copy comes from a more “respectable” site.

This duplicate content that I'm reporting on should provide some good insights. Why? Because these sites aren't attempting to spam the search engines. Instead, they're just typical efforts by business units to brand something on the web with no consideration of the SEO consequences. You might also ask why, as someone involved with SEO, I'm not doing anything about it? Partly because the sites are eventually going to disappear as they are replaced by yet another recipe search and partly because examining the data should be quite educational.

So without further ado, here's the data I've collected over the course of a few months. This first chart shows the number of pages for a particular site in Google. Note, I labeled the sites with a letter rather than the actual domain. Other things to keep in mind:

  • All sites are sub-domains off of a parent site i.e. the parent site is www.something.com and these sites are subdomain.something.com.
  • The parent site of site A is the oldest and most optimized. It ranks well with Google and other search engines.
  • Other than site A, all the parent sites are well indexed, but haven't had all SEO issues addressed.
  • Site C is considered the flagship version of this recipe search tool i.e. it gets the offline and online press.
Date   Site A   Site B   Site C   Site D   Site E   Site F
10-Mar-06   795   36   33,600   11,000   199   13,700
21-Mar-06   12,700   40   21,700   99   653   10,500
29-Mar-06   12,000   27   23,300   73   9,810   967
12-Apr-06   14,800   38   15,000   27   10,100   885
17-Apr-06   16,600   39   40,700   35   15,400   12,400
2-May-06   30,900   28   25,900   31   11,700   13,000
5-May-06   15,200   25   19,400   19   9,880   11,200
18-May-06   84,800   25   24,100   17,700   9,360   11,400
22-May-06   107,000   31   26,200   17,600   836   788
30-May-06   14,500   28   10,700   17,600   526   17,200
12-June-06   11,000   24   9,1200   21,600   442   20,500

There are some interesting things happening here.

  1. Even though Site A was poorly indexed at the beginning while Site E, F, and H were well indexed, Site A still managed to have its content accepted by Google.
  2. Some strange things happened near the end of May in that Site A had more pages indexed than were actually on the site. I thought there might be a URL parameter issue, but it looks like the problem fixed itself.
  3. Even though we consider Site C to be the flagship version of the search tool, Google has most recently decreased the number of indexed pages and seems to favor Site D and Site E.
  4. Site E seems to be the clearest indication of what happens to a site of duplicate content in that it has for just under a month had very few pages indexed.
  5. And yet contrary to above statement, Site D went from having next to no pages indexed to now being the leader.

Even though I've made some observations, I'm finding it difficult to draw any solid conclusions. Part of the problem may be that I have only 3 months worth of data. I would've expected 3 months to be long enough for Google to figure things out. Perhaps Google is giving each site the benefit of the doubt at this time.

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 2.50 out of 5)

1 Comment

  1. I realise this is old, but did you ever draw any further conclusions on this? Did the rankings stabilise in the subsequent months? I find there is a lot of conflicting advice about duplicate content, so it would be nice to have some actual data to base an opinion off.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail.