Duplicate Content and SEO
Note that this is an update of a May 2, 2006 article with more data and more commentary.
Several months ago, a subsidiary of my current employer created a very robust recipe search tool with over 10,000 recipes. The availability of this tool resulted in some excitement from various other business groups because they wanted to incorporate it in to their websites. When I heard this I thought I should monitor the situation to see what would happen with these different search tools because the sites were, in effect, deploying duplicate content and which the general belief is that search engines don't like.
Of course the question is, what does it mean that search engines don't like duplicate content? My take is that when they encounter duplicate content, the search engines will want to make sure only one copy is shown in the search engine results for any given search term. After all, it does the user little good to display 5 different sites all with the exact same content. Confirming that two pages are duplicates is easy. That's the sort of thing that computers are good at. Even if the sites have a slightly different look depending on the branding, I think search engines can still determine that the real content of a page is a duplicate.
However, the challenge that search engines face is figuring out which copy is the original. This is important so that the original is displayed in search results while the duplicate is discarded. From what I've read, this decision is made by considering a number of factors (all speculation of course) including which copy was found first, which copy comes from an older site, and which copy comes from a more “respectable” site.
This duplicate content that I'm reporting on should provide some good insights. Why? Because these sites aren't attempting to spam the search engines. Instead, they're just typical efforts by business units to brand something on the web with no consideration of the SEO consequences. You might also ask why, as someone involved with SEO, I'm not doing anything about it? Partly because the sites are eventually going to disappear as they are replaced by yet another recipe search and partly because examining the data should be quite educational.
So without further ado, here's the data I've collected over the course of a few months. This first chart shows the number of pages for a particular site in Google. Note, I labeled the sites with a letter rather than the actual domain. Other things to keep in mind:
- All sites are sub-domains off of a parent site i.e. the parent site is www.something.com and these sites are subdomain.something.com.
- The parent site of site A is the oldest and most optimized. It ranks well with Google and other search engines.
- Other than site A, all the parent sites are well indexed, but haven't had all SEO issues addressed.
- Site C is considered the flagship version of this recipe search tool i.e. it gets the offline and online press.
There are some interesting things happening here.
- Even though Site A was poorly indexed at the beginning while Site E, F, and H were well indexed, Site A still managed to have its content accepted by Google.
- Some strange things happened near the end of May in that Site A had more pages indexed than were actually on the site. I thought there might be a URL parameter issue, but it looks like the problem fixed itself.
- Even though we consider Site C to be the flagship version of the search tool, Google has most recently decreased the number of indexed pages and seems to favor Site D and Site E.
- Site E seems to be the clearest indication of what happens to a site of duplicate content in that it has for just under a month had very few pages indexed.
- And yet contrary to above statement, Site D went from having next to no pages indexed to now being the leader.
Even though I've made some observations, I'm finding it difficult to draw any solid conclusions. Part of the problem may be that I have only 3 months worth of data. I would've expected 3 months to be long enough for Google to figure things out. Perhaps Google is giving each site the benefit of the doubt at this time.