Detecting Cloaking Algorithmically Is Not Easy
Recently Matt Cutts posted a question on his blog asking for what people thought Google's Web Spam team should focus on next. Mixed in amongst the answers were requests to eliminate cloaking. Some even went so far as to list offending sites. What's interesting to me is that since cloaking isn't new, is there something tricky with its detection that has kept Google from eliminating it from their results?
I can install a Firefox add-on to set my browser's user-agent to match that of the GoogleBot and spend the rest of the day identifying cloaking with near 100% accuracy. So what's been keeping the PhD-filled Google complex from wrapping what I can do into an automated process? I had no idea what the answer to that question was until I read through a research paper titled Cloaking and Redirection: A Preliminary Study by Boening Wu and Brian D. Davison of Lehigh University.
Cloaking in Brief
Cloaking on the web, in case this is the first time you've come across the term, refers to the delivery of content to users that is different than what is delivered to search engines. The motivation behind doing so is to obtain rankings in search engines while driving subsequent visitors to some action without providing the promised content. For example, a site may require visitors to sign up before seeing the content, but those very same visitors wouldn't ever see the content if it didn't rank in search engines. Cloaking to the rescue -- show the full content to search engines, but swap in a registration form for human visitors.
Methods for Detecting Cloaking
At first blush it seems that detecting a cloaked page is a simple matter of comparing the content delivered to a search engine vs. the content delivered to a browser. Unfortunately, this comparison turns out to be a non-trivial task. Some reasons for this are that some sites:
- Change their content frequently (e.g. news sites) so comparing two different copies would yield a false positive.
- Rotate through content and show something different for each request (e.g. a page the profiles different people).
- Serve "clean" versions of their content to search engines (e.g. by removing advertising). There's no malicious intent with this sort of activity.
- Include dynamic elements such as a time stamps that make every version of the page content unique.
Back in 2003, M. Njork filed a patent for a system and method for identifying cloaked web servers. He proposed using a browser toolbar installed by users to compare pages to what was stored by a search engine. The problem with his proposed solution is that it doesn't take into account any of the 4 items above. Fortunately, I don't believe such a toolbar ever made it to the masses.
Wu and Davison, authors of the research paper I mentioned above, proposed a few alternate methods for detecting cloaking. The first looks at the terms on the page using the following algorithm:
- Capture three copies of a web page. Two by a crawler (C1, C2) and one by a browser (B1).
- Parse the HTML into terms and count only the number of unique terms.
- Compare the counts for C1 and C2 (call this NCC) along with C1 and B1 (call this NBC).
- If NBC is greater than NCC, mark the page as cloaking candidate. Note that the threshold used for what constitutes a significant difference between NBC and NCC can be changed to achieve desired level of precision vs. recall.
Another approach compares the link counts on the pages in a similar fashion as to the steps described above for comparing terms. The results revealed that link comparisons identify fewer instances of cloaking than term comparisons, but the results are more accurate.
The final method requires an additional copy of the browser version of the page bring the total to 4: C1, C2, B1, and B2. The assumption for this third method is that deliberate cloaking will return a set of specific terms (chosen by the spammer) to search engines, but never to users. So, if C1 and C2 have common terms that don't appear at all in B1 or B2, it's likely the page is cloaking. Again, a threshold would be required to reduce false positives for insignificant differences.
Unfortunately, all three of these methods are flawed (even the others say so) and will result in many pages being marked as engaged in cloaking because they fail to distinguish between "acceptable" cloaking and that which results in web spam. In addition, search engines would be required to capture 3 or 4 copies of the same page which would add to the already daunting task of crawling and storing web content.
So what do you say? Should we cut Google a little slack?