Web Spam Taxonomy
I came across an interesting research paper the other day titled Web Spam Taxonomy. How could I resist that title!? The paper was written by Zoltan Gyongyi and Hector Garcia-Molina while in the Computer Science Department of Stanford University. The authors also acknowledge many discussions with an anonymous collaborator at a major search engine as a source of information.
In the context of this research paper, all SEOs are spammers since their activities are intended to boost rankings of a page without actually improving the quality of that page. The authors call out techdictionary.com as an example of a web spam. They also mention SEO Inc. and Bruce Clay as important voices in the web spam arena. Curious choices.
Web spam is detrimental to search engines in two ways because it:
- reduces the quality of search results
- increases the cost of each processed query due to the storage and retrieval of useless pages
The techniques for generating web spam have two classifications: boosting and hiding. Boosting focuses on improving rankings while hiding focuses on preventing the search engines from detecting the boosting techniques.
Possibly the simplest form of spamming, term spamming centers around the fundamental TFIDF metric (term frequency / inverse document frequency). In practical terms this metric means that if a keyword appears 4 times out of a total of 40 keywords then the TF (term frequency) is 0.1 while the IDF (inverse document frequency) is 10. The TFIDF score of a page for a given query is computed over all common terms as the sum of the products of the TF and IDF values. With TFIDF in mind, spammers either aim to target a small number of keywords with a high TFIDF score or to receive a non-zero TFIDF for a lot of terms.
Page elements that make good targets for spam include the body, title, meta tag, anchor text, and URLs. Some techniques include:
- Repetition of keywords in the page elements to increase relevance for a small number of queries.
- Dumping a large number of unrelated terms to increase the number of queries (mostly long tail) that the document would be relevant for.
- Weaving which uses copied text to surround the spam keywords. Certainly wouldn't pass a visual inspection, but could trick a search engine.
- Phrase stitching which combines different snippets of content to create new content.
It doesn't take long before someone new to SEO comes across articles about how links are key to rankings. So it's no wonder that link spamming exists. For spammers, links are categorized as those that are:
- inaccessible i.e. can't be modified
- accessible, but not owned i.e. can be modified indirectly such as comments in a forum
- owned i.e. complete control to modify
Targeting the HITS algorithm's emphasis on hub and authority scores is a matter of creating pages that link to high quality sites to increase that page's hub score. One common technique is to mirror a directory. Using many high hub score pages to point to a chosen target can then result in that target page obtaining a high authority score. This, as you can imagine, can be accomplished with accessible and owned pages.
Google's PageRank algorithm, despite its success, is susceptible to manipulation as well. PageRank places great importance on a page's incoming links. Generally, the more links the better although there is increased benefit from "high quality" links. Enter the link farm -- an artificial construction of interlinked sites that in its purest form has:
- All available owned pages part of the link farm
- All accessible pages pointing to the spam farm
- All links pointing outside of the spam farm suppressed
- All pages within the farm having some outgoing links
Aside from link farms, all methods of obtaining inbound links are fair game. For example, anyone that has successfully toyed with social media has already figured out one of the ways links can be accumulated i.e. use useful/interesting content to attract links. A spammer goes one step further to then use that link equity to boost low quality pages. Other techniques include infiltrating a web directory, posting links in forums / blog comments, participating in link exchanges, and buying expired domains.
Once popular, but not particularly effective anymore, is hiding content using background and foreground colors that match. Hiding links is only slightly harder and can be achieved with 1x1 pixel images. CSS brought with it a few new tricks such as setting page elements to be not visible along with other tricks like negative indents.
A more sophisticated hiding technique is cloaking. Sending different content to search engines vs. visitors is what cloaking is all about. Detecting the user-agent and responding accordingly is one method, but IP cloaking is much harder to detect.
This web spam taxonomy, despite being somewhat dated (the document itself has no date, but the bibliography references other papers from 2005 so it's no older than that), makes for a nice segue into a series of web spam detection posts I have in the works. Stay tuned!