Statistically Speaking, That Page Is Spam
In a previous post I covered a Microsoft Research paper that discussed how static factors could be used to improve search ranking results above and beyond what PageRank alone could do. Rolled together, these factors formed what the authors of the paper called fRank to measure the quality of a web page. In this post I'm going to cover another research paper that looks at the other end of the quality spectrum. That is, what can be done algorithmically to identify a given page or domain as spam? Note that the basis of this post is from a 2004 SIGIR Paper titled Spam, Damn Spam, and Statistics.
We've all seen spam pages. More often than not we can point to a page and call it spam with a high degree of certainty. Our brains are good at quickly identifying elements of a page that immediately indicate it as spam. Alas, computers are less capable with such acts of intuition, but a certain category of spam can be easily and methodically classified as such.
Although numerous and beyond the scope of this research paper, some approaches to spam include:
- Loading pages with popular, but irrelevant keywords. Commonly called keyword stuffing.
- Synthesizing many pages each with a narrow topic focus which in turn redirect to the page that needs to receive the traffic. Commonly called doorway pages.
- Synthesizing many pages knowing that each will obtain a minimum PageRank which can all be channeled to the key page.
The first approach is easily detected using term vector analysis. The second two are the subject of the research paper.
Using two datasets of 150 million URLs and 429 million URLs, respectively, the Microsoft researchers set out to demonstrate that spam pages exhibit statistical anomalies that could be used to accurately separate spam from non-spam pages. These statistical anomalies include:
- Length of host name. The longer the more likely it's spam.
- Host name resolutions to the same IP. The more that resolved to the same IP, the more the pages are spam.
- Host-machine ratio for links. The more outbound links to hosts that converge to a small set of IPs, the more likely those pages are spam.
- The distribution of links embedded on a page vs. those pointing to a page. This distribution should follow a Zipfian Distribution and outliers are likely spam.
- A variation in the words used across many pages while the number of words remains constant.
- A high rate of page mutation. That is, pages that change content frequently. Note that news sites, which have just a few constantly changing pages with many otherwise static pages, do not get flagged by this measure.
- Excessive replication of content across domains.
For me there are two takeaways from this research. The first is that spam detection is not just about detecting what a page is, but also what it isn't. That is, a page that isn't like others in some via a statistical measure stands out.
The second takeaway is that this paper supports the notion that there is such a thing as over-optimizing a page, but the threshold is likely changing over time. By that I mean some overly exuberant SEO may push a site out of the statistical norms and into the looks like spam bucket. At the same time, the statistical norms are in flux because as more and more sites are optimized, what is normal changes with those optimization efforts.