Using Link Structures to Classify Web Spam
In an earlier post I summarized content from a research paper that provided a Web Spam Taxonomy. That paper is a few years old, but I believe it still provided a good foundation for discussions regarding web spam. In this post, I'm going to walk through a document titled Improving Web Spam Classifiers Using Link Structure written by Qingqing Gan and Torsten Suel of the CIS Department at Polytechnic University in Brooklyn, NY. In the world of information retrieval research, this research paper is quite current having been publishing in the May, 2007.
Spam, in the context of this research paper, falls into two categories: content spam and link spam. There has been a lot of research focused on dealing with web spam (some of which I will get to in other posts) such as:
- Propogating distrust by reversing links aka BadRank.
- Promoting trust from good sites in order to demote spam aka TrustRank.
- Using statistical analysis to identify spam since spam pages typically vary from the average page.
- Developing a SpamRank metric that uses PageRank value distribution in the in-coming pages.
This research paper describes a basic classifier (consisting of 20+ content and link-based features) which is then enhanced in two different ways by integrating additional neighborhood features.
- Number of words in a page.
- Average length of words in a page.
- Fraction of words drawn from globally popular words.
- Fraction of globally popular words used in a page.
- Fraction of visible content. This sounds like code to content ratio.
- Number of words in the page title.
- Amount of anchor text in a page to help detect pages stuffed with links.
- Compression rate of the page, using gzip. Not sure what this one is for, but I'm guessing normal text compresses a certain amount while spam doesn't.
- Percentage of pages in most populated level.
- Top level page expansion ratio.
- In-links per page.
- Out-links per page.
- Out-links per in-link.
- Top-level in-link portion.
- Average level of in-links.
- Average level of out-links.
- Percentage of in-links to most popular level.
- Percentage of out-links from most emitting level.
- Cross-links per page.
- Top-level internal in-links per page on this site.
- Average level of page in this site.
- Number of hosts in the domain. The more, the more likely spam.
- Ratio of pages in this host to pages in the domain.
- Number of hosts on the same IP address. Makes you wonder about IP sharing hosts like MediaTemple, no?
Combined, the above features would provide a score for any given page with some measure of confidence that it is or isn't spam. To refine this score, the authors also included neighborhood data. That is, by looking at what sites are pointing into and what sites are pointed out to, the classification accuracy improves. Namely, that spam sites generally point to a lot of other spam and spam sites generally receive a disproportionate number of links from other spam sites. So combining the feature data with the neighborhood data results in spam score. But the authors wanted to improve on this basic classification and did so in two ways. One of the methods is decribed below, the other isn't because I couldn't wrap my head around the description. I suggest downloading the PDF if you're interested.
Simply put, the relabeling approach aims to flip the label of a site from non-spam to spam or vice-versa. To do this, each site in a neighborhood is given a label with a certain level of confidence. The label from each site comes about by using the feature set listed above. This neighborhood label is then compared to the site's label. If the labels disagree and there is more confidence in the neighborhood's, the site's label is switched to match that of the neighborhood. Otherwise the label stays the same.
Boiled down, this translates into your site being judged by the friends it keeps.