Detecting Splogs with Self-Similarity Analysis
A while back Google's search results were littered with splogs (spam blogs). It was common to search for a term, click on the first result, and land on a page that had advertising fill the area above the fold and useless content below. I'm not sure when, but the problem must have reached critical mass because Google cleaned up the results. And as with most things Google, the cleanup was very likely algorithmic and automated. Ever wonder how they might have accomplished this feat? A research paper from May 8, 2007 titled Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics may be the answer.
Blogs differ from regular websites suggesting that the splog detection process needs to be different than what is done to detect typical web spam. With web spam, content is usually static whereas splogs typically employ an automated framework to regularly generate new content. This constant change is what enables temporal analysis of both the content and link structure which in turn reveals splog-identifying patterns or anomalies.
Why go through all this trouble? Splogs not only degrade the end-users experience, but they are a significant tax on network and storage resources. Splogs have a established a significant foothold as can be seen from some data the authors collected using primary and secondary research:
- identified 2.7 million splogs out of 20.3 million (over 10% splogs)
- an average of 44 of the top 100 blogs search results in the three popular blog search engines came from splogs
- 75% of new pings come from splogs
- 50% of claimed blogs pinging the website weblogs.com are splogs
Characteristics of a Splog
When evaluating a splog, the evaluation is done at a site level rather than at a page level. Also, if spam elements such as trackback spam on an otherwise legitimate blog don't make that blog a splog. With this in mind, here are the typical characteristics of a splog:
- Content that is generated algorithmically including random gibberish, copied snippets from other sites, and weaved content.
- No added value e.g. the content is nonsensical or is a duplicate of another site.
- A hidden agenda likely one tied to generating advertising revenue. Arguably most blogs have some agenda other than the desire to share and communicate so this item alone doesn't mean a blog is a splog.
The machine-generated nature of splogs make them good candidates for statistical assessment. This assessment can include looking at:
- Post Time: Two measures that capture regularity in posting time (micro) e.g. posts go live in the morning before the blogger's “real” job as well as a macro time view e.g. a large gap in posting due to a vacation.
- Post Content: A measure of the topic drift by the blogger. Commonly a blogger will remain focused on a topic, but will sometimes write about other topics.
- Post Links: The links on the blog can be telling. A large proportion directed to a particular domain, for example, suggests a relationship with that destination domain.
The math associated with gathering this data and developing the signatures for identifying a blog vs. splog are quite complex. For details, I recommend downloading the PDF I linked to above. However, as you can see below, the authors did provide strong evidence, in the form of visuals, that the patterns that splogs produce are distinctive.