10 Graphs Reveal Web Spam Patterns
They say a picture is worth a 1,000 words. So here's 10,000 words of web spam data from a research paper titled Detecting Spam Web Pages through Content Analysis by Alexandros Ntoulas et. al.
Note: The pink line represents the probability of spam.
1. Top Level Domain
Relatively speaking, there is more spam on .biz domains than on other domains.
And apparently spam is quite popular with the French.
3. Compression Ratio
Repeated (e.g. keyword stuffing) words generally lead to better compression rates.
4. Visible Content
Not to be confused with hidden content, visible content in this context basically refers to the code to text ratio. Guess what? A page with a higher ratio of code is actually less likely to be spam.
5. Number of Words
Longer documents are more likely to be spam.
6. Average Word Length
Long words, generally formed by combining other words such as freemp3, result in an above average word length.
7. Number of Words in Title
This one shouldn't surprise anyone. Stuffing the keyword title is common with spammers.
8. Words in Anchor Text
Too large a proportion of text in link text is a good indicator for web spam.
9 & 10. Most Frequent Words in Corpus Common with Text
The fraction of words on a page that are contained in the set of 200 or 500 words that occur most frequently in the English portion of the research paper authors' 105 million document corpus.