10 Graphs Reveal Web Spam Patterns
Table of Contents
They say a picture is worth a 1,000 words. So here's 10,000 words of web spam data from a research paper titled Detecting Spam Web Pages through Content Analysis by Alexandros Ntoulas et. al.
Note: The pink line represents the probability of spam.
1. Top Level Domain
Relatively speaking, there is more spam on .biz domains than on other domains.
![Web Spam: Top Level Domain](https://infolific.com/images/internet/web-spam-top-level-domain.png)
2. Language
And apparently spam is quite popular with the French.
![Web Spam: Language](https://infolific.com/images/internet/web-spam-language.png)
3. Compression Ratio
Repeated (e.g. keyword stuffing) words generally lead to better compression rates.
![Web Spam: Compression Ratio](https://infolific.com/images/internet/web-spam-compression-ratio.png)
4. Visible Content
Not to be confused with hidden content, visible content in this context basically refers to the code to text ratio. Guess what? A page with a higher ratio of code is actually less likely to be spam.
![Web Spam: Fraction of Visible Content](https://infolific.com/images/internet/web-spam-fraction-visible-content.png)
5. Number of Words
Longer documents are more likely to be spam.
![Web Spam: Number of Words](https://infolific.com/images/internet/web-spam-number-of-words.png)
6. Average Word Length
Long words, generally formed by combining other words such as freemp3, result in an above average word length.
![Web Spam: Average Word Length](https://infolific.com/images/internet/web-spam-average-word-length.png)
7. Number of Words in Title
This one shouldn't surprise anyone. Stuffing the keyword title is common with spammers.
![Web Spam: Number of Words in Title](https://infolific.com/images/internet/web-spam-number-of-words-in-title.png)
8. Words in Anchor Text
Too large a proportion of text in link text is a good indicator for web spam.
![Web Spam: Fraction of Text as Anchor Words](https://infolific.com/images/internet/web-spam-fraction-anchor-words.png)
9 & 10. Most Frequent Words in Corpus Common with Text
The fraction of words on a page that are contained in the set of 200 or 500 words that occur most frequently in the English portion of the research paper author's 105 million document corpus.
![Web Spam: Fraction of 200 Most Frequent Words in Corpus Common with Text](https://infolific.com/images/internet/web-spam-fraction-200-common-corpus.png)
![Web Spam: Fraction of 500 Most Frequent Words in Corpus Common with Text](https://infolific.com/images/internet/web-spam-fraction-500-common-corpus.png)
Leave a Reply