10 Graphs Reveal Web Spam Patterns

They say a picture is worth a 1,000 words. So here's 10,000 words of web spam data from a research paper titled Detecting Spam Web Pages through Content Analysis by Alexandros Ntoulas et. al.

Note: The pink line represents the probability of spam.

1. Top Level Domain

Relatively speaking, there is more spam on .biz domains than on other domains.

Web Spam: Top Level Domain

2. Language

And apparently spam is quite popular with the French.

Web Spam: Language

3. Compression Ratio

Repeated (e.g. keyword stuffing) words generally lead to better compression rates.

Web Spam: Compression Ratio

4. Visible Content

Not to be confused with hidden content, visible content in this context basically refers to the code to text ratio. Guess what? A page with a higher ratio of code is actually less likely to be spam.

Web Spam: Fraction of Visible Content

5. Number of Words

Longer documents are more likely to be spam.

Web Spam: Number of Words

6. Average Word Length

Long words, generally formed by combining other words such as freemp3, result in an above average word length.

Web Spam: Average Word Length

7. Number of Words in Title

This one shouldn't surprise anyone. Stuffing the keyword title is common with spammers.

Web Spam: Number of Words in Title

8. Words in Anchor Text

Too large a proportion of text in link text is a good indicator for web spam.

Web Spam: Fraction of Text as Anchor Words

9 & 10. Most Frequent Words in Corpus Common with Text

The fraction of words on a page that are contained in the set of 200 or 500 words that occur most frequently in the English portion of the research paper author's 105 million document corpus.

Web Spam: Fraction of 200 Most Frequent Words in Corpus Common with Text
Web Spam: Fraction of 500 Most Frequent Words in Corpus Common with Text
1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4.00 out of 5)
Loading...

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail.