fRank Takes on PageRank
Where have I been? That's the question that my readers (both of you, not including my brother) may have asked in the last couple of months. I've been where I've always been, but I've been reading much more than I've been writing. Some of that reading has been research papers of the sort put out by the International World Wide Web Conference Committee (IW3C2) or the Special Interest Group on Information Retrieval (SIGIR).
One of these papers came out of Microsoft and is called Beyond PageRank: Machine Learning for Static Ranking. Ooh. Anything with PageRank in the title, must be good, right? Well, actually in this case yes. The paper was an excellent read and I highly recommend it despite it being from 2006. Because I have a memory like a sieve, I've decided to note some of the highlights for future reference.
A good query-independent ranking or static ranking algorithm is key for search engine success and provides:
- A general indication of the overall quality of a page.
- The ability for the search engine to quickly stop searching for results once a particular threshold of quality has been passed.
- A clue to setting the priority for what pages should be crawled first.
It's generally accepted that Google's PageRank is the best method for the static ranking of Web pages, but the authors of this page have set out to demonstrate otherwise. Their argument is stated as such:
“There are a number of simple URL- or page-based features that significantly outperform PageRank (for the purposes of statically ranking web pages) despite ignoring the structure of the web. We combine these and other static features using machine learning to achieve a ranking system that is significantly better than PageRank (in pairwise agreement with human labels).”
Pretty bold statement, right?
I won't attempt to describe what is meant by a machine learning approach, but some of the benefits cited by the authors include:
- Multiple measures that make it hard for malicious users to manipulate (especially of the measures are kept secret).
- An algorithm that learns allows for a feature to be de-emphasized should it become subject to manipulation.
- Taking advantage of advances in machine learning field e.g. it is apparently possible to adjust the ranking model ahead of the spammer's attempts to circumvent it.
And so RankNet was born — the Microsoft researcher's implementation of a “modified standard neural network back-prop algorithm”. And from it a new measure call fRank (for feature-based ranking).
The paper includes details of various experiments which are worth reading, but the gist the results is that fRank performs significantly better than PageRank despite a lack of information about the web graph. As a side benefit, fRank tends to bias pages that web users actually prefer rather than those preferred by web authors when compared to PageRank. I had to mull that one over for a while.
And what simple measures in combination beat the all mighty PageRank?
- Popularity as measured by the number of times it was visited by users over time. The MSN Toolbar provided this data. Yes, the MSN Toolbar, as with other toolbars, could very well be a factor in rankings.
- Anchor text length and number of unique words in that text. I'm not sure what length is optimal, but I guess the authors determine such a value.
- Page elements such as number of words in the body and the frequency of the most common term.
- PageRank as computed on 5 billion pages.
- Domain-level elements such as the number of outlinks on any page and the average PageRank.