The History of Latent Semantic Indexing
It's sometimes fun (well, if you're involved with SEO) to look at how optimization theories sometimes form and seem to be truthful, but even years afterwards are still being discussed. Such is the case with Latent Semantic Indexing or LSI.
On February 3, 2005, Aaron Wall wrote this about LSI: “Latent semantic indexing allows a search engine to determine what a page is about outside of specifically matching search query text. By placing additional weight on related words in content LSI has a net effect of lowering the value of pages which only match the specific term and do not back it up with related terms.”
At that time, Orion of Search Engine Watch wasn't convinced and wrote, ”
No thanks. This time I prefer that others demystify LSA/LSI in connection with search engines ranking/indexing.”
Fast forward almost two years later and we have Dr. Garcia reminding everyone of, “all those LSI-based myths promoted by snake oil marketers, like that there is such thing as LSI-friendly documents, LSI and link popularity and the dumb notion that displaying a tag cloud of terms is evidence that a company has any LSI-like technology. I have a debunked collections of these and similar SEO tales.”
Less than two months later on February 7, 2007 Bruce Clay posted about recent ranking changes that have been dubbed the minus-950 penalty. In his post he writes that some have speculated that this penalty is a result of a recent Google patent that describes what seems to be a, “low-scale version of latent semantic indexing.” This certainly fits since the patent abstract includes this high-level description, “Phrases are identified that predict the presence of other phrases in documents. Documents are then indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.”
So what's the verdict on LSI? I don't know. It's certainly something worth watching. The good news is that all of the recommendations from pro-LSI folks seems to boil down to using different keywords with similar meanings when writing content which is something that would happen anyway if you write with your users in mind. In addition, link development that keeps an eye towards “looking natural” will probably survive any algorithm changes that include LSI factors.
The patent referred to by Bruce Clay is about Phrase- based Indexing and has absolutely nothing to do with LSI. Google has been granted at least 20 patents related to Phrase- based Indexing, and they have written a white paper on Semantic Topic Modeling that describes the frequently co-occurring complete and meaningful phrases that are a strong part of phrase-based Indexing. It is something worth testing and trying out, but again has absolutely nothing to do with LSI.
Here is a nice "invitation" to SEOs selling "LSI".
http://irthoughts.wordpress.com/2007/07/09/a-call-to-seos-claiming-to-sell-lsi/
Regards
Dr. E. Garcia
Dr. Garcia,
Thanks for stopping by to provide a follow-up. This continues to be a fascinating topic despite LSI-friendliness being thoroughly debunked.
To clear things up, over the weekend I helped an SEO to understand why those that still think can manipulate LSI for ranking purposes are wasting their time or are following these phonies. This is what I wrote at one point:
Quote starts.
"Regarding putting things in simple terms: Irronically, if you skip the math and look at the results and figures of the last tutorial of the series (#5) (SVD and LSI Tutorial 5: LSI Keyword Research and Co-Occurrence Theory)
it shows a simple example on how term weights in the LSI doc-matrix are redistributed across a small collection of just three docs. Now imagine a collection of 1 million or 1 billion of docs. A large matrix, hugh?
A search engine implementing LSI on such huge matrix will redistribute weights across, too. Then, any small change in just one term provokes a redistribution of weights in the entire matrix because of how the SVD algorithm works. There is no way for SEOs sitting behind a query box to predict such changes. They would need to have access to every doc of the collection and or how other publishers made changes at any given point in time to their own docs. Can they? Common sense tells me they cann't.
Thus, this explains why optimization strategies cannot affect LSI or the SVD for that matter, unless SEOs have access to the full index of the search engines and have superpowers to read the intention of other publishers or editors out there. To sum up: there is no such thing as "LSI SEO optimization", "LSI-friendly", or "LSI-compatible" documents. Same with LSI and link popularity. Pure non sense.
In the example, I used a primitive model wherein term weights are defined as word occurrences in docs. This model was taken from Prof Grossman and Frieder's book which I reviewed long ago.
The picture of computing term weights is actually more complex since no current search engines define term weights as mere occurrences (term frequencies), but using composite weights:
aij = Lij*Gi*Nj
where aij = term weights
Lij=local weights
Gi=global weights
Nj=normalization weights
Some even define Gi in terms of entropy weights, but I am not going to explain this now. I have tutorials explaining entropy weights and all these equations at Mi Islita.com
The thing is that each search engines have their own recipe for defining Lij,Gi,and Nj. Once aij is computed, these values are then used in the term-doc matrix to be decomposed in LSI."
End of Quote
Here is another quote:
Quote starts
"Regarding some "LSI-tools" and related services, these are just either fake, or a caricature of what LSI-basd search engines actually compute. There are specific litmus tests one can conduct to identify whether these tools fake results from a thesaurus or a word list of definitions.
Don't be fooled by these marketer, eithers. They will put out any argument in relation to LSI to sale whatever they sell or to promote their image as "seo experts". The sad thing is they always find their way through SES conference and other forums to deceive the public or make a profit out of the ignorance of others."
End of Quote
To sum up, there is no such thing as "LSI-Friendly" documents or "LSI optimization". If some still want a second opinion, this is what true SEO experts have to say after I helped them to grasp the concept of LSI:
Mike Grehan
Lies, Lies, and LSI by Mike Grehan
http://www.clickz.com/clickz/column/1715046/lies-lies-lsi
Bill Slawski
http://www.seobythesea.com/2007/03/personalization-through-tracking-triplets-of-users-queries-and-web-pages/
Rand Fishkin
http://www.seomoz.org/blog/infosearch-media-contentlogic-purveyors-of-falsehoods
Lee Odden
http://www.toprankblog.com/2006/12/5-myths-about-seo/
Regards
Dr. Edel Garcia
Hmm ok, I'm off to read the tutorial by mr Garcia, I definitely need to know more about this before I start saying stupid things :)
Dr. Garcia,
Thanks for stopping by. The tutorial you've linked to seems promising. I tried to read it while at a training session at SES today, but realized it's going to take a little more concentration than can be had in a crowded room!
Thanks for the quote.
And then, there is a tutorial series in which I expose and debunk the many SEO myths regarding SVD and LSI.
Indeed, many SEOs like tha above that have "explained" LSI don't really have a clue or even know how to SVD a simple matrix. They just do the talking to sale something or to promote their image as "experts".
Regards
Dr. E. Garcia