Latent Semantic Indexing can improve your WordPress search results
Latent Semantic Indexing (LSI) can improve the quality of Wordpress search results dramatically. Rather than just look for any one of a set of keywords in the body of your posts, LSI creates a low-rank approximation of the relationship between your blog posts and the words you use. Since the document term-space is of much lower order than the original document-term matrix, words with related semantic value (i.e., “Microsoft” and “Bill Gates”) become associated, and searches for one term will return results that are closely related.
Some examples:
Take a look at this query for “writing good code”. Naturally, you would like Wordpress to return articles about coding practices, or even computer programming at all! However, the first three matches are Ludacris lyrics, ethical blogging, and finally something useful–Microsoft interview tales. Now, take a look at what I get back with LSI: Google Desktop Search, Heavyweight Categories plugin, and Things I want to do for Wordpress. These, to me anyway, seem a little more relevant. And, if you do try looking for “rap music” with the LSI technique, one of your results is Pot Smokers = Psychotic. Now how relevent is that?
If you need more proof, “pop culture” gives me Paris Hilton, and sex returns The “really big” boys get it wrong.
Some downsides:
To do LSI, you have to create a term document matrix, which will be really big. Mine is 12,525 x 726, and takes up 40 mb of space in full form. Of course, it’s a very sparse matrix, so you can store it in a sparse structure and save most of the space. However, you still have to compute the SVD of that huge matrix, and do a number of painful multiplications and solvings. In other words, LSI is a little slow for a web application. Queries on my p4 here at home take as long as a minute to run–imagine the wait on a loaded server!
Still, the results are astounding, and the WP dev’s should definitely code up a hack!
| This entry was posted on Wednesday, April 27th, 2005 at 8:32 am and is tagged with google desktop search, pot smokers, ludacris lyrics, application queries, paris hilton and sex, document matrix, semantic value, rap music, term space, sparse matrix, google, multiplications, bill gates, paris hilton, lsi, big boys, latent semantic indexing, ludacris, svd, computer programming. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback. |

There are actually continuous models. You can recompute an approximate SVD of the Term-document matrix as new stuff comes in without too much work, it just won’t be as accurate…
[...] Elliott points towards Latent Semantic Indexing (LSI) for improving the search. It might not be a viable option today, but it might be future of the search industry. Natural Language Processing is already being used in certain search functionalities. [...]
Typo in my previous message: I mean you “cannot” index one entry by itself.. =)
I studied LSI in my Masters and I am, too, looking for a commercial implementation of it.
One problem with LSI (based on the proposed implementation in the textbook) is that the indexing is not continuous. You can add an entry in your database and have it indexed just by itself. You have to re-index the entire database, which is pain in the butt. And it might also cause the search results to change significantly from one build to another. So, Edward, if you can find a solution to this (making the indexing continuous), then you will be the next Bill Gates.
For some applications this may not be a problem. But in general, this is bad.
Another problem with LSI is the algorithm that finds the “nearest neighbors”. I don’t think people have a good solution to that yet. But in terms of searching, the time and space complexity is not so much of a concern.
For Google, however, is the immense number of pages and keywords they have to index. The complexity go up at least at 2nd order polynomial rate. We may not have enough atoms in the universe to store all that information.
LSI is a patented technique, and is difficult to implement properly. Don’t even try it unless you read the appropriate academic papers first. I suggest http://scholar.google.com as a first resource. Even google doesn’t use LSI, at least yet, because of its computational complexity.
Good info on LSI. Thanks.
Am an aspiring search engine architect. Ave been building a Presale Marketplace engine for my part of the world (E. Africa). It will be a search engine leading prospects to quality products, services, places etc. Twill be the first here. I want to give it some Artificial Intelligence.
Ave worked extensively with Linux|Apache|PHP|MySQL and do hope to launch my solution on this paltform. Ave recently stumbled on LSI/LSA and became very interested. I have gone through a lot of sites and documents on LSI.
Problem is, I really can’t find any straight path from where I am now to augmenting LSI on my choice patform – PHP|MySQL. Where do I go from here? Please help
Try:
http://www.semiologic.com/projects/search-reloaded/
when it is available to increase the relevance of your search results