My Research Diary

Tuesday, February 07, 2006

The Practical Mind

I came across this thought provoking blog and no matter what opinion I have, I liked it. There aren't many things that you come across as thought-provoking. This blog - I liked to read.

Saturday, February 04, 2006

Paper submitted to ACM SIGIR 2006 conference

I submitted paper to SIGIR 2006 conference in Seattle on Jan 30, 2006. The work I have presented in this paper is under the guidance of Dr. Xu as well as Dr. Bayrak (my advisor). Here is brief description -

The term vs. document generated by vector space model is a nice linear algebraic representation but has huge dimension as well as synonymy and polysemy issues. So what do we do? Well, we have matrix decomposition techniques like SVD (Singular Value Decomposition) or SDD (Semi discrete matrix decomposition). If you don't like matrices much, your best bet is using probabilistic models that are trained and they perform really well, depending on how well the training set represents the test data. The two popular approaches are using probabilistic latent semantic indexing (known as pLSI) and a little superior technique called Latent Dirichlet Allocation (known as LDA). The second technique is called Latent Semantic Indexing/Analysis (LSI or LSA)
LSI has been around for some time now and has known computational complexity issues. Having said that, LSI does seem to solve synonymy problem quiet well. So essentially the papers you will find in IR field have indexing techniques and backbone of one of the two methods mentioned above. I am equally impressed by couple of clustering techniques like Spherical K-means clustering which has high efficiency and accuracy established and the other one is Principle Direction Divisive Partitioning (PDDP). PDDP is hierarchical in nature and uses principle components. Spherical K-means algorithm requires value K be given to put the given set of documents in K baskets (called clusters) in an iterative manner.

Well, with that introduction, I can tell you what we have proposed. The spherical K means algorithm is known for its efficiency and quality of clustering. Normally LSI methods are computationally exhaustive because they have huge matrix of term vs. document to decompose into 3 matrices. LSI reduced rank representation provides approximation of the original matrix by reducing the number of rows (it maps the terms i.e. rows of matrix into latent space for each document). If we apply spherical K means algorithm to the dataset first, we can obtain 10% of the original dataset as concept vectors representing entire dataset and then if we apply LSI to that column reduced matrix, we can still obtain results comparable to applying LSI to the original matrix. That means, we can now reduce the matrix column-wise (using spherical K-means) and further reduce it row-wise (using LSI) without much affecting the result. For example, if you have 100,000 documents with total of 500,000 unique words, then first you can reduce the matrix of 500,000 x 100,000 to 500,000 x 1000 using spherical K-means algorithm. We can apply LSI to this new matrix now to further reduce the rank to say chosen 200 or 300 thus reducing the matrix row-wise to 300 x 1000 which is approximate (with error associated) representation of original 500,000 x 100,000 matrix. Original LSI algorithm uses Lanczos method that has time complexity of O (m * n * c) where m is no. of rows, n is the number of columns and c is the number of non-zero elements of the matrix A[m*n]. Now with spherical K-means we reduce the time complexity to O(m*(n/10)*c') where c' is the number of non-zero elemtns of matrix A'[m*(n/10)].

# posted by Hemant Joshi : 2/04/2006 11:05:00 AM

Science Citation Index

I found that the benchmark for quality journals in your field, you have to lookup Science Citation Index or Expended Science Citation Index in that field.

So I lookup the SCI for Computer science, Information Systems field and I found they have 23 journals listed there. Some of these journals are quarterly and some are monthly. I also need to look at some of the top conferences in my field like SIGIR, ECIR , SIGKDD, ICDM and CIKM conferences.

# posted by Hemant Joshi : 2/04/2006 10:58:00 AM

ICML & A 2005 conference

I attended International Conference on Machine Learning and Applications in Los Angeles December 15-17, 2005. I presented my research work at the conference detailing the use of semantic approach instead of matrix computations and statistical baggage. It was all and all a very good experience to listen to some other presenters in this area. I am particularly impressed by Dr. Masood Nikravesh's keynote speach. He is at BISC, CA and has a research at solving natural language problems using fuzzy logic (Zadeh et. al.)

So what next? I think, I will get back to my work with Dr. Xu and Dr. Bayrak and get some experiments conducted in next couple of weeks.

BTW, I got to know from Anindo that he has started posting his blog as well. He has started as post doc at MIT.

# posted by Hemant Joshi : 2/04/2006 10:40:00 AM