Saturday, February 04, 2006
Paper submitted to ACM SIGIR 2006 conference
I submitted paper to SIGIR 2006 conference in Seattle on Jan 30, 2006. The work I have presented in this paper is under the guidance of Dr. Xu as well as Dr. Bayrak (my advisor). Here is brief description -
The term vs. document generated by vector space model is a nice linear algebraic representation but has huge dimension as well as synonymy and polysemy issues. So what do we do? Well, we have matrix decomposition techniques like SVD (Singular Value Decomposition) or SDD (Semi discrete matrix decomposition). If you don't like matrices much, your best bet is using probabilistic models that are trained and they perform really well, depending on how well the training set represents the test data. The two popular approaches are using probabilistic latent semantic indexing (known as pLSI) and a little superior technique called Latent Dirichlet Allocation (known as LDA). The second technique is called Latent Semantic Indexing/Analysis (LSI or LSA)
LSI has been around for some time now and has known computational complexity issues. Having said that, LSI does seem to solve synonymy problem quiet well. So essentially the papers you will find in IR field have indexing techniques and backbone of one of the two methods mentioned above. I am equally impressed by couple of clustering techniques like Spherical K-means clustering which has high efficiency and accuracy established and the other one is Principle Direction Divisive Partitioning (PDDP). PDDP is hierarchical in nature and uses principle components. Spherical K-means algorithm requires value K be given to put the given set of documents in K baskets (called clusters) in an iterative manner.
Well, with that introduction, I can tell you what we have proposed. The spherical K means algorithm is known for its efficiency and quality of clustering. Normally LSI methods are computationally exhaustive because they have huge matrix of term vs. document to decompose into 3 matrices. LSI reduced rank representation provides approximation of the original matrix by reducing the number of rows (it maps the terms i.e. rows of matrix into latent space for each document). If we apply spherical K means algorithm to the dataset first, we can obtain 10% of the original dataset as concept vectors representing entire dataset and then if we apply LSI to that column reduced matrix, we can still obtain results comparable to applying LSI to the original matrix. That means, we can now reduce the matrix column-wise (using spherical K-means) and further reduce it row-wise (using LSI) without much affecting the result. For example, if you have 100,000 documents with total of 500,000 unique words, then first you can reduce the matrix of 500,000 x 100,000 to 500,000 x 1000 using spherical K-means algorithm. We can apply LSI to this new matrix now to further reduce the rank to say chosen 200 or 300 thus reducing the matrix row-wise to 300 x 1000 which is approximate (with error associated) representation of original 500,000 x 100,000 matrix. Original LSI algorithm uses Lanczos method that has time complexity of O (m * n * c) where m is no. of rows, n is the number of columns and c is the number of non-zero elements of the matrix A[m*n]. Now with spherical K-means we reduce the time complexity to O(m*(n/10)*c') where c' is the number of non-zero elemtns of matrix A'[m*(n/10)].
The term vs. document generated by vector space model is a nice linear algebraic representation but has huge dimension as well as synonymy and polysemy issues. So what do we do? Well, we have matrix decomposition techniques like SVD (Singular Value Decomposition) or SDD (Semi discrete matrix decomposition). If you don't like matrices much, your best bet is using probabilistic models that are trained and they perform really well, depending on how well the training set represents the test data. The two popular approaches are using probabilistic latent semantic indexing (known as pLSI) and a little superior technique called Latent Dirichlet Allocation (known as LDA). The second technique is called Latent Semantic Indexing/Analysis (LSI or LSA)
LSI has been around for some time now and has known computational complexity issues. Having said that, LSI does seem to solve synonymy problem quiet well. So essentially the papers you will find in IR field have indexing techniques and backbone of one of the two methods mentioned above. I am equally impressed by couple of clustering techniques like Spherical K-means clustering which has high efficiency and accuracy established and the other one is Principle Direction Divisive Partitioning (PDDP). PDDP is hierarchical in nature and uses principle components. Spherical K-means algorithm requires value K be given to put the given set of documents in K baskets (called clusters) in an iterative manner.
Well, with that introduction, I can tell you what we have proposed. The spherical K means algorithm is known for its efficiency and quality of clustering. Normally LSI methods are computationally exhaustive because they have huge matrix of term vs. document to decompose into 3 matrices. LSI reduced rank representation provides approximation of the original matrix by reducing the number of rows (it maps the terms i.e. rows of matrix into latent space for each document). If we apply spherical K means algorithm to the dataset first, we can obtain 10% of the original dataset as concept vectors representing entire dataset and then if we apply LSI to that column reduced matrix, we can still obtain results comparable to applying LSI to the original matrix. That means, we can now reduce the matrix column-wise (using spherical K-means) and further reduce it row-wise (using LSI) without much affecting the result. For example, if you have 100,000 documents with total of 500,000 unique words, then first you can reduce the matrix of 500,000 x 100,000 to 500,000 x 1000 using spherical K-means algorithm. We can apply LSI to this new matrix now to further reduce the rank to say chosen 200 or 300 thus reducing the matrix row-wise to 300 x 1000 which is approximate (with error associated) representation of original 500,000 x 100,000 matrix. Original LSI algorithm uses Lanczos method that has time complexity of O (m * n * c) where m is no. of rows, n is the number of columns and c is the number of non-zero elements of the matrix A[m*n]. Now with spherical K-means we reduce the time complexity to O(m*(n/10)*c') where c' is the number of non-zero elemtns of matrix A'[m*(n/10)].
Comments: