My Research Diary

Monday, August 30, 2004

Reading on document clustering

This weekend, I went home thinking about the way document clustering is done for textual data. Most of the text clustering algorithms have few things in common. 1. They try and bring out similarity measure for text data, 2. Most of them try to figure out a matrix representation of the data to solve the problem as matrix manipulations.
I have been a lot of clustering algorithms. K-means has its own advantages and disadvantages. Density based clustering is another algorithm I am impressed with.
I read "Concept decompositions for large sparse text data clustering" paper by Dr. Modha and Dr. Dhillon. These two guys are Indians and I have some sense of pride reading one of the great algorithms into text document clustering.
What more, I have been working on my own idea of text document clustering. Just like mentioned above I do have similarity measure defined and I also do work with matrices. It is the approach to solve that matrix to club similar documents together. I will need to run few experiments to get to the point where I can start pointing out the differences with other techniques and also start detailing technical paper. I am keeping focus on next August 2005 conference for Special Interest Group on Information Retrieval. I am also planning to write a journal paper regarding semantic representation of the contextual data. The ontology options in protege' need to be explored in order to achieve this.

My priority at this moment is to focus on the draft to proposal due October end and I am getting few experiments in shape to collect some test results.

# posted by Hemant Joshi : 8/30/2004 12:05:00 AM

Comments: Post a Comment

My Research Diary

Monday, August 30, 2004

Reading on document clustering

Links

Archives