Document and Term Clustering Introduction
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
- Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups.
- Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.
Types of clustering algorithms
- Connectivity models:
- Centroid models:
- Distribution models:
- Density Models:
Term clustering is a dual problem of Document Clustering
- it’s also unsupervised Text Mining technique, but applied to terms instead of documents
- term clustering may be good technique for Dimensionality Reduction
- when we use Vector Space Models, e.g. Bag of Words, then we have a term-document matrix
- rows of are documents, columns of are terms
- can cluster columns instead of rows!
- clustering rows and clustering columns are very related problems
Term Clustering groups words with a high degree of semantic relatedness
- so we can use clusters (centroids of terms) to represent terms
Li Jain 1998
- view semantic relatedness between words in terms of their co-occurrence and co-absence in the corpus
Clustering of Terms
How to do this?
- try applying usual row clustering techniques on
Apply Local Pattern Discovery and Frequent Patterns Mining techniques for terms:
- can see a document as a transaction and words like items
- we want to find frequent itemsets of words in these documents
- it’s called Frequent Word Patterns
Two-Phase Document Clustering
- use Mutual Information to find best term clustering
- and then use mutual information to find best document clustering
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
- The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.
- In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward’s method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the K-means algorithm and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the K-means algorithm are more efficient and provide sufficient information for most purposes.
- These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters. Dimensionality reduction methods can be considered a subtype of soft clustering; for documents, these include latent semantic indexing (truncated singular value decomposition on term histograms)
- Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include Bag-of-words model and N-gram model.
2. Stemming and lemmatization
- Different tokens might carry out similar information (e.g. tokenization and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
3. Removing stop words and punctuation
- Some tokens are less important than others. For instance, common words such as “the” might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
4. Computing term frequencies or tf-idf
- After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. See tf-idf for detailed discussions.
- We can then cluster different documents based on the features we have generated. See the algorithm section in cluster analysis for different types of clustering methods.
6. Evaluation and visualization
- Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space. See multidimensional scaling as a possible approach.