Word Clustering Through Common Sentence Analysis

Open Access
- Author:
- Shi, Kelvin
- Area of Honors:
- Computer Science
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Daniel Kifer, Thesis Supervisor
Dr. Jesse Louis Barlow, Thesis Honors Advisor - Keywords:
- Machine
Learning
Clustering
Keyword
Natural
Language
Processing
Artificial
Intelligence - Abstract:
- To develop a program that could interpret the semantic meaning of text, we decided to first develop a word clustering program to determine the general topic of words. Word clustering could provide us with the context information of words, which could then be used as a resource to develop further applications. Understanding the general topic of a word has applications in several problems. For example, we could perform definition disambiguation. Some words have multiple definitions. The word “scale”, for example, has different meanings in each of the contexts: “scale a wall,” “fish’s scale,” and “weighing on a scale.” Understanding the topical context of the word would help classify the definition of each instance of the word. Furthermore, word clustering could also allow the computer program to better generate fluid sentences. By determining which words tend to belong together, the program can then place these contextually similar words together to create more natural-sounding sentences. To perform clustering, we found the frequency of each word relative to every other word, and we used these frequencies to find the characteristic relationships between certain words. We then used these relationships to cluster the words by grouping the words with the strongest relationships. To evaluate our algorithm, we decided to compare the results of our algorithm to the commonly used Latent Dirichlet Allocation (LDA) algorithm. After applying a double-blind test, we found that our algorithm was comparable to existing LDA techniques and is thus suitable for our purposes.