Tweet Categorization With Pattern Diffusion

Open Access
- Author:
- Bush, Zachary David
- Area of Honors:
- Computer Science (Behrend)
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Meng Su, Thesis Supervisor
Ronald Lee Mccarty, Thesis Honors Advisor - Keywords:
- PCA
Pattern Diffusion
Diffusion Map
Tweets
Twitter
Diffusion Geometry - Abstract:
- In the last several years, there has been an influx of socially oriented technology. Various websites such as Facebook and Twitter that are driven by users' sharing information have appeared. The field of social computing is still very new, and the information available through these interactions is being explored. This paper proposes applying data mining methods such as PCA (Principal Component Analysis) and Diffusion Geometry to the information that users exchange to efficiently categorize them. Additionally, this paper proposes applying known ranking methods to identify the users that cause topics to become popular. In order to categorize users' tweets, each tweet message is represented as an attribute vector, which is made by the frequencies of certain keywords in the tweets' content. The top one hundred most frequently appearing meaningful words are picked as the keywords that each tweet would be represented by. It is impossible to display the clusters of the tweet vectors created by k-means because of the high dimensionality of the data. The various keywords may be related, so the intrinsic dimension of the data set is much smaller than one hundred. Thus, the two methods (PCA and Diffusion Geometry) are used to reduce the dimensionality of the data. Our calculating results on about one thousand real tweet messages show that these two mining methods are promising in their ability to categorize tweets into groups.