Using Data Mining to Detect Hate Speech on Twitter

Open Access

Author:: Ciccarelli, Joshua
Area of Honors:: Data Sciences
Degree:: Bachelor of Science
Document Type:: Thesis
Thesis Supervisors:: Anna Cinzia Squicciarini, Thesis Supervisor
John Yen, Thesis Honors Advisor
Keywords:: Machine Learning
Twitter
Data Mining
Data Science
Cyber Agression
Cyber Bullying
Abstract:: The intention of this experimental study is to create a machine learning algorithm that can accurately classify tweets as malicious or not malicious. The dataset used for the first experiments is 80k tweets that were gathered using Twitter’s API. Each row contains features such as text, follower count, reply text, reply count and more. This dataset was processed for text features and was used as training data to train a machine learning algorithm to automatically detect cyber hate speech. The results were measured by calculating the root mean squared error of the testing data and using this as a factor to compare models. The dataset used for the second experiment contains 24K tweets and has labels and counters for each tweet. Overall, the goal of this study is to further research regarding machine learning on social media and to help social media platforms detect vulgar content. There exists a large amount of malicious content on social media platforms, especially Twitter, and a well-trained machine learning model can detect this content so it can be removed.

Tools