A Natural Language Processing Analysis
Open Access
- Author:
- Mason, Peter Everest
- Area of Honors:
- Computer Science
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- John O'Hara, Thesis Supervisor
Dr. John Joseph Hannan, Thesis Honors Advisor - Keywords:
- Natural Language Processing
NLP
Reddit
Social Media - Abstract:
- In an era where “Big Data” pervades nearly every industry with the hope of gleaning new insights or ironing out inefficiencies, one often-overlooked place is the self-proclaimed “front page of the Internet.” The news aggregator and discussion forum Reddit (www.reddit.com) boasts several hundred million monthly users, and the documented interactions between so many strangers is a potential goldmine waiting to be tapped. In this thesis we use natural language processing, a burgeoning field in computer science, to explore and analyze the comment corpus of all Reddit users between July 2013 and May 2015. We first look into Reddit-specific phenomena, particularly the frequency and distribution of textual memes and reposted comments, and find that some memes maintain their influence and popularity throughout the two-year span while variations develop and persist within smaller online communities. The rate of reposts—unoriginal content—increases over that period as well, both in absolute numbers and when adjusting for the growing number of users. We next study privacy and sharing habits of Reddit users, looking at how often names, addresses, and phone numbers are given out. Distinguishing between all names, addresses, or phone numbers and personal information proves a more intractable task, though some heuristics help us approximate an upper bound for all three categories that are shared throughout the two years. Finally, we explore some patterns relevant to other forms of social media. We analyze the structure of high-scoring comments and overtly political comments, and using some machine learning we construct predictive classifiers for both areas. We ultimately find that while sentence structure and named entity categories certainly have some impact on whether a comment scores highly or is political, we are unable to reason out why that is the case.