Linear Regression, Mixture Modeling, and Gradient Boosting to Predict Box Office Revenue: Leveraging Machine Learning in Volatile Industries
Open Access
Author:
Pevner, Joseph
Area of Honors:
Finance
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
Lingzhou Xue, Thesis Supervisor Brian Spangler Davis, Thesis Honors Advisor
Keywords:
Machine Learning Finance Film
Abstract:
Financial decision-making fundamentally relies upon our ability to accurately predict future cash flows, though in highly volatile markets, this poses an existential difficulty. This thesis explores the growing paradigm of applying regression and machine learning techniques to financial forecasting through a case-study of the notoriously erratic film industry. In this exploration, we pose three models of increasing complexity—a multiple linear regression, finite mixture model, and gradient boosting—to predict Domestic Box Office Revenue based upon several pre-release factors. Exploratory analysis, data wrangling, and feature engineering are employed upon a high-dimensional vendor-acquired dataset, emphasizing the importance of ensuring data quality prior to prediction. Each model is trained with five-fold cross-validation and five repetitions to promote robust and extrapolatable predictions. Comparing the evaluation metrics such as the Pearson Correlation Coefficient, Spearman’s Correlation Coefficient, Mean Absolute Error, and Root Mean Squared Error across the three models demonstrates an increase in linearity and reduction in prediction error across an increase in model complexity. We find that the gradient boosted model is most effective in predicting revenues, approximately halving error from the baseline linear regression model, though the model poses difficulty in extracting general insights. We further submit finite mixture modeling as a balanced approach in maintaining algorithmic interpretability while generating accurate estimates. These findings demonstrate the ability of high-powered machine learning algorithms, such as expectation-maximization and gradient boosting, to forecast revenue in volatile financial environments.