Open Access
Soepranoto, Preston Adrian
Area of Honors:
Industrial Engineering
Bachelor of Science
Document Type:
Thesis Supervisors:
  • Soundar Rajan Tirupatikumara, Thesis Supervisor
  • Catherine Mary Harmonosky, Honors Advisor
  • Movie
  • Data Analytics
  • Predictive Modeling
  • Data Science
  • Big Data
  • Machine Learning
Predicting a movie’s profitability or return on investment (ROI) is a complex problem, especially for investors looking to make a sizable investment with the prospect of getting even greater dividends. In order to predict the ROI of a movie, movie studios would have to select the right mix of factors that would accrue to the success of the movie. Some factors involved in a movie production include and not limited to the production budget, actors, directors, genre and rating. This research integrates the classic factors tied to movies, and developed new factors that were deemed to be crucial to a movie’s success to predict the ROI. Factors in this study are divided into four groups, “star feature”, “integrated feature”, “descriptive feature” and “time-based feature”. This thesis aims to build a predictive model by utilizing historical film data from a paid resource,, and to predict a film’s success by looking at its ROI. Only movies released between 2007 and 2016 were used. An ROI of a film was more indicative of investors’ goals as opposed to its box office revenue. The predictive model will be based upon machine learning techniques, specifically supervised learning. We approached this problem by classifying movies into different classes of ROI, ranging from “low ROI” to “very good ROI”, and the goal is for our model to accurately place a film in the right class. Attempting to achieve this feat would require us to first complete the data preparation stage. Data then was used to train the model using various learning algorithms and the model was tested on how well it did using a 10-fold cross validation. Applying this methodology, we were able to see which algorithm would perform the best for our model. We further went one step ahead and tried to predict the numerical value of a movie’s ROI through a regression analysis. The prediction accuracy (based on ROI) was used as a benchmarking measure. The model with the highest overall accuracy, area under the receiving operating curve (AUC), precision and recall was chosen for the final application. Based on these metrics, we found that out of all of the multiple learning algorithms, the “Random Forest” performed the best for our model. This analysis also gave a clear indication on the impact of the features selected and how they correlate with the ROI of a film. The correlation between the features and the ROI of the movie was measured using the Gini Index. According to the Gini Index, the attributes “Total ROI for Supporting Actor”, “Total ROI for Director”, and “Total ROI for Actor” make up the top 3 on the list of features starting from the first to the third.