Box Office Prediction
Team member: Tianliang Tao, Steve Zhu
Github Repo: https://github.com/TianliangTao/Project.git
Goal
In this project, our goal is that studios and investors can use forecasting methods to predict the revenue that a new movie can generate based on some given information, like budget, runtime, popularity, and so on.
In this project, the primary goal is to build a machine learning model to predict the revenue of movies according to the distributor, budget, release date, running time, genres and other information.
Required Tools
- Programming language: Python
- Compiler: Google Colab, Visual Studio Code
- Libraries involved: pandas, Numpy, Matplotlib, Sklearn, Plotly.express, Xgboost, Tensorflow, Scrapy
Flowchart
§1.Crawling the Data
In order to get the data, we use scrapy crawling the date from Box Office Mojo. We crawl the movie information from 1990 to 2021, which include the movie title, release date, runtime, budget, distibutor, gener, and so on.
import scrapy
class MovieSpider(scrapy.Spider):
name = "movie"
# urls to start scraping/crawling from
start_urls = [f"https://www.boxofficemojo.com/year/{1990 + i}" for i in range (33)]
def parse(self,response):
"""
This function is to access the website that include the list of movies for each year.
And choose 300 movies at most from each year.
Click the movie automatically and go to next page.
"""
for data in response.css("tr")[1:300]:
next_page = data.css("td.a-text-left.mojo-field-type-release.mojo-cell-wide a.a-link-normal").attrib["href"]
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback = self.parse_data_page)
def parse_data_page(self,response):
"""
This function is to crawl the information of each movie, which include the movie title,
domestic,international, worldwild,budget,MPAA, and so on.
"""
movie = response.css("div.a-fixed-left-grid-col.a-col-right h1.a-size-extra-large::text").get()
for grosses in response.css(".mojo-performance-summary-table"):
domestic = grosses.css("div:nth-child(2) > span:nth-child(3) > span:nth-child(1)::text").get()
international = grosses.css("div:nth-child(3) > span:nth-child(3) > a:nth-child(1) > span:nth-child(1)::text").get()
worldwild = grosses.css("div:nth-child(4) > span:nth-child(3) > a:nth-child(1) > span:nth-child(1)::text").get()
for data in response.css(".mojo-summary-values"):
distributor = data.css("div:nth-child(1) > span:nth-child(2)::text").get()
opening = data.css("div:nth-child(2) > span:nth-child(2) > span:nth-child(1)::text").get()
if data.css("div:nth-child(3) > span:nth-child(1)::text").get() == "Budget":
budget = data.css("div:nth-child(3) > span:nth-child(2) > span:nth-child(1)::text").get()
if data.css("div:nth-child(4) > span:nth-child(1)::text").get() == "Release Date":
release_date = data.css("div:nth-child(4) > span:nth-child(2) > a:nth-child(1)::text").get()
if data.css("div:nth-child(5) > span:nth-child(1)::text").get() == "MPAA":
MPAA = data.css("div:nth-child(5) > span:nth-child(2)::text").get()
if data.css("div:nth-child(6) > span:nth-child(1)::text").get() == "Running Time":
running_time = data.css("div:nth-child(6) > span:nth-child(2)::text").get()
if data.css("div:nth-child(7) > span:nth-child(1)::text").get() == "Genres":
genres = data.css("div.a-section:nth-child(7) > span:nth-child(2)::text").get()
yield{
"Movies":movie,
"Domestic":domestic,
"International":international,
"Worldwild":worldwild,
"Distributor":distributor,
"Opening":opening,
"Budget":budget,
"Release_Date":release_date,
"MPAA":MPAA,
"Runing_Time":running_time,
"Genres":genres
}
We have a 503 in the process of scraping the data because too much data need to crawl. In order to solve this error, we create download middleware that keeps retrying with new proxy the URLs that have 503
response until they are successfully scraped.
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 1,
'scrapy_proxies.RandomProxy': 400,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = "/Users/10647/anaconda3/envs/movie/proxylist.txt"
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
§2.Data Cleaning and Split
In this step, we need to check the data and clean the data.
#load the data and take a look
url = "https://github.com/TianliangTao/Project/blob/main/movie/movie.csv?raw=true"
df = pd.read_csv(url)
df.head()
We see there are some non sense symbols, so we need to
- Delete the dollar sign and comma
- Check whether the distributor name is unique
- Check how many ‘nan’ we have in our data
- Change the release date to number
- Change the running time to minutes only
- Convert genre features to list
We will use modules to break down large programs into small manageable and organized files. Modules will save the code with the file extension .py
. We create different modules for data cleaning, split dataser into training data and test data, and keras. Model. In order to access modules on colab, I mount Google Drive using an authorization code.
from google.colab import drive
import sys
drive.mount('/content/gdrive')
sys.path.append('/content/gdrive/MyDrive/box office/modules')
This part is a module for data cleaning. We will use import data_clean
import this module in our main code page. Due to Distributor, MPAA, Release_Year, Release_Month and Genres features contain various type of values, we covert them to dummy variables. In order to make our model converge to better weights, we use data normalization to norem the data.
import pandas as pd
import numpy as np
from sklearn import preprocessing
def clean_number(data):
"""
Input data contains dollar sign and comma in str format
Ouput the data in int format without nan, the dollar sign and comma
"""
df = data.copy()
# Remove comma
df = df.str.replace(',', '',regex=True)
# Remove dollar sign
df = df.str.replace('$', '',regex=True)
# Convert to int type
df = df.astype(int)
return df
def change_runningtime(data):
"""
Input data contains running time information in str format
Output the running time in minitues and in int format
"""
df = data.copy()
# Add '00 min' to the data so the index will not be out of range for the data does not contain miniutes inforamtion
df = df + ' 00 min'
df = df.str.split(' ').apply(lambda x: int(x[0]) * 60 + int(x[2]))
return df
def prepare_data(data):
"""
This function will call the function clean_number() and the function change_runningtime().
clean the data, conver certain data to dummy variables and normalize the data
"""
# Take a copy first
df = data.copy()
# Fix typo
df = df.rename(columns = {'Worldwild':'Worldwide', 'Runing_Time':'Running_Time'})
# Put 0 in the international feature of movies do not have the international box office information
df['International'] = df['International'].replace(np.nan, '0')
# Delete the movies with missing information
df = df.dropna()
# Delete the dollar sign and the comma, and convert the number in to int format
df['Domestic'] = clean_number(df['Domestic'])
df['International'] = clean_number(df['International'])
df['Worldwide'] = clean_number(df['Worldwide'])
df['Opening'] = clean_number(df['Opening'])
df['Budget'] = clean_number(df['Budget'])
# Convert the format of running_time to minutes
df['Running_Time'] = change_runningtime(df['Running_Time'])
# Extract the year and month information from release_date
df['Release_Date'] = pd.to_datetime(df['Release_Date'])
df['Release_Year'] = df['Release_Date'].dt.year
df['Release_Month'] = df['Release_Date'].dt.month
# Clean the space and \n in Genres feature and create lists contain genre information
df['Genres'] = df['Genres'].str.replace(' ','')
df['Genres'] = df['Genres'].str.split('\n\n')
# Covert Distributor, MPAA, Release_Year, Release_Month and Genres features to dummy variables
dist_dum = pd.get_dummies(df.Distributor, prefix='Distributor')
mpaa_dum = pd.get_dummies(df.MPAA, prefix="MPAA")
year_dum = pd.get_dummies(df.Release_Year, prefix="Release_Year")
month_dum = pd.get_dummies(df.Release_Month, prefix="Release_Month")
genre_dum = pd.get_dummies(df['Genres'].apply(pd.Series).stack(), prefix="Genre").groupby(level=0).sum()
df = pd.concat([df, dist_dum, mpaa_dum, year_dum, month_dum, genre_dum], axis=1)
df.drop(columns=['Movies', 'Distributor', 'MPAA', 'Release_Date', 'Release_Year', 'Release_Month', 'Genres'], inplace=True)
# Normalize the data
columns_to_normalize = ['Domestic', 'International', 'Worldwide', 'Opening', 'Budget', 'Running_Time']
mean_scaler = preprocessing.StandardScaler()
df[columns_to_normalize] = mean_scaler.fit_transform(df[columns_to_normalize])
df['Worldwide'] = df['Worldwide'].to_numpy(dtype = np.float32).reshape((-1, 1))
return df
To check the data after cleaning.
import data_clean1
df1 = data_clean1.prepare_data(df)
df1.head()
In order to predict the box office revenue, we split data in to three group. 70% data for trainng. 10% data for validation, and 20% for testing. And we also create a module for this part.
import tensorflow as tf
def make_dataset(data, shuffle=True):
"""
shuffles the elements of this dataset
Create the train, test and validation data set
"""
df = data.copy()
labels = df['Worldwide']
features = df.drop(columns = ['Worldwide'])
# get the slices of array in the form of objects
ds = tf.data.Dataset.from_tensor_slices((features, labels))
# shuffles the elements of dataset
if shuffle:
ds = ds.shuffle(buffer_size=len(ds))
train_size = int(0.7*len(ds))
val_size = int(0.1*len(ds))
# data[:train_size]
train = ds.take(train_size).batch(20)
# data[train_size : train_size + val_size]
val = ds.skip(train_size).take(val_size).batch(20)
# data[train_size + val_size:]
test = ds.skip(train_size+val_size).batch(20)
return train,val,test
§3.Modeling
(a) TensorFlow
Create a machine learning model using tensorflow. We also creat a module for this create_modle()
function. This module will assemble layers into the model, and compile the model, with loss and optimizer functions.
import tensorflow as tf
from keras import regularizers
from keras import optimizers
def create_model(input_size):
"""
Creat a sequence of layers. The first layer is input layer and the final layer is an output layer
"""
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(input_size, )),
tf.keras.layers.Dense(356,activation='relu',kernel_regularizer=regularizers.l1(.001)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(256,kernel_regularizer=regularizers.l1(.001),activation='relu'),
tf.keras.layers.Dense(units = 1)
])
# use model.compile() config the model with optimizer,losses and metrics
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=.001),loss='mse', metrics=['mean_squared_logarithmic_error'])
return model
Train the model by calling the fit method.
import model
model1 = model.create_model(160)
history = model1.fit(train, validation_data = val, epochs = 100)
Plot the origin revenue and revenue of prediction.
test_predictions = model1.predict(test).flatten()
test_true = np.concatenate([y for x, y in test], axis=0)
fig = px.scatter(x = test_true, y = test_predictions)
fig.update_layout(
width = 1000,
height = 1000,
yaxis_range=[-1,7],
xaxis_range=[-1,7]
)
fig.update_yaxes(
scaleanchor = "x",
scaleratio = 1,
)
fig.show()
We see that our model performance is not good. We need to add more features into our data to improve the performance. We use TMDB API
add original_language and popularity into the dataset.
# create an empty dataset
added_df = pd.DataFrame(columns=[])
# use tmdb api search the original_language and popularity of each movie in the previous dataset
for movie_name in df['Movies']:
response = requests.get('https://api.themoviedb.org/3/search/movie?api_key=c856b5dd6e0385ee3a021059a0a6cca1&query='+movie_name)
responded = response.json()
try:
added_info = pd.DataFrame({'Original_Language': [responded['results'][0]['original_language']], 'Popularity': [responded['results'][0]['popularity']]})
except:
added_info = pd.DataFrame(np.nan, index = [0], columns=['Original_Language', 'Popularity'])
# append the data
added_df = added_df.append(added_info)
added_df = added_df.reset_index()
added_df
Add the new data into previous dataset.
df['Original_Language'] = added_df['Original_Language']
df['Popularity'] = added_df['Popularity']
We train this new dataset and model it, but it also have a poor performence. We find add more features into our data cannot improve the performance. Thus we decided to use a new model to train the data.
(b)XGBoost
We will use XGBoost to train a model to find patterns in a dataset with other features and then uses the trained model to predict box office revenue.
from xgboost import XGBRegressor,plot_importance
train, test = train_test_split(df4, test_size = 0.2, random_state = 1)
y_train = train['Worldwide']
y_test = test['Worldwide']
x_train = train.drop(columns=['Worldwide'])
x_test = test.drop(columns=['Worldwide'])
xgb_model = XGBRegressor(learning_rate=0.05,
n_estimators=10000,max_depth=4)
xgb_model.fit(x_train, y_train, early_stopping_rounds=100,
eval_set=[(x_test, y_test)], eval_metric = 'rmse')
xbg_val_predictions=xgb_model.predict(x_test)
test_predictions = xgb_model.predict(x_test)
test_true = y_test
fig = px.scatter(x = test_true, y = test_predictions)
fig.update_layout(
width = 1000,
height = 1000,
yaxis_range=[0,2000000000],
xaxis_range=[0,2000000000]
)
fig.update_yaxes(
scaleanchor = "x",
scaleratio = 1,
)
fig.show()
According to the plot we cansee that our model has a good performence. XGBoost has a higher result in terms of accuracy and it fits this model better.
Conclusion
We can predict box office revenue by using feature such as budget, popularity, runtime, etc. People in the film industry and film related can use machine learning model to predictbox office revenue by inputting the features. According to prediction result, our model has some limitations because it cannot provide accurate results. In order to improve the model performance, we need to add more data sets and add some more features. Therefor, we need more observation data to capturemore variability in our testing data set. A large number of data sets can better measure the accuracy of the model. And different machine learning frameworks can affect the accuracy of the result.