Using NLP and ML to isolate great episodes from the rest

Introduction

Game of Thrones (GoT) is a show enjoyed by millions around the world. Its fame has caught the attention of many including advertisers who want to have their ad on HBO in between GoT so that millions could view the ad. Of course, this is only possible if the show is able to continuously put out good episodes and have the audience in a good mood.

When it comes to such things it is often difficult to determine what distinguishes a good episode from a bad. Difficult until the advancement of technology and creation of tools such Natural Language Programming (NLP) and Machine Learning (ML), which can allow us to analyze each episode with a fine tooth comb, and hopefully help identify good episodes before they air.

Using Feature Engineering and NLP on the scripts of Game of Thrones I am going to try and determine what defines a good episode and what defines a bad episode. This can help the creators of the show to continue putting out great television, or help advertisers determine if their ad will be played in-front of a good episode or a bad episode.

Table of content

What defines a good episode?

Before we can contrast good episodes against the bad, we need to determine what will define a good episode. Luckily this has been taken care of in the form of a rating which each episode receives, on IMDB tens of thousands of viewers have rated each episode on a scale of 0-10. If one observes the rating over all of the seasons, one notices that with the exception of season 8, the data is heavily skewed towards the top of the scale.

What happened with Season 8?

First let’s address Season 8, as it stands out as the single worst season in GoT history. This can be quickly explained as the TV show was given a sudden end date so the creators had to rush and go off-script from George RR Martin’s story line, this upset many fans. Since Season 8 did not follow the book like previous seasons, and is an outlier interms of ratings, we will choose to ignore this season, and look at Season 1-7.

Since the rating is on a continuous scale, which is heavily skewed, it will be difficult to label good and bad episodes unless we create a split. For this case I have gone with the whole number 9, as it signifies the highest realistic average whole number rating an episode can receive, and a 8 point something lacked something for enough people that it did not make into the nines. Luckily, as you will see in the next section, by choosing 9 we also get a very even split in our data and there is little to no class imbalance.

What is the Data that we’re working with?

We are starting off with two data-sets that we will combine to create a new dataset on which we will apply our machine learning techniques. Dataset 1 is called GOT_Ratings_Original.csv, it contains the ratings and views of each episode, a detailed description is provided below. Dataset 2 is called Game_of_Thrones_Script_Original.csv, ****it mainly contains the dialogue that was spoken on the show along with title of each episode and name of character that said the line, detail description is also provided below.

GOT_Ratings_Original.csv

Column name:	Season	Episode	Rating	Viewers
Type of data:	Continuous whole numbers from 1-8	Continuous whole numbers from 1-10	Continuous real numbers from 0-10	Continuous positive real numbers
Description:	Identifies which season a data point belongs to	Identifies which episode a data point belongs to, loss of information if used without the Season column	Gives the average rating of an episode over thousands of supposed viewers who rated it on IMDB on a scale of 0-10	Number of people that tuned in to view the episode on the day of airing.

Game_of_Thrones_Script_Original.csv

Column name:	Release Date	Season	Episode	Episode Title	Name	Sentence
Type of data:	Date Time	String	String	String	String	String
Description:	Contains the information of which season the data point belongs to. Formatting is: “Season #” where the text text Season follow the number to which the data point belongs to	Contains the Season of each data point, stored as a string that contains the words Season	Contains the Episode of each data point, stored as a string that contains the words Episode	String that contains the name of the episode	String that contains the name of the character who delivered the line stored in the data point	Contains the full dialogue delivered by a character. Each data point represents a single uninterrupted line delivered by a character