Macrobase Diff Case Study 1: IMDB Movies

Lets analyse a known dataset: the Kaggle IMDB dataset, which contains info about some of the best rated movies. We will see what useful insights we can learn by using Macrobase Diff to explain differences between the top of the ranking and the less popular ones.

(See my original post introducing my minimal implementation of Macrobase Diff)

After fetching the dataset I transformed it a bit, most notably I collapsed fields containing JSON inside cells as this is not something Macrobase Diff can handle as of now. If you are curious how the dataset was treated before analysis see this gist.

Lets see what kind of data do we have about the IMDB movies:

'budget', 'genres', 'homepage',
'id', 'keywords', 'original_language',
'original_title', 'overview', 'popularity',
'production_companies','production_countries',
'release_date', 'revenue', 'runtime',
'spoken_languages', 'status', 'tagline',
'title', 'vote_average', 'vote_count

There are two types of columns that Macrobase Diff can use:

numeric, we can use them to build a query to find the outliers (for instance fields like popularity or budget
categorical (or attributes), those are non-numeric columns. They will be used to explain differences between the in-group and the outliers. Like country, or the title

Not all categorical columns lend themselves well to the explanation by the Diff however. Some of them have too high cardinality, most of their rows are unique, good examples of that are the "tagline" and the "homepage" columns. These values coming from these fields will never have enough support (% representation in the whole dataset) to ever be considered for explaining the differences. After excluding such columns, I ended up with:

"title" ,"spoken_languages" ,"production_companies",
"production_countries", "genres", "keywords",
"original_language", "popularity"

You might notice that I kept the title field, while this attribute has all of its values unique, this attribute will be filtered out and ignored for the explanation finding but will still be printed out when showing outliers which is of course handy.

What is Different between the most popular movies on IMDB and the rest?

In order to answer this question we can use popularity column, let's write a simple query that will mark part of the dataset as "outliers", Macrobase Diff allows us to use just one metric for now but this is enough for this simple case.

python -m mbdiff --query "popularity > 125.0" movies.csv

This query will select movies such as "Avengers: Age of Ultron", "Jurassic World", "Interstellar", "The Martian" and may other most people will immediately recognise. The selection will be printed out in the first section of the results, I will skip it because it is rather lengthy and jump to the more interesting part: the explanations, or what makes these movies different from the rest?

Macrobase Diff uses (for now) risk ratio to score explanations. Simply said, risk ratio score tells us how much more specific given attribute combination is to the outlier group as compared to the inlier group. The higher the score the better the given explanation. An explanation is synonymous with an attribute combination. See the output below, each row is one explanation.

Explanations
       score  genres    original_language    production_countries      spoken_languages
--  --------  --------  -------------------  ------------------------  ------------------
 0  0.819568  Drama     en                   United States of America  -
 1  0.817379  Drama     -                    United States of America  English
 2  0.805825  Drama     -                    United States of America  -
 3  0.721123  Drama     en                   -                         English
 4  0.687639  Drama     en                   -                         -
 5  0.677253  Drama     -                    -                         English
 6  0.67321   Comedy    -                    United States of America  English
 7  0.667347  Comedy    en                   United States of America  -
 8  0.663208  Comedy    -                    United States of America  -
 9  0.615054  Comedy    en                   -                         English
10  0.605647  Comedy    -                    -                         English
11  0.597159  Comedy    en                   -                         -
12  0.58186   Drama     -                    -                         -
13  0.572761  Comedy    -                    -                         -

Macrobase Diff reported that the most popular movies on IMDB are much more likely to be English Dramas produced in the USA. The score of 0.8 given to the top explanation is not particularly high, so I would not be too attached to it.

Let's try to refine the results by limiting amount of attributes used for generating explanations. By default Macrobase Diff uses 3 as the maximum number of attributes and considers all categorical columns initially.

python -m mbdiff --query "popularity > 125.0"  --max-order 2 movies.csv

yielding

Explanations
       score  genres    production_countries      original_language    spoken_languages
--  --------  --------  ------------------------  -------------------  ------------------
 0  0.805825  Drama     United States of America  -                    -
 1  0.687639  Drama     -                         en                   -
 2  0.677253  Drama     -                         -                    English
 3  0.663208  Comedy    United States of America  -                    -
 4  0.605647  Comedy    -                         -                    English
 5  0.597159  Comedy    -                         en                   -
 6  0.58186   Drama     -                         -                    -
 7  0.572761  Comedy    -                         -                    -

What is confounding the results in the 3-attribute explanations and is still very much visible in the 2-attribute explanations is the correlation between columns. Nearly all movies from the USA are in English and most movies in the dataset that are in English come from the USA. That provides some redundancy and the tool cannot help with that yet - this has to be taken care of during the data cleaning / ETL step.

The result from 1-attribute explanations are quite crisp however:

Explanations
       score  genres
--  --------  --------
 0  0.58186   Drama
 1  0.572761  Comedy

Confirming what we glanced from a more noisy explanation table initially, that the most popular movies tend to be Dramas and Comedies. Take note however of the lower score (risk ratio). Just having Genre equal Drama or Comedy is less specific than being an English Drama from the USA, hence it is more common among the inliers (less popular movies) and worse as an explanation.

(Originally posted on pzakrzewski.com )