Lets analyse a known dataset: the Kaggle IMDB dataset, which contains info about some of the best rated movies. We will see what useful insights we can learn by using Macrobase Diff to explain differences between the top of the ranking and the less popular ones.
(See my original post introducing my minimal implementation of Macrobase Diff)
After fetching the dataset I transformed it a bit, most notably I collapsed fields containing JSON inside cells as this is not something Macrobase Diff can handle as of now. If you are curious how the dataset was treated before analysis see this gist.
Lets see what kind of data do we have about the IMDB movies:
'budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies','production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count
There are two types of columns that Macrobase Diff can use:
- numeric, we can use them to build a query to find the outliers (for instance fields like popularity or budget
- categorical (or attributes), those are non-numeric columns. They will be used to explain differences between the in-group and the outliers. Like country, or the title
Not all categorical columns lend themselves well to the explanation by the Diff however. Some of them have too high cardinality, most of their rows are unique, good examples of that are the "tagline" and the "homepage" columns. These values coming from these fields will never have enough support (% representation in the whole dataset) to ever be considered for explaining the differences. After excluding such columns, I ended up with:
"title" ,"spoken_languages" ,"production_companies", "production_countries", "genres", "keywords", "original_language", "popularity"
You might notice that I kept the title field, while this attribute has all of its values unique, this attribute will be filtered out and ignored for the explanation finding but will still be printed out when showing outliers which is of course handy.
What is Different between the most popular movies on IMDB and the rest?
In order to answer this question we can use popularity column, let's write a simple query that will mark part of the dataset as "outliers", Macrobase Diff allows us to use just one metric for now but this is enough for this simple case.
python -m mbdiff --query "popularity > 125.0" movies.csv
This query will select movies such as "Avengers: Age of Ultron", "Jurassic World", "Interstellar", "The Martian" and may other most people will immediately recognise. The selection will be printed out in the first section of the results, I will skip it because it is rather lengthy and jump to the more interesting part: the explanations, or what makes these movies different from the rest?
Macrobase Diff uses (for now) risk ratio to score explanations. Simply said, risk ratio score tells us how much more specific given attribute combination is to the outlier group as compared to the inlier group. The higher the score the better the given explanation. An explanation is synonymous with an attribute combination. See the output below, each row is one explanation.
Explanations score genres original_language production_countries spoken_languages -- -------- -------- ------------------- ------------------------ ------------------ 0 0.819568 Drama en United States of America - 1 0.817379 Drama - United States of America English 2 0.805825 Drama - United States of America - 3 0.721123 Drama en - English 4 0.687639 Drama en - - 5 0.677253 Drama - - English 6 0.67321 Comedy - United States of America English 7 0.667347 Comedy en United States of America - 8 0.663208 Comedy - United States of America - 9 0.615054 Comedy en - English 10 0.605647 Comedy - - English 11 0.597159 Comedy en - - 12 0.58186 Drama - - - 13 0.572761 Comedy - - -
Macrobase Diff reported that the most popular movies on IMDB are much more likely to be English Dramas produced in the USA. The score of 0.8 given to the top explanation is not particularly high, so I would not be too attached to it.
Let's try to refine the results by limiting amount of attributes used for generating explanations. By default Macrobase Diff uses 3 as the maximum number of attributes and considers all categorical columns initially.
python -m mbdiff --query "popularity > 125.0" --max-order 2 movies.csv
Explanations score genres production_countries original_language spoken_languages -- -------- -------- ------------------------ ------------------- ------------------ 0 0.805825 Drama United States of America - - 1 0.687639 Drama - en - 2 0.677253 Drama - - English 3 0.663208 Comedy United States of America - - 4 0.605647 Comedy - - English 5 0.597159 Comedy - en - 6 0.58186 Drama - - - 7 0.572761 Comedy - - -
What is confounding the results in the 3-attribute explanations and is still very much visible in the 2-attribute explanations is the correlation between columns. Nearly all movies from the USA are in English and most movies in the dataset that are in English come from the USA. That provides some redundancy and the tool cannot help with that yet - this has to be taken care of during the data cleaning / ETL step.
The result from 1-attribute explanations are quite crisp however:
Explanations score genres -- -------- -------- 0 0.58186 Drama 1 0.572761 Comedy
Confirming what we glanced from a more noisy explanation table initially, that the most popular movies tend to be Dramas and Comedies. Take note however of the lower score (risk ratio). Just having Genre equal Drama or Comedy is less specific than being an English Drama from the USA, hence it is more common among the inliers (less popular movies) and worse as an explanation.
(Originally posted on pzakrzewski.com )