Macrobase Diff — minimal implementation
In my last post ( https://pzakrzewski.com/posts/macrobase/) I introduced Macrobase, a tool and a methodology for prioritising attention in data analysis. I said that I would try to reimplement part of the Macrobase pipeline to understand it better, and this is what today’s post is about.
Along with many other pipelines implemented in Macrobase one got its own separate publication: https://cs.stanford.edu/~matei/papers/2019/vldb_macrobase_diff.pdf
The Diff Operator belongs to the Explain part of the methodology. Tools that are supposed to provide possible clues as to what makes detected outliers/anomalies in the data special, or different than the in-group.
The diff requires the data to already be classified into two groups (traditionally outliers and inliers) then it conducts following steps in order to provide its explanations:
- filter attributes on support (i.e. how many rows contain given column-value combination)
- mine attribute combinations satisfying min support parameter
- filter on risk ratio
- present remaining attribute combinations sorted on risk ratio
The original Macrobase provides a tutorial which nicely introduces the Diff Operator. You can find it here: https://macrobase.stanford.edu/docs/sql/tutorial/
Aside from calculating risk-ratio metric, better known from epidemiology ( https://en.wikipedia.org/wiki/Relative_risk) and mining the attribute combination the Diff Operator has one more thing going for it. As you may have noticed if you followed the link to the Macrobase tutorial, the Diff is implemented to work in an SQL-like shell. It is supposed to be an extension to SQL. Allowing for building queries like:
SELECT * FROM DIFF (SELECT * FROM wiki WHERE deleted > 0.0) outliers, (SELECT * FROM wiki WHERE deleted <= 0.0) inliers ON *;
The above query (taken verbatim from the aforementioned tutorial) divides a table with data about wikipedia posts into two groups based on on value of one attribute (deleted) and will attempt to find explanations (remaining attribute combinations that are more common among outliers) within all categorical columns (this is what ` ON *`` does).
Like all Macrobase tools and methodologies it is meant to work in a streaming mode as well, this is where the side-line Fast Data come from in the original paper. But streaming implementation of the Diff Operator I will have to cover some other time.
I decided to start small and implemented a simple command line utility in Python that given a csv and an outlier query will return the Diff output.
You can find it on my GitHub: https://github.com/PiotrZakrzewski/macrobase-diff
I will continue working on it and I will try to document the process along the way. Stay tuned!