Photo by Sebastian Coman Photography on Unsplash
Analysing GO code with Souffle
A painfully basic static code analysis example
func main() {
a := 1
b := a // you know what value b has without running the program
// can another program know it?
}
When you read source code, you intuitively know what values given variables take (like in the trivial example above), you don't always need to run a piece of code to know what state the variables will end up in. The process you follow in your head is a form of Static Analysis.
Other programs can also perform static analysis, think of your linters that help you prevent type errors or nil/null/None exceptions.
How can you know the value of a variable without running the code?
A typical linter you use every day relies on some pretty advanced static analysis methods, so let's start with something really basic (or even painfully basic) like the code snippet from above. Let's write a program that can find the value of b
in the scope of the main function at the moment of its declaration.
Soufflee - a tool for static analysis
The chances are, you have not heard about Souffle, nor its parents Datalog and Prolog. But in short, it is a language (with its toolchain) specialized in querying source code. You can think of it as SQL for source code.
Just like SQL, Soufflee is declarative, you specify what you want, not how to get it.
Buckle up as there are surprisingly a lot of steps you need to take before a program can answer the same static analysis question that you do intuitively! It is an interesting journey!
Defining types
Soufflee does not understand Golang syntax, it makes no assumptions about the paradigm, as it is a general-purpose tool you could use to analyze widely different languages like Haskell, Java, Lisp or Forth. You need to provide it with at least a subset of the rules that Golang follows. To analyse our code snippet we will need at the very least to define how to declare a variable and how to assign its value.
I will roughly follow the first example from Soufflee's home page, which demonstrates Variable-Points-to analysis which is close to what we want to accomplish with our humble code snippet. The subtle difference is that we want just the value of b
and are not interested in pointers like the homepage example does.
Let's teach Souffle about GO variables first.
.type var <: symbol
.type
: This Souffle keyword is used to declare a new type.var
: This is the name of the new type being declared.<:
: This symbol indicates that the type on the left (in this case,var
) is a subset of the type on the right.symbol
: In Soufflé, thesymbol
type is a built-in type that represents strings.
The given line of code defines a new type named var
which is a subset of the symbol
type. This means that any value of type var
is also a symbol
, but not necessarily vice versa. This subset relationship allows users to define more specific or restricted versions of existing types. We will use var
just to represent variables.
Mind that the type we declared for variables does not even distinguish between different types of values it can hold, it uses string to represent its values.
Defining Assignments
Now let's teach Soufflee about assigning values to our GO var
.decl assign( a:var, b:var )
.decl
: This keyword indicates the declaration of a new relation.
assign
: This is the name of the new relation being declared.
( a:var, b:var )
: These are the attributes (or columns) of the assign
relation. There are two attributes in this relation: a
and b
both of previously defined type var
.
This will be used to represent assignment operations where one variable (a
) is assigned the value of another variable or literal(b
).
Mind that our assign
relation does not distinguish between assigning a literal value like this:
a = 2 // 2 is a literal value
or assigning a value held by another variable, like this:
a = b // b is another variable of the same type
It also does not distinguish between the different ways you can declare and assign a variable in GO.
Encode The Code Snippet in Souffle
Let's use our two definitions to encode our tiny code snippet, to represent it fully in a way that Soufflee understands. The following section can be referred to as facts as it deals with describing the concrete state of the program and not general declarations.
assign("a","1").
assign("b","a").
... and that's it. You might ask - where are the var
s? They are declared implicitly. As we know assign
takes in var
as parameters, any new var
used in a relation will be declared implicitly. We end up with three var
s in our fact base: a, b and 1. and two relations between them:
a -> 1
b -> a
Souffle what is the value of b
?
We taught Souffle enough Go to understand variable declaration and assignment and then encoded our code snippet as facts. Now we can make a query to answer the main question! What is the value of b
? In other words, can we follow the chain of assignments that leads from b
to a literal value 1
?
.decl alias( a:var, b:var ) output
This line declares a new relation named alias
with two attributes of type var
. The output
keyword indicates that the results of this relation will be outputted, i.e., it is an output relation. Output is what Soufflee will report back to us, the answer to our query
Before we can use our new output relation to find the concrete value of b
we need to tell Soufflee a bit more about how we understand alias
and what facts satisfy this relation. What is an alias?
alias(X,X) :- assign(X,_).
alias(X,X) :- assign(_,X).
alias(X,Y) :- assign(X,Y).
The above lines provide rules for what makes an alias, at least one of these rules must be satisfied by a pair of var
s to be an alias
. In human language, these rules mean:
X is an alias of itself (the first two rules)
X is an alias of Y if Y is assigned to X
The above would be enough to tell the value of a
as it is a simple assignment (there is a fact assign(a,1)
in our fact base) but not yet what is the value of b
!
For that, we will need a query that can follow the chain of assignments.
alias(X,Y) :- assign(X,Z), alias(Z,Y).
This rule is essential for capturing the transitive nature of aliasing between variables. It's a recursive rule that extends the direct aliasing relationships found in the program.
- If there exists an assignment from a variable
X
to another variableZ
, and there is already a known alias relationship fromZ
toY
, then we can conclude thatX
also aliasesY
.
In simpler terms, think of it as a chain of aliasing. If X
is assigned the value of Z
, and Z
is known to alias Y
, then it's like X
indirectly references or aliases Y
through Z
.
This rule is crucial for computing the complete set of aliasing relationships in programs where one variable might be assigned the value of another variable, which in turn might be aliased to yet another variable, and so on. By iteratively applying this rule, the program can capture both direct and indirect aliasing chains between variables.
What is also worth noting is that the approach does not take into consideration the ordering of the statements. Within the scope of this simple snippet, this does not change the outcome.
Complete Souffle Program
This is what the complete souffle program looks like:
// Basic Types, Simplified Go syntax
.type var <: symbol
.decl assign( a:var, b:var )
// Facts
assign("a","1").
assign("b","a").
// Relations and rules needed to query the facts
.decl alias( a:var, b:var ) output
alias(X,X) :- assign(X,_).
alias(X,X) :- assign(_,X).
alias(X,Y) :- assign(X,Y).
alias(X,Y) :- assign(X,Z), alias(Z,Y).
The Result!
Let's run it and finally learn what the value of b
is!
I saved the above source code as alias.dl
(the extension .dl
is shared with datalog) and ran souffle on it with souffle alias.dl
The output of this program is (by convention) saved in a file called the same output relation, alias.csv
in our case.
a a
a 1
1 1
b a
b 1
b b
The output contains all valid aliases, which includes some that a human would not consider (like that a
is an alias of a
) but crucially it also tells us that among many obvious aliases of b
there is also b -> 1
Conclusion
That was a LOT of work to get to an answer your brain could produce in an instant, I hope thanks to this blog post you will appreciate your own reasoning capabilities more! Or maybe you will realize how much complexity is hidden in linters and why they cover only the very basic error cases, while most of the remaining ones are left to human eyes scanning the code repeatedly trying to figure out what else can go wrong.