Coding Project: LLVM Analysis

For this assignment, you will create a standalone program named analysis (in the language of your choice). The code should be substantially your own. Although you may use auxiliary libraries, the fundamental tasks described must be performed by code you wrote for this assignment. You may work with 1 partner on this assignment. For credit on the assignment, each submission should be no more than incidentally similar in design to any other and each member of a submitting team must be able to explain their solution in detail. Each member of the team should be indicated in a file called COLLAB.txt included in the submitted tarball, which should also disclose any references or third-party code used as part of completing the assignment.

The analysis program will perform some basic analysis over an input file in a restricted subset of LLVM that we covered in class. In particular, the only pseudoinstructions that you need to cover are:

  • alloca
  • load
  • store
  • ret
  • icmp
  • br
  • add
  • sub
  • div
  • mul
  • call
  • define

Additionally, your analysis should accomodate programs with both global and local variable allocations. You may assume that all variables are of scalar integer type. Although they are not instructions, you should also handle phi nodes in your program!

Assignment Details

This assignment consists of two parts, CFG inference and static dataflow:

CFG Inference

The basic idea of this part is to construct the control-flow graph of each function declared in a given input file and output those control-flow graphs in the graphviz dot format. For each function in the input, the program will output a graphviz control-flow graph. The nodes have record shape and an empty label. Edges should be directed and have a label according to their position in the branch. For example, consider the input

Example 1 Input:
define i32 @main(i32 %argc) {
        %noArgs = icmp eq i32 %argc, 1
        br i1 %noArgs, label %lbl_t, label %lbl_f
lbl_t:
        %varT = add i32 1, 0
        br label %end
lbl_f:
        %varF = add i32 2, 0
        br label %end
end:
        %var = phi i32 [%varT, %lbl_t], [%varF, %lbl_f]
        ret i32 %var
}


The output would be of the form

Example 1 Output:
digraph {
        Node0 [shape=record,label=""]
        Node0 -> Node1 [label=1];
        Node0 -> Node2 [label=2];
        Node1 [shape=record,label=""]
        Node1 -> Node3 [label=1];
        Node2 [shape=record,label=""]
        Node2 -> Node3 [label=1];
        Node3 [shape=record,label=""]
}
Note that for this behavior to be correct, the output must be valid in the dot format.

Static Dataflow

This part is optional for undergraduate students. It is required for graduate students

For this part, your goal will be to detect if a flow-sensitive, static dataflow analysis over the control flow graph detects a flow from the result of invoking a function SOURCE to the (single) argument of a function SINK in any function in the input. If so, the program should exit with code 0 after printing the single line

LEAK
If there is no flow, the program should exit with code 0 after printing the single line
NO LEAK
For example, consider the following input:
define i32 @main(i32 %0, i8** %1) {
        %res = call i32 @SOURCE ()
        call i32 @SINK (i32 %res)
        ret i32 0
}
declare dso_local i32 @SOURCE()
declare dso_local i32 @SINK(i32)

This program SHOULD print LEAK.

Note that the analysis is over the control-flow graph. Thus, should tagged data from SOURCE enter any function call, it should be considered a leak. For example, consider the program:
define i32 @foo(i32 %0, i8** %1) {
        ret i32 7
}
define i32 @main(i32 %0, i8** %1) {
        %res = call i32 @SOURCE ()
        call i32 @foo (i32 %res)
        ret i32 0
}


declare dso_local i32 @SOURCE()
declare dso_local i32 @SINK(i32)

This program SHOULD print LEAK, even though there is no (interprocedural) flow.

Building and running your code

This section describes how your project will be graded.
Build scripts
Grading will proceed by downloading your submission and running the following command to build your code:
tar -xf <submission.tgz>
cd c1
make

Please ensure that this sequence of commands is sufficient for building your code on the cycle servers (if your build requires special accomodations, please contact Drew). You are certainly welcome to use a dynamic language, in which case you should include a makefile that is simply a no-op. If the above sequence results in a broken build your submission will be awarded 0 points.

For CFG Inference, your code will be graded by correctness in adherence to the following command-line format:
Your program should take the following two arguments:
  • -i <infile> where <infile> is replaced by the name of an input file in the LLVM human-readable bitcode format (i.e. a .ll file).
  • -g <outdir> where <outdir> is replaced by the name of a target output directory in the graphviz format. For each function <F> declared in <infile>, the program will add a file named <F>.dot in the directory <outdir>.
For example, imagine the content of "Example 1 input" above is placed in a file callTest.ll. Then the command-line
./analysis -i callTest.ll -g mydir
Would populate the file mydir/main.dot with the lines of "Example 1 output"
For Static Dataflow, your code will be graded by correctness in adherence to the following command-line format:
  • -i <infile> where <infile> is replaced by the name of an input file in the LLVM human-readable bitcode format (i.e. a .ll file).
  • -s

In other words, when the -s switch is provided, static analysis is performed on the input and LEAK or NO LEAK is printed to the terminal in accordance with the leaking behavior of ANY function in <infile>