600.328/428: Compiler Writing
Fall 2003 Syllabus
Course Description
600.328/428, Compilation and Program Analysis, covers a sequence of topics drawn from traditional compiler courses and more recent work on program analysis. The subject of “compilers” traditionally covers the process by which some source language, such as C, is converted to machine code for execution. This includes everything from how to read the source file (lexical analysis and parsing) to figuring out what the source code means (semantics) to making it fast (optimization) to figuring out how to run it (register allocation and code generation).
For this semester, I have chosen to adapt this course somewhat in the direction of program analysis. Instead of dealing with optimization, register allocation, and code generation, we will view the "intermediate form" of the program as a data set for exploratory analysis, and examine ways in which this analysis might be usefully performed to find bugs or validate properties of programs. This isn't really as "untraditional" as it sounds, since optimizers and code generators view the program in very much the same way that we will.
While the course deals a fair bit with models and touches on theorem proving, this is a systems course. The types of models we are concerned with are relatively simple. Here are some examples:
Example 1. Control Flow
We can build a state machine that describes when an operating system kernel has interrupts disabled (on entry to kernel, on exit from kernel, after any call to disable()) and enabled (when all calls to disable() are balanced by calls to enable(). This can be described with a fairly simple state machine. Given this state machine, we might like to ask questions like:
For every function f, is it the case that interrupts are always enabled when f is called, or always disabled when f is called, but not both (i.e. it's always one way or the other)?
Example 2. Type State
Sometimes you know that you are only supposed to call a procedure after you have called some previous procedure. For example, you don't want to update an object unless you know it is in memory. Using something called “type state”, we can describe these sorts of state transitions for variables, and answer questions like:
Did I pin this object in memory before I modified it?
Example 3. Escape-Free Procedures
Sometimes you want to know that a procedure is “escape free”, by which we mean that nothing you pass to it will get stored anywhere, and all temporary memory will be freed by the time the procedure returns (think of this as an extended kind of “read only” procedure). This is actually a form of extended type checking.
Does this procedure modify any externally visible memory?
In this course, we are going to explore how to get answers to questions like this, and how to design programs so that the answers are useful. We will discuss program verification and why verification is considered to be difficult. Time permitting, we will try an example.
Assignments and Grading Policy
Grading in this course will be entirely based on the project assignments. Assignments will consist of several programs written in Java that implement various phases of a compiler and analysis tool for MiniJava (a simplified version of Java). On most assignments, students will work in teams of two. Students are also free to work by themselves.
For many of these assignments there is “helper” code available from the textbook materials.
In addition to the text, selected readings will be assigned on various subject areas.
Resources
Cambridge University Press maintains a web page with “course resources” for this course (http://us.cambridge.org/features/052182060X/index.html). It includes an updated MiniJava grammar and several sample input files that are written in the MiniJava language. Where there is a descrepancy between the book and the online grammar (BNF), use the online version.
For your assignments, it is a good idea to make sure that your implementation will accept all of the sample input files on the web site.
A note about the BNF grammar: the grammar that you implement in Antlr will end up slightly different because it is left-factored rather than right-factored. This will make more sense to you after we talk about parsing in week 2.
Weekly Lecture Plan
- September 8
-
Lexical Analysis
Tokenization. Regular languages, regular expressions, and their relationship to finite automata. Construction of deterministic finite automata from a regular expression. Discussion of why nobody uses this technique explicitly in modern compilers, and what techniques are now used instead.
Assignment: Write a lexical analyzer for a language that we will provide in class.
Readings:
-
Appel, Chapters 1 and 2
-
- September 15
-
Parsing and Abstract Syntax Trees
Context free and context sensitive languages. LR, LALR, LL, and recursive descent parsers. Why LR(k) parsers are formally preferred. Why LL(∞) parsers are better in practice for real implementations. Introduction to the Antlr (http://www.antlr.org) parser generator.
Most modern compilers do not mix semantics with parsing. Instead, they construct an “abstract syntax tree” (AST) that is processed by later phases of the compilation process. During this week we will introduce abstract syntax trees.
Readings:
-
Appel, Chapter 3, 4
-
The Antlr Getting Started page.
Assignment:
Create a parser for MiniJava using Antlr. Use the programs on the resources web page as test programs. Your parser should produce an AST, and it should be able to output the resulting AST in two forms:
-
As an XML representation of the AST tree (suitably indented for the sake of grader sanity!)
-
As a syntactically valid program that can be fed back in to a compiler. Your output will not exactly match the input, because you will probably wish to generate fully parenthesized expressions (the original parenthesis will not be represented in your AST). You also won't put newlines in the same places. Aside from this sort of “bracketing” issue, however, the output program and the input program should be essentially identical when compared token by token.
A word of advice: it is worth making the effort to generate this output in indented form (we recommend 2 spaces for each indentation level). You will find that referring to this output is very useful as you debug your programs, and you will be glad of the ability to read it.
Before somebody complains: yes, XML stinks, but it is a fine way of expressing this sort of tree, and requiring it simplifies the grading process considerably.
-
- September 22
-
Symbol Tables
Once an AST is constructed, the next phase of the compiler constructs a “symbol table”. The symbol table is used to record the types and roles of variable, type, and procedure names and to resolve references to these names. For each identifier, its type, arguments, fields, and so forth are recorded in the corresponding symbol table entry.
Assignment: Implement a symbol table and build a compiler pass that traverses your AST to populate it. For testing purposes, you should re-generate the XMLized AST with incorporated symbol cross-references and symbol table. The symbol table should be dumped first as a set of symbols, each of which should have a unique “def="symNNN"” attribute. Each AST that describes a defining instance should have a unique “def="symNNN"” attribute, and each reference to that symbol should have a corresponding “ref="symNNN"” attribute.
Readings:
-
Appel, Chapters 5
-
- September 29
-
Intermediate Forms
Translation to expression trees as an interim step on the way to basic blocks and traces.
Readings:
-
Appel, Chapters 6, 7
Assignment: Create expression trees from the symbol-resolved AST and output these in canonical XML form.
-
- October 6
-
Basic Blocks and Traces
Basic block as a unit of execution and analysis. Programs as a DCG of basic blocks. Traces as an extension of the basic block concept. Introduction to flow analysis.
Readings:
-
Appel, Chapters 8
Assignment: Create traces from the symbol-resolved AST and output these in canonical XML form.
-
- October 13
-
Fall break ends October 13, classes resume on October 14.
Liveness Analysis
Liveness analysis is a special case of a general problem called “data flow”, in which the ability of a definition to reach a certain point is determined. The classical motivation for liveness analysis is register allocation, which we will briefly discuss.
At this point, the course will begin to diverge from the path of the textbook.
Readings:
-
Appel, Chapters 9, 10
Assignment: Create traces from the symbol-resolved AST and output these in canonical XML form.
-
- October 20
-
Control Flow Analysis
Understanding the control flow of a program. Unification across flows. Methods for heuristically resolving data-dependent control flow. MOPS and utility of PDAs in control flow analysis.
Readings:
-
Hao Chen, David Wagner. “MOPS: An Infrastructure for Examining Security Properties of Software”, Proc. 9th ACM Conference on Computer and Communication Security. Washington, DC. November 17-21, 2002. 235-244.
-
- October 27
-
Model Checking of Control Flow
Continuing the topic of the previous week.
Assignment: Implement a MOPS-style checker using the infrastructure that you have constructed to date. Test cases will be provided.
- November 3
-
Value Flow Analysis
As a precursor to further analysis, it is necessary to know which values reach which locations in the program. In this segment we will discuss value flow and alias analysis, and algorithms for deciding these properties.
- November 10
-
Typestate
Discussion of typestate analysis and specification of type states. Examples where typestate analysis provides value.
Readings:
-
Robert. E. Strom, Shaula Yemini. “Typestate: A Programming Language Concept for Enhancing Software Reliability”. IEEE Transactions on Software Engineering, January, 1986.
-
R. E. Strom, D. M. Yellin. “Extending Typestate Checking Using Conditional Liveness Analysis”. IEEE Transactions on Software Engineering, 19(5):478--485, May 1993.
-
- November 17
-
Applications of Typestate
Assignment: Enhance your parser to incorporate knowledge of type state declarations. Construct an algorithm for typestate propagation and use this to analyze a typestate-based model against selected example programs.
- November 24, December 1
-
Theorem Proving
The notion of program correspondence and verifying the correctness of program properties.
- December 8
-
Review