600.328/428: Compiler Writing

Fall 2003 Syllabus

Course Description

600.328/428, Compilation and Program Analysis, covers a sequence of topics drawn from traditional compiler courses and more recent work on program analysis. The subject of “compilers” traditionally covers the process by which some source language, such as C, is converted to machine code for execution. This includes everything from how to read the source file (lexical analysis and parsing) to figuring out what the source code means (semantics) to making it fast (optimization) to figuring out how to run it (register allocation and code generation).

For this semester, I have chosen to adapt this course somewhat in the direction of program analysis. Instead of dealing with optimization, register allocation, and code generation, we will view the "intermediate form" of the program as a data set for exploratory analysis, and examine ways in which this analysis might be usefully performed to find bugs or validate properties of programs. This isn't really as "untraditional" as it sounds, since optimizers and code generators view the program in very much the same way that we will.

While the course deals a fair bit with models and touches on theorem proving, this is a systems course. The types of models we are concerned with are relatively simple. Here are some examples:

Example 1. Control Flow

We can build a state machine that describes when an operating system kernel has interrupts disabled (on entry to kernel, on exit from kernel, after any call to disable()) and enabled (when all calls to disable() are balanced by calls to enable(). This can be described with a fairly simple state machine. Given this state machine, we might like to ask questions like:

For every function f, is it the case that interrupts are always enabled when f is called, or always disabled when f is called, but not both (i.e. it's always one way or the other)?

Example 2. Type State

Sometimes you know that you are only supposed to call a procedure after you have called some previous procedure. For example, you don't want to update an object unless you know it is in memory. Using something called “type state”, we can describe these sorts of state transitions for variables, and answer questions like:

Did I pin this object in memory before I modified it?

Example 3. Escape-Free Procedures

Sometimes you want to know that a procedure is “escape free”, by which we mean that nothing you pass to it will get stored anywhere, and all temporary memory will be freed by the time the procedure returns (think of this as an extended kind of “read only” procedure). This is actually a form of extended type checking.

Does this procedure modify any externally visible memory?

In this course, we are going to explore how to get answers to questions like this, and how to design programs so that the answers are useful. We will discuss program verification and why verification is considered to be difficult. Time permitting, we will try an example.

Assignments and Grading Policy

Grading in this course will be entirely based on the project assignments. Assignments will consist of several programs written in Java that implement various phases of a compiler and analysis tool for MiniJava (a simplified version of Java). On most assignments, students will work in teams of two. Students are also free to work by themselves.

For many of these assignments there is “helper” code available from the textbook materials.

In addition to the text, selected readings will be assigned on various subject areas.

Resources

Cambridge University Press maintains a web page with “course resources” for this course (http://us.cambridge.org/features/052182060X/index.html). It includes an updated MiniJava grammar and several sample input files that are written in the MiniJava language. Where there is a descrepancy between the book and the online grammar (BNF), use the online version.

For your assignments, it is a good idea to make sure that your implementation will accept all of the sample input files on the web site.

A note about the BNF grammar: the grammar that you implement in Antlr will end up slightly different because it is left-factored rather than right-factored. This will make more sense to you after we talk about parsing in week 2.

Weekly Lecture Plan

September 8

Lexical Analysis

Tokenization. Regular languages, regular expressions, and their relationship to finite automata. Construction of deterministic finite automata from a regular expression. Discussion of why nobody uses this technique explicitly in modern compilers, and what techniques are now used instead.

Assignment: Write a lexical analyzer for a language that we will provide in class.

Readings:

Appel, Chapters 1 and 2

September 15

Parsing and Abstract Syntax Trees

Context free and context sensitive languages. LR, LALR, LL, and recursive descent parsers. Why LR(k) parsers are formally preferred. Why LL(∞) parsers are better in practice for real implementations. Introduction to the Antlr (http://www.antlr.org) parser generator.

Most modern compilers do not mix semantics with parsing. Instead, they construct an “abstract syntax tree” (AST) that is processed by later phases of the compilation process. During this week we will introduce abstract syntax trees.

Readings:

Appel, Chapter 3, 4
The Antlr Getting Started page.
The Antlr Reference Manual.

Assignment:

Create a parser for MiniJava using Antlr. Use the programs on the resources web page as test programs. Your parser should produce an AST, and it should be able to output the resulting AST in two forms:

As an XML representation of the AST tree (suitably indented for the sake of grader sanity!)
As a syntactically valid program that can be fed back in to a compiler. Your output will not exactly match the input, because you will probably wish to generate fully parenthesized expressions (the original parenthesis will not be represented in your AST). You also won't put newlines in the same places. Aside from this sort of “bracketing” issue, however, the output program and the input program should be essentially identical when compared token by token.

A word of advice: it is worth making the effort to generate this output in indented form (we recommend 2 spaces for each indentation level). You will find that referring to this output is very useful as you debug your programs, and you will be glad of the ability to read it.

Before somebody complains: yes, XML stinks, but it is a fine way of expressing this sort of tree, and requiring it simplifies the grading process considerably.

September 22

Symbol Tables

Once an AST is constructed, the next phase of the compiler constructs a “symbol table”. The symbol table is used to record the types and roles of variable, type, and procedure names and to resolve references to these names. For each identifier, its type, arguments, fields, and so forth are recorded in the corresponding symbol table entry.

Assignment: Implement a symbol table and build a compiler pass that traverses your AST to populate it. For testing purposes, you should re-generate the XMLized AST with incorporated symbol cross-references and symbol table. The symbol table should be dumped first as a set of symbols, each of which should have a unique “def="symNNN"” attribute. Each AST that describes a defining instance should have a unique “def="symNNN"” attribute, and each reference to that symbol should have a corresponding “ref="symNNN"” attribute.

Readings:

Appel, Chapters 5

September 29

Intermediate Forms

Translation to expression trees as an interim step on the way to basic blocks and traces.

Readings:

Appel, Chapters 6, 7

Assignment: Create expression trees from the symbol-resolved AST and output these in canonical XML form.

October 6

Basic Blocks and Traces

Basic block as a unit of execution and analysis. Programs as a DCG of basic blocks. Traces as an extension of the basic block concept. Introduction to flow analysis.

Readings:

Appel, Chapters 8

Assignment: Create traces from the symbol-resolved AST and output these in canonical XML form.

October 13

Fall break ends October 13, classes resume on October 14.

Liveness Analysis

Liveness analysis is a special case of a general problem called “data flow”, in which the ability of a definition to reach a certain point is determined. The classical motivation for liveness analysis is register allocation, which we will briefly discuss.

At this point, the course will begin to diverge from the path of the textbook.

Readings:

Appel, Chapters 9, 10

Assignment: Create traces from the symbol-resolved AST and output these in canonical XML form.

October 20

Control Flow Analysis

Understanding the control flow of a program. Unification across flows. Methods for heuristically resolving data-dependent control flow. MOPS and utility of PDAs in control flow analysis.

Readings:

Hao Chen, David Wagner. “MOPS: An Infrastructure for Examining Security Properties of Software”, Proc. 9th ACM Conference on Computer and Communication Security. Washington, DC. November 17-21, 2002. 235-244.

October 27

Model Checking of Control Flow

Continuing the topic of the previous week.

Assignment: Implement a MOPS-style checker using the infrastructure that you have constructed to date. Test cases will be provided.

November 3

Value Flow Analysis

As a precursor to further analysis, it is necessary to know which values reach which locations in the program. In this segment we will discuss value flow and alias analysis, and algorithms for deciding these properties.

November 10

Typestate

Discussion of typestate analysis and specification of type states. Examples where typestate analysis provides value.

Readings:

Robert. E. Strom, Shaula Yemini. “Typestate: A Programming Language Concept for Enhancing Software Reliability”. IEEE Transactions on Software Engineering, January, 1986.
R. E. Strom, D. M. Yellin. “Extending Typestate Checking Using Conditional Liveness Analysis”. IEEE Transactions on Software Engineering, 19(5):478--485, May 1993.

November 17

Applications of Typestate

Assignment: Enhance your parser to incorporate knowledge of type state declarations. Construct an algorithm for typestate propagation and use this to analyze a typestate-based model against selected example programs.

November 24, December 1

Theorem Proving

The notion of program correspondence and verifying the correctness of program properties.

December 8

Review

Courses

600.328/428: Compiler Writing

Course Description

Example 1. Control Flow

Example 2. Type State

Example 3. Escape-Free Procedures

Assignments and Grading Policy

Resources

Weekly Lecture Plan