Compilerbook PDF
Compilerbook PDF
Anyone is free to download and print the PDF edition of this book for per-
sonal use. Commercial distribution, printing, or reproduction without the
author’s consent is expressly prohibited. All other rights are reserved.
You can find the latest version of the PDF edition, and purchase inexpen-
sive hardcover copies at http://compilerbook.org
iii
iv
iv
v
Contributions
v
vi
vi
CONTENTS vii
Contents
1 Introduction 1
1.1 What is a compiler? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why should you study compilers? . . . . . . . . . . . . . . . 2
1.3 What’s the best way to learn about compilers? . . . . . . . . 2
1.4 What language should I use? . . . . . . . . . . . . . . . . . . 2
1.5 How is this book different from others? . . . . . . . . . . . . 3
1.6 What other books should I read? . . . . . . . . . . . . . . . . 4
2 A Quick Tour 5
2.1 The Compiler Toolchain . . . . . . . . . . . . . . . . . . . . . 5
2.2 Stages Within a Compiler . . . . . . . . . . . . . . . . . . . . 6
2.3 Example Compilation . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Scanning 11
3.1 Kinds of Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 A Hand-Made Scanner . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Deterministic Finite Automata . . . . . . . . . . . . . 16
3.4.2 Nondeterministic Finite Automata . . . . . . . . . . . 17
3.5 Conversion Algorithms . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 Converting REs to NFAs . . . . . . . . . . . . . . . . . 19
3.5.2 Converting NFAs to DFAs . . . . . . . . . . . . . . . . 22
3.5.3 Minimizing DFAs . . . . . . . . . . . . . . . . . . . . . 24
3.6 Limits of Finite Automata . . . . . . . . . . . . . . . . . . . . 26
3.7 Using a Scanner Generator . . . . . . . . . . . . . . . . . . . . 26
3.8 Practical Considerations . . . . . . . . . . . . . . . . . . . . . 29
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Parsing 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Context Free Grammars . . . . . . . . . . . . . . . . . . . . . 35
vii
viii CONTENTS
5 Parsing in Practice 67
5.1 The Bison Parser Generator . . . . . . . . . . . . . . . . . . . 68
5.2 Expression Validator . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Expression Interpreter . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Expression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Semantic Analysis 97
7.1 Overview of Type Systems . . . . . . . . . . . . . . . . . . . . 98
7.2 Designing a Type System . . . . . . . . . . . . . . . . . . . . . 101
7.3 The B-Minor Type System . . . . . . . . . . . . . . . . . . . . 104
7.4 The Symbol Table . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.5 Name Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.6 Implementing Type Checking . . . . . . . . . . . . . . . . . . 111
7.7 Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . 115
viii
CONTENTS ix
ix
x CONTENTS
12 Optimization 193
12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.2 Optimization in Perspective . . . . . . . . . . . . . . . . . . . 194
12.3 High Level Optimizations . . . . . . . . . . . . . . . . . . . . 195
12.3.1 Constant Folding . . . . . . . . . . . . . . . . . . . . . 195
12.3.2 Strength Reduction . . . . . . . . . . . . . . . . . . . . 197
12.3.3 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . 197
12.3.4 Code Hoisting . . . . . . . . . . . . . . . . . . . . . . . 198
12.3.5 Function Inlining . . . . . . . . . . . . . . . . . . . . . 199
12.3.6 Dead Code Detection and Elimination . . . . . . . . . 200
12.4 Low-Level Optimizations . . . . . . . . . . . . . . . . . . . . 202
12.4.1 Peephole Optimizations . . . . . . . . . . . . . . . . . 202
12.4.2 Instruction Selection . . . . . . . . . . . . . . . . . . . 202
12.5 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . 206
12.5.1 Safety of Register Allocation . . . . . . . . . . . . . . 206
12.5.2 Priority of Register Allocation . . . . . . . . . . . . . . 207
12.5.3 Conflicts Between Variables . . . . . . . . . . . . . . . 207
12.5.4 Global Register Allocation . . . . . . . . . . . . . . . . 208
12.6 Optimization Pitfalls . . . . . . . . . . . . . . . . . . . . . . . 209
12.7 Optimization Interactions . . . . . . . . . . . . . . . . . . . . 211
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
12.9 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . 213
x
CONTENTS xi
Index 227
xi
xii CONTENTS
xii
LIST OF FIGURES xiii
List of Figures
xiii
xiv LIST OF FIGURES
xiv
1
Chapter 1 – Introduction
1
2 CHAPTER 1. INTRODUCTION
The best way to learn about compilers is to write your own compiler from
beginning to end. While that may sound daunting at first, you will find
that this complex task can be broken down into several stages of moder-
ate complexity. The typical undergraduate computer science student can
write a complete compiler for a simple language in a semester, broken
down into four or five independent stages.
Without question, you should use the C programming language and the
X86 assembly language, of course!
Ok, maybe the answer isn’t quite that simple. There is an ever-increasing
number of programming languages that all have different strengths and
weaknesses. Java is simple, consistent, and portable, albeit not high per-
formance. Python is easy to learn and has great library support, but weak
typing. Rust offers exceptional static type-safety, but is not (yet) widely
2
1.5. HOW IS THIS BOOK DIFFERENT FROM OTHERS? 3
Most books on compilers are very heavy on the abstract theory of scan-
ners, parsers, type systems, and register allocation, and rather light on
how the design of a language affects the compiler and the runtime. Most
are designed for use by a graduate survey of optimization techniques.
This book takes a broader approach by giving a lighter dose of opti-
mization, and introducing more material on the process of engineering a
compiler, the tradeoffs in language design, and considerations for inter-
pretation and translation.
You will also notice that this book doesn’t contain a whole bunch of
fiddly paper-and-pencil assignments to test your knowledge of compiler
algorithms. (Ok, there are a few of those in Chapters 3 and 4.) If you want
to test your knowledge, then write some working code. To that end, the
exercises at the end of each chapter ask you to take the ideas in the chapter,
and either explore some existing compilers, or write parts of your own. If
you do all of them in order, you will end up with a working compiler,
summarized in the final appendix.
3
4 CHAPTER 1. INTRODUCTION
4
5
Headers
(stdio.h)
Object Code
(prog.o)
Dynamic Static
Running Executable
Linker Linker
Process (prog)
(ld.so) (ld)
Libraries
Dynamic Libraries (libc.a)
(libc.so)
• The preprocessor prepares the source code for the compiler proper.
In the C and C++ languages, this means consuming all directives that
start with the # symbol. For example, an #include directive causes
the pre-processor to open the named file and insert its contents into
the source code. A #define directive causes the preprocessor to
substitute a value wherever a macro name is encountered. (Not all
languages rely on a preprocessor.)
5
6 CHAPTER 2. A QUICK TOUR
other semantic routines, optimizes the code, and then produces as-
sembly language as the output. This part of the toolchain is the main
focus of this book.
• The linker consumes one or more object files and library files and
combines them into a complete, executable program. It selects the
final memory locations where each piece of code and data will be
loaded, and then “links” them together by writing in the missing ad-
dress information. For example, an object file that calls the printf
function does not initially know the address of the function. An
empty (zero) address will be left where the address must be used.
Once the linker selects the memory location of printf, it must go
back and write in the address at every place where printf is called.
In this book, our focus will be primarily on the compiler proper, which is
the most interesting component in the toolchain. The compiler itself can
be divided into several stages:
Abstract
Character Semantic Intermediate Code Assembly
Scanner Tokens Parser Syntax
Stream Routines Representation Generator Code
Tree
Optimizers
• The scanner consumes the plain text of a program, and groups to-
gether individual characters to form complete tokens. This is much
like grouping characters into words in a natural language.
6
2.3. EXAMPLE COMPILATION 7
• The parser consumes tokens and groups them together into com-
plete statements and expressions, much like words are grouped into
sentences in a natural language. The parser is guided by a grammar
which states the formal rules of composition in a given language.
The output of the parser is an abstract syntax tree (AST) that cap-
tures the grammatical structures of the program. The AST also re-
members where in the source file each construct appeared, so it is
able to generate targeted error messages, if needed.
• The semantic routines traverse the AST and derive additional mean-
ing (semantics) about the program from the rules of the language
and the relationship between elements of the program. For exam-
ple, we might determine that x + 10 is a float expression by ob-
serving the type of x from an earlier declaration, then applying the
language rule that addition between int and float values yields
a float. After the semantic routines, the AST is often converted into
an intermediate representation (IR) which is a simplified form of
assembly code suitable for detailed analysis. There are many forms
of IR which we will discuss in Chapter 8.
• One or more optimizers can be applied to the intermediate represen-
tation, in order to make the program smaller, faster, or more efficient.
Typically, each optimizer reads the program in IR format, and then
emits the same IR format, so that each optimizer can be applied in-
dependently, in arbitrary order.
• Finally, a code generator consumes the optimized IR and transforms
it into a concrete assembly language program. Typically, a code gen-
erator must perform register allocation to effectively manage the
limited number of hardware registers, and instruction selection and
sequencing to order assembly instructions in the most efficient form.
The first stage of the compiler (the scanner) will read in the text of
the source code character by character, identify the boundaries between
symbols, and emit a series of tokens. Each token is a small data structure
that describes the nature and contents of each symbol:
At this stage, the purpose of each token is not yet clear. For example,
factor and foo are simply known to be identifiers, even though one is
7
8 CHAPTER 2. A QUICK TOUR
the name of a function, and the other is the name of a variable. Likewise,
we do not yet know the type of width, so the + could potentially rep-
resent integer addition, floating point addition, string concatenation, or
something else entirely.
The next step is to determine whether this sequence of tokens forms
a valid program. The parser does this by looking for patterns that match
the grammar of a language. Suppose that our compiler understands a
language with the following grammar:
Grammar G1
1. expr → expr + expr
2. expr → expr * expr
3. expr → expr = expr
4. expr → id ( expr )
5. expr → ( expr )
6. expr → id
7. expr → int
Each line of the grammar is called a rule, and explains how various
parts of the language are constructed. Rules 1-3 indicate that an expression
can be formed by joining two expressions with operators. Rule 4 describes
a function call. Rule 5 describes the use of parentheses. Finally, rules 6 and
7 indicate that identifiers and integers are atomic expressions. 1
The parser looks for sequences of tokens that can be replaced by the
left side of a rule in our grammar. Each time a rule is applied, the parser
creates a node in a tree, and connects the sub-expressions into the abstract
syntax tree (AST). The AST shows the structural relationships between
each symbol: addition is performed on width and 56, while a function
call is applied to factor and foo.
With this data structure in place, we are now prepared to analyze the
meaning of the program. The semantic routines traverse the AST and de-
rive additional meaning by relating parts of the program to each other, and
to the definition of the programming language. An important component
of this process is typechecking, in which the type of each expression is
determined, and checked for consistency with the rest of the program. To
keep things simple here, we will assume that all of our variables are plain
integers.
To generate linear intermediate code, we perform a post-order traver-
sal of the AST and generate an IR instruction for each node in the tree. A
typical IR looks like an abstract assembly language, with load/store in-
structions, arithmetic operations, and an infinite number of registers. For
example, this is a possible IR representation of our example program:
1 The careful reader will note that this example grammar has ambiguities. We will discuss
8
2.3. EXAMPLE COMPILATION 9
ASSIGN
ID
MUL
height
ADD CALL
ID INT ID ID
width 56 factor foo
9
10 CHAPTER 2. A QUICK TOUR
writing the same IR) so that they can be enabled and disable indepen-
dently. A retargetable compiler contains multiple code generators, so that
the same IR can be emitted for a variety of microprocessors.
2.4 Exercises
2. Determine how to change the optimization level for your local com-
piler. Find a non-trivial source program and compile it at multiple
levels of optimization. How does the compile time, program size,
and run time vary with optimization levels?
3. Search the internet for the formal grammars for three languages that
you are familiar with, such as C++, Ruby, and Rust. Compare them
side by side. Which language is inherently more complex? Do they
share any common structures?
10
11
Chapter 3 – Scanning
Scanning is the process of identifying tokens from the raw text source code
of a program. At first glance, scanning might seem trivial – after all, iden-
tifying words in a natural language is as simple as looking for spaces be-
tween letters. However, identifying tokens in source code requires the
language designer to clarify many fine details, so that it is clear what is
permitted and what is not.
Most languages will have tokens in these categories:
• Keywords are words in the language structure itself, like while or
class or true. Keywords must be chosen carefully to reflect the
natural structure of the language, without interfering with the likely
names of variables and other identifiers.
• Identifiers are the names of variables, functions, classes, and other
code elements chosen by the programmer. Typically, identifiers are
arbitrary sequences of letters and possibly numbers. Some languages
require identifiers to be marked with a sentinel (like the dollar sign
in Perl) to clearly distinguish identifiers from keywords.
• Numbers could be formatted as integers, or floating point values, or
fractions, or in alternate bases such as binary, octal or hexadecimal.
Each format should be clearly distinguished, so that the programmer
does not confuse one with the other.
• Strings are literal character sequences that must be clearly distin-
guished from keywords or identifiers. Strings are typically quoted
with single or double quotes, but also must have some facility for
containing quotations, newlines, and unprintable characters.
• Comments and whitespace are used to format a program to make it
visually clear, and in some cases (like Python) are significant to the
structure of a program.
When designing a new language, or designing a compiler for an exist-
ing language, the first job is to state precisely what characters are permit-
ted in each type of token. Initially, this could be done informally by stating,
11
12 CHAPTER 3. SCANNING
for example, “An identifier consists of a letter followed by any number of letters
and numerals.”, and then assigning a symbolic constant (TOKEN IDENTIFIER)
for that kind of token. As we will see, an informal approach is often am-
biguous, and a more rigorous approach is needed.
Figure 3.1 shows how one might write a scanner by hand, using simple
coding techniques. To keep things simple, we only consider just a few
tokens: * for multiplication, ! for logical-not, != for not-equal, and se-
quences of letters and numbers for identifiers.
The basic approach is to read one character at a time from the input
stream (fgetc(fp)) and then classify it. Some single-character tokens are
easy: if the scanner reads a * character, it immediately returns TOKEN MULTIPLY,
and the same would be true for addition, subtraction, and so forth.
However, some characters are part of multiple tokens. If the scanner
encounters !, that could represent a logical-not operation by itself, or it
could be the first character in the != sequence representing not-equal-to.
Upon reading !, the scanner must immediately read the next character. If
12
3.3. REGULAR EXPRESSIONS 13
the next character is =, then it has matched the sequence != and returns
TOKEN NOT EQUAL.
But, if the character following ! is something else, then the non-matching
character needs to be put back on the input stream using ungetc, because
it is not part of the current token. The scanner returns TOKEN NOT and will
consume the put-back character on the next call to scan token.
In a similar way, once a letter has been identified by isalpha(c), then
the scanner keeps reading letters or numbers, until a non-matching char-
acter is found. The non-matching character is put back, and the scanner
returns TOKEN IDENTIFIER.
(We will see this pattern come up in every stage of the compiler: an
unexpected item doesn’t match the current objective, so it must be put
back for later. This is known more generally as backtracking.)
As you can see, a hand-made scanner is rather verbose. As more to-
ken types are added, the code can become quite convoluted, particularly
if tokens share common sequences of characters. It can also be difficult
for a developer to be certain that the scanner code corresponds to the de-
sired definition of each token, which can result in unexpected behavior on
complex inputs. That said, for a small language with a limited number of
tokens, a hand-made scanner can be an appropriate solution.
For a complex language with a large number of tokens, we need a more
formalized approach to defining and scanning tokens. A formal approach
will allow us to have a greater confidence that token definitions do not
conflict and the scanner is implemented correctly. Further, a formalized
approach will allow us to make the scanner compact and high perfor-
mance – surprisingly, the scanner itself can be the performance bottleneck
in a compiler, since every single character must be individually consid-
ered.
The formal tools of regular expressions and finite automata allow us
to state very precisely what may appear in a given token type. Then, auto-
mated tools can process these definitions, find errors or ambiguities, and
produce compact, high performance code.
13
14 CHAPTER 3. SCANNING
Rule #3 is known as the Kleene closure and has the highest precedence.
Rule #2 is known as concatenation. Rule #1 has the lowest precedence and
is known as alternation. Parentheses can be added to adjust the order of
operations in the usual way.
Here are a few examples using just the basic rules. (Note that a finite
RE can indicate an infinite set.)
Regular Expression s Language L(s)
hello { hello }
d(o|i)g { dog,dig }
moo* { mo,moo,mooo,... }
(moo)* { ,moo,moomoo,moomoomoo,... }
a(b|a)*a { aaa,aba,aaaa,aaba,abaa,abba,... }
The syntax described on the previous page is entirely sufficient to write
any regular expression. But, is it also handy to have a few helper opera-
tions built on top of the basic syntax:
14
3.4. FINITE AUTOMATA 15
15
16 CHAPTER 3. SCANNING
as the input symbol. Some states of the FA are known as accepting states
and are indicated by a double circle. If the FA is in an accepting state after
all input is consumed, then we say that the FA accepts the input. We say
that the FA rejects the input string if it ends in a non-accepting state, or if
there is no edge corresponding to the current input symbol.
Every RE can be written as an FA, and vice versa. For a simple regular
expression, one can construct an FA by hand. For example, here is an FA
for the keyword for:
f o r
0 1 2 3
a-z
0-9
a-z
a-z 0-9
0 1 2
0-9
0-9
1-9 1 2
0
0
3
16
3.4. FINITE AUTOMATA 17
The transitions between states are represented by a matrix (M [s, i]) which
encodes the next state, given the current state and input symbol. (If the
transition is not allowed, we mark it with E to indicate an error.) For each
symbol, we compute c = M [s, i] until all the input is consumed, or an error
state is reached.
[a-z]
i n g
0 1 2 3
Now consider how this automaton would consume the word sing. It
could proceed in two different ways. One would be to move to state 0 on
s, state 1 on i, state 2 on n, and state 3 on g. But the other, equally valid
way would be to stay in state 0 the whole time, matching each letter to the
[a-z] transition. Both ways obey the transition rules, but one results in
acceptance, while the other results in rejection.
The problem here is that state 0 allows for two different transitions on
the symbol i. One is to stay in state 0 matching [a-z] and the other is to
move to state 1 matching i.
Moreover, there is no simple rule by which we can pick one path or
another. If the input is sing, the right solution is to proceed immediately
from state zero to state one on i. But if the input is singing, then we
should stay in state zero for the first ing and proceed to state one for the
second ing
An NFA can also have an (epsilon) transition, which represents the
empty string. This transition can be taken without consuming any input
symbols at all. For example, we could represent the regular expression
a*(ab|ac) with this NFA:
17
18 CHAPTER 3. SCANNING
a
a b 3
ε 1 2
0 ε
4 a c
5 6
States Action
0, 1, 4 consume a
0, 1, 2, 4, 5 consume a
0, 1, 2, 4, 5 consume a
0, 1, 2, 4, 5 consume c
6 accept
In principle, one can implement an NFA in software or hardware by
simply keeping track of all of the possible states. But this is inefficient.
In the worst case, we would need to evaluate all states for all characters
on each input transition. A better approach is to convert the NFA into an
equivalent DFA, as we show below.
18
3.5. CONVERSION ALGORITHMS 19
Regular expressions and finite automata are all equally powerful. For ev-
ery RE, there is an FA, and vice versa. However, a DFA is by far the most
straightforward of the three to implement in software. In this section, we
will show how to convert an RE into an NFA, then an NFA into a DFA,
and then to optimize the size of the DFA.
The NFA for any character a is: The NFA for an transition is:
a ε
Now, suppose that we have already constructed NFAs for the regular
expressions A and B, indicated below by rectangles. Both A and B have
a single start state (on the left) and accepting state (on the right). If we
write the concatenation of A and B as AB, then the corresponding NFA is
simply A and B connected by an transition. The start state of A becomes
the start state of the combination, and the accepting state of B becomes the
accepting state of the combination:
A ε B
19
20 CHAPTER 3. SCANNING
A
ε ε
ε ε
B
ε ε
ε
ε
c o w
c a t
20
3.5. CONVERSION ALGORITHMS 21
c a t
ε ε
ε ε
c o w
c o w
ε ε
ε ε ε ε
c a t
c w
o
ε ε
ε
a ε ε ε ε
c a t
You can easily see that the NFA resulting from the construction algo-
rithm, while correct, is quite complex and contains a large number of ep-
silon transitions. An NFA representing the tokens for a complete language
could end up having thousands of states, which would be very impractical
to implement. Instead, we can convert this NFA into an equivalent DFA.
21
22 CHAPTER 3. SCANNING
Epsilon closure.
−closure(n) is the set of NFA states reachable from NFA state n by zero
or more transitions.
Now we define the subset construction algorithm. First, we create a
start state D0 corresponding to the −closure(N0 ). Then, for each outgo-
ing character c from the states in D0 , we create a new state containing the
epsilon closure of the states reachable by c. More precisely:
22
3.5. CONVERSION ALGORITHMS 23
c a t
ε N8 N9 N10 N11 ε
a ε ε ε
N0 N1 N2 N3 N12 N13
ε ε
c o w
N4 N5 N6 N7
D3:
N6
w
D4:
o N7, N12, N13,
N2, N3, N4, N8
c
D1:
D0: a N1, N2, N3,
c D2:
N0
N4, N8, N13
N5, N9 c
D6:
a N11, N12, N13,
N2,N3, N4, N8
t
D5:
N10
Example. Let’s work out the algorithm on the NFA in Figure 3.4. This
is the same NFA corresponding to the RE a(cat|cow)* with each of the
states numbered for clarity.
23
24 CHAPTER 3. SCANNING
7. Remove D4 from the work list, and observe that the only outgoing
transition c leads to states N5 and N9 which already exist as state D2 ,
c
so simply add a transition D4 → − D2 .
c
8. Remove D6 from the work list and, in a similar way, add D6 →
− D2 .
9. The work list is empty, so we are done.
24
3.5. CONVERSION ALGORITHMS 25
3
b
b
1 a
a
b
b
a 4 5
2
a
a
b
1,2,3,4 a 5
b
Now, we ask whether this graph is consistent with respect to all possi-
ble inputs, by referring back to the original DFA. For example, we observe
that, if we are in super-state (1,2,3,4) then an input of a always goes to
state 2, which keeps us within the super-state. So, this DFA is consistent
with respect to a. However, from super-state (1,2,3,4) an input of b can
either stay within the super-state or go to super-state (5). So, the DFA is
inconsistent with respect to b.
To fix this, we try splitting out one of the inconsistent states (4) into a
new super-state, taking the transitions with it:
b b
4 5
1,2,3
a
a,b
25
26 CHAPTER 3. SCANNING
b a
a b b
1,3 2 4 5
a
a,b
Again, we examine each super-state and observe that each possible in-
put is consistent with respect to the super-state, and therefore we have the
minimal DFA.
Regular expressions and finite automata are powerful and effective at rec-
ognizing simple patterns in individual words or tokens, but they are not
sufficient to analyze all of the structures in a problem. For example, could
you use a finite automaton to match an arbitrary number of nested paren-
theses?
It’s not hard to write out an FA that could match, say, up to three pairs
of nested parentheses, like this:
0
( 1
( 2
( 3
) ) )
26
3.7. USING A SCANNER GENERATOR 27
%{
(C Preamble Code)
%}
(Character Classes)
%%
(Regular Expression Rules)
%%
(Additional Code)
scanner generator. The program Lex, developed at AT&T, was one of the
earliest examples of a scanner generator. Vern Paxson translated Lex into
the C language to create Flex, which is distributed under the Berkeley li-
cense and is widely used in Unix-like operating systems today to generate
scanners implemented in C or C++.
To use Flex, we write a specification of the scanner that is a mixture of
regular expressions, fragments of C code, and some specialized directives.
The Flex program itself consumes the specification and produces regular
C code that can then be compiled in the normal way.
Figure 3.6 gives the overall structure of a Flex file. The first section con-
sists of arbitrary C code that will be placed at the beginning of scanner.c,
like include files, type definitions, and similar things. Typically, this is
used to include a file that contains the symbolic constants for tokens.
The second section states character classes, which are a symbolic short-
hand for commonly used regular expressions. For example, you might
declare DIGIT [0-9]. This class can be referred to later as {DIGIT}.
The third section is the most important part. It states a regular expres-
sion for each type of token that you wish to match, followed by a fragment
of C code that will be executed whenever the expression is matched. In the
simplest case, this code returns the type of the token, but it can also be used
to extract token values, display errors, or anything else appropriate.
The fourth section is arbitrary C code that will go at the end of the
scanner, typically for additional helper functions. A peculiar requirement
of Flex is that we must define a function yywrap() which returns one to
indicate that the input is complete at the end of the file. If we wanted to
continue scanning in another file, then yywrap() would open the next file
and return zero.
The regular expression language accepted by Flex is very similar to
that of formal regular expressions discussed above. The main difference is
that characters that have special meaning with a regular expression (like
parenthesis, square brackets, and asterisk) must be escaped with a back-
slash or surrounded with double quotes. Also, a period (.) can be used to
match any character at all, which is helpful for catching error conditions.
27
28 CHAPTER 3. SCANNING
%{
#include "token.h"
%}
DIGIT [0-9]
LETTER [a-zA-Z]
%%
(" "|\t|\n) /* skip whitespace */
\+ { return TOKEN_ADD; }
while { return TOKEN_WHILE; }
{LETTER}+ { return TOKEN_IDENT; }
{DIGIT}+ { return TOKEN_NUMBER; }
. { return TOKEN_ERROR; }
%%
int yywrap() { return 1; }
#include "token.h"
#include <stdio.h>
int main() {
yyin = fopen("program.c","r");
if(!yyin) {
printf("could not open program.c!\n");
return 1;
}
while(1) {
token_t t = yylex();
if(t==TOKEN_EOF) break;
printf("token: %d text: %s\n",t,yytext);
}
}
28
3.8. PRACTICAL CONSIDERATIONS 29
typedef enum {
TOKEN_EOF=0,
TOKEN_WHILE,
TOKEN_ADD,
TOKEN_IDENT,
TOKEN_NUMBER,
TOKEN_ERROR
} token_t;
Figure 3.7 shows a simple but complete example to get you started.
This specification describes just a few tokens: a single character addition
(which must be escaped with a backslash), the while keyword, an iden-
tifier consisting of one or more letters, and a number consisting of one or
more digits. As is typical in a scanner, any other type of character is an
error, and returns an explicit token type for that purpose.
Flex generates the scanner code, but not a complete program, so you
must write a main function to go with it. Figure 3.8 shows a simple driver
program that uses this scanner. First, the main program must declare as
extern the symbols it expects to use in the generated scanner code: yyin
is the file from which text will be read, yylex is the function that imple-
ments the scanner, and the array yytext contains the actual text of each
token discovered.
Finally, we must have a consistent definition of the token types across
the parts of the program, so into token.h we put an enumeration describ-
ing the new type token t. This file is included in both scanner.flex
and main.c.
Figure 3.10 shows how all the pieces come together. scanner.flex is
converted into scanner.c by invoking flex scanner.flex -o scanner.c.
Then, both main.c and scanner.c are compiled to produce object files,
which are linked together to produce the complete program.
29
30 CHAPTER 3. SCANNING
token.h
Linker scanner.exe
main.c main.o
Compiler
with that rule can compare the token text with a separate list of keywords
and return the appropriate type. Yet another approach is to treat all key-
words and identifiers as a single token type, and allow the problem to be
sorted out by the parser. (This is necessary in languages like PL/1, where
identifiers can have the same names as keywords, and are distinguished
by context.)
Tracking source locations. In later stages of the compiler, it is useful
for the parser or typechecker to know exactly what line and column num-
ber a token was located at, usually to print out a helpful error message.
(“Undefined symbol spider at line 153.”) This is easily done by having
the scanner match newline characters, and increase the line count (but not
return a token) each time one is found.
Cleaning tokens. Strings, characters, and similar token types need to
be cleaned up after they are matched. For example, "hello\n" needs to
have its quotes removed and the backslash-n sequence converted to a lit-
eral newline character. Internally, the compiler only cares about the actual
contents of the string. Typically, this is accomplished by writing a function
string clean in the postamble of the Flex specification. The function is
invoked by the matching rule before returning the desired token type.
Constraining tokens. Although regular expressions can match tokens
of arbitrary length, it does not follow that a compiler must be prepared to
accept them. There would be little point to accepting a 1000-letter iden-
tifier, or an integer larger than the machine’s word size. The typical ap-
proach is to set the maximum token length (YYLMAX in flex) to a very large
value, then examine the token to see if it exceeds a logical limit in the ac-
tion that matches the token. This allows you to emit an error message that
describes the offending token as needed.
Error Handling. The easiest approach to handling errors or invalid
input is simply to print a message and exit the program. However, this
is unhelpful to users of your compiler – if there are multiple errors, it’s
(usually) better to see them all at once. A good approach is to match the
30
3.9. EXERCISES 31
minimum amount of invalid text (using the dot rule) and return an explicit
token type indicating an error. The code that invokes the scanner can then
emit a suitable message, and then ask for the next token.
3.9 Exercises
1. Write regular expressions for the following entities. You may find it
necessary to justify what is and is not allowed within each expres-
sion:
3. Test the regular expressions you wrote in the previous two problems
by translating them into your favorite programming language that
has native support for regular expressions. (Perl and Python are two
good choices.) Evaluate the correctness of your program by writing
test cases that should (and should not) match.
5. Convert the NFAs in the previous problem into DFAs using the sub-
set construction method.
31
32 CHAPTER 3. SCANNING
32
3.10. FURTHER READING 33
33
34 CHAPTER 3. SCANNING
34
35
Chapter 4 – Parsing
4.1 Overview
35
36 CHAPTER 4. PARSING
Grammar G2
1. P→E
2. E→E+E
3. E → ident
4. E → int
36
4.2. CONTEXT FREE GRAMMARS 37
37
38 CHAPTER 4. PARSING
P P
E E
E + E E+E
Grammar G3
1. P→E
2. E→E+T
3. E→T
4. T → ident
5. T → int
38
4.2. CONTEXT FREE GRAMMARS 39
allows a left-most derivation. But also note that it still accepts the same lan-
guage as Grammar G2 . That is any sentence that can be derived by Gram-
mar G2 can also be derived by Grammar G3 , but there exists only one
derivation (and one meaning) per sentence. (Proof is left as an exercise to
the reader.)
Now suppose that we would like to add more operators to our gram-
mar. If we simply add more rules of the form E → E ∗ T and E → E ÷ T, we
would still have an unambiguous grammar, but it would not follow the
rules of precedence in algebra: each operator would be applied from left
to right.
Instead, the usual approach is to construct a grammar with multiple
levels that reflect the intended precedence of operators. For example, we
can combine addition and multiplication by expressing them as a sum of
terms (T ) that consist of multiplied factors (F ), like this:
Grammar G4
1. P→E
2. E→E+T
3. E→T
4. T→T*F
5. T→F
6. F → ident
7. F → int
Grammar G5
1. P→S
2. S → if E then S
3. S → if E then S else S
4. S → other
Do this now:
Write out the two possible parse trees for this sentence:
if E then if E then other else other.
39
40 CHAPTER 4. PARSING
4.3 LL Grammars
LL(1) grammars are a subset of CFGs that are easy to parse with simple
algorithms. A grammar is LL(1) if it can be parsed by considering only
one non-terminal and the next token in the input stream.
To ensure that a grammar is LL(1), we must do the following:
Once we have taken those steps, then we can prove that it is LL(1) by
generating the F IRST and F OLLOW sets for the grammar, and using them
to create the LL(1) parse table. If the parse table contains no conflicts, then
the grammar is clearly LL(1).
Substitute with:
A → β1 A0 |β2 A0 |...
A’ → α1 A0 |α2 A0 |...|
40
4.3. LL GRAMMARS 41
Grammar G6
1. P →E
2. E → T E’
3. E’ → + T E’
4. E’ →
5. T → ident
6. T → int
A0 → β1 |β2 |...
For example, these rules describing an identifier, array reference, and
function call all share the same prefix of a single identifier:
Grammar G7
1. P→E
2. E → id
3. E → id [ E ]
4. E → id ( E )
41
42 CHAPTER 4. PARSING
Grammar G8
1. P →E
2. E → id E’
3. E’ → [ E ]
4. E’ → ( E )
5. E’ →
Repeat:
For each rule X → Y1 Y2 ...Yk in a grammar G:
Add a to F IRST(X)
if a is in F IRST(Y1 )
or a is in F IRST(Yn ) and Y1 ...Yn−1 →
If Y1 ...Yk → then add to F IRST(X).
until no more changes occur.
42
4.3. LL GRAMMARS 43
Repeat:
If A → αBβ then:
add F IRST(β) to F OLLOW(B).
If A → αB or F IRST(β) contains then:
add F OLLOW(A) to F OLLOW(B).
until no more changes occur.
Grammar G9
1. P →E
2. E → T E’
3. E’ → + T E’
4. E’ →
5. T → F T’
6. T’ → * F T’
7. T’ →
8. F →(E)
9. F → int
43
44 CHAPTER 4. PARSING
44
4.3. LL GRAMMARS 45
int parse_P() {
return parse_E() && expect_token(TOKEN_EOF);
}
int parse_E() {
return parse_T() && parse_E_prime();
}
int parse_E_prime() {
token_t t = scan_token();
if(t==TOKEN_PLUS) {
return parse_T() && parse_E_prime();
} else {
putback_token(t);
return 1;
}
}
int parse_T() {
return parse_F() && parse_T_prime();
}
int parse_T_prime() {
token_t t = scan_token();
if(t==TOKEN_MULTIPLY) {
return parse_F() && parse_T_prime();
} else {
putback_token(t);
return 1;
}
}
int parse_F() {
token_t t = scan_token();
if(t==TOKEN_LPAREN) {
return parse_E() && expect_token(TOKEN_RPAREN);
} else if(t==TOKEN_INT) {
return 1;
} else {
printf("parse error: unexpected token %s\n",
token_string(t));
return 0;
}
}
45
46 CHAPTER 4. PARSING
For example, here is the parse table for Grammar G9 . Notice that the
entries for P , E, T , and F are straightforward: each can only start with
int or (, and so these tokens cause the rules to descend toward F and a
choice between rule 8 (F → int) and rule 9 (F → (E)). The entry for E 0 is
a little more complicated: a + token results in applying E → +T E 0 , while
) or $ indicates E → .
46
4.3. LL GRAMMARS 47
Now we have all the pieces necessary to operate the parser. Informally,
the idea is to keep a stack which tracks the current state of the parser. In
each step, we consider the top element of the stack and the next token on
the input. If they match, then pop the stack, accept the token and continue.
If not, then consult the parse table for the next rule to apply. If we can
continue until the end-of-file symbol is matched, then the parse succeeds.
Create a stack S.
Push $ and P onto S.
Let c be the first token on the input.
47
48 CHAPTER 4. PARSING
4.4 LR Grammars
While LL(1) grammars and top-down parsing techniques are easy to work
with, they are not able to represent all of the structures found in many
programming languages. For more general-purpose programming lan-
guages, we must use an LR(1) grammar and associated bottom-up parsing
techniques.
LR(1) is the set of grammars that can be parsed via shift-reduce tech-
niques with a single character of lookahead. LR(1) is a super-set of LL(1)
and can accommodate left recursion and common left prefixes which are
not permitted in LL(1). This enables us to express many programming
constructs in a more natural way. (An LR(1) grammar must still be non-
ambiguous, and it cannot have shift-reduce or reduce-reduce conflicts,
which we will explain below.)
For example, Grammar G10 is an LR(1) grammar:
Grammar G10
1. P→E
2. E→E+T
3. E→T
4. T → id ( E )
5. T → id
P E T
F IRST
F OLLOW
48
4.4. LR GRAMMARS 49
While this example shows that there exists a derivation for the sen-
tence, it does not explain how each action was chosen at each step. For
example, in the second step, we might have chosen to reduce id to T in-
stead of shifting a left parenthesis. This would have been a bad choice,
because there is no rule that begins with T(, but that was not immediately
obvious without attempting to proceed further. To make these decisions,
we must analyze LR(1) grammars in more detail.
49
50 CHAPTER 4. PARSING
Kernel of State 0
P→.E
Then, we compute the closure of the state as follows. For each item
in the state with a non-terminal X immediately to the right of the dot, we
add all rules in the grammar that have X as the left hand side. The newly
added items have a dot at the beginning of the right hand side.
P→.E
E→.E+T
E→.T
Closure of State 0
P→. E
E→. E+T
E→. T
T→. id ( E )
T→. id
50
4.4. LR GRAMMARS 51
You can think of the state this way: It describes the initial state of the
parser as expecting a complete program in the form of an E. However, an
E is known to begin with an E or a T , and a T must begin with an id. All
of those symbols could represent the beginning of the program.
From this state, all of the symbols (terminals and non-terminals both)
to the right of the dot are possible outgoing transitions. If the automa-
ton takes that transition, it moves to a new state containing the matching
items, with the dot moved one position to the right. The closure of the new
state is computed, possibly adding new rules as described above.
For example, from state zero, E, T , and id are the possible transitions,
because each appears to the right of the dot in some rule. Here are the
states for each of those transitions:
Transition on E:
P→E.
E→E.+T
Transition on T:
E→T.
Transition on id:
T → id . ( E )
T → id .
Figure 4.3 gives the complete LR(0) automaton for Grammar G10 . Take
a moment now to trace over the table and be sure that you understand
how it is constructed.
No, really. Stop now and study the figure carefully before continuing.
51
52 CHAPTER 4. PARSING
start accept
State 0
P→.E State 2
E State 1 +
E→.E+T E→E+.T
P→E.
E→.T T → . id ( E )
E→E.+T
T → . id ( E ) T → . id
T → . id
id id T
State 5
T → id ( . E )
State 4 id E→.E+T State 3
T T → id . ( E ) +
( E→.T E→E+T.
T → id .
T → . id (E)
T → . id
T E
State 6
State 8
T → id ( E . )
E→T.
E→E.+T
State 7
T → id ( E ) .
52
4.4. LR GRAMMARS 53
The LR(0) automaton tells us the choices available at any step of bot-
tom up parsing. When we reach a state containing an item with a dot at
the end of the rule, that indicates a possible reduction. A transition on a
terminal that moves the dot one position to the right indicates a possible
shift. While the LR(0) automaton tells us the available actions at each step,
it does not always tell us which action to take.1
Two types of conflicts can appear in an LR grammar:
A shift-reduce conflict indicates a choice between a shift action and a
reduce action. For example, state 4 offers a choice between shifting a left
parenthesis and reducing by rule five:
T → id . ( E )
Shift-Reduce Conflict:
T → id .
1 The 0 in LR(0) indicates that it uses zero lookahead tokens, which is a way of saying that
it does not consider the input before making a decision. While it is possible to write out a
grammar that is strictly LR(0), it is not very interesting or useful, since the parse does not
depend on the input!
53
54 CHAPTER 4. PARSING
Naturally, each state in the table can be occupied by only one action.
If following the procedure results in a table with more than one state in a
given entry, then you can conclude that the grammar is not SLR. (It might
still be LR(1) – more on that below.)
54
4.4. LR GRAMMARS 55
Here is the SLR parse table for Grammar G10 . Note carefully the states
1 and 4 where there is a choice between shifting and reducing. In state 1, a
lookahead of + causes a shift, while a lookahead of $ results in a reduction
P → E because $ is the only member of F OLLOW(P ).
55
56 CHAPTER 4. PARSING
Now we are ready to parse an input by following the SLR parsing al-
gorithm. The parse requires maintaining a stack of states in the LR(0) au-
tomaton, initially containing the start state S0 . Then, we examine the top
of the stack and the lookahead token, and take the action indicated by the
SLR parse table. On a shift, we consume the token and push the indicated
state on the stack. On a reduce by A → β, we pop states from stack cor-
responding to each of the symbols in β, then take the additional step of
moving to state G OTO[t, A]. This process continues until we either suc-
ceed by reducing the start symbol, or fail by encountering an error state.
Loop:
Let s be the top of the stack.
If A CTION[s, a] is accept:
Parse complete.
Else if A CTION[s, a] is shift t:
Push state t on the stack.
Let a be the next input token.
Else if A CTION[s, a] is reduce A → β:
Pop states corresponding to β from the stack.
Let t be the top of stack
Push G OTO[t, A] onto the stack.
Otherwise:
Halt with a parse error.
56
4.4. LR GRAMMARS 57
57
58 CHAPTER 4. PARSING
Grammar G11
1. S→V=E
2. S → id
3. V → id
4. V → id [ E ]
5. E→V
We need only build part of the LR(0) automaton to see the problem:
State 1
S -> id . [
...
V -> id .
State 0 id
V -> id . [ E ]
S -> . V = E
S -> . id
V
V -> . id
V -> . id [ E ] ...
58
4.4. LR GRAMMARS 59
The closure of the start state is computed by adding the rules for V
with a lookahead of =, because = follows V in rule 1:
Closure of State 0
S→. V = E {$}
S→. id {$}
V→. id {=}
V→. id [ E ] {=}
Now you can see how the lookahead solves the reduce-reduce conflict.
When the next token on the input is $, we can only reduce by S → id.
When the next token is =, we can only reduce by V → id. By tracking
lookaheads in a more fine-grained manner than SLR, we are able to parse
arbitrary LR(1) grammars.
59
60 CHAPTER 4. PARSING
start accept
State 0
P→.E {$} State 2
E State 1 +
E→.E+T {$,+} E→E+.T {$,+}
P→E. {$}
E→.T {$,+} T → . id ( E ) {$,+}
E→E.+T {$,+}
T → . id ( E ) {$,+} T → . id {$,+}
T → . id {$,+}
T id id T
State 4
State 8 State 7 State 3
T → id . ( E ) {$,+}
E → T . {$,+} T → id ( E ) . {$,+} E→E+T. {$,+}
T → id . {$,+}
( id E )
State 5
T → id ( . E ) {$,+}
T State 9 State 6
E→.E+T {+,) } State 15
T → id . ( E ) {+,) } T → id ( E . ) {$,+}
E→.T {+,) } E → T . {+,) }
T → id . {+,) } E→E.+T {+,) }
T → . id (E) {+,) }
T → . id {+,) }
T ( id id +
State 12
T → id ( . E ) {+,) } State 10
E State 13 +
E→.E+T {+,) } E→E+.T {+,) }
T → id ( E . ) {+,) }
E→.T {+,) } T → . id ( E ) {+,) }
E→E.+T {+,) }
T → . id (E) {+,) } T → . id {+,) }
T → . id {+,) }
) T
State 14 State 11
T → id ( E ) . {+,) } E→E+T. {+,) }
Figure 4.6 gives the complete LR(1) automaton for Grammar G10 . Take
a moment now to trace over the table and be sure that you understand
how it is constructed.
One aspect of state zero is worth clarifying. When constructing the
closure of a state, we must consider all rules in the grammar, including
the rule corresponding to the item under closure. The item E → . E + T
is initially added with a lookahead of {$}. Then, evaluating that item, we
add all rules that have E on the left hand side, adding a lookahead of {+}.
So, we add E → . E + T again, this time with a lookahead of {+}, resulting
in a single item with a lookahead set of {$, +}
60
4.5. GRAMMAR CLASSES REVISITED 61
E → . E + T {$+} E → . E + T {)+}
E→.T {$+} E→.T {)+}
E → . E + T {$)+}
E→.T {$)+}
The resulting LALR automaton has the same number of states as the
LR(0) automaton, but has more precise lookahead information available
for each item. While this may seem a minor distinction, experience has
shown this simple improvement to be highly effective at obtaining the ef-
ficiency of SLR parsing while accommodating a large number of practical
grammars.
Now that you have some experience working with different kinds of gram-
mars, let’s step back and review how they relate to each other.
61
62 CHAPTER 4. PARSING
LL(2) parser would require a row for pairs {aa, ab, ac, ...}.
62
4.6. THE CHOMSKY HIERARCHY 63
left hand side, and a mix of terminals and non-terminals on the right hand
side. We call these “context free” because the meaning of a non-terminal is
the same in all places where it appears. As you have learned in this chap-
ter, a CFG requires a pushdown automaton, which is achieved by coupling
a finite automaton with a stack. If the grammar is ambiguous, the automa-
ton will be non-deterministic and therefore impractical. In practice, we
restrict ourselves to using subsets of CFGs (like LL(1) and LR(1) that are
non-ambiguous and result in a deterministic automaton that completes in
bounded time.
Context sensitive languages are those described by context sensitive
grammars where each rule can be of the form αAβ → αγβ. We call these
“context sensitive” because the interpretation of a non-terminal is con-
trolled by context in which it appears. Context sensitive languages re-
quire a non-deterministic linear bounded automaton, which is bounded
in memory consumption, but not in execution time. Context sensitive lan-
guages are not very practical for computer languages.
Recursively enumerable languages are the least restrictive set of lan-
guages, described by rules of the form α → β where α and β can be any
combination of terminals and non-terminals. These languages can only be
recognized by a full Turing machine, and are the least practical of all.
63
64 CHAPTER 4. PARSING
4.7 Exercises
Grammar G12
1. P→S
2. P→SP
3. S → if E then S
4. S → if E then S else S
5. S → while E S
6. S → begin P end
7. S → print E
8. S→E
9. E → id
10. E → integer
11. E→E+E
(a) Point out all aspects of Grammar G12 which are not LL(1).
(b) Write a new grammar which accepts the same language, but
avoids left recursion and common left prefixes.
(c) Write the FIRST and FOLLOW sets for the new grammar.
(d) Write out the LL(1) parse table for the new grammar.
(e) Is the new grammar an LL(1) grammar? Explain your answer
carefully.
4. Consider the following grammar:
Grammar G13
1. S → id = E
2. E→E+P
3. E→P
4. P → id
5. P → (E)
6. P → id(E)
64
4.7. EXERCISES 65
T
T & T | F
( F -> F ) -> T
F
! ( T | F )
( T -> F ) & T
10. Write a hand-coded parser that reads in regular expressions and out-
puts the corresponding NFA, using the Graphviz [2] DOT language.
11. Write a parser-construction tool that reads in an LL(1) grammar and
produces working code for a table-driven parser as output.
65
66 CHAPTER 4. PARSING
66
67
In this chapter, you will apply what you have learned about the theory of
LL(1) and LR(1) grammars in order to build a working parser for a simple
expression language. This will give you a basis to write a more complete
parser for B-Minor in the following chapter.
While LL(1) parsers are commonly written by hand, LR(1) parsers are
simply too cumbersome to do the same. Instead, we rely upon a parser
generator to take a specification of a grammar and automatically produce
the working code. In this chapter, we will give examples with Bison, a
widely-used parser generator for C like languages.
Using Bison, we will define an LALR grammar for algebraic expres-
sions, and then employ it to create three different varieties of programs.
• A validator reads the input program and then simply informs the
user whether it is a valid sentence in the language specified by the
grammar. Such a tool is often used to determine whether a given
program conforms to one standard or another.
• An interpreter reads the input program and then actually executes
the program to produce a result. One approach to interpretation is
to compute the result of each operation as soon as it is parsed. The
alternate approach is to parse the program into an abstract syntax
tree, and then execute it.
• A translator reads the input program, parses it into an abstract syn-
tax tree, and then traverses the abstract syntax tree to produce an
equivalent program in a different format.
67
68 CHAPTER 5. PARSING IN PRACTICE
68
5.1. THE BISON PARSER GENERATOR 69
%{
#include <stdio.h>
%}
%token TOKEN_INT
%token TOKEN_PLUS
%token TOKEN_MINUS
%token TOKEN_MUL
%token TOKEN_DIV
%token TOKEN_LPAREN
%token TOKEN_RPAREN
%token TOKEN_SEMI
%token TOKEN_ERROR
%%
program : expr TOKEN_SEMI;
69
70 CHAPTER 5. PARSING IN PRACTICE
#include <stdio.h>
int main()
{
if(yyparse()==0) {
printf("Parse successful!\n");
} else {
printf("Parse failed.\n");
}
}
Figure 5.3 shows the general build procedure for a combined program
that uses Bison and Flex together. The parser specification goes in parser.bison.
We assume that you have written a suitable scanner and placed it in scanner.flex.
Previously, we wrote token.h by hand. Here, we will rely on Bison to
generate token.h automatically from the %token declarations, so that
the parser and the scanner are working from the same information. In-
voke Bison like this:
Compiler
parser.bison Bison parser.c
parser.o
scanner.o
scanner.flex Flex scanner.c
Compiler
70
5.2. EXPRESSION VALIDATOR 71
Bison will report that the grammar has one shift-reduce conflict, and
parser.output will describe each state. In the event of a conflict, Bison
will suppress one or more actions, and this is indicated by square brackets
in the following report:
state 9
2 expr: expr . TOKEN_PLUS expr
2 | expr TOKEN_PLUS expr .
As written, the Bison specification in Figure 5.1 will simply evaluate whether
the input program matches the desired grammar. yyparse() will return
zero on success, and non-zero otherwise. Such a program is known as a
validator and is often used to determine whether a given program is stan-
dards compliant.
There are a variety of online validators for web-related languages like
HTML1 , CSS2 , and JSON3 . By having a strict language definition separate
from actual implementations (which may contain non-standard features)
it is much easier for a programmer to determine whether their code is
standards compliant, and therefore (presumably) portable.
1 http://validator.w3.org
2 http://css-validator.org
3 http://jsonlint.com
71
72 CHAPTER 5. PARSING IN PRACTICE
To do more than simply validate the program, we must make use of se-
mantic actions embedded within the grammar itself. Following the right
side of any production rule, you may place arbitrary C code inside of curly
braces. This code may refer to semantic values which represent the values
already computed for other non-terminals. Semantic values are given by
dollar signs indicating the position of a non-terminal in a production rule.
Two dollar signs indicates the semantic value of the current rule.
For example, in the rule for addition, the appropriate semantic action is
to add the left value (the first symbol) to the right value (the third symbol):
(Careful: the value of the token comes from the yytext array in the
scanner, so you can only do this when the rule has a single terminal on the
right hand side of the rule.)
In the cases where a non-terminal expands to a single non-terminal, we
simply assign one semantic value to the other:
72
5.4. EXPRESSION TREES 73
factor
: TOKEN_MINUS factor { $$ = -$2; }
| TOKEN_LPAREN expr TOKEN_RPAREN { $$ = $2; }
| TOKEN_INT { $$ = atoi(yytext); }
;
4 Although it is verbally awkward, we are using the term “kind” rather than “type”, which
73
74 CHAPTER 5. PARSING IN PRACTICE
74
5.4. EXPRESSION TREES 75
struct expr *e =
expr_create(EXPR_MULTIPLY,
expr_create(EXPR_ADD,
expr_create_value(10),
expr_create_value(20)
),
expr_create_value(30)
);
MULTIPLY
ADD VALUE: 30
VALUE: 10 VALUE: 20
75
76 CHAPTER 5. PARSING IN PRACTICE
%{
#include "expr.h"
#define YYSTYPE struct expr *
struct expr * parser_result = 0;
%}
factor
: TOKEN_MINUS factor
{ $$ = expr_create(EXPR_SUBTRACT,
expr_create_value(0),$2); }
| TOKEN_LPAREN expr TOKEN_RPAREN
{ $$ = $2; }
| TOKEN_INT
{ $$ = expr_create_value(atoi(yytext));
;
76
5.4. EXPRESSION TREES 77
SUB
term
VALUE:0 VALUE:20
factor
MINUS
factor
20
(5+6)/7 5+(6/7)
DIV ADD
77
78 CHAPTER 5. PARSING IN PRACTICE
int l = expr_evaluate(e->left);
int r = expr_evaluate(e->right);
switch(e->kind) {
case EXPR_VALUE: return e->value;
case EXPR_ADD: return l+r;
case EXPR_SUBTRACT: return l-r;
case EXPR_MULTIPLY: return l*r;
case EXPR_DIVIDE:
if(r==0) {
printf("runtime error: divide by zero\n");
exit(1);
}
return l/r;
}
return 0;
}
Now that we have constructed the AST, we can use it as the basis for
computation and many other operations.
The AST can be evaluated arithmetically by calling expr evaluate
shown in Figure 5.7 This function performs a post-order traversal of the
tree by invoking itself recursively on the left and right pointers of the node.
(Note the check for a null pointer at the beginning of the function.) Those
calls return l and r which contain the integer result of the left and right
subtrees. Then, the result of this node is computed by switching on the
kind of the current node. (Note also that we must check for division-by-
zero explicitly, otherwise expr evaluate would crash when r is zero.)
78
5.5. EXERCISES 79
printf("(");
expr_print(e->left);
switch(e->kind) {
case EXPR_VALUE: printf("%d",e->value); break;
case EXPR_ADD: printf("+"); break;
case EXPR_SUBTRACT: printf("-"); break;
case EXPR_MULTIPLY: printf("*"); break;
case EXPR_DIVIDE: printf("/"); break;
}
expr_print(e->right);
printf(")");
}
5.5 Exercises
79
80 CHAPTER 5. PARSING IN PRACTICE
4. Extend the parser and interpreter to allow for invoking several built-
in mathematical functions such as sin(x), sqrt(x) and so forth.
Before coding, think a little bit about where to put the names of the
functions. Should they be keywords in the language? Or should
any function names be simply treated as identifiers and checked in
expr evaluate()?
5. Extend the parser and interpreter to allow for variable assignment
and use, so that you can write multiple assignment statements fol-
lowed by a single expression to be evaluated, like this:
g = 9.8;
t = 5;
g*t*t - 7*t + 10;
80
5.6. FURTHER READING 81
As its name suggests, YACC was not the first compiler construction tool,
but it remains widely used today and has led to a proliferation of simi-
lar tools written in various langauges and addressing different classes of
grammars. Here is just a small selection:
81
82 CHAPTER 5. PARSING IN PRACTICE
82
83
6.1 Overview
The abstract syntax tree (AST) is an important internal data structure that
represents the primary structure of a program. The AST is the starting
point for semantic analysis of a program. It is “abstract” in the sense that
the structure leaves out the particular details of parsing: the AST does
not care whether a language has prefix, postfix, or infix expressions. (In
fact, the AST we describe here can be used to represent most procedural
languages.)
For our project compiler, we will define an AST in terms of five C
structures representing declarations, statements, expressions, types, and
parameters. While you have certainly encountered each of these terms
while learning programming, they are not always used precisely in prac-
tice. This chapter will help you to sort those items out very clearly:
• A declaration states the name, type, and value of a symbol so that
it can be used in the program. Symbols included items such as con-
stants, variables, and functions.
83
84 CHAPTER 6. THE ABSTRACT SYNTAX TREE
6.2 Declarations
b: boolean;
s: string = "hello";
f: function integer ( x: integer ) = { return x*x; }
struct decl {
char *name;
struct type *type;
struct expr *value;
struct stmt *code;
struct decl *next;
};
(You will need to write similar code for statements, expressions, etc,
but we won’t keep repeating it here.)
84
6.2. DECLARATIONS 85
struct decl
name type value code next
b boolean
struct decl
name type value code next
s string "hello"
struct decl
name type value code next
f return x*x;
Note that some of the fields point to nothing: these would be repre-
sented by a null pointer, which we omit for clarity. Also, our picture is
incomplete and must be expanded: the items representing types, expres-
sions, and statements are all complex structures themselves that we must
describe.
85
86 CHAPTER 6. THE ABSTRACT SYNTAX TREE
6.3 Statements
• STMT DECL indicates a (local) declaration, and the decl field will
point to it.
• STMT EXPR indicates an expression statement and the expr field
will point to it.
• STMT IF ELSE indicates an if-else expression such that the expr
field will point to the control expression, the body field to the state-
ments executed if it is true, and the else body field to the state-
ments executed if it is false.
• STMT FOR indicates a for-loop, such that init expr, expr, and
next expr are the three expressions in the loop header, and body
points to the statements in the loop.
86
6.3. STATEMENTS 87
This structure has a lot of fields, but each one serves a purpose and is
used when necessary for a particular kind of statement. For example, an
if-else statement only uses the expr, body, and else body fields, leaving
the rest null:
STMT_IF_ELSE
decl init_expr expr next_expr body else_body next
A for-loop uses the three expr fields to represent the thre parts of the
loop control, and the body field to represent the code being executed:
for(i=0;i<10;i++) print i;
STMT_FOR
decl init_expr expr next_expr body else_body next
87
88 CHAPTER 6. THE ABSTRACT SYNTAX TREE
6.4 Expressions
Expressions are implemented much like the simple expression AST shown
in Chapter 5. The difference is that we need many more binary types: one
for every operator in the language, including arithmetic, logical, compar-
ison, assignment, and so forth. We also need one for every type of leaf
value, including variable names, constant values, and so forth. The name
field will be set for EXPR NAME, the integer value field for
EXPR INTEGER LITERAL, and so on. You may need to add values and
types to this structure as you expand your compiler.
Note that you can store the integer, boolean, and character literal val-
ues all in the integer value field.
88
6.4. EXPRESSIONS 89
EXPR_NOT
left right
!x
EXPR_NAME
x
EXPR_CALL
left right
EXPR_NAME EXPR_ARG
f left right
EXPR_NAME EXPR_ARG
f(a,b,c) a left right
EXPR_NAME EXPR_ARG
b left right
EXPR_NAME
c
89
90 CHAPTER 6. THE ABSTRACT SYNTAX TREE
Array subscripting is treated like a binary operator, such that the name
of the array is on the left side of the EXPR SUBSCRIPT operator, and an
integer expression on the right:
EXPR_SUBSCRIPT
left right
a[b]
EXPR_NAME EXPR_NAME
a b
6.5 Types
A type structure encodes the type of every variable and function men-
tioned in a declaration. Primitive types like integer and boolean are
expressed by simply setting the kind field appropriately, and leaving the
other fields null. Compound types like array and function are built by
connecting multiple type structures together.
typedef enum {
TYPE_VOID,
TYPE_BOOLEAN,
TYPE_CHARACTER,
TYPE_INTEGER,
TYPE_STRING,
TYPE_ARRAY,
TYPE_FUNCTION
} type_t;
struct type {
type_t kind;
struct type *subtype;
struct param_list *params;
};
struct param_list {
char *name;
struct type *type;
struct param_list *next;
};
90
6.5. TYPES 91
TYPE_BOOLEAN
boolean subtype param
TYPE_INTEGER
integer subtype param
TYPE_ARRAY
subtype param
array [] integer
TYPE_INTEGER
subtype param
TYPE_ARRAY
subtype param
TYPE_ARRAY
array [] array [] integer
subtype param
TYPE_INTEGER
subtype param
91
92 CHAPTER 6. THE ABSTRACT SYNTAX TREE
TYPE_FUNCTION
subtype param
TYPE_CHAR
"c"
subtype param
Note that the type structures here let us express some complicated and
powerful higher order concepts of programming. By simply swapping in
complex types, you can describe an array of ten functions, each returning
an integer:
92
6.6. PUTTING IT ALL TOGETHER 93
Now that you have seen each individual component, let’s see how a com-
plete B-Minor function would be expressed as an AST:
DECL "compute"
type value code next
TYPE_FUNCTION STMT_DECL
subtype params decl init_expr expr next_expr body else_body next
TYPE_INTEGER EXPR_INTEGER
subtype params 0
STMT_FOR
decl init_expr expr next_expr body else_body next
STMT_BLOCK
decl init_expr expr next_expr body else_body next
STMT_EXPR
decl init_expr expr next_expr body else_body next
EXPR_ASSIGN STMT_RETURN
left right decl init_expr expr next_expr body else_body next
EXPR_NAME EXPR_NAME
total i
93
94 CHAPTER 6. THE ABSTRACT SYNTAX TREE
d = decl_create(
"square",
type_create(TYPE_FUNCTION,
type_create(TYPE_INTEGER,0,0),
param_list_create(
"x",
type_create(TYPE_INTEGER,0,0)
0)),
0,
stmt_create(STMT_RETURN,0,0,
expr_create(EXPR_MUL,
expr_create_name("x"),
expr_create_name("x")),
0,0,0,0),
0);
program : decl_list
{ parser_result = $1; }
;
94
6.7. BUILDING THE AST 95
95
96 CHAPTER 6. THE ABSTRACT SYNTAX TREE
6.8 Exercises
4. Add the AST generator functions as action rules to your Bison gram-
mar, so that you can parse complete programs, and print them back
out again.
5. Add new functions decl translate(), stmt translate(), etc.
that output the B-Minor AST in a different language of your own
choosing, such as Python or Java or Rust.
6. Add new functions that emit the AST in a graphical form so you can
“see” the structure of a program. One approach would be to use the
Graphviz DOT format: let each declaration, statement, etc be a node
in a graph, and then let each pointer between structures be an edge
in the graph.
96
97
97
98 CHAPTER 7. SEMANTIC ANALYSIS
• safe or unsafe
• static or dynamic
• explicit or implicit
98
7.1. OVERVIEW OF TYPE SYSTEMS 99
/* This is C code */
int i;
int a[10];
for(i=0;i<100;i++) a[i] = i;
99
100 CHAPTER 7. SEMANTIC ANALYSIS
code that can explicitly examine the type of a variable. For example, the
instanceof operator in Java allows one to test for types explicitly:
/* This is C code */
int x = 32.5;
/* This is C code */
int *i;
float *f = i;
100
7.2. DESIGNING A TYPE SYSTEM 101
The compiler determines that 32.5 has type double, and therefore x
must also have type double. In a similar way, the output operator << is
defined to have a certain behavior on integers, another behavior on strings,
and so forth. In this case, the compiler already determined that the type of
x is double and so it chooses the variant of << that operates on doubles.
This is useful because variables and functions dealing with days and
months are now kept separate, preventing you from accidentally assigning
one to another, or for giving the value 13 to a variable of type Month.
C has a similar feature, but it is much weaker: typedef declares a new
name for a type, but doesn’t have any means of restricting the range, and
doesn’t prevent you from making assignments between types that share
the same base type:
/* This is C code */
typedef int Month;
typedef int Day;
/* Assigning m to d is allowed in C,
because they are both integers. */
Month m = 10;
Day d = m;
101
102 CHAPTER 7. SEMANTIC ANALYSIS
/* This is Go code */
type coordinates struct {
latitude float64
longitude float64
}
Less frequently used are union types in which multiple symbols oc-
cupy the same memory. For example, in C, you can declare a union type
of number that contains an overlapping float and integer:
/* This is C code */
union number {
int i;
float f;
};
union number n;
n.i = 10;
n.f = 3.14;
In this case, n.i and n.f occupy the same memory. If you assign 10
to n.i and read it back, you will see 10 as expected. However, if you
assign 10 to n.i and read back n.f, you will likely observe a garbage
value, depending on how exactly the two values are mapped into memory.
Union types are occasionally handy when implementing operating system
features such as device drivers, because hardware interfaces often re-use
the same memory locations for multiple purposes.
102
7.2. DESIGNING A TYPE SYSTEM 103
• Perform a bitwise copy. If the two variables have the same under-
lying storage size, the unlike assignment could be accomplished by
just copying the bits in one variable to the location of the other. This
is usually a bad idea, since there is no guarantee that one data type
has any meaning in the other context. But it does happen in a few
select cases, such as when assigning different pointer types in C.
103
104 CHAPTER 7. SEMANTIC ANALYSIS
The B-Minor type system is safe, static, and explicit. As a result, it is fairly
compact to describe, and straightforward to implement, and will eliminate
a large number of programming errors. However, it may be more strict
than some languages, so there s going to be a large number of errors that
we must detect.
B-Minor has the following atomic types:
• All binary operators must have the same type on the left and right
hand sides.
104
7.4. THE SYMBOL TABLE 105
• The comparison operators < <= >= > may only be applied to integer
values and always return boolean.
• The boolean operators ! && || may only be applied to boolean
values and always return boolean.
The symbol table records all of the information that we need to know
about every declared variable (and other named items, like functions) in
the program. Each entry in the table is a struct symbol which is shown
in Figure 7.1.
The kind field indicates whether the symbol is a local variable, a global
variable, or a function parameter. The type field points to a type structure
indicating the type of the variable. The name field gives the name (obvi-
ously), and the which field gives the ordinal position of local variables
and parameters. (More on that later.)
As with all the other data structures we have created so far, we must
have a factory function like this:
105
106 CHAPTER 7. SEMANTIC ANALYSIS
Conceptually, the symbol table is just a map between the name of each
variable, and the symbol structure that describes it:
However, it’s not quite that simple, because most programming lan-
guages allow the same variable name to be used multiple times, as long
as each definition is in a distinct scope. In C-like languages (including B-
Minor ) there is a global scope, a scope for function parameters and local
variables, and then nested scopes everywhere curly braces appear.
For example, the following B-Minor program defines the symbol x
three times, each with a different type and storage class. When run, the
program should print 10 hello false.
x: integer = 10;
106
7.4. THE SYMBOL TABLE 107
Stack Top
Inner symbol
x Scope
x LOCAL(0) BOOLEAN
Table
Function symbol
x Scope
x PARAM(0) STRING
Table
symbol
x x GLOBAL INTEGER
Global symbol
f Scope
f GLOBAL FUNCTION
Table
symbol
main
main GLOBAL FUNCTION
107
108 CHAPTER 7. SEMANTIC ANALYSIS
void scope_enter();
void scope_exit();
int scope_level();
108
7.5. NAME RESOLUTION 109
With the symbol table in place, we are now ready to match each use of a
variable name to its matching definition. This process is known as name
resolution. To implement name resolution, we will write a resolve method
for each of the structures in the AST, including decl resolve(),
stmt resolve() and so forth.
Collectively, these methods must iterate over the entire AST, looking
for variable declarations and uses. Wherever a variable is declared, it must
be entered into the symbol table and the symbol structure linked into the
AST. Wherever a variable is used, it must be looked up in the symbol table,
and the symbol structure linked into the AST. Of course, if a symbol is
declared twice in the same scope, or used without declaration, then an
appropriate error message must be emitted.
We will begin with declarations, as shown in Figure 7.4. Each struct decl
represents a variable declaration of some kind, so decl resolve will cre-
ate a new symbol, and then bind it to the name of the declaration in the
current scope. If the declaration represents an expression (d->value is
not null) then the expression should be resolved. If the declaration repre-
sents a function (d->code is not null) then we must create a new scope
and resolve the parameters and the code.
Figure 7.4 gives some sample code for resolving declarations. As al-
ways in this book, consider this starter code in order to give you the basic
idea. You will have to make some changes in order to accommodate all
the features of the language, handle errors cleanly, and so forth.
In a similar fashion, we must write resolve methods for each structure
in the AST. stmt resolve() (not shown) must simply call the appropri-
ate resolve on each of its sub-components. In the case of a STMT BLOCK,
it must also enter and leave a new scope. param list resolve() (also
not shown) must enter a new variable declaration for each parameter of a
function, so that those definitions are available to the code of a function.
To perform name resolution on the entire AST, you may simply invoke
decl resolve() once on the root node of the AST. This function will
traverse the entire tree by calling the necessary sub-functions.
109
110 CHAPTER 7. SEMANTIC ANALYSIS
d->symbol = symbol_create(kind,d->type,d->name);
expr_resolve(d->value);
scope_bind(d->name,d->symbol);
if(d->code) {
scope_enter();
param_list_resolve(d->type->params);
stmt_resolve(d->code);
scope_exit();
}
decl_resolve(d->next);
}
if( e->kind==EXPR_NAME ) {
e->symbol = scope_lookup(e->name);
} else {
expr_resolve( e->left );
expr_resolve( e->right );
}
}
110
7.6. IMPLEMENTING TYPE CHECKING 111
111
112 CHAPTER 7. SEMANTIC ANALYSIS
to the symbol structure, which contains the type. This type is copied and
returned to the parent node.
For interior nodes of the expression tree, we must compare the type
of the left and right subtrees, and determine if they are compatible with
the rules indicated in Section 7.3. If not, we emit an error message and
increment a global error counter. Either way, we return the appropriate
type for the operator. The types of the left and right branches are no longer
needed and can be deleted before returning.
Here is the basic code structure:
switch(e->kind) {
case EXPR_INTEGER_LITERAL:
result = type_create(TYPE_INTEGER,0,0);
break;
case EXPR_STRING_LITERAL:
result = type_create(TYPE_STRING,0,0);
break;
type_delete(lt);
type_delete(rt);
return result;
}
112
7.6. IMPLEMENTING TYPE CHECKING 113
Let’s consider the cases for a few operators in detail. Arithmetic oper-
ators can only be applied to integers, and always return an integer type:
case EXPR_ADD:
if( lt->kind != TYPE_INTEGER || rt->kind!=TYPE_INTEGER ) {
/* display an error */
}
result = type_create(TYPE_INTEGER,0,0);
break;
case EXPR_EQ:
case EXPR_NE:
if(!type_equals(lt,rt)) {
/* display an error */
}
if(lt->kind==TYPE_VOID ||
lt->kind==TYPE_ARRAY ||
lt->kind==TYPE_FUNCTION) {
/* display an error */
}
result = type_create(TYPE_BOOLEAN,0,0);
break;
case EXPR_DEREF:
if(lt->kind==TYPE_ARRAY) {
if(rt->kind!=TYPE_INTEGER) {
/* error: index not an integer */
}
result = type_copy(lt->subtype);
} else {
/* error: not an array */
/* but we need to return a valid type */
result = type_copy(lt);
}
break;
113
114 CHAPTER 7. SEMANTIC ANALYSIS
type of expressions, and then check them against declarations and other
constraints as needed.
For example, decl typecheck simply confirms that variable declara-
tions match their initializers and otherwise typechecks the body of func-
tion declarations:
114
7.7. ERROR MESSAGES 115
s: string = "hello";
b: boolean = false;
i: integer = s + (b<5);
But, your project compiler can very easily have much more detailed
error messages like this:
It’s just a matter of taking some care in printing out each of the expres-
sions and types involved when a problem is found:
115
116 CHAPTER 7. SEMANTIC ANALYSIS
7.8 Exercises
116
117
8.1 Introduction
First, we will point out that the AST itself can be a usable IR, if the goal is
simply to emit assembly language without a great deal of optimization or
other transformations. Once typechecking is complete, simple optimiza-
tions like strength reduction and constant folding can be applied to the
AST itself. Then, to generate assembly language, you can simply perform
a post-order traversal of the AST and emit a few assembly instructions
corresponding to each node. 1
1 This is the approach we use in a one-semester course to implement a project compiler,
since there is a limited amount of time to get to the final goal of generating assembly lan-
guage.
117
118 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
typedef enum {
DAG_ASSIGN,
DAG_DEREF,
DAG_IADD,
DAG_IMUL,
...
DAG_NAME,
DAG_FLOAT_VALUE,
DAG_INTEGER_VALUE
} dag_kind_t;
struct dag_node {
dag_kind_t kind;
struct dag_node *left;
struct dag_node *right;
union {
const char *name;
double float_value;
int integer_value;
} u;
};
The directed acyclic graph (DAG) is one step simplified from the AST. A
DAG is similar to the AST, except that it can have an arbitrary graph struc-
ture, and the individual nodes are greatly simplified, so that there is little
or no auxiliary information beyond the type and value of each node. This
requires that we have a greater number of node types, each one explicit
about its purpose. For example, Figure 8.1 shows a definition of a DAG
data structure that would be compatible with our project compiler.
118
8.3. DIRECTED ACYCLIC GRAPH 119
ASSIGN
x MUL
ADD ADD
a 10 a 10
ASSIGN
x FMUL
FADD
a ITOF
10
119
120 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
ASSIGN
x LOOKUP
a i
ASSIGN
x DEREF
IADD
a IMUL
i 4
120
8.3. DIRECTED ACYCLIC GRAPH 121
ASSIGN
x DEREF
IADD
DEREF IMUL
IADD DEREF 4
FP 16 IADD
FP 20
Obviously, searching the table for equivalent nodes every time a new
node gets added is going to have polynomial complexity. However, the
absolute sizes stay relatively small, as long as each individual expression
has its own DAG.
121
122 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
ASSIGN(x,DEREF(IADD(DEREF(IADD(FP,16)),
IMUL(DEREF(IADD(FP,16)),4)))))
Clearly, this sort of code would not be easy for a human to read and
write manually, but it is trivial to print and trivial to parse, making it easy
to pass between compiler stages for analysis and optimization.
Now, what sort of optimizations might you do with the DAG? One
easy optimization is constant folding. This is the process of reducing an
expression consisting of only constants into a single value. 2 This capabil-
ity is handy, because the programmer may wish to leave some expressions
in an explicit form for the sake of readability or maintainability while still
having the performance advantage of a single constant in the executable
code.
ConstantFold( DagNode n ):
If n is a leaf:
return;
Else:
n.left = ConstantFold(n.left);
n.right = ConstantFold(n.right);
2 Constant folding is a narrow example of the more general technique of partial execution
in which some parts of the program are executed at compile time, while the rest is left for
runtime.
122
8.4. CONTROL FLOW GRAPH 123
ASSIGN
ASSIGN
secs IMUL
ASSIGN
secs IMUL
days IMUL
secs IMUL
days IMUL
24 IMUL
days 86400
60 60
3600 24
123
124 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
i=0;
if(i<10)
true
if(i%2==0)
true false
false
print "\n";
i++;
return;
Note that the control flow graph has a different structure than the AST.
The AST for a for-loop would have each of the control expressions as an
immediate child of the loop node, whereas the control flow graph places
each one in the order it would be executed in practice. Likewise, the if
statement has edges from each branch of the conditional to the following
node, so that one can easily trace the flow of execution from one compo-
nent to the next.
124
8.5. STATIC SINGLE ASSIGNMENT FORM 125
int x = 1;
int a = x;
int b = a + 10;
x = 20 * b;
x = x + 30;
int x_1 = 1;
int a_1 = x_1;
int b_1 = a_1 + 10;
x_2 = 20 * b_1;
x_3 = x_2 + 30;
if(y<10) {
x=a;
} else {
x=b;
}
Becomes this:
if(y_1<10) {
x_2=a;
} else {
x_3=b;
}
x_4 = phi(x_2,x_3);
125
126 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
8.6 Linear IR
126
8.7. STACK MACHINE IR 127
as long as the values it writes are not moved below their uses. Moving
instructions can reduce the number of physical registers needed in code
generation, as well as reduce execution time in a pipelined architecture.
PUSH a
PUSH 10
ITOF
FADD
COPY
FMUL
POP x
127
128 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
8.8 Examples
D.1597D.1597 = (float) a;
D.1598D.1598 = D.1597D.1597 * x;
D.1599D.1599 = D.1598D.1598 * x;
D.1600D.1600 = (float) b;
D.1601D.1601 = D.1600D.1600 * x;
D.1602D.1602 = D.1599D.1599 + D.1601D.1601;
y = D.1602D.1602 + 1.0e+2;
D.1603D.1603 = y;
return D.1603D.1603;
}
128
8.8. EXAMPLES 129
3 http://llvm.org
129
130 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
0: iload 1
1: i2f
2: fload 3
4: fmul
5: fload 3
7: fmul
8: iload 2
9: i2f
10: fload 3
12: fmul
13: fadd
14: ldc #2
16: fadd
17: fstore 4
19: fload 4
21: freturn
130
8.9. EXERCISES 131
8.9 Exercises
1. Add a step to your compiler to convert the AST into a DAG by per-
forming a post-order traversal and creating one or more DAG nodes
corresponding to each AST node.
2. Write the code to export a DAG in a simple external representation as
shown in this chapter. Extend the DAG suitably to represent control
flow structures and function definitions.
3. Write a scanner and parser to read in the DAG external representa-
tion and reconstruct it as a data structure. Think carefully about the
grammar class of the IR, and choose the simplest implementation
that works.
131
132 CHAPTER 8. INTERMEDIATE REPRESENTATIONS
132
133
9.1 Introduction
Program Memory
0 4GB
In principle, the CPU is free to use memory in any way it sees fit. Code
and data could be scattered and intermixed across memory in any order
that is convenient. It is even technically possible for a CPU to modify the
memory containing its code while it is running. It goes without saying that
programming in this fashion would be complex, confusing, and difficult
to debug.
Instead, program memory is commonly laid out by separating it into
logical segments. Each segment is a sequential address range, dedicated
to a particular purpose within the program. The segments are typically
laid out in this order:
133
134 CHAPTER 9. MEMORY ORGANIZATION
0 4GB
• The code segment (also known as the text segment) contains the ma-
chine code of the program, corresponding to the bodies of functions
in a C program.
• The data segment contains the global data of the program, corre-
sponding to the global variables in a C program. The data seg-
ment may further be sub-divided into read-write data (variables)
and read-only data (constants).
• The heap segment contains the heap, which is the area of memory
that is managed dynamically at runtime by malloc and free in a
C program, or new and delete in other languages. The top of the
heap is historically known as the break.
• The stack segment contains the stack, which records the current ex-
ecution state of the program as well as the local variables currently
in use.
Typically, the heap grows “up” from lower addresses to higher ad-
dresses, while the stack grows “down” from higher to lower. In between
the two segments is an invalid region of memory that is unused until over-
taken by one segment or the other.
On a simple computer such as an embedded machine or microcon-
troller, logical segments are nothing more than an organizational conven-
tion: nothing stops the program from using memory improperly. If the
heap grows too large, it can run into the stack segment (or vice versa), the
program will crash (if you are lucky) or suffer silent data corruption (if
you are unlucky.)
On a computer with an operating system that employs multiprogram-
ming and memory protection, the situation is better. Each process running
in the OS has its own private memory space, with the illusion of starting
at address zero and extending to a high address. As a result, each process
can access its own memory arbitrarily, but is prevented from accessing or
modifying other processes. Within its own space, each process lays out its
own code, data, heap, and stack segments.
In some operating systems, when a program is initially loaded into
memory, permissions are set on each range of memory corresponding to
its purpose: memory corresponding to each segment can be set appropri-
134
9.2. LOGICAL SEGMENTATION 135
Code Data Code Data Heap Stk Data Code Heap Stk
Virtual Addrs: 0 −→ 4GB 0 −→ 4GB
ately: read-write for data, heap, and stack; read-only for constants; read-
execute for code; and none for the unused area.
The permissions on the logical segments also protect a process from
damaging itself in certain ways. For example, code cannot be modified
at runtime because it is marked read/execute, while items on the heap
cannot be executed, because they are marked read/write. (To be clear, this
only prevents against accidents, not malice; a program can always ask the
operating system to change the permissions on one of its own pages. For
an example, look up the mprotect system call in Unix.)
If a process attempts to access memory in a prohibited way or attempts
to access the unused area, a page fault occurs. This forces a transfer of
control to the OS, which considers the process and the faulting address. If
the access indicates a violation of the logical segmentation of the program,
then the process is forcibly killed with the error message segmentation
fault.1
Initially, a process is given a small amount of memory for the heap
segment, which it manages internally to implement malloc and free. If
this area is exhausted and the program still needs more, it must explicitly
request it from the operating system. On traditional Unix systems, this
is done with the brk system call, which requests that the heap segment
be extended to a new address. If the OS agrees, then it will allocate new
pages at the beginning of the invalid area, effectively extending the heap
segment. If the OS does not agree, then brk will return an error code,
causing malloc to return an error (indicated as a null pointer) which the
program must handle.
The stack has a similar problem, in that it must be able to grow down.
It is not so easy for a program to determine exactly when more stack is
needed, because that happens whenever a new function is invoked, or new
local variables are allocated. Instead, modern operating systems maintain
a guard page at the top of the invalid area, adjacent to the current stack.
When a process attempts to extend the stack into the invalid area, a page
fault occurs, and control is transferred to the OS. If the OS sees that the
faulting address is in the guard page, it can simply allocate more pages
for the stack, set the page permissions appropriately, and move the guard
page down to the new top of the invalid area.
Of course, there are limits on how big the heap and the stack may grow;
1 And now you understand this mysterious Unix term!
135
136 CHAPTER 9. MEMORY ORGANIZATION
struct chunk {
enum { FREE, USED } state;
int size;
struct chunk *next;
struct chunk *prev;
char data[0];
};
136
9.3. HEAP MANAGEMENT 137
Suppose that the user calls malloc(100) to allocate 100 bytes of mem-
ory. malloc will see that the (single) chunk of memory is free, but much
larger than the requested size. So, it will split it into one small chunk of
100 bytes and one larger chunk with the remainder. This is accomplished
by simply writing a new chunk header into the data area after 100 bytes.
This is and then connect them together into a linked list:
Once the list has been modified, malloc returns the address of the
data element within the chunk, so that the user can access it directly. It
doesn’t return the linked list node itself, because the user doesn’t need to
know about the implementation details. If there is no chunk available that
is large enough to satisfy the current request, then the process must ask
the OS to extend the heap by calling brk.
When the user calls free on a chunk of memory, the state of the chunk
in the linked list is marked FREE, and then merged with adjacent nodes, if
they are also free.
(Incidentally, now you can see why it is dangerous for a program to
modify memory carelessly outside a given memory chunk. Not only could
it affect other chunks, but it could damage the linked list itself, resulting
in wild behavior on the next malloc or free!)
If the program always frees memory in the opposite order that it was
allocated, then the heap will be nicely split into allocated and free memory.
However, that isn’t what happens in practice: memory can be allocated
and freed in any order. Over time, the heap can degenerate into a mix
of oddly sized chunks of allocated and freed memory. This is known as
memory fragmentation.
Excessive fragmentation can result in waste: if there are many small
chunks available, but none of them large enough to satisfy the current
malloc, then the process has no choice but to extend the heap, leaving
the small pieces unused. This increases pressure on total virtual memory
consumption in the operating system.
In a language like C, memory chunks cannot be moved while in use,
and so fragmentation cannot be fixed after it has already occurred. How-
137
138 CHAPTER 9. MEMORY ORGANIZATION
ever, the memory allocator has some limited ability to avoid fragmenta-
tion by choosing the location of new allocations with care. Some simple
strategies are easy to imagine and have been studied extensively:
• Best Fit. On each allocation, search the entire linked list and find the
smallest free chunk that is larger than the request. This tends to leave
large spaces available, but generates tiny leftover free fragments that
are too small to be used.
• Worst Fit. On each allocation, search the entire linked list and find
the largest free chunk that is larger than the request. Somewhat coun-
terintuitively, this method tends to reduce fragmentation by avoid-
ing the creation of tiny unusable fragments.
• First Fit. On each allocation, search the linked list from the begin-
ning, and find the first fragment (large or small) that satisfies the
request. This performs less work than Best Fit or Worst Fit, but per-
forms an increasing amount of work as the linked list increases in
size.
• Next Fit. On each allocation, search the linked list from the last ex-
amined location, and find the next fragment (large or small) that sat-
isfies the request. This minimizes the amount of work done on each
allocation, while distributing allocations throughout the heap.
The stack is used to record the current state of the running program. Most
CPUs have a specialized register – the stack pointer – which stores the
address where the next item will be pushed or popped. Because the stack
grows down from the top of memory, there is a confusing convention:
pushing an item on the stack causes the stack pointer to move to a lower
numbered address, while popping an item off the stack causes the stack
pointer to move to a higher address. The “top” of the stack is actually at
the lowest address!
Each invocation of a function occupies a range of memory in the stack,
known as a stack frame. The stack frame contains the parameters and
the local variables used by that function. When a function is called, a
new stack frame is pushed; when the function returns, the stack frame
is popped, and execution continues in the caller’s stack frame.
Another specialized register known as the frame pointer (or some-
times base pointer) indicates the beginning of the current frame. Code
138
9.4. STACK MANAGEMENT 139
within a function relies upon the frame pointer to identify the location of
the current parameters and local variables.
For example, suppose that the main function calls function f, and then
f calls g. If we stop the program in the middle of executing g, the stack
would look like this:
Stack Frame Parameters to main
for main: Old Frame Pointer
Local Variables
Return Address
Stack Frame Parameters to f
for f: Old Frame Pointer
Local Variables
Return Address
Stack Frame Parameters to g ← Frame Pointer
for g: Old Frame Pointer
Local Variables ← Stack Pointer
↓ (stack grows down) ↓
The order and details of the elements in an stack frame differ some-
what between CPU architectures and operating systems. As long as both
the caller and the callee agree on what goes in the stack frame, then any
function may call another, even if they were written in different languages,
or built by different compilers.
The agreement on the contents of the activation record is known as a
calling convention. This is typically written out in a detailed technical
document that is used by the designers of compilers, operating systems,
and libraries to ensure that code is mutually interoperable.
There are two broad categories of calling conventions, with many op-
portunities for variation in between. One is to put the arguments to a
function call on the stack, and the other is to place them in registers.
139
140 CHAPTER 9. MEMORY ORGANIZATION
Local Variables
← Stack Pointer
↓ (stack grows down) ↓
To access its arguments or local variables, f must load them from mem-
ory relative to the frame pointer. As you can see, the function arguments
are found at fixed positions above the frame pointer, while local variables
are found below the frame pointer. 2
When f begins executing, it still must save the old frame pointer and
make room for local variables. It doesn’t have to load arguments from the
stack; it simply expects the values in %R10 and %R11 and can compute
on them right away. This could confer a significant speed advantage by
avoiding memory accesses.
But, what if f is a complex function that needs to invoke other func-
tions? It will still need to save the current values of the argument registers,
in order to free them up for its own use.
To allow for this possibility, the stack frame for f must leave space for
the arguments, in case they must be saved. The calling convention must
define the location of the arguments, and they are typically stored below
the return address and old frame pointer, like this:
2 The arguments are pushed in reverse order in order to allow the possibility of a variable
number of arguments. Argument 1 is always two words above the frame pointer, argument
2 is always three words above, and so on.
140
9.5. LOCATING DATA 141
Return Address
Old Frame Pointer ← Frame Pointer
1st Argument (10)
2nd Argument (20)
Local Variables
← Stack Pointer
↓ (stack grows down) ↓
What happens if the function has more arguments then there are reg-
isters set aside for arguments? In this case, the additional arguments are
pushed on to the stack, as in the stack calling convention.
In the big picture, the choice between stack and register calling conven-
tions doesn’t matter much, except that all parties must agree on the details
of the convention. The register calling convention has the slight advan-
tage that a leaf function (a function that does not call other functions) can
compute a simple result without accessing memory. Typically, the regis-
ter calling convention is used on architectures that have a large number of
registers that might otherwise go unused.
It is possible to mix the conventions in a single program, as long as
both caller and callee are informed of the distinction. For example, the
Microsoft X86 compilers allow keywords in function prototypes to select a
convention: cdecl selects the stack calling convention, while fastcall
uses registers for the first two arguments.
141
142 CHAPTER 9. MEMORY ORGANIZATION
142
9.5. LOCATING DATA 143
A safer approach is to store the length of the array at the base address
of the array itself. Then, the compiler may generate code that checks the
actual index against the array bounds before generating the address. This
prevents any sort of runtime accident by the programmer. However, the
downside is performance. Every time the programmer mentions a[i],
the resulting code must contain this:
struct card {
int suit;
int rank;
};
struct deck {
int is_shuffled;
struct card cards[52];
};
struct deck d;
d.card[10].rank = 10;
143
144 CHAPTER 9. MEMORY ORGANIZATION
address(d.card[10].rank) =
address(d) + offset(card) + sizeof(card)*10
+ offset(rank)
144
9.6. PROGRAM LOADING 145
The header structure itself is just a few bytes that allow the operating
system to interpret the rest of the file:
Magic Number
Text Section Size
Data Section Size
BSS Size
Symbol Table Size
Entry Point
The magic number is a unique integer that clearly defines the file as an
executable: if the file does not begin with this magic number, the OS will
not even attempt to execute it. Different magic numbers are defined for
executables, unlinked object files, and shared libraries. The text size field
indicates the number of bytes in the text section that follows the header.
The data size field indicates the amount of initialized data that appears in
the file, while the BSS size field indicates the amount of uninitialized data.
4
The uninitialized data need not be stored in the file. Instead it is simply
allocated in memory as part of the data segment when the program is
loaded. The symbol table in the executable lists each of the variable and
function names used in the program along with their locations in the code
and data segment; this permits a debugger to interpret the meaning of
addresses. Finally, the entry point gives the address of the starting point of
the program (typically main) in the text segment. This allows the starting
point to be something other than the first address in the program.
The a.out format is a big improvement over a binary blob, and is still
supported and usable today in many operating systems. However, it isn’t
quite powerful enough to support many of the features needed by modern
languages, particularly dynamically loaded libraries.
The Extensible Linking Format (ELF) is widely used today across many
operating systems to represent executables, object files, and shared libraries.
Like a.out, an ELF file has multiple sections representing code, data, and
bss, but it can have an arbitrary number of additional sections for debug-
ging data, initialization and finalization code, metadata about the tools
used, and so forth. The number of sections in the file outnumbers the seg-
ments in memory, and so a section table in the ELF file indicates how mul-
tiple sections are to be mapped into a single segment.
4 BSS stands for “Block Started by Symbol” and first appeared in an assembler for IBM 704
in the 1950s.
145
146 CHAPTER 9. MEMORY ORGANIZATION
File Header
Program Header
Code Section
Data Section
Read-Only Section
...
Section Header
146
147
10.1 Introduction
147
148 CHAPTER 10. ASSEMBLY LANGUAGE
A given assembly language can have multiple dialects for the same CPU,
depending on whether one uses the assembler provided by the chip ven-
dor, or other open source tools. For consistency, we will give examples
in the assembly dialect supported by the GNU compiler and assembler,
which are known as gcc and as (or sometimes gas.)
A good way to get started is to view the assembler output of the com-
piler for a C program. To do this, run gcc with the -S flag, and the
compiler will produce assembly output rather than a binary program. On
Unix-like systems, assembly code is stored in files ending with .s, which
indicates “source” file.
If you run gcc -S hello.c -o hello.s on this C program:
#include <stdio.h>
.file "test.c"
.data
.LC0:
.string "hello %s\n"
.LC1:
.string "world"
.text
.globl main
main:
PUSHQ %rbp
MOVQ %rsp, %rbp
SUBQ $16, %rsp
MOVQ %rdi, -8(%rbp)
MOVQ %rsi, -16(%rbp)
MOVQ $.LC0, %rax
MOVQ $.LC1, %rsi
MOVQ %rax, %rdi
MOVQ $0, %rax
CALL printf
MOVQ $0, %rax
LEAVE
RET
148
10.2. OPEN SOURCE ASSEMBLER TOOLS 149
(There are many valid ways to compile hello.c and so the output of
your compiler may be somewhat different.)
Regardless of the CPU architecture, the assembly code has three differ-
ent kinds of elements:
Directives begin with a dot and indicate structural information use-
ful to the assembler, linker, or debugger, but are not in and of themselves
assembly instructions. For example, .file simply records the name of
the original source file to assist the debugger. .data indicates the start of
the data segment of the program, while .text indicates the start of the
program segment. .string indicates a string constant within the data
section, and .globl main indicates that the label main is a global sym-
bol that can be accessed by other code modules.
Labels end with a colon and indicate by their position the association
between names and locations. For example, the label .LC0: indicates that
the immediately following string should be called .LC0. The label main:
indicates that the instruction PUSHQ %rbp is the first instruction of the
main function. By convention, labels beginning with a dot are temporary
local labels generated by the compiler, while other symbols are user-visible
functions and global variables. The labels do not become part of the result-
ing machine code per se, but they are present in the resulting object code
for the purposes of linking, and in the eventual executable, for purposes
of debugging.
Instructions are the actual assembly code like (PUSHQ %rbp), typically
indented to visually distinguish them from directives and labels. Instruc-
tions in GNU assembly are not case sensitive, but we will generally upper-
case them, for consistency.
To take this hello.s and turn it into a runnable program, just run
gcc, which will figure out that it is an assembly program, assemble it, and
link it with the standard library:
It is also interesting to compile the assembly code into object code, and
then use the nm utility to display the symbols (”names”) present in the
code:
149
150 CHAPTER 10. ASSEMBLY LANGUAGE
(U), since it must be obtained from the standard library. But none of the
labels like .LC0 appear because they were not declared as .globl.
As you are learning assembly language, take advantage of an existing
compiler: write some simple functions to see what gcc generates. This can
give you some starting points to identify new instructions and techniques
to use.
150
10.3. X86 ASSEMBLY LANGUAGE 151
In other places (such as the Intel manual), you will see the Intel syntax,
which (among other things) dispenses with the percent signs and reverses
the order of the arguments. For example, this is the same instruction in the
Intel syntax:
MOVQ RBP, RSP
When reading manuals and web pages, be careful to determine whether
you are looking at AT&T or Intel syntax: look for the percent signs!
the lower eight registers indicate the purpose for which each was origi-
nally intended: for example, %rax is the accumulator.
As the design developed, new instructions and addressing modes were
added to make the various registers almost equal. A few remaining in-
structions, particularly related to string processing, require the use of %rsi
and %rdi. In addition, two registers are reserved for use as the stack
pointer (%rsp) and the base pointer (%rbp). The final eight registers are
numbered and have no specific restrictions.
The architecture has expanded from 8 to 64 bits over the years, and so
each register has some internal structure. The lowest 8 bits of the %rax
register are an 8-bit register %al, and the next 8 bits are known as %ah.
The low 16 bits are collectively known as %ax, the low 32-bits as %eax,
and the whole 64 bits as %rax.
8 bits: al ah
16 bits: ax
32 bits: eax
64 bits: rax
151
152 CHAPTER 10. ASSEMBLY LANGUAGE
The numbered registers %r8-%r15 have the same structure, but a slightly
different naming scheme:
To keep things simple, we will focus our attention on the 64-bit regis-
ters. However, most production compilers use a mix of modes: a byte can
represent a boolean; a longword is usually sufficient for integer arithmetic,
since most programs don’t need integer values above 232 ; and a quadword
is needed to represent a memory address, enabling up to 16EB (exa-bytes)
of virtual memory.
MOVB moves a byte, MOVW moves a word, MOVL moves a long, MOVQ
moves a quad-word.1 Generally, the size of the locations you are moving
to and from must match the suffix. In some cases, you can leave off the
suffix, and the assembler will infer the right size. However, this can have
unexpected consequences, so we will make a habit of using the suffix.
The arguments to MOV can have one of several addressing modes.
152
10.3. X86 ASSEMBLY LANGUAGE 153
For the most part, the same addressing modes may be used to store
data into registers and memory locations. There are some exceptions. For
example, it is not possible to use base-relative for both arguments of MOV:
MOVQ -8(%rbx), -8(%rbx). To see exactly what combinations of ad-
dressing modes are supported, you must read the manual pages for the
instruction in question.
In some cases, you may want to load the address of a variable instead of
its value. This is handy when working with strings or arrays. For this pur-
pose, use the LEA (load effective address) instruction, which can perform
the same address computations as MOV:
Mode Example
Global Symbol LEAQ x, %rax
Base-Relative LEAQ -8(%rbp), %rax
Complex LEAQ -16(%rbx,%rcx,8), %rax
153
154 CHAPTER 10. ASSEMBLY LANGUAGE
adds %rbx to %rax, and places the result in %rax, overwriting what might
have been there before. This requires a little care, so that you don’t acci-
dentally clobber a value that you might want to use later. For example,
you could translate c = a+b+b; like this:
MOVQ a, %rax
MOVQ b, %rbx
ADDQ %rbx, %rax
ADDQ %rbx, %rax
MOVQ %rax, c
MOVQ a, %rax
MOVQ b, %rbx
ADDQ %rbx, %rax
IMULQ %rbx
MOVQ %rax, c
The IDIV instruction does the same thing, except backwards: it starts
with a 128 bit integer value whose low 64 bits are in %rax and high 64 bits
in %rdx, and divides it by the value given in the instruction. The quotient
is placed in %rax and the remainder in %rdx. (If you want to implement
the modulus instruction instead of division, just use the value of %rdx.)
To set up a division, you must make sure that both registers have the
necessary sign-extended value. If the dividend fits in the lower 64 bits, but
is negative, then the upper 64 bits must all be ones to complete the twos-
complement representation. The CQO instruction serves the very specific
purpose of sign-extending %rax into %rdx for divsion.
For example, to divide a by five:
154
10.3. X86 ASSEMBLY LANGUAGE 155
The instructions AND, OR, and XOR perform destructive bitwise boolean
operations on two values. Bitwise means that the operation is applied to
each individual bit in the operands, and stored in the result.
So, AND $0101B $0110B would yield the result $0100B. In a similar
way, the NOT instruction inverts each bit in the operand. For example,
the bitwise C expression c = (a & ˜b); could be translated like this:
MOVQ a, %rax
MOVQ b, %rbx
NOTQ %rbx
ANDQ %rax, %rbx
MOVQ %rbx, c
2 Alternatively, you could could use the bitwise operators as logical operators if you give
true the integer value -1 (all ones) and false the integer value zero.
155
156 CHAPTER 10. ASSEMBLY LANGUAGE
Instruction Meaning
JE Jump if Equal
JNE Jump if Not Equal
JL Jump if Less
JLE Jump if Less or Equal
JG Jump if Greater
JGE Jump if Greater or Equal
MOVQ x, %rax
CMPQ $0, %rax
JLE .L1
.L0:
MOVQ $10, $rbx
JMP .L2
.L1:
MOVQ $20, $rbx
.L2:
MOVQ %rbx, y
156
10.3. X86 ASSEMBLY LANGUAGE 157
Note that jumps require the compiler to define target labels. These
labels must be unique and private within one assembly file, but cannot be
seen outside the file unless a .globl directive is given. Labels like .L0,
.L1, etc, can be generated by the compiler on demand.
PUSHQ %rax
POPQ %rax
Note that, in 64-bit code, PUSH and POP are limited to working with
64-bit values, so a manual MOV and ADD must be used if it is necessary to
move smaller items to/from the stack.
157
158 CHAPTER 10. ASSEMBLY LANGUAGE
• The first eight floating point arguments are placed in the registers
%xmm0-%xmm7, in that order.
• Arguments in excess of those registers are pushed onto the stack.
• If the function takes a variable number of arguments (like
printf) then the %rax register must be set to the number of
floating point arguments.
• The return value of the function is placed in %rax.
In addition, we also need to know how the remaining registers are han-
dled. A few are caller saved, meaning that the calling function must save
those values before invoking another function. Others are callee saved,
meaning that a function, when called, must save the values of those regis-
ters, and restore them on return. The argument and result registers need
not be saved at all. Figure 10.4 shows these requirements.
To invoke a function, we must first compute the arguments and place
them in the desired registers. Then, we must push the two caller-saved
registers (%r10 and %r11) on the stack, to save their values. We then issue
the CALL instruction, which pushes the current instruction pointer on to
the stack then jumps to the code location of the function. Upon return
from the function, we pop the two caller-saved registers off of the stack,
and look for the return value of the function in the %rax register.
3 Note that there is nothing stopping you from writing a compiler that uses a stack call-
ing convention. But if you want to invoke functions compiled by others (like the standard
library) then you need to stick to the convention already in use.
158
10.3. X86 ASSEMBLY LANGUAGE 159
int x=0;
int y=10;
int main()
{
x = printf("value: %d",y);
}
.data
x:
.quad 0
y:
.quad 10
str:
.string "value: %d\n"
.text
.globl main
main:
MOVQ $str, %rdi # first argument in %rdi: string
MOVQ y, %rsi # second argument in %rsi: y
MOVQ $0, %rax # there are zero float args
159
160 CHAPTER 10. ASSEMBLY LANGUAGE
.global square
square:
MOVQ %rdi, %rax # copy first argument to %rax
IMULQ %rax # multiply it by itself
# result is already in %rax
RET # return to caller
160
10.3. X86 ASSEMBLY LANGUAGE 161
.globl func
func:
pushq %rbp # save the base pointer
movq %rsp, %rbp # set new base pointer
There is a lot to keep track of here: the arguments given to the function,
the information necessary to return, and space for local computations. For
this purpose, we use the base register pointer %rbp. Whereas the stack
pointer %rsp points to the end of the stack where new data will be pushed,
the base pointer %rbp points to the start of the values used by this func-
tion. The space between %rbp and %rsp is known as the stack frame for
this function call.
161
162 CHAPTER 10. ASSEMBLY LANGUAGE
Contents Address
old %rip register 8(%rbp)
old %rbp register (%rbp) ← %rbp points here
argument 0 -8(%rbp)
argument 1 -16(%rbp)
argument 2 -24(%rbp)
local variable 0 -32(%rbp)
local variable 1 -40(%rbp)
saved register %rbx -48(%rbp)
saved register %r12 -56(%rbp)
saved register %r13 -64(%rbp)
saved register %r14 -72(%rbp)
saved register %r15 -80(%rbp) ← %rsp points here
Note that the base pointer (%rbp) locates the start of the stack frame.
So, within the body of the function, we may use base-relative address-
ing against the base pointer to refer to both arguments and locals. The
arguments to the function follow the base pointer, so argument zero is at -
8(%rbp), argument one at -16(%rbp), and so forth. Past those are local vari-
ables to the function at -32(%rbp) and then saved registers at -48(%rbp).
The stack pointer points to the last item on the stack. If we use the stack
for additional purposes, data will be pushed to further negative values.
(Note that we have assumed all arguments and variables are 8 bytes large:
different types would result in different offsets.)
162
10.3. X86 ASSEMBLY LANGUAGE 163
Here is a complete example that puts it all together. Suppose that you
have a C-minor function defined as follows:
163
164 CHAPTER 10. ASSEMBLY LANGUAGE
164
10.4. ARM ASSEMBLY 165
The ARM processor architecture has a history almost as long as the X86
architecture. It originated as the 32-bit Acorn RISC Machine used in the
Acorn Archimedes personal computer in 1987, and has since evolved into
a wide-used low-power CPU employed in many embedded and mobile
systems, now known as the Advanced RISC Machine (ARM). The evolv-
ing architecture has been implemented by a number of chip vendors work-
ing from a common architecture definition. The most recent versions of
the architecture are ARMv7-A (32-bit) and ARMv8-A (64-bit.) This chap-
ter will focus on the 32-bit architecture, with some remarks on differences
in the 64-bit architecture at the end.
ARM is an example of a Reduced Instruction Set Computer (RISC)
rather than a Complex Instruction Set Computer (CISC). Compared to
X86, ARM relies on a smaller set of instructions which are simpler to pipeline
or run in parallel, reducing the complexity and energy consumption of the
chip. ARM is sometimes considered “partially” RISC due to a few ex-
ceptions. For example, the difference in the time to execute some ARM
instruction makes the pipeline imperfect, the inclusion of a barrel shifter
for preprocessing brings forward more complex instructions, and condi-
tional execution decreases some of the potential instructions executed and
lead to less branching instructions so less energy used by the processor.
We will mainly be focusing on the elements of the instruction set which
allow us to do the most in a compiler, leaving the more complex aspects
and optimizations of the programming language for later.
165
166 CHAPTER 10. ASSEMBLY LANGUAGE
state. A user-level program cannot access these directly, but they can be
set as a side effect of some operations.
ARM uses the following suffixes to indicate data sizes. Note that these
have different meanings than in X86 assembly! If no suffix is given, the
assembler assumes an unsigned word operand. Signed types are used to
provide appropriate sign-extension when loading a small data type into a
larger register. There is no register naming structure for anything below a
word.
Data Type Suffix Size
Byte B 8 bits
Halfword H 16 bits
Word W 32 bits
Double Word - 64 bits
Signed Byte SB 8 bits
Signed Halfword SH 16 bits
Signed Word SW 32 bits
Double Word - 64 bits
Mode Example
Immediate MOV r0, #3
Register MOV r1, r0
The mnemonic letter for each data type can be appended to the MOV in-
struction allowing us to be sure of which is being transferred and how that
data is being transferred. If not, the assembler assumes an entire word.
To move values in and out of memory, use the Load (LDR) and Store
(STR) instructions, both of which indicate the source or destination register
as the first argument, and the source or destination memory address as the
second argument. In the simplest case, the address is given by a register
and indicated in brackets:
166
10.4. ARM ASSEMBLY 167
LDR r1, =x
LDR r2, [r1]
167
168 CHAPTER 10. ASSEMBLY LANGUAGE
Post-indexing modes do the same thing, but in the reverse order. First,
the load is performed from the base register, then the base register is incre-
mented:
LDR r1, [r2], #4 ; Load from r2 then r2 += 4
LDR r1, [r2], r3 ; Load from r2 then r2 += r3
These complex pre-indexing and post-indexing modes make it possi-
ble to have single-instruction implementations of idiomatic C operations
like b = a++. The corresponding modes are also available to the STR in-
struction.
Absolute addresses (and other large literals) are somewhat more com-
plicated in ARM. Because every instruction must fit into a 32-bit word,
it is not possible to fit a 32-bit address into an instruction, alongside the
opcode. Instead, large literals must be stored in a literal pool, which is
a small region of data inside the code section of the program. A literal
can be loaded from a pool with a PC-relative load instruction, which can
reference ±4096 bytes from the loading instruction. This results in several
small literal pools being scattered throughout the program, so that each
one is close to the load instruction that uses it.
The ARM assembler generally hides this complexity from the user.
When a label or large literal is prefixed with an equals sign, this indicates
to the assembler that the marked value should be placed into a literal pool,
and a corresponding PC-relative instruction emitted instead.
For example, the following instructions, which load the address of x
into r1, then the value of x into r2:
LDR r1, =x
LDR r2, [r1]
Will be expanded into this load of the address of x from an adjacent
literal pool, followed by loading the value of x:
LDR r1, .L1
LDR r2, [r1]
B .end
.L1:
.word x
.end:
168
10.4. ARM ASSEMBLY 169
Instruction Example
Add without carry ADD Rd, Rm, Rn
Add with carry ADC Rd, Rm, Rn
Subtract without carry SUB Rd, Rm, Rn
Subtract with carry SBC Rd, Rm, Rn
Multiplication works much the same way, except that multiplying two
32-bit numbers could result in a 64-bit value. The ordinary MUL discards
the high bits of the results, while UMULL puts the 64-bit result in two 32-bit
registers. The signed variant SMULL will sign extend the high register as
needed.
Instruction Example
Multiplication MUL Rd, Rm, Rn
Unsigned Long Multiplication UMULL RdHi, RdLo, Rm, Rn
Signed Long Multiplication SMULL RdHi, RdLo, Rm, Rn
Instruction Example
Bitwise-And AND Rd, Rm, Rn
Bitwise-Or ORR Rd, Rm, Rn
Bitwise-Xor EOR Rd, Rm, Rn
Bitwise-Bit-Clear BIC Rd, RM, Rn
Move-Not MVN Rd, Rn
CMP Rd, Rn
CMP Rd, #imm
169
170 CHAPTER 10. ASSEMBLY LANGUAGE
Opcode Meaning
B Branch Always BL Branch and Link
BX Branch and Exchange BLX Branch-Link-Exchange
BEQ Equal BVS Overflow Set
BNE Not Equal BVC Overflow Clear
BGT Greater Than BHI Higher (unsigned >)
BGE Greater Than or Equal BHS Higher or Same (uns. >=)
BLT Less Than BLO Lower (unsigned <)
BLE Less Than or Equal BLS Lower or Same (uns. <=)
BMI Negative BPL Positive or Zero
MOV r0, #0
loop: ADD r0, r0, 1
CMP r0, #5
BLT loop
LDR r0, =x
LDR r0, [r0]
CMP r0, #0
BGT .L1
.L0:
MOV r0, #20
B .L2
.L1:
MOV r0, #10
.L2:
LDR r1, =y
STR r0, [r1]
170
10.4. ARM ASSEMBLY 171
CMP r0, r1
ADDLT r0, r0, #1
ADDGE r1, r1, #1
PUSH {r0,r1,r2}
171
172 CHAPTER 10. ASSEMBLY LANGUAGE
• The first four arguments are placed in registers r0, r1, r2 and r3.
POP {r0,r1,r2}
Unlike X86, any data items ranging from a byte to a double-word can
be pushed on to the stack, as long as data alignment is respected.
172
10.4. ARM ASSEMBLY 173
int x=0;
int y=10;
int main() {
x = printf("value: \%d\n",y);
}
.data
x: .word 0
y: .word 10
S0: .ascii "value: \%d\012\000"
.text
main:
LDR r0, =S0 @ Load address of S0
LDR r1, =y @ Load address of y
LDR r1, [r1] @ Load value of y
173
174 CHAPTER 10. ASSEMBLY LANGUAGE
.global square
square:
MUL r0, r0, r0 @ multiply argument by itself
BX lr @ return to caller
func:
PUSH {fp} @ save the frame pointer
MOV fp, sp @ set the new frame pointer
PUSH {r0,r1,r2} @ save the arguments on the stack
SUB sp, sp, #8 @ allocate two more local variables
PUSH {r4-r10} @ save callee-saved registers
Through this method, we ensure that we are able to save all the values
in the registers into the stack and ensure that no data will be lost. The
stack, once this has been done, looks very similar to that of the X86 stack,
just with some extra callee-saved registers stored on the stack.
Here is a complete example that puts it all together. Suppose that you
have a C-minor function defined as follows:
174
10.4. ARM ASSEMBLY 175
Contents Address
Saved r12 [fp, #8]
Old LR [fp, #4]
Old Frame Pointer [fp] ← fp points here
Argument 2 [fp, #-4]
Argument 1 [fp, #-8]
Argument 0 [fp, #-12]
Local Variable 1 [fp, #-16]
Local Variable 0 [fp, #-20]
Saved r10 [fp, #-24]
Saved r9 [fp, #-28]
Saved r8 [fp, #-32]
Saved r7 [fp, #-36]
Saved r6 [fp, #-40]
Saved r5 [fp, #-44]
Saved r4 [fp, #-48] ← sp points here
175
176 CHAPTER 10. ASSEMBLY LANGUAGE
176
10.4. ARM ASSEMBLY 177
it does not know beforehand whether the local will be used as a return
value. We will explore these issues later in Chapter 12 on optimization.
177
178 CHAPTER 10. ASSEMBLY LANGUAGE
This chapter has given you a brief orientation to the core features of the
X86 and ARM architectures, enough to write simple programs in the most
direct way. However, you will certainly need to look up specific details
of individual instructions in order to better understand their options and
limitations. Now you are ready to read the detailed reference manuals and
keep them handy while you construct your compiler:
178
179
11.1 Introduction
Congratulations, you have made it to the final stage of the compiler! After
scanning and parsing the source code, constructing the AST, performing
type checking, and generating an intermediate representation, we are now
ready to generate some code.
To start, we are going to take a naı̈ve approach to code generation, in
which we consider each element of the program in isolation. Each ex-
pression and statement will be generated as a standalone unit, without
reference to its neighbors. This is easy and straightforward, but it is con-
servative and will lead to a large amount of non-optimal code. But it will
work, and give you a starting point for thinking about more sophisticated
techniques.
The examples will focus on X86-64 assembly code, but they are not
hard to adapt to ARM and other assembly languages as needed. As with
previous stages, we will define a method for each element of a program.
decl codegen will generate code for a declaration, calling stmt codegen
for a statement, expr codegen for an expression, and so on. These rela-
tionships are shown in Figure 11.1.
Once you have learned this basic approach to code generation, you
will be ready for the following chapter, in which we consider more complex
methods for generating more highly optimized code.
179
180 CHAPTER 11. CODE GENERATION
decl_codegen
stmt_codegen
management, and some are available for scratch values. Take those scratch
registers and put them into a table like this:
r 0 1 2 3 4 5 6
name %rbx %r10 %r11 %r12 %r13 %r14 %r15
inuse X X
int label_create();
const char * label_name( int label );
180
11.3. GENERATING EXPRESSIONS 181
181
182 CHAPTER 11. CODE GENERATION
Suppose we want to generate X86 code for the following DAG, where
a, b and c are global integers:
ASSIGN
1. MOVQ a, R0
ISUB c
2. MOVQ $3, R1
3. ADDQ R0, R1
4. MOVQ b, R0
IADD b
5. SUBQ R0, R1
6. MOVQ R1, c
a 3
1. Visit the a node. Call scratch alloc to allocate a new register (0)
and save that in node->reg. Then emit the instruction MOVQ a, R0
to load the value into register zero. 1
2. Visit the 3 node. Call scratch alloc to allocate a new register (1)
and emit the instruction MOVQ $3, R1.
3. Visit the IADD node. By examining the two children of this node,
we can see that their values are stored in registers R0 and R1, so we
emit an instruction that adds them together: ADDQ R0, R1. This is
a destructive two-address instruction which leaves the result in R1.
R0 is no longer needed, so we call scratch free(0).
4. Visit the b node. Call scratch alloc to allocate a new register (0)
and emit MOVQ b, R0.
5. Visit the ISUB node. Emit SUBQ R0, R1, leaving the result in R1,
and freeing register R0.
6. Visit the c node, but don’t emit anything, because it is the target of
the assignment.
7. Visit the ASSIGN node and emit MOVQ R1, c.
1 Note that the actual name of register R0 is scratch name(0), which is %ebx. To keep
the example clear, we will call them R0, R1, etc for now.
182
11.3. GENERATING EXPRESSIONS 183
And here is the same code with the actual register names provided by
scratch name:
MOVQ a, %rbx
MOVQ $3, %r10
ADDQ %r10, %rbx
MOVQ b, %rbx
SUBQ %rbx, %r10
MOVQ %r10, c
Of course, this also means that %rax and %rdx cannot be used for other
purposes while a multiply is in progress. Since we have a large number
of scratch registers to work with, we will just not use %rdx for any other
purpose in our basic code generator.
183
184 CHAPTER 11. CODE GENERATION
switch(e->kind) {
case EXPR_NAME:
e->reg = scratch_alloc();
printf("MOVQ %s, %s\n",
symbol_codegen(e->symbol),
scratch_name(e->reg));
break;
case EXPR_ADD:
expr_codegen(e->left);
expr_codegen(e->right);
printf("ADDQ %s, %s\n",
scratch_name(e->left->reg),
scratch_name(e->right->reg));
e->reg = e->right->reg;
scratch_free(e->left->reg);
break;
case EXPR_ASSIGN:
expr_codegen(e->left);
printf("MOVQ %s, %s\n",
scratch_name(e->left->reg),
symbol_codegen(e->right->symbol));
e->reg = e->left->reg;
break;
. . .
}
}
184
11.3. GENERATING EXPRESSIONS 185
185
186 CHAPTER 11. CODE GENERATION
switch(s->kind) {
case STMT_EXPR:
...
break;
case STMT_DECL:
...
break;
...
}
stmt_codegen(s->next);
}
Now consider each kind of statement in turn, starting with the easy
cases. If the statement describes a declaration STMT DECL of a local vari-
able, then just delegate that by calling decl codegen:
case STMT_DECL:
decl_codegen(s->decl);
break;
case STMT_EXPR:
expr_codegen(s->expr);
scratch_free(s->expr->reg);
break;
186
11.4. GENERATING STATEMENTS 187
case STMT_RETURN:
expr_codegen(s->expr);
printf("MOV %s, %%rax\n",scratch_name(s->expr->reg));
printf("JMP .%s_epilogue\n",function_name);
scratch_free(s->expr->reg);
break;
(The careful reader will notice that this code needs to know the name
of the function that contains this statement. You will have to figure out a
way to pass that information down.)
Control flow statements are more interesting. It’s useful to first con-
sider what we want the output assembly language to look like, and then
work backwards to get the code for generating it.
Here is a template for a conditional statement:
if ( expr ) {
true-statements
} else {
false-statements
}
expr
CMP register, $0
JE false-label
true-statements
JMP done-label
false-label :
false-statements
done-label :
187
188 CHAPTER 11. CODE GENERATION
Once you have a skeleton of the desired code, writing the code gen-
erator is easy. First, generate two new labels, then call expr codegen
for each expression, stmt codegen for each statement, and substitute the
few additional instructions as needed to make the overall structure.
case STMT_IF:
int else_label = label_create();
int done_label = label_create();
expr_codegen(s->expr);
printf("CMP %s, $0\n",scratch_name(s->expr->reg));
scratch_free(s->expr->reg);
printf("JE %s\n",label_name(else_label));
stmt_codegen(s->body);
printf("JMP %s\n",label_name(done_label));
printf("%s:\n",label_name(else_label));
stmt_codegen(s->else_body);
printf("%s:\n",label_name(done_label));
break;
And here is the corresponding assembly template. First, evaluate the ini-
tializing expression. Then, upon each iteration of the loop, evaluate the
control expression. If false, jump to the end of the loop. If not, execute the
body of the loop, then evaluate the next expression.
init-expr
top-label:
expr
CMP register, $0
JE done-label
body-statements
next-expression
JMP top-label
done-label :
Writing the code generator is left as an exercise for the reader. Just keep
in mind that each of the three expressions in a for-loop can be omitted. If
188
11.4. GENERATING STATEMENTS 189
the init-expr or the next-expr are omitted, they have no effect. if the expr is
omitted, it is assumed to be true. 2
Many languages have loop-control constructs like continue; and break;.
In these cases, the compiler must keep track of the labels associated with
the current loop being generated, and convert these into a JMP to the top
label, or the done-label respectively.
The print statement in B-Minor is a special case of an imperative
statement with variant behavior depending on the type of the expression
to be printed. For example, the following print statement must generate
slightly different code for the integer, boolean, and string to be printed:
i: integer = 10;
b: boolean = true;
s: string = "\n";
print i, b, s;
print_integer(i);
print_boolean(b);
print_string(s);
2 Yes, each of the three components of a for-loop are expressions. It is customary that the
first has a side effect of initialization (i=0)), the second is a comparison (i<10), and the third
has a side effect to generate the next value (i++), but they are all just plain expressions.
189
190 CHAPTER 11. CODE GENERATION
Now that you know how to generate control flow statements, we must re-
turn to one aspect of expression generation. Conditional expressions (less-
than, greater-than, equals, etc) compare two values and return a boolean
value. They most frequently appear in control flow expressions but can
also be used as simple values, like this:
b: boolean = x < y;
left-expr
right-expr
CMP left-register right-register
JLT true-label
MOV false, result-register
JMP done-label
true-label:
MOV true, result-register
done-label:
190
11.6. GENERATING DECLARATIONS 191
i: integer = 10;
s: string = "hello";
b: array [4] boolean = {true, false, true, false};
.data
i: .quad 10
s: .string "hello"
b: .quad 1, 0, 1, 0
191
192 CHAPTER 11. CODE GENERATION
11.7 Exercises
1. Write a legal expression that would exhaust the six available scratch
registers, if using the technique described in this chapter. In general,
how many registers are needed to generate code for an arbitrary ex-
pression tree?
2. When using a register calling convention, why is it necessary to gen-
erate values for all the function arguments before moving the values
into the argument registers?
3. Can a global variable declaration have a non-constant initializing ex-
pression? Explain why.
4. Suppose B-Minor included a switch statement. Sketch out two dif-
ferent assembly language templates for implementing a switch.
5. Write the complete code generator for the X86-64 architecture, as out-
lined in this chapter.
6. Write several test programs to test all the aspects of B-Minor then
use your compiler to build, test, and run them.
7. Compare the assembly output of your compiler on your test pro-
grams to the output of a production compiler like gcc on equivalent
programs written in C. What differences do you see?
8. Add an extra code generator to your compiler that emits a different
assembly language like ARM or an intermediate representation like
LLVM. Describe any changes in approach that were necessary.
192
193
Chapter 12 – Optimization
12.1 Overview
Using the basic code generation strategy shown in the previous chapter,
you can build a compiler that will produce perfectly usable, working code.
However, if you examine the output of your compiler, there are many
ways in which you can see obvious inefficiencies. This stems from the
fact that the basic code generation strategy considers each program ele-
ment in isolation, and must use the most conservative strategy to connect
them together.
In the early days of high level languages, before optimization strate-
gies were widespread, code produced by compilers was widely seen to
be inferior to that which was hand-written by humans. Today, a modern
compiler has many optimization techniques and very detailed knowledge
of the underlying architecture, and so compiled code is usually (but not
always) superior to that written by humans.
Optimizations can be applied at multiple stages of the compiler. It’s
usually best to solve a problem at the highest level of abstraction possible.
For example, once we have generated concrete assembly code, about the
most we can do is eliminate a few redundant instructions. But, if we work
with a linear IR, we can speed up a long code sequence with smart reg-
ister allocation. And if we work at the level of a DAG or an AST, we can
eliminate entire chunks of unused code.
Optimizations can occur at different scope within a program. Local op-
timizations refer to changes that are limited to a single basic block, which
is a straight-line sequence of code without any flow control. Global opti-
mizations refer to changes applied to the entire body of a function (or pro-
cedure, method, etc), consisting of a control-flow-graph where each node
is a basic block. Interprocedural optimizations are even larger, and take
into account the relationships between different functions. Generally, op-
timizations at larger scopes are more challenging, but have more potential
to improve the program.
This chapter will give you a tour of some common code optimization
techniques that you can either implement in your project compiler, or ex-
plore by implementing them by hand. But this is just an introduction: code
optimization is a very large topic that could occupy a whole second text-
193
194 CHAPTER 12. OPTIMIZATION
book, and is still an area of active research today. If this chapter appeals to
you, then check out some of the more advanced books and articles refer-
enced at the end of the chapter.
194
12.3. HIGH LEVEL OPTIMIZATIONS 195
#include <time.h>
for(i=0;i<1000000;i++) {
x = x * y;
}
gettimeofday(&stop,0);
timersub(&stop,&start,&elapsed);
The end result is the same (86400) but the code is much clearer about
the purpose and origin of that number. However, if translated literally, the
program would contain three excess constants, several memory lookups,
and two multiplications to obtain the same result. If done in the inner
loop of a complex program, this could be a significant waste. Ideally, it
195
196 CHAPTER 12. OPTIMIZATION
f = expr_create( EXPR_CONSTANT );
f->value = e->operator applied to
e->left->value and e->right->value
expr_delete(e->left);
expr_delete(e->right);
return f;
} else {
return e;
}
}
While the effects of constant folding may seem minimal, it often is the
first step in enabling a chain of further optimizations.
196
12.3. HIGH LEVEL OPTIMIZATIONS 197
for(i=0;i<400;i++) {
a[i] = i*2 + 10;
}
for(i=0;i<400;i+=4) {
a[i] = i*2 + 10;
a[i+1] = (i+1)*2 + 10;
a[i+2] = (i+2)*2 + 10;
1 While there is a logic to this sort of optimization, it does seem like an unseemly level of
familiarity between the compiler and the standard library, which may have different devel-
opers and evolve independently.
197
198 CHAPTER 12. OPTIMIZATION
Or this:
for(i=0;i<400;i++) {
a[i] = i*2 + 10;
i++;
a[i] = i*2 + 10;
i++;
a[i] = i*2 + 10;
i++;
a[i] = i*2 + 10;
}
Increasing the work per loop iteration saves some unnecessary eval-
uations of i<400, and it also eliminates branches from the instruction
stream, which avoids pipeline stalls and other complexities within the mi-
croprocessor.
But how much should a loop be unrolled? The unrolled loop could
contain 4, 8, 16 or even more items per iteration. In the extreme case,
the compiler could eliminate the loop entirely and replace it with a finite
sequence of statements, each with a constant value:
a[0] = 0 + 10;
a[1] = 2 + 10;
a[2] = 4 + 10;
. . .
198
12.3. HIGH LEVEL OPTIMIZATIONS 199
and so the code can be moved to the block preceding the loop, which is
known as code hoisting. For example, the array index in this example is
constant throughout the loop, and can be computed once before the loop
body:
t = x*y;
for(i=0;i<400;i++) {
for(i=0;i<400;i++) {
a[x*y] += i;
a[t] += i;
}
}
for(i=0;i<1000;i++) {
y = quadratic(10,20,i*2);
}
The overhead of setting up the parameters and invoking the function
likely exceeds the cost of doing the handful of additions and multiplies
within the function itself. By inlining the function code into the loop, we
can improve the overall performance of the program.
Function inlining is most easily performed on a high level represen-
tation such as an AST or a DAG. First, the body of the function must be
duplicated, then the parameters of the invocation must be substituted in.
Note that, at this level of evaluation, the parameters are not necessarily
constants, but may be complex expressions that contain unbound values.
For example, the invocation of quadratic above can be substituted
with the expression (a*x*x+b*x+30) under the binding of a=10, b=20,
and x=i*2. Once this substitution is performed, unbound variables such
as i are relative to the scope where quadratic was called, not where it
was defined. The resulting code looks like this:
199
200 CHAPTER 12. OPTIMIZATION
for(i=0;i<1000;i++) {
y = 10*(i*2)*(i*2) + 20*(i*2) + 30;
}
for(i=0;i<1000;i++) {
y = 40*i*i + 40*i + 30;
}
200
12.3. HIGH LEVEL OPTIMIZATIONS 201
if( x<y ) {
return 10;
} else { T return 10
print "hello"; if print "goodbye" return 30
x<y F
} print "hello"
print "goodbye";
return 30;
A return statement causes an immediate termination of the function
and (from the perspective of the control flow graph) is the end of the ex-
ecution path. Here, the true branch of the if statement immediately re-
turns, while the false branch falls through to the next statement. For some
values of x and y, it is possible to reach every statement.
However, if we make a slight change, like this:
if( x<y ) {
return 10;
} else { T return 10
return 20; if print "hello" return 30
x<y F
} return 20
print "goodbye";
return 30;
Then, both branches of the if statement terminate in a return, and it
is not possible to reach the final print and return statements. This is
(likely) a mistake by the programmer and should be flagged.
Once the control flow graph is created, determining reachability is sim-
ple: perform a traversal of the CFG, starting from the entry point of the
function, marking each node as it is visited. Once the traversal is com-
plete, if there are any nodes unmarked, then they are unreachable and the
compiler may either generate a suitable error message or simply not gen-
erate code for the unreachable portion. 3
Reachability analysis becomes particularly powerful when combined
with other forms of static analysis. In the example above, suppose that
variables x and y are defined as constants 100 and 200, respectively. Con-
stant folding can reduce x<y to simply true, with the result that the false
branch of the if statement is never taken and therefore unreachable.
Now, don’t get carried away with this line of thinking. If you were
paying attention in your theory-of-computing course, this may sound sus-
piciously like the Halting Problem: can we determine whether an arbi-
trary program written in a Turing-complete language will run to comple-
tion, without actually executing it? The answer is, of course, no, not in the
general case. Reachability analysis simply determines in some limited cases
3 A slight variation on this technique can be used to evaluate whether every code path
through a function results in a suitable return statement. This is left as an exercise to the
reader.
201
202 CHAPTER 12. OPTIMIZATION
All of the optimizations discussed so far can be applied to the high level
structure of a program, without taking into account the particular target
machine. Low-level optimizations focus more on the translation of the
program structure in a way that best exploits the peculiarities of the un-
derlying machine.
202
12.4. LOW-LEVEL OPTIMIZATIONS 203
ADDQ Cj , Ri
IADD Ri
Ri Cj
Figure 12.3 gives a few more examples of X86 instructions that can be
represented as tree templates. The simple instructions at the top simply
substitute one entity in the DAG for another: MOV $Cj, Ri converts a
constant into a register, while MOV Mx, Ri converts a memory location
into a register. Richer instructions have more structure: the complex load
MOV Cd(Rc,Ra,8), Ri can be used to represent a combination of add,
multiply, and dereference.
Of course, Figure 12.3 is not the complete X86 instruction set. To de-
scribe even a significant subset would require hundreds of entries, with
multiple entries per instruction to capture the multiple variations on each
instruction. (For example, you would need one template for an ADDQ on
two registers, and another for a register-memory combination.) But this
is a feasible task and perhaps easier to accomplish than hand-writing a
complete code generator.
With the complete library of templates written, the job of the code gen-
erator is to examine the tree for sub-trees that match an instruction tem-
plate. When one is found, the corresponding instruction is emitted (with
appropriate substitutions for register numbers, etc) and the matching por-
tion of the tree replaced with the right side of the template.
For example, suppose we wish to generate X86 code for the statement
a[i] = b + 1; Let us suppose that b is a global variable, while a and
i are local variables at positions 40 and 32 above the base pointer, respec-
tively. Figure 12.4 shows the steps of tree rewriting. In each DAG, the box
indicates the subtree that matches a rule in Figure 12.3
Step 1: The left IADD should be executed first to compute the value
of the expression. Looking at our table of templates, there is no IADD that
can directly add a memory location to a constant. So, we instead select rule
(2), which emits the instruction MOVQ b, %R0 and converts the left-hand
203
204 CHAPTER 12. OPTIMIZATION
Cj Ri
MOVQ $Cj , Ri DEREF Ri
Mx Ri
MOVQ Mx , Ri
IADD
DEREF Ri
MOVQ (Rj ), Ri IMUL Rk
Rj Rl 8
IADD Ri
ADDQ Cj , Ri MOVQ Ri, (Rk , Rl , 8)
ASSIGN Ri
Ri Cj
Ri DEREF
DEREF Ri
MOVQ Cj (Rk ), Ri
IADD
IADD
IMUL Rk
Rk Cj
Rl 8
side of the template (a memory location) into the right hand side (register
%R0).
Step 2: Now we can see an IADD of a register and a constant, which
matches the template of rule (4). We emit the instruction ADDQ $1, %R0
and replace the IADD subtree with the register %R0.
Step 3: Now let’s look at the other side of the tree. We can use rule
(5) to match the entire subtree that loads the variable i from %RBP+32 by
emitting the instruction MOVQ 32(%RBP), %R1 and replacing the subtree
with the register %R1.
Step 4: In a similar way, we can use rule (6) to compute the address
of a by emitting LEAQ 40(%RBP), %R2. Notice that this is, in effect, a
three-address addition specialized for use with a register and a constant.
Step 5: Finally, template rule (7) matches most of what is remaining.
We can emit MOVQ R0, (R1,8)%R2 which stores the value in R1 into the
computed array address of a. The left side of the template is replaced with
the right-hand side, leaving nothing but the register %R0. With the tree
204
12.4. LOW-LEVEL OPTIMIZATIONS 205
ASSIGN ASSIGN
b 1 IADD R0 1 IADD
IADD IADD
32 RBP 32 RBP
ASSIGN ASSIGN
ASSIGN
R0 DEREF R0 DEREF
R0 DEREF
IADD IADD
IADD
IADD
32 RBP
MOVQ b,%R0
ADDQ $1,%R0
MOVQ 32(%RBP),%R1
LEAQ 40(%RBP),%R2
MOVQ %R0,(%R2,%R1,8)
205
206 CHAPTER 12. OPTIMIZATION
Note that some of these cases are more difficult to detect than others!
Globally shared variables are already known to the compiler, but the other
three cases are (often) not reflected in the language itself. In the C lan-
guage, one can mark a variable with the volatile keyword to indicate
that it may be changed by some method unknown to the compiler, and
therefore clever optimizations should not be undertaken. Low level code
found in operating systems or parallel programs is often compiled without
such optimizations, so as to avoid these problems.
206
12.5. REGISTER ALLOCATION 207
207
208 CHAPTER 12. OPTIMIZATION
Guthrie conjectured that only four colors were necessary to color a planar graph, while at-
tempting to color a map of Europe. He brought this problem to Augustus DeMorgan, who
popularized it, leading to several other mathematicians who published several (incorrect)
proofs in the late 1800s. In 1891, Percy John Heawood proved that no more than five colors
were sufficient, and that’s where things stood for the next 85 years. In 1976, Kenneth Appel
and Wolfgang Haken produced a computer-assisted proof of the four-color theorem, but the
proof contained over 400 pages of case-by-case analysis which had to be painstakingly ver-
ified by hand. This caused consternation in the mathematical community, partly due to the
practical difficulty of verifying such a result, but also because this proof was unlike any that
had come before. Does it really count as a “proof” if it cannot be easily contemplated and
verified by a human? [2]
208
12.6. OPTIMIZATION PITFALLS 209
i
{ABC}
Block A:
a=10; i=0
a=10
for(i=0;i<10;i++) {
a = a*3; Block B:
i<10 ? a
} {ABCD}
x = a*2; T F
print x; Block C:
Block D:
x=a*2
i++
print a; a=a*3
print x
print a
x
{D}
Now that you have seen some common optimization techniques, you should
be aware of some pitfalls common to all techniques.
209
210 CHAPTER 12. OPTIMIZATION
Unfortunately, these two expressions are not the same in the concrete
world of limited-precision mathematics. Suppose that a, b, and x are 32-
bit signed integers, with a range of [−23 1, +23 1). If both a and b have the
value 2,000,000,000, then a/5 is 400,000,000 and a/5+b/5 is 800,000,000.
However, a+b overflows a 32-bit register and wraps around to a negative
value, so that (a+b)/5 is -58993459! An optimizing compiler must be ex-
tremely cautious that any code transformations produce the same results
under all possible values in the program.
210
12.7. OPTIMIZATION INTERACTIONS 211
Multiple optimizations can interact with each other in ways that are unpre-
dictable. Sometimes, these interactions cascade in positive ways: constant
folding can support reachability analysis, resulting in dead code elimi-
nation. On the other hand, optimizations can interact in negative ways:
function inlining can result in more complex expressions, resulting in less
efficient register allocation. What’s worse is that one set of optimizations
may be very effective on one program, but counter productive on another.
A modern optimizing compiler can easily have fifty different optimiza-
tion techniques of varying complexity. If it is only a question of turning
each optimization on or off, then there are (only) 250 combinations. But, if
the optimizations can be applied in any order, there are fact(50) permuta-
tions! How is the user to decide which optimizations to enable?
Most production compilers define a few discrete levels of optimization.
For example, gcc defines -O0 as the fastest compile speed, with no opti-
mization, -O1 enables about thirty optimizations with modest compile-
time expense and few runtime drawbacks (e.g. dead code elimination),
-O2 enables another thirty optimizations with greater compile-time ex-
pense (e.g. code hoisting), and -O3 enables aggressive optimizations that
may or may not pay off (e.g. loop unrolling.) On top of that, individual
optimizations may be turned on or off manually.
But is it possible to do better with finer-grained control? A number of
researchers have explored methods of finding the best combinations of op-
timizations, given a benchmark program that runs reasonably quickly. For
example, the CHiLL [8] framework combines parallel execution of bench-
marks with a high level heuristic search algorithm to prune the overall
search space. Another approach is to use genetic algorithms [9] in which
a representative set of configurations is iteratively evaluated, mutated, re-
combined, until a strong configuration emerges.
211
212 CHAPTER 12. OPTIMIZATION
12.8 Exercises
212
12.9. FURTHER READING 213
2. R. Wilson, “Four Colors Suffice: How the Map Problem Was Solved”,
Princeton University Press, 2013.
A history of the long winding road from the four-color conjecture in the nineteenth
century to its proof by computer in the twentieth century.
213
214 CHAPTER 12. OPTIMIZATION
214
215
Construct a scanner for B-Minor which reads in a source file and produces
a listing of each token one by one, annotated with the token kind (iden-
tifier, integer, string, etc) and the location in the source. If invalid input
is discovered, produce a message, recover from the error, and continue.
Create a set of complete tests to exercise all of the tricky corner cases of
comments, strings, escape characters, and so forth.
Building on the scanner, construct a parser for B-Minor using Bison (or an-
other appropriate tool) which reads in a source file and determines whether
215
216 APPENDIX A. SAMPLE COURSE PROJECT
the grammar is valid, and indicates success or failure. Use the diagnostic
features of Bison to evaluate the given grammar for ambiguities and work
to resolve problems such as the dangling-else. Create a set of complete
tests to exercise all of the tricky corner cases.
Next, use the parser to construct the complete AST for the source program.
To verify the correctness of the AST, print it back out in as an equivalent
source program, but with all of the whitespace arranged nicely so that it is
pleasant to read. This will result in some interesting discussions with the
instructor about what constitutes an “equivalent” program. A necessary
(but not sufficient) requirement is that the output program should be re-
parseable by the same tool. This requires attention to some details with
comments, strings, and general formatting. Again, create a set of test cases.
Next, add methods which walk the AST and perform semantic analysis
to determine the correctness of the program. Symbol references must be
resolved to definitions, the type of expressions must be inferred, and the
compatibility of values in context must be checked. You are probably used
to encountering incomprehensible error messages from compilers: this is
your opportunity to improve upon the situation. Again, create a set of test
cases.
The most exciting step is to finally emit working assembly code. Straight-
forward code generation is most easily performed on the AST itself, or a
DAG derived from the AST in the optional IR assignment, following the
procedure in Chapter 11. For the first attempt, it’s best not to be concerned
about the efficiency of the code, but allow each code block to conserva-
tively stand on its own. It is best to start with some extremely simple
programs (e.g. return 2+2;) and gradually add complexity bit by bit.
Here, your practice in constructing test cases will really pay off, because
216
A.7. OPTIONAL: EXTEND THE LANGUAGE 217
you will be able to quickly verify how many test programs are affected by
one change to the compiler.
In the final step, you are encouraged to develop your own ideas for ex-
tending B-Minor itself with new data types or control structures, to create
a new backend targeting a different CPU architecture, or to implement one
or more optimizations described in Chapter 12.
217
218 APPENDIX A. SAMPLE COURSE PROJECT
218
219
B.1 Overview
219
220 APPENDIX B. THE B-MINOR LANGUAGE
B.2 Tokens
B.3 Types
B-Minor has four atomic types: integers, booleans, characters, and strings.
A variable is declared as a name followed by a colon, then a type and an
optional initializer. For example:
x: integer;
y: integer = 123;
b: boolean = false;
c: char = ’q’;
s: string = "hello world\n";
An integer is always a signed 64 bit value. boolean can take the lit-
eral values true or false. char is a single 8-bit ASCII character. string
is a double-quoted constant string that is null-terminated and cannot be
modified. (Note that, unlike C, string is not an array of char, it is a
completely separate type.)
Both char and string may contain the following backslash codes. n
indicates a linefeed (ASCII value 10), 0 indicates a null (ASCII value zero),
and a backslash followed by anything else indicates exactly the following
character. Both strings and identifiers may be up to 256 characters long.
¡p¿ B-Minor also allows arrays of a fixed size. They may be declared with
no value, which causes them to contain all zeros:
a: array [5] integer;
Or, the entire array may be given specific values:
a: array [5] integer = {1,2,3,4,5};
220
B.4. EXPRESSIONS 221
B.4 Expressions
B-Minor has many of the arithmetic operators found in C, with the same
meaning and level of precedence:
B-Minor is strictly typed. This means that you may only assign a value
to a variable (or function parameter) if the types match exactly. You can-
not perform many of the fast-and-loose conversions found in C. For ex-
ample, arithmetic operators can only apply to integers. Comparisons can
be performed on arguments of any type, but only if they match. Logical
operations can only be performed on booleans.
Following are examples of some (but not all) type errors:
x: integer = 65;
y: char = ’A’;
if(x>y) ... // error: x and y are of different types!
f: integer = 0;
if(f) ... // error: f is not a boolean!
Following are some (but not all) examples of correct type assignments:
b: boolean;
x: integer = 3;
y: integer = 5;
b = x<y; // ok: the expression x<y is boolean
f: integer = 0;
221
222 APPENDIX B. THE B-MINOR LANGUAGE
c: char = ’a’;
if(c==’a’) ... // ok: c and ’a’ are both chars
In B-Minor , you may declare global variables with optional constant ini-
tializers, function prototypes, and function definitions. Within functions,
you may declare local variables (including arrays) with optional initial-
ization expressions. Scoping rules are identical to C. Function definitions
may not be nested.
Within functions, basic statements may be arithmetic expressions, return
statements, print statements, if and if-else statements, for loops, or
code within inner groups. B-Minor does not have switch statements,
while-loops, or do-while loops, since those are easily represented as spe-
cial cases of for and if.
The print statement is a little unusual because it is a statement and
not a function call. print takes a list of expressions separated by commas,
and prints each out to the console, like this:
B.6 Functions
Functions are declared in the same way as variables, except giving a type
of function followed by the return type, arguments, and code:
The return type of a function must be one of the four atomic types,
or void to indicate no type. Function arguments may be of any type.
integer, boolean, and char arguments are passed by value, while string
and array arguments are passed by reference. As in C, arrays passed by
reference have an indeterminate size, and so the length is typically passed
as an extra argument:
222
B.7. OPTIONAL ELEMENTS 223
A function prototype states the existence and type of the function, but
includes no code. This must be done if the user wishes to call an external
function linked by another library. For example, to invoke the C function
puts:
223
224 APPENDIX B. THE B-MINOR LANGUAGE
224
225
C has been the language of choice for implementing low level systems like
compilers, operating systems, and drivers since the 1980s. However, it is
fair to say that C does not enforce a wide variety of good programming
practices, in comparison to other languages. To write solid C, you need to
exercise a high degree of self-discipline. 1
For many students, a compilers course in college is the first place where
you are asked to create a good sized piece of software, refining it through
several development cycles until the final project is reached. This is a good
opportunity for you to pick up some good habits that will make you more
productive.
To that end, here are the coding conventions that I ask my students to
observe when writing C code. Each of these recommendations requires a
little more work up front, but will save you headaches in the long run.
Use a version control system. There are a variety of nice open source
systems for keeping track of your source code. Today, Git, Mercurial, and
Subversion are quite popular, and I’m sure next year will bring a new one.
Pick one, learn the basic features, and shape your code gradually by mak-
ing small commits.2
Go from working to working. Never leave your code in a broken state.
Begin by checking in the simplest possible sketch of your program that
compiles and works, even if it only prints out “hello world”. Then, add
to the program in a minor way, make sure that it compiles and runs, and
check it in. 3
Eliminate dead code. Students often pick up the habit of commenting
out one bit of code while they attempt to change it and test it. While this is
1 Why not use C++ to address some of these disciplines? Although C++ is a common
part of many computer science curricula, I generally discourage the use of C++ by students.
Although it has many features that are attractive at first glance, they are not powerful enough
to allow you to dispense with the basic C mechanisms. (For example, even if you use the C++
string class, you still need to understand basic character arrays and pointers.) Further, the
language is so complex that very few people really understand the complete set of features
and how they interact. If you stick with C, what you see is what you get.
2 Some people like to spend endless hours arguing about the proper way to use arcane
features of these tools. Don’t be one of those people: learn the basic operations and spend
your mental energy on your code instead.
3 This advice is often attributed as one of Jim Gray’s “Laws of Data Engineering” in slide
225
226 APPENDIX C. CODING CONVENTIONS
a reasonable tactic to use for a quick test, don’t allow this dead code to pile
up in your program, otherwise your source code will quickly become in-
comprehensible. Remove unused code, data, comments, files, or anything
else that is unnecessary to the program, so that you can clearly see what it
does now. Trust your version control system to allow you to go back to a
previously working version, if needed.
Use tools to handle indenting. Don’t waste your time arguing about
indenting style; find a tool that does it for you automatically, and then
forget about it. Your editor probably has a mode to indent automatically.
If not, use the standard Unix tool indent.
Name things consistently. In this book, you will see that every func-
tion consists of a noun and a verb: expr typecheck, decl codegen,
etc. Each one is used consistently: expr is always used for expressions,
codegen is always used for code generation. Every function dealing with
expressions is in the expr module. It may be tempting to take shortcuts
or make abbreviations in the heat of battle, but this will come back to bite
you. Do it right the first time.
Put only the interface in a header file. In C, a header file (like expr.h)
is used to describe the elements needed to call a function: function proto-
types and the types and constants necessary to invoke those functions. If a
function is only used within one module, it should not be mentioned in the
header file, because nobody outside the module needs that information.
Put only the implementation in a source file. In C, a source file (like
expr.c) is used to provide the definitions of functions. In the source
file, you should include the correpsonding header (expr.h) so that the
compiler can check that your function definitions match the prototypes.
Any function or variable that is private to the module should be declared
static.
Be lazy and recursive. Many language data structures are hierarchi-
cally nested. When designing an algorithm, take note of the nested data
structures, and pass responsibility to other functions, even if you haven’t
written them yet. This technique generally results in code that is simple,
compact, and readable. For example, to print out a variable declaration,
break it down into printing the name, then the type, then the value, with
some punctuation in between:
printf("%s:\n",d->name);
type_print(d->type);
printf(" = ");
expr_print(d->value);
printf(" ;\n");
Then proceed to writing type print and expr print, if you haven’t
done them already.
Use a Makefile to build everything automatically. Learn how to write
a Makefile, if you haven’t already. The basic syntax of Make is very simple.
226
227
The following rule says that expr.o depends upon expr.c and expr.h,
and can be built by running the command gcc:
There are many variations of Make that include wildcards, and pattern
substitution, and all manner of other things that can be confusing to the
uninitiated. Just start by writing plain old rules whose meaning is clear.
Null pointers are your friends. When designing a data structure, use
null pointers to indicate when nothing is present. You cannot derefence
a null pointer, of course, and so you must check before using it. This can
lead to code cluttered with null checks everywhere, like this:
. . .
}
You can eliminate many of them by simply placing the check at the
beginning of the function, and programming in a recursive style:
expr_codegen(e->left,output);
expr_codegen(e->right,output);
. . .
}
227
228 INDEX
Index
228
INDEX 229
229
230 INDEX
230
INDEX 231
typechecking, 8
validator, 67, 71
value, 83, 185
value-number method, 121
variant type, 103
virtual machine, 1
virtual registers, 126
virtual stack machine, 127
weak equivalence, 37
whitespace, 11
YYSTYPE, 77
231