2011 Book PrinciplesOfCompilers
2011 Book PrinciplesOfCompilers
Song Y. Yan
Principles of Compilers
Principles of Compilers
A New Approach to Compilers
Including the Algebraic Method
ISBN 978-7-04-030577-7
Higher Education Press, Beijing
The compiler is one of the most important aspects of system software. When
any computer user develops a computer program, one must use some pro-
gramming language, rather than using a computer instruction set. This im-
plies that there must be the compiler of the programming language that
has been installed on the computer one uses, and otherwise the developed
program cannot be run.
There are some differences between a compiler and programming lan-
guage. Once language is designed, it must be kept unchanged (except when
it contains a mistake that has to be corrected), while the techniques for im-
plementing compilation might be changed over time. Hence people always
explore the more efficient and more advanced new techniques to raise the
quality of compilers.
The course similar to “The principles of Compilers” has become one of
the most important courses in computer science within higher institutes. Ac-
cording to our knowledge, the development of compilation techniques evolves
in two directions. One is towards the improvement of the compilation tech-
niques for existing languages. Another is towards the research and develop-
ment of the compilation techniques of new languages. These new languages
include object-oriented languages, distributed languages, parallel languages,
etc. This book introduces the newest knowledge in the field, and explores
the compilation techniques suitable for the languages and computation. It
associates the compilation of programming languages with the translation of
natural languages in human brains so that the reader can easier understand
the principles of compilers. Meanwhile, it introduces the algebraic method of
compilation that belongs to formal technology.
This book consists of 16 chapters. Chapter 1, Introduction, outlines the
process of compilation and associates the compilation of programming lan-
guages with the comprehension and generation of natural languages in human
brains. Chapter 2 introduces the grammar and language. The generation of
the language is based on the grammar and languages are the fundamentals
of the compilation process. Chapter 3 introduces finite automata and regu-
lar languages, together with Chapter 4, it is devoted to lexical analysis, the
first task of analysis stage. Chapter 3 may be regarded as the theoretical
preparation of lexical analysis; while Chapter 4 is the concrete practice of
vi Preface
Yunlin Su
Song Y. Yan
March 2011
Contents
Chapter 1 Introduction · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.1 Language and Mankind · · · · · · · · · · · · · · · · · · · · · · · · · · · · 1
1.2 Language and Computer · · · · · · · · · · · · · · · · · · · · · · · · · · · 3
1.3 Compilation of Programming Languages · · · · · · · · · · · · · · · · 12
1.4 Number of Passes of Compiler · · · · · · · · · · · · · · · · · · · · · · · 17
1.5 An Example of Compilation of a Statement · · · · · · · · · · · · · · 19
1.6 Organization of the Book · · · · · · · · · · · · · · · · · · · · · · · · · · · 21
Problems· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
References · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 23
References · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 155
Index · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 451
Chapter 1 Introduction
Steve Pinker
If you read the text above, you must be engaging in one of the mind’s most
enchanting process — the way one mind influences another through language.
However, we put a precondition on it that you have to know English, other-
wise the text has no influence at all to you. There are so many languages in
the world that even no one can exactly tell how many there are. Therefore,
there is the need of a bridge that connects different languages so that people
can understand each other. The bridge is the translation. And the subject
of the book is the translation between the formal language and the machine
language, or compilation.
What is the compiler or the compilation program? Simply speaking, it
is a program of which the function is to translate programs written in a
programming language into machine codes that are to be run by the same
kind of machine the codes belong to. In order to explain things behind this,
we need to discuss it further.
Language is main means of human communication and the way in which
most information is exchanged. By language, people link up each other, they
express their attentions and feelings, and they describe matters or express
their understanding [1]. It is one of the kinds of intelligence or the product
of intelligence. However, in the long process of human evolution, there was
a long period without language. Gradually, they invented oral language to
meet the need of living. Therefore, oral language can be considered as the first
breakthrough in language, it was also a breakthrough in human civilization.
From oral language to written language, it underwent even longer time. The
2 Chapter 1 Introduction
has become a heated profession. Take as an example for the colloquial trans-
lation or interpretation, it involves three persons, i.e., two, A and B. who
want to converse with each other for some purpose, and one, C, who helps
them with the thing. Suppose that A speaks the language X and B speaks
the language Y. Obviously, in order for A and B understanding each other
the task of C is to interpret the words of X spoken by A into language Y,
meanwhile, he interprets the words of B spoken in Y spoken by B into lan-
guage X. Therefore, C must be a bilingual in this circumstance. And this
situation is shown in Fig. 1.1.
computer; especially it is too much for the common users of the computers.
It is something like that one is required to understand the principles of the
television and operate the existing components of the television if one wants
to watch the television.
When the computer was just invented, there was not any other language to
use for running the computer. The instruction set of computer was the unique
language which people may use to develop programs. The historical period
is called the period of manually programming. The instruction commonly
contains the operation code that indicates the operation it performs, the
addresses of the data which the operation performs as the control codes.
At that time, only very few people were the designers or developers of the
computers. For them to build the programs using the instruction set was
not a problem though it also entailed them to work hard and spend lots
of time. As computers became more and more popular, the users were no
longer those who are very familiar with the principles inside the computers,
they are just the real user, no different from the users of televisions. They
want to freely use the computer to solve their varieties of problems. In this
circumstance, no doubt, the machine language became their stumbling block
in using computers [2]. In order to break away from the constraints of the
machine language, from soon after the invention of computer, people had
started searching the solution to the problem. The first step was to replace
the operation codes of the instructions to the notations easily mnemonic,
e.g., to use ADD to represent addition, SUB to represent subtract, MUL to
represent multiplication, and DIV to represent division, or more simply, just
to use +, −, ×, and / (or ÷) to represent the arithmetic operators. Then, they
used symbolic addresses to take the place of real binary addresses, etc. Of
course the language transformed in this way is no longer computer language,
or original computer instruction set, although it basically corresponds to the
computer instruction set, and they are completely equivalent. This was the
first step that people broke away from computer language. Though it was a
minor step, it was crucial. It indicates that people may not be confined by the
computer instruction set, they may use more convenient language to develop
programs for computers. This kind of languages is called assembly languages.
Here the module given above was still suitable. As shown in Fig. 1.2, the left
side of the bottom edge of the triangle represents any program written in
assembly language which we call the source program, and the right side of the
bottom edge is totally equivalent program written in a computer instruction
set which was produced by the assembler on the top of the triangle and has
the name of the object program or target program. And the assembler plays
the role of the compiler of which the duty is to translate the source program
into the executable object program written in machine language. Therefore,
the assembler must also be executable on computer and by its operation it
produces the object program as its output.
Hence the assembler is the early version of the compilers. As the lan-
guage which source programs used was assembly language, it was only the
1.2 Language and Computer 5
simple adaptation of the machine instruction set (e.g., the operation code
was the mnemonic code of the original one). Hence it is also called low-level
language. Here the word low means that it is machine-oriented (low-level)
and isn’t mankind-oriented (high-level). Assembler is also a low-level form
of the compilers as it hasn’t used much the compilation principles which we
used in the compilers for high-level programming languages.
After the success of assembly languages and their assemblers, people
started the design of the mankind-oriented high-level programming languages.
The common feature of these languages is that they broke away from the
restriction of the computer instruction set. They adopted a subset of the
commonly used language (in general it is English) and established the gram-
mar to describe the statements or elements which people used to develop the
programs. These languages are called procedure-oriented languages or sim-
ply procedural languages, or imperative languages. The earliest programming
languages include FORTRAN (stands for FORmula TRANslation, it was first
designed as early as 1954 [3]), ALGOL 60 [4], COBOL (stands for Common
Business Oriented Language, it was first designed in 1959, and its success was
strongly influenced by the United States Department of Defense). In terms of
the occurrence of the programming languages, the 1960s was stormy. It was
said that at that period over two thousand languages were developed, but
only thirteen of them ever became significant either in terms of concept or
usage. Among them, APL (stands for A Programming Language, developed
by Dr. Kenneth Iverson at IBM [5]) is an interactive language. It devises
a powerful yet compact notation for computation which incorporated many
concepts from mathematics. PL/1 (stands for Programming Language/1) is
suitable for scientific computation. With 1 in the name it probably intends
to be number one in terms of its great deal of functionality. LISP (stands
for List Processing, developed by McCarthy and his co-workers to design
a conceptually simple language for handling symbolic expressions with its
domain being artificial intelligence) [6]. PROLOG (stands for Programming
for Logic) is another effort for use in artificial intelligence. SNOBOL (devel-
oped in the mid-1960s at Bell Telephone Laboratory [7]) is a language whose
main strength is in processing string data. As the name SIMULA67 indicated
that SIMULA was designed in 1967 and had simulation as its major appli-
6 Chapter 1 Introduction
cation domain. And it was later refined in CLU, Euclid, and MODULA [8].
GPSS or SIMSCRIPT [9] provided the example that conventional program-
ming languages can and have been augmented so that simulations can be
easily described. The later development of the programming languages was
the coming of the general-purpose language called ADA [10] in honor of Ada
Augusta, Countess of Lovelace, the daughter of the famous poet Byron. She
collaborated with Charles Babbage (1792 – 1871) who between 1820 and 1850
designed two machines for computation. One relied on the theory of finite dif-
ference and so he called it Difference Engine. The other embodied many of the
principles of a modern digital computer and he called this Analytical Engine.
Ada, as the collaborator of Charles Babbage, helped him with developing
programs for the analytical engine. Therefore she has recently been recog-
nized as the first programmer. The other language that later became very
popular is C [11]. It initially was used for writing the kernel of the operating
system UNIX.
Apart from few (if any) languages the languages aforementioned basically
all are procedure-oriented languages. After the software crisis that took place
in the late 1960s, the structured programming method was proposed, and it
hastened parturition of Pascal (in honor of French mathematician Pascal, de-
signed by Swiss computer scientist Niklaus Wirth [12]). Another methodology
that was proposed to solve the software crisis is the object-oriented software
design method, and it caused the production of the object-oriented languages.
For example, based upon the C language, C++ was developed. Soon after it
Java was also designed based upon C. In addition, SMALLTALK [13] is also
of this kind.
As hardware unceasingly develops it also puts forward the new require-
ments to software. New computer architectures like distributed systems, par-
allel computer systems, computer networks, etc. all propose new requirements
and challenges to computer programming. New languages that meet these
needs sequentially come out.
No matter how the languages change, there is one thing unchanged that
the source programs written in these languages must be compiled first before
they become executable object programs on computers. That is to say that
they obey the module as shown in Fig. 1.3.
There are two triangles in Fig. 1.4. At first the second triangle is put to
run. After its run, it yields the compiler’s object program which in turns to
replace the compiler on the top of the first triangle executable. And via its
run on computer it also yields the object program that corresponds to the
source program. That is what we really need.
The module can be extended further. For example, one uses A language
to write the source program, and the compiler is written in B. Obviously the
compiler cannot be executed before it is compiled to the machine language.
Now the compiler in B can be regarded as a source program. And its compiler
is written in C. Once again, the C compiler is taken as a source program. It
is compiled by a really executable compiler in machine language.
The sequence of programs works backward. The compiler in machine lan-
guage first translates the C compiler into an executable compiler. Then by
its turn it translates the compiler in B to machine language. Finally by its
run it translates the source program into an executable object program.
The process can be extended to any levels. As long as the last level is
executable, the backward process can continue to transform the former one
to executable and in turn it transforms its former one again until the first
level can be run. Then the whole compilation process ends.
In designing a programming language a number of criteria must be kept
in mind in order to make the language welcome by users and qualified as a
quality language.
8 Chapter 1 Introduction
way that syntactic and logical errors are both discouraged and easily discov-
ered. Comments both enhance the comprehensiveness and play a role for the
reader of the program to check the correctness of the program. Hence any
designer of the language should provide the facility in the language he/she
designs. Programmer when designing a program should also make use of the
facility to enhance the reliability of the program.
5. Machine independent
High-level programming language is intended to use in a variety of ma-
chines, of course, one of its goals is the ability to move the programs from a
machine to a machine.
6. Generality
The idea of generality is that all features should be composed of different
aspects of a few basic concepts. That also means that related concepts should
be unified into a single framework as the class does in the object-oriented
language.
7. Extensibility
The reason for taking this as a desirable feature is that translators are
usually forced to choose one representation for all objects of a given type and
this can be very inefficient. Programming language should allow the extension
of itself via simple, natural and elegant mechanism.
Almost every language provides the definition mechanism of subroutine
or subprogram. In developing large programs, a great part of the tasks for the
programmer can be regarded as the extension of the language as he/she has to
make a decision i.e. in order to solve the problem, how should he/she utilize
the primitive characteristics to simulate the data structures of the problem.
Hence, from the view point of the concept, it is equal to extend the original
language to include the structures simulated. Moreover, the hardware envi-
ronment that has rapidly developed and changed recently also requires the
change of software to meet them, especially the parallel computer systems,
clusters, distributed computer systems, etc., all require the programming
languages that are suitable to them. In the book later we will discuss the
computer architecture for explicitly parallel instruction computing (EPIC).
It is the extension of the very long instruction word (VLIW). These devel-
opments in hardware all require the new programming languages that are
suitable their compilers.
8. Provability
It will be desirable that when a program is developed, there is also a
mechanism to carry out the verification of the program. Is there a formal
definition of all features of the language? If it is so, this will permit formal
verification of programs. However, the cost of the process of formal verifica-
tion is very high, either unaided or aided by machine, and requires a high-level
1.2 Language and Computer 11
From the discussion above, we have known that the programs written in
high-level programming languages need to be compiled before they can run
or executed on computer. Therefore, to write programs is something like that
the communications between two persons need a third person to translate.
In the process, people are concerned about the efficiency — the execution
of programs. In practice, the focuses or goals may be varied. If the goals
are different, the result may also be different. For example, in the course of
compilers, in view of the nature of the course, we are naturally concerned
about the compilers. The goals of compilers may be followings:
1. To develop the compiler as small as possible
According to the energy saving principle, of course we will take it as our
goal to produce a compiler as small as possible. In other words, it is all
right as long as it can handle basic tasks of compilation. Such a compiler,
however, may not be complete, as it may not be able to handle sophisticate
situations. The possible case that corresponds to the situation may be the
subset of the language. It just considers the basic elements of the language,
rather than the whole language. This kind of compilers may be taken as
the project of students with the purpose of providing training practice to
students. For training students to master the basic skills of the compilers,
this kind of compilers may play well but in practical applications they are
far from complete and unqualified to fulfill the practical task of compilation.
2. To create a compiler that possesses the better ability of diag-
nosis and recovery from failures
A compiler should be able to discover the errors within the source pro-
grams written by the users, not only static errors, but also dynamic errors.
After the error is found, it should also determine the source of the error, se-
quentially it presents the hint for correction. Only such kind of compilers can
be considered as user-friendly. It is also very helpful for users to guarantee the
correctness of the programs. However, it does not guarantee that the object
programs produced must be efficient. This kind of compilers is suitable for
the teaching environment as its ability of compilation and hinting is instruc-
tive for students. It belongs to the so-called dirty compiler category. It can
also be used as the preliminary compiler, after which the clean or optimized
compiler may work again on the source programs and produce the optimized
object programs with high efficiency.
3. To produce the compiler that will compile flexible and efficient
object programs
Based upon producing correct object programs from source programs, the
compiler requires that the object programs have higher efficiency. Therefore,
apart from the compilation, the compiler also implements the optimization
1.3 Compilation of Programming Languages 13
of object programs.
If we pay attention to the process by which the object programs are
yielded or to the final product — object programs, the possible goals can be
as follows.
1) The time spent by the compiler to translate the source program is the
least. If this is the goal, we must require that the speed of compilation of
the source program and confirmation of correctness of the source program is
fastest, and it uses the fastest speed again to generate the object program.
As for the efficiency of the object program, it is not its business and out of
its consideration.
2) The object program which the compiler generates is most efficient. It
is contrast to the first one as its focus is the efficiency of the object program
rather than the speed or efficiency of the compilation.
3) The size of the object program which the compiler generates should
be smaller. Notice that 2) was concerned about the time efficiency, here it is
concerned about the space efficiency. Therefore, the two are not equivalent.
Of course, in general, the shorter the program is, the faster it runs. However,
it is not always like this. For example, the program may be short, but it
contains lots of loops, then it may be time-consuming. Hence the goal here
is pursued mainly from the space which the object program occupies. If the
memory space of the computer is limited, it may consider this as the goal.
From the discussion above, we can see that the goals of writing compilers
may be a variety, and it is impossible to require that the compilers written
by different groups of people or written for different purposes simultaneously
meet all the same requirements. As for developing other systems, we can only
realize the compromise of the different goals. Now we focus on the compiler
and the compilation process. For the compiler, as its specific goal is to trans-
late programs written in some language into the target language of a kind of
computers, the compiler is used to establish the correspondence between the
two sides — the source language and the computer. In other words, for a pro-
gramming language and a kind of computers, there needs a compiler for the
language that runs on the computer. If there are m programming languages
and n computer systems, according to the correspondence between the lan-
guage and computer given above, we need to develop m × n compilers. Of
course, this is not the case that we look for as it implies a tremendous work
load. In order to reduce the work load, the approach we take is to find out
in the compiler which part is computer-related and which part is computer
independent. For those computers independent parts, we make them shared
by all compilers. Only for those computer related, we direct at the different
computer to design the corresponding parts of the compiler separately. Just
out of the reason, the compiler we developed is not written directly in the
computer instruction set as only in this way, instead it can be unrelated to
the specific computer. The works that need to relate to the specific computer
should be suspended as late as possible (for example, let it happen when
the compiler is really working for compilation, rather than when it was de-
14 Chapter 1 Introduction
veloped). By the effort, the number of the compilers for m languages and
n computers may be reduced from m × n to m + n. Here we just briefly
introduce the idea, we will expound it in details later in the book.
Further we will explain the working principles of the compilers. We be-
gin our discussion with general languages. When we study a language, no
matter it is native language or foreign language, the first we should study is
the words, i.e., the individual words to stand things. It includes the spelling,
the writing, the pronunciation, etc. Then we study the grammar with which
the individual words may be linked together to form a meaningful sentence
with correct grammar. As for the compiler, its working process contains two
phases, the analysis phase and the synthetic phase. The analytical phase in-
cludes two parts again: lexical analysis and syntactic analysis. Lexical anal-
ysis starts from the input of the source program. The input of the source
program is considered as the input of the character stream. The lexical anal-
ysis has to differentiate the words in sentences, they include identifiers, con-
stants, key words, variable names, operators, punctuation symbols, etc. At
the time, it has also to check the correctness of the spelling or writing of
the words. Only when they all are correct, may the next analysis, i.e., the
syntactic analysis be called on. And in order for syntactic analysis easier to
work, all the characters in the input form should be transformed into the
intermediate code form. In this aspect, it is somehow similar to the product
of language understanding in mind. The question now is that in order for
the neural system in our brain to process the utterances, what representa-
tions result in memory when listeners understand utterances or texts? What,
for example, would be stored in memory when you hear “The kid is on the
bed”? Research has suggested that the meaning representation begins with
basic units called propositions [17, 18]. Propositions are the main ideas of
utterance. They are a kind of the intermediate code form easy to process and
produce the understanding. For “The kid is on the bed”, the main idea is
that something is on something else. When one reads the utterance, he/she
will extracts the proposition on and understand the relationship which it ex-
presses between the kid and the bed. Often propositions are written like this:
On (kid, bed). Many utterances contain more than one position. Consider
“The dog watches the kid playing on the ground board”. We have as the first
component proposition On (kid, ground board). From that, we build up
Playing (kid, On (kid, ground board))
Finally we get to
Watch (dog, Playing (kid, On (kid, ground board)))
The intermediate code form makes every language unit having the same
format which we call token. They are linked together to represent the original
sentence.
The syntactic analysis takes the token sequence as input, then it analyzes
each sentence based upon the grammar of the programming language. If
1.3 Compilation of Programming Languages 15
after this, it did not find any error in the program (sentence sequence),it fur-
ther transforms the source program into the intermediate code representation
again so that the sequential synthetic phase may work on the representation
and transform it into the target program. Hence the working process may be
shown in Fig. 1.5.
The symbol table of Fig. 1.5 represents the structure in which for each
identifier a record is contained. As for the constant table, it represents the
structure in which for each constant a record is contained. In the symbol
table, apart from the identifier and the address that allocates to it (but it
is not the real memory address, it is only a relative address), it also con-
tains the segments for its various attributes. This kind of the data structure
may accelerate the searching of the record of every identifier, and it can also
accelerate the store of the identifier into the record or the retrieval of the
identifier from it. Upon the lexical analysis in working, when it meets an
identifier for the first time, if the lexical analysis confirms that it is an iden-
tifier, it is called the definition occurrence of the identifier. Then the later
occurrence is called the application occurrence. As the definition occurrence
appears, the compiler puts it to the symbol table, and allocates an address
to it according to the order it occurred, the allocated address is also stored
in the table. Meanwhile, based on the definition or declaration for it made
by the source program, the relative attributes are also put into the table.
On the application occurrence, the identifier is transformed to intermediate
form according to the record obtained from the definition occurrence, and
it is also needed to check whether the attributes implied in the application
occurrence are consistent with that of the definition occurrence. If they are
not consistent, the lexical analysis will consider that there is an error there.
The constant table is similar to the symbol table. For a constant, lexical
analysis first needs to transform each character that represents the digit (If
the constant represents a signed integer, it may contains a symbol + or −. If
it is a real or a float number, it may also contains +, −, decimal point, and
exponential symbol.) into corresponding numeric value. In the process, it is
also required to check whether it is correct or not. After the correctness is
confirmed then the constant is put in the constant table, and the address is
assigned to it as well as its attributes are put into the table.
For more concrete details of the symbol table and constant table, we will
further explain them later in the book.
16 Chapter 1 Introduction
The error handling may be carried out in both lexical analysis phase and
syntactic analysis phase, even in the later synthetic stage (including seman-
tic analysis). Actually, usually the lexical analysis and syntactic analysis may
handle the majority of the errors detected by the compiler. The errors that
can be found by lexical analysis include such errors as that the input char-
acters cannot be linked to form any symbol of the language; while the errors
that usually can be found by the syntactic analysis include such errors as
that the token stream violates the grammar rules or structural rules of the
language. During the semantic stage, the compiler intends to detect the fol-
lowing construction: it is correct in syntactic structure, but it simply has no
meaning in the operation concerned. For example, we want to perform the
additional operations of two identifiers, but one identifier may be the name
of an array while the other is the name of a procedure. The error handling
should not stop the working of the compiler after it discovers an error so that
it can continue the compilation and continue to find out more errors (if any).
Definitely, the user prefers knowing more errors in his/her program to only
knowing an error.
We have outlined the analytical stage in the preceding part. As we men-
tioned before, for the complete compilation process, after the analytical stage
has been finished, the next stage will be synthetic stage, the tasks of which
may be divided into the generation of the intermediate code, the optimization
of the code and the generation of code. Fig. 1.6 shows the process.
a := (1 + r)n × a
var n : int,
a, r : real,
a := 1,
i.e.,
a := (1 + r) ↑ n × a.
The compilation of the statement begins with the input of the character
stream that constitutes the program. Then the lexical analysis works first. It
transforms the characters into tokens. It confirms that var, int, real all are
key words of the language, and n, a, r are identifiers. The identifier n has
the attribute of integer while a, r have the attribute of real number. There is
also a constant 1 that is integer. By this assumption, the symbol table and
constant table are shown in Tables 1.1 and 1.2.
Table 1.1 Symbol table
Symbol name Intermediate code Attribute
n id1 int
a id2 real
r id3 real
With symbol table, we can draw the process of the compilation of the
statement according to the compilation process given in Fig. 1.6. Since there
is no error in the program we omit the error handling part.
In Fig. 1.7, for the sake of simplicity, we use EXP in place of the exponent
operation id1↑ n.
graph on the field does not involve on these topics, it is hard to regard really
as the book on the field any more, or it can only be regarded as an obsolete
book on the field. However, as these fields are still growing, not mature at
all, we can only introduce the state of the art of current level.
Problems
Problem 1.1 For the compilation of programming languages, why are the
two phases — analysis and synthesis necessary? For the translation of
natural languages, what phase do you consider important?
Problem 1.2 From the design of programs, expound the necessity of lan-
guages for thinking.
Problem 1.3 According to your understanding of the text, analyze and
compare the pros and cons of single-pass and multi-pass scanning tech-
nique.
References
[1] Pinker S (1994) The language instinct: How the mind creates language.
Morrow, New York.
[2] Ritchie DM et al (1978) The C programming language. Bell Syst Tech J, 57,
6, 1991 – 2020.
[3] Backus JW et al (1957) The FORTRAN automatic coding system. Proc
Western Jt Comp Conf AIEE (now IEEE) Los Angles.
[4] Naur P (ed) (1963) Revised report on the algorithmic language ALGOL 60.
Comm ACM 6(1): 1 – 17.
[5] Iverson K (1962) A programming language. Wiley, New York.
[6] McCarthy J et al (1965) LISP 1.5 programmer’s manual, 2nd edn. MIT
Press, Cambridge.
[7] Farber DJ et al (1964) SNOBOL, a string manipulation language. JACM,
11(1): 21 – 30.
[8] Wirth N (1977) MODULA, a language for modular programming. Softw Prac
Exp. 7: 3 – 35.
[9] Kiviat P et al (1969) The SIMSCRIPT II programming language. Prentice
Hall. Englewood Cliffs.
[10] Wirth N (1971) The programming language pascal. Acta Inf, 1(1): 35 – 63.
[11] Knuth DE (1964) The remaining trouble spots in ALGOL 60. Comm ACM,
7(5): 273 – 283.
[12] Sammet J (1969) Programming Languages: History and fundamentals. Pren-
tice Hall, Englewood Cliffs.
[13] Goldberg et al (1980) Smalltalk-80: The language and its implementation.
Addison-Wesley, Boston.
[14] Horowitz E (1983) Fundamentals of programming languages. Springer,
Berlin.
24 Chapter 1 Introduction
[15] United States Department of Defense (1980) The Ada Language Reference
Manual, Washington D. C.
[16] Knuth DE (1974) Structured programming with GOTO statement. Comp
Surveys, 6(4): 261 – 301.
[17] Clark HH, Clark EV (1977) Psychology and language; An introduction to
psycholinguistics. Harcourt Brace Jovanovich, New York.
[18] Kintsch W (1974) The representation of meaning in memory. Hillsdale, Erl-
baum.
Chapter 2 Grammars and Languages
From the development of the mankind language, the language itself was cre-
ated first without the establishment of the grammar. As the knowledge of
mankind enriched and developed, the grammar was created to help the study
of the language and to make the language normalized. As any native language
is very complicate and the grammar was founded after the language, no mat-
ter what language is, not any grammar can totally describe the phenomena
of the language. In addition, there exist ambiguities in the native languages.
For the human being, in general, these phenomena of ambiguities can be
handled by human themselves. For computers, however, it is hard for them
to accept and even to understand ambiguity. Programming languages are dif-
ferent from native languages in that the generation of the language is almost
at the same time. The the purpose of the grammar is to help the users of the
language to avoid any ambiguity and to express the meaning correctly. The
program should be correctly written in order to be run on computer with
correct results. Therefore, the research on compilers should be started with
the discussion on the relation between grammars and languages.
als, punctuation symbols, arithmetic operators, Greek alphabet, etc. all are
characters. For character, as for the point in geometry, we do not further
define it. We suppose that it is well known common sense. In the following
or in the future, we just use the lower case to denote the character while the
Latin alphabet is used for character list.
Definition 2.2 Alphabet. The finite set of the characters. In general, if the
Latin alphabet is taken as the alphabet, then the upper case is used for the
purpose. For example, we have A = {a, b, c, 1, 2}.
Definition 2.3 Character String. Any string that consists of 0 or more
characters is called a string. The string that consists of 0 character is called an
empty string. It is denoted as “ε”. It indicates that there is no any character
in the string. If A is defined as an alphabet as aforementioned, then a, 1, 1a,
abc, 1ba,. . . , all are the character strings over A or briefly strings. Usually
strings are denoted as Greek letters as α, β etc.
Definition 2.4 The operations on strings. Given A = {a, b, c, 1, 2}, the
strings over A are determined. There are three kinds of the operations over
the strings.
1) Concatenation or juxtaposition. For example, a and 1 are strings, then
a1 and 1a are concatenation or juxtaposition of them. In general, if α and β
are strings, then αβ and βα are strings too.
2) Disjunction or selecting one operation. If α and β are strings, α | β rep-
resents that selecting one from the two, the result is still a string. Obviously,
the operation satisfies the commutative law, i.e., α | β = β | α.
3) Closure. Given a string α, we can define the closure operation as follows.
α∗ = ε | α | αα | ααα | . . .
= ε | α1 | α 2 | α 3 | . . . . (2.1)
This is also called the Kleene closure. We can also define positive closure as
follows,
α+ = α | αα | ααα | . . .
= α | α 2 | α3 | . . . . (2.2)
A∗ = ε | A | A2 | A3 | . . .
= ε | (α | β | γ) | (α | β | γ)(α | β | γ) | (α | β | γ)(α | β | γ)(α | β | γ) | . . . .
(2.3)
2.3 Grammar 27
A+ = A | A2 | A3 | . . .
= (α | β | γ) | (α | β | γ)(α | β | γ) | (α | β | γ)(α | β | γ)(α | β | γ) | . . . .
(2.4)
α ∗ = ε | α+ (2.5)
A∗ = ε | A+ . (2.6)
We need to point out the difference between empty string ε and empty
set ∅. Empty string is a string without any character inside while empty set
is a set without any element. The two things share a fact that both contain
nothing. But they are different as one has no character in it and another has
no element (it may be characters or something else). For set, we may also
define its closure.
Definition 2.5 The closure of set. Let A = {a, b, c} be a set. The closure
operation of set A is defined as
A∗ = ε ∪ A ∪ AA ∪ AAA ∪ . . .
= ε ∪ A ∪ A2 ∪ A3 ∪ . . . (2.7)
Similarly, we have
A+ = A ∪ A2 ∪ A3 ∪ . . . (2.8)
What we get in this way is still a set, but it can be regarded as string too,
the set of strings.
Definition 2.6 Regular expression. Given a set, e.g., A = {a, b, c}. The
regular expression over A is defined as:
1) The element in A is regular expression.
2) If p, q are regular expressions, then after the following operation the
result is still a regular expression:
(1) concatenation, i.e., pq, pp, and qq;
(2) disjunction, i.e., p | q or q | p;
(3) closure, i.e., p∗ or q∗ .
3) Return to 2), start from the regular expressions obtained by 1) or 2),
repeatedly perform the operations in 2), what we get all are regular expres-
sions.
2.3 Grammar
α → β, (2.9)
where α is called the left part of the production while β is called the right
part, and
α ∈ (VN ∪ VT )+ ,
β ∈ (VN ∪ VT )∗ .
That means that α is a nonempty string that consists of terminals and non-
terminals, while β is a string that also consists of terminals and nonterminals
but it may be empty.
Notice that the left part of productions cannot consist of terminals alone
as we have mentioned that terminal cannot be used for derivation.
Definition 2.8 Derivation. Given a grammar G = (VN , VT , P, S), a deriva-
tion means the following step:
If
α → uwTvx (2.10)
is a production in P, where u, w, v, x ∈ (VN ∪ VT )∗ , and
T→y∪z (2.11)
In these productions, the only one that shrinks is BSC → BC, for it we have
| α |>| β |. Apart from this one (excluding those that have empty right parts),
30 Chapter 2 Grammars and Languages
or
A → b.
It may be as
A → Ba (2.17)
or
A → b.
2.4 Language
In Section 2.2, we defined grammar and pointed out that the only operation
for grammar is the derivation. The purpose of the derivation is to generate
the string that is composed of terminals. All the strings that are derived from
the grammar form the language. It is the language generated or accepted by
the grammar. In this section, we will exemplify a few languages generated by
their corresponding grammars [2].
Example 2.6 Consider the language generated by the grammar given in
Example 2.1. ⎧
⎪
⎪ S → ABSCD,
⎪
⎪
⎪
⎪ BA → AB,
⎪
⎪
⎪
⎪ DC → CD,
⎪
⎪
⎪
⎪ →
⎪
⎪
A aA,
⎪
⎪ →
⎪
⎪ B bB,
⎪
⎪
⎨ C → cC,
D → dD, (2.24)
⎪
⎪
⎪
⎪ BSC → BC,
⎪
⎪
⎪
⎪ A → ε,
⎪
⎪
⎪
⎪ B → ε,
⎪
⎪
⎪
⎪ C → ε,
⎪
⎪
⎪
⎪ → ε,
⎪
⎪
D
⎩ S → ε.
At first, we have
S → ABSCD → ABCD → ε ε ε ε → ε.
2.5 Language Generated by a Grammar 35
Hence we have ε ∈ L(G), where L(G) means the language L generated by the
grammar G. Furthermore, S → ABSCD → ABCD → aABCD → aA → a.
So a ∈ L(G). Similarly, we may get b, c, d ∈ L(G). In addition,
S → ABSCD → ABABSCDCD
→ ABABCDCD
→ AABBCCDD
→ ...
→ am bn cp dq .
In the process the three productions are important for obtaining the form we
desire to have. These productions are
CA → AC,
BA → AB,
CB → BC.
36 Chapter 2 Grammars and Languages
If two IDs are related via |–M , then we say that the second ID is yielded
from the movement of the first ID. If the ID is produced from the finite
number (including zero) of movements of the other ID, then we say that they
∗
are related via |–M , and we denote it |–M . When no confusion will happen,
∗ ∗
either |–M or |–M may be simplified as |– or |– .
40 Chapter 2 Grammars and Languages
K = {q0 , q1 , q2 , q3 , q4 },
Γ = {0, 1, x, y, b},
Σ = {0, 1},
H = {q4 }.
At the beginning, let the tape contain 0n 1n , following it being the infi-
nite number of b’s, i.e., blanks. Take 02 12 as an example, according to the
definition of δ, we have the following computation:
q0 . Under the state q0 , M moves right, looks for the leftmost 0 and begins
the other cycle. When M moves right to look for 1 under the state q1 , if
it meets b or x before it meets 1, that means that the numbers of 0’s and
1’s are not consistent. Hence the string is rejected. As for state q3 , it is the
state that replaces q0 when it found y on the tape (notice that y is not an
input symbol, the same as x, they are temporal symbols introduced during
the computation). The q3 is used for scanning y and checking whether there
is 1 left or not. If there is no 1 left, that means that b follows y, then q3 is
changed to q4 , and the computation comes to the end, and the input string
is acceptable. Otherwise the computation cannot finish, or the input string
cannot be accepted. Therefore, state q4 is the final state.
Example 2.11 Consider a Turing machine M = (K, Γ, Σ, δ, q0 , b, {q2 }),
where
K = {q0 , q1 , q2 },
Γ = {u, a, b}, (u is the left boundary symbol of the tape)
Σ = {a},
H = {q2 }.
where
K = {q0 , q1 },
Γ = {u, a, b}, (u stands for the left boundary symbol)
Σ = {a},
H = {q1 }.
Therefore, in this circumstance the Turing machine only moves forward and
backward between the left boundary and the first non blank character a. It
cannot move to the blank character where it enters the final state. However,
if the input string is ubabb. . . , that means that the blank character follows
the left boundary symbol, then we have
By moving only for one cell, it has encountered the blank character and it
immediately enters the final state.
We have given the definition of the Turing machine and three examples of
the machine. From these we may see that Turing machine carries the compu-
tation directing at the input on the input tape until it enters the final state.
The input that leads the Turing machine enters the final state is regarded
as the statement of the language the machine recognized or accepted. There-
fore, we can now define the language which the Turing machine recognizes or
accepts [4].
Definition 2.19 The language accepted or recognized by the Turing ma-
chine L(M). The set of those words in Σ∗ that cause M to enter a final
state when M starts its operation with state q0 and the tape head of M
is initially located at the leftmost cell. Formally, the language accepted by
M = (K, Γ, Σ, δ, q0 , b, H) is
{w | w ∈ Σ∗ ∧ q0 w |– → α1 pα2 ∧ p ∈ H ∧ α1 , α2 ∈ Γ∗ }.
According to the definition, it is not hard to see that the three examples of
Turing machine above recognize their respective languages. The first Turing
machine recognizes {0n 1n | n 1}. The language which the second Turing
machine recognizes is {an | n 0}. The third Turing machine recognizes
language {b}. From these examples, we can also see that given a Turing
machine that recognizes language L, without loss of generality, we can assume
that whenever the input is accepted, the machine enters a final state and it
no longer moves. There is another possibility, however, that for the character
or statement which it does not recognize, it never stops. The third example
belongs to the case.
Definition 2.20 Recursive language. Let M = (K, Γ, Σ, δ, b, q0 , H) be a
Turing machine, where H = {y, n} consists of two discriminable final states
y (stands for yes) and n (stands for no). Any terminating instantaneous
44 Chapter 2 Grammars and Languages
configuration uhb, i.e., it terminates after it deletes the contents of the tape.
As we have mentioned that any Turing machine that semi-decides a lan-
guage can be transformed into the equivalent Turing machine that satisfies
the same condition. We need to define a grammar that generates the language
L ⊆ (Σ − {u, b})∗ which M semi-decides. It is G = (Vn , Σ − {u, b}, P, S). Now
we need to specify the components of G.
The Vn of nonterminal symbol set of G consists of all the states of K,
including the start symbol S (the initial state q0 may be used as S), in addi-
tion, the left boundary symbol u, the blank symbol and the terminate token
↑. Perceivably, the derivations of G will simulate the backward computation
of M. We will simulate the computation through the initial configuration.
Consider the string uvaqw ↑. Then the productions of G are the actions that
simulate M backwardly. For each q ∈ K, and each a ∈ Σ, depending on δ(q,a)
there are following rules:
1) If for some p ∈ K and b ∈ Σ, δ(q, a) = (p, b), then in G there will be
production bp → aq.
2) If for some p ∈ K, δ(q, a) = (p, R), then the corresponding production
in G will be: for all c ∈ Σ, acp→aqc and abq→aq↑ (this rule reverses the
extension of the tape to right via a blank symbol).
3) If for some p ∈ K and a = b ∈ Σ, δ(q, a) == (p, L), then G will have
the corresponding production pa→aq.
4) If for some p ∈ K, δ(q, b) = (p, L), then for all c ∈ Σ, G will have the
corresponding production pab→aqc and p↑→bp↑. This rule is for deleting
the excessive blanks reversely.
We have to point out here that many books on the field usually assert the
equivalence of the transition function and the production, just through the
transformation from the former to the later. They did not indicate the differ-
ence between these two things. Actually, however, in grammar, the complete
derivation should be that in the final string there will be no any nonterminal,
i.e., no element of K. In the string, there are only the symbols of Σ, otherwise
the derivation is called the dirty derivation.
The productions obtained from function δ, however, are unavoidably con-
taining the nonterminal symbol in K. Therefore, in order to define completely
G, it is necessary to contain extra productions that are used for deleting the
nonterminals. Hence, we stipulate that G also contains the productions used
to the transformation of the start of the computation (the termination of the
derivation) and the termination of the computation (the start of the deriva-
tion). The production
S → ubh ↑
forces that an cceptable computation will precisely finish the derivation at
the termination place. In addition, the production
ubs → ε
will delete the last part of the left side of the input, and ↑→ ε will delete the
termination token, making only the input string left.
2.6 Turing Machine 49
The following assertion makes more precise the idea that grammar G
backwardly simulates the computation of M.
Assertion For two arbitrary configurations of M, u1 q1 a1 w1 and
u2 q2 a2 w2 , there will be u1 q1 a1 w1 −→ u2 q2 a2 w2 if and only if u2 q2 a2 w2
↑ −→ u1 q1 a1 w1 ↑.
G
The proof of the assertion is omitted here. We just point out that the
proof is a case analysis of the properties of the actions of M. We now almost
come to the end of the proof of the theorem. We need to prove that for
all w ∈ (Σ − {u, b})∗ , M terminates upon w if and only if w ∈ L(G); but
w ∈ L(G) if and only if
∗
S −→ ubh ↑ −→ ubsw ↑ −→ w ↑ −→ w
G G G G
From the design of the multi track Turing machine, we can see that the
Turing machine with k-tracks is not much different from a Turing machine
with k tapes. Actually in the theorem we just proved above, we used the
same idea. We also have the following definition.
Definition 2.29 The Turing machine with two-way infinite tape. A Turing
machine with a two-way infinite tape is denoted by M = (K, Γ, Σ, δ, b, q0 , H)
as in the original model. As the name implies that the tape is infinite both to
the left and to the right. The way we denote an ID of such a Turing machine
is the same as for the one-way (to the right) infinite Turing machine. We
imagine, however, that there is an infinity of blank cells both to the left and
right of the current nonblank portion of the tape.
The relation −→, which relates two ID’s if the ID on the right is obtained
M
from the one on the left by a single move, is defined as follows. The original
model with the exception that if δ(q, X) = (p, Y, L), then qXα −→ pbYα
M
(in the original model, for this situation, no move could be made). And if
δ(q, X) = (p, b, R), then qXα −→ pα (in the original model, the b would
appear to the left of p).
The initial ID is qw. While there is a left end tape in the original model,
there is no left end of the tape for the Turing machine to “fall off”. So it can
proceed left as far as it wishes. The trick behind the construction is to use
two tracks on the semi-infinite tape. The upper track represents the cells of
the original Turing machine that are at or to the right of the initial portion.
The lower track represents the positions left of the initial position, but in
reverse order. The exact arrangement can be as shown in Fig. 2.4.
2.6 Turing Machine 51
In Fig. 2.5, the two-way infinite tape has been transformed into one-way
2-track infinite tape. The first cell of the tape holds the symbol in the lower
track, indicating that it is the leftmost cell and the following symbols are
the adjacent left symbols from right to left. The finite control of the Turing
machine tells whether it would scan a symbol appearing on the upper track
(corresponding to the original right-side of the two-way infinite tape) or the
lower track (corresponding to the original left-side of the two-way infinite
tape).
We now give a formal construction of M1 = (K1 , Γ1 , Σ1 , δ1 , b, q1 , H1 ). The
states K1 are all objects of the form [q, U] or [q, D], where q ∈ K1 , and the
symbol q1 is in K1 too. The second component will indicate whether M1 will
work on the upper (U stands for upper) or lower (D stands lower) track. The
tape symbols in Γ1 are all objects of the form [X, Y], where X and Y ∈ Γ.
In addition, Y may be , a symbol not in Γ. Σ1 consists of all symbols [a, b],
where a is in Σ. H1 is {[q, U], [q, D] | q is in H}. it should be evident that M1
can simulate M in the sense that while M moves to the right of the initial
position of its input head, M1 works on the upper track. While M moves to
the left of its tape head position, M1 works on its lower track, moving in the
direction opposite to the direction in which M moves. The input symbols of
M1 are input symbols of M on the upper track with a blank on the lower
track. Such a symbol can be identified with the corresponding input symbol
of M. b is identified with [b, b].
We summarize the idea and omit the formal proof with the following
theorem.
Theorem 2.3 Language L is recognized by a Turing machine with a two-
way infinite tape if and only if it is recognized by a Turing machine with a
one-way infinite tape.
We now almost come to the end of the discussion of the Turing machine.
In the discussion, we introduced the original concept of the Turing machine
that is one tape and deterministic, and then sequentially we modified or
extended it to nondeterministic and multitape Turing machine, including
extended it from one-way infinite tape to two-way infinite tape. However,
finally we found that all these extensions or modifications do not change
or extend the functions of the original Turing machine. Therefore, it also
means that the basic fact is Theorem 2.1 that a language is generated by
a grammar if and only if it is recursively enumerable, if and only if it is
accepted by a Turing machine. In this way, the Turing machine can be used
52 Chapter 2 Grammars and Languages
in the syntactic analysis. This is the reason why we like to introduce the
concept of Turing machine in the book compilation-oriented. The following
result is also important for us. Based on the Turing machine, we can identify
grammar with more useful computational model.
Definition 2.30 The grammar computation function. Let G = (VN , VT ,
P, S) be A grammar, and let f: Σ∗ → Σ∗ be a function. We say that G
computes F if for all w’s and v’s ∈ Σ∗ , the following expressions holds:
∗
SwS −→ v
G
if and only if v = f(w). That is, the string that consists of input w with a
start symbol of G in both sides of w would generate a string of Σ∗ under G,
and it is just the correct value v of the f(w).
Function f: Σ∗ → Σ∗ is called grammatically computable [6, 7] if and
only if there exists a grammar G that computes it. Similar to Theorem 2.1,
we have the following theorem.
Theorem 2.4 A function f: Σ∗ → Σ∗ is recursive if and only if it is gram-
matically computable.
Problems
Problem 2.1 Prove that any grammar can be transformed into an equiv-
alent grammar that has the form uAv → uwv of production rule, where
A ∈ VN and u, v, w ∈ (VN ∪ VT )∗ .
Problem 2.2 Prove Theorem 2.1. For the only if direction, given a gram-
mar, how to construct a Turing machine so that when it has input w, it
∗
outputs a string uΣ∗ such that SwS −→ u, if such a string u exists. For
G
if direction, use a proof that is similar to the proof of Theorem 2.1 but
with forward (rather than backward) direction.
54 Chapter 2 Grammars and Languages
Problem 2.3 Design and completely write a Turing machine that scans
towards right until it found two consecutive 0’s. The set of characters on
the tape is {0, 1, b, u}, and the input set is {0, 1}.
Problem 2.4 Find out the grammars that generate the following languages:
∗
1) {ww | w ∈ {a, b} };
2
2) {(x ) ↑ n | n 0};
3) {(an ) ↑ 2 | n 0};
4) {ai | I is not a prime}.
Problem 2.5 Under what condition, the kleene closure of a language L is
equal to its positive closure?
References
[1] Chomsky N (1956) Three models for the description of language. IRE Trans
Inf Theory 2(3): 113 – 124.
[2] Hopcroft J E, Ullman J D (1969) Formal languages and theit relation to
Automata, Addison-Wesley, Reading, Mass.
[3] Hopcroft J E, Ullman J D (2007) Introduction to Automata Theory, Lan-
guages and Computation. Addison-Wesley, Reading, Mass.
[4] Knuth D E Trabb Pardo L (1977) Early development of programming lan-
guages. In Dekker M (ed) Encyclopedia of computer science and technology
7. Marcel Dekker, New York.
[5] Ledgard H f (1971) Ten mini-languages; a study of topical issues in program-
ming languages. Computing Surveys 3(3): 115 – 146.
[6] Simon M (1999) Automata theory. World Scientific, Singapore.
[7] Simovici D A, Tenney R L (1999) Theory of formal languages with applica-
tions, World Scientific, Singapore.
Chapter 3 Finite State Automata and Regular
Languages
Example 3.1 Let Σ = {a, b, c}. Then ω1 = acb and ω2 = aababc are two
words over Σ, and |w1 | = 3 and |w2 | = 6. Let w = λ, then |w| = 0. Suppose
ω = ab, then λab = abλ = ab.
Definition 3.2 Let Σ be an alphabet, and λ the empty word containing no
symbols. Then Σ∗ is defined to be the set of words obtained by concatenating
zero or more symbols from Σ. If the set does not contain λ, then we denote
it by Σ+ . That is,
Σ+ = Σ∗ − {λ}. (3.1)
A language over an alphabet Σ is a subset of Σ∗ .
Example 3.2 Let Σ = {a, b}. Then
Σ∗ = {λ, a, b, aa, ab, ba, bb, aaa, aab, aba, baa, abb, bab, bba, bbb, . . .},
Σ+ = {a, b, aa, ab, ba, bb, aaa, aab, aba, baa, abb, bab, bba, bbb, . . .}.
are all languages over Σ, where N denotes the set of positive integers (Z+ is
also used to represent the set of positive integers).
Definition 3.3 Let ω1 and ω2 be two words, and L1 , L2 and L be sets of
words.
1) The concatenation of two words is formed by juxtaposing the symbols
that form the words.
2) The concatenation of L1 and L2 , denoted by L1 L2 , is the set of all
words formed by concatenating a word from L1 and a word from L2 . That is,
L1 L2 = {ω1 ω2 : ω1 ∈ L1 , ω2 ∈ L2 }. (3.2)
L = Σ∗ − L. (3.3)
∞
L ∗ = L0 ∪ L 1 ∪ L 2 ∪ · · · = Li (3.4)
i=0
+
and L , the positive closure of L is defined by
∞
L + = L1 ∪ L 2 ∪ L 2 ∪ · · · = Li . (3.5)
i=1
Example 3.4 If Σ = {0, 1} and L = {0, 10}, then L∗ consists of the empty
word λ and all the words that can be formed using 0 and 10 with the property
that every 1 is followed by a 0.
Definition 3.5 A grammar G is defined as a quadruple
where
V is a finite set of objects called variables;
T is a finite set of objects called terminal symbols;
S ∈ V is a special symbol called start variables;
P is a finite set of productions.
Definition 3.6 Let G = (V, T, S, P) be a grammar. Then the set
∗
L(G) = {w ∈ T∗ : S =⇒ w} (3.7)
∗
is the language generated by G, where S =⇒ w represents an unspecified num-
+
ber of derivations (including zero, if not including zero, we then use S =⇒ w)
that can be taken from S to w.
Example 3.5 Find the grammar that generates the language
S → aSb,
S → λ,
a string, the accepter either accepts (recognises) the string or rejects it. A
more general automaton, capable of producing strings of symbols as output,
is called a function transducer.
There are essentially two different types of automata: deterministic au-
tomata and nondeterministic automata. In deterministic automata,each move
is uniquely determined by the current internal state, the current input symbol
and the information currently in the temporary storage. On the other hand,
in nondeterministic automata, we cannot predict the exact future behaviour
of a automaton, but only a set of possible actions. One of the very impor-
tant objectives of this chapter and the next chapter is actually to study the
relationship between deterministic and nondeterministic automata of vari-
ous types (e.g., finite automata, push-down automata, and more generally,
Turing machines).
Finite-state automata or finite automata for short, are the simplest automata
(see Fig. 3.2). In this and the subsequent sections, we shall study the ba-
sic concepts and results of finite automata (both deterministic and non-
deterministic), and the properties of regular languages, with an emphasis
on the relationship between finite automata and regular languages.
where
Q is a finite set of internal states;
Σ is the input alphabet;
q0 ∈ Q is the initial state;
F ⊆ Q is the set of final states, or accepting states;
60 Chapter 3 Finite State Automata and Regular Languages
δ : Q × Σ → Q. (3.9)
Remark: The above DFA is defined without output; we can, of course,
define it with additional output as follows:
M = (Q, Σ, U, δ, σ, q0 , F),
where
U is the output alphabet;
σ is the output function
σ : Q × Σ → U. (3.10)
M = (Q, Σ, δ, q0 , F)
= ({A, B, C}, {0, 1}, δ, A, {B}),
δ(A, 0) = A δ(A, 1) = B
δ(B, 0) = C δ(B, 1) = B
δ(C, 0) = C δ(C, 1) = C
0 1
A A B
B C B
C C C
Initial state: A Final state: B
Then the DFA can be represented by a directed graph shown in Fig. 3.3,
where the initial state A has a starting right arrow, and the final state B has
been double circled.
The machine defined above can read a given finite input tape containing
a word and either accepts the word or rejects it. The word is accepted if after
reading the tape, the machine is in any one of the accepting states.
Example 3.7 Consider the machine defined in Example 3.6. Suppose now
that the machine reads the word 00011. Then the following are the actions
of the machine as it reads 00011:
3.3 Deterministic Finite Automata 61
Since the machine is in the final state B after having read the input word, then
the word 00011 is accepted by this machine. However, the machine cannot
accept the word 000110, because
62 Chapter 3 Finite State Automata and Regular Languages
That is, the machine does not stop at the final state B after having read the
word 000110. In fact, it stopped at the state C which is not a final state.
There are several other ways to describe actions of an automaton. One
very useful way can described as follows (for the same automaton defined
above and the same word 00011):
It is plain to verify that the automaton described in Fig. 3.3 can accept
the following words:
0,
1,
01,
001,
011,
0000011,
00111111111,
··· ,
0m 1n , with m 0 and n 1.
In set notation, the set of words L that can be accepted by the DFA is
L = {0m 1n : m 0, n 1}.
Example 3.8 Fig. 3.4 shows another example of a DFA, M, which has two
Fig. 3.4 A DFA that accepts strings with two consecutive 0’s or 1’s.
3.3 Deterministic Finite Automata 63
M = (Q, Σ, δ, q0 , F)
= ({A, B, C, D, E}, {0, 1}, δ, A, {D, E}),
0 1
A B C
B D C
C B E
D D D
E E E
Initial state: A Final states: D and E
00,
11,
0011110,
01001111000,
110111101001010,
1010101010101010100,
0101010101010101011.
01,
10,
010101010101010101,
0101010101010101010,
1010101010101010101.
An automaton is finite in the sense that there are only finite states within
the automaton. For example, in the automaton in Fig. 3.3, there are only
three states: A, B and C. A finite automaton is deterministic in the sense
that for a given state and a given input, the next state of the automaton
is completely determined. For example, again in the automaton in Fig. 3.3,
given state A and input 0, the next state can only be A.
64 Chapter 3 Finite State Automata and Regular Languages
0 1
A {A, C} A
B B {B, D}
C E λ
D λ E
E E E
Initial state: A Final state: E
3.5 Regular Expressions 65
Fig. 3.5 A NFA that accepts strings with two consecutive 0’s or 1’s.
A → xB, (3.13)
A → x,
where A, B ∈ V and x ∈ T∗ .
A grammar G = (V, T, S, P) is said to be left-linear if all productions are
of the form
A → Bx, (3.14)
A → x.
S → abS,
S→a
S1 → S1 ab,
S1 → S2 ,
S2 → a
is left-linear. Both G1 and G2 are regular grammars.
By G1 , we can have the following derivations:
S =⇒ abS
=⇒ aba
S =⇒ abS
=⇒ ababS
=⇒ ababa
=⇒ (ab)2 a
∗
S =⇒ ababS
=⇒ abababS
=⇒ abababa
=⇒ (ab)3 a
..
.
=⇒ (ab)n a, for n 1.
68 Chapter 3 Finite State Automata and Regular Languages
S =⇒ S1 ab
=⇒ S2 ab
=⇒ aab
S =⇒ S1 ab
=⇒ S1 abab
=⇒ S2 abab
=⇒ aabab
=⇒ a(ab)2
∗
S =⇒ S1 abab
=⇒ S1 ababab
=⇒ S2 ababab
=⇒ aababab
=⇒ a(ab)3
..
.
=⇒ a(ab)n , for n 1.
x = uvw, (3.15)
|v| > 0, (3.16)
|uv| N, (3.17)
uvi w ∈ L, ∀i 0. (3.18)
The number N is called the pumping number for the regular language L.
70 Chapter 3 Finite State Automata and Regular Languages
is regular and let N be the pumping number for L. We must show that no
matter what N is, we may find x with |x| N, that produces a contradiction.
Let x = aN bN . According to Theorem 3.5, there are strings u, v, and w, such
that Eqs. (3.15) – (3.18) in the theorem hold. From Eqs. (3.15) and (3.16) we
can see that uv = ak for some k. So from Eq. (3.17) we have v = aj form for
some j > 0. Then Eq. (3.18) says that uvm w ∈ L, ∀m a. But
uvm w = (uv)vm−1 w
= ak (aj )m−1 aN−k bN
= aN+j(m−1) bN (w = 0N−k bN , since uv = ak )
= aN+t bN . (let t = j(m − 1) when m > 1)
Clearly, there are t more consecutive a’s than there are consecutive b’s in x.
Since this string is not in the form an bn , then it is not regular.
Finally we present some closure properties for regular languages.
Theorem 3.6 The family of regular languages is closed under the opera-
tions union, intersection, difference, concatenation, right-quotient, comple-
mentation, and star-closure. That is,
Example 3.14 Suppose that we wish to design a lexical analyzer for iden-
tifiers; an identifier is defined to be a letter followed by any number of letters
or digits, i.e.,
identifier = {{letter}{letter, digit}∗}.
It is easy to see that the DFA in Fig. 3.6 will accept the above defined
identifier. The corresponding transition table for the DFA is given as follows:
state/symbol letter digit
A B C
B B B
C C C
Initial state: A Final state: B
For example, all the elements in set S1 are acceptable identifiers by the DFA,
whereas all the elements in set S2 are unacceptable identifiers:
Example 3.15 Suppose that we now want to design a lexical analyzer for
real numbers; a real number can be either in decimal form (e.g., 45, 79) or in
exponential form (e.g., 34. 0E-9). The DFA described in Fig. 3.7 will accept
the real numbers just defined. The corresponding transition table for the DFA
is given as follows:
72 Chapter 3 Finite State Automata and Regular Languages
State/symbol Digit · E + −
1 2
2 2 3 5
3 4
4 4 5
5 7 6 6
6 7 5
7 7
Initial state: 1 Final state: 4 and 7
Σ = {a, b}
Q = {A, B, C} q0 = A F=C
δ(A, a, A) = 0.7 δ(B, a, A) = 1 δ(C, a, C) = 1
δ(A, a, C) = 0.1 δ(B, b, B) = 0.6 δ(C, b, C) = 1
δ(A, a, B) = 0.2 δ(B, b, C) = 0.4
δ(A, b, B) = 0.9
δ(A, b, C) = 0.1
74 Chapter 3 Finite State Automata and Regular Languages
where δ(q, a, q ) = 1. This stochastic automaton can be diagrammati-
q ∈Q
cally shown in Fig. 3.8. Suppose that we now wish to calculate the probability
that the automaton will go to state C from A given instructions a and b:
δ (A, ab, C} = δ(A, a, q) · δ(q, b, C)
q∈Q
= δ(A, a, A) · δ(A, b, C) + δ(A, a, B) · δ(B, b, C) +
δ(A, a, C) · δ(C, b, C)
= 0.7 × 0.1 + 0.2 × 0.4 + 0.1 × 1
= 0.25.
Fuzzy Automata
In stochastic automata, the uncertainty was modelled by probability. We now
introduce another similar automata in which the uncertainty was modelled
by fuzziness, rather than by probability. A fuzzy automaton is again similar
to a nondeterministic automaton in that several destination states may be
entered simultaneously; however, it is also similar to a stochastic automaton
in that there is a measure of the degree to which the automaton transitions
between states, that measure being between 0 and 1.
Definition 3.14 A fuzzy automaton, M, is a six-tuple
where
Q is a finite set of states;
q0 ∈ Q is the initial state;
Σ is a finite set of inputs or instructions;
F ⊆ Q is the set of final states or accepting states, denoted by a double
circle;
3.10 Variants of Finite Automata 75
δ : Q × Σ × Q → V. (3.25)
Σ = {a, b}
Q = {A, B, C} q0 = A F=C
δ(A, a, A) = 0.8 δ(B, a, C) = 0.9 δ(C, b, B) = 0.4
δ(A, a, B) = 0.7
δ(A, a, C) = 0.5
δ(A, b, C) = 0.4
Then M can be graphically described in Fig. 3.9. Note that a fuzzy automata
is not necessarily stochastic, say, e.g., δ(C, b, q ) = 0.4 = 1. Suppose that
q ∈Q
now we also wish to calculate the certainty that the automaton will go to
state C from A given instructions a and b:
δ (A, ab, C} = [δ(A, a, q) ∧ δ(q, b, C)]
q∈Q
= [δ(A, a, A) ∧ δ(A, b, C)] ∨ [δ(A, a, B) ∧ δ(B, b, C)] ∨
[δ(A, a, C) ∧ δ(C, b, C)]
= (0.8 ∧ 0.4) ∨ (0.7 ∧ 0.4) ∨ (0.5 ∧ 0.7)
= 0.4 ∨ 0.4 ∨ 0.5
= 0.5.
Note that “∧” (resp. “∨”) means that the minimum (resp. maximum) is being
taken over all the possible states.
76 Chapter 3 Finite State Automata and Regular Languages
Problems
S → Aa
A→B
B → Aa.
L = {ai bj : i = j, i, j ∈ Z+ }
and
For example, the patterns 123abd, doc311, d1f22 are all accepted by the
automaton.
References
Robert L. Solso
1990) showed that the number of words a person knows shall be about 20000
to 40000 and the recognition memory would be many times of that num-
ber. With these words in mind, a person is able to know the meaning of the
string of words if he/she also knows the arrangement of these words. There-
fore to understand a language starts from understanding of words. Language
is composed of sentences and each sentence is the string of words arranged
according to some existing rules. For written language, the hierarchy of a
sentence is lexeme → word or morphology → phrase → sentence. As for the
sentence expressed via sound the hierarchy is phoneme → syllable → sound
words → sound sentence. Among them each layer has to be bound by the
grammar rules. Therefore, according to the modern linguists, to understand
a language involves five layers: phonetic analysis, lexical analysis, syntac-
tic analysis, semantic analysis and pragmatical analysis. Phonetic analysis
means that according to the phoneme rules the independent phonemes are
separated one by one from the speech sound stream. Then according to
phoneme morphological rules, the syllable and its corresponding lexeme or
words are found one by one. As for the analysis of the sentence of written
language, the phonetic analysis is not necessary, because the lexical analysis
is done via the reading in order for one to understand the meaning. When
a person reads a language which he/she is familiar with, the understanding
layers are what we mentioned above, excluding the layer of phonetic anal-
ysis. When one wants to understand oral language, the phonetic analysis
must be included. Therefore, the phonetic analysis is the essential basis for
understanding oral language.
Take English as an example. In English, there are approximately 45 dif-
ferent phonemes. For example, when you hear some one saying “right” and
“light”, if you are English native speaker, you will not have any difficulty
in discerning between phonemes r and l. But if the native language of the
speaker is Japanese, then it is likely that he/she could not pronounce them
clearly. Since in Chinese there are many words that have the same pronuncia-
tion, the same situation is likely to happen. Only when the analysis is carried
out for the whole context, may the discerning of these words be possible.
The lexical analysis, therefore, is an essential step for language under-
standing, as well as for the compilation because it is also taken as the basis
of understanding programs. This is why we have the chapter, and we also
regard it as the commencement step of the compilation.
Talking about the role of the lexical analyzer, we first should talk about the
role of the compiler since the lexical analyzer is part of it. The role of the
compiler is to compile or to translate a kind of languages into another, usu-
ally into a language executable on computer. In other words, it compiles or
4.2 Lexical Analyzer 81
/ * C
p r o
g r a m
— 1 *
/ m a
i n (
) {
p r i n
t f ( “
c p r
o g r a
m — 1 \
n ” ) ;
}
The main task of the lexical analyzer is to read these characters one by
one from the buffer, then group them into tokens according to different
situations [3]. These tokens will be encoded. In this way, the original charac-
ter string now becomes the token string or token stream, providing the input
82 Chapter 4 Lexical Analysis
to the syntax analyzer. Later we will see how the lexical analyzer works on
the input above, to form the token string from it.
Apart from the main task that transforms the input character stream into
a token string, the lexical analyzer may also perform certain secondary tasks
at the user interface [4]. Such task is to strip out from the source program
comments and white space in the form of blank, tab, and newline characters.
Another task is to correlate error messages from the compiler with the source
program. For example, the lexical analyzer may keep track of the number of
newline seen, so that a line number can be associated with an error message.
In some cases, the lexical analyzer is in charge of making a copy of the source
program with the error messages marked in it. If the source language supports
some macro-preprocessor function, these preprocessor functions may also be
implemented as the lexical analysis [5] takes place.
For some compilers, they may divide the lexical analyzer into two phases,
the first is called scanning and the second lexical analysis. The scanner is in
charge of the simple task while the lexical analyzer is really doing the more
complex operations.
Now we return to our example above. The lexical analyzer reads the input
characters one by one. Then according to the regulation of lexical grammar
of C language, the character stream is grouped into different tokens and
so, it becomes the stream of tokens. In C language, the tokens can be key
words, they are usually reserved for specific uses, not allowing to be used for
the identifiers; then identifiers, integers, real numbers, a notation of single
character, comments, and character string (user uses for printing), etc [6].
The lexical analyzer starts its work with reading the first word of the input.
It reads the “” , it knows that this is not a letter, rather it is an operator.
However, here it should not have expression, so it continues its reading to
the second character to see if it is “*” or not. If it is not “*”, then it confirms
that it is wrong. Otherwise it knows that the combination of “/” and “*”
forms the identification of the start of comment line, and all the character
string before the identification of the end of comment line — the combination
of “*” and “/” is the comment. Regarding this, we have the definition that
states that
#define is-comment-starter(chch) ((ch)(ch)=="/" "*")
and
#define is-comment-stopper(chch) ((ch)(ch)=="*" "/")
These two regulations specify the starter and the stopper of a comment,
“/” “*”and “*” “/”. Comment is written for programmers who wrote the
program for other in order to provide some information or memorandum to
them. Hence the comment need not to provide to the compiler. But when the
compiler prints the list of the program, it is necessary to print the comment
as it occurs in the original form. Therefore, the comments should be stored
in a specific place where they occur in the original order. Hence the comment
4.2 Lexical Analyzer 83
area needs to store the contents of the comments; in addition, it should also
retain the places or addresses where they are located in order for them to “go
home”. In the current programming languages, the comments may occur at
any place. So in a program there may be many comments.
As we have mentioned above that the main task of the lexical analyzer
is to group the input character string, it needs also to decide the end of
the input, and to discern the layout characters [7]. Therefore, in the lexical
grammar that is used by the lexical analyzer for reference of doing its job,
there is also the need to define the upper case, the lower case, etc. In the
following, we list part of it for the demonstration.
#define is-end-of-input(ch) ((ch)=="0")
#define is-layout(ch) (!is-end-of-input (ch)&&(ch)<=")
where the first one defines the identification of the end of input while the
second defines the layout symbol.
#define is-uc-letter(ch) ("A"<=(ch)&&(ch)<="Z")
#define is-lc-letter(ch) ("a"<=(ch)&&(ch)<="z")
#define is-letter(ch) (is-uc-letter(ch)||is-lc-letter(ch))
#define is-digit(ch) ("0"<=(ch)&&(ch)<="9")
#define is-letter-or-digit(ch) (is-letter(ch)||is-digit(ch))
These are the definitions of letters and digits. The first one defines the
upper case of the letters and the second defines the lower case; the third one
defines the upper case or lower case; the fourth defines digits; and the last
one defines letters or digits.
#define is-underscore(ch) ((ch)==" ")
#define is-operator(ch) (strchr(+-×÷),(ch))!=0)
#define is-separator(ch) (strchr(";( ){ }",(ch))!=0)
The lexical analyzer needs to discern such the tokens as key words (or
reserved words), identifiers, relation operators, comments, etc. Every pro-
gramming language has the dictionary used for the reserved words that are
the words used in the statements. In the dictionary, it also contains the codes
84 Chapter 4 Lexical Analysis
of these words in machine. When the lexical analyzer found a reserved word
in the input, it checks the correctness of the word (the spelling, the exis-
tence) from the dictionary. If it can find the same word from the dictionary,
the correctness is satisfied. Then from the place of the word in the dictio-
nary, the code used in the machine for the word is also found. The code will
take place of the word in the intermediate form of the program. One of the
advantages of the intermediate form is that for every identifier as well as the
reserved word, they all have the same length and have identity bits to show
their attributes. Otherwise they have different lengths and will take time to
handle. Let us have a look at the dictionary of C language. It contains the
following words (this is not exhaustive):
auto break case char const continue
default do double else enum extern
float for goto if in long
register return short signed sizeof static
struct switch typedef union unsigned void
volatile while
Apart from these reserved words, there are also other data type declara-
tion words, main(), opt, #include, #define, #line, #error, #pragma etc that
need to be discerned. Notice that the dictionary must be complete that it
contains all the legal words used in the C language as the statement names
and so on. Otherwise if some are missing when the lexical analyzer found its
appearance in the program, it will not recognize it and will consider it as an
error. Notice that the list above is not the real list occurring in machine as in
machine, each one should also have its code. The sorting of these words are
not very necessary as it does not contain many words and so its searching
will not be consuming. Even by sequential searching method, the efficiency
is still acceptable.
We have known from previous discussion that any identifier is a string that
must start with a letter and followed by any number (but in the concrete
implementation, the number shall be limited) of letters, digits, and under-
scores. It can be seen as a regular expression. We can describe the identifiers
of C language by
[a-z A-Z ][a-z A-Z 0-9]∗
Then the grammar that generates the regular expressions may be written as
follows:
letter→a|b|...|z|A|B|...|Z|
digit→0|1|...|9
4.2 Lexical Analyzer 85
id→[a-z A-Z ]A
A→[a-z A-Z 0-9]A
A→ ε
Notice, that the “A” as the nonterminal is not the “A” as the letter. And
the regular expression can also be written as
letter (letter |digit)∗
Having the expression, we can check the correctness of the identifier via con-
trasting it with the expression to see if the identifier coincides in the structure.
But as we have mentioned that in practice, the identifier is somewhat differ-
ent from the expression that the number of components of the identifier is
limited. Hence when the lexical analyzer scanned the string of the identifier it
must count the number of the components. When the length of the practical
identifier exceeds the limit then either it cuts it off (then it needs to check the
uniqueness of the identifier) or declares that an error occurs, the identifier is
too long.
After checking the correctness of each identifier, it is stored in the table
specially used for the identifiers. It is called the symbol table in which the
identifiers are stored and each is assigned an address for storing its value.
The identifier is also replaced by its intermediate code. In general the lexical
analyzer encodes the identifiers according to the order that they occur in
the program. For example, the first identifier may be encoded as I1 (where
I stands for an identifier) and correspondingly the second one is encoded as
I2, etc. Usually, the identity bits occupy two bits of the words. For example,
we use λ to denote the identity, and respectively, we use λ = 00 to represent
reserved words, λ = 01 to represent identifiers, λ = 10 to represent integers
and λ = 11 to represent real constants, other bits in the word represent the
number of the identifiers or the address in the memory that stores the value
of the item.
In the lexical analysis of the identifiers, there is an important task, i.e.,
to decide the declaration of the identifier. For most programming languages,
there is such requirement, that is, the declaration of the identifier (it is called
the definition occurrence of the identifier) should precede the occurrence of it
in the program (it is called the application occurrence). Therefore, the prin-
ciple is called the definition occurrence precedes the application occurrence.
The principle implies that for an identifier there needs a declaration only,
otherwise it will commit the error of repeated definition. It does not allow
twice definitions with the two having different types. It is not allowed to use
an identifier without definition. If these regulations are violated, then the
lexical analyzer will handle it as an error case.
86 Chapter 4 Lexical Analysis
digit M|
digit
M→digit M|
digit
The productions above really generates the regular expression of real num-
bers but may include the leading zeros and zeros tail. The lexical analyzer
may base on the grammar to handle real numbers [9]. The handling method
is to group the digits that represent integer or real numbers together to form
digit string, then translate the number into the binary number and put it in
the constant number table [10]. The order of handling is also according to
the order by which these constants occur. That is, after the first number had
been handled in this way and store in the constant table, when the second
number occurs, at first it is checked to see whether it has occurred or not in
the table. If it has existed in the table, then it does not need to store in the
constant table again. Otherwise it will be handled as the first number and
put into the constant table.
As for the identifier, however, constant numbers have also the properties
that are not regular. In any programming language, as the restriction of
the capacity of the memory, in general, one integer can only occupy one
word of the memory (32 bits or 64 bits), and one real number occupies the
double or two times of the integer number. Therefore, when doing spelling
and transformation of the constant number, the size or the value of the
number will be checked. When the length or the size of the number exceeds
the restriction, the lexical analyzer will report the error.
In order to present a legal expression of the real number without the
leading zeros or the tail zeros, we want to modify the expression. We introduce
the new terminal digit1 to represent [1..9], the original digit still represents
[0..9]. Then the modified regular expression of real numbers now becomes
following:
(+|-|)(0|digit1 digit∗ ).(0|digit1|digit1 digit|digit
digit1∗ )(e(+|-|)digitdigit1)?
The reader may check which kinds of real numbers are included in the ex-
pression.
the tokens, and the place p in the input stream, then the lexical analyzer
decides which regular expression in S matches the input segment starting
from p, and decides the role of the segment in the program.
According to the task of the lexical analyzer, the analyzer can be con-
structed manually or automatically by computer. Both are done based on
the tokens designated by the regular expression [11]. Here we mainly explain
the structure of the manually generated lexical analyzer. Fig. 4.1 demon-
strates the heading file of the lexical analyzer lex.h that defines 7 kinds of
tokens: comment lines, identifiers, reserved words, constants, tokens with
single character, ERRONEOUS token, and EOF token. The so-called token
with single character means operators such as +, −, × (or ∗), ÷ (or /), sep-
arators such as ; , . . . , (, ), {, }, etc. Fig. 4.1 shows the form of Token-types
with the extending field. The field records the starting position of the token
in the input, it also contains the definition of constants-like.
The main program of the lexical analyzer consists of the following parts:
the declaration of local data that manages input files; the declaration of global
variables Token; the routine start-lex( ) that starts the lexical analyzer; and
get-next-token( ). The get-next-token(0) is used to scan the input stream,
and get the next token and put the data into Token table. Fig. 4.2 shows the
data and commencement of the lexical analyzer.
The main program of the lexical analyzer repeatedly invokes subroutine
get-next-token( ). The get-next-token( ) and its subprogram check the current
input character to see what kind of token it belongs to, then it prints the
4.2 Lexical Analyzer 89
Fig. 4.2 Data and commencement part of manually generated lexical analyzer.
information found from the Token. When an EOF is recognized and handled,
the loop ends. Fig. 4.3 shows the main program of the lexical analyzer.
thing needed to do is to put its intermediate code into the IA area. Fig. 4.4
shows this correspondence.
When handling the constants, the first step is also to group the compo-
nents of it. But there is one thing that differs from that of identifiers. That is,
the constant should be converted from decimal one to binary one as the value
in machine. The handling process for integers is the same for real numbers.
But the real numbers will occupy double words as many as the integers do.
Since for integers, each occupies one word length, then the real number will
occupy two word length. Fig. 4.5 shows the difference [14].
the table. Only the corresponding address needs to be put into intermediate
language area to replace the original occurrence in the source program.
In the following a simple program in C and its corresponding intermediate
language form are given. We will make it more directly perceived through the
senses as much as possible so that the reader will be able to understand the
working process of the lexical analysis [15]. Fig. 4.6 shows the process via the
contrast of the source program with its intermediate peer.
Remark in Fig. 4.6 above, the code 4 has exceeded the extent of two bits.
But it is only for the purpose of explanation. In practice we may need to use
more bits as the identity bits.
The lex works in the following manner: At first the source program is
4.2 Lexical Analyzer 93
S → ·ABC,
S → A · BC,
S → AB · C,
S → ABC·
S → ·ABC
A → ·Bβ
S → ·ABC
to
S → A · BC
is a state transformation. For all the productions of a grammar, we have to
write down all the state transformations. In the syntactical analysis later, we
will discuss the state transformation in more details. In lex.1 the transfor-
mation map is represented in table where the actions associated with regular
expressions are represented as code in C, they are directly moved to lex.yy.c.
Finally, lex.yy.c is compiled into object program a.out. And it is just the
lexical analyzer that transforms the input strings into token sequence.
The process above is shown in Fig. 4.7.
In order for the reader understand the lex.1 deeper, we list the skeleton of
it as shown in Fig. 4.8. It contains three parts: the first part is the definitions
of rules, the second part is the regular expressions and code segment, and
the third part is the code in C language. It is also the one directly moved to
lex.yy.c mentioned above.
Fig. 4.8 The skeleton of lex that automatically generates lexical analyzer.
char yytext [ ]. When C language code executes the statement with numeric
return value, the return value is just the value of the token. It represents
that they are the values of corresponding tokens. After the relative handling
is finished, the inner loop yylex ends. The class operator/separator is single
char token and it is the first character of array yytext [ ] (hence it is yytext
[0]).
The third part is the C language code that is truly executable. The lexical
analyzer generated by lex does not need to initiate and the routine start.lex(
) is empty. The routine let-next-token( ) starts with invoking yylex( ). This
invocation jumps over the comments and format symbols until it found the
real token. Then it returns the value of the token class and carries out the
corresponding process for the token class. When it detects the end of input,
routine yylex( ) returns 0. The malloc( ) statement in this part is used to
allocate space for array yytext [ ] as the space will store the result which the
execution of get-next-token( ) obtains, that is, to store the token obtained in
the space allocated while routine yywrap( ) is used to aid the processing of
the end of file.
In the last section, we have mentioned that after the analysis of the lexi-
cal analyzer, the source program input is transformed into the program in
intermediate language and stored in intermediate language area. These iden-
tifiers, constants, and reserved words with arbitrary lengths all are replaced
by tokens with fixed lengths and with different signs. At first the reserved
words will be replaced by the corresponding code in the reserved word dic-
tionary. Secondly, the identifier is replaced by an integer, following the sign
bit (for example λ = 1). Meanwhile, however, in using token to take place of
the identifier, a table is needed for storing the identifiers for check. As the
amount of identifiers in a program varies, in general, it is more than that of
constants, hence how to build the table of identifiers is a problem that causes
special concern of the compiler.
The symbol table (identifier table) may be seen as the extended record
array indexed by the character string (rather than the number) and the char-
acter string is just the identifier. The relative record contains the relative in-
formation collected for the identifier. The structure of the symbol table may
be represented as follows:
struct identifier {
char a[int];
int ptr;
}
Therefore, the basic interface of the symbol table module consists of a func-
96 Chapter 4 Lexical Analysis
Errors which lexical analyzer discovers may be grouped into the following
classes as we described above:
1) wrong spellings of the reserved words.
2) Identifier errors.
3) number errors.
4) punctuation errors.
There are two attitudes towards errors:
One is strictly following the rules (the grammar) to treat errors, once
an error is found, then the error report is issued immediately to user or
programmer and the error type and position are also reported as one can as
possible. In general case, however, to provide such an information is not easy.
The common lexical analyzer can only report the possible error type, it is
very difficult for it to locate the specific location of the error.
Another treatment manner is more tolerant or more human nature,namely,
when the error is found, the lexical analyzer manages to correct the error it-
self, rather than to issue immediately a report to user or programmer. The
aim is to save the time of user or programmer and to improve efficiency.
For example, if the error takes place in the spelling of the reserved word, if
it is discovered that the wrong reserved word differs from the correct one only
in one letter, then the lexical analyzer will regard it as correct and correct
the wrong letter.
If the error takes place in the input of the identifier, and it is determined
that it is not another identifier, then the same treatment may be taken.
98 Chapter 4 Lexical Analysis
Problems
Problem 4.1 Write a program using the language which you are familiar
with that recognizes the real numbers and identifiers.
Problem 4.2 Some one gives the regular expression of real numbers as
(+|-|) digit∗ .digit digit∗ (e(+|-|)digit digit∗ |)
Explain what problem will the regular expression cause? If the problem
is to avoid, how should it be written?
Problem 4.3 Write a complete input scanner.
Problem 4.4 Suppose that your symbol table can admit 10 – 100 identi-
fiers while sometimes you need to handle 100 000 000 identifiers with
proper efficiency. Hence allocating a hash table with the size of admitting
100 000 000 is not consistent with the requirement of the problem. Design
a suitable hash table algorithm to solve the problem.
References
[1] McCullough WS, Pitts W (1943) A logical calculus of the ideas immanent
in nervous activity, Bull. Math. Biophysics 5: 115 – 133.
[2] Lesk ME Lex-a lexical analyzer generator, Computing Science Tech Re-
port, 39, Bell Laboratories, Murray Hill, N J. It also appears in Vol.2 of
the Unix Programming’s Manual, Bell Laboratories with the same title but
with E.Schmidt as coauthor. Murray Hill, N J. http://dinosaur.compil- er-
tool.net/lex/index.html.
References 99
[3] Kleene SC Representation of events innerve nets and finite automata. In:
Shannon CE, McCarthy J (eds) Automata studies, 34, pp. 3 – 40.
[4] http://www.cs.princeton.edu/∼appel/modern/java/JLex. Accessed 12 Oct
2009.
[5] Hopcroft JE, Motwani R, Ulman JD (2006) Introduction to automata theory,
languages and computation. Addison-Wesley, Boston.
[6] Huffman DA (1954) The synthesis of sequential machines. J Franklin Inst.
257, pp 3 – 4, 161, 190, 275 – 303.
[7] http://jflex.de/. Accessed 19 Nov 2009.
[8] Aho AV, Corasick MJ (1975) Efficient string matching, an aid to biblio-
graphic search. Comm ACM, 18(6): 333 – 340.
[9] Free software Foundation. http://www.gnu.org/software/flex/. Accessed 19
Nov 2009.
[10] Aho AV (1990) Algorithms for finding patterns in strings. In Laeuwen J van
(ed) Handbook of theretical computer science. MIT Press, Cambridge.
[11] Shannon C, McCarthy J (eds) (1956) Automata Studies. Princeton Univ
Press, NJ.
[12] Thompson K (1968) Regular expression search algorithm. Comm ACM 11
(6): 419 – 422.
[13] McNaughton R, Yamada H (1960) Regular expressions and state graph for
automata, Ire Trans. On Electronic Computers EC-9:1: 38 – 47.
[14] Moore EF, Gedanken experiments on sequential machines, in [15], pp. 129 –
153.
[15] Knuth DE, Morris JH, Pratt WR (1997) Fast Pattern matching in strings,
SIAM J. Computing 6:2: 323 – 350.
[16] McKenzie BJ, Harries R (1990) Bell TC. Selecting a hashing algorithm.
Software — Practice and Experience, 20(2): 672 – 689.
Chapter 5 Push-Down Automata and
Context-Free Languages
Push-down automata (PDA) form the most important class of automata be-
tween finite automata and Turing machines. As can be seen from the previous
chapter, deterministic finite automata (DFA) cannot accept even very simple
languages such as
{xn yn | n ∈ N},
but fortunately, there exists a more powerful machine, push-down automata,
which can accept it. Just as DFA and nondeterministic finite automata (NFA),
there are also two types of push-down automata: deterministic push-down
automata (DPDA) and non-deterministic push-down automata (NPDA). The
languages which can be accepted by PDA are called context-free languages
(CFL), denoted by LCF . Diagrammatically, a PDA is a finite state automaton
(see Fig. 5.1), with memories (push-down stacks). In this chapter, we shall
study PDA and their associated languages, context-free languages LCF . For
the sake of completeness of the automata theory and formal languages, We
shall also study Turing machines and their associated languages.
102 Chapter 5 Push-Down Automata and Context-Free Languages
Q = {q0 , q1 , q2 , q3 },
Σ = {a, b},
Γ = {0, 1},
z = 0,
F = q3 ,
5.3 Context-Free Languages (LCF ) 103
and
S → abB,
A → aaBb,
B → bbAa,
A→λ
104 Chapter 5 Push-Down Automata and Context-Free Languages
S =⇒ abB
=⇒ abbbAa
=⇒ abbba
S =⇒ abB
=⇒ abbbAa
=⇒ abbbaaBba
=⇒ abbbaabbAaba
=⇒ abbbaabbaba
∗
S =⇒ abbbaabbAaba
=⇒ abbbaabbaaBbaba
=⇒ abbbaabbaabbAababa
=⇒ abbbaabbaabbababa
=⇒ ab(bbaa)2 bba(ba)2
∗
S =⇒ abbbaabbaabbAababa
=⇒ abbbaabbaabbaaBbababa
=⇒ abbbaabbaabbaabbAabababa
=⇒ ab(bbaa)3 bba(ba)3
..
.
L = {an bn : n 0}
is not a regular language, but this language can be generated by the grammar
G = ({S}, {a, b}, S, P) with P given by S → aSb and S → λ, which is ap-
parently a context-free grammar. So, the family of context-free languages is
the superset of the family of regular languages, whereas the family of regular
languages is the proper subset of the family of context-free languages.
We call a string x ∈ (V ∪ T)∗ a sentential form of G if there is a derivation
∗
S =⇒ x in G. But notice that there may be several variables in a sentential
form, in such a case, we have a choice of order to replace the variables. A
5.4 Pumping Theorems for Context-Free Languages 105
(i) S → AA,
(ii) A → AAA,
(iii) A → bA,
(iv) A → Ab,
(v) A → a.
Then we have the following three distinct derivations for string L(G) =
ababaa :
i i i
S =⇒ AA S =⇒ AA S =⇒ AA
v v v
=⇒ aA =⇒ Aa =⇒ aA
ii ii ii
=⇒ aAAA =⇒ AAAa =⇒ aAAA
iii iii v
=⇒ abAAA =⇒ AAbAa =⇒ aAAa
v v iii
=⇒ abaAA =⇒ AAbaa =⇒ abAAa
iii iii iii
=⇒ ababAA =⇒ AbAbaa =⇒ abAbAa
v v v
=⇒ ababaA =⇒ Ababaa =⇒ ababAa
v v v
=⇒ ababaa =⇒ ababaa =⇒ ababaa
Derivation(1) Derivation(2) Derivation(3)
z = uvwxy, (5.4)
Like its counter-part for regular languages, the pumping theorem for
context-free languages provides a tool for demonstrating that languages are
not context-free.
Theorem 5.4 The family of context-free languages is not closed under in-
tersection and complementation. That is
L1 and L2 are context-free =⇒ L1 ∩ L2 , L1 are not context-free. (5.8)
invented the method and Peter Naur, who refined it for the programming lan-
guage ALGOL, directly corresponds to context-free grammar. In fact, many
parts of a ALGOL-like or Pascal-like programming languages are susceptible
to definition by restricted forms of context-free grammars.
Example 5.4 The following grammar (context-free grammar, but using
BNF notation) defines a language of even, non-negative integers.
With this grammar, we can easily generate the even integers, and show
their parse trees, which are useful in syntax analysis and code generation in
compiler construction.
As we have seen, finite automata (FA) can recognise regular languages (LREG ),
but not non-regular languages, such as L = {an bn | n ∈ N}, which is known
to be context-free language. PDA, however, can recognise all the context-
free languages LCF generated by context-free grammars GCF . There are lan-
guages, however, say for example, context-sensitive languages LCS , such as
L = {anbn cn | n ∈ N}, that cannot be generated by context-free gram-
mars. Fortunately, there are other machines, called Linear Bounded Au-
tomata (LBA), more powerful than push-down automata, that can recognise
all the languages generated by context-sensitive grammars GCS . However,
LBA cannot recognise all languages generated by phrase-structure grammars
GPS . To avoid the limitations of the above mentioned three special types
of automata, a Turing Machine (TM), named after the British mathemati-
cian Alan Turing is used. Turing machines can recognise all the languages
generated by phrase-structure grammars, called the recursively enumerable
languages LRE , that includes, of course, all the regular languages, context-
free languages and context-sensitive languages. In addition, Turing machines
can also model all the computations that can be performed on any computing
machine. In this section, we shall study Turing machines and their associated
languages LRE .
A standard Turing machine (see Fig. 5.2) has the following features:
1) The Turing machine has a tape that is unbounded in both directions.
2) The Turing machine is deterministic.
3) There are no special input and output files.
108 Chapter 5 Push-Down Automata and Context-Free Languages
M = (Q, Σ, Γ, δ, q0 , , F) (5.9)
where
Q is a finite set of internal states;
Σ is a finite set of symbols called the input alphabet, we assume that
Σ ⊆ Γ − {};
Γ is a finite set of symbols called the tape alphabet;
δ is the transition function, which is defined as
Example 5.5 Let Σ = {a, b}. Design a Turing machine that accepts the
language
L = {an bn : n 1}.
As we have seen from the preceding section that this language is a context-free
language and can be accepted by a push-down automata. In this example,
we shall see that this language can be accepted by a Turing machine as well.
Let q0 be the initial state, and suppose that we use the x’s to replace a’ and
y s to replace b . Then we can design the transitions as follows (see Fig. 5.3):
M = (Q, Σ, Γ, δ, q0 , , F)
= ({q0 , q1 , q2 , q3 , q4 }, {a, b}, {a, b, x, y, }δ, q0, , {q4 }).
At this point the Turing machine halts in a final state, so the string aaabbb
is accepted by the Turing machine. The above successive instantaneous de-
scriptions can also be showed diagrammatically as follows:
5.8 Turing Machines as Language Accepters 111
112 Chapter 5 Push-Down Automata and Context-Free Languages
Remark: The above example shows that Turing machines can accept lan-
guages that can be accepted by push-down automata. It is, of course the case
that Turing machines can accept languages that can be accepted by finite
automata. For example, the following regular language
LREG = {w ∈ {a, b}∗ : w contains the substring aba}.
can be accepted by both Turing machines and finite automata; Fig. 5.4 gives
a Turing machine and a finite automaton that accept the above language.
Example 5.6 Design a Turing machine that accepts the language
L = {an bn cn : n 1}.
As we already know that this language is not a context-free language, thus it
cannot be accepted by a push-down automata. In this example, we shall show
5.8 Turing Machines as Language Accepters 113
Fig. 5.4 A TM and a DFA that accept the language {a, b}∗ {aba}{a, b}∗ .
M = (Q, Σ, Γ, δ, q0 , , F),
114 Chapter 5 Push-Down Automata and Context-Free Languages
where
Q = {q0 , q1 , q2 , q3 , q4 , q5 },
Σ = {a, b, c},
Γ = {a, b, cx, y, z, },
F = {q4 },
δ : Q × Γ → Q × Γ × {L, R} is defined by
For the particular input aabbcc, we have the following successive instan-
taneous descriptions of the designed Turing machine:
xxyyzq2 c
xxyyq3 zz
xxyq3 yzz
xxq3 yyzz
xq3 xyyzz
xxq0 yyzz
xxyq4 yzz
xxyyq4 zz
xxyyzq4 z
xxyyzzq4
xxyyzzq4
Fig. 5.5 gives a Turing machine that accepts {an bn cn : n 1}.
We could, of course, list many more different types of Turing machines. How-
ever, all the different types of Turing machines have the same power. This
establishes the following important result about the equivalence of the various
Turing machines.
116 Chapter 5 Push-Down Automata and Context-Free Languages
x → y, (5.13)
where x, y ∈ (V ∪ T)+ , and length (x) length (y) (or briefly as |x| |y|).
A language L is called a context-sensitive language, denoted by LCSL ,
if there exists a context-sensitive grammar G, such that L = L(G) or L =
L(G) ∪ λ.
Example 5.7 Design a context-sensitive grammar to generate the context-
sensitive language
L = {an bn cn : n > 0}.
118 Chapter 5 Push-Down Automata and Context-Free Languages
(i) S → abc
(ii) S → aAbc
(iii) A → abC
(iv) A → aAbC
(v) Cb → bC
(vi) Cc → cc.
All the classes (families) of machines we have studied so far are finite (state)
machines, but some of the machines have exactly the same power (here by
the same power, we mean they accept exactly the same language), whilst
some of the machines have more power than others. For example, deter-
ministic finite automata (DFA) have the same power as nondeterministic
finite automata (NFA); nondeterministic push-down automata (NPDA) have
more power than deterministic push-down automata (DPDA); push-down
automata (PDA) with two push-down stores have more power than the push-
down automata (PDA) with only one push-down store; but push-down au-
tomata (PDA) with more than two push-down stores have the same power as
push-down automata with two push-down stores. Interestingly enough, push-
down automata with two or more push-down stores have the same power as
Turing machines; All different types of Turing machines (such as determin-
istic, nondeterministic, probabilistic, multitape and multidimensional, etc.)
have the same power. However, restricting the amount of available tape for
computation decreases the capabilities of a Turing machine; linear bounded
automata is such a type of restricted Turing machines in which the amount
of available tape is determined by the length of the input string. The relation
between the various classes of finite machines over the same alphabet Σ can
be summarized as follows:
So, there are essentially four main classes of machines: finite automata (FA),
push-down automata (PDA), linear-bounded automata (LBA) and Turing
ma-chines (TM). The hierarchy of these classes of machines can be described
as follows:
Finite Automata (FA)
We have already seen that languages and grammars are actually equivalent
concepts; on one hand, given a language, we can find the grammar which
generates the language; on the other hand, given a grammar, we can find the
set of languages generated by the given grammar. Remarkably enough, lan-
122 Chapter 5 Push-Down Automata and Context-Free Languages
Fig. 5.6 Hierarchical relations among various languages, grammars and their Ma-
chines.
Problems
L = {an bn cn : n 0}
Problem 5.3 Show that the Turing machine constructed in Example 5.5
cannot accept the language L = {an bm : m 1, n > m}.
Problem 5.4 Construct Turing machines that accept the languages L1 =
n
{an b2n : n 1} and L2 = {a2 : n 1} over Σ = {a, b}.
Problem 5.5 Construct a Turing machine that accepts the language
L = {ak bm cn : k, m, n > 0, k = m or k = n or m = n}
References
Noam Chomsky
In the compilation of source programs, the second phase of the process is the
syntactical analysis. Based on the lexical analysis, the syntactical analysis
checks the correctness of the source programs in terms of the grammar of
the language used. And it is well-known that most of the properties of the
programming languages are context-free. Therefore, naturally if we want to
check whether a program is correct or not in terms of syntax, we should check
if the syntax of the program is consistent with context-free, at least for most
of it. In order to do so, the basis is to know about the context-free grammars.
This chapter and Chapter 5 together form the preparation of the syntactical
analysis.
The context-free grammars generate the context-free languages, and the
context-free languages were initially introduced to provide a model of nat-
ural languages. Hence they are very important for understanding natural
language. Later the investigation of programming languages will show that
they are equally important in the area of programming language. We can
even say that the importance of programming languages is even more vital
than natural languages. Because in common conversation between people,
it is not so strictly obeying the rules of the context-free grammar so that
the counterpart of the conversation may still understand the meaning or in-
tention of the speaker. But for computer, it is not the case. Even a minor
mistake in program will affect the understanding of the compiler, so that it
will not be able to correctly compile the program and the program will not be
executable. After all, at the present time, the computer is not as intelligent
as human.
126 Chapter 6 Context-Free Grammars
Therefore, in one word, the motivation of the chapter is, together with the
last chapter, to provide sufficient knowledge for the next chapter — syntax
analysis. The three chapters provide the core knowledge for the whole course.
π0 : X → xyXYx,
π1 : Y → ε,
π2 : Z → yyx
L3 ⊆ L2 ⊆ L1 ⊆ L0 . (6.4)
is {an bn | n ∈ N}.
We now prove it by induction on n 0 that an bn ∈ L(G) for every n ∈ N.
The case n = 0 follows from the existence of the production π0 : S → ε
∗
in G. Suppose now that an bn ∈ L(G), so S =⇒ an bn . Using the production
G
S → aSb we obtain the derivation
G G G
π0 : S → abc, π1 : S → aXbc,
π2 : Xb → bX, π3 : Xc → Ybcc,
π4 : bY → Yb, π5 : aY → aaX,
π6 : aY → aa.
We claim that a word α contains the infix aY (which allows us to apply the
∗
production π5 ) and S =⇒ α if and only if α has the form α = ai Ybi=1 ci+1
for some i 1. An easy argument by induction on i 1 shows that if
∗
α = ai Ybi+1 ci+1 , then S =⇒ α. We need to prove only the inverse implication.
This can be done by strong induction on the length n 3 of the derivation
∗
S =⇒ α.
130 Chapter 6 Context-Free Grammars
and this word has the prescribed form. Suppose now that for derivation
∗
shorter than n the condition is satisfied, and let S =⇒ α be a derivation
G
of length n such that α contains the infix aY. By the induction hypothe-
sis the previous word in this derivation that contains the infix aY has the
form α = aj Ybj+1 cj+1 . To proceed from α we must apply the production π5
and replace y by X. Thus we have
Next, the symbol x must be shifted to the right using the production π2 ,
transform itself into an Y (when in touch with the cs̄) and Y must be shifted
to the left to create the infix aY. This can happen only through the applica-
tion of the productions π3 and π4 as follows:
π2
π3
1
=⇒ aj+1 Ybj+2 cj+2 . (6.8)
π4
That proves that α has the desired form. Therefore, all the words in the
language L(G) has the form an bn cn .
Although this grammar is not context-sensitive (only productions π0 , π1 ,
π5 , and π6 are context-sensitive), we will exhibit a context-sensitive grammar
for this language. Moreover, we will show that this language is not context-
free. So it will serve to show that L2 ⊆ L1 .
Now we turn our attention to real programming language.
Example 6.5 Suppose that E stands for expression, T stands for term, F
stands for factor, then the following productions will generate the arithmetic
expressions that consist of +, −, ×, /.
E → E + T,
E → E − T,
E → T,
T → T × F,
T → T/F,
T → F,
F → (E),
F → a | b | c.
6.2 Context-Free Grammars 131
The language which the grammar recognizes is all the arithmetic expressions
that consist of operators +, −, × and / and three variables a, b, and c.
That means that all the arithmetic expressions can be derived from these
productions step by step. For example, the arithmetic expression
(a × b + b × c + c × a)/(a + b + c)
E→T
→ T/F
→ F/F
→ (E)/F
→ (E + T)/F
→ (E + T + T)/F
→ (T + T + T)/F
→ (T × F + T + T)/F
→ (F × F + T + T)/F
→ (a × F + T + T)/F
→ (a × b + T × F + T)/F
→ (a × b + F × F + T)/F
→ (a × b + b × F + T)/F
→ (a × b + b × c + T)/F
→ (a × b + b × c + F)/F
→ (a × b + b × c + c × a)/F
→ (a × b + b × c + c × a)/(E)
→ (a × b + b × c + c × a)/(E + T)
→ (a × b + b × c + c × a)/(E + T + T)
→ (a × b + b × c + c × a)/(T + T + T)
→ (a × b + b × c + c × a)/(F + T + T)
→ (a × b + b × c + c × a)/(a + T + T)
→ (a × b + b × c + c × a)/(a + F + T)
→ (a × b + b × c + c × a)/(a + b + T)
→ (a × b + b × c + c × a)/(a + b + F)
→ (a × b + b × c + c × a)/(a + b + c).
done. And the step always changes the leftmost nonterminal either by a new
nonterminal or by a terminal. The derivation done in this manner is called
leftmost derivation. On the other hand, if the derivation always changes the
rightmost nonterminal, then the derivation is called rightmost derivation.
The so-called change nonterminal, actually it replaces the nonterminal with
the right part of the production of which the left part is the nonterminal.
Using derivation tree may make the derivation more directly perceived
through the sense. Corresponding to the derivation given above, we draw the
derivation tree as shown in Fig. 6.1.
A → X1 X2 . . .Xk (6.9)
must be a production of P.
5) If a vertex n has the label ε, then n must be a leaf, and it is the only
child of its parent node.
6) All the leaf nodes of the tree from left to right form the sentence.
6.2 Context-Free Grammars 133
In the formal description of the derivation tree, we did not involve leftmost
derivation. Actually, the derivation tree may also be constructed through the
rightmost derivation. The key is whether the leftmost derivation tree is the
same as the rightmost derivation tree? It may also be asked whether the
leftmost derivation and rightmost derivation of a sentence are the same? as
derivation completely corresponds to derivation tree.
Definition 6.7 [2] Grammar G = (VN , VT , S, P) is called non-ambiguous if
for all the sentences which it recognizes each only has one derivation tree.
There exists ambiguous grammar. The following is an example of ambigu-
ous grammar.
Example 6.6 The production set P of grammar G = (VN , VT , S, P) con-
sists of the following productions:
E → E + T,
E → T,
T → T × F,
T → F,
F → (E),
F → a | b | c.
For two derivation trees, it looks like that their derivations are the same.
E→E+E
→E+E+E
→ T+E+E
134 Chapter 6 Context-Free Grammars
→T×F+E+E
→ F×F+E+E
→a×F+E+E
→a×b+E+E
→a×b+T+E
→a×b+T×F+E
→a×b+F×F+E
→a×b+b×F+E
→a×b+b×c+E
→a×b+b×c+T
→a×b+b×c+T×F
→a×b+b×c+F×F
→a×b+b×c+c×F
→ a × b + b × c + c × a.
Actually, the two derivations are different in that in the right side E + E
one is to replace the first E to get E + E + E while the another is to replace
the second E to get E + E + E. Strictly speaking, they are (E + E) + E and
E + (E + E). Since with an expression a × b + b × c + c × a, there are two
derivation trees, i.e., the leftmost derivation tree and the rightmost derivation
tree. They are different. So this is an ambiguous grammar.
Given a grammar, how to decide whether it is ambiguous or not. We now
present a sufficient condition for it.
Theorem 6.1 In a grammar, if in its productions, there is one that there
are more than two occurrences of a nonterminal that consecutively appear
while on the left of the production, it is the nonterminal too. Then the gram-
mar must be ambiguous.
Proof Without losing generality, suppose that the production is as follows:
A → AA. . ..
It is easy to imagine that using the leftmost and rightmost derivation to the
same production, we must get different derivation trees. Therefore, that this
is an ambiguous grammar is proved.
Directly perceiving through the sense, the derivation tree of an ambiguous
must have the following form: its two same subtrees may either appear on
the left subtree or appear on the right subtree of the whole tree. And these
two positions are adjacent.
6.3 Characteristics of Context-Free Grammars 135
S1 → aS3 , S1 → bS2 , S2 → a,
S2 → aS1 , S2 → bS1 S1 , S3 → b,
S3 → bS1 , S3 → aS3 S3 .
S1 → Xa S3 , S1 → Xb S2 , S2 → a,
S2 → Xa S1 , S2 → Xb Z1 , Z1 → S1 S1 ,
S3 → b, S3 → Xb S1 , S3 → Xa Z2 ,
Z2 → S2 S2 , Xa → a, Xb → b
Obviously the grammar that has nonterminal symbols S1 , S2 , S3 , Z1 , Z2 , ter-
minal symbols a and b, and the productions above is a context-free grammar
and is in the Chomsky normal form. It is also equivalent to grammar G given.
Using the Chomsky normal form we can prove an important decidability
result for the class of Type 2. The result relates the length of a word to the
length of derivation.
Lemma 6.1 [3] Let G = (VN , VT , S, P) be a context-free grammar in the
∗
Chomsky normal form. Then if S =⇒ x we have | α | 2 | x | −1.
α
Proof The proof is slightly stronger than the statement stated in the
∗
lemma, namely, if X =⇒ x for some X ∈ VN , then | α | 2 | x | −1.
α
and we have | α |=| β | + | γ | +1, because the productions used in the last
∗
two derivations are exactly the ones used in X =⇒ x. Applying the inductive
α
hypothesis we obtain
| α | = | β | + | γ | +1 2 | u | −1 + 2 | v | −1 + 1 = 2(| u | + | v |) − 1
= 2 | x | −1. (6.11)
Theorem 6.3 There is an algorithm to determine for a context-free gram-
∗
mar G = (VN , VT , S, P) and a word x ∈ VT whether or not x ∈ L(G).
Proof Before we start our proof, we have to mention the following facts:
At first, for every context-free grammar G, there is a context-free, ε-free
grammar G such that L(G ) = L(G) − {ε}. Furthermore, if G is a context-
free grammar, then there is an equivalent context-free grammar G such that
one of the following cases occurs:
1) if ε ∈
/ L(G), then G is ε-free.
2) if ε ∈ L(G), then G contains a unique erasure production S → ε
where S is the start symbol of G , and S does not occur in any right part of
any production of G .
Having these facts, we may construct a grammar G equivalent to G
such that one of the following two cases occurs:
6.3 Characteristics of Context-Free Grammars 137
/ L(G), G is ε-free.
1) If ε ∈
2) If ε ∈ L(G), G contains a unique erasure production S → ε, where
S is the start symbol of G , and S does not occur in any right part of any
production of G .
If x = ε, then x ∈ L(G) if and only if S → ε is a production in G .
Suppose that x = ε. Let G1 be a context-free grammar in the Chomsky
normal form such that L(G1 ) = L(G ) − {ε}. We have x ∈ L(G1 ) if and only
∗
if x ∈ L(G). By Lemma 6.1, if S =⇒ x, then | α | 2 | x | −1, so by listing all
α
derivations of length at most 2 | x | −1 we can decide if x ∈ L(G).
Definition 6.10 A context-free grammar G = (VN , VT , S, P) is in the
Greibach normal form if all its productions are of the form X → aα, where
∗
X ∈ VN , a ∈ VT , and α ∈ VN .
If G is in the Greibach normal form, then G is ε-free, so ε ∈
/ L(G).
Every ε-free context-free grammar has an equivalent grammar in the
Greibach normal form. In the following discussion we will prove the fact. But
in order to do so we have to introduce some more preliminary knowledge.
Definition 6.11 Left-recursive symbol. Let G = (VN , VT , S, P) be a context-
free grammar. A nonterminal symbol X is left-recursive (right-recursive) if
+ +
there exists a derivation X =⇒ Xα(X =⇒ αX) for some α ∈ (VN ∪ VT )∗ .
G G
Y → γ1 , . . ., Y → γn
be the list of all productions in P whose left part is Y. Then G is equivalent
to the grammar G = (VN , VT , S, P ), where
∗
and ω ∈ VT . For n = 1, the statement of the lemma is immediate. Suppose
n
that this statement holds for derivation of length less than n and let u =⇒ ω.
G
If the production X → αYβ is not used in this derivation, then we obviously
∗
have u =⇒ ω. Otherwise this derivation can be written as
G
u =⇒ ω Xω =⇒ ω αYβω =⇒ ω.
∗ ∗
(6.13)
G G
∗ ∗
Thus, ω can be written as a product, ω = u uα uY uβ u , where ω =⇒u , α=⇒uα ,
G G
∗ ∗ ∗
Y =⇒ uY , β =⇒ uβ , ω =⇒ u are derivations that are shorter than n. By the
G G G
inductive hypothesis, we have the derivations
ω =⇒ u ,
∗ ∗
α =⇒ uα ,
G G
β =⇒ u , ω =⇒ u .
∗ ∗
G G
∗
Also, the existence of the derivation Y =⇒ uY entails the existence of a
G
∗
derivation γi =⇒ uY . By the inductive hypothesis, we obtain the derivation
g
∗ ∗
γi =⇒ uY . Thus we obtain the derivation u = ω αXβω =⇒ ω αγi βω = ω.
G G
X → Xα1 , . . ., X → Xαk ,
where αi = ε for 1 i k.
2) The remaining productions in PX :
X → β1 , . . ., X → β1 ,
P = (P − PX ) ∪ {X → βj Y, X → βj | 1 j l} ∪
{Y → αi Y, Y → αi | 1 i k}. (6.14)
The word βj αil . . .αi2 αi1 can also be derived in G from X using the derivation
X =⇒ βj Y =⇒ βj αil Y =⇒ . . .
G,right G,right G,right
∗
Thus, every leftmost derivation S =⇒ x corresponds to a rightmost derivation
G,left
S =⇒ x, so L(G) ⊆ L(G ).
∗
(6.18)
G,right
1) {X1 , . . ., Xn } ⊆ VN ;
2) the productions that have Xi as their left parts have the form Xi → aα
or Xi → Xj α ∈ P, where a ∈ VT , α ∈ (VN )∗ and i < j;
3) if Y ∈ VN − {X1 , . . ., Xn }, then for every production X → γ ∈ P , γ =
ε, all symbols in γ except the first symbol are nonterminals, and the first
symbol of γ belongs to {X1 , . . ., Xn } ∪ VT .
Note that the productions that rewrite the symbol Xn are necessarily
of the form Xn → aα, where a is a terminal symbol and α contains only
nonterminal symbols. The right part of a production of the form Xn → . . .
begins with a terminal symbol or with Xn .
Then the Xn in all productions Xn−1 → Xn will be replaced by the right
part of the productions Xn → . . ..
Thus, all productions Xn−1 → . . . have the form Xn−1 → aγ, where
γ ∈ (VN )∗ . This process is repeated for Xn−2 , . . ., X1 .
Finally, since the first symbol of γ in Y → γ is a terminal or a symbol
Xi , an application of Lemma 6.2 allows us to replace the productions Y →
Xi . . . with the productions whose right part begins with a terminal, thereby
obtaining a grammar in the Greibach normal form equivalent to the initial
grammar.
Based on the theorem we may design an algorithm that transforms a
context-free grammar in the Chomsky normal form into equivalent grammar
in the Greibach normal form.
Algorithm T (transformation of a context-free grammar in Chomsky
normal form into an equivalent grammar in the Greibach normal form)
Input: A context-free grammar in the Chomsky normal form with the
production set P that consists of the productions in the form Xi → Xk Xl and
Xi → a.
Output: All the productions wil have the form Xi → aα, where a ∈ VT
and α ∈ (Vn )∗ , Vn consists of the original {X1 , . . ., Xn } plus the additional
norterminals used in the transformation.
T1 (sorting the productions) For all the productions in form of Xi →
Xk Xl , they will occur in increasing order. On one hand the indices of the
nonterminals in left parts of the productions occur in increasing order; on the
other hand, for each production, we have k > i. If a production Xj → Xu Xv
has j > u, then by using the productins with Xu , Xu+1 , . . . as left parts, we
may transform the production given into the form Xj → Xj . . .. then left
recursion happens with the production.
Meanwhile, starting from X1 , if we have X1 → X2 . . . and X2 → a, by
replacing X2 in the right part of the first production, we obtain X1 → aα,
where α is a string of nonterminals. Then the production obtained and X2 → a
are in the Greibach normal form.
T2 (eliminating left recursion) In the step T1, we get the production
of the form Xj → Xj γ where γ ∈ (Vn )∗ . Since we have two forms of the
productions Xj → Xj γ and Xj → a, by replacing the Xj at the right part
of the first production with a of the right part of the second production, we
obtain Xj → aγ and this is a production in the Greibach normal form. In
142 Chapter 6 Context-Free Grammars
X → Xγ,
X → a. (6.19)
X → aY,
Y → γY | ε. (6.20)
Yj → bl (bl ∈ VT ),
∗
Yj → bl α(α ∈ VN ). (6.22)
X1 =⇒ X2 X3 =⇒ X1 X2 X3 . (6.24)
G G
X2 =⇒ X1 X2 =⇒ X2 X3 X2 . (6.25)
G G
X2 → X2 X3 X2 , X2 → a. (6.26)
6.3 Characteristics of Context-Free Grammars 143
X2 → aY1 , X2 → a, Y1 → X3 X2 Y1 , Y1 → X3 X2 .
X1 → X2 X3 , X2 → aY1 , X2 → a, Y1 → X3 X2 Y1 ,
Y1 → X3 X2 , X3 → X1 X3 , X3 → b.
A → BC, B → CA | b, C → AB | a.
X1 → X2 X3 , X2 → X3 X1 | b, X3 → X1 X2 | a.
Now the third production is not in increasing order in terms of the indices
of the nonterminals at left part and right part. We need to change it. By
substituting X1 with X1 → X2 X3 and X2 with X2 → X3 X1 , we have
X3 → X1 X2 → X2 X3 X2 → X3 X1 X3 X2 .
X3 → bX3 X2 , X3 → a.
144 Chapter 6 Context-Free Grammars
The productions with respect to Y are not in the Greibach normal form yet.
But at first we handle X2 , we have
Suppose that the inequality holds, vertices of height less than n, and let
v be a vertex of height n. Then the following two cases may occur:
1) If h(v) = h(v1 ), where v1 ∈ CHILDREN(v), then weight(v) =
weight(v1 ) L(G)h(v1) by the inductive hypothesis. Thus weight(v)
L(G)h(V) .
2) If h(v) > h(v1 ) for every v1 ∈ CHILDREN(v), then h(v) = 1 +
max{h(v1 ) | v1 ∈ CHILDREN(v)}, and v has more than one child with
positive weight. Let vi0 , . . ., vik−1 be the set of all such children of v, where
k > 1. Note that we have k L(G) by the definition of derivation trees.
Therefore,
Since ωM equals the weight of the root vroot of the tree T and ωM
nG it follows that L(G)|An |+1 = nG weight(vroot ) L(G)h(vroot ) . Thus,
h(vroot ) | An | +1. Consequently, y contains a branch B such that each
node in B with h(v) > 0 has a child v such that h(v) = 1 + h(v ). Then there
is a sequence of nodes (v0 , v1 , . . .) along B such that
since there are more than | AN | nodes in this sequence, there must be two
nodes vi and vj with i < j, that are labeled with the same nonterminal symbol
X (see Fig. 6.3). Let z be the word over VT given by z = word(Tvj ). Then
∗
word(Tvj ) includes z, and so, it must have the form yzu for some y, u ∈ VT .
∗
Further yzu is an infix of ω, so ω = xyzut for some x, t ∈ VT and there are
the derivations in G:
∗ ∗ ∗
S =⇒ xXt, X =⇒ yXu, X =⇒ z. (6.28)
ω0 ω1 ω2
ω0 ω1 ω1 ω1 ω1 ω2
Corollary 6.1 For every context-free grammar G and marking set M, there
exists a number nG such that if ω ∈ L(G) and ωM nG , then ω can be
written as ω = xyzut such that yM 1, uM 1, yzuM nG and
yi zu ∈ PHRASE(ωi , G) where ωi = xyi zui t ∈ L(G) for every i ∈ N.
Proof The proof can be found from the proof of the theorem above.
There is another version of pumping theorem. It corresponds to the case
when the marking set M coincides with N.
Theorem 6.7 (Pumping lemma of Bar-hillel, Perles, and Shamir) Let
G be a context-free grammar. There exists a number nG ∈ N such that if
ω ∈ L(G) and | ω | nG , then we can write ω = xyzut such that | y | 1 or
| u | 1, | yzu | nG and xyn zun t ∈ L(G) for all n ∈ N.
Proof If M = N then ωM =| ω |, then the argument in the proof of
Ogden lemma holds for current situation.
Example 6.10 Let A = {a, b, c}, and let L = {an bn cn | n ∈ N}. The
language L is not context-free.
We prove the fact by using pumping theorem of Bar-Hillel, Perles and
Shamir. Suppose that L were a context-free language. Then there is nG ∈
N satisfying the property stated in lemma of Bar-Hillel et al. Let ω =
anG bnG cnG . Clearly | ω |= 3nG > nG , so ω = xyzut for some x, y, z, u, t
such that | y | 1 or | u | 1, | yzu | nG and xyn zun t ∈ L(G) for all n ∈ N.
Note that neither y nor u may contain more than one type of symbols. In-
deed, if y contained both a’s and b’s, then we could write y = yi a. . .ab. . .byn
since
Theorem 6.8 The class of context-free languages is not closed with respect
to intersection.
Proof Consider the context-free grammars
Suppose that the claim holds for derivations of length less than n, and
n
let Sqq =⇒
x be a derivation of length n. If we write the first step of this
G
derivation explicitly, we obtain
q q
Sqq =⇒ sqq 1 q1 q2 ∗
0 s1 . . .sn−1
n−1
=⇒
x. (6.31)
G G
sqq
∗
0 =⇒ x0 ,
1
G
sq11 q2 =⇒ x1 ,
∗
..
.
q q ∗
i−1 i
si−1 =⇒ xi−1 ,
G
..
.
q qn ∗
n−1
sn−1 =⇒ xn−1 . (6.32)
G
q q
S =⇒
Sq0 q =⇒
aq00 q1 aq11 q2 . . .an−1
n−1
(6.35)
G G
for some final state q and any states q1 , . . ., qn−1 . We can select these interme-
diate states such that δ(qi , ai ) = qi+1 for 0 i n− 2 and δ(qn−1 , an−1 ) = q.
Therefore, there are the following productions in P :
q q
aq00 q1 → a0, aq11 q2 → a1 , . . ., an−1
n−1
→ an−1 , (6.36)
6.3 Characteristics of Context-Free Grammars 151
∗
This implies the existence of the derivation S =⇒ a0 a1 . . .an−1 .
G
If ε ∈ R or ε ∈ L we consider the regular language R = R − {ε} and
the context-free language L = L − {ε}. By the previous argument we can
construct an ε-free context-free grammar G such that L(G ) = L ∩ R . If
ε ∈/ L ∩ R, then L ∩ R = L ∩ R and this shows that L ∩ R is context-
free. If ε ∈ L ∩ R, we have L ∩ R = (L ∩ R ) ∪ {ε}, then starting from
G = (Vn , VT , S , P ) we construct the context-free grammar:
symbol.
In summary, a transducer J = (A, Q, B, θ, q0 , F) is
• ε-free if (q, x, y, q ) ∈ θ implies y = ε;
• k-bounded if (q, x, y, q ) ∈ θ implies | x | k and | y | k;
• k-output if (q, x, y, q ) ∈ θ implies | y |= k.
Transducer can be represented by labeled directed multigraphs. Let / be a
symbol such that ∈ / A∪B. The graph of a transducer J = (A, Q, B, θ, q0 ) is the
labeled directed multigraph G(J) = (G, A∗ /B∗ , m) where G = (Q, E, s, d) is a
directed multigraph whose set of vertices is the set Q of states of J and whose
set of edges E contains an edge e with s(e) = q, d(e) = q , and m(e) = x/y
for every quadruple (q, x, y, q ) ∈ θ. The following convention introduced for
automata, the initial state q0 is denoted by an incoming arrow, and the final
states are circled.
Example 6.12 Let Q = {q0 , q1 , q2 , q3 }, A = {a, b}, and B = {0, 1}. If θ is
the relation
θ = {(q0 , ab, 0, q1 ), (q1 , bb, 01, q3 ), (q0 , ba, 00, q2 ), (q2 , aa, 011, q3),
(q3 , ε, 100, q0 ), (q2 , aa, 0011, q1)}
F = {q3 },
then the graph of the transducer J = (A, Q, B, θ, q0 , F) is given in Fig. 6.5.
Problems
A → aBcC A → aBb A → aB A → a.
is inherently ambiguous.
References
Robert L. Solso
The syntax analysis is the essential step for the compilation of programs
written in programming languages. In order to produce the object programs
executable on the computer, the source program has to be analyzed with
respect to its correctness, the correctness of the lexicon, syntax and semantics.
And in general, the construction of the compiler puts the first two, i.e., the
analysis of the lexicon and analysis of the syntax into a phase — the analysis
phase. In the two analyses, obviously the syntax analysis is more important
than the lexical analysis though it is the base of the former. In comparison
with the lexical analysis, the syntax analysis has more issues to explore. In
the development history of compilers many researchers and software engineers
had devoted lots of their times to design methods for the syntax analysis. The
syntax analysis had been regarded as the key to the success of compilers, and
it was also regarded as the most arduous task in the construction of compilers.
Now in the field of the syntax analysis there are many successful techniques
that are widely used in the construction of compilers, especially in the design
of the syntax analysis, we should introduce and discuss them in the book that
devotes to the principles of compilers. The aim of this chapter is to explain the
role of the syntax analysis and to introduce the important techniques used
in the syntax analysis, and to compare these techniques in terms of their
efficiency and easiness, etc. We also want to point out that though the main
ideas of the methods introduced here come from the existing references we
have also made our improvements to make them even more understandable
or more efficient. We do not only channel other results. In the following
158 Chapter 7 Syntax Analysis
discussion, we will point out the role of the syntax analysis, and explore
the procedure of the analysis, and discuss a number of issues involving in
the syntax analysis. When we explain the methods we will provide examples
to show the practice. Sometimes we also provide algorithms to make the
methods more operable and more precise.
The syntax analyzer, or parser as many called it, takes the output of the lex-
ical analyzer as input. In order to produce the object code that is executable
on computers, it has to verify the correctness of the source program, or to
point out the errors commonly occurring in the source programs and recover
from these errors if possible. Therefore, the role of the syntax analyzer has
duality. On one hand, it is responsible for reporting any syntax error in an
intelligible fashion. In this aspect, it should also be able to recover from the
errors and continue processing the remainder of its input.On the other hand,
if it verifies that the source program is correct in terms of the syntax struc-
ture, its task is to transform the source program into the intermediate code so
that in further step it can be used to produce the object code. In this aspect,
there are a number of techniques that were invented to handle the problem.
However, after all, both two aspects are based on referring the grammar, or
more specifically, on referring the production of the grammar. If some part
of the source program violates the regulation set by the production, it issues
the error report and tries to recover from the error; otherwise it confirms the
correctness of this part of the source program and continues with generating
the intermediate code or other relevant things. There are three general types
of the syntax analyzers, or parsers for grammars. Universal parsing methods,
such as Cocke-Younger-Kasami algorithm [1] and Earley’s algorithm [2] of
orient their works to any grammar. These methods, however, are too ineffi-
cient to use in producing compilers. The methods commonly used in compilers
are classified as either top-down or bottom-up. As their names indicate, top-
down parsers build parse trees from the top (root) to the bottom (leaves)
while bottom-up parses build parse trees from the leaves and work up to the
root. In both cases, the input to the parser is scanned from left to right, one
symbol at a time.
In the sense of the formal grammars we have discussed above, the lan-
guages which are recognized and most of the programming languages basi-
cally are context-free. Therefore, the source programs written by users can
be considered as sentences of the context-free grammar if they are correctly
written. In other words, they are the strings which the grammar accepts.
When discussing the strings which the grammar accepts, we had mentioned
two approaches — the leftmost derivation and rightmost derivation. The two
approaches just correspond to top-down and bottom-up respectively. At the
7.2 Role of Syntax Analysis in Compilers 159
moment, we just leave the topic and consider the nature of syntactic errors
and general strategies for error recovery.
Syntax Error Handling
If a compiler can only process correct programs, it will not be useful. Since
many programmers cannot make their programs immediately correct in the
first time they wrote them. In this case, the compiler has nothing to do in
facing this kind of programs with errors. Therefore, a good compiler should
assist the programmer in identifying and locating errors. Specially, as we now
discuss the syntax analysis, we are concerned about the errors that occur in
syntax. We require that the compiler should be able to detect the syntac-
tic errors. It turned out that much of the error detection and recovery in a
compiler is centered around the syntax analysis phase. One reason for this is
that many errors are syntactic in nature or are exposed. When the stream of
tokens comes from the lexical analyzer, they may disobey the grammatical
rules for defining the programming language, such as an arithmetic expres-
sion with unbalanced parentheses. Another reason is the precision of modern
parsing method. The compiler can detect the existence of syntactic errors in
programs very efficiently.
Accurately detecting the presence of semantic and logical errors at com-
piling time is more difficult. However, in this chapter we just focus on the
syntactical error handling.
The syntactical error handling has the following goals:
• It reports the existence of errors clearly and precisely.
• It recovers from each error quickly so that it can continue to detect sub-
sequent errors.
• It will not remarkably slow down the processing of correct programs.
Obviously, if the detection of errors and recovery from errors are very diffi-
cult, the realization of these goals is a great challenge. Fortunately, in reality,
it is not the case. Common errors are simple and a relatively straightforward
error-handling is enough. However, in some cases the detection of an error
is long behind the position where the error occurs, and the precise nature
of the error may also be difficult to detect. Therefore, in these cases people
cannot expect highly that the error handler precisely reports the positions
of errors. The reports can only be taken as references. In difficult cases, the
error handler may have to guess what the programmer had in mind when he
wrote the program.
Several parsing methods such as the LL and LR (they will be discussed
soon in this chapter) detect an error as soon as possible. More precisely they
detect an occurring error as soon as they see a prefix of the input that is not
a prefix of any legal string in the language.
A common punctuation error is to use a comma in place of the semicolon
in the argument list of a function declaration, or vice versa. Others are to leave
out a mandatory semicolon at the end of a line and to put in an extraneous
semicolon at the end of a line before the word else. Perhaps the reason why
160 Chapter 7 Syntax Analysis
semicolon errors are so common is that the use of semicolons varies from one
language to another. Hence, in order to avoid such errors, the programmer
should refer to the manual of the programming language for the regulation.
How should an error handler report the presence of an error? A com-
mon strategy which many compilers adopt is to print the offending line with
a pointer to the position where an error is detected. However, if there is a
reasonable likelihood of what the error actually is, an informative understand-
able diagnostic message may be included, for example, “semicolon missing at
this position”.
After the errors are detected, the next step is to recover from these errors.
From the view point of program writers, they want the error handler to
provide the accurate information so that they can easily correct the program
according to the message given. But the error handler does not always have
the strength to do so. There are a number of different strategies that an error
handler may adopt to recover a syntactic error. Here, we list the following
strategies:
• Panic mode.
With the strategy, when an error is discovered, the parser discards the
input symbol at a time until one of a designated set of the synchronizing
token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end, whose role in the source program is clear. The com-
piler designer must select the synchronizing tokens proper for the source
language. One of the disadvantages of this strategy is that it often skips
a considerable amount of input without checking for any possible errors.
However its advantage is clear that it is simple. And, unlike some other
methods to be considered later, it is guaranteed not to go into an infinite
loop.
• Phrase-level recovery.
With the strategy, when an error is discovered, the parser may perform the
local correction on the remaining input. That is, it may replace a prefix of
the remaining input by a string that allows the parser to continue (other-
wise the parser cannot do again). The choice of the local correction is up
to the compiler designer. However commonly the typical local correction
is to replace a comma by a semicolon, or to delete a seemingly extraneous
semicolon, or to insert a missing semicolon. For this strategy, it has the
danger that improper correction may lead the program into infinite loops.
Hence we must be very careful to choose the replacements. This can be
regarded as the drawback of the strategy, and the other drawback is that
the actual error is likely to occur before the point of detection.
• Error productions.
If we have a good idea of the common errors that might be encountered,
we can augment the grammar for the language in hands with productions
that will generate erroneous constructs. This is the idea of the strategy.
We use the grammar augmented by these error productions to construct
our syntactical analyzer. If an error for production is used by the analyzer,
7.3 Methods of Syntax Analysis 161
The aim of the syntax analysis is to verify the correctness of programs in terms
of the syntax of the programming language. That means that we want to
know whether we can derive the source program or not from the distinguished
nonterminal symbol using productions. If at the end of these derivations, we
really get the source program, we can say that the source program is really
correct in terms of the syntax of the language. And the process is from the
root of the parsing tree toward the leaves of the tree, hence it is called the
top-down analysis. The other way starts from the source program, i.e., the
leaves of the parsing tree. Some leaves can be combined together to form the
right part of a production, and then to replace them by the parent of these
strings. Gradually, the process is going up to climb up the tree towards the
root. If at the end the process reduces to the root of the parsing tree, it also
means that the root matches the leaves or the source program. Therefore, this
is called the bottom-up analysis. In the analysis process, in order to ensure
the efficiency, we want to avoid the backtracking of the passed path. That
means that we want the process keeping going in one direction, rather than
going back and forth in two directions LL(1) and LR(1) were developed from
the idea. Specifically, LL(1) is originated from the top-down approach while
LR(1) is originated from bottom-up approach. The first “L” in LL(1) stands
for scanning the input from left to right, the second “L” stands for producing
leftmost derivation, and the “1” in parenthesis stands for using one input
symbol of look ahead at each step to make parsing action decision. As for
LR(1), the first “L” has the same meaning as in LL(1), while “R” stands for
constructing a rightmost derivation in reverse, and the “1” in the parenthesis
stands for the number of input symbol of look ahead that is used in making
parsing decisions. Sometimes “1” is omitted, and the default value is still 1.
We discuss the two approaches, separately, as follows.
The top-down syntax analysis can be considered as an attempt to find
162 Chapter 7 Syntax Analysis
but then there are two productions with A being left part and they are
A → aB, A → aC.
In this case in order to continue the creation of the parsing tree, should we use
A→aB or A→aC? Since both have a as the first character, without knowing
the following one we have no idea about which one is the right choice. In
order to solve the problem we need to introduce the concept of FIRST(α)
that will differentiate productions each other according to their FIRST().
Definition 7.1 [2] Let A → α be a production of a context-free grammar
G. We define FIRST(A → α) = {a | a is the first terminal symbol occurring
in α from left to right, or starting from
A → α,
+
A =⇒ a}.
G
E → EAE | (E) | −E | id | ε,
A → + | − | × | / |↑ .
P → XY,
P → YU,
X → aX,
X → ε,
Y → c,
Y → bY,
U → c.
Similarly, we have
In order to decide which production should be used among two or more pro-
ductions, we need to consider another concept called FOLLOW(). It stands
for follower set.
Definition 7.2 Given a grammar G. For the nonterminal symbol A,
FOLLOW(A) = {x | x ∈ VT and x may occur immediately after A in the
productions of G}.
According to the definition, we point out that:
1) For the distinguished symbol S, we stipulate that FOLLOW(S) = {$ | $
is the end token located at the right end of input string}.
2) If A → αBβ is a production of G, then FOLLOW(B) = {x | x ∈
FIRST(β) but x = ε} Here, FIRST(β) may have a number of cases. β stands
for a string of terminals and nonterminals. If its first character is terminal
then it is in FIRST(β). If it is a nonterminal then we have to see the FIRST()
of production with the nonterminal being its left part. In this case, if the non-
terminal occurs as left part of more than one production, then the FIRST()
may have more than one element.
3) If there is a production A → αB or A → αBβ, where for β, FIRST (β)
∗
contains ε (that is β =⇒ ε), then FOLLOW (β) belongs to FELLOW (A).
Example 7.2 Given the set of productions of the grammar G as follows:
S → XY,
X → PQ,
X → YU,
P → pP,
P → ε,
Q → qQ,
Q → d,
Y → yY,
Y → e,
U → uU,
U → f.
7.3 Methods of Syntax Analysis 165
The thing that really distinguishes one production from other is the combi-
nation of FIRST() and FOLLOW(), so it is in order to combine them. We
have the following definiton.
Definition 7.3 Let G be a grammar and A be one of its nonterminals, and
A → α be one of its productions, then the derivation symbol set (abbreviated
as of DS) of A → α is defined as
∗
DS(A → α) = {a | a ∈ FIRST(A → α) or if α =⇒ a, then a ∈ FOLLOW(A)}.
Now we can see that the productions with the same left part all have different
derivation sets, hence it will be deterministic which production should be used
in the derivation process.
However, there is a case that will causes the derivations unable to proceed.
We mean the situation of left recursion, for example, in
A → Aα,
A → a,
the first production belongs to the case. From the angle of the derivation it
will form
A → Aα → Aαα → . . . → Aαα. . .α.
This derivation repeats to have A occurring as the first symbol of right part
so that the derivation cannot have a breakthrough. In addition, from the
angle of the derivation set, we have
The two derivation sets are totally equal. In this case, if the derivation in-
volves the decision of using which production with A as the left part, no
decision can be made, and the derivation cannot be carried out. Therefore,
in order to break the impediment, the left recursion should be eliminated.
The following approach can eliminate the left recursions.
Lemma The left recursion of nonterminal A in production A → Aα can
be eliminated if there is another production that also has A as its left part:
A → β, where α and β are sequences of terminals and nonterminals that
do not start with A. This is done by rewriting the productions for A in the
following manner:
A → βB,
B → αB | ε.
The proof of the lemma is not difficult. We do not give the formal proof. We
just mention that every string that is derived from the original productions
can also be derived from the later productions. For example, the form of the
strings that is derived from the former productions is βα∗ . By using the later
productions, we have A → βB → βαB → . . . → βα. . .αB → βα∗ . In reverse,
every sting that is derived from later productions can also be derived from
the former productions. For example, the form of the strings that is derived
from the later productions is βα∗ . Then in the former productions we have
A → Aα → Aαα → Aαα. . .α=⇒Aα∗ .
In final step, we replace A with β and get βα∗ . That means that the two
sets of sentences which two grammars generate are equal.
7.3 Methods of Syntax Analysis 167
Here “other” stands for any other statement. This is an ambiguous grammar
as for some sentences they have two parse trees. But it is not difficult to
rewrite it so that it becomes unambiguous one.
Now we introduce the concept of LL(1) grammar.
Definition 7.4 In a grammar G, if for any nonterminal that occurs as left
part of more than two productions their derivation sets are disjoint, then the
grammar G is called LL(1) grammar.
The determination of LL(1) grammar.
Now that we have known what is LL(1) grammar, and that in order to
deterministically carry out top-down derivation we must adopt LL(1) method
to make sure that all the productions with the same specific nonterminal as
the left parts have disjoint derivation sets. If it is the case, then the derivation
process can be proceeded deterministically from beginning until the end. If
the condition is not satisfied, however, does it mean that the LL(1) method
can not be used? The answer is not yet deterministic. We still have chance if
the grammar can be transformed into LL(1) one. Otherwise, many grammars
are stopped outside the door of LL(1) and LL(1) will not be attractive. But
only some grammars can be transformed into LL(1) one. For others they
really can not be analyzed by LL(1) method.
Now the problem is how to transform the grammar that is known to be
non-LL(1) into LL(1). We see the following example.
Example 7.3 Given a grammar that describes programs as follows:
PROGRAM→begin DECLIST comma STATELIST end,
DECLIST→d semi DECLIST,
DECLIST→d,
STATELIST→s semi STATELIST,
STATELIST→s.
Decide whether it is LL(1) or not? If it is not, is it possible to transform it
into LL(1) one?
Solution We just need to check the derivation sets for DECLIST and
STATEMENT as only these two nonterminals each has two derivation sets.
We look at DECLIST first.
DS(DECLIST→d semi DECLIST)={d}
DS(DECLIST→d)={d}
So the two have completely equal derivation sets. The grammar is not LL(1)
one. Notice the difference of the two productions. For the former one the
terminal d is followed by semi while for the later it is followed by an empty.
It implies that in right part of PROGRAM it will be followed by comma.
Therefore what is needed is only to introduce a nonterminal. We then have
PROGRAM→begin DECLIST comma STATELIST end,
DECLIST→d X,
X→semi DECLIST,
X→ ε.
170 Chapter 7 Syntax Analysis
A → αA ,
A → β1 | β2 .
A → αA | γ,
A → β1 | β2 | . . . | βn .
example
P → Qx,
P → Ry,
Q → sQm,
Q → q,
R → sRn,
R → r.
P → sQmx,
P → qx,
P → sRny,
P → Rny.
Now there are derivation sets that are still intersected, so we need to do the
left factoring. We get
P → qx,
P → ry,
P → sP1 ,
P1 → Qmx,
P1 → Rny.
Now the first three derivation sets are disjoint. But the last two P1 produc-
tions become the situation similar to that of P productions before. We then
handle them as we did for P productions. This time we get
P1 → qmx,
P1 → qmx,
P1 → sRnny,
P1 → rny.
The derivation set of the first P1 and that of the third one are intersected
again. Once again we use left factoring to deal with. We get
P1 → qmx,
P1 → my,
P1 → sP2 ,
172 Chapter 7 Syntax Analysis
P2 → Qmmx,
P2 → Rnny.
Now the situations of P2 are similar to both of P1 and P, even they have
longer right parts. Obviously the process can not end. Therefore our attempt
to transform any grammars by left factoring is destined to fail.
Previously, we have discussed the indirect left recursion, now we once
again to discuss it and present a concrete example:
S → RT,
R → TU,
T → SW,
U → VT,
W → RU.
Notice that the production set is not complete. Here we start from S and make
the derivations S → RT → TUT → SWVSW. Then the left recursion on S
occurs. Therefore we can always try to transform any indirect left recursion
into direct left recursion. The following shows the fact. Consider the following
grammar
S → Aa | b,
A → Ac | Sd | ε.
At first we transform it to be
S → Sda | b,
A → Ac | Aad | bd | ε.
S → bS ,
S → daS | ε,
A → bdA | ε,
A → cA | adA | ε.
The example shows that there is the so-called inherited left recursion gram-
mar for which all the productions with the same nonterminal as their left
parts contain common terminals in theirs derivation symbol sets. We are un-
able to eliminate the left recursion for this kind of grammar. There is also
some grammar that looks vey simple, but it is not LL(1) grammar and cannot
be transformed to be LL(1) grammar.
The language which a LL(1) grammar generates is called LL(1) language.
Now we have seen that a non left recursion grammar can or cannot be trans-
formed to LL(1) grammar that is equivalent to itself. It is the same to say
7.4 LL(1) Syntactical Analysis Method 173
that a grammar can or cannot be transformed so that both generate the same
language.
A problem comes up that does the algorithm exist that is able to de-
termine whether a given language is of LL(1) or not? The answer to the
question is no. This kind of problems is called undecidable or unsolvable.
Theoretically, it has been proven that this kind of algorithm does not exist.
At least, there is no such algorithm that is applied to any case.
After we explained the LL(1) grammars and LL(1) languages, we now in-
troduce the LL(1) syntactical analysis method. It belongs to the top-down
syntactical method.
In order to perform the LL(1) syntactical analysis, at first we have to
determine whether the given grammar is of LL(1). The specific process is as
follows:
1) Check if the grammar contains left recursion. If it does, the left recur-
sion should be eliminated. If it cannot be eliminated then the process ends,
the syntactical analysis fails.
2) If the grammar does not contain left recursion or the left recursion
is eliminated then the next step is to check whether the derivation sets of
the productions with the same left part disjoint. If they disjoint, then the
LL(1) syntactical analysis can be immediately carried out. However, if they
do not disjoint, then they need to be transformed through extracting common
factors or other methods. If the transformation fails, the syntactical analysis
also ends with failure.
When the two steps above succeed, the top-down syntactical analysis of
the compiler may really start. Now associating with Example 6.1 we introduce
the syntactical analysis process.
The productions of the grammar after transforming to LL(1) grammar
are as follows:
DS(DECLIST→dX)={d},
DS(X→semi DECLIST)={semi},
DS(X→ ε)={comma},
DS(Y→semi STATELIST)={semi},
DS(Y→ ε)={end}.
Using these derivation sets, we construct the LL(1) syntactical analysis table
as shown in Table 7.1.
Table 7.1 LL(1) syntactical analysis table
Nonterminal Input symbol(terminal)
PROGRAM begin d semi comma s end
PROGRAM
→begin
DECLIST DECLIST→d
X X→semi X→ ε
STATELIST STATELIST
→s
Y Y→semi Y→ ε
Note: suppose that the input sentence is begin d semi d semi d comma s semi s
end.
symbol is not a terminal, then it means that it needs to derive further from
the nonterminal until a production with a terminal as the first element of its
right part, or if in the process we meet a production with ε as its right part,
then we need to see the followers of the nonterminal that occurs as left part
of the production. The reason for doing so is that the elements of DS consist
of two kinds. One is the first terminal which the nonterminal may derive from
the production with it as the left part, either the first terminal occurs as the
first symbol on the right part of the production, or the first symbol is not a
terminal rather it is a nonterminal, but the derivation from the nontermianal
will yield the first terminal aforementioned as the first symbol. The other is
the follower symbol that comes from when a production in discussion has ε
as the right part, then DS element of this nontermianl is the first symbol of
the nonterminal following the nonterminal, or if it also derives an ε, then it is
the first symbol of the nonterminal following the nonterminal or the follower
symbol of it and so on. When the first item of the right part of the production
is not a terminal, we have to go to LA3.
LA3 (branch and return). On the production with the start symbol as
left part, the process advances on the right part one symbol after another
symbol. For terminal symbol (as a comma in the last example) we just need
to check if the input matches the same symbol on the production. If it is a
nontermianl then we will go to a subprogram that handles the nonterminal as
we will derive its DS that corresponds to the input program. In this process
if once again we meet a nontermianl we will go to another subprogram. In
this way, the subprograms are embedded. Therefore, when returning from a
subprogram, we need to return to where the subprogram was invoked. If it
returns to the main program that for the first time it goes to the subprogram
of the nonterminal, then it advances to the next symbol on the right of the
nonterminal in the production with a start symbol as the left part. Finally we
come to matching the final symbol of the input with a final symbol of right
part of the production with a start symbol as left part. When the process
smoothly proceeds without any inconsistency, then the algorithm reports
success and also means that the input program is legal in terms of syntax.
Actually, the algorithm above can also be used to construct parsing tree
as long as the transformed LL(1) productions are taken as parsing tree with
start symbol as the root of the tree and nonterminals as the inner vertices.
Now we still take the input above as our example to draw the parsing tree
that is shown in Fig. 7.2.
Thus the leaves of the tree are linked to become
begin d semi d semi d comma s semi s end
Here, we invoke X→semi DECLIST twice and invoke X→ ε once, also we
invoke Y→semi STATELIST once and Y→ ε once. We go to DECLIST from
X→semi DECLIST after matching the first semi, and when we handle DE-
CLIST, we get DECLIST→d X. Then we go to subprogram X and from X
we got d and semi DECLIST. So we go to DECLIST twice. We obtain
7.4 LL(1) Syntactical Analysis Method 177
DECLIST→d X
→d semi DECLIST (we invoke X→semi DECLIST for
the first time)
→d semi d X
→d semi d semi DECLIST (X→semi DECLST is invoked for
the second time)
→d semi d semi d X
→d semi d semi d (X→ ε is invoked once)
Similarly,
STATELIST→s Y
→s semi STATELST (we invoke Y→semi STATELIST for
the first time)
→s semi s Y
→s semi s (Y→ ε is invoked once)
By making use the idea of regarding nonterminal as subprogram, the process
of top-down syntactical analysis can be seen clearly.
We consider another example that is on arithmetic expressions. After
transformation the productions become of LL(1):
E → TE ,
E → +TE ,
E → ε,
T → FT ,
T → ∗FT ,
T → ε,
178 Chapter 7 Syntax Analysis
F → (E),
F → id.
At first, we need to evaluate the Derivation Set (DS) of each production.
They are
DS(E → TE ) = {(, id},
DS(E → +TE ) = { + },
DS(E → ε) = {$}, ($ is the end sign of input string)
DS(T → FT ) = {(, id},
DS(T → ∗FT ) = { ∗ },
DS(T → ε) = {+, ), $},
DS(F → (E)) = {(},
DS(F → id) = {id}.
We now construct
√ the syntactical analysis table (see Table 7.2). This time
we add symbol to indicate that it is also a follower symbol.
Table 7.2 LL(1) syntactical analysis table
Nonterminal Input symbol
id + * ( ) $
√ √
E E→TE E→TE
E E →+TE E → ε E → ε
√ √ √
T T→FT T→FT
T
T→ε
T →*FT T → ε T → ε
√ √ √ √
F F→id F→(E)
Notice that in the figure, the arrows downward represent the invocation
while the upward ones represent return to the caller who invokes the subpro-
gram. From the discussion above, we can see that via the contrasts between
the DS element and the input that occurs at the left of the DS symbol, the
analysis is deterministic. It does not need backtracking. It is very important
for the parsing methods as only by this way it can be efficient. Whether a syn-
tactical analysis method can be accepted to apply or not mainly depends on
the criterion – efficiency. According to this in the following we introduce an-
other method — bottom-up method, or its representation, LR(1) syntactical
analysis method.
nonterminal. We regard the input elements as the leaves of the parsing tree.
Then we climb the tree through shift-reduction operations. The aim of shift
operations is to combine the leaves that belong to the same parents while
the reduction operations are for reducing the leaves to their parents. By this
way, the analysis “climbs” the parsing tree until the input reduces to the root
of the tree, i.e., the start symbol of the grammar. Similar to LL(1) method,
if the analysis successfully finishes, it reports success, and it means that the
input program is legal in terms of syntax; otherwise it reports failure and the
input is illegal.
In order to reduce a number of terminal symbols to their parents and
then to reduce nonterminal symbols to the start symbol, or more generally
to reduce the string of nonterminals and terminals to the start symbol, the
reduction operations are essential. Before we discuss the details of reduction
operation, we introduce the concept of handle first.
We introduce the concept of handle both informally and formally. At first,
informally, a “handle” of a string is a substring that matches the right side
of a production, and whose reduction to the nonterminal on the left part of
production represents one step along the reverse of the rightmost derivation.
In many cases the leftmost substring β that matches the right part of some
production A → β is not a handle, because a reduction by the production
A → β yields a string that cannot be reduced to the start symbol.
Then formally, we have
Definition 7.5 If S is the start symbol of a context-free grammar G, and
we have a rightmost derivation
∗
S =⇒ αAω =⇒ αβω,
then A → β in the position following α is a handle of αβω. Note that the
string ω to the right of handle contains only terminal symbols.
We say a handle rather than the handle because the grammar could be
ambiguous with more than one rightmost derivation of αβω.
Example 7.4 Suppose that G is a context-free grammar with the following
productions:
S → aS,
S → As | A,
A → bSc | bc.
There are rightmost derivations for it as follows:
S → aS → aAs → abScs → abbSccs.
Here abbSccs is a right-sentential form, and bSc is a handle as by production
A → bSc it can be reduced to A.
From the definition, we can see that handle is very important to a re-
duction because only through handle the reduction may be carried out [5].
182 Chapter 7 Syntax Analysis
S → Sc,
S → SA | A,
A → aSb | ab.
of the production, that is the positions between two symbols and position
before the first symbol and after the last symbol in right part we add tokens
to them. These are called configurations. Our token contains two numbers.
The first one denotes the number of the production, and the second denotes
the order of the position from left to right in the right part of the production.
The positions are numbered from 0. So now we have the productions with
numbered configurations:
1) S →(1, 0) S(1, 1) c(1, 2) ;
2) S →(2, 0) S(2, 1) A(2, 2) ;
3) S →(3, 0) A(3, 1) ;
4) A →(4, 0) a(4, 1) S(4, 2) b(4, 3) ;
5) A →(5, 0) a(5, 1) b(5, 2) .
Therefore, for grammar G its configuration set consists of {(1, 0), (1, 1),
(1, 2), (2, 0), (2, 1), (2, 2), (3, 0), (3, 1), (4, 0), (4, 1), (4, 2), (5, 0), (5, 1), (5, 2)}.
In the set, we define relation R as follows: (1, 0) faces S, we define it
to be equivalent to configurations where are on the leftmost of right part of
productions with S as the left part. Sequentially if the configuration faces
another nonterminal, then it will be equivalent to the configurations where
are on the leftmost of the right part of the production with this nonterminal as
left part. The equivalence will be transitive in this way until no configuration
faces a nonterminal again or all the configurations that should be taken into
account in this way have been exhausted. So we have the following equivalent
relations. We use ∼ to denote R, and have
On the other hand, since (1, 0) ∼ (2, 0), both face S, so their next configura-
tions are equivalent too, so (2, 1) ∼ (1, 1), hence we have
Now we see that both (4, 0) and (5, 0) belong to two equivalent classes. We
define shift function f to represent the numbering of classes. The f takes the
configuration and the character as arguments, and the next configuration or
its equivalent class as the value. We start with the configuration (1, 0), as
184 Chapter 7 Syntax Analysis
f((1, 0), −) = {(1, 0), (2, 0), (3, 0), (4, 0), (5, 0)} = 1,
f(1, S) = {(1, 1), (2, 1)} = 2f((2, 1), −) = {(4, 0), (5, 0)},
so
{(1, 1), (2, 2), (4, 0), (5, 0)} = 2.
Subsequently, we define
so
{(4, 1), (2, 0), (3, 0), (4, 0), (5, 0), (5, 1)} = 6,
f(6, s) = {(4, 2), (2, 1)} = 7,
f(7, b) = {(4, 3)} = 8,
f(6, b) = {(5, 2)} = 9.
These class numbers are considered as state numbers. By this way, configura-
tions and states establish the correspondence. We find that different configu-
rations may have the same state number as they belong to the same equivalent
class, while a configuration may have more than one state number. We then
have the following productions with states in corresponding configurations:
S →1 S2 c3 ,
S →1, 6 S2, 7 A4 ,
S →1, 6 A5 ,
A →1, 2, 6 a6 S7 b8 ,
A →1, 2, 6 a6 b9 .
A → ε.
Since the right part has no symbol, there is only a position on it, it corre-
sponds to a state.
7.6 LR(1) Syntactical Analysis Method 185
and δα = γ, and in each case, δAw really is the right sentential form for
γw. In this case, we may perform bottom-up syntactical analysis of a symbol
string x of L(G), and it has a right sentential form of G. When we carry out a
series of reduce-shift operations, we may reach the start symbol of G without
backtracking. In this way, we have the rightmost derivation of x.
More explicitly speaking, as we saw in the grammar given above, it has 9
states. Apart from states 3, 4, 5, 8, and 9 that are respectively the final state
of productions 1, 2, 3, 4, and 5, the remaining states are only used for shift
operations, while 3, 4, 5, 8, and 9 are used for reductions. They are responsible
for the reduction of each production separately and each reduction operation
will reduce input string that matches the right part of the production to the
nonterminal in left part of the production. For example, both 4 and 5 reduce
the right part sequence to the nonterminal in left part. State 4 and state 5
reduce the right parts of strings of productions 2 and 3 to nonterminal S,
while states 8 and 9 reduce the right part strings of production 4 and 5 to
nonterminal A.
Just because of the property that the grammar has, there is no need to
look ahead any symbol, the analysis can be deterministically carried out,
hence this is a LR(0) grammar.
In the description above, the states that are responsible for shift opera-
tions do not involve in reduce operations and vice versa. This case is called
shift-reduce conflict-free. However, generally speaking, shift-reduce conflict is
likely to happen in common grammars. For example, suppose that we have
N) T →(N, 0) i(N, 1) ,
N + 1) T →(N+1, 0) i(N+1, 1) E(N+1, 2) .
Now there is shift-reduce conflict with the state m + 1. This can be seen from
the production N) and N + 1). In the production N), the state m + 1 requires
7.6 LR(1) Syntactical Analysis Method 187
1) S → S$;
2) S → T;
3) S → S + T;
4) T → (S);
5) T → i.
At first we have equivalent configuration group {(1, 0), (2, 0), (3, 0), (4, 0),
(5, 0)}, and we denote it 1. Then
S →1 S 2 $ 3 ,
S →1, 7 T 4 ,
188 Chapter 7 Syntax Analysis
S →1, 7 S2, 8 +5 T 6 ,
T →1, 5, 7 (7 S 8 ) 9 ,
T →1, 5, 7 i 10 .
On the table, the items starting with S denote shift action while the
items starting with R denote reduce action, so Si means shifting to the state
i. But i after R, Ri means reduction according to the production number i.
For example, R1 means the reduction according to production number one,
i.e. the first production. Reduction means that the current string on input
will be replaced by the nonterminal of left part of the production. Actually,
the state 1 cannot encounter symbol S . However, we put item halt on the
place to indicate that it is the successful finish situation. The empty place on
the table represents the impossible situation. If the state in the input string
meets the symbol with the empty item on the table that means that an error
case occurs and the analyzer should report failure.
For LR(0) syntactical analysis as we have mentioned that the states in
charge of doing shift operation and that in charge of doing reduction are
separated, the items on the rows of states have different characteristics. That
means that they each contains only one operation, either shift or reduction.
In this circumstance, of course, the analysis is simple.
We must point out, however, only very few grammars are of LR(0). For
example, the grammar that contains production like A → ε cannot be of
LR(0) as if the production is contained in the production set, the state A →n
must be in conflict with the same state in P → αAβ. By the former production
no matter what symbol is coming, including β, a reduction operation should
7.6 LR(1) Syntactical Analysis Method 189
T → i,
V → i,
and it happens that they both have the same state corresponding to two
configurations
T →n i,
V →n i.
That means that the initial configurations of two productions have the same
state. Consequently, the two configurations after i have the same state too,
as f(n, i) has only one value. But how to do the reduction? Should we reduce
to T or to V? We are in conflict now. And this is reduce-reduce conflict.
Shift-reduce conflict and reduce-reduce conflict are commonly seen cases.
In the last section, we point out that LR(0) method is too weak to be practi-
cal. It is because it does not need to look ahead any input symbol, it can make
a decision by checking the state only. However, in practice, it is very rare to
have such a case. As the improvement of LR(0), the first step is to generate
SLR(1) grammar. The S in SLR(1) means simple. So it is a simple LR(1)
syntactical analysis. In this analysis, relatively simple shift-reduce conflict
is allowed. When the case occurs, it can be resolved by looking ahead one
symbol. Concretely speaking, for a production with a nonterminal A as the
left part, for some state in it, if the lookahead symbol of it does not belong
to the follower symbol of A, then reduce operation on the symbol (or symbol
string) in front of the state cannot use the production. Instead, a shift in
operation should be carried out with the lookahead symbol as the input.
We show an example of SLR(1) grammar:
S → real IDLIST,
IDLIST → IDLIST, ID,
IDLIST → ID,
ID → A | B | C | D.
We omit the details of establishing states and directly assign the states to the
configurations. We get the productions with states attaching to corresponding
190 Chapter 7 Syntax Analysis
configurations:
S →1 real2 IDLIST3 ,
IDLIST →2 IDLIST3,4 ID5 ,
IDLIST →2 ID6 ,
ID →2,4 A | B | C | D7 .
With these productions with states attached, we may obtain the correspond-
ing syntactical analysis table (see Table 7.4).
In this table, we may see that on line of the state 3, there are two different
items. One is S4 that corresponds to input “,”, that means that on the state
3, if the input is a comma the state will shift to the state 4. The another
item corresponds to input end mark ⊥, this time it will do reduce operation
to reduce the previous string to the left part of the production 1, that is the
start symbol S. This is the first time that we see that there are two different
actions-shift and reduce-on the same line. This is a shift-reduce conflict. The
conflict can be resolved easily by looking ahead one symbol.
Table 7.4 SLR(1) syntactical analysis table
Symbol
State
S IDLIST ID real , A, B, C, D ⊥
1 halt S2
2 S3 S6 S7
3 S4 R1
4 S5 S7
5 R2 R2 R2 R2
6 R3 R3 R3 R3
7 R4 R4 R4 R4
S →1 Shalt ⊥.
This is why we put a halt on the intersect column S and row state 1.
SLR(1) goes forward a step in comparison with LR(0), and it may resolve
the shift-reduce conflict by distinguishing whether the looking ahead symbol
belongs to the follower of some nonterminal or not. However, the power of
SLR(1) is limited. Therefore, we need to seek for more powerful method.
Our solution is LALR(1) syntactical analysis method. “LA” here means look
ahead. That means that it resolves the conflicts by more carefully looking
ahead input symbols.
7.6 LR(1) Syntactical Analysis Method 191
1) S → T else F;
2) T → E;
3) T → i;
4) F → E;
5) E → E + i;
6) E → i.
f((1, 0), −) = 1 = {(2, 0), (3, 0), (5, 0), (6, 0)},
f(1, T) = {(1, 1)} = 2,
f(2, else) = 3 = {(4, 0), (5, 0), (6, 0)},
f(3, F) = 4 = {(1, 3)},
f(1, E) = 5 = {(2, 1), (5, 1)},
f(1, i) = 6 = {(6, 1), (3, 1)},
f(3, E) = 7 = {(4, 1), (5, 1)},
f(6, ; ) = 8 = {(3, 2)},
f(5, +) = f(7, +) = 9 = {(5, 2)},
f(9, i) = 10 = {(5, 3)},
f(4, ; ) = 11 = {(1, 4)},
f(5, ; ) = 12 = {(2, 2)}.
192 Chapter 7 Syntax Analysis
1) S →1 T2 else3 F4 ; 11 ;
2) T →1 E5 ; 12 ;
3) T →1 i6 ; 8 ;
4) F →3 E7 ;
5) E →1, 3 E5,7 +9 i10 ;
6) E →1, 3 i6 .
Consequently, we may establish the analysis table as shown in Table 7.5:
Table 7.5 LALR(1) syntactical analysis table
Symbol
State
S T F E else ; + i ⊥
1 Halt S2 S5 S6
2 S3
3 S4 S7 S6
4 S11
5 R2 S9
6 R6 S8 /R6 R6
7 R4
8 R3
9 S10
10 R5 R5 R5
11 R1
12 R2
Notice that on the table above, a shift-reduce conflict occurs on the line
state 6. The conflict is caused by i on the right part of the production 3
and the i on the right part of the production 6. According to the production
6, the state before i (1 and 3) comes from different ways. If it comes from
the nonterminal T in production 1, when the i is followed by ;, it should be
reduced to E and reduce action should be taken. However, the state is 3, it
comes from the state after else on the production 1. In this case, the follower
of i on the production 6 is the follower of E. Then from the production 4,
the follower of E is the follower of F. Hence it is ;. Therefore according to
the production 6, when the look ahead symbol that follows i is ;, the action
that should be done is shift. It should shift to the state 8. This is where the
conflict comes from.
Now in order to solve the conflict, the past context should be taken into
account. On the productions 5 and 6, the state 1 comes from state 1 before
T on the first production while state 3 comes from configuration after “else”.
Therefore, if we assign single state to the configuration that following i then
conflict must occurs. And this is where the problem is. After we found the
7.6 LR(1) Syntactical Analysis Method 193
1) S →1 T2 else 3 F4 ; 11 ;
2) T →1 E5 ; 12 ;
3) T →1 i6 ; 8 ;
4) F →3 E7 ;
5) E →1, 3 E5,7 +9 i10 ;
6) E →1, 3 i6, 13 .
Now we will have a new syntactical analysis table (see Table 7.6) in place
of the above one and this one solves the shift-reduce conflict.
This table can be used to analyze the LALR(1) grammar. With the
method, if shift-reduce conflict takes place, we try to distinguish the past
context of a configuration. The conflict of this kind usually can be resolved.
Table 7.6 LALR(1) syntactical analytical table that solves the conflict
Symbol
State
S T F E else ; + i ⊥
1 Halt S2 S5 S6
2 S3
3 S4 S7 S12
4 S11
5 R2 S9
6 R6 S8 R6
7 R4 S9
8 R3
9 S10
10 R5 R5 R5
11 R1
12 R2
13 R6 R6
We have mentioned that the LR(0) syntactical analysis method can only
be suitable to such a grammar that each state on its productions either
is for shift action or is for reduce action. It did not need to consider the
lookahead symbol. Once the look ahead symbol needs to be considered, it
is no longer suitable. Then we need a stronger method. And SLR(1) lends
194 Chapter 7 Syntax Analysis
itself to the situation. In SLR(1), we allow such a case that a state either
carries out a shift action or a reduce action according to different lookahead
symbols. The fact reflects on the table that on the same line (corresponding
to a state) with different columns, it may have shift and reduce items on
them. The practicability of this one, however, is also limited as in most cases,
the situation is more complicated that even for the same row and the same
column, depending on the past context it has different tackling methods,
either doing shift action or doing reduce action. Therefore, it needs to split a
state to two states, one for the shift action and another one for reduce action.
In some cases, even one state is split to more states. In the example of last
section, we just did so. And it is the LALR(1) syntactical analysis method.
Now we want to say again that the power of LALR(1) is still limited. For
more complicated grammars, it fails to work. If it is the case, we have to try
a stronger method, it is the common LR(1) syntactical analysis method.
The LR(1) syntactical analysis method is different from LR(0), SLR(1),
and LALR(1) in that it has much more number of states while the last three
ones have basically the same state numbers. Their sizes of the syntactical
analysis tables are also almost the same. For LALR(1), the size of states is
slightly more than SLR(1) as some states in SLR(1) will be split into two
states. The number of states of LR(1), however, will increase remarkably as
for one configuration it will become a state depending on one input, hence
the number of states will be much more than the number of configurations.
In the following, we will introduce another version of the analytical
method while we introduce the general LR(1) syntactical analysis method.
The method makes our state transitions like an automaton. Consider the
grammar with following productions (see Example 7.6).
Example 7.6 A grammar is given via the following productions:
1) S → A;
2) S → xb;
3) A → aAb;
4) A → B;
5) B → x.
At first we add a production S → S so that the determination of state is
consistent with our original practice. This is called the incremental grammar
of the original grammar. Now we have:
1) S → S{⊥};
2) S → A{⊥};
3) S → xb{⊥};
4) A → aAb{⊥};
5) A → B{⊥};
6) B → x{⊥}.
7.6 LR(1) Syntactical Analysis Method 195
Where inside { } is the following symbol. It is the end mark following the
input string. indicates configuration. It is similar to what we used (m, n)
before. From the point, as different inputs come, we will have different states.
Then what we get is similar to the previous automaton.
We establish all the states first and it is shown as follows:
S →0 S1 ,
S →0 A2 ,
S →0 x3 b4 ,
A →0, 5 a13, 5 A7, 8 b9, 10 ,
A →0, 5 B611 ,
B →0, 5 x3, 12 .
We can see that all the cores of LR(1) states correspond to the states of
SLR(1). The reason for the situation is that the cores are determined by the
symbols which other states allow to shift in. Therefore, if we do not consider
the lookahead symbols, the core is a LR(0) state. If the state is transformed
then new LR(1) state is generated while its core is still the LR(0) state.
Therefore, LR(1) states are the result of the split of LR(0) states.
The source of power of LR(1) comes from the split of states. Just de-
pending on the split, the problems which SLR(1) or LALR(1) cannot solve
can be solved satisfactorily by LR(1) method. Of course not all problems are
solvable by LR(1).
It is not that every such split is necessary. Contrasting with Fig. 7.5
we can see that states 6 and state 2 can be combined together to form a
new single state S2,6 as they each consists of a configuration only. Through
196 Chapter 7 Syntax Analysis
further analysis, we can find that more states can be combined. After such a
combination we get the LALR(1) automaton as shown in Fig. 7.6.
Fig. 7.6 LALR (1) automaton obtained from combination of states of LR (1)
automaton of Fig. 7.5.
In Figs. 7.5 and 7.6, those states that that locates at the rightmost
position of the production is for reduce action while other states are for shift
actions. The symbols beside the arrows outsides productions are the input
symbols.
From the introduction above we know that the number of states for LR (1)
is much more than that of LALR (1) though it is not obvious in the example
above. In the practical programming languages, the amount of states used
by LR (1) syntactical analysis will be several levels higher than that used by
corresponding LALR(1) syntactical analysis. Here is the statistics given by
reference: A SLR (1) syntactical analysis table of a programming language
after compression spent a number of KB of the memory on the average, while
LR(1) table needed a number of MB of the memory. When we construct
such table the memory we need probably is several tens times of the amount.
Fortes Galvez implemented a different realization of LR (1) in 1992. It slightly
reduced the size of LR (1) syntactical analysis table. On the other hand, most
of the programming languages just need the LALR (1) syntactical analysis
table, so we do not need to worry about the size of LR (1) table as we rarely
use it.
After considering LR (1) we naturally want to know what about LR (k)
syntactical analysis for k 2, is it more powerful than LR(1)? [7] The studies
affirm that LR(k) (k 2) syntactical analysis is slightly more powerful than
LR(1) indeed, but it is at the expense of bigger size of the analysis table.
People originally thought that when a grammar was not of LR (1), can it
be analyzed via LR (2)? However, it turned out that the probability for it
being of LR(2) is very low. The conclusion no doubt is depressing as it is not
like that when LR(0) cannot solve the problem, we use SLR(1) to solve it
7.6 LR(1) Syntactical Analysis Method 197
instead, or further use LALR(1) before we use LR(1). When LR (1) fails to
work, probably LR(2) does not work either. Therefore, theoretically LR(2)
has some significance, but it is never used so far.
We now illustrate the practical procedure of syntactical analysis via LALR
(1). We consider the grammar above:
S → S,
S → A,
S → xb,
A → aAb,
A → B,
B → x,
and suppose that the input string is aaaaxbbbb. At first, we construct the
productions with states attached. They are as follows:
S →0 S1 ,
S →0 A2 ,
S →0 x3 b4 .
A →0, 5 a13, 5 A8, 7 b10, 9 ,
A →0, 5 B6, 11 ,
B →0, 5 x3, 12 .
With these productions and states, we may construct the syntactical analysis
table as shown in Table 7.7.
Table 7.7 LR(1) syntactical analysis table
Symbol
State
S S A B x a b ⊥
0 Halt S1 S2 S6 S3
1 R1
2 R2
3 S4 R6
4 R3
5 S7 S11 S12
6 R5
7 S9
8 S10
9 R4
10 R4
11 R5
12 R6
13 S8
198 Chapter 7 Syntax Analysis
Before we really start the practical analysis with input string, we should
notice that the analysis needs two stacks, one for symbols and another one
for states. Besides, the handles of terminals and nonterminals are different
in that for terminals, we simply put them to the input stack, change or do
not change the state. But for nonterminals, while we put it into the symbol
stack, sometimes we will change the symbol as well as the state. Therefore,
in some books, they differentiate the handles as “goto” and “action”.
Initially, two stacks are empty, but in the state stack, we put 0 to indicate
the empty situation while in the input symbol stack, we keep it empty.
Input Symbol Stack State Stack
a 0
a 5
a 0
Until four a s all were put in the stack, the state stack does not change.
Input Symbol Stack State Stack
a 5
a 0
Then x enters the input stack, according to the analysis table, the state 5
changes to the state 10.
7.6 LR(1) Syntactical Analysis Method 199
x 12
a 0
The state 12 is for reduce, it makes x to reduce to B, and the state returns
to 5.
Now in input stack, from top to down, we have Baaaa, and in the state
stack, also from top to down, we have 5 0. The state 5 meets symbol B it
shifts to the state 11. Therefore, we have the following situation.
Input Symbol Stack State Stack
B 11
a 0
Under the state 11, B is reduced to A and the state 11 is removed and we
have 5 and 0 in the state stack.
Input Symbol Stack State Stack
A 5
a 0
The state 5 meets A, it shifts to the state 7, so the situation changes again
as follows.
200 Chapter 7 Syntax Analysis
A 7
a 0
Under the state 7, b comes and enters the input stack, the state 7 shifts to
the state 9 and the situation becomes as follows.
Input Symbol Stack State Stack
b 9
A 0
Now in the input stack, from top to down what we see is bAa, it is not
anything but the right part of the production with A as the left part. Since
the state is 9, it will reduce the three symbols to A. The situation becomes
as follows.
Input Symbol Stack State Stack
A 5
a 0
The state 5 meets A and shifts to 7. Hence we have the following situation.
7.6 LR(1) Syntactical Analysis Method 201
A 7
a 0
As the second b enters the input stack, once again we have bAa from top
to down and the state changes to 9. It reduces the three symbols to A. The
situation will repeat again twice more until at the end.
Input Symbol Stack State Stack
b 9
A 0
This time the three symbols bAa will be reduced to A and in the state stack
only the state 0 remains.
Input Symbol Stack State Stack
A 0
However, according to the analysis table, when the input stack is empty, the
state 5 is actually state 0, and when the state 0 meets the symbol A it shifts
to the state 2. The state 2 is for reduce. It reduces A to S and the state
becomes the state 0 again.
Input Symbol Stack State Stack
S 0
S 1
0
202 Chapter 7 Syntax Analysis
Now the state 1 reduces S to S and the state becomes 0. So finally we have
the following situation.
Input Symbol Stack State Stack
S 0
It declares the success of the analysis. It also means that the input string is
the sentence of the grammar.
orb stands for an open round bracket, EC stands for an enquiry clause, SC
stands for a sequential clause, crb stands for a closed round bracket, semi
stands for a semicolon, u stands for an unit, PLU stands for a possibly labeled
unit.
The problem comes from the fact that the unit in the conditional clause
cannot have label while the unit in the sequential clause of the closed clause
may have label. This language is LL(1) if the rules of the closed clause and
conditional clause are combined together (two productions of SC must be
transformed, otherwise their DS’s must intersect). The grammar obtained
after transformation is denoted with different notation, and we have grammar
G2 as follows:
S→orb T,
T→lab V|u W,
V→lab V|u X,
W→crb|semi T|stick SC stick SC crb,
X→ semi V|crb,
SC→PLU Y,
Y→semi PLU Y|z,
PLU→lab PLU|u.
Problems
S → Ai bi , 1 i n,
Ai → aj Ai |aj , 1 i, j n but j = i.
References
[1] Grune D, Jacobs CJH (1990) Parsing Technique: A Practical Guide, Ellis
Horwood, New York.
[2] Sippu S, Soisalon-Soinenan E (1989/1990) Parsing Theory, vol. II LL(k)
Parsing and LR(k) Parsing. Springer, Berlin.
[3] Robin Hunter (1988) Compilers: their design and construction using Pascal.
Cambridge University Press, Cambridge.
[4] Aho A V, Ullman J D (1973) the theory of parsing, translation, and compil-
ing, vol. II: Compiling. Prentice-Hall, Englewood Cliffs, New Jersey.
[5] Chen Huowang, Qian Jiahua, Sun Yongqiang (1980) Compiler principles.
Defense Industry Press, Beijing.
[6] Jin Chengzhi (1981) Compilers: The principles and implementation. Higher
Education Press, Beijing.
[7] Aho A V, Ullman J D (1972) The theory of parsing, translation and compil-
ing, vol. I: Parsing,. Prentice-Hall, Englewood Cliffs, New Jersey.
[8] Grune D, Bal H E, Jacobs C J (2007) Modern compiler design. Addison-
Wiley, Reading, MA.
[9] Chapman NP (1987) LR Parsing: Theory and Practice. Cambridge Univer-
sity Press, Cambridge.
Chapter 8 Attribute Grammars and Analysis
The attribute grammar has been used in the definition of syntaxes of many
languages. To translate programs written in any programming language, a
compiler may need to keep track of many quantities besides the code gen-
erated for the program. For example, the compiler may need to know the
type of identifier, the location of the first instruction in the target code, or
the number of the instructions generated. Therefore we talk abstractly about
attributes associated with the constitute of the program. By attribute, we
mean any quantity, e.g., a type, a string, a memory location, or whatso-
ever. However, in the context of the chapter, we mainly mean the attributes
that the context-free grammar cannot describe. Since attribute grammar is
constructed based on the context-free grammar, it can handle the computa-
tion that is required by the context processing, and then explain the syntax
analysis of the context-free grammar. As the illustration, we use the context-
free grammar to define part of Pascal language. Then we extend the gram-
mar, defining the non-context-free aspects of the language using the attribute
grammar (The definition of this part of Pascal is given here according to the
work of Watt (1977) and McGettrick (1980)) [3].
At first we point out which attributes are not of context-free. For example,
we construct the parsing tree of the input sentence via the top-down method.
In the parsing tree, the leaves represent the terminals. But they are just the
terminals only, they do not contain other characteristics of the terminals, i.e.,
the attributes. As the leaf of the abstract syntax tree (AST), the terminal
may have its initial attribute, e.g., an identifier that represents an integer
has the type attribute “integer”, but it has no the value yet. Its value will be
provided by the input of the program and this is not defined in the grammar
that belongs to the context-free. A token that represents an identifier has its
initial attribute value “identifier”, but has no the address value and numerical
value stored in the location yet. They are also provided by later input and
8.2 Attribute Grammar 209
not defined by the context-free grammar, all of which will be done by the
attribute grammar.
The lexical analysis and parsing analysis work together to complete the
analytical process of the context-free part of source programs. Analytic fea-
tures may be local ones or in the embedded form. As for other cases, for
example, to check the number of the formal parameters in the entry of rou-
tine, to see if it is consistent with the number which the declaration stipulates,
is not of context-free.
In order to check the imperative context-free conditions and collect infor-
mation of a specific language to handle the semantics, we need to handle the
context. In some sense, the attribute grammar is a supplement of the context-
free grammar that directs its intention at semantic analysis. In a compiler
that purely deals with the compilation of programs, the context-free process
is divided into two phases. At first, it checks all the relations of the context in
the language. Only when the check is passed, then can the input be regarded
as correct. Second, it collects other informations called the attributes. These
informations are stored in the nodes of the abstract syntax tree. The context
handling is done via checking all the context relations and evaluating all the
attributes of the nodes.
Therefore, in simple words, the computation required by the context pro-
cessing may be described in the syntactical analysis of the context-free gram-
mar, and generates the attribute grammar. In order to meet the need, the
context-free grammar is extended along two directions: one is the data, and
another is the evaluation.
For each grammar symbol, no matter whether it is terminal or nontermi-
nal, it is stipulated to have null or more attributes. Each attribute has its
name and type called formal attribute. The formal attribute will be realized
as a real attribute consistent with the formal type of specific value. The at-
tributes are used for keeping the semantic information of the specific nodes.
Therefore, all the nodes that correspond to the same grammar symbol S have
the same formal attribute, but their real values — the real attributes are not
the same.
For every production rule like A → M1 M2 . . .Mn , there is a series of rel-
ative computation rules — the attribute computation. They stipulate how to
compute the attribute of A according to the attributes of the attributes Mi
(1 i n) at the right part. These computation rules check the context
conditions and issue warning and error message in case some sort of errors
occurs. They are related to production rules, rather than associated with non-
terminals. This is because the computation rule is related to the attributes
of the member Mi , while the member Mi is determined by production rules.
The attributes of every grammar symbol are categorized as a synthetic
attribute and an inherited attribute. If a variable occurs twice or more times
in a production, then each occurrence has the same attribute values. Then
information may be transmitted from the start symbol to the sentence or
program generated. The attribute used in this way is called inherited at-
210 Chapter 8 Attribute Grammars and Analysis
tribute. Of course, it may be worked in the reverse way, i.e., the attribute
value is transmitted from where it is obtained in the sentence to the start
symbol, and the attributes of this kind are called synthetic. At the beginning,
only terminal symbol have the synthetic attribute. The values of synthetic
attribute are directly obtained from the program text. Synthetic attribute of
a child node may be accessed by the computation rules of the parent node,
and further computation is allowed to take place on the parent node. No-
tice that the computation can only be carried out when all values which the
computation depends on are determined. The computation on parent node
not only becomes the synthetic attribute on the node but also the inherited
attributes of its children.
Now we add the attributes to the productions of the context-free gram-
mar [4], where the upwards arrow denotes the synthetic attribute while the
downwards arrow denotes the inherited attribute.
<PROGRAM>::=program <name>↑NAME
(<PROGRAM PARAMETERS>)
<BLOCK>↓STANDARDENV↓{ }↓{ }
The uppercase letters following the arrows represent the attribute vari-
ables. From the representation, we see that NAME has the synthetic attribute
as its value and is obtained from the lexical analysis. STANDARDENV has
inherited attribute as its value and is obtained from the set of standard iden-
tifiers and their implications.
There are two more inherited attributes that belong to block. They are
empty at the moment. They are any formal parameters and any global labels.
We now supplement it as follows:
<BLOCK>↓GLOB↓FORM↓GLOBLAB ::=<LABEL DECLARATION>↑LOCLAB
<CONSTANT DEFINITION>↓GLOB↓FORM↑NEWLOC1
<TYPE DEFINITION> ↓GLOB↓NEWLOC1↑NEWLOC2
<VARIABLE DECLARATION>↓GLOB↓NEWLOC2↑NEWLOC3
<PROCEDURE and FUNCTION DECLARARION>↓GLOB↓NEWLOC3↑LABELS↑
NEWLOC
<STATEMENT PART>↓ENV↓LABELS↑STMLAB
Now we expatiate these attributes. GLOB, FORM, and GLOBLAB repre-
sent respectively global variables, formal parameters, and global labels. These
global attributes belong to inherited attributes. The local labels that belong
to <LABEL DECLARATION> obtain their values from the program and
they are synthetic attributes. The global properties and formal parameters
of <CONSTANT DEFINITION> belong to inherited attributes while its
NEWLOC1 represents new local constants. Their values are obtained from
the program and they are also synthetic attributes. As we mentioned above
that synthetic attributes not only can be the synthetic attributes of the par-
ent node, but also become the inherited attributes of its children nodes. This
is shown in our explanation above. NEWLOC1 originally was a synthetic
8.2 Attribute Grammar 211
In the last section, we mentioned that for every production rule like A →
M1 M2 . . .Mn there is a series evaluation rules (i.e., attribute evaluation rules)
for evaluating every attribute. And this is intimately related to dependency
graph. The dependency graph is used for the description of evaluation rules.
Therefore, it is needed to define the dependency graph first.
Definition 8.1 Dependency graph. In the parsing tree that corresponds to
production rule A → M1 M2 . . .Mn , if the attribute b of a node depends on at-
tribute c, then on the node the semantic rule evaluation for b must be carried
8.3 Dependence Graph and Evaluation of Attributes 213
out after the evaluation for c. In a parsing tree, the inter dependency relation
between the synthetic attribute and inherited attribute may be described via
the directed graph. This directed graph is called dependency graph.
Fig. 8.1 shows a simple and yet practical attribute grammar rule. It
presents the constant definition declaration via nonterminals Defined-identifier
and Expression.
There are two more points that need to explain about Fig. 8.1. At first,
in the function Update symbol table, the nonterminal Defined-identifier, not
only the identifier, is used. This is because the two are significantly different.
The occurrence of an identifier definition, i.e., the identifier only presents one
piece of information, that is its name, while another occurrence of identi-
fier application, i.e., the Defined-identifier presents many other information
besides its name, such as range information, type, categories (they are con-
stants, variables, parameters, segments, selectors, etc), values, distributed
information, etc. Secondly, in the function Update symbol table, Checked
type of constant, definition (Expression.type), not only the Expression.type
is used. This is because the execution of context check of constant type re-
quires calling functions, rather than directly using values. And the check is
necessary. If the check succeeds, it will return the initial Expression type
that is also what we need. Otherwise it will issue error information, and the
routine will return a special value Erroneous-type.
Besides, it is also needed to distinguish the difference between the data
stream and dependency relation. In Fig. 8.2, the arrows represent the data
stream, rather than the dependency relation. If the data stream flows from
variable a to variable b, then b depends on a. The data dependency sometimes
is denoted in pairs. For example, (a, b) implies that b depends on a. It also
implies that “data flow from a to b”, or “a is the precondition of b”. Simply
speaking, the attribute dependency graph actually contains arrow heads of
the data stream.
Now we add the secondary attributes of Expression to constant definition
of Fig. 8.1, as shown in Fig. 8.3. In this way, we create the complete data
flow for the constant definition.
8.3 Dependence Graph and Evaluation of Attributes 215
If CONST Pi = 3.141 592 65 is added, then the result is shown in Fig. 8.4.
Usually, the semantics of expression depends on the contents of the symbol
table while the symbol table is provided in advance. Therefore, we say that
the symbol table is an inherited attribute of Expression. The semantics of a
number, however, is independent to symbol table, hence this is the reason
why there is no arrow from the number to the symbol table.
From Fig. 8.5 it may be seen that the direction of the data stream is flow-
ing from the inherited attribute of A to the synthetic attribute of A, the same
rules apply to nodes B, C, and D. Hence under their respective evaluation
rules, the data stream flows from each inherited attribute to the synthetic
attribute. Meanwhile, the evaluation rules of A cause the data stream also
flowing from the synthetic attribute to the inherited attribute of B, the same
as for C and D. The same rule also works for node A. Therefore, the data
stream also flows from the synthetic attribute of A to its inherited attribute.
The data stream is not shown on the Figure. Generally speaking, the inher-
ited attribute may be regarded as an input parameter while the synthetic
attribute may be regarded as the output parameter. There is time order on
input and output. In general, input should precede output. But there is also
some exception, some synthetic attributes may acquire values before inherited
attributes [8].
The following is the general method for the attribute evaluation:
1) Create corresponding abstract syntax tree.
2) Construct attribute dependency graph.
3) Allocate space for attributes of each node of the tree.
4) Fill the attributes of terminals of the tree with the values acquired
from representation of terminals.
5) Topologically sorts out the nodes of the dependency graph. Then ac-
cording to the order execute the evaluation rules to assign values to attributes,
until no more new value may be assigned. And make sure that only where
there is one attribute value may be used then can it be used, and each at-
tribute can only get one value each time.
For the attribute syntax tree of Fig. 8.4, we may perform the evaluation
according to the evaluation method specified above. The order of the attribute
evaluation may be determined according to the direction of the data stream.
The attribute syntax tree obtained after the attribute evaluation is shown in
Fig. 8.6.
8.3 Dependence Graph and Evaluation of Attributes 217
Fig. 8.6 Attribute syntax tree of Fig. 8.4 obtained after the attribute evaluation.
The simple attribute evaluation method only allows the value assignments
in the following form:
attribute1 =func1 (attribute1,1 , attribute1,2 ,...)
attribute2 =func2 (attribute2,1 , attribute2,2 ,...)
......
More complex attribute evaluation allows that in the rule part some features
of the practical programming language are used. For example, the statements
if, while, case, etc., and local variables are called local attributes.
A simple and the most general method that realizes the attribute evalu-
ation is only to realize the data stream machine. The method that realizes
the data stream machine is: access all the nodes of the data stream graph,
finish all possible assignment in each node. Repeat the procedure until all
the synthetic attributes of the root have obtained values. Only when all the
attributes which an assignment needs have had values then can the assign-
ment be carried out. This method is called dynamic attribute evaluation as
the order which the evaluation depends on is determined on run time of the
compiler.
The role of the attribute grammar is that it can transmit from any place
of the parsing tree to places with a controllable mode. In order to show the
attribute evaluation method, we illustrate it via a simple attribute gram-
mar. It actually is a dynamic attribute evaluation. For example, it may be
used to compute the code of letters in ASCII (American Standard Code for
Information Interchange) or in EBCDIC (Extended Binary-Coded Decimal
218 Chapter 8 Attribute Grammars and Analysis
Continued
Decimal code Character Decimal code Character Decimal code Character
111 o 117 u 123 {
112 p 118 v 124 |
113 q 119 w 125 }
114 r 120 x 126 ∼
115 s 121 y 127 DEL
116 t 122 z
For EBCDIC, we do not list the specific codes in details, but we show
its format. EBCDIC consists of eight bits too, and the eight bits are divided
into two zones. The first four bits constitute the zone, and the last four bits
constitute the digit. Both zone and digit constitute the code of characters in
EBCDIC.
8 4 2 1 8 4 2 1
0 0 1 1 0 1 0 1
zone digit
The digits shown on the top represent the weight of each bit. Fig. 8.7 below
presents the grammar of ASCII and EBCDIC.
The grammar defines the grammar of ASCII and EBCDIC as well as the
attributes of the elements. In the last production, A stands for ASCII while
E stands for EBCDIC. If the Base-Tag A following the series of digits, then
this is a code of character in ASCII. If E following the series of digits, then it
is a code of character in EBCDIC. The key point here is that the evaluation
of the code of character depends on the Base-Tag (A and E). But for the
sake of simplicity, we omit the details, instead we just use the real Digit-Seq
value.
Fig. 8.8 shows more concrete attribute grammar of ASCII and EBCDIC.
From Fig. 8.8 it is easy to draw the dependency graph of Code, Digit-
Seq, Digit and Base-Tag. But we omit it as it mainly involves the issues of
implementations rather than the principles.
In order to implement the data stream by the method specified above, we
must visit all the nodes on the dependency graph. Usually when visiting these
nodes one should avoid the infinite loop. There is a simple way to avoid loop,
220 Chapter 8 Attribute Grammars and Analysis
that is, to link these nodes to the parsing tree to visit them, since parsing
tree has no loop. Then recursively traveling all nodes in the parsing tree may
automatically visit all the nodes on the dependency graph. On every node,
we complete all the assignments according to the evaluation rules as much as
possible, then travel children nodes, and attempt to do the assignments again
according to the rules when returning from them. The assignments before
traversing are to transmit inherited attributes downwards while assignments
after traveling are to acquire synthetic attributes and transmit them upwards.
8.3 Dependence Graph and Evaluation of Attributes 221
Since in the attribute evaluation, starting from some node, then the evalua-
tion will traverse children of the node, sequentially it maybe returns to the
node. In this case the loop will occur. If the loop continues infinitely, our work
will be affected by the undesirable thing. Therefore, we must prevent the loop
from happening. In order to do so it is necessary to detect the existence of
the loop. The work may be done through both a dynamic loop detection and
a static loop detection.
In the dynamic loop detection the loop is detected during the attribute
evaluation of a practical parsing tree when the loop exists in the parsing tree.
The static loop detection deduces whether the parsing tree can generate a
loop or not from attribute grammar itself. What it detects is all the parsing
trees the grammar generates. Therefore, if the dynamic loop detection does
not find any loop in the specific parsing tree, then we consider that the
specific parsing tree has no loops. If the static loop detection has not found
any loop in the detection of an attribute grammar, then we consider that all
the parsing trees which the grammar generates have no loops. Therefore, the
static loop detection is more valuable than the dynamic loop detection but
also more difficult.
Now we further analyze these two detection methods.
For the dynamic loop detection, there is a slightly rough method that
checks the number of rings. If the parsing tree has m attributes, but we found
that it contains more than m rings, then we can confirm that the parsing tree
contains loops, because if the parsing tree has no loops, then each ring may
evaluate at most one attribute. Therefore, if the evaluation proceeded after
m runs, all the evaluation should finish. If it did not stop, it means that it
must contain loop.
For the static loop detection, we need to seek for the reasons that the loop
exists from the production rules. Obviously, a loop cannot be generated from
the dependency graph of a production rule R, because the attribute evalu-
ation rule may assign values to an attribution set (including the inherited
attributes of R’s children nodes and synthetic attributes of R) and what it
used is another attribute set (including the synthetic attributes of R’s chil-
dren nodes and inherited attributes of R). If the two sets disjoint, then they
have no common elements, so they cannot form a loop. In a parsing tree if
there exists an attribute dependency loop, then the data stream must leave
the original node, traverse around some part of the tree, and then back to
the node. Perhaps the process may move in a roundabout way until it returns
to the original attribute node. For example, it departs from an inherited at-
tribute of the node N, goes down to the tree below N, at the bottom it travels
to a subtree twice, and travels to another subtree once, then it goes up to a
synthetic attribute of N, then continues to go to the rest of the tree, where it
passes through the left brother node of N, and then passes through the right
222 Chapter 8 Attribute Grammars and Analysis
The research on attribute grammars is mainly caused by the need for spec-
ifying the non-context-free features of programming languages. It involves
evaluation rules. On the implementation of these evaluation rules is also the
objects of our research because some special features of the evaluation rules
are likely to bring some conveniences or advantages. On the section, we will
discuss L attributes and S attributes, and they are just what we talk about.
Definition 8.2 A class of syntax-directed definitions is called L attributed
definitions, if their attributes can always be evaluated in the depth-first search
order. L here stands for left, because attribute information appears to flow
from left to right. Therefore, on the way of traversing the parsing tree from
left to right, the attribute evaluation may be done [10].
A syntax-directed definition is L attributed if each inherited attribute of
Xj , 1 j n, on the right side of A → X1 X2 . . .Xn , depends only on:
1) the attributes of the symbols X1 , X2 , . . . , Xj−1 to the left of Xj in the
production;
2) the inherited attribute of A.
Hence the feature of L attributes is that the inherited attribute of a subnode
of a nonterminal N depends only the synthetic attributes of left subnodes of
the production and the inherited attributes of A itself. That means that the
data dependency graph of any production has no the data stream arrow from
a subnode to itself or to its left subnode.
Many programming languages are of L attributed as their intrinsic data
flow from left to right is helpful for programmers to read and to understand
programs. For example, the dependency graph of Constant-definition, where
there is no data stream from Expression to Defined-identifier (from right to
left). But the example of ASCII and EBCDIC is not L attribute grammar as
there is a data stream from right to left in it.
Fig. 8.9 shows part of analytical tree of L attribute grammar.
8.4 L Attribute Grammas and S Attribute Grammars 223
Fig. 8.9 The data stream in part of analytical tree of an L attribute grammar.
In the figure above, every node has two boxes, one in each side. The left
one represents the inherited attribute while right one represents the synthetic
attribute. The name of the node is in between. The node A has five subnodes
B1 , B2 , B3 , B4 , and B5 , while C1 , C2 , C3 , and C4 are subnodes of B3 . The
upwards arrow represents the synthetic attribute data stream of the subnode,
they all point to right or the synthetic attributes of the parent node. When the
attribute evaluation starts functioning on a node, all the inherited attributes
of the node have been acquired, and these attributes may be transferred to
any its subnodes that need them. On the figure, they are shown via dotted
lines with arrows.
Suppose that the attribute evaluation is functioning on the node C3 , there
are only two attribute sets that participate in the function:
• All the attributes of nodes that are on the path from the top to the node
that is processed. That are C3 , B3 , and A.
• The synthetic attributes of left sibling nodes of those nodes. That are C1 ,
C2 , B1 , and B2 .
The right sibling of C3 , B3 and A did not participate in as their synthetic
attributes do not function.
On the figure one thing is hided. That is that the inherited attributes
remain in node where they belong to. Their values are transferred along
the path from top to the node that is processed (for example, the constant
definition that was described before). This structure is just provided by a
top-down analysis.
The attribute evaluation in L attribute grammar may be conveniently
contained in the top-down analysis. And applying top-down analysis also
entails some tricks to complete the evaluation. The key problem is that the
inherited attributes must be transferred from a parent node to subnodes. On
the other hand, in the bottom-up analysis, only when all the subnodes have
been processed, then may thing on the parent node be defined and created.
Therefore, when any inherited attribute is needed, there is no place yet to
transfer it down.
The bottom-up analysis program, however, has a stack to shift in termi-
nals and to reduce nonterminals. We establish the correspondence between
the stack and the attribute stack. The attribute stack may keep attributes of
the stack element on the same order of elements. In this way, it will be alright
224 Chapter 8 Attribute Grammars and Analysis
A→B{C; inh-attr:=f(B.syn.attr);C}
where the part in the brackets is the actions we need to execute, the assign-
ment of inherited attribute to C. In order to do so, we introduce ε-production:
A→B A-actional C
A-actional → ε{C.inh-attr:=f(B.syn-attr);}
Fig. 8.10 shows the attribute value stream of the abstract syntax tree to
summary the L attribute grammars and S attribute grammars.
From the figure, we can see that in L attribute grammars the attribute
values flow down along a branch then flow up again along the next branch.
But in S attribute grammars the attribute values only flow along one direc-
tion, down to up.
We now finish the discussion of attribute grammars. They are the useful
supplement of context-free grammars. Especially when we handle the non-
terminals, obviously we need them to help the compilation process.
Problems
Problem 8.1 Let synthesized attribute val give the value of the binary
number generated by S in the following grammar. For example, on input
110.011, S.val = 6.375
S→L.L|L,
L→LB|B,
B→0|1.
1) Inherited attribute.
2) Synthesized attribute.
3) Attribute evaluation rule.
4) Dependence graph.
5) IS-SI graph.
6) The nodes of abstract parser tree.
7) The subnode pointers of the abstract parser tree.
Problem 8.4 Consider the following attribute grammar, construct the IS-
SI graph of A, and point out the loop contained in the grammar.
S(SYN S)→
A(i1 ,s1 )
ATTRIBUTE RULES:
SET i1 TO s1 ;
SET s TO s1 ;
A(INH i1 , SYN s1 ) →
A(i2 ,s2 ) ‘a’
ATTRIBUTE RULES:
SET i2 TO s1 ;
SET s1 TO s2 ;
|
B(i2 ,s2 )
ATTRIBUTE RULES:
SET i2 TO s1 ;
SET s1 TO s2 ;
B(INH I, SYN s)→
‘b’
ATTRIBUTE RULES; SET s TO c;
References
[1] Irons ET (1961) A syntax directed compiler for Algol 60. Comm ACM, 4(1):
51 – 55.
[2] Knuth DE (1968) Semantics of context-free languages. Mathmatical Systems
Theory 2(2): 127 – 145. Etrata 5(1): 95 – 96.
[3] Reps TW (1984) Generating Language-Based Environments. MIT Press,
Cambridge.
[4] Lewis PM, Rosenktrantz DJ, Stearns RE (1974) Attributed translations. J
Computer and System Sciences, 9(3): 279 – 307.
[5] Mayoh BH (1981) Attribute grammars and mathematical semantics. SiAM
J Computing 10(3): 503 – 518.
[6] Kennedy K, Ramanathan J (1979) A deterministic attribute grammar eval-
uator based on dynamic sequencing. TOPLAS 1(1): 142 – 160.
[7] Engelfrief J (1984) Attribute evaluation methods. Lorho pp 103 – 138.
[8] Cohen R, Harry E (1979) Automatic generation of near-optimal linear-time
translators for non-circular attribute grammars. Sixth ACM Symposium on
Principles of Programming Languages, pp 121 – 134.
[9] Kastens U (1980) Ordered attribute grammars. Acta Informatica, 13(3):
229 – 256.
[10] Bochmann GV (1976) Semantics evaluation from left to right. Comm ACM,
19(2): 55 – 62.
Chapter 9 Algebraic Method of Compiler
Design
C.A.R. Hoare
G. Polya
This chapter will be independent of the last several chapters. It will introduce
a grand new method for the design of compilers of the procedure oriented
programming languages. The method is based on that these languages sat-
isfy the algebraic laws. The new practical strategy is to reduce any source
program to a canonical form through a series algebraic transformations. And
the canonical form precisely specifies the features of the object machine. The
outstanding character of the method is that the correctness of the compiler
is ensured by that of each algebra transformation, while the correctness of
these transformations is proven by more basic laws.
The aims of introduceing new methods in compiler design are as follows.
At first, we want to widen the thinking train of readers. When they study the
methods of compiler design, especially when they are engaged in the design
of a compiler, they know that apart from the methods we introduce before,
there are many different ones. Furthermore, we want to show the frontier of
the field, especially the effort of associating correctness with the translation
230 Chapter 9 Algebraic Method of Compiler Design
process. No doubt, the correctness is essential for any software [1]. Without
it, software has no any value. The assertion is absolute right for compilers.
This is why since 1960s a large number of approaches have been suggested
to tackle the problem of compiler correctness. But we cannot include all the
approaches in this book. We choose the current method because it has some
advantages over other existing methods. It also benefits from the view de-
scribed by others: the compilation process is completely characterized within
a uniform framework of a procedural language whose semantics is given by
algebraic laws. We will refer to this language as a reasoning language. The
source language is a subset of this language. But we also supplement addi-
tional specification operators such as constructions to model assumptions and
assertions. By doing so the approach develops the compiler while it provides
the proof of correctness of the compiler. As long as the algebraic transforma-
tions are correct, then the compiler derived from these transformations in the
canonical form is also correct. Finally, we want to emphasize the importance
of the formal method to the reader. Denotational, algebraic and axiomatic
methods all belong to category of formal methods. Many researchers are
working on the methods and accomplish high achievements. Therefore, the
reader is encouraged to make an effort on this aspect. One thing should be
pointed out that the chapter emphasizes the algebraic approach to compi-
lation [2], rather than the translation between particular pairs of languages.
It only involves the code generation phase of the compiler, instead of the
entire development process of the compiler. Therefore though this chapter is
independent of the last chapters, it has intimate relation with the following
chapters, intermediate code generation and object code generation.
We first introduce the source language that is a subset of the reasoning lan-
guage. We need to describe it because our goal is to translate the programs
written in the source language to the object language. The language can be
considered as an extension of the guarded command language proposed by
E.W. Dijkstra with procedures and general recursions.
The operators of the source programming language are listed in Fig. 9.1,
in which we use x to stand for an arbitrary program variable, e for an expres-
sion, b for a Boolean expression, p and q for program and X for a program
identifier.
We make explanation about these operators.
• skip When the operator is executed, it produces no change for the
program and terminates successfully.
• x := e The assignment starts with the evaluation of the expression e. Its
value is then assigned to x. For the sake of simplicity, we assume that the
evaluation of e always works without failure. So the assignment always
9.2 Source Language 231
terminates.
• p; q The program p; q runs as usual sequential composition does, that
is, if the execution of p successfully terminates, then the execution of q
follows that of p.
• p " q The program p " q runs either like p or like q. This nondeterminism
is called demonic. Because if p or q fails ∗ , p " q fails. At least in principle,
this is not better than the situation where p " q always fails because it
cannot be relied at all.
• p b q This is a conditional statement. Its execution starts with
the evaluation of the boolean expression b. If b holds, then p is executed,
otherwise q is executed.
• b ∗ p This is an iteration statement that stands for a loop. It starts with
the evaluation of boolean expression b. If it holds p is executed and this
is followed by the same iteration until b does not hold again. If b does not
hold from the beginning, the statement just behaves like skip. Although
iteration is a special case of the recursion, it is convenient to name it as
an independent operation.
Note: We consider that a program fails if it diverges (or aborts), the
operator ⊥ has the same meaning.
• dec x · p This is a declaration that declares the variable x for use in the
program p (the scope of the declaration). Here there is a difference from
common practice that we do not enforce that a variable be declared before
it is used. Undeclared (or global) variable can be considered as playing the
role of input and output commands: the initial values of these variables
are taken as the input to the program and their final values as the output
yielded by the program. Our intention is to simplify the dealing of type
information.
• proc X ∼ = p · q It introduces a non-recursive procedure with name X
and body p. The program q following the symbol × is the scope of the
procedure. Occurrences of X in q are interpreted as the calling of proce-
dure X. We separate procedures from the recursion with the intention of
reducing complexity.
• μ X · p This is a recursive program. It has the name X and the body p.
Similar to proc X ∼ = p · q above, occurrences of X in p are interpreted as
232 Chapter 9 Algebraic Method of Compiler Design
recursive calls of X.
It needs to point out that the source language allows arbitrary nesting of
variable and procedure declarations, as well as recursive definitions. For the
purpose of generality, we avoid defining the syntax of expressions. We use
uop and bop as operators of source language to stand for arbitrary unary
and binary operators, respectively. According to the practical situation, we
assume that the target machine has instructions that directly implement
these operators.
The source language is embedded in a specification space that includes the
constructions presented in Fig. 9.2. As in Fig. 9.1, x stands for an arbitrary
program variable, b for boolean expression, and p and q for programs (or
specification).
as a source operator. But in this chapter we just use it for reasoning. Although
some of the above constructs are not strictly necessary, as some of them can
be defined in terms of others, each one represents a helpful concept both for
specification and for reasoning.
We will use the algebraic laws to present the semantics of the specification
(reasoning) language. Most of the laws are expressed as equations of the form
p = q. Informally the equation means that p and q have the same behavior:
for an arbitrary initial state s, p terminates if and only if q does, and the
final state produced by p starting in s is the same as the one produced by
q. The programs p and q may consume different amount of resources (for
example, memory) and run at different speeds; But what the equation really
means is that an external observer (who is unable to see the internal states)
cannot distinguish between them. Therefore, we can replace p with q or vice
versa in any context.
It is also possible to attach a boolean condition to a law, meaning that
the law is guaranteed to hold only if the condition yields a true value. Fur-
thermore, the laws can be inequations (rather than equations ). These use
the refinement relation informally. For the purpose of illustration, a few alge-
braic laws are given below. They describe the fact that ⊆ is a lattice ordering
(If the reader is interested in knowing more about lattice theory, he/she is
encouraged to refer to modern algebraic books). For all programs p, q and r
we have
p ⊆ #, (miracle is the top of the lattice)
⊥ ⊆ p, (abort is the bottom)
(r ⊆ p ∧ r ⊆ q) ≡ r ⊆ (p " q), (" is the greatest lower bound)
(p ⊆ r) ∧ (q ⊆ r) ≡ (puq) ⊆ r. (u is the least upper bound)
An additional and extremely important fact about the refinement relation is
that all the operators of the reasoning language are monotonic with respect
to it. This means that if q refines p, then the replacement of p with q in
any context leads to a refinement of the entire context. More formally, for an
arbitrary context F:
p ⊆ q ⇒ F(p) ⊆ F(q).
After introducing the operators of source language and specification language,
we may sequentially present an overview of the approach to compilation. But
first we need to define the machine used as the target of our compiler. The
target machine is very simple. It consists of four components:
P a sequential register (program counter)
A a general purpose register (accumulator)
M a store for variables (RAM)
m a store for instructions (ROM)
We need to emphasize that the essential feature of our approach to compila-
tion is the embedding of the target language within the reasoning language.
We represent the machine components as program variables and design the
9.2 Source Language 235
instructions as assignment that update the machine state. We define the in-
structions of our simple machine as follows:
load(n) ⇔ (def as) A, P := M[n], P + 1.
(As we mentioned above, we use multiple assignment here, A is assigned to
M[n], meanwhile, in doing so, the program counter is increased by one.)
store(n) ⇔ (def as) M, P := (M % {n → A}), P + 1.
We use value A to update the memory at position n, the program counter is
increased by one.
bop − A(n) ⇔ (def as) A, P := (A bop M[n]), P + 1.
The value in A and value in memory at position n execute a binary operation
and the result is kept in A, the program counter is increased by one.
uop − A ⇔ (def as) A, P := (uop A), P + 1.
The value in A executes an unary operation and the result is still kept in A,
the program counter is increased by one.
jump(k) ⇔ (def as) P := k.
The content of the program counter is assigned to k, so the next instruction
to be executed will be at position k,
cjump(k) ⇔ (def as) P := (P + 1 A k).
If the value in A holds, the next instruction to be executed is at position
P+1, that means that the machine will execute consecutively, otherwise it
will jump to execute instruction at position k.
Where we use map overriding (%) to update M at position n with the
value of A. Of course we may use a more conventional notation M[n]:= A
as well. But we do not use it because it is not suitable for reasoning since
M[n] is not really a variable. Especially we will need to define operators like
non-freeness and substitution that regard two variables as different if they are
syntactically different; but M[e] and M[f] will be the same if e and f evaluate
to the same value even if they are syntactically different.
Recall that we mentioned that we do not deal with type information.
But in this context we assume that P is an integer variable, and that M is
an array variable. The conditional assignment that defines cjump is just an
abbreviation of the conditional
(P := P + 1) A (P := k).
A similar strategy can be adopted to model the components and instructions
of other target machines.
The normal form for describing the behavior of our simple machine that
executes a stored program is an iterated execution of instructions taken from
the memory m at location P:
dec P, A · P := s; (s P < f) ∗ m[P]; (P = f)⊥ ,
where s is the intended start address and f is the finish address of the code to
be executed. The requirement to start at the right instruction is expressed by
the initial assignment P := s, and the requirement to terminate at the right
place is expressed by the final assertion (P = f)⊥ . The iteration program (s
P < f) ∗ m[P] means that while the value of P is in between s and f, then
the execution of m[P] is realized.
236 Chapter 9 Algebraic Method of Compiler Design
dec P, A · P := s;
(s P < s + 2) ∗ [(P = s) → load(Ψy), (P = s + 1) →
store(Ψx)];
(P = s + 2)⊥ .
Now the whole expression is the program, we get that indicates that these
instructions must be loaded into the memory m at positions s and s+1,
completing the overall process. Note that the product of the compilation
process is just a program in the same language (the normal form) from which
we can easily obtain the sequence of generated instructions of the target
language.
Note: Ψx is the address of x in the memory M, whereas M[Ψx] represents
the memory cell (location) that holds this value.
Example 9.2 (A simple conditional statement) Consider the following
conditional statement that assigns x or y to z depending on whether the bop
relation holds or not between x and y:
(z := x) x bop y (z := y).
The conditional statement may express a maximum (or minimum ) finding
program in which case bop would stand for (or ). But we would prefer
sticking to the notation of the source language where bop stands for arbitrary
binary operators. We might derive the normal form of the statement as we
did for the assignment statement step by step. But it is very similar to the
first one, we just omit the details and directly present the normal form of it:
In this section at first we will briefly describe a theoretical basis for the
kind of the refinement algebra we will be using. Then we will introduce the
reasoning language based on the refinement algebra. The source language we
use is a subset of the reasoning language. Meanwhile, in this section we will
give examples of the refinement calculi based on these ideas and address the
problem of the data refinement. We will also link our approach to compilation
to the more general task of deriving programs from specifications. Therefore,
the contents of this section are crucial to the approach to compilation which
we study.
9.3 Algebraic Foundation and Reasoning Language 239
p = (x := 2 " p := 3),
a = (x = 2),
b = (x = 3).
Similarly, one can show that p(b) = false, and consequently, that p(a)
∨p(b) = false. It is also trivial to show that p(a ∨ b) = true. Therefore,
we conclude that the fourth property does not hold in general for nondeter-
ministic programs. Instead, the implication
p(a) ∨ p(b) ⇒ p(a ∨ b)
does hold. The complete lattice PredTran includes predicate transformers
useful for the specification purpose; they are not implementable in general.
Among above properties, only monotonicity is satisfied by all the predicate
transformers in PredTran: # trivially breaks the law of the excluded miracle.
The fact that greatest lower bounds over arbitrary sets are allowed implies
that the assumption of bounded nondeterminism (and therefore continuity)
9.3 Algebraic Foundation and Reasoning Language 243
1
Our language is untyped, and types are considered here only for the present discussion.
9.3 Algebraic Foundation and Reasoning Language 245
design of the compiler this invariant would establish the relationship between
the data space of the source program and that of the target program. As
exemplified in the previous section, the concrete representation of a variable
y in the memory M of our simple target machine is denoted by M[Ψ y], where
Ψ is a symbol table that maps identifiers to their respective addresses in M.
For a list of global variables y1 , . . . , yn representing the states of the source
program, the relevant coupling invariant would be
y1 = M[Ψy1 ] ∧ . . . ∧ yn = M[Ψyn ].
In this approach to the data refinement, a new relation between programs is
defined to express that program p (operating on variables x ) is a data refine-
ment of the program p (operating on variables x) under coupling invariant I.
This is written p I,x,x p and is formally defined by
p I,x,x p ⇔ (def as) (∃ x ·I ∧ p(a))⇒ p (∃ x·I ∧ a)
for ∀ a not containing x (considering programs as predicate transformers).
Broadly, the antecedent requires that the initial values of the concrete
variables couple to some set of abstract values for which the abstract program
will succeed in establishing postcondition a; the consequent requires that the
concrete program yields new concrete values that also couple to an acceptable
abstract state.
To illustrate a simple example of the data refinement, consider the cou-
pling invariant, given the relation above that relates the data space of the
source program to that of the target program. We can show that the program
x := y
is data refined by the program
M := M % {ψx → M[ψy ]}.
Note that this captures the desired intention: the effect of the latter program
is to update the memory at position ψx (the address of x) with M[ψy ], the
value stored in the memory cell with address ψy (the address of y). More
generally, we can prove that an arbitrary source program operating on the
source variables is data refined by the corresponding target program which
operates on M.
This definition is chosen for two main reasons. The first is that it guar-
antees the characteristics of the data refinement given above, that is
if (p I,x,x p ) then (dec x : Tx · p) ⊆ (dec x : Tx · p ).
The second reason is that it distributes through the program constructors,
thus allowing data refinement to be carried out piecewise. For example, the
distribution through sequential composition is given by
if (p I,x,x p ) and (q I,x,x q ) then (p; q) I,x,x (p ; q ).
But we can also adopt another approach to avoid the need of the defining
data refinement relation. The use of the algorithmic refinement relation can
not only characterize the data refinement, but also carry out the calculations.
The basic idea is to introduce an encoding program, say ψ that computes
abstract states from concrete states and a decoding program, say Φ that
computes concrete states from abstract states. Then, for a given abstract
program p, the task is to find a concrete program p’ such that
246 Chapter 9 Algebraic Method of Compiler Design
ψ; p; Φ ⊆ p .
With the aid of specification features, it is possible to give a very high-level
definition for ψ and Φ. Using the same convention adopted above that x
stands for the abstract variable, x for the concrete variable and I for the
coupling invariant, ψ is defined by
ψ ⇔ (def as) var x; x :∈⊥ I; endx .
The meaning of the expression is that it introduces the abstract variable x
and assigns its value such that the invariant is satisfied, and then removes
the concrete variable from the data space. The use of ⊥ as an annotation in
the above generalized assignment command means that it aborts if I cannot
be established. Similarly we have the definition of Φ:
Φ ⇔ (def as) var x ; x :∈⊥ I; end x.
This one introduces the concrete variable x and assigns its value such that
the invariant is satisfied, and removes the abstract variable from the data
space. But the generalized assignment command result in a miracle in this
case if I cannot be established.
It needs to point out that these two kinds of generalized assignment com-
mands are introduced only for the purpose of the present discussion, hence-
forth we will still use the previous notation x: ∈ b, instead of x: ∈ b, as
they are the same by definition.
Note that having separate commands to introduce and end the scope of
variables is an essential feature to define the encoding and decoding programs:
the first introduces x and ends the scope of x ; the second introduces x and
ends the scope of x.
In this approach, the data refinement can also be performed piecewise, as
we can prove the distributivity properties such as
ψ; (p; q); Φ ⊆ (ψ; p; Φ) : (ψ; q; Φ)
that illustrates that both algorithmic and data refinement can be carried out
within the framework of one common relation.
We have mentioned previously that the task of compilation is the program
refinement. In the sense, we can establish some connection between our point
of view of compiler design and the more general task of deriving programs
from specifications [8]. Henceforth we will refer to deriving programs sim-
ply as derivation. In both cases, a programming language is extended with
specification features, so that a uniform framework is built and the interface
between programs and specifications (when expressed in terms of distinct
formalisms) is avoided.
In a derivation, the idea is to start with an arbitrary specification and
end with a program formed solely from constructs that can be executed by
computer. In our case, the initial object is an arbitrary source program and
the final product is its normal form. But the tools used for achieving the goals
in both cases are identical in nature: transformations leading to refinement
in the sense already discussed.
Derivation entails two main tasks: the control and data refinement. We
also split the design of the compiler into these two main phases. However,
9.3 Algebraic Foundation and Reasoning Language 247
meaning that the law is guaranteed to hold only if the condition evaluates
to be true. Furthermore, the laws can be inequation (rather than equation)
expressing refinement.
It is possible to select a small subset of our language and define the
additional operators in terms of the more basic ones. But this is not our
concern here. What we emphasize is the algebraic laws that will be used in
the process of designing a compiler. However, we do illustrate how a few
operators can be defined from others. In particular, iteration is defined as a
special case of the recursion and all the laws about the iteration are proven.
They deserve such special attention because of their role in the proof of the
normal form reduction theorems.
Another concern is the correctness of the laws of the basic operators. To
achieve this we need to link the algebraic semantics of the language with
a suitable mathematical model in which the basic operators can be defined
and their laws verified. In the end of this section, we will further discuss this
issue and argue that the existence of nontrivial models for reasoning language
shows that in some sense the reasoning language and its algebraic laws are
consistent.
As we have explained before both the programming and specification
operators of the reasoning language have the same status in that they can be
viewed as predicate transformers. In this uniform framework, there is no need
to distinguish between programs and specifications. We will refer to both of
them as “programs”. Another remark is that programs have both a syntactic
and a semantic existence. On one hand, we perform syntactic operations on
them, such as substitution. On the other hand, the algebraic laws relating
language operators express semantic properties. Strictly speaking, we should
distinguish between these two natures of programs. But it is not convenient
to do so and it will be clear from the context which view we are taking.
1. Concepts and Notation
1) Name conventions
For the purpose of the convenience of the following discussion, we define
some conventions as regards the names used to denote program terms:
X, Y, Z variables denoting programs
p, q, r arbitrary but given programs
x, y, z list of variables
a, b, c boolean expressions
e, f, g list of expressions
We also use subscripts in addition to the above conventions. For example,
b0 , b1 , . . . stand for boolean expressions (also referred to as conditions). We
use comma for list concatenation: x, y stands for the concatenation of lists
x and y. Further conventions are explained when necessary.
2) Precedence rules
In order to clarify the priority order and to reduce the number of brackets
9.3 Algebraic Foundation and Reasoning Language 249
denotes the substitution of the list of expressions e for the (equal-length) list
of variables x in the list of expressions f.
We also allow the substitution of programs for program identifiers:
p[X ← q].
250 Chapter 9 Algebraic Method of Compiler Design
4. Demonic nondeterminism
The program p " q denotes the demonic choice of programs p and q; either
p or q is selected, the choice being totally arbitrary. The abort command
already allows completely arbitrary behavior, so an offer of further choice
makes no difference to it.
Law 9.5 p " ⊥ = ⊥ ("-⊥ zero)
On the other hand, the miracle command offers no choice at all.
Law 9.6 p " # = p ("-# unit)
252 Chapter 9 Algebraic Method of Compiler Design
When the two alternatives are the same program, the choice becomes
vacuous - -" is idempotent.
Law 9.7 p " p = p (" idemp)
The order in which a choice is offered is immaterial " is commutative.
Law 9.8 p " q = q " p (" comm)
Demonic choice is associative.
Law 9.9 (p " q) " r = p " (q " r) (" assoc)
5. Angelic nondeterminism
The angelic choice of two programs p and q is denoted by p ' q. Infor-
mally, it is a program that may act like p or q, whichever is more suitable in
a given context.
As we have mentioned above, ⊥ is totally unpredictable, and therefore
the least suitable program for all purpose.
Law 9.10 ⊥ ' p = p ('-⊥ unit)
On the other hand, # suits any situation.
Law 9.11 # ' p = p ('-# zero).
Like ", angelic choice ' is idempotent, commutative and associative.
Law 9.12 p ' p = p (' idemp)
Law 9.13 p ' q = q ' p (' comm)
Law 9.14 (p ' q) ' r = p ' (q ' r) (' assoc)
6. Ordering Relation
Here we define the ordering relation & on programs: p & q holds whenever
the program q is at least as deterministic as p or, alternatively, whenever q
offers only a subset of the choices offered by p. In this case, q is at least as
predictable as p. This coincides with the meaning we adopt for refinement.
Thus p & q can be read as “p is refined by q”.
We define & in terms of ". Informally, if the demonic choice of p and q
always yields p, one can be sure that p is worse than q in all situations.
Definition 9.9 (The ordering relation) p & q ⇔ (def as)(p " q) = p
In the final section, we prove that this ordering coincides with the ordering
on the lattice of predicate transformers described in the beginning of this
chapter. Alternatively, the ordering relation could have been defined in terms
of '.
Law 9.15 p & q ≡ (p ' q) = p (&-')
9.3 Algebraic Foundation and Reasoning Language 253
From Definition 9.9 and the laws of ", we conclude that & is a partial
ordering on programs.
Law 9.16 p & p (& reflexivity)
Law 9.17 (p & q) ∧ (q & p) ⇒ (p = q) (& antisymmetry)
Law 9.18 (p & q) ∧ (q & r) ⇒ (p & r) (& transitivity)
Moreover & is a lattice ordering. The bottom and top elements are ⊥
and #, respectively; the meet (greatest lower bound) and join (least upper
bound) operators are " and ', in this order. These are also consequences of
the definition of & and the laws of " and '.
Law 9.19 ⊥ & p (&-⊥ bottom)
Law 9.20 p & # (&-# top)
Law 9.21 (r & p ∧ r & q)≡ r &(p " q) (& -" glb)
Law 9.22 (p & r) ∧ (q & r) ≡(p ' q) & r (&-' lub)
In order to be able to use the algebraic laws to transform subcomponents
of compound programs, it is crucial that p & q imply that F(p) & F(q), for
all contexts F (functions from programs to programs). This is equivalent to
saying that F (and consequently, all the operators of our language) must be
monotonic with respect to &. Then we have the law as follows.
Law 9.23 If p & q, then
1) (p " r) & (q " r) (" monotonic)
2) (r; p) & (r; q) and (p; r) & (q; r) (; monotonic)
We will not state monotonicity laws explicitly for the remaining operators
of our language.
7. Unbounded Nondeterminism
Here we generalize the operators " and ' to take an arbitrary set of
programs, say p, as argument. ' p denotes the least upper bound of p; its
definition is given below.
Definition 9.10 (least upper bound) (' p & p) ≡ (∀X|X ∈ p · X & p).
The above definition states that p refines the least upper bound of the set
p if and only if, for ∀ X ∈ p, p refines X.
The greatest lower bound of p, denoted by " p, is defined in a similar
way.
Definition 9.11 (Greatest lower bound) (p & " p) ≡ (∀X|X ∈ p · p & X).
Let ' be the set of all programs, and ∅
´ be empty set. Then we have
' ∅ = ⊥ = " U∅
" ∅ = # = ' U.
254 Chapter 9 Algebraic Method of Compiler Design
From the above we can easily show that sequential composition does not
distribute rightward through the least upper bound or the greatest lower
bound in general, since we have
⊥; " ∅ = ⊥ = " ∅
#; ' ∅ = # = ' ∅.
The rightward distribution of sequential composition through these opera-
tors is used below to define Dijkstra’s healthiness conditions. However, the
leftward distribution is valid in general, and can be verified by considering
programs as predicate transformers. In the following, the notation {X| b ·
F(X)} should be read as: the set of elements F(X) for ∀ X in the range
specified by b.
Law 9.24
1) ' p; p = ' {X|X ∈ p · (X; p)} (; -' left dist)
2) " p; p = " {X|X ∈ p · (X; p)} (; -" left dist)
It is also possible to verify that the lattice of programs (considered as
predicate transformers) is distributive.
Law 9.25
1) (' p) " p = ' {X|X ∈ p · (X " p)} ("-' dist)
2) (" p) ' p = " {X|X ∈ p · (X' p)} ('-" dist)
As discussed before, among all predicate transformers Dijkstra singles out
the implementable ones by certain healthiness conditions. Here we formulate
these conditions as equations relating operators of our language.
1) p; ⊥ = ⊥ p is non-miraculous
2) p; " p = " {X|X ∈ p·(p; X)} p is conjuctive
for all (non-empty) sets of programs p
3) p; ' p = ' {X|X ∈ p·(p; X)} p is disjunctive
for all (non-empty) sets of programs p
4) p; ' {i|i0· qi } = ' {i|i0· pi ; qi } p is continuous
provided qi & qi+1 for all i0
We say that a program p is universally conjunctive if the second equation
above holds for all sets of programs p (possibly empty). Similarly, if the third
equation holds for all p, we say that p is universally disjunctive.
8. Recursion
Let X stand for the name of the recursive program we wish to construct,
and let F(X) define the intended behavior of the program, for a given context
F. If F is defined solely in terms of the notations introduced already, it follows
by structural induction that F is monotonic:
p & q ⇒ F(p) & F(q).
Actually, this will remain true for the commands that will be introduced
9.3 Algebraic Foundation and Reasoning Language 255
9. Approximate Inverse
Let F and G be functions on programs such that for all programs X and
Y
F(X) = Y ≡ X = G(Y).
Then G is the inverse of F, and vice versa. Therefore, G(F(X)) = X =
F(G(X)), for all X. It is well-known, however, that a function has an inverse
if and only if it is bijective. As the set of bijective functions is relatively small
this makes the notion of inverse rather limited. The standard approach is to
generalize the notion of inverse functions as follows.
Definition 9.12 (Approximate inverse) Let F and F−1 be functions on
programs such that, for all X and Y
F(X) & Y ≡ X & F−1 (Y)
Then we call F the weakest inverse of F−1 , and F−1 the strongest inverse of
F. The pair (F, F−1 ) is called the Galois connection.
The left and right weakest inverses of the sequential composition are de-
fined together with a calculus of the program development. Broadly speaking,
the aim is to decompose a task (specification) r into two subtasks p and q,
such that
r & p; q.
The method allows one to calculate the weakest specification that must
be satisfied by one of the components p and q when the other one is known
and then gets the problem totally solved. For example, one can calculate the
weakest specification of p from q and r. It is denoted by qr and satisfies
r & (qr); q. This is called the weakest prespecification. Dually, r/p is the
weakest specification of the component q satisfying r & p; (r/p). It is named
the weakest postspecification.
The strongest inverse of language constructs is less commonly used. This
is because perhaps they exist only for operators that are universally disjunc-
tive. Article [8] have suggested a method to reason about recursion based on
the notion of the strongest inverse which they call weak-op-inverse. We list
some of the properties of strongest inverses as follows.
Before presenting the properties of strongest inverses, we review two basic
definitions. F is universally conjunctive if for all sets (possibly empty) p,
F(" p) = " {X | X ∈ p · F(X)}.
Similarly, F is universally disjunctive if for all sets (possibly empty) p,
F(' p) = ' {X | X ∈ p · F(X)}.
256 Chapter 9 Algebraic Method of Compiler Design
10. Simulation
Here we consider the inverse of programs themselves. An inverse of the
program S is a program T that satisfies
S; T = skip = T; S.
That means that running S followed by T and T followed by S are the same
as not running any program at all, since skip has no effect whatsoever.
The inversion of programs has been previously discussed by Dijkstra and
Gries. A More formal approach to program inversion is given in [9]. It defines
proof rules for inverting programs written in Dijkstra’s language. A common
feature of these works is the use of the notion of the exact inverse given above.
But it seemed that this notion of inverse is rather limited, hence we adopt a
weaker definition of the program inversion.
Definition 9.13 (Simulation) Let S and S−1 be programs such that
(S; S−1 ) & skip & (S−1 ; S).
Then the pair (S, S−1 ) is called a simulation, S−1 is the strongest inverse of
S, whereas S is the weakest inverse of S−1 .
A very simple example of simulations is the pair (⊥, #) since
(⊥; #) = ⊥ & skip & # = (#; ⊥).
For further examples of simulations one might see them later.
9.3 Algebraic Foundation and Reasoning Language 257
Simulations are useful for calculation in general. When carrying out pro-
gram transformation, it is not seldom to meet situations where a program
followed by its inverse (that is S; S−1 or S−1 ; S) appears as a subterm of the
program being transformed. Thus, from the definition of simulations, it is
possible to eliminate the subterm of the above form by replacing them with
skip (of course, this is only valid for the inequational reasoning). This will
be illustrated in many proofs in the next two sections where we give further
examples of simulations.
But the most valuable uses for the concept of simulations are for data
refinement. This was discussed in some detail in the previous section where
we introduced the concepts of encoding and decoding programs that form
a simulation pair. The distributivity properties of simulations given below
are particularly useful to prove the correctness of the change of the data
representation phase of the compilation process, where the abstract space of
the source program is replaced by the concrete state of the target machine.
The appropriate encoding and decoding programs will be defined when the
need arises.
Now we present some of the properties of simulations.
Theorem 9.3 (Simulation) Let S be a program. The following properties
hold:
1) S−1 is unique if it exists.
2) S−1 exists if and only if S is universally disjunctive.
3) S−1 is universally conjunctive if it exists.
We define the following abbreviations.
Definition 9.14 (Simulation functions) Let { S, S−1 } be a simulation.
We use S and S−1 themselves as functions defined by
S(X) ⇔ (def as)S; X; S−1 ,
S−1 (X) ⇔ (def as)S−1 ; X; S.
The next theorem shows that the concepts of simulations and appropriate
inverse are closely related.
Theorem 9.4 (Lift of simulation)
Let S and S−1 be simulation functions as defined above. Then S−1 is the
strongest inverse of S. Furthermore from Theorem 9.2 we have
S(S−1 (X)) & X & S−1 (S(X)).
The following theorem shows how simulation functions distribute through
all the language operators introduced so far, with possible improvement in
the distributed result.
Theorem 9.5 (Distributivity of simulation functions)
1) S(⊥) = ⊥;
2) S(#) & #;
3) S(skip) & skip;
258 Chapter 9 Algebraic Method of Compiler Design
Proof
RHS = (a∨b)→(a→ p) " (a∨b)→(b→ p) {Law 9.42 (guard -" dist)}
= (a∨b)→(a→ p " b→ q) {Law 9.41 (→guard conjunction)}
= LHS
When p and q above are the same program, we have the following laws.
Law 9.44 (a→ p " b→ p) = (a∨b)→ p (→guard disjunction 2)
Surprisingly, perhaps this is not a consequence of the previous one.
Sequential composition distributes leftward through guarded commands.
Law 9.45 (b→ p); q = b→(p; q) (; -→left dist)
If b then p else q
It can also be defined in terms of more basic operators.
Definition 9.16 (Conditional)
(p b q) ⇔ (def as)(b → p ¬b → q).
The most basic property of a conditional is that its left branch is executed
if the condition holds initially; otherwise its right branch is executed.
Law 9.47 (a∧b) ; (p b ∨c q) = (a ∧b) ; p ( true cond)
Law 9.48 (a∧¬ b) ; (p b∧c q) = (a ∧¬ b) ; p ( false cond)
The left branch of a conditional can always be preceded by an assumption
of the condition. Similarly, to precede the right branch by an assumption of
the negation of the condition has no effect.
Law 9.49 (b ; p) b q = (p b q) = pb (¬ b ; q) ( void
b )
If the two branches are the same program, the conditional can be elimi-
nated.
Law 9.50 p b p = p ( idemp)
Guard distributes through the conditional.
Law 9.51 a→(p b q) = (a→ p) b (a → q) (guard- dist)
Sequential composition distributes leftward through the conditional.
Law 9.52 (p b q); r = (p; r) b (q; r) (; - left dist)
The following two laws allow the elimination of nested conditionals in
certain cases.
Law 9.53 p b (p c q) = p b∨c q ( cond disjunction)
Law 9.54 (p b q) c q = p b∧c q ( cond conjunction)
We have considered assumptions and assertions as primitive commands
and have defined guarded commands and conditionals in terms of them. The
following equations show that an alternative could be to consider the con-
ditional as a constructor and regard assumptions, assertions and guarded
commands as special cases. These are stated as laws because they are not
necessary in our proofs.
b⊥ = skip b ⊥,
b = skip b #,
b → p = p b #.
15. Assignment
The command x := e stands for a multiple assignment where x is a list
of distinct variables, and e is an equal-length list of expressions. The com-
ponents of e are evaluated and simultaneously assigned to the corresponding
262 Chapter 9 Algebraic Method of Compiler Design
(b * p); q
= ((p: b * p) b skip); q {Definition 9.10 (Iteration) and
Law 9.26 (μ fixed point)}
= (p; ((b * p); q) b q {Law 9.62 (; - left dist) and
Law 9.1 (; - skip unit)}
( μX·((p; X) b q) = RHS {Law 9.26 (μ least fixed point)}.
RHS= (μX·((p; X) b q))
= (p; (μX·(p; X) b q)) b q
( (p; (μX·(p; X) b q)); q); q) b q
{From Lemma 9.1 (strongest inverse of ; ) we have (RHS; ' q);
q & RHS}
(((p; ((μX·((p; X) b q); ' q) b skip); q
{Law 9.62 (; - left dist) and Law 9.1 (; skip unit)}.
Then from Definition 9.12 (Approximate inverses) we have
(RHS ; ' q)= (μX·((p; X) b q)); q (
(p; ((μX·((p; X) b q)); ' q) b skip
( b * p {Law 9.26 (μ least fixed point) and Definition 9.12
(Approximate Inverses)}.
So RHS ( LHS, according to Definition 9.12.
The following law is surprisingly important, mainly in proving the cor-
rectness of the normal form reduction of sequential composition.
Law 9.78 (b * p); (b∨c) * p = (b∨c) * p ( * sequence)
Proof The proof once again is done by two sides RHS ( LHS and LHS (
RHS. At first, RHS ( LHS.
RHS= RHS b RHS {Law 9.50 ( idemp)}
= ((p; RHS) b∨c skip) b ((b∨c) * p)
{(Definition 9.10 (Iteration) and Law 9.25 (μ fixed point)}
= ((p; RHS) b∨c (b∨c) * p) b ((b∨c) * p) {Law 9.49
( void b ) and Law 9.72 ( * elim)}
= (p; RHS) b ((b∨c) * p) {Law 9.54 ( cond conjunction)}
( μX·(p; X) b ((b∨c) * p) {Law 9.26 (μ least fixed point)}
( (b * p); ((b∨c) * p) {Law 9.77 ( * -μ tail recursion)}
= LHS,
LHS= (q; (b * p)) b skip); (b∨c) * p) {Definition 9.10 (Iteration) and
Law 9.26 (μ fixed point)}
9.3 Algebraic Foundation and Reasoning Language 267
= (p; LHS) b RHS {Law 9.52 (; - left dist) and Law 9.1
(; - skip Unit)}
= (p; LHS) b ((p; RHS) b∨c skip) {Law 9.26 (μ fixed point)}
( (p; LHS) b∨c skip {Law 9.55 ( cond disjunction)} and
LHS & RHS}
= (b∨c) * p.
18. Static Declaration
The notation dec x · p declares the list of the distinct variable x for use in
the program p (the scope of the declaration). Local blocks of this form may
appear anywhere a program is expected.
It does not matter whether variables are declared in one list or singly.
Law 9.79 If x and y have no variables in common,
dec x · (dec y · p) = dec x, y · p (dec assoc)
The order in which variables occur does not matter either.
Law 9.80 dec x·(dec y · p)= dec y · (dec x · p) (dec commu)
If a declared variable is never used, its declaration has no effect.
Law 9.81 If x is not free in p,
dec x · p= p (dec elim)
One may change the name of a bound variable, provided that the new
name is not used for a free variable.
Law 9.82 If y is not free in p, then
dec x · p = dec y · p[x←y] (dec rename)
The value of a declared variable is totally arbitrary. Therefore, the ini-
tialization of a variable may reduce nondeterminism.
Law 9.83
1) dec x · p & dec x·x := e; p (dec- := initial value)
2) dec x · p & dec x·x :∈ b; p (dec- :∈ initial value)
An assignment to a variable just before the end of its scope is irrelevant.
But a generalized assignment cannot be completely ignored, since it may
result in a miracle.
Law 9.84
1) dec x · p = dec x · p; x := e (dec- x:= final value)
2) dec x · p & dec x · p; x :∈ b (dec- x :∈ final value)
The scope of a variable may be increased without effect, provided that
it does not interfere with the other variables with the same name. Thus
each of the programming constructs has a distribution law with declaration.
268 Chapter 9 Algebraic Method of Compiler Design
Note that it is possible to deal with cases where x is only declared in one
of the branches (and is not free in the other ones) being used in Law 9.81.
Declaration can also be moved outside an iteration, possibly reducing
nondeterminism. As shown below, this law can be derived from more basic
ones.
Law 9.89 b * (dec x·p) & dec x·b * p (dec- * dist)
Proof Our proof starts with right – hand side.
RHS= dec x· (p; b * p) b skip {Definition 9.10 (Iteration) and
Law 9. 26 (μ fixed point )}
= (dec x ·p; b * p) b skip {Law 9.88 ( dec- dist) and
Law 9.81 (dec elim)}
( ((dec x·p); dec x· b * p) b skip {Law 9.86 (dec-; dist)}
= b * (dec x·p) = LHS {Definition 9.10 (Iteration) and
Law 9.26 (μ fixed point )}.
9.3 Algebraic Foundation and Reasoning Language 269
In this section, we will exemplify our system to see how it works in producing
a compiler. Of course, it is a simple one and not practical as its purpose is
only to show that our approach to compiler design works. In this section, we
first describe the normal form of a model of an arbitrary executing mecha-
nism. The normal form theorems in the section are concerned with control
elimination: the reduction of the nested control structure of the source pro-
gram to a single flat iteration. These theorems are largely independent of a
particular target machine.
Then we design and prove the correctness of a compiler for a subset of
our source language, not including procedures or recursion. The construc-
tions considered here are skip, assignment, sequential composition, demonic
nondeterminism, conditional, iteration and local declaration.
As mentioned early, the compilation process is split into three main
phases: simplification of expressions, control elimination and data refine-
ment (the conversion from the abstract space of the source program to the
concrete state of the target machine).
Each of these generic transformations has the status of a theorem. The
more specific transformations that illustrate the compilation process for a
specific target machine have a status of a rule. Each rule describes a trans-
formation that brings the source program closer to a normal form with the
same structure as the specific target machine. Taking collectively, these rules
can be used to carry out the compilation task.
It is necessary to emphasize that one should notice the different roles
played by the algebraic laws of the last section and these reduction rules: the
laws express general properties of the language operators, whereas the rules
276 Chapter 9 Algebraic Method of Compiler Design
b1 → p1 . . .bn → pn .
b1 ∨ . . . ∨ bn .
as an abbreviation of
v : [a, (b1 ∨ . . .bn ) → (b1 → p1 . . .bn → pn ), c].
= LHS.
The following normal form representations of skip and assignment are
initialization of the above lemma. The one of skip is further simplified by
the fact that it is an identity of sequential composition. The operational
interpretation is that skip can be implemented by a jump.
Theorem 9.7 (Skip)
skip & v : [a, (a → v :∈ c), c]
Theorem 9.8 (Assignment)
x := e & v : [a, a → (x := e; v :∈ c), c]
The reduction of sequential composition assumes that both arguments are
already in the normal form, and that the final state of the left argument
coincides with the initial state of the right argument. The guarded command
set of the resulting normal form combines the original guarded commands.
First we consider the particular case where the guarded command set of the
right argument includes that of the left argument.
Lemma 9.3 (Sequential composition)
Proof
RHS ( v: [a, b1 →p, c0 ]; v: [c0, (b1 →p b2 → q), c]
{Lemma 9.3 (sequential composition)}
( v: [a, b1 →p, c0 ]; v: [c0 , b2 → q, c]
{Lemma 9.4 (eliminate guarded command)}
= LHS.
The following lemma shows how to eliminate a conditional command when
its branches are normal form programs with identical components, except for
the initial state. The first action to be executed in the resulting normal form
program determines which of the original initial states should be activated.
Lemma 9.5 (Conditional)
If v is not free in b then
v : [a1 , R, c] b v : [a2 , R, c] & v : [a, R, c]
where R= (a→(v: ∈a1 b v: ∈a2 ) b1 →p).
Proof
RHS = dec v·v: ∈a; (v: ∈a1 b v: ∈a2 ); (a∨ b1 )*R; c⊥
{Law 9.74 (*- unfold)}
= dec v·v: ∈a; (v: ∈a1 ; (a∨ b1 )*R; c⊥ ) b
(v: ∈a2 ; (a∨ b2 )*R; c⊥ ) {Law 9.52 (; - left dist)}
= (dec v·v: ∈a; v: ∈a1 ; (a∨ b1 )*R; c⊥ b
(dec v·v: ∈a; v: ∈a2 ; (a∨ b1 )*R; c⊥ ) {Law 9.68 (: ∈
right dist) and Law 9.88 (dec- dist)}
= v: [a1 , R, c] b v: [a2 , R, c]= LHS
{Law 9.93 (dec-: ∈ initial value)}.
The above lemma is useful for intermediate calculations. It is used in the
proof of the normal form reduction of conditional and iteration commands.
Theorem 9.10 (Conditional)
If v does not occur in b then
v : [a1 , b1 → p, c1 ] b v : [a2 , b2 → p, c] & v : [a, R, c]
where R = (a→(v: ∈a1 b v: ∈a2 ) b1 →p c1 →v: ∈c b2 → q)
Proof
RHS = v: [a, R, c]
( v: [a1 , R, c] b v: [a2 , R, c] {Lemma 9.5 (conditional)}
280 Chapter 9 Algebraic Method of Compiler Design
The compiler which we design produces code for a simple target machine that
consists of four components:
A Accumulator
P a sequential register (program counter)
M a store for instructions (ROM)
m a store for the data or operand
The idea is to regard the machine components as program variables and
design the instructions as assignments that update the machine state.
P and A will be represented by single variables. Although we do not
deal with types explicitly, P will be assigned integer expressions, standing
for locations in ROM. A will be treated as an ordinary source variable; it
will play an important role in the decomposition of expressions, that is the
subject of next section. M will be modeled as a map (finite function) from
addresses of the program counter in P to address where the instruction to be
executed is stored, and m as a map from the address variable in expression
to the location in RAM.
In order to model M and m, we need to extend our language to allow map
variables. We use the following operators on maps:
{ x→e} singleton map
m1 ∪ m2 union
m1 % m2 overriding
m[x] application
Perhaps the least familiar of these operators is overriding: m1 %m2 contains
all the pairs in m2 plus each pair in m1 whose domain element is not in the
domain of m2 . For example,
{x → e, y → f} % {y → g, z → h} = {x → e, y → g, z → h}.
Problems
Problem 9.1 Using the derivative language of this chapter, express the
following computation
a ×(1+p)n –a.
Problem 9.2 Using the derivation language of this chapter, express the
following computation
Area= (s(s – a) (s – b) (s – c))1/2
where s = (a + b + c)/2.
Problem 9.3 Using the derivation of this chapter, write a program to com-
pare three numbers a, b, and c, find out the biggest one.
Problem 9.4 Using the derivation language of this chapter, find out the
smallest n such that
1 + 2 + . . . + n 500
and compute the real sum.
Problem 9.5 Using the derivation language of this chapter, write two ver-
sions of program of finding the greatest common divisor. The definition
of greatest common divisor of two integers x and y is
y, if yand x mod y= 0
gcd(x, y) = gcd(y, x), if y>x
gcd(y, x mod y), otherwise.
Problem 9.6 Using the derivation language of this chapter, convert the
input decimal integers into the number of any system.
Problem 9.7 Using the derivation language of this chapter, write a pro-
gram that receives a character from keyboard, then sort these characters
according to the increasing order of these characters. Delete the replicated
characters (if any).
Problem 9.8 Using the derivation language, write a program to compute
the following the proposition formula:
(P→(Q∧R)∧(∼P→∼Q)∧ ∼R).
Problem 9.9 Write a program that realizes the function of the accumula-
tor, it accumulates the input of user until the input is zero. Then output
the result of the accumulator. Then using the methods provided in the
chapter, compile the program.
References
Mao Zedong
Deng Xiaoping
After finishing the first phase, i.e., the analytical phase of compilation, natu-
rally we may enter the second phase, i.e., the synthetic phase. And the main
task of this phase is to generate the target code. Why don’t we directly gen-
erate the target code instead of bothering to generate the intermediate code?
The first question should be answered in this chapter.
Theoretically there is no any difficulty for us to generate the target code
after finishing the analysis of the source program and storing all the infor-
mation (data). It will be beneficial for generating efficient target code by
generating the intermediate code instead of directly generating the target
code.
The so-called intermediate code means the program rewritten from the
source program in the intermediate language. Actually if we compile the
source program in a number of passes, the intermediate language has already
existed.
We use it as the media that transits from one pass to the next pass. That
means that the output of a pass is the input of the next pass. They are the
same thing and it is in the intermediate language. At first after the lexical
analysis, a rewriting source version is yielded. The lexical analysis transformes
the identifiers, variables, constants and reserves words of the language into
the machine version with the fixed-length. Therefore after lexical analysis, the
286 Chapter 10 Generation of Intermediate Code
program text will consist of the text with a series of symbols of fixed-lengths.
They are the replacements of the original symbols with variable lengths. The
simplest form is integers, some of which correspond to the reserved words,
while the others represent the pointers to the identifier table or constant
table and so on. We may call this form of the program the intermediate code.
The difference between this kind of intermediate form and the source text is
that this kind of intermediate code contains no original information which the
source text contains. It can only contain all the information when it combines
with the various tables which the lexical analyzer generates.
Therefore, the intermediate code which we mentioned here is the kind
which the analytical phase generates in the last pass. It is closer to the ma-
chine code though it is independent of the machine. In general the analyzer
generates the syntax trees or parser trees as the intermediate form that is not
yet the intermediate code. The parser trees still remain the tracks of most of
the source language and program paradigm to which the parser trees belong
to. It is not beneficial to carry out the generation of the target code. There-
fore, we hope that by the generation of the intermediate code, the specific set
of nodes is reduced to a small set of general purpose concept nodes. Such set
is easier implemented on real machines, and thus it is also easier to generate
the target code. Therefore, the intention of this chapter is to cover all issues
concerning the generation of the intermediate code.
Besides, we want to point out that there is another advantage with the
intermediate code, i.e., an optimization program that is independent of the
machine may be applied to the intermediate code [1]. Fig. 10.1 shows the
positions of the intermediate code generator and code optimizer.
Fig. 10.1 the Position of the intermediate code generator and code optimizer.
the indication of module structures, etc. The management nodes may contain
the evaluation of expressions, e.g., the evaluation of the address of an array
element. In the normal situation, they seldom has the code that corresponds
to the target code. However, in other cases, for example, for modules, the
compiler must generate the code to execute the initialization of modules.
However, the code which management nodes need is very short.
Control stream node [2] describes many characteristics: the branch caused
from conditional statement, the multi- selections drawn from switch state-
ment, computational goto statement, function invocation, abnormal han-
dling, method applications, Prolog rule selection and remote procedure in-
vocation. The characteristics of the code which control stream nodes need
depend on the category to which the source program belongs. Correspond-
ingly the code needed by control stream nodes is not much either.
Expressions occur in all the categories. Expressions explicitly occur in the
codes of almost all languages. Expressions are the main object that the inter-
mediate code needs to handle. Therefore, we need to consider how to select
the intermediate language. In principle, the intermediate language should
satisfy the following conditions:
1) independent of machine;
2) simple;
3) easy to generate intermediate code, moreover, it is also easy to transfer
to the machine code;
4) easy to handle the optimizing code program.
According to these requirements of the intermediate language, through
extensive theoretical research and massive practice, people gradually form
the intermediate languages that are rather mature and popularly accepted.
They are acyclic directed graph (ADG), postfix expression and three address
code.
In the following, we will successively introduce the ways of generating the
intermediate code by these languages.
E → T,
T → T1 × F,
T → F,
F → (E),
F → id,
F → num.
Notice that in these productions, E1 and T1 are not new nonterminals,
actually they are E and T respectively. It is for convenience that we introduce
them. Meanwhile, id and num represent identifiers and numbers respectively,
so they are terminals. According to these production rules, we may obtain
the semantic rules as shown in Fig. 10.3.
Fig. 10.3 Syntax driven definition of parser tree for assignment statement.
According to the syntax driven definition given above, the parser tree in
Fig. 10.2 (a) is formed and a series of function calls is carried out:
p1 :=mkleaf(id, entry a);
p2 :=mkleaf(id, entry b);
p3 :=mkleaf(id, entry c);
10.2 Intermediate Code Languages 289
p4 :=mknode(‘+’, p2 , p3 );
p5 :=mknode(‘–’, p2 , p3 );
p6 :=mknode(‘×’, p4 , p5 );
p7 :=mkleaf(num, 1/3);
p8 :=mknode(‘×’, p2 , p3 );
p9 :=mknode(‘-’, p7 , p8 );
p10 :=mknode(‘-’, p6 , p9 );
p11 :=mknode(‘:=’, p1 , p10 );
We now explain function calls and relevant symbols. In the semantic rules, the
nptr indicates the node pointer. mknode means the construction of the node.
The construction node includes separated components. The first component
is the label of the node while the second and third components are the left
child and right child. The mkleaf means the construction of leaf. It has two
components, the first of which is the identification or type while the second
is the identifier. If it is a constant, then the first component represents type,
and the second one is the value. There is one left that is id.place. It points
to the corresponding pointer or address of this identifier in the symbol table.
The acyclic directed graph is also generated from the same syntax driven
definition. If the function mknode(op, left, right) or mkunode(op, child) (rep-
resents that it is a node with only one subnode) are encountered, then before
the construction of the node, it needs to check whether its left child and right
child have already existed. If they exist, then the only thing that needs to
do is to establish the pointer that points to the node. And also the node
itself has to be created. In this way, the acyclic directed graph in Fig. 10.2 is
generated.
There are two ways to represent the parser tree of Fig. 10.2 (a), as shown
in Fig. 10.4.
In Fig. 10.4 (a), each node represents a record in which the leftmost field
is the operator, and the last two fields are the pointers of the left child and
right child of the operator, respectively. They contain two fields. One is its
type and another is its name or value (if it is a number). In Fig.10.4(b), the
node is distributed from a record array, and the index or position of the node
is taken as the pointer of the node. From the root node located on position
15, and following the pointers, all the nodes of the parser tree may be visited.
290 Chapter 10 Generation of Intermediate Code
Fig. 10.4 Two representations of the parser tree of Fig. 10.2 (b).
abc × bc − (1/3)bc × ×− := .
For example, for the parser tree in Fig. 10.2(b) that corresponds to ex-
pression
(b + c) × (b − c) − (b × c)/3.
The binary tree that represents the expression is shown in Fig. 10.6. If we
traverse it with preorder, we get
where each operator occurs before its operands, and so it is called pre-
order. Preorder is also called Polish notation as the inventor of the notation
Lukasiewicz is a Polish. If we traverse it with inorder, what we get is just
(b + c)(b − c) − (b × c)/3,
where the parentheses indicate the priority of the operators. It is just the orig-
inal expression. Since the operators are always located between the operands,
it is called inorder.
Finally, if we traverse it with postorder, we have
bc + bc − ×1/3bc × ×−,
where each operator occurs after the operands, so it has the name. It is also
called reversed Polish notation.
292 Chapter 10 Generation of Intermediate Code
Postorder is very suitable for the stack machine that also has the postorder
form of expressions. The so-called stack machine is the machine that uses
stacks to store and operate. It has no register. It has two kinds of instructions:
One kind is to copy or move values between the top of the stack and other
places. The other kind is to perform operations upon the element on the
top of they stack and other stack elements. The other kind of machines that
are in contrast to this one is pure register machines that have one memory
(values are stored in it) and a set of registers. This kind of machines has
also two types of instructions. One kind is to copy value between memory
and registers; the other kind is to perform operations upon two values on the
registers, and then store the result into one of the registers.
In order to generate code for the stack machine, the only thing that needs
to do is to read in the expression in postorder, then perform the following
actions:
1) when a variable is met, generate code that pushes the value to the top
of the stack;
2) when a binary operator is met, generate code that performs the op-
eration of the operator upon the two top elements of the stack, and replace
second top element of the stack (the top of the stack is removed);
3) when a unary operator is met, generate code that performs the opera-
tion of the operator upon the top element of the stack, and then replace the
top element with the result.
All the operations are based on the assumption that we can discern the
unary operators and binary operators. The code that pushes the value of
variable on the top of the stack will include the computation that transforms
the variable address of the form
(number of block layer, offset)
on compilation time into the real address on run-time.
−a = 3,
a + b = 5,
b + 1 = 3,
a and b = 2.
The expression
(b + c) × (b − c) − (b × c)/3
b+c= 1 (+b c) = 1
b−c = 2 (−b c) = 2
1×2=3 or expressed as (×1 2) = 3
b×c = 4 (×b c) = 4
#(1/3) × 4 = 5 (×#(1/3)4) = 5
3−5=6 (−3 5) = 6
where the integers correspond to the identifiers which the compiler assigns
to. The constants should be preceded with #. The code is also called three
address code in which two addresses are used for operands, and the other
address is used for the result. We assume here that the operator is of binary
one. As for unary operator, it can be considered as the special case of the
binary one that only needs one operand address and one result address. If
the operator is unary, then the second operand is empty. Quadruples may be
replaced by triples or two address code. Each triple consists of two operand
addresses and an operator. Obviously, the operands of two address code are
the same as the operands of the quadruples. Apart from being variables and
constants, the operands may be the indices of other triples. For example, the
triple code of
(b + c) × (b − c) − b × c/3
294 Chapter 10 Generation of Intermediate Code
is
position triple
1 b+c
2 b−c
3 @1 × @2∗
4 b×c
5 (1/3) × @4
6 @3 − @5
Note∗ : we have used @ to precede the index to the triple, so for numbers,
no need to use # to precede them again. In the following, we present a triple
code from which one may easily derive the computation which it performs.
position triple
1 a+1
2 @1 × @1
3 b + @2
4 @3 + c
5 @4 =: x
6 @1 + 1
7 d × @6
8 @7 =: y
x := b + (a + 1)2 + c,
y := d × (a + 2).
Quadruples and triples are widely used as the intermediate code. Therefore,
we should transform the parser tree that is generated from analytical phase
into quadruple form, then can finally the target code that is equivalent to the
source program be formed, or the target code that is expected by the source
code.
Now we analyze the quadruple form of intermediate code generated by
various statements:
1. Assignment statement
The assignment statement has the following form:
v := e
where v is a variable, and e is an expression. We have introduced the quadru-
ple form of intermediate code previously in the chapter. Now the problem is
10.2 Intermediate Code Languages 295
about variable that may be a simple variable, it may also be an indexed vari-
able. If it is a simple variable, then the intermediate code of the assignment
statement is
<v:=e>≡ the quadruple code that evaluates e
:= v << e >>
where <e> denotes the address that stores the expression, so <<e>> denotes
the value of the expression.
2. The address evaluation for the elements of the arrays
We have mentioned that in the assignment statement, the variable at the
left hand side may be an indexed variable. If it is the case, then the assignment
statement needs to assign the value f the expression at the right hand side to
the address of the indexed variable. Therefore it involves how to evaluate the
address of indexed variable from its index. Besides, in the evaluation of the
expression of the assignment statement, the indexed variables may also be
involved; hence we also need to evaluate their addresses from their indices.
Therefore, addressing array elements is an issue which the compiler cannot
avoid.
The arrays may be of the one-dimension, two-dimension and even higher
dimension. For the simplest one-dimension, the general form is
array (num) of T,
where num indicates the number of the elements. The count of the number
may start from 0, if it is the case, the real number of the elements is num+1;
if the count starts from 1, then the number is just num itself. T is the type of
elements, it may be the integer or float, etc. The type determines the length
or width of elements, and is denoted by w.
In order to make the access of the elements more convenient, in general,
the elements of the array are allocated in a contiguous block, sitting one
by one. And the first element is put on the start location that is called the
base address. In order to make the number more explicit, num is written as
low..up, that is
array (low..up) of T.
Suppose that the ith element is to be visited, then it needs to check whether
i satisfies
low i up.
If the condition does not hold, it means that the element which we want to
access is not within the range, then the compiler reports error. If the condition
holds, and the width of each array element is w, then the ith element of the
array begins in location
where low is the lower bound on the subscript and base is the relative address
of the storage allocated for the array. That is, base is the relative address of
A[low], A is the name of the array.
296 Chapter 10 Generation of Intermediate Code
or simply
(×i w) = 1,
(+c 1) = 2.
The last quadruple implies that base – low × w has been saved in c. The
practice for c is called the compilation-time precalculation.
This practice can also be applied to address calculations of elements of
multi-dimensional arrays. For the two-dimensional array, it has the following
form
array A(i1 ..u1 , i2 ..u2 ),
for example, the array A(1..2, 1..3) of integer. This declares that A is an
array of integers with 2 rows and 3 columns. There are 6 elements in A. A
two-dimension array is normally stored in one of two forms, either row-major,
i.e., row-by-row, or column-major, i.e., column-by-column. Fig. 10.6 shows
Once again, where w is the width of the element. The calculation can also be
partially evaluated if it is rewritten as
Expression (10.5) can also be rewritten into two parts: a fixed part on com-
pilation time and a changing part as i, j, and k change.
We can generalize the row-major or column-major form to multi-
dimensional arrays. The generalization of row-major(column- major) form
is to store the elements in such a way that, as we scan down a block of
storage, the rightmost (leftmost) subscripts appear to vary fastest. The ex-
pression (10.3) generalizes to the following expression for relative address of
A[i1 , i2 , . . . , ik ] is
(base + (i1 n2 + i2 ) × n3 + i3 ). . .) × nk + ik ) × w +
((. . .(i1 n2 + i2 ) × n3 + i3 ). . .) × nk + ik ) × w, (10.6)
where for all j, nj =uj − ij +1 and they are assumed fixed, the first term of the
expression (10.6) can be computed at the compilation time and saved with
the symbol table entry for A.
However, the generation code for the computation of array address can
also be done at the syntactical analysis. In order to do so, the production
rules of the grammar need to be transformed, that is, actions are inserted
into production rules so that the corresponding quadruples are generated. We
illustrate the procedure with the grammar of the expression. The production
rules for the expression are as follows:
S → EXP
EXP → TERM|
EXP + TERM,
TERM → FACT|
TERM × FACT,
FACT → −FACT|
ID|
(EXP).
In order to describe actions, we need to use a stack that keeps the array con-
sisting of concrete records. Each of its entry contains an integer or character
(when quad=true, it stores qno meaning the number of the quadruple; oth-
erwise it stores idop meaning the operator). ptr is the pointer of the stack.
quadno is a variable, of which the value is the number of quadruple just al-
located last time. Initially both the values of ptr and quadno are 0. Suppose
that the character (or symbol) that has just read last time is kept in “in”.
The following is the production rules that have attached actions:
1) S → EXP {ptr:=ptr−1 (the stack grows bottom-up)}
2) EXP → TERM |
EXP+{ptr:=ptr+1: (notice the interpretation on the text,
stack[ptr].quad:=false: if quad=true it stores qno;
stack[ptr].idop:=in} otherwise it stores operator)
10.2 Intermediate Code Languages 299
TERM{for (i=ptr−2;i=ptr;i++)
if stack[i].quad then
emit (stack[i].qno)
else emit(stack[i].idop);
quadno:=quadno+1;
emit(‘=’, quadno);
(/* print ‘=’ and the quadruple number */)
ptr:=ptr−2;
stack[ptr]:=true;
stack[ptr]:=quadno;}
3) TERM → FACT|
TERM ×{ptr:=ptr+1;
stack[ptr].quad:=false;
stack[ptr].idop:=in;}
FACT {for(i=ptr−2:i=ptr;i++)
if stack[i].quad then
emit(stack[i].qno)
else emit(stack[i].idop);
quadno:=quadno+1;
emit(‘=’, quadno);
ptr:=ptr−2;
stack[ptr]:=true;
stack[ptr]:=quadno;}
4) FACT → − {ptr:=ptr+1;
stack[ptr].quad:=false;
stack[ptr].idop:=in}
FACT {for (i=ptr−1;i=ptr;i++)
If stack[i].quad then
emit(stack[i].qno)
else emit(stack[i].idop);
quadno:=quadno+1;
emit(‘=’, quadno);}
| ID {ptr:=ptr+1;
stack[ptr].quad.:=false;
stack[ptr].idop:=in}
300 Chapter 10 Generation of Intermediate Code
| (EXP)
5) ID → a | b| c | d |. . . |z
For the places without any action attached in the productions above, it
means that no action needs to attach to.
Now in order to do the same for array, that is, when syntactical analysis
is performed upon array, the actions are put to the production rules so that
the quadruple form can also be generated, we need to set up the production
rules for the array. At first, in order for the elements of the array may also be
visited at where id appears in the expression grammar, we need to introduce
a nonterminal called N to replace id. We then have
N → id[Elist]|id,
Elist → Elist, E|E.
However, in order to make use the bounds of various dimensions when we
combine the expressions of subscripts to form Elist, when forming N, we need
to attach the array name at the leftmost of the subscript expression, rather
than connecting it with Elist. Thus the inserting actions can be done easier.
Therefore, the productions above are rewritten as
N → Elist]|id,
Elist → Elist, E|E.
In this way, the pointer of symbol table entry of the array may be transmitted
as the synthetic attribute of the array of Elist.
After rewriting the production rules above, we may now define the gram-
mar of arrays. And the element of the array may occur either at the left of
production rules or the right of production rules. The production rules are
as follows:
1) S → N:=E;
2) E → E+E;
3) E → (E);
4) E → N;
5) N → Elist];
6) N → id;
7) Elist → Elist, E;
8) Elist → id[.
In order to attach actions to these production rules, we need to explain the
symbols that will be used. At first, we use Elist.dim to record the dimensions
in Elist, i.e., the number of subscript expressions. Function limit (array.j)
returns the value of nj , that is in the symbol table entry of the element
number of jth-dimension of array pointed by ‘array’. Finally, Elist place
represents the temporary unit that stores the value computed from subscript
expression.
When accessing the array element [3] A[i1 , i2 , . . . , ik ], the actions gener-
ated in the production rules will use the following recurrence relations:
10.2 Intermediate Code Languages 301
⎧
⎪
⎪ e := i1 ,
⎨ 1
..
. (10.7)
⎪
⎪
⎩
em := em−1 × nm + im
to compute the first m indices of kth-dimensional array
(. . .(i1 n2 + i2 ) × n3 + i3 ). . .) × nm + im . (10.8)
Thus, when m=k, a multiplication by the width w is all that will be needed
to compute the second term of Expression (10.6). Note that the ij ’s here may
really be values of expressions. And the first term of Expression (10.6) is fixed
and stores as an entry in the symbol table of array A.
By passing, we state that the grammar aforementioned is ambiguous as
it contains left recursion. We may transform it to remove the left recursion.
After that we get the following grammar:
1) S → N:=E;
2) E → (E);
3) E → N;
4) E → (E)A;
5) E → NA;
6) A → +E;
7) A → +EA;
8) N → Elistid ];
9) N → id;
10) Elist → id[;
11) Elist → id[E B;
12) B →, E;
13) B →, E B.
Notice that this is not an LL(1) grammar yet as many productions with
the same left part have common derivation symbols. But it is not difficult to
transform it into of LL(1). We just do not do it.
Strictly speaking, we should use these production rules to insert the ac-
tions that would generate the quadruples. However, since the address compu-
tation for array, in general, cannot cause the problem of ambiguity, therefore
for simplification, we still attach the actions to the general productions:
1) S → N:=E
{if N.offset=null then /*N is simple id*/
emit(N.place‘:=’ E.place);
else
emit(N.place‘[’N.offset‘]’‘:=’E.place)}
2) E → E1 +E2
{E.place:=newtemp:
emit(E.place‘:=’E1 .place ‘+’ E2. .place)}
3) E →(E1 )
{E.place:=E1.place}
302 Chapter 10 Generation of Intermediate Code
We have mentioned before that E1 and E2 are not new nonterminal. They
are actually E itself. We use them only for discerning them from the E at the
left hand side.
4) E → N
{if N.offset=null then / ∗ N is a simple id ∗/
E.place:=N.place
else begin
E.place:=newtemp;
emit (E.place ‘:=’ L.place ‘[’ N.offset ‘]’)
end}
In the case of indexed variable, we want to get the value of some element
of the array. N.place[N.offset] is just the value of the element which the index
corresponds to.
5) N → Elist ]
{N.place:=newtemp;
N.offset:=newtemp;
emit(N.place‘:=’c(Elist.array));
emit(N.offset‘:=’Elist.place‘*’width(Elist.array))}
Where N.offset is the new temporary variable that represents the second
term of Expression (10.6); function width (Elist.array) returns w in Expres-
sion (10.6). N.place represents the first term of Expression (10.6), returned
by the function c (Elist.array).
6) N → id
{N.place:=id.place;
N.offset:=null}
A null offset indicates a simple name, so this corresponds to a simple
variable.
7) Elist→Elist1 . E
{t:=newtemp;
m:=Elist1 . ndim+1;
emit(t‘:=’Elist1 . place‘∗’limit(Elist1 . array, m));
emit(t‘:=’t‘+’E.place);
Elist.array:=Elist1.array;
Elist.place:=t;
Elist.ndim:=m}
Here the actions are produced using the recurrence relation above, where
Elist1 .place corresponds to em−1 of Expression (10.7) while Elist.place corre-
sponds to em .
8) Elist → id[E
{Elist.array:=id.place;
Elist.place:=E.place;
Elist.ndim:=1}
Here E.place holds both the value of Expression E, and the value of Ex-
pression (10.8) for m=1
10.2 Intermediate Code Languages 303
Y * 25 =1 t1 :=y * 25
1+z=2 t2 :=t1 +z
baseA -#104=3 or t2 :=baseA -#104
2*#4=4 t3 :=4*t1
3[4]=5 t4 :=t2 [t3 ]
5:=x x:=t4
quadruple code three-address code
304 Chapter 10 Generation of Intermediate Code
u:=newtemp;
emit (u‘:=’ ‘inttoreal’E2 .place);
emit(E.place ‘:=’ E1 .place a‘real+’u);
E.type:=real
end
else
E.type:=type-error;
We just present the conversations between integers and reals for addition
within assignments. For subtract, multiplication and division, the cases are
similar to this one. The semantic action stated above uses two attributes
E.place and E.type for the nonterminal E. As the number of types that are
subjects to conversion increases, the number of cases that arise increases
quadratically (or worse, if there are operators with more than two argu-
ments). Therefore, with large numbers of types, the careful organization of
the semantic actions becomes more critical.
For example,
x:=y × z+a × b,
where x, y, and z are reals while a and b are integers. Then the output of the
intermediate code is as follows:
t1 :=a int × b,
t2 :=y real × z,
u:= inttoreal t1 ,
t2 :=t2 real+u,
x:=t2 .
4. Conditional statements
Previously, we have encountered the problem regarding the generation of
intermediate codes of conditional statements. Now we will discuss it again in
more detail.
The form of conditional statements is as follows:
if e then S1 else S2
For the execution of the statement, at first, it calculates the value of the
expression e. Hence, there will be a series of quadruples for the calculation.
For simplicity, we just express it as <e>. Then the truth or false of the value
of e needs to be judged. If it is true, then subsequently the statement (or
statements) represented by S1 is (or are) executed. The statement(s) should
follow the judgment of e. When the execution finishes it will exit from the
sequence. Otherwise if the value of e is false, then the statement (or state-
ments) represented by S2 is (or are) executed. Therefore, the quadruple code
that corresponds to the conditional statement is
<e> (a series of quadruple code that evaluates the
expression e)
(then, <<e>>, 0, 0) (if e is true, then sequentially execute the
306 Chapter 10 Generation of Intermediate Code
following statement)
<S1> (the code of the statement S1 )
(goto, , , ) (branch address after S1 is executed)
(else, , , ) (if e is false, branch to here)
<S2 > (the code of the statement S2 )
(if end, , , ) (the end if the conditional statement).
Example 10.2 Suppose that a nesting conditional statement is
if E1 then S1 else
if E2 then S2 else S3
5. Loop statements
Consider the loop statement with while type
While B do S
The statement is similar to the conditional statement. At first, B has to be
evaluated. We need to introduce a label to denote the branch address which
we use L:<B> to represent. It represents the first address of the code that
evaluates <B>. Hence we have
L: <B>
(while, <<B>>, 0, 0)
<S>
(goto, L, 0, 0)
(wfend, 0, 0, 0) (when B is false, branch to the
end of while statement)
Another form of the loop statement is
for (i=a;i b;i++)
do S
10.2 Intermediate Code Languages 307
(:=, i, a, 0)
L:<S> (the intermediate code sequence where i should be replaced
by its current value and L represents the first quadruple)
(+, I, 1, i)
(, I, b, 0)
(goto, L, 0, 0) (when i b branch back to continue the execution
of S)
(forend, 0, 0, 0) (if the extent has been exceeded the loop ends)
6. Procedural statements [5]
We use the following grammar to represent the procedure calls
S → call id(Elist)
Elist → E
Elist → E, Elist
where there should be other production rules that correspond to the nonter-
minal E to generate the arithmetic or boolean expression. Since we are only
concerned the generation of intermediate code for procedure calls, we omit
those productions here.
S → call id(Elist) {execute the evaluation for every E in the
Elist queue
Elist → E then execute assignment E.place:=E;
Elist → E, Elist then execute emit(‘actpar’, E.place);
Emit(‘call’ id.place n)}
For example, if the procedure statement is
G(E1 , E2 , E3 , E4 , E5 )
Then the quadruple code is as follows:
E1 .place:=<E1 > (<E1 > represents the code for evaluation of
E1 . The assignment here indicates that
value evaluated is stored in E1 .place. The
same for the following)
E2 . place:=<E2>
E3 . place:=<E3>
E4 . place:=<E4>
E5 . place:=<E5>
(actpar, E1. place, 0, 0)
(actpar, E2 .place, 0, 0)
(actpar, E3 .place, 0, 0)
(actpar, E4 .place, 0, 0)
(actpar, E5 .place, 0, 0)
(call, G.place, 0, n)
308 Chapter 10 Generation of Intermediate Code
where actpar is specially used for indicating the real parameter, it can also in-
dicate the mode of the transfer of real parameters. The final (call, G.place, 0,
n) implements the function call, where n indicates the number of parameters.
Example 10.3 Given a procedure statement g(x × 3, y+z), the quadruple
code that corresponds to the statement is
(×, 3, x, T1 )
(+, y, z, T2 )
(actpar, T1 .place, 0, 0)
(actpar, T2 . place, 0, 0)
(call, g.place, 0, 2)
Example 10.4 Suppose that there is a procedure statement
G(x × 3, g(x+2)× 2)
(∧ a b)=1
(= x 0)=2
(∧ 2 d)=3
(∨ b 3)=4
(∧ c 4)=5
(∨ 1 5)=6
8. Switch statements
The “switch” or “case” statement is available in a variety of languages.
It provides the possibility of multi choice of a condition. For example, in a
competition of four awards are set up: the champion, the runner-up, the third
place and the rearguard. These four will be awarded with different levels of
bonus. Then this can be processed with a case statement with five cases that
correspond to first, second, third, fourth, and the last one that does not get
any bonus again. There is a variety of the setting of switch statement. The
following is its general form:
switch E
begin
case V1 : S1
case V2 : S2
...
case Vn−1 : Sn−1
default: Sn
end
There is a selector expression which is to be evaluated, followed by n constant
values that the expression might take, including a default “value” that always
matches the expression if no other value does. The intended translation of a
switch statement is as follows [6]:
1) Evaluate the expression.
2) Find which value in the list of cases is the same as the value of the
expression. Recall that the default value matches the expression if none of
the values explicitly mentioned in cases does.
3) Execute the statement Si associated with the value found Vi .
According to the requirement of the execution, there are two ways to
generate the intermediate code for the statement. The first practice is that
after the value of e is evaluated, it is stored in, say location t, then branch to
the test of value t, and according to the result of test, statement Si is selected
310 Chapter 10 Generation of Intermediate Code
and executed.
<the intermediate code of evaluation of e>
t:= <e>
goto test
L1 : intermediate code of S1
goto exit
L2 : intermediate code of S2
goto exit
...
Ln−1 : intermediate code of Sn−1
goto exit
Ln : intermediate code for default value
goto exit
test: if t=V1 then goto L1
if t= V2 then goto L2
...
If t=Vn−1 then goto Ln−1
goto Ln
exit:
The another method is that after the evaluation of E and store it in t, the
test is successively done. At first, to check whether t is equal to V1 or not;
if it is equal, then execute the corresponding statement S1 ; otherwise check
whether it is equal to V2 , . . . , and continue doing so until it meets a value, or
it is not equal to any value (in this case it is regarded as being equal to Vn ).
It should be equal to notice that as V1 , V2 , . . . , Vn−1 are arranged, those
Vi ’s that have bigger probabilities that the expression takes should precede
the lower ones so that the efficiency is higher. The intermediate code is as
follows:
<the intermediate code of the evaluation of e>
t:= <e>
if t
= V1 goto L1
the intermediate code of <S1 >
goto exit
L1 : if t
= V2 goto L2
The intermediate code of <S2 >
goto exit
L2 :
...
Ln−1 : if t
= Vn−1 goto Ln
the intermediate code of <Sn−1 >
goto exit
Ln : the intermediate code of default
exit:
So far, for almost all the statements of languages we have introduced how to
generate the intermediate code that corresponds to them. There is still one
more thing that should be noted that we do not consider how to make the
code generated more efficient, for example, how to make it shorter, or how
to use less amount of storages. We just consider their implementation.
Problems 311
Problems
References
[1] Davidson JW, Fraser CW (1984) Code selection through object code opti-
mization, TOPLAS 6(4): 505 – 526.
[2] Tanenbaum AS, van Staveren H, Keizer EG, et al (1983) A practical tool for
making portable compilers. Comm. ACM 26(9): 654 – 660.
[3] Laverett TW, Cattell RGG, Hobbs SO, et al (1980) An overview of the
production-quality compiler-compiler project. Computer 13(8): 39 – 40.
[4] Fraser CW, Hanson DR (1982) A machine-independent linker. Software —
Practice and Experience 12, pp 351 – 366.
[5] Nori KV, Ammann U, Jensen K, et al (1981) Pascal implementation notes
in Barron, pp 125 – 170.
[6] Nawey MC, Waite WM (1985) The robust implementation of equence-
controlled iteration. Software — Practice and Experience 15(7): 655 – 668.
Chapter 11 Debugging and Optimization
The errors that occur in programs mainly have three types: Mistyping mis-
spelling in input, Syntax errors, Semantic errors.
Mistyping and misspelling often happen in input, especially when one
who types the program is not the writer of the program, or is not good in
English, then the rate of the errors over the whole program must be high.
We can classify the errors into two kinds. One belongs to isolated errors
that have limited impact to the whole program, and another belongs to global
314 Chapter 11 Debugging and Optimization
however, it does not affect much. For example, in the following expression
x := a/((b + c) − d ∗ (e + f)),
x := a/((b + c) d ∗ (e + f).
Through analysis, the mistake can be found easily, so this sill belongs to the
isolated mistake.
However, if in the expression above, the right parenthesis for (b+c) be-
comes a left parenthesis, then the expression becomes
It becomes a mistake that has the global effect. Obviously, we can find out
that there is mismatching of parentheses, but it is hard to deal with it prop-
erly. Of course, through careful analysis, we can guess that the left parenthesis
that follows c should be a right parenthesis. After we spot it and correct it,
the problem can be solved. We found the problem that the multiplication of
–d and (e+f) has ∗ in between, while the multiplication of c and (−d ∗ (e + f))
has no the corresponding ∗. However, since the inconsistency is not always
regarded as a problem, it causes the difficulty for checking.
In the current rapid interactive systems, usually when a mistake is found
from debugging, the procedure will immediately stop, and it informs the
user or programmer of the mistake, letting him/her to make correction. If
this is done, the procedure resumes. However, from the stand of the user or
programmer, he/she obviously wants to know more about the mistakes. It
will be most welcome by them if all the mistakes are found. Therefore, he/she
wants that after a problem was found, the debugging procedure continues its
work until it really cannot go further or it has found out all the problems.
When the mistake really affects the global situation, the procedure cannot do
anything again except that the program has been recovered from the previous
mistake, even if the recovery is made presumably.
Therefore, there are two strategies towards recovery from mistakes: one
is the correction of mistakes. This strategy is to make the continue analysis
or debugging possible through the modification of input symbol streams or
internal states of the syntactical program. But the strategy is very prone
to deviate from the analytical program and yield a lot of pseudo mistake
messages.
The other strategy is called non- correction of mistakes. It does not modify
the input stream, and delete all the information generated by the analytical
program. It uses the “remaining” program to continue the analysis of the
remaining part of the program. If the analysis succeeds, there is no mistake
again; otherwise, there must be other mistakes and that means that a correct
parsing tree cannot be generated.
316 Chapter 11 Debugging and Optimization
Actually, when we introduce the parsing trees of LL(1) and LR(1), we have
considered the cases in which mistakes occurred, and we have also correspond-
ingly established the mechanism for debugging and recovery. We consider the
situations of mistakes that exist, that will make the compiler running more
friendly towards users.
P → D; S
D → D; D| id: T
T → integer| real| char| boolean| array[num] of T| ↑ T
We postulate that when there is no type error taking place, the type error
checking program returns void, and void is idempotent, i.e., void void = void.
Hence any number of void linking together is equal to single void. But if an
error of the type is found, a type-error message is immediately issued. Hence
we have the type grammar with the semantic action rules inserted.
P → D; S {P.type := if S.type = void then void else
type-error. Notice that the number of
S.type more than one}.
D → D; D
D → id: T {addtype (id, entry, T.type)}
T → integer {T.type := integer}
T → real {T.type := real }
T → char {T.type := char}
T → boolean {T.type := boolean}
T → array[num] of T {T.type := array [1..numval, T1 .type]}
11.3 Debugging of Syntax Errors 317
We now consider the handling and recovery of errors for LL(1) parser. In
the previous chapter, we have discussed which cases are regarded as errors in
LL(1) parser. Here what we want to discuss is that if the error happens how
should we deal with? We are mainly concerned with two points:
1) How should we avoid infinite loop? As infinite loop will cause the
analytical procedure running without termination.
2) How to avoid to generate a parsing tree with error? If the parsing tree
generated contains the error, naturally the goal expected by the compilation
cannot be achieved.
Therefore, we need a good strategy in order to avoid the infinite loop via
removing at least one input symbol: make sure that to avoid the generation
of parsing tree with error via not to throwing away the next symbol that is
guessed or inserting a symbol. The expected symbol means that using LL(1)
parser the sequence of the symbol that will match the input stream. We will
put these expected symbols into a stack called guess stack.
Allowable set method is a frame construct that systematically constructs a
safety method for recovery from errors. The key to the method is to construct
the allowable set and the following three steps are included. When the error
is found, the three steps are executed.
1) Making use of some appropriate algorithms to construct the allowable
set A according to the state of the parser, where A should contain the symbol
of end of file (EOF) and the symbols in Follower Set of nonterminal.
2) Throwing away the symbols from the input stream that is not accept-
able for symbols in A until the first symbol tA that is acceptable by some
symbol in A.
3) Making the parser going on via a suitable algorithm so that after tA is
processed, the guess stack and input stream can go ahead simultaneously.
There is an improving method for this method. It also contains three
steps:
1) Constructing allowable set.
2) Skipping over the non-acceptable symbol.
In this way, there will be zero or multi symbols that will sequentially be
thrown until a symbol in allowable set is met. As the symbol EOF is always
in A, it will terminate when the step is skipped over.
3) Once again, making guess stack and input stream going ahead simul-
taneously.
Let modified parser continuously go to carry out analysis. It first attempts
to do the normal guess or matching shift. If it succeeds, the parser once
again normally runs. If the shift fails, then for the nonterminal at the top of
stack, it guesses the shortest candidate expression. And for terminal, before
it is inputted the top symbol of the guess stack is inserted. Then step 3) is
repeated until the success of shift. In this way, once again let the parser make
11.4 Semantic Error Check 319
In LR(1) analysis the recovery from errors is rather difficult, as most of the
information which it collects is with the property of an assumption. When
a syntax error is found, the LR(1) parser is in the state of Sx , the current
input is tx , and in the analytical table the entry that corresponds to (s, tx ) is
empty. This corresponds to error. In order to recovery from errors, we need to
select a nonterminal as the one that recovers from error. We denote it R and
add candidate form errorneous to it. Suppose that the original productions
are
N → α . Rβ
R → . GHI
Now we add
R → . erorneous R.
Pseudo-terminal errorneous R represents a dumb node that is allowed to be
the replacement of R. The process of recovery from errors starts from moving
out the elements from top of stack one by one, until a state of recovering
from errors is found. Suppose that the state of recovery from errors is sv , due
to that we construct a dumb node errorneous R for R, hence the entry that
corresponds to (sv ,errorneous R) cannot be empty, and it is just the symbol
that is allowed by sv .
We denote the state tz . Once again we move out the elements one by
one until a symbol that is in the allowable set of tz is found. The purpose
of the process is to move out the remaining part of production of R from
input in order to avoid the repetition of the loop. But this measurement may
not be successful in avoiding the generation of parsing tree with errors. The
algorithm of recovery from errors for LR(1) is very complicated, we do not
talk much here.
The lexical error check and syntax error check which we discuss previously
all are carried out when the source program has not run (actually it is unable
to run yet), hence they may be called static check. For semantic check, it
is mainly dynamic check as by static check, the following errors cannot be
found.
1) Zero is divisor, for example in a/b, b may take zero as value in running
time.
320 Chapter 11 Debugging and Optimization
techniques [5]:
The criteria of local optimization are:
1) An optimization step must preserve the purpose of the program. That
is, an optimization must not change the output produced by a program for a
given input. If the so-called optimization causes an error, such as a division
by zero, that is not optimization at all, it is simply a deterioration.
2) An optimization should make the program more efficient. That means
that it may run faster, and it reduces the space taken. If it cannot make
any improvement in these two aspects, of course we cannot say optimization
either. At least we should have the improvement of the speed at the price of
the space or vice versa.
3) An optimization must be worth doing the effort. The result of the
optimization should deserve the effort. That means that what we get from
the optimization is more beneficial in comparison with our effort. Otherwise
if we made a big effort but what we got is the only little benefit, why should
we pay the cost?
The following is the techniques that can be adopted in local optimization.
Pre-processing of expressions
(1) Permanent folding method [6].
For expressions the most widely used pre-processing methods for opti-
mization are permanent folding method and arithmetic simplification method.
The first one, permanent folding method is as follows: The name permanent
folding is a traditional term in the compilation of evaluation of constant ex-
pressions. For example, most of the compilers will translate the following
program
char lower-case-from --- capital (char ch){
return ch+(‘a’-‘A’);
}
Into
char lower-case-from-capital (char ch) {
return ch+32
};
as ‘a’ has integer value 97 and ‘A’ has integer value 65 in ASCII.
Permanent folding method is a simplest and most efficient optimization,
although the programmer rarely writes directly constant expressions they
may come from character constants, macros, symbol interpretations and
intermediate code generation.
(2) Arithmetic simplification
Arithmetic simplification method uses the lower cost arithmetic opera-
tions to replace the higher cost operation. In this way we get the profit. The
possible substitutions are listed as follows:
Operation → Substitution
E ∗2 ∗ ∗ n → E << n
2∗V → V+V
322 Chapter 11 Debugging and Optimization
3∗V → (V<<1)+V
V ∗∗ 2 → V∗V
E+0 → E
E∗1 → E
E ∗∗ 1 → E
1 ∗∗ E → 1
In this table, E stands for (sub) expression, V stands for variable, <<
stands for left shift operator, ∗∗ stands for exponential operator. It is assumed
that the cost of multiplication is higher than that of addition and shift, but
is lower than that of exponentiation. It is true for most of the computers.
Using the operations with lower cost replaces that with higher cost, this is
called strength reduction. If an operation can be totally removed, it is called
nullification transformation.
The following is a program of the 3D text written in Java that shows the
optimization of the program.
// decide where to place the text...
If (!Isstyleset(CAPTION)) {
switch(text placement) {
case CENTER:
xx=(bound.width/2)-(fm.stringwidth(text)/2);
yy=(bound.height/2)-(fm.getheight()/2);
break;
case LEFT
xx=thickness+TEXT-OFFSET;
yy=(bound.height/2)-(fm.getheight(1/2));
break;
case RIGHT:
xx=bound.width-thickness-EXT-OFFSET-fm
string.width(text);
yy=(bound.height/2)-(fm.getheight( )/2);
break:
}
}
else{
int space = fm.char width(‘i’);
xx=thickness+ TEXT-OFFSET + spaces;
yy=0;
// fill a rectangle in bounding space of string...
get.set color (getbackground( ));
g.setcolor(getbackground( ));
g.fillrec(xx,yy,
fm.stringwidth(text)+(spacer*2),
fm.getheight());
xx+=spacer;
}
After we transform it into an abstract syntax tree the following interme-
diate code may be obtained:
1) if CAPTION = StyleSET then goto 30
11.5 Optimization of Programs 323
2) T := textplacement
3) if t <> 1 goto 9
4) xx := bounds.width/2
5) xx := xx – fm.stringwidth(text)/2
6) yy := bound.height/2
7) yy – (fm.getheight())/2
8) goto
9) if t <> 2 goto 15
10) xx := thickness
11) xx := xx + TEXT – OFFSET
12) yy := bound.height( )/2
13) yy := – fm.getheight( )/2
14) goto
15) xx := bounds.width
16) xx := xx – TEXT – OFFSET
17) xx := xx – fm.stringwidth(text)
18) yy := bound.height/2
19) yy := yy – fm.getheight( )/2
20) goto
21) w := fm.char width(‘i’)
22) space := realpoint(w)
23) xx := thickness
24) xx := xx + TEXT – OFFSET
25) xx := spacer
26) yy := 0
27) t1 := g.set color(getbackground ( ))
28) t2 := g.fillrec(xx,yy,
fm.stringwidth(text),(spacer*2),
fm.getheight( ));
29) xx := xx + spacer
30) (the exit of the program) Notice that in the program above, some
goto’s have no destinations yet and for each case fm.getheight() should be
used, and apart LEFT, other cases also need to use fm.stringwidth(text).
Therefore, we may optimize the program to get:
1) if CAPTION + Styls then goto 23
2) t := fm.getheight( )
3) t1 := textplacement
4) goto 34
5) xx := thickness
6) xx := xx + TEXT – OFFSET
7) yy := bounds.width/2
8) yy := yy – t/2
9) goto 34
10) xx := bounds.width/2
11) xx := xx – t2 /2
324 Chapter 11 Debugging and Optimization
12) yy := bounds.height/2
13) yy := yy – t/2
14) goto 34
15) xx := bounds.width
16) xx := xx - thickness
17) xx := xx – – t2
18) yy := bounds.height/2
19) yy := yy – t/2
20) goto 34
21) If t1 = 2 goto 5
22) t2 := fm. String width(text)
23) if t1 = 1 goto 8
24) if t3 = 3 goto 13
25) t2 := fm.stringwidth(text)
26) t3 := fm.charwidth(‘i’)
27) space := realpoint(t3 )
28) xx := thickness
29) xx := xx + TEXT – OFFSET
30) xx := xx + spacer
31) yy := 0
32) t2: := fm.stringwidth(text)
33) t3 := g.setcolor (getbackground( ))
34) t4 := g.fill rect(xx, yy, t2, (spacer*2),t)
35) xx := xx + spacer
36) (the exit of the program)
The length of this program is longer than that of the original one, but it
is indeed the improvement of it. The reader may check it oneself.
a[t4 ] := t3
goto B2
The idea of copy propagation is that after the copy assignment a:=b,
using b instead of using a as often as possible.
Loop programs belong to such a situation that the execution time is not pro-
portional to the length of the program. It can be the case that the length of
the program is short, but it repeats for many times. Then the running time
can also be very long. We will especially pay attention to the inner loops
where programs tend to spend the bulk of their time. The running time of
a program may be improved if we decrease the number of instructions in an
inner loop, even if we increase the amount of code outside that loop. We re-
gard that the following three techniques are important for loop optimization:
code motion that moves code outside a loop; induction-variable elimination
which we apply to eliminate variables from the inner loops; and reduction
in strength that replaces an expensive operation by a cheaper one, such as a
multiplication by an addition. In the following, we explain these techniques
one by one.
Code motion
The factors that determine the length of execution time are the length of
the loop body and the number of the execution of the loop. Therefore, an
important modification that decreases the amount of code in a loop is code
motion, especially when the number of the execution of the loop is fixed.
The variable that should remain in the loop is one that really needs to be
executed in the loop. If a variable does not change while the loop is executed,
then the variable is called loop invariant variable. For loop invariant variable,
we do not need to keep it inside the loop to spend time on its computation.
For example, if we have
u := a
v := b
...
while (i < (u + v)(u − v))
If the while loop body does not change the values of u and v, then (u+v)(u−v)
is a loop invariant variable and we do not need to repeat the evaluation of it.
So we may have the code motion as follows:
328 Chapter 11 Debugging and Optimization
u := a ...
v := b
t := (u+v)(u-v)
while (i<t)...
The induction variable elimination
We have mentioned before that in a loop for the computation that is
really needed we should make the computation more efficient. The induction
variable elimination and the reduction of strength just belong to the category.
Now we consider the induction variables’ elimination. When there are two
or more variables in a loop it may be possible to get rid of all but one, by
the process of the induction-variable elimination.
For example, we have the following program with loop
i := n-1
j := m
t1 := 5*j
v := a[t1 ]
B2:
...
B3 : j := j-1
t1 := 4*j
t2 := a[t1 ]
if t2 > v goto B4 else B3
B4 : if i j goto B6 else B5
B5 : goto B3
B6 :
In this program B3 forms an inner loop and both j and t1 are induction
variables. As t1 changes along with the change of j, so we cannot get rid of
either j or t1 completely, t1 is used in B3 and j in B4 . However, we can make
some modification so that partly reduces the strength of the computation. By
further analysis, we can even really realize the induction variable elimination.
Problems
References
[1] Graham SL, Haley CB, Joy WN (1979) Practical LR error recovery. ACM
SIGPLan Notices 14 (8): 168 – 175.
[2] Alfred V Aho, Ravi Sethi, Jeffrey Ullman D (2003) Compilers, principles,
techniques and tools. Prentice-Hall, Englewood cliffs.
[3] Giegerich R (1983) A formal framework for the derivation of machine-specific
optimizers. TOPLAS 5(3): 422 – 448.
[4] Cocke J, Kennedy K (1977) An algorithm for reduction of operator strength.
Comm. ACM 20(11): 850 – 856.
[5] Graham SL (1984) Code generation and optimization. In B Lorho (ed)
Methods and Tools for Compiler Construction: An Advanced course, pp,
251 – 288.
[6] Cocke J, Markstein J (1980) Measurement of code improvement algorithms.
Information Processing 80, 221 – 228.
[7] Allen Fe, Cocke J, Kennedy K (1981) Reduction of operator strength. In:
Muchicks, Jones N (ed) Program Flow an algsis: theory and application.
Prentice-Hall, Englewood Cliffs, pp. 79 – 101.
Chapter 12 Storage Management
Roger S. Pressman
The task of compilers is to translate the source programs to the target pro-
grams. Therefore, strictly speaking there is nothing to do for a compiler with
the storage management. Storage management is not its task. However, com-
pilers can only work when they stay in the memory and the target code
which they generate is also in memory. During the compilation, the compiler
should consider the layout of the source program, the various tables and the
placements of intermediate code and target code, etc. If the layout is not
appropriate, the compiler will not be able to efficiently access and the work
of compiler cannot be efficient either. Therefore, in this sense, compiler has
intimate relation with the storage management.
Now that we have known that compilers have related to storage in many ways.
It is the task of the chapter to illustrate the storage management that affects
the process of compilation. We will explain what the storage management
means for compilers. In order for compilers to run efficiently how should
one realize the storage management? What strategies should be adopted to
realize the storage management? etc.
Suppose that the compiler obtains a block of storage from the operating
system for the compiled program to run in. Of course, before this the compiler
needs to compile the source program, and build up a number of symbol tables
that are stored in the memory too. From the lexical analysis, the syntax
analysis and the generation of intermediate code, etc., finally we just have
the compiled program that is called the target code. During these phases, how
does the storage management work? This chapter should explicitly explain
the issues in details.
332 Chapter 12 Storage Management
tion is interrupted and information about the status of the machine, such as
the value of the program counter and machine register, is saved on the stack,
along with other information associated with the activation.
Stack allocation is based on the idea of a control stack; storage is organized
as a stack, and activation records are pushed and popped as activations begin
and end, respectively. Storage for the locals in each call of a procedure is
contained in the activation record for that call. Thus locals are bound to
fresh storage in each activation, because a new activation record is pushed
onto the stack when a call is made. Furthermore, the values of locals are
deleted when the activation ends; that is, the values are lost because the
storage for locals disappears when the activation record is popped.
We now describe a form of stack allocation in which sizes of all activation
records are known at compile-time. Situations in which incomplete informa-
tion about the size is available at compile-time are considered below.
Suppose that register top marks the top of the stack. At run-time, an ac-
tivation record can be allocated and deallocated by incrementing and decre-
menting top, respectively, by the size of the record. If procedure q has an
activation record of size a, then top is incremented by a just before the tar-
get code of q is executed. When control returns from q, top is decremented
by a.
A separate area of run-time memory, called a heap, holds all other infor-
mations. In some languages, they provide facilities for the dynamic allocation
of storage for data, under program control. Storage for such data is usually
taken from a heap. The stack allocation strategy cannot be used if either of
the following is possible:
1) The value of local names must be retained when an activation ends.
2) A called activation outlives the caller. This possibility cannot occur
for those languages where activation trees correctly depict the flow of control
between procedures.
In each of the above cases, the deallocation of activation records need not
occur in a last-in-first-out fashion, so storage cannot be organized as a stack.
Heap allocation parcels out pieces of the contiguous storage, as needed for
activation records or other objects. Pieces may be deallocated in any order,
so over time the heap will consist of alternate areas that are free and in use.
The difference between heap and stack allocations of an activation records
is that the record for an activation of procedure, say r, is retained when the
activation ends. The record for the new activation, say q(1, 9), therefore,
cannot follow that for s physically. Now if the retained activation record for
r is deallocated, there will be free space in the heap between the activation
records for s and q(1, 9). It is left to the heap manager to make use of this
space.
The sizes of the stack and heap can change as the program executes, so
we show these at opposite ends of memory where they can grow towards
each other if need be. By convention, stacks grow down. That is, the ‘top’ of
the stack is drawn towards the bottom of the page. Since memory addresses
334 Chapter 12 Storage Management
the procedure [3], so as far as the front end is concerned, the size of the field is
unknown. In the general activation record, we therefore show this field after
that for local data, where change in its size will not affect the offsets of data
object relative to the fields in the middle.
Since each call has its own actual parameters, the caller usually evaluates
actual parameters and communicates them to the activation record of the
callee. Methods for passing parameters will be discussed in the next section.
In the run-time stack, the activation record of the caller is just below
that for the callee. There is an advantage to placing the fields for parameters
and a potential returned value next to the activation record of the caller.
The caller can then access these fields using offsets from the end of its own
activation record, without knowing the complete layout of the record for the
callee. In particular, there is no reason for the caller to know about the local
data or temporaries of the callee. A benefit of this information hiding is
that procedures with variable numbers of arguments can be handled. Some
programming languages require arrays local to a procedure to have a length
that can be determined at the compile-time. More often, the size of a local
array may depend on the value of a parameter passed to the procedure. In
that case, the size of all the data local to the procedure cannot be determined
until the procedure is called.
A common strategy for handling variable-length data is some how different
from the handling of fixed-length data. Suppose that procedure p has four
local arrays. The storage for these arrays is not part of the activation record
for p; only a pointer to the beginning of each array appears in the activation
record. The relative addresses of these pointers are known at compile-time,
so the target code can access array elements through the pointers.
Suppose that there is a procedure called q that is called by p. The activa-
tion record for q begins after the arrays of p, and the variable- length arrays
of p, and the variable-length arrays of q begin beyond that.
Access to data on the stack is through two pointers, top and top-sp. The
first of these marks the actual top of the stack; it points to the position at
which the next activation record will begin. The second is used to find local
data.
Suppose that top-sp points to the end of machine-status field. The top-sp
points to the end of this field in the activation record for q. Within the field is
a control link to the previous value of top-sp when control was in the calling
activation of p.
The code to the reposition top and top-sp can be generated at compile-
time, using the size of the fields in the activation records. When q returns,
the new value of top is top-sp minus the length of the machine-status and
parameter fields in q’s activation record. After adjusting top, the new value
of top-sp can be copied from the control link of q.
Having introduced the details of handling communications of procedure
and the procedure it calls, we can now introduce the call sequence and return
sequence, or the algorithms for doing so. Once again, we should make it
336 Chapter 12 Storage Management
clear that register top-sp points to the end of the machine- status field in
an activation record. This position is known to the caller, so it can be made
responsible for setting top-sp before control follows to the called procedure.
The code for the callee can access its temporaries and local data using
offsets from top-sp.
The call sequence is as follows:
1) The caller evaluates actual.
2) The caller stores a return address and old value of top-sp into the
callee’s activation record. Then the caller increments top-sp to the position
with the new pointer value. That is, top-sp is moved past the caller’s local
data and temporaries and the callee’s parameter and status fields.
3) The callee saves register values and other status information.
4) The callee initializes its local data and begins execution.
As for return sequence, it is likely to be as follows:
• The callee places a return value next to the activation record of the caller.
• Using the information in the status field, the callee restores top-sp and
other registers and branches to a return address in the caller’s code.
• Although top-sp has been decremented, the caller can copy the returned
value into its own activation record and use it to evaluate an expression.
The above calling sequences allow the number of arguments of the called
procedure to depend on the call. Note that, at compile-time, the target code of
the caller knows the number of arguments it is supporting to the callee. Hence
the caller knows the size of the parameter field. However, the target code of
the callee must be prepared to handle other calls as well, so it waits until
it is called, and then examines the parameter field. Using the organization
described above, information describing the parameters must be placed next
to status field so the callee can find it.
In this situation, we organize the blocks in a list into a link. Allocation and
deallocation can be done quickly with little or no storage overhead.
Suppose that blocks are to be drawn from a contiguous area of storage.
Initialization of the area is done by using a portion of each block for a link
to the next block. A pointer called available points to the first block.
Allocation consists of taking a block off the list and deallocation consists
of putting the block back on the list.
The compiler routines that manage blocks do not need to know the type
of object that will be held in the block by the user program. We can treat
each block as a variant record, with the compiler routines viewing the block
as consisting of a link to the next block and the user program viewing the
block as being of some other type. Thus there is no space overhead because
the user program can use the entire block for its own purposes.
When the block is returned, the compiler routines use some of the space
from the block itself to link it into the list of available blocks.
With the variable-sized blocks, when they are allocated and deallocated,
storage can become fragmented; that is, the heap may consist of alternate
blocks that are free and in use.
For example, if a program allocates five blocks and then deallocates the
second and fourth, then the fragmentation is formed. Fragmentation is of no
consequence if blocks are of fixed size, but if they are of variable size, then it
will be a problem, because we could not allocate a block larger than any one
of the free block, even though the space is available in principle.
One method for allocating variable-sized blocks is called the first-fit
method. When a block of size s is allocated, we search for the first free
block that is of size f s. This block is then subdivided into a used block of
size s and a free block of size f–s. Note that allocation incurs a time overhead
because we must search for a free block that is large enough.
When a block is deallocated, we check to see if it is next to a free block.
If possible, the deallocateed block is combined with a free block next to it to
create a larger block. Combining adjacent free blocks into a larger free block
prevents further fragmentation from occurring. There are a number of subtle
details concerning how free blocks are allocated, deallocated, and maintained
in an available list or lists. There are also several tradeoffs between time,
space, and availability of large blocks. The reader is referred to articles [5]
and [6] for a discussion of these issues.
until the offices are full of wastes. Or in the storage, it has been such full
that the garbage collector has no room to be inside the storage, how can it
carry out the reclamation work? Therefore the work of the garbage collection
should be done regularly or periodically. It cannot be suspended until the
garbage has occupied the whole storage space.
The second problem is even more important, in order to correctly carry
out the work of collection, we need to make it very clear, what is garbage? For
this purpose, people propose two concepts that are approximate but actually
are different. One is “the set of storage fragments that have no pointers
pointing to” and the next is “not reachable set of fragments from the data
of allocator of non heap style.” The data in these two sets obviously are
not accessible by any program. They incur the techniques which garbage
collectors depend upon. The first one incurs the technique called reference
counts. And the second one incurs marking scanning and the copies of two
spaces.
• Reference counts. This approach directly identifies wasted fragment. It is
relatively simple and efficient. But it requires that while a program runs
all the activities of pointers be monitored, and it is not unlikely to cover
all the wasted fragments.
• Token and scan. This approach defines the reachable fragments, and the
rest is regarded as garbage. It is rather efficient and it needn’t carry out
the pointer detection. But it is rather complicated. It can cover all the
available spaces.
• The dual space copy. This approach copies the reachable fragments of
the storage area within the so-called source space into the storage area
of the so-called target space. The rest of the target space is of free frag-
ments. This approach is also very efficient and needn’t carry out pointer
detection, but it is also complicated, and it wastes about half of the space.
Once the wasted fragments have been determined through these approa-
ches, they should be transformed into free space available for use. Those that
are discovered by reference counts or token and scan should be returned to
free space link table via the specific algorithm. Those that are discovered by
the dual space copy may automatically create new free space link table that
is a single storage fragment that contains all free storage spaces.
Regarding storage allocation, we need to further explain the concept of
the fragment. As we mentioned before, if a program in execution requires a
storage space with length of s, the allocator found a space with length f from
free storage link table, where f > s. Then it allocates a part of the storage
with length s to the program. Then there is storage with length f – s left.
This is a fragment.
Gradually, these fragments will scatter over the storage space. Though
all these fragments may have limited lengths only, the sum of them may
be quite big. Therefore, if a program applies for a space that its length is
bigger than the size of the current biggest storage fragment in the table, then
the allocator must fail to satisfy the requirement. However, if a compressing
340 Chapter 12 Storage Management
technique is adopted, that is, to combine these fragments together and move
them to one side of the storage, forming an independent and single free space,
it will be used to satisfy the requirement of the program. It may be seen from
here that this is the best approach that collects the storage space being not
in use.
Compressing technique is not perfect either. The main problem with it is
that it needs to shift the reachable fragments around and in these fragments
it is likely that they contain pointers pointing to other fragments also needing
to shift. Therefore, in order to correctly shift them the careful design must
be done first.
Garbage collector algorithms may have three kinds:
• Working at once. After the garbage collector is initialized, it will com-
pletely control all the storage fragments until it finishes running, then it
returns back. After the processing, the situation of storage will be im-
proved that wasted fragments will not scatter over the storage again.
Since this kind of the garbage collector completely controls the storage
fragments when it runs, the situation is simpler. But if there is unexpected
activation happening, there will be some sort of damages. This is possibly
a problem in compiler but it is not the case in application programs.
• Dynamic (also called incremental). Some garbage collector starts working
when procedure malloc or free is called. These activities will locally modify
the structures of the storage fragment in order to enhance the ability of
searching free fragments. The dynamic garbage collector is more complex
than the kind of working at once in structure, but its execution is more
stable, and has smaller damage. When it cannot meet the requirements,
it may need the help from the former one.
• Concurrent one. In this kind, the garbage collector concurrently works
with the application program. They work on different processors, and
each runs in one processor. These two run parallel but each carries out
its own task.
Garbage collectors need lots of help from compiler, and it is the aim of
this book to explain them.
At first, we need to point out that, only when a program has a pointer
directly pointing to the storage fragment, then the fragment is reachable for
the program, or it has a pointer indirectly pointing to that storage fragment,
then it is so. The pointer available to the program depends on its specific
implementation, it may be located in different locations such as a global
variable, local variable, routine parameter, register, and other. We call those
non heap-style storage space which the program code may directly access to
the area of program data, and the set of all pointers in the area of program
12.4 Reclamation of Used Space 341
data the root set. Notice that the root set is only an overall concept, rather
than a kind of structures. It is the set of all pointers in the area of program
data, rather than the list of their values. The root set usually cannot be
implemented directly. It only occurs conceptually inside of the program code
of garbage collector.
The pointers in the root set may point to storage fragments that are con-
trolled by garbage collector, hence they are reachable. The reachable frag-
ments in a heap also contain pointers of other fragments. These fragments
pointed to by pointers are also reachable.
Compilers must provide the root set and the distributive information of
each storage fragment to the garbage collector, meanwhile, it must ensure
that when the garbage collector is activated, all the reachable pointers in
the area of program data and heap are effective. Based on the support from
compilers, the garbage collectors resolve the following three problems:
1) To determine the root set via searching all the pointers and their types
in the area of program data.
2) To search for all the pointers and their types in some given storage
fragment.
3) To find out all the reachable storage fragments using 1) and 2).
From these one may see that, without the supports from compiler, the
garbage collectors cannot complete the task of collecting garbage. Compiler
completely controls the distribution of pointers of storage fragmentation,
hence the problem is how to transmit the information to garbage collector.
The following is some methods that garbage collector in compiler provides
information of distribution of pointers.
1) Compiler generates a bit image for every fragment type to assign which
field in that type segment points to the pointer of other fragments. When this
method is used, the fragment must be self-described, because as long as there
is pointer the garbage collector is ensured to keep working. Therefore, every
storage fragment must contain each own bit image, or the pointer that points
to the bit image.
2) Compiler generates a specific routine for each fragment type that calls
for garbage collector, and passes each pointer in the fragment as the param-
eters. This method avoids the explanation of the bit image in running time
and the requirement for self-description because the code can transmit the
fragment types and pointers to garbage collector.
3) Compiler organizes the fragments to form an array that contains all
the internal pointers, followed by other data structure types. Through this
organization, the garbage collector can start working as long as it knows
the addresses of the pointers in the array and the number of pointers in the
fragment.
342 Chapter 12 Storage Management
The tokens and scans garbage collection algorithm may be divided into two
processes. The first one is making token process that is used to mark all
reachable blocks. The second one is the scanning process that is used to scan
distributed storage space and regard those that have not been marked as
reachable blocks free blocks so that they may be reused. The tokens and scans
garbage collection algorithm sometimes is also called marking and removing
garbage collection algorithm. In comparison with the reference counts the
garbage collection algorithm that was introduced before and the dual space
copy algorithm that will introduce soon, the efficiency of tokens and scans is
highest because it can claim all the storage space that can be claimed while
by reference counts algorithm the cyclic reference counts structures cannot
be claimed. As for dual space copy algorithm it leaves half of the storage
space that is not available.
1. Tokens
Token marking is based on two principles, one is that those storage blocks
that are reachable from the root set are reachable. The second one is that any
storage block that is reachable from a pointer in reachable blocks is reachable.
Suppose that the root set resides in program data area or in the highest end
of the active record. Its data type specification has been constructed and for
compilers it is accessible. Now from the earliest form token program data area
is reachable, by its data type description the internal pointer may be found.
If the recursive procedure found that a block has no pointer or the block that
has been marked, then it backtracks and uses the next pointer to continue
the recursive procedure. As the number of reachable blocks is limited, and
the processing of each storage block is only for finite times (in general, is
only once, occasionally it can be more than twice). Therefore, the depth-
first-search scanning algorithm can terminate and the time it spent increases
linearly proportional to the increase of the number of reachable fragments.
Besides the free bit, marking process needs another auxiliary bit — marking
bit in the head of management field of storage blocks. Initially, the bit is in
the state of “removal”.
2. Scan and reclamation
The reclamation of unreachable fragments is relatively easy. According to
the length of record in the fragment, we traverse storage fragments one by
one. With every fragment, we check whether it is marked as reachable. If it
344 Chapter 12 Storage Management
has been marked reachable, then remove its marking bit, otherwise open its
free bit.
The adjacent free fragments can also be combined together using scanning
algorithm from left to right. When the algorithm finished, we reserve a pointer
and let it point to the first free fragment and record its size. As long as we
meet free fragments, we accumulate their sizes until we meet an occupied
block, or we reached the end of the storage. At the time the size in the
management field is just the total size of free fragments. In this way, we create
a bigger space. Continuing the scanning process whenever a free fragment is
met then the process repeats, and so forth.
The outcome of the marking and scanning process is the generation of
a heap where all the blocks that are marked occupied reachable. Moreover,
occupied block must be between free fragments. This is the best method for
the implementation of reclamation of fragments under the situation that the
fragments do not need to move. If once again using the contracting process to
combine all the free fragments to form a bigger free block, then the efficiency
of the execution may be further raised.
The token (marking) process of the tokens and scans the garbage collection
algorithm only involves reachable blocks, but the scanning process involves
all the storage blocks. When it is working the most part of the heap con-
sists of garbage and the workload of the algorithm is huge. Considering the
problem, the dual space copy algorithm avoids the scanning of all the storage
blocks, it only scans the reachable blocks. Therefore, it saves time through
the requirement for storage increases almost doubly. As the price of storage
gradually decreases to save the time at the expense of storage is worthwhile.
The idea of the dual space copy algorithm is to divide the available heap into
two equal parts: source space and target space.
In daily computation the new storage block can be obtained in the source
space simply using the moving ahead of the pointer. When the source space
consumes up, all the reachable blocks will be copied to the empty target
space through garbage collector.
The operation of the dual space copy starts with the copy of source space
storage which is referred to by the pointers in the root set, it puts the copy
starting from beginning position to the target space. Then the primitive stor-
age blocks of source space are marked “copied” and in the block a forward
pointer is set that points to corresponding copy in the target space. When the
copying operation finishes the content may be destroyed. In copying, there
is no update of pointers hence the pointers still point to the blocks in source
space. Subsequently, a “scanning pointer” is used to scan the storage blocks
of the target space from left to right in order to search for the pointers of
12.4 Reclamation of Used Space 345
storage blocks in the target space. Suppose that the one of scanning point-
ers of storage block R points to a storage block S in the source space, then
there are two possibilities: S is marked “copied” or S has not been marked
“copied”. In the first case, it contains a forward pointer to update the pointer
in R, while in the second case it should be immediately copied. When the
copying operation finishes, it is marked “copied” and its content is replaced
by the copy which the pointer points to. Repeat this process until the target
space does not contain the pointers that point to the storage blocks in the
source space.
Finally, all the reachable blocks in the source space have been copied to
the target space, and all the pointers are updated, pointing to the target
space. Now the roles of the two spaces exchange. The computation continues.
By using copying garbage collector the overhead of fragmentation and
time complexity which the tokens and scans the garbage collection algorithm
incurs may be solved. The key problem is that moving storage blocks makes
free space keeping consecutive and this can be done by moving ahead a global
pointer. The main drawback of dual space copy is that it may waste half of
the heap space, and when the heap is nearly full, the performance of the
algorithm will deteriorate. The dual space copy algorithm is a highly efficient
algorithm while it entails rigorous requirements. When running efficiency is
more important in comparison with space efficiency, then it is a good choice.
12.4.6 Contract
the existing pointer to point to the new position, and the third really move
the storage block.
1) Address computation. Scanning the storage blocks down-up, computing
the new position for ever storage block after contract. The corresponding
address of the new position is kept in the management field of the storage
block. Since we know the new position of the first occupied storage block
(located at the lower end of the storage), we also know the size of the storage
block, hence there is not any problem for the computation of address.
2) Update of pointer. Scanning the program data area and storage blocks,
to search for the pointers of the heap where every pointer that points to the
storage blocks should be updated to the new position in the management
field of the storage block.
3) Move of the storage blocks. The storage program scans the storage
blocks from low to high. Every occupied block is moved to new position. The
new position can be found from the management field of the storage block.
The storage blocks can only be moved to left (moved to low end), or kept
still, hence the work can be done by single way scanning from left to right.
All the pointers in the blocks are repointed to the storage blocks that are
pointed at the beginning of the contract.
a[i] := a[i]
where a[i] represents a value. While a[j] represents the address where the value
12.5 Parameter Passing 347
12.5.1 Call-by-Value
This is, in a sense, the simplest possible method of passing parameters. The
actual parameters are evaluated, and their r-values are passed to the called
procedure. Call-by-value can be implemented as follows:
1) A formal parameter is treated just like a local name, so the storage for
the formal parameters is in the activation record of the called procedure.
2) The caller evaluates the actual parameters and places their r-values in
the storage for the formal parameters.
A distinguishing feature of call-by-value is that operations on the formal
parameters do not affect values in the activation record of the caller. A pro-
cedure called by value can affect its caller through nonlocal names or through
pointers that are explicitly passed as values.
12.5.2 Call-by-References
12.5.3 Copy-Restore
12.5.4 Call-by-Name
Problems
argument and return one value, either a function or a real. The operator
· stands for composition of functions: that is (f · g)(x) = f(g(x))
1) What value is printed by main?
2) Suppose that whenever a procedure p is created and returned, its
activation record becomes a child of the activation record of the function
returning p. The passing environment of p can then be maintained by
keeping a tree of activation records rather than a stack. What does the
tree of activation record when a is computed by main in the program?
3) Alternatively, suppose an activation record for p is created when
p is activated, and made a child of the activation record for the pro-
cedure calling p. This approach can be used to maintain the activation
environment for p. Draw snapshots of the activation records and their
parent-child relationships as the statements in main is executed. Is a
stack sufficient to hold activation records when this approach is used?
function f(x: function);
var y: function;
y := x · h; /* creates y when executed */
return y
end {f};
function h():
return sin
end {h};
function g(z: function);
var w: function;
w := arctan · z: /* creates w when executed */
returned w
end {g} ;
function main ( );
var a: real:
u, v: function;
v:= f(g);
u:= v( );
a:= u(π/2);
print a
end { main }
Problem 12.5 Write a procedure that inserts a new entry to a link list
through passing a list head pointer.
Problem 12.6 List the characteristics of the garbage collection that makes
the concurrent garbage collection algorithms very hard to work.
Problem 12.7 The possible way to decrease walk and stop of token and
scan algorithm is to increasingly use the scan procedure. After token
procedure we do not scan all the storage again, instead, we change the
code of main() so that when a suitable size free fragment is scanned, the
scan procedure stops(the original one continues). Outline the modified
increasing module.
References 351
References
Chinese maxim
carefully selected algorithm for the design of generators will be much easier
to generate a highly efficient code than that was rushed out in a short time.
In this chapter, we shall discuss those issues that are related to the design
of target code generators. Though lots of the details about the target code
generation are related to the object language and operating systems, the
storage management, the selection of instructions and registers, the order
of computation, etc. confront almost all the generators of target codes. It
will help the reader to better comprehend the essence of problems which we
explain in the following.
The output of the code generator is the target program. As the intermediate
code, the output may have a variety of forms such as the absolute machine
code, float machine code, or assembly language code.
The advantage of generating programs in the absolute machine language
as output is that the programs may store in fixed positions of the storage and
they can immediately be executed. A small-size program may be compiled
and executed quickly. The compilers for a number of “student programs”,
such as WATFIV and PL/C just generate such absolute machine code.
The advantage of generating programs in the float machine language as
output (target modules) is that it allows the independent compilation of
subprograms. A group of float target modules may be linked up together by
a linker, and then install in storage to execute. If we generate float target
modules, we must add the overhead of corresponding linking and loading.
But due to independent compilation of subprograms and being able to call for
other programs that have been compiled from a target module, the flexibility
is significantly raised. If the target machine cannot automatically handle
floating, then the compiler must provide specific float information to loader
in order to link together the independently compiled program segments.
To generate programs in assembly language as output makes the progress
of the code generation a bit easier, as we may generate symbolic instructions
and making uses of macro facilities provided by assembler to help the code
generation. Of course, the expense that must be paid for the benefit is to make
it in the assembly language again after target generation, but the generation
of assembly code does not repeat the whole task of assembler. This is a
reasonable choice, it is especially true for machine with the smaller size of
storage, as on this type of machines the compiler must be of multipasses. In
this chapter, for sake of readability, we also choose assembly language as the
target language. But we want to emphasize that, as long as addresses can
be computed from offsets and other information in symbol tables, then code
generator may generate float addresses and absolute addresses of names, as
easy as generating symbol addresses.
To map the names in source programs into addresses of data objects in stor-
ages is part of the jobs of the front end of the compiler and it is done through
cooperation with code generation programs.
We have assumed that a name reference in the three-address statement
is the name entry in the symbol table, and the references of entries of the
symbol table are done after checking the declaration of the procedure. The
type in the declaration determines the width of the type that is the quantity
356 Chapter 13 Generation of Object Code
of the storage needed for storing it. The relative address of the name in the
data area of the procedure can be determined by the information in symbol
tables.
If the machine code is to generate, the labels in the three-address state-
ments need to transform to instruction addresses. This procedure adopts the
so-called back-patching technique as we introduced before in the previous
chapter, Suppose that the label refers to a quadruple number in a quadruple,
when we scan every quadruple in turn, we will eventually derive the position
of the quadruple that holds the label. What we need is only to maintain
the count that records the number of words which the generated instructions
used. The number can be stored in the array of quadruples (in the other
field). Therefore, if a reference is met, for example j: goto i, if i is less than j,
then we just need to establish a jump instruction of which the target address
is the storage location of the first appearance of quadruple i. But if the jump
is forward, i exceeds j, then we need to use a table to store the location of
the first appearance of i. When we handle quadruples, we can advance to
location where the first appearance of i stays, and then by back-patching we
replace all the appearances of i’s in quadruples by the location.
The essence of the instruction set of target machine determines the difficulty
of the selection of instructions. The consistency and completeness are impor-
tant factors for the determination of an instruction set. If the target machine
cannot support each kind of data type with consistent manner, then it will
need to have special treatment for every abnormal situation.
The speed and the common usage of the instructions are also important
factors. If we do not care of the efficiency of the target program, then the
selection of instructions is very simple. For every type of three-address state-
ments, we may design a code frame that describes the target code which
the frame generates. For example, for every three-address statement with the
form c := a + b, it has static allocations for a, b, and c. So it may be
translated into the following sequence of code:
MOV a, R1 /* load a into register R1 */
ADD b, R1 /* add b to R1 */
MOV R1, c /* store R1 to c */
But the method that it generates the code one by one often generates very
poor code. For example, if we want to translate the following sequence and
we adopt the one by one method
x := y + z,
u := x + z.
13.2 Issues of Design of Generators of Target Codes 357
Then we will have the following code as the result of the translation:
MOV y, R1
ADD z, R1
MOV R1, x
MOV x, R1
ADD v, R1
MOV R1,v
Instead, if we consider the translation systematically, we might generate more
compact codes that saves two instructions:
MOV y, R1
ADD z, R1
ADD v, R1
MOV R1, v
The quality of the code generated depends on the length of the running time
and length of the program. The target machine that possesses bountiful in-
struction set may realize any given operations in many modes. The overheads
of different modes may have dramatic differences.
Some translation of the intermediate code may be correct but may also
be very inefficient. For example, if the machine has an INC (increment) in-
struction, then by one instruction alone the three-address statement x := x
+ 1 may be efficiently realized. In contrast, if we adopt the following mode:
MOV x, R1
ADD #1, R1
MOV R1, x
then the speed will naturally be much slower. But in order to determine which
implementation is better and which one runs faster, it needs to comprehend
the execution of instructions.
regulations about the usage of the registers, it makes the problems more
complicated.
The order of computation may also affect the efficiency of the target code.
Probably we have seen that by adopting some order of the computation the
need for registers to store intermediate results will decrease. To choose the
best order is the other NP complete problem. Therefore, at the beginning,
we must take the order by which the intermediate code generator generates
three-address statements. Then the problem is avoided.
No doubt, the most important principle for the code generation is that it
generates correct code. As a code generator may encounter some special cases,
the principle of correctness is extremely important. After the priority of the
correctness, our important design goals are easy implementation, easy to
debug, and easy to maintain.
then make the mean of them with each one having equal weight. The evalu-
ation is as follows:
(Cray I + IBM 801 + RISC II + Clipper C300 + AMD 29 k + Motorola
88k + IBM 601 + Intel I 960 + Alpha 21164 + POWER 2 + MIPSR 4000
+ Hitachi Super H4 + Strong ARM 110 + Sparc 64)/14 = 2009
MMIX works on the model of 0 and 1, and usually treats 64 bits once.
For convenience, we partition them into groups of four bits each, so each
represents a hexadecimal number. The hexadecimal numbers are:
100111100011011101111001101110000001011111110110010100111110000010110 . (13.2)
#9e3779b97f4a7c16. (13.3)
IEC 10646 UCS-2 or informally Unicode UTF-16. It not only contains the
Greek letters such as Σ and σ (#03a3 and #03c3), Cyrillic letters likes W
and w (#0429 and #0449), Armenia letters like and (#0547 and # 0577),
Hebrew letters like (#05e9), Arabic letters like (#0634), and Indian
letters like ( #0936) or (#09b6) or (#0b36) or (#0bb7), etc.,
but also tens of thousands of East Asian ideographs such as the Chinese
character for mathematics and computing, (#7b97). Even it contains the
special encodes for Roman numerals, such as MMIX = # 216f216f21602169.
Through simply adding leading 0 byte to each character, the normal ASCII
and Lain – 1 characters may be represented, such as the Unicode of pâté is
#007000e2007400e9.
We will use convenient term to describe characters with width up to
16 bits like Unicode, as the numbers with two bytes are very important in
practice. For those words that have 4 bytes or 8 bytes, we will call them
double words and four words respectively, hence
2 bytes = 1 word
2 words = 1 double word
2 double words = 1 four word
According to D. E. Knuth, 2 bytes are also called 1 wyde, 2 wydes are called
1 terra, and 2 Tetras are called 1 octa. One octabyte equals four wydes equals
eight bytes equal sixty-four bits. Of course, the quantity consisting of one or
more bytes may represent alphanumerical characters. Using binary system,
From the point of views of a programmer, a MMIX computer has 264 memory
units and 28 general registers, in addition to 32 special-purpose registers (see
Fig. 13.1) [3]. Data are transferred to registers from memory units, and the
operations are performed in the registers, then the numbers are transferred
back to the memory, The memory units are called M[0], M[1],..., M[264 – 1].
Hence, if x is any four word byte, M[x] is a memory byte.
Fig. 13.1 A MMIX computer which a programmer sees there are 256 general-
purpose registers and 32 special-purpose registers, along with 264 bytes of 2 pseudo
memories, each register has 64 effective bits.
Generally speaking, if x is four word byte, then the notation M2 [x], M4 [x],
and M8 [x] represent word, double words, four words that contain byte M[x].
When referring to Mt [x] we neglect lg t bits of the lowest effective bits of
x. For the sake of completeness, we also write M1 [x] = M[x], when x<0 or
x264 , we define M[x] = M[x mod 264 ].
The 32 special-purpose registers of MMIX are denoted rA, rB, . . . , rBB,
rTT, rWW, rXX, rYY, and rZZ. It is the same as their intimate relatives,
each one may store four word bytes. Later their usage will be explained. For
example, we will see that rA controls the arithmetic interruptions while rR
stores the residue after division operations.
13.3.3 Instructions
The memory of MMIX not only contains data, but also contains instructions.
An instruction or “order” is a double word bytes of which four bytes may
conveniently be called OP, X, Y, and Z. OP is the operation code, X, Y, and
Z are the operand. For example, #20010203 is an instruction with OP = #20,
X = #01,Y = #02, and Z = #03. And the meaning is to put the addition of
the registers $2 and $3 to register $1. The bytes of operands are regarded as
integers without sign.
Since MMIX provides the operators the length of one byte, it has totally
256 operations. Each of the 256 operators has a convenient mnemonic sym-
bol forms. For example, the operator #20 is denoted ADD. Thereafter, we
will adopt the symbol forms of operators, we will also provide the complete
instruction table. X, Y, and Z have also symbolic representation. They are
consistent with the assembly language which we will discuss later. For exam-
ple, #20010203 may be conveniently written as “ADD $1,$2, $3”. Generally,
addition instruction is written as “ADD $X, $Y, $Z”. Most of the instructions
have three operands, but there are some that have two operands, even some
have only one operand. When there are two operands, the first one is X, and
the second one is number YZ with two bytes. Then the symbolic notation
contains one comma only. For example, the instruction “INCL $X,YZ” is an
instruction that increases the register X with a quantity YZ. When there is
only one operand, it is number XYZ with three bytes and without sign. In
its symbol notation, there is no comma at all.
For example, the instruction “JMP @+4*XYZ” will tell MMIX that the
next instruction can be obtained by jumping to double word byte XYZ.
The instruction “JMP @+1000000” has the form #f003d090 in hexadecimal
system, as JMP = #f0, while 1000000 = #03d090.
13.3 Target Machine MMIX 363
We will see that It may be classify 256 instructions of MMIX into a number of
groups. We now start with the instructions that transfer information between
the register and memory.
In the following every instruction has the address A of the memory that
is used for storing the sum of addition of $Y and $Z. Formally speaking,
In this case, the value of rH is just the difference of subtracting the original
number #9e3779b97f4a7c16 from 264 . This is not coincidence, the reason is
that, if we put the decimal point at the left of the number, it is golden ratio
φ−1 = φ − 1. After the exponent we get the approximation φ−2 = 1 − φ−1 ,
and the decimal point is placed at the left of rH.
The quotient of eight bytes and the residue obtained from the division
between dividend of 16 bytes and divisor of 8 bytes are obtained through the
instruction DIVU. The upper half of dividend appears in the special-purpose
register rD that is specially used for storing dividend. At the beginning of
the program, its initial value is zero. Through the instruction PUT rD, $Z
that will be described soon, the register can be assigned to any desired value.
If the value of rD is greater than that of divisor, then DIVU $X, $Y, $Z will
only set $X←rD and rR←$Y (when $Z is zero, the case always happens) But
DIVU never has an integer division check occurring.
According to Definition (13.7), instruction ADDU evaluates an address A
of the memory location. Therefore, sometimes we give the other name LDA
to it.
The following related instructions are helpful to address evaluation:
• 2ADDU $X, $Y, $Z (times two and addition of numbers without sign):
u($X)←(u($Y)×2+u($Z))mod 264
• 4ADDU $X, $Y, $Z (times four and addition of numbers without sign):
u($X)←(u($Y)×4+u($Z)) mod 264
• 8ADDU $X, $Y, $Z (times eight and addition of numbers without sign):
u($X)←(u(SY)×8+u($Z)) mod 264
• 16ADDU $X, $Y, $Z (times sixteen and addition of numbers without
sign): u($X)←(u($Y)×16+U($Z)) mod 264
If overflow is not a problem, then the execution of instruction 2ADDU
$X, $Y, $Z is faster than multiplication by 3, this is why we have 2ADDU
$X, $Y $Z in place of multiplication by 3. The result of execution of the
instruction is
Similarly, we may take four word x as vector b(x) with 8 bytes where each
byte is an integer between 0 and 255; or we can take it as a vector w(x) with
4 words, or we take it as a double word vector t(x). The following operations
handle all the components once.
• BDIF $X, $Y, $Z (byte difference): b($X)←b($Y)-. -b($Z)
• WDIF $X, $Y, $Z (word difference) w($X)←w($Y)-. -w($Z)
• TDIF $X, $Y, $Z (double word difference) T ($X)←T($Y)-. -T($Z)
• ODIF $X, $Y, $Z (four word difference): O($X)←O($Y)-. -O($Z)
Where the operation -. - is a saturated subtract (or “dot subtract”) oper-
ation:
y-. -z = max(0, y − z).
These operations have important applications in computer graphics (when
bytes or words represent the values of graph elements) and text processing.
We can also take a four word bytes as an 8×8 boolean matrix, i.e., as a
8×8 array of 0’s and 1’s. Let m(x) represent such a matrix of which the rows
from top to bottom represent the bytes of x from left to right, and let mT (x)
be the transposed matrix, whose columns are the bytes of x. For example, if
x = #9e3779b97f4a7c16, then
⎛ ⎞ ⎛ ⎞
1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0
⎜0 0 1 1 0 1 1 1⎟ ⎜0 0 1 0 1 1 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜0 1 1 1 1 0 0 1⎟ ⎜0 1 1 1 1 0 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 0 1 1 1 0 0 1⎟ ⎜1 1 1 1 1 0 1 1⎟
m(x) = ⎜ ⎟ , mT (x) = ⎜ ⎟.
⎜0 1 1 1 1 1 1 1⎟ ⎜1 0 1 1 1 1 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜0 1 0 0 1 0 1 0⎟ ⎜1 1 0 0 1 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎝0 1 1 1 1 1 0 0⎠ ⎝1 1 0 0 1 1 0 1⎠
0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0
(13.10)
370 Chapter 13 Generation of Object Code
The explanation of the four word bytes suggests two operations to be very
similar to that of mathematics. But now we define them from very beginning.
If A is an m×n matrix, and B is an n×s matrix. If ◦ and · are binary
operations, then the extended matrix multiplication A · ◦B is defined as
Notice that if each row of A at most contains one 1, then in the Expres-
sions (13.12) or (13.13) there is at most one non-zero entry. If every column
of B at most contains one 1, the same fact also holds. Therefore, In these
situations, the results of A|×B and A⊕×B are the same as the usual matrix
multiplication A+×B=AB.
• MOR $X, $Y, $Z (multiple OR’s) : mT ($X)←mT ($Y)|×mT ($Z); equiv-
alently, m($X)←m($Z)|×m($Y)
• MXOR $X, $Y, $Z(multiple XOR’s): mT ($X)←mT($Z)⊕×m($Y); equiv-
alently, m($X)←m($Y)⊕×m($Z)
These instructions observe the corresponding bytes of $Z and use their
bytes to select the bytes of $Y. Actually these operations set up the bytes
of $X. Then the bytes selected are combined together by OR or XOR. For
example, if we have
$Z = # 0102 0408 1020 4080.
It can be written as the following matrix
⎛ ⎞
0 0 0 0 0 0 0 1
⎜ ⎟
⎜0 0 0 0 0 0 1 0⎟
⎜ ⎟
⎜0
⎜ 0 0 0 0 1 0 0⎟
⎟
⎜ ⎟
⎜0 0 0 0 1 0 0 0⎟
⎟.
M(z) = ⎜ (13.14)
⎜0
⎜ 0 0 1 0 0 0 0⎟
⎟
⎜ ⎟
⎜0
⎜ 0 1 0 0 0 0 0⎟
⎟
⎜ ⎟
⎝0 1 0 0 0 0 0 0⎠
1 0 0 0 0 0 0 0
Therefore, Both MOR and MXOR instructions would reverse the bytes of $Y:
The kth byte of $X from left is set to kth byte of $Y from right, 1 k 8.
13.3 Target Machine MMIX 371
$Z = #00000000000000ff,
both MOR and MXOR will set all the bytes of $X to zero’s with right most
byte as an exception that is the OR or XOR of all eight bytes of $Y.
Float point operations. MMIX contains complete implementations of stan-
dard 754 on float arithmetic. It is famous IEEE/ANSI (Institute for Electrical
and Electronic Engineers/ American national standard Institute) standard.
Every four word byte x represents a float binary number that determines
the floating binary f(x), the leftmost of x is the sign (0 = ‘+’, 1 = ‘–’),
the next 11 bits are the exponent E, the rest 52 bits are fraction (decimal
number) F. The value it represents is:
• ± 0.0, if E = F = 0 (zero)
• ± 2−1047F , if E = 0 and F = 0 (abnormal)
• ± 2E−1023 (1+F/252 ), if 0<E<2047 (normal)
• ± ∝, if E = 2047, and F = 0 (infinite)
• ± NaN (Non a Number) (F/252 ), if E = 2047 and F = 0 (it is not a
number)
The “short” float number f(t) that is represented by a double word bytes
is similar, but its exponential part consists only of 8 bits while its fraction
part consists only of 23 bits. The normal case of a float number (0<E<255)
represents +–2E−1127 (1+f/223 ).
• FADD $X, $Y, $Z (float add): f($X)←f($Y)+f($Z)
• FSUB $X, $Y, $Z (float subtract) f($X)←f($Y)–f($Z)
• FMUL $X, $Y, $Z (float multiplication): f($X)←f($Y)×f($Z)
• FDIV $X, $Y, $Z (float division): f($X)←f($Y)/f($Z)
• FREM $X, $Y, $Z (float remainder): f($X)←remainder of f($Y)/f($Z)
• FSQRT $X,$Z or FSQRT $X, $Y, $Z ( float square root): f($X)←f($Z)1/2
• FINT $X, $Z or FINT $X, $Y, $Z (float integer): f($X)←int f($Z)
• FCMP $X, $Y, $Z (float comparison): s($X)←[f($Y)>f($Z)]-[f($Y)<
f($Z)]
• FEQL $X, $Y, $Z (float equality): s($X)←[f($Y) = f($Z)]
• FUN $X, $Y, $Z (float disorder): s($X)←[f($Y)||f($Z)]
• FCMPE $X, $Y, $Z (float comparison with respect to ∈): s($X)←[f($Y)>
f($Z)f(rE)]–[f($Y)<f($Z)(f(rE))]
• FEQLE $X, $Y, $Z (float equality with respect to ∈): s($X)←[f($Y)≈
f($Z)(f(rE))]
• FUNE $X, $Y, $Z (float disorder with respect to ∈): s($X)←[f($Y)||
f($Z)(f(rE))]
• FIX $X, $Z or FIX $X, $Y, $Z (float number is transformed to fixed point
number): s($X)←int f($Z)
• FIXU $X, $Z or FIXU $X, $Y, $Z (float number is transformed to fixed
point number without sign): u($X)←(int f($Z)mod264 )
• FLOT $X, $Y, $Z or FLOT $X, $Y, $Z (fixed point number is transformed
to float number): f($X)←s($Z)
372 Chapter 13 Generation of Object Code
• PBNZ $X, RA (possibly branch if non zero): if s($X)=0 then set @←RA
• PBNP $X, RA (possibly branch if non positive): if s($X)0 then set
@←RA
• PBEV $X,RA (possibly branch if even): if s($X) mod 2 = 0 then set
@←RA
If high speed computer can predict when will it handle a branch instruc-
tion, then usually it works fastest because the previous knowledge will help it
look forward, and prepare well for handling instructions in the future. There-
fore, MMIX encourages the programmer to provide the indication whether
a branch instruction is possible. Whenever there is more than half of chance
that a branch instruction will appear, then the programmer may use the
possible branch instructions, rather than the branch instructions.
hides the local register that hasn’t been popped and restores its original
value.
• SAVE $X,0 (save process state): u($X)←contents
• UNSAVE $X (restore process state) : contents←u($X)
The instruction SAVE stores all the current registers to the top of register
stack in memory, and push the top four word’s address to stack u($X). The
register $X must be global, and X must be greater than or equal to rG. All
the current local registers and global registers must be stored, along with
rA, rD, rG, rH, rM, rR, and several registers which we have not discussed
so far. The instruction UNSAVE takes such a top four word bytes address
and restores the relevant contents. Actually, it undoes what a previous SAVE
instruction did. The value of rL is set to zero by SAVE instruction, and then
UNSAVE restores it.
MMIX has special registers called stack offset register (rO) and stack
pointer register (rS), they control the operations of PUSH, POP, and UN-
SAVE.
So far, we have introduced main instructions of MMIX with target reg-
isters. In the target code generated by compiler these instructions will ap-
pear. But for completeness, we will also list those instructions which MMIX
attempts to use in super speed and parallel operations and the instructions
that are used for handling interruptions.
System considerations. Here we list the instructions which high-level users
may be interested in using for super speed and parallel operations of MMIX
structure. In the sense that these operations provide the machine with how
to plan in advance to realize maximal efficiency, the relevant operations are
similar to the “possible branch” instructions. Apart from probably using
instruction SYNCID, most of the programmers do not need to use these
instructions.
• LDUNC $X, $Y, $Z (load four word that are not in adjusted buffer):
S($X)←S(M8 [A])
• STUNC $X, $Y, $Z (store four word that are not in adjusted buffer):
S(M8 [A])←S($X)
These instructions implement the same operations as LDO and STO did,
but they also inform the machine that the loaded and stored four word bytes
and their adjacent words will be read or written in the nearest future.
• PRELD X, $Y, $Z (preload data).
It states that many bytes from M[A] till M[A+X] will be loaded or stored
in the nearest future.
• PREST X, $Y, $Z (prestore data)
It states that many bytes from M[A] till M[A+X] will definitely be written
(stored) before next time they are read (loaded).
• PREGO X, $Y, $Z (pretake to branch)
It states that many bytes from M[A] till M[A+X] will probably be used for
instructions in the future.
• SYNCID X, $Y, $Z (synchronized instructions and data)
13.3 Target Machine MMIX 377
It states that all of bytes M[A] through M[A+X] must be fetched again
before being interpreted as instructions. MMIX is allowed to assume that a
program’s instructions do not change after the program has begun, unless
the instructions have been prepared by SYNCID.
• SYNCD X, $Y, $Z (synchronize data)
It states that all of bytes M[A] through M[A+X] must be brought up
to date in the physical memory, so that other computer and input/output
devices can read them.
• SYNC XYZ (synchronization)
Different processors can reliably cooperate together through confining the
parallel activities.
• CSWAP $X, $Y, $Z (compare and exchange four word bytes)
If u(M8 [A]) = u(rP), where rP is a special prediction register, set u(M8 )
←u($X) and u($X)←1,otherwise set u(rP)←u(M8 [A]) and u($X)←0. This is
an atomic operation (indivisible) that it is used when a number of computers
share a common memory.
• LDVTS $X, $Y, $Z (load a pseudo translation state)
This instruction is only provided for operating system, and the details are
omitted.
13.3.11 Interruptions
The usual activities from a double word byte to the next instruction, not
only may be changed by jump and branch instructions but also by unpre-
dictable events such as overflow or output signals. The real world computers
also need to cope with the matters like violations of the security regulations
and hardware failures. MMIX distinguishes two kinds of interruptions: Trips
and traps. A trip sends control to the trip handler that is a part of the user
program; a trap sends control to trap handler that is a part of the operat-
ing system. When MMIX is doing arithmetic operation, there may be eight
unexpected conditions, they are: integer division check (D), integer overflow
(V), flow to fix-point overflow (W), incorrect float operations (I), float over-
flow (O), float underflow (U), float division by zero divisor (Z), and float
incorrect (X). Special arithmetic state register rA maintains current infor-
mation of these unexpected situations. Its 8 bits of the right most byte are
called its event bits. They are in the order of DVWIOUZX, and is called
D BIT(#80), U BIT(#40), . . . , X BIT(#01). The left 8 bits of the event
binary bits in rA are called enable bits. They also appear in the same or-
der as DVWIOUZX. When some arithmetic operation is executed and some
condition occurs, before MMIX proceeds to the next instruction, it searches
for the relevant enable bit. If the enable bit is 0, then the corresponding
event is set to 1; otherwise machine will go through trip to location #10
378 Chapter 13 Generation of Object Code
lation register rV that defines mapping from “virtual” 64 bit address to the
“real physical location that is installed in the memory. These special regis-
ters assist MMIX to be a completely feasible machine that can be realistically
constructed.
• GETA $X, RA (gets address) u($X)←RA
This instruction uses the same convention as the branch instructions that
puts a relative address into register $X. For example, “GETA $0, @” puts $0
to be the address of the instruction itself.
• SWYM X, Y, Z or SWYM x, YZ or SWYM XYZ (SWYM means “sym-
pathize with your machinery”)
This is the last of the 256 instructions. Fortunately it is simplest. Actu-
ally, it is usually called empty operation, because it does not do anything.
However, it makes the machine running smoothly. Here, X, Y, Z are omitted.
Timing
We have mentioned that, in the generation of the target code, we need to
compare different generation schemas, to specify which one is better. Apart
from the comparison of the memory volumes which they used, the another
index is the time they consumed. In other words, we may compare those
target programs that all solve the same problem or were generated for the
same problem, to see which one runs faster. However, generally speaking,
such comparisons are not easy to carry out as the architecture of MMIX can
be implemented in many different modes. The running time of a program
not only depends on the timing progress, but also depends on the dynamic
energetic units that can be synchronously active as well as the degree by
which they form the stream line. It depends on the size of randomly access
memory that offers the illusion of 264 virtual bytes. It also depends on the
adjustment strategy of buffer and other areas, the sizes of these areas, and
allocation strategy, etc.
In order to be pragmatic, the measurement of the running time of a MMIX
program may often be based on the running time which a high performance
machine with large volume of main memory can achieve. For every operation,
an overhead is assigned to it. By this way, the satisfactory estimation may
be given. Suppose that every operation takes an integer v that represents a
unit of the time loop that a streamline uses. Even though as the development
of science and technology the value of v is decreasing, we can always keep
the newest value instead of using ns to measure. In our estimation, we will
also assume that the running time depends on the quantity of memory mems
which a program accesses. This is the number of instructions that are loaded
and stored. For example, we will assume that, every LDO (load four words)
instruction takes μ+v, where μ represents the average overhead of a memory
access. The whole running time of a program may be reported 35μ+1000v,
meaning that it has 35 memory accesses plus 1000 time units loops. For many
years, the value μ/v gradually increases no one knows whether the trend will
continue or not. But the experience has proved that, the values of μ and v
382 Chapter 13 Generation of Object Code
When generating target code, of course we may directly generate the in-
structions of MMIX. But in this way many details introduced above must be
involved and it will be easily prone to inevitable errors. Therefore, we have
13.4 Assembly Language of MMIX 383
stronger trend in using the symbol language which it provides, i.e., the assem-
bly language called MNIXAL. It is the extension of the mnemonic notation
of the instruction. By using it, the MMIX programs can be written and read
easier, the programmer does not need to worry about the tedious details that
often lead to errors. Its main characteristics are that it selectively uses the
mnemonic symbols to represent numbers, uses a label field to associate the
name, memory cell, and number of registers.
We may introduce it by giving each convention of MMIXAL. But if we
really do so, we would still do not know how to use it. Therefore we would
rather use a simple example instead. In this example, various symbols in
MMIXAL will occur. The following code is a part of a bigger program that
is used to find out the maximum of n elements x[1], . . . , x[n]. The idea of the
algorithm of the subprogram is that, suppose that the maximal element is the
last element. We put this element into a special location. Then we compare
it with the elements in its left hand side in turn. If it is greater than any of
these elements, it still stays in the location. But if it is less than anyone, the
winner of the competitors will take its place over and becomes the new host
of the location. When all the n elements have finished the battle, the one now
staying in the location is the maximal value.
Program M (find maximal). At the beginning, n is in register $0,
and the address of x[0] is the register x0(a global register that is defined in
somewhere else).
Assembly code Row Label Operator Expression Times Note
number number
01 j IS $0 j
02 m IS $1 m
03 kk IS $2 8k
04 xk IS $3 X[k]
05 t IS $255 temporary storage
06 LOC #100
#100: #39 02 00 03 07 Maximum SL kk,$0, 3 1 M1. Initialization.
k←n,j←n
#104: #8c 01 fe 02 08 LDO m,x0, kk 1 m←X[n]
#108: #f0 00 00 06 09 JMP DecrK 1 goto M2 with k←n−1
#10c: #8c 03 fe 02 10 Loop LDO xk,x0,kk n-1 M3. Comparison
#110: #30 ff 03 01 11 CMP t,xk,m n-1 t←[X[k]>m]- X[k]<m]
#114: #5c ff 00 03 12 PBMP t, DecrK n-1 if X[k]m, then goto M5
#118: #c1 01 03 00 13 ChangeM SET m,xk A M4 Change m. m←X[k]
#11c: #3d 00 02 03 14 SR j, kk,3 A j←k
#120: #25 02 02 08 15 DecrK SUB kk,kk,8 n M5. Decrease k, k←k−1
#124: #5502 ff fa 16 PBP kk, Loop n M2. Test all? If k>0,
then return to M3
#128: #f802 00 00 17 POP 2,0 1 return to main program
This program also shows the usage of the relevant symbols of MMIXAL.
1) The columns “label“, “operator”, and “expression” are more interesting
as they contain the program written in MMIXAL machine language.
2) The column “assembly code” lists the real numerical machine language
that corresponds to MMIXAL program. The translation usually is done by
the so-called assembler or another versions of the assembler that is another
384 Chapter 13 Generation of Object Code
the maximal value into $1 and storing its address into $2, it will return to
instruction “STO $1, Max”.
Now we have a look at the complete program once again, instead of only
the subprogram. We call the following program “hello world” because the
message it prints is “Hello world”, then it stops.
Program (Hello)
Assembly code Row Label Operator Expression Note
number number
01 argv IS $1 argument vector
02 LOC #100
#100:#8fff0100 03 Main LDOU $255, argv, 0 $255←the name of program
#104:#00000701 04 TRAP 0,Fputs, StdOut print the name
#108:#f4ff0003 05 GETA $255,String $255←the address of ‘‘world’’
#10c:#00000701 06 TRAP 0, StdOut print the sstring
#110:#00000000 07 TRAP 0,Halt,0 stop
#114:#2c20776f 08 String BYTE ‘‘, Hello’’, #a,0 The string with line feed and
end symbol
#118:#726c6490 09
1#11c:#00 10
NNIX establishes the string of every argument so that its character starts
at the boundary of four word bytes. However, in general, string may start at
any location in the four word bytes.
In line 03, the first instruction of program H puts the string pointer M8 [$1]
into the register $255. The string is the name of the program “Hello”. Line
04 is a special instruction Trap that requires that the operating system put
the string $255 into the standard output document. Similarly, lines 05 and 06
require that NNIX contribute “world” and a line feed to the standard output.
The symbol Fputs is prearranged to be 7 while symbol stdOut is predefined
to be1. The line “TRAP 0, Halt, 0” is a usual way to end a program. This
belongs to a special trap instruction.
The string output characters of lines 05 and 06 are generated by BYTE
instruction in line 08. BYTE is a pseudo operator of MMIXAL, rather than
an operator of MMIX. But BYTE is different from the virtual operators like
IS and LOC in that it does assemble the data to storage. Generally speak-
ing, BYTE assembles a series of expression to a constant of one byte. The
construction of line 08 “world” is the abbreviation of seven single characters
"," "," "w", "o", "r", "l", "d"
The constant #a on line 08 is the symbol line feed in ASCII. If it appears in
a file in printing, it will cause the action of line feed. The final “0” on line 08
is used to end the string. Therefore line 08 is a table of nine expressions and
it causes the showing of nine bytes on the left sides of lines 08–10.
The summary of the language. Now that we have seen three examples
that demonstrate what can we do in MMIXAL, it is the time to carefully
discuss several rules, especially we have to investigate what cannot be done
in the language. The following are few rules that define the language.
1) A symbol is a string that starts with a letter and followed by let-
ters/numbers. As the purpose of the definition, underline “ ” is regarded as
a letter, and all the Unite codes with value more than 126 are also letters.
For example, PRIME1, Data-Segment, Main, , pâté.
Special constructions dH, dF, and dB, where d is a single number. Accord-
ing to the convention of “local symbols” explained above, it is substituted
effectively by unique symbol.
2) A constant is:
(1) A decimal constant. It consists of one or more digits of decimal num-
13.4 Assembly Language of MMIX 387
carried out from left to right, hence a/b/c is (a/b)/c while a-b+c is (a-b)+c.
Example [5]: #ab<<32+k&∼(k–1) is an expression. It is the addition of
term #ab<<32 and term k&∼(k–1), and the later term is the bitwise and of
main terms k and ∼ (k–1).The later main term is the complement of (k–1)
that is the complement of an expression included in parentheses that is the
difference of term k and term 1. 1 is also main term, actually it is a constant
of decimal system. If the symbol k, say, to be equivalent to #cdef00, then the
whole expression #ab<<32+k& ∼ (k–1) is equivalent to #ab000001000.
The binary operations are only allowed to perform on the pure numbers,
except for the exceptional cases like $1+2 = $3 and $3–$1 = 2. The future
reference cannot be combined with anything else, the expression such as 2F+1
is always illegal, because 2F never corresponds to any defined symbol.
6) An instruction consists of three fields:
(1) LABEL field. It is either blank, or is a symbol.
(2) OP field. It is either an MMIX operator or a virtual operator of
MMIXAL.
(3) EXPR field. It is a list of one or more expressions, separated by
commas in between. The EXPR field may also be blank, in which case it is
equivalent to the expression 0.
7) The assemble of an instruction is carried out in the following three
steps:
(1) If it is needed, the current location @ can be aligned through adding
to it the multiples of
8, if the operator is OCTA;
4, if the operator is TETRA or an operator of MMIX;
2, if the operator is WYDE.
(2) If the symbol in LABEL exists, it is defined as @, unless OP = IS or
OP = GREG.
(3) If OP is a virtual operator, please refer to rule 8, otherwise OP is an
instruction of MMIX and it is as explained in Section 13.2, OP and EXPR
fields define a tetra (double word) bytes, hence @ should be added 4. Some
instructions of MMIX have three operands in the EXPR field, while others
have two, there are some that have one operand only.
If OP is, say, ADD. MMIXAL may predict that there are three operands,
then it will check if the first and the second operands are register numbers.
If the third operand is a pure number, MMIXAL will change the operator
from #20 (add) to #21 (prompt add) and it will check if the prompt value is
less than 256.
If the OP is SETH, then MMIX will expect that there are two operands.
The first operand should be a register number, and the second should be a
number less than 65536.
An OP like BNZ takes two operands: a register number and a pure num-
ber. The pure number should be expressed as a relative address. In other
words, its value should be able to express as @+4k where –65535 k 65536.
Accessing memory. The OP like LDB or GO has two forms: either two
13.5 Generation of MMIXAL Target Codes 389
operands $X, A or three operands $X, $Y, $Z or $X, $Y, Z. When the memory
A may be expressed as the sum of a base address and a byte Z: $Y+Z, the
option of two operands may be used.
8) MMIXAL contains the following virtual operations:
(1) OP = IS. The EXPR should be a single expression; If there exists a
symbol in LABEL, it is regarded to be equivalent to the value of the expres-
sion.
(2) OP = GREG. The expression is a single expression with a pure equiva-
lent value x. If the symbol in LABEL exists, it is taken to be equivalent to the
maximal register number that has not been allocated. Moreover, when the
program starts, this global register will contain x. If x=0, then x is regarded
as the base address, and the program will not change the global register.
(3) OP = LOC. EXPR should be an expression with a pure equivalent
value x. The value of@ is set to x. For example, the instruction “T LOC
@+ 1000” defines T as the start address of the series of 1000 bytes, the @ is
advanced to the byte after the series.
(4) OP = BYTE,WYDE, TETRA, or OCTA. The EXPR field should be
a list of pure expressions each of which may load in 1, 2, 4, or 8 bytes.
9) MMIXAL confines the future reference so that the process of assemble
may be completed in one pass scanning of the program. The future reference
is allowed only when it is the following case:
(1) In a relative address, like the operand of JMP, or the second operand
of branch instructions, possible branches, PUSH, or GETA.
(2) In an expression that is assembled by OCTA.
MMIXAL has also some additional characteristics that are related to sys-
tem programming [6], we do not introduce them here. For a complete descrip-
tion of the details of the language, they appear in the file <MMIX ware>,
along with a complete procedure of working assembler.
Having the language of target codes, our next task is to discuss how to trans-
late various kinds of intermediate code to target codes [7]. The intermediate
codes we have introduced include reversed polish form, triple form, quadruple
form and syntax tree form. No matter which one is adopted, what we need
to translate are expressions and statements, including sequential statements,
conditional statements and loop statements.
No matter which intermediate code is used, the algorithm that is used to
translate them into target code is to put the intermediate code that needs to
handle into stack, then take them out one by one to translate. In this process
of analysis, the corresponding target will come out.
In the following, we take reversed Polish form to express expressions to
show the translation procedure and further explain our algorithm. In this
390 Chapter 13 Generation of Object Code
way, we do not need to make any explanation regarding the problem later.
In order to generate the target code of the expressions in the reversed Polish
form, we need to use a stack. The stack is called component stack that stores
each intermediate syntax tokens that constitute the expressions, each element
is denoted s[i]. The rough translation procedure is that, from the reversed
Polish form area the tokens of the token string are taken, and put them into
the stack. If the one that is scanned is a variable, then at the same time as
it is put into the stack, it is also put into register in order to perform the
operation. If the one that is scanned is an operator, then it needs to check
which type of operator is it? If it is a unary operator, then it only needs to
perform the operation on the operand preceding the operator. If it is binary
operator, then the corresponding operation should be performed over the two
operands before the operator. Similarly, if it is an n-ary operator, then the
operation will be performed over the preceding n components. But so far
we have not met n-ary operator. Hence for the translation of expressions in
the reversed Polish form, the most important thing is to put variables into
registers. Thus when operator is met, the corresponding instructions will
directly come out.
For example, suppose that the expression is (x ∗ y + w ∗ z) ∗ x − −y/z,
then its reversed Polish form is
xy ∗ wz ∗ +x ∗ yz/ − .
((x ∗ y + u) − w ∗ y)/x.
13.5 Generation of MMIXAL Target Codes 391
x ∗ (y + u ∗ v − y/w).
The other form of intermediate code is syntax trees which we have introduced
in preceding sections [8]. Not only expressions can be represented in syntax
trees, but also various statements can be represented in syntax tree form. In
order to translate expressions in syntax trees to the target code in MMIXAL,
either we can transform the syntax trees to the reversed Polish form (actually
it is only the post order visit to the syntax trees), or we transform the syntax
trees into triples or quadruples. Then we use the methods introduced above
to translate the triples and quadruples into the target code in MMIXAL.
Besides, we can also directly translate the syntax trees to the target code
required. Notice that we use the visit rules for post order visit. For example,
given a syntax tree as shown in Fig. 13.2, its target code in MMIXAL is as
follows:
Problems
3) (*,u,(2))
4) (+,(1),(3))
5) (*,(4),x)
Problem 13.3 Translate the following intermediate code in quadruple form
into MMIXAL target code:
1) (+,x,y,T1 )
2) (*,x,y,T2 )
3) (/,T2 ,u,T3 )
4) (+,T1 ,T3 ,T4 )
5) (*,T4 ,X,T5 )
Problem 13.4 Write down the corresponding MMIXAL target code of the
Following expression:
(x*y+w*(x+y)-u*y/x)*y
Problem 13.5 Write down the corresponding MMIXAL target code of the
following statement:
if x<5 the x := x+1 else x := x−8
Problem 13.6 Write down the MMIXAL target code that realizes the fol-
lowing function:
x + 3x2 + 5x3 x < 0
S=
x + 4x2 + 6x3 x 0
Problem 13.7 Write down the MMIXAL target program that realizes the
following statement:
while i 100 do s := s+3
Problem 13.8 Given the following program
void f (x,a,y)
float x[] [4],a[][4],y[]{
int i,j;
float s;
for(i=0;i<3;i++){
s=0;
For(j=0;j<4:j++)
{s=s+x[i][j]
a[i][j]=x[i][j]*x[i][j];}
y[i]=s;
}
}
Translate it into MMIXAL target code.
Problem 13.9 Write down the corresponding MMIXAL target code of the
following relation expression:
References 397
a b V a > 0 ∨ b 0 ∧ (c d)
Problem 13.10 Write down the corresponding MMIXAL target code of
the following expression:
a∧b ∨ c ∧ (b ∧ y = 3.14 ∧ a = 2.18)
References
Tom DeMarco
parallel and distributed languages are more important than functional and
logical programming languages. Therefore, as one of frontiers in the field of
principles of compilers, we will discuss the compilation of object-oriented lan-
guages first, and sequentially we will discuss what we regard as also frontiers.
For example, Chapter 16, we will introduce briefly the newest technique —
grid computing and the compilation of the language (if any) that writes
programs to implement grid computing.
Object-oriented languages and procedure-oriented ( or imperative) lan-
guages are similar in many aspects apart from that the target code they
generate may have different levels. The former ones tend to generate lower
level while the latter ones tend to generate assembly target code. Therefore,
in this chapter we will introduce the special characteristics of object-oriented
languages and the corresponding special handling techniques in compilers.
hidden parameter.
1) Method identification. Before the translation of the calling of the
object methods, it is necessary to identify which one needs to be called. For
the routine calls, of course there is also the problem of the identification
though it is rather simpler. Moreover, it can be solved through semantics
check. But for objects, in many cases it needs to find out the method to be
called through looking up the deliver table during running time.
2) Message Passing. Many people depict object-oriented programming
as objects plus message passing as the object-oriented programming is based
data-centralization. The basic units for activities are objects. All the activities
of objects are message-driven. Therefore, message passing is the unique means
for object communications.
Methods, as we have said, are the operations upon objects. They are just
implemented via message passing. Therefore, so called identification method
actually is message identification. Because when some operation needs to im-
plement upon the object, the message is sent to the object. Upon receiving
message and deciding the operation required by the message, the correspond-
ing operation is called and it is performed to completion. When the point is
reached, it has no difference again with the routine calling.
In most object-oriented languages objects have also constructive func-
tions and structured functions. They are called when objects are created
and released, respectively. Their calls and the calls in other methods are not
different in principles.
Suppose that our objects are just like what we said. Then the compilation
of objects is very simple. Suppose that we have an object A that has method
m1 and m2, as well as fields a1 and a2. Then the running table of object class
A consists of fields a1 and a2, as shown in Fig. 14.1.
Besides, the compiler maintains the method table on compilation of class
A as shown in Fig. 14.2.
Fig. 14.1 The fields of object. Fig. 14.2 The method table of object.
There are innumerable objects in the world. Each object has its own charac-
teristics. If we want to investigate them individually, it will not be possible.
Fortunately, many objects have common characteristics, apart from peculiar-
ities. Through the common characteristics, we may group those objects with
the same common points. Then we concentrate to investigate these classes so
that our research can be given great convenience.
We take the division of books as an example. Fig. 14.3 shows the division
of books.
In the hierarchical structure of classes, the lower nodes all have the char-
acteristics of upper nodes. But in comparison with the upper nodes, the lower
nodes have their own new characteristics. The lower nodes hold the charac-
teristics of the upper nodes, this is called inheritance. Inheritance is one of
the most important characteristics of objects.
14.3.1 Inheritance
is also called type extension. Class B can extend the class A through zero or
more fields and methods of class A. Class A is called the parent of class B and
class B is called the child class of A. Now suppose that class B complements
class A with method m3 and field b1, then the representation of class B
during running time is shown in Fig. 14.4. In addition, Fig. 14.5 shows the
methods of class B on compilation time.
Method overload will affect the compilation time. According to the as-
sumption above, class B redefines method m2, while the method has been
declared and defined in class A. Hence the method table of class A is shown
in Fig. 14.6, and the method table of class B is shown in Fig. 14.7.
Fig. 14.6 Method table of class A. Fig. 14.7 Method table of class B.
14.3.3 Polymorphic
A link of a method call and the corresponding body code of the method is
called constraint. As we have just mentioned that the same method name may
play role on different objects on the class chain. That means that method
calling may correspond to the method body code of different objects.
There are two modes of constraints, viz. static constraints and dynamic
constraints. The static constraint indicates that on compilation time it is
known already that of which object the method body code should be called.
For example, if in the source program the calling method of object is m2,
and from the example above we know that there are m2 A A and m2 A B.
Since it is static constraint, on the compilation time the calling object has
been determined, the calling object is m2 A A.
As for dynamic constraint, it indicates that the correspondence of the
method name and method body code is created on running time. When
the program runs, before implementing the calling of the method body, the
code is run first. According to the type of object and the position of the
object on the object chain it is determined that of which object the method
body should be called. For example, in the example above, which one should
be called, m2 A A or m2 A B? Therefore, dynamic constrain involves the
following cases:
1) There are multiple types of A (polymorphic), at that time all the classes
on the class chain are regarded to A. For example, for the example above,
there are two A’s. One is the “true’ A; while another one is A “embedded in
B”. For the true A, it needs to use m2 A A, and for the A embedded in B,
it needs to use m2 A B. Therefore in the code which the compiler generates
for the method calling, needs to discern whether it is A or B based on the
dynamic type message.
14.3 Characteristics of Objects 407
2) Method B needs a pointer that points to B so that it can visit all the
fields in B. If there are several sub-types, then every sub-type needs such
pointer. However, m2 may be dynamically pointed to a pointer call that is
class A – like pointer in B. Hence we need the other operation called (re)
subtyping. That operation reconstructs pointer that points to B from the
pointer that points to A, e.g., the method calling p → m2(A), where p is a
static pointer that points to object A. It can be translated as follows:
switch (Dynamic type of(p)) {
case Dynamic class A:m2 A A(p); break:
case Dynamic class B:
m2 A B (convert ptr to A to ptr to B(p));
break;
}
Here the dynamic type message is enumeration type with two values Dy-
namic class A and Dynamic class B. When p is a static pointer, p → m2(A)
call may be translated to
m2 A B
Notice that, this code line is consistent with declaration void m2 A A(class A*
this, int i) and void m2 A B (class*this, int i).
Apart from the methods aforementioned, the following method may also
be used.
In order to determine the method of which the routine is used, switch
statement is one that is a function used in a small range. It may be evaluated
in advance. After the evaluation, we transform the pointer that points B
from A and merge it to the routine m2 A B. Now the method may accept
the pointer that points to A.
void m2 A B (class A*this A, int i){
class B*this=convert ptr to A to ptr to B(this A);
The program body of method m2 A B,
visits any object field x through this → x
}
More generally, for the translation of every method m x y, the first parameter
points to the pointer that points to class x. That pointer may be transformed
to class y through application of convert ptr to x to ptr to y( ). If x and y
are the same, then the transformation may be omitted.
The method call p → m2(A) may be translated to the following form
through this modification of m2 A B():
dynamic type of(p)==Dynamic class A?m2 A A:m2 A B(p);
The formula above is a computation function that makes the calling with
p as the parameter, where p is a static pointer that points to the object
of class A. Every time when the operation is executed on p, it is not to
evaluate the function, rather it is to merge the resultant function address to
the dynamic type information based on the dynamic type information of p.
The type information of objects of class B is a record with three selection
408 Chapter 14 Compilation of Object-oriented Languages
Notice the operations of supertyping and subtyping, the former one trans-
forms child classes into parent classes, and the latter one transforms the
parent classes into child classes.
supertyping
410 Chapter 14 Compilation of Object-oriented Languages
This order is correct for the objects of class E and “C in E”, but how is
it for “D in E”? Whenever the compiler compiles the object class D without
being aware of the situation of C and E, the compiler must determine the
representation of objects of class D. Suppose that D consists of the pointer
that points to its delivery table, followed by the fields of D after the delivery
412 Chapter 14 Compilation of Object-oriented Languages
table. D will not be able to work as when D is in the objects of type E, the
fields that it inherits from A precede the pointer of the delivery table for
a distance. Moreover, the fields of D are behind the pointer of the delivery
table. Therefore, when the generated code accesses the fields of objects, the
compiler maybe cannot find the fields of the objects. Usually the problem
will be solved by the specification symbols on running time. For the pointer
that points to object itself, the specification symbol allows the methods of the
object to visit the fields of the object. For every field, the specification symbol
must contain the offset from the beginning of the object pointer to where the
pointer is now. We enumerate the fields of the objects, hence we can use the
enumeration indices as the indices of offset table. Apart from the pointer of
delivery table, the representation of the object must contain the pointer that
points to the offset table. As it is unknown in advance which object class will
be contained in multi inheritance, hence all the objects should follow these
two schemas of pointers.
Take Fig.14.13 as example, suppose that the size of all pointers and fields
is 1, then from the pointer of E, the offset of field a1 is 2,the offset of a2 is
3, and then for c1 and c2, they are 4 and 5, the field e1 is 9. Therefore, we
have the indices of class E
2 3 4 5 8 9
−4 −3 −2
Notice that for an object of class E, both m1 and m3 have ambiguity. When
an object of class E is applied, the rules of the language or the programmer
should specify m1 and m3 clearly.
As multiple inheritance may cause complexity and the overhead of the
method calling, so some languages (e.g., Java) do not allow to use it.
Problems
References
This chapter will be totally different from previous chapters as so far there
is no really grand new parallel programming language to be used already.
What people have are only the parallel facilities parasitically affiliated to
the existing languages. The discussions before mainly focus on the issues of
compilation of sequential programming languages, and in this chapter and
the following chapter we will discuss the issues on the compilation of parallel
languages. Is it important? The answer is definitely yes. Because in the
new century, no doubt, parallel computers will dominate the whole market
of computers. Then developing programs for this kind of computers will be
the must if some one wants to be the programmer of the new century. Of
course he/she also needs to know how his/her programs are compiled by the
corresponding compilers. It is not exaggerating that parallel compilation will
become the main theme in compiler field.
The explorations and pursuing of people in science and technology are always
continuing without cease. It is also true in computer science and technology.
Since 1991 the U.S.A. proposed the plan of developing high performance
computing and communication (HPCC). Many countries around the world
correspondingly invested huge fund to carry out the development of high
performance super computers. So far, the United states has successfully de-
veloped the super computer with the speed up to about one thousand trillions
per second. China recently has also announced that the scientists and engi-
416 Chapter 15 Compilation of Parallel Languages
neers of their country have also successfully developed the super computer
with the speed up to one thousand trillions per second. Other countries such
as Germany, Japan Russia, etc. also make their endeavor to develop the
computers of this kind.
Since the speeds of the computers with single processor have almost
reached the limitation — the information flows inside the computers of this
kind cannot exceed 300 thousand kilometers per second. In order to develop
high performance super computers, the only approach is depending on par-
allelism — to develop the cluster of high amount of computers working in
parallel. For example, if each unit of the cluster computers has the speed of
10 millions per second, ten thousand units connected together and they are
assembled well, then the speed of the cluster definitely will be one trillion
per second. It may be said that the high performance computer is parallel
in structure. Now the parallel processing modes experienced several phrases,
from single instruction stream and multiple data stream (SIMD), parallel
vector processors (PVP), storage-sharing symmetric multiprocessors (SSSM),
massively parallel processors (MPP) to cluster. These parallel architectures
may roughly be divided into five classes [1]:
1) Single instruction multiple data stream array processors: They con-
sist of thousand-thousand processors with very simple functions. Data flow
through each processor with certain modes and then are processed. SIMD
type parallel computers played an important role in the stimulation devel-
opment of parallel computers. However, since the development of the micro
processing chips, SIMD type parallel computers used in scientific and tech-
nological computation have basically retreated from the stage after 1990s.
2) Parallel vector processors: In addition to scalar registers and scalar
components, vector processor computers have also special vector registers
and vector stream function components that can quickly handle vector com-
putation.
3) Main memory-sharing processors systems: Multi-processors share one
centralized memory and also possess special multi-processor synchronous
communication component that can support the development of data par-
allelism and control parallelism. But when there are too many processors,
the channels that link each processor with central memory will become bot-
tleneck, so that they constrain the development of the computers of this kind.
People then turned to investigate large-scale parallel computers.
4) Distributed memory multi-processor systems: They are computers
composed of lots of nodes. Each node has its own processors and memory
and Internet is the link among nodes. It mainly supports the parallel devel-
opment of data, as well as control.
5) Computer cluster systems: They are the sets that consist of all com-
puter nodes physically connected each other by high performance networks
and local networks. In usual case, each computer node is a symmetric multi-
processor server, a work station (WS), or a PC. The nodes may be isomorphic,
they may also be isomeric. The number of the processors generally is several
15.2 Rising of Parallel Computers and Parallel Computation 417
bodies and their extreme far distances in between (in general, it is counted
with light year as the unit) the computation size is hard to imagine, it can
also imagine that the parallel computation can apply upon them. Another
example, in order to increase precision of the numerical weather forecast, it
is estimated that based on the longitude, latitude and the atmosphere layer,
at least 200 × 100 × 200 = 4 000 000 grid nodes need to take into account,
and then the parallel computation is performed upon these grid nodes.
2) Data intensive: numerical library, data warehouse, data mining, and
visual computation, etc., all involve massive data. Data intensive processing
also entails high-level parallel computation.
3) Network intensive: At first we explain the concept of network paral-
lel computation. The combination of computer science and technology and
communication technology makes computer systems developing towards net-
works. Various technologies make computer networks gradually to be wide
areas, international, wide band, low delay, multimedia, synthetic and intelli-
gent. Thereby, the concept of network parallel computation is also proposed.
The so-called network parallel computation is based on the computation of in-
dividual computer, and plus the message passing between computers to form
the computation of high-level. It sufficiently makes use of conditions pro-
vided by high speed information networks (or called information high way),
implementing resource sharing, inter-communication, message passing ser-
vice. Hence network intensive is the computation performed in the intensive
network environment.
From the implementation of the parallel computation, they can be clas-
sified as:
1) Synchronous parallel computation: the so-called synchronous parallel
computation means that two or more computation are required to begin at
the same time and also finish at the same time.
2) Asynchronous computation: Asynchronous computation does not re-
quire the beginning and finish of the computation at the same times, it just
require that the parallel computation keep necessary and normal communi-
cation and information interconnection.
Again, we can classify parallel computation from the memory it needs:
1) Memory-sharing parallel computation: Under the mode, several pro-
cessors (each one may have or may not have own memory) share memory. In
this case, the access to the shared memory may be constrained in some ways.
2) Distributed memory parallel computation: Under the mode, there is
no any constraint to the access of memory. But it is likely that a data may
have several copies, hence the whole amount of the memory may be reduced.
In the following, we introduce the practical applications of parallel com-
putation in more details in order for reader to have deeper impression on
the concept. Many challenging topics of application fields have proposed par-
allel computation. For example, in the field of magnetic record technology,
the research on the computation and emulation of static magnetic and in-
ter induction with the aim of reducing the high noise of graduated disc;
15.3 Parallel Programming 419
in to 8 pairs to compare, then they can be formed into 4 groups with each
group containing 4 numbers. Subsequently, the sort is performed in parallel
in two groups with each containing 8 numbers. It is seen that the size of
granules changes as the program proceeds. Therefore even though there is
parallel language that can explicitly express the parallelism, the programmer
still needs to give the rein to his/her own professional talents to really find
out the parallelism, and correctly express it.
The so-called explicitly express parallelism, usually adopts two forms, viz.
cobegin . . . coend and parbegin . . . parend. the omission between two words
means that there may be arbitrary more parallel program segments. In this
case, the responsibility of finding and organizing the parallelism falls on to the
programmer. As for the compiler, its work is rather simple. What it needs to
do is to assign the programs or program segments that need to run in parallel
to different processors. It can be done with master/servant fashion. In this
fashion, the parallel computer consists of one controller (process) and several
subordinators (also processes). The responsibilities of the master process are
to maintain the global data structure, partition the tasks and interfaces be-
tween users, including receiving tasks, initiate computation and reclaim the
results of the computation. As for the responsibilities of each subordinator,
they have to carry the computation assigned to them, including local initial-
ization, computation, communications between modules and finally return
the results to the master process.
The hidden parallelism of programming languages is more difficult and
challenging field. In this field, the determination and partition of the tasks of
parallelism fall on the compiler. Directing at the source program, the compiler
has to find out which parts can be done in parallel and then organize them in
this way so that the program may be run in parallel. The compiler has also
to automatically generate the communication code.
In the rest of this section, we introduce five types of parallel programming.
Then in Section 15.4 we concentrate on the discussion of hidden parallel pro-
gramming, mainly exploring the instruction level parallelism (ILP), especially
for very long instruction word (VLIW) and superscalar, how do they generate
target code.
A process may create many processes to get the parallelism. The address
spaces of different processes at least partly overlap. Therefore, many processes
may access the same variable. This is what we said of shared variable. A
shared variable may be read or written by many or even all processes, and it
becomes an important mechanism of communications.
In this type of programming, an important issue is to access shared vari-
able synchronously. For synchronization, we have explained it before. But
now we need to introduce the concept of exclusive synchronization. Let us
observe an example first. In this example, suppose that two processes syn-
chronously execute the following program with an attempt of increasing the
value of shared variable X.
X: shared integer:
X:=X+2:
Suppose that the initial value of X is 6. Obviously, if both processes add 2
to X, the value of X will be 10. But what will really happen? If the two
processes initially read 6, then they synchronously add 2 to X and wrote it
back to X, then the result is 8, rather than adding 2 again to X. obviously
this is not our expectation. The reason for this is that both two processes
did the addition operation simultaneously. In order to avoid this situation,
we adopt the measurement of mutual exclusive synchronization.
Mutual exclusive synchronization means that in any given time, there
is only one process that may access the shared variable. To implement this
simple synchronization, the primitive operation may use the lock variable.
The lock variable possesses an undivisible operations: gain and release. The
so-called undivisible means that the whole operation is an integral process:
gain and release. After some process gains the lock variable, other processes
cannot gain it until the process releases the lock variable. The situation is
like that other processes are locked by the lock variable. When the process
completes its task, it releases the lock. Thereby other processes may compete
for gaining it. But next time, there will be only one process that can gain
the lock and then continues its execution. Therefore, the function of the lock
variable is making the constraint : In one time, there is only one process that
can access the shared data structure. With the lock variable, the example
above may be written as:
X: shared integer:
X-lock; lock;
Acquire-lock(X-lock);
X:=X+2:
Release-lock(X-lock);
Now, we have ensured that in any given time, there is only one process that
executes the operation of adding 2 to X. But the problem with this fashion is
that it is inefficient and it is prone to failure. The main problem is that, when
there is somewhere the statement is not protected properly, the program will
go to failure. The more structured and higher level of the solution to mutual
422 Chapter 15 Compilation of Parallel Languages
The parallel programming model with shared variables has a problem, that
is, it is based on the machine with physically shared memory. This model is
not suitable for multi-computer systems as multi-computer systems have no
shared memory that can be used for shared memory variable. Therefore an-
other model of parallel programming is proposed — message passing model.
Message passing model is not only suitable for the multi-computer systems
without shared memory but also for the computer system with shared mem-
ory.
In parallel computer systems using message passing model, the data ex-
change between processes is realized through message passing. This work can
be done by two primitive operations send and receive. They are as follows:
process1
send(process2, message);
process2
receive(process1,message);
This represents that process1 sent to process2 a message. The format of
the message depends on the programming system. In the low layer, message
usually is a byte array. And in the high-level language, message may be
similar to a structural value of a record where may have different fields of
types. After process2 calls receive( ), it then stays in the locked state until
the message arrives.
There are many basic models of send/receive. The problem that needs to
15.3 Parallel Programming 423
be solved is how to establish the connection with each other between send side
and receive side. Obviously, this requires that both sides know the address
and communication mode of the counterpart. The more flexible mode is to let
the receive side receive the message sent by any process in the program. For
example, on internet, one can receive the message sent by anyone. This mode
is very useful in the case that the receive side does not know in advance who
will send message to it. But the disadvantage is that rubbish message may be
received. The method of increasing flexibility is to use indirect process name
rather than directly assigning process name. For example, the port names
which many languages use may realize the function. The send side just needs
to put the message to the port, and the process that issues receive statement
on the port may receive the message. The match of send side and receive side
is done by the system.
The another problem is when the send side can continue the sending? In
asynchronous message passing, the send side and the receive side operate at
different places. Hence the send side may continue after it finishes one oper-
ation. But in synchronization environment, the requirements of synchronous
message passing are strict, it continues passing only when it is determined
that the passing message from send side has safely arrives at the destination.
This mode has its advantage in this way.
In some languages, the receive side can control the type and the length
of message received. For example,
receive print(size, text) such that size<4096;
# it indicates that it can only receive the print message with length
less than 4096.
or
receive print (size, text) by size;
// this indicates that the print messages are required to be received
in increasing order.
What we introduced above is about receiving message by receive state-
ment with explicit mode. Corresponding to this, there is another mode called
hidden receiving mode. It creates a new thread process for every message
received.
What is a thread process? Simply speaking, a thread process is a light
level sub-process that possesses its own program counter and stack. We ex-
plain the concept further now. Talking about process, we know that the most
basic process is sequential process, that is, the activities that happen from
the execution of the program on a sequential computer. Process consists of
program and data, both can reside in the memory. Process also contains pro-
gram counter and stack. The program counter points to the instruction that
is currently executing, while the stack records the calling order of embedded
functions. We call those sub-processes that possesses program counter and
stack as thread process. Fig. 15.1 shows the relation between process and
thread process.
424 Chapter 15 Compilation of Parallel Languages
In Fig.15.1, there are two thread processes, thread 1 and thread 2. They
execute on the same code of process p. Each thread process is identified by its
own program counters pc1 and pc2 and calling stacks s1 and s2. The process
data area is shared by the two thread processes. Since this type of thread
process was used as early as 1980s by Mesa Language. Its form was simple
and this is why it is called “light level”. We still keep using the name. Thread
process executes in the context of process. It has no address space, but it can
access to the address space where it resides. The thread process will execute
a message handler that is a routine defined by programmer to handle various
types of message.
Therefore, we create a thread for every message received. When the mes-
sage handler is completed, it ends. These thread processes can access global
variables of the process.
In hidden reception, a process can activate multi-thread processes.
Thread processes can be used for other purposes. For example, if a process
needs a remote process to send requirements, it can create a thread process to
send message and wait for the result. At the meantime, the original process
(e.g., the main program, it can be regarded as thread too) can continue its
work. Thread process has become an important concept in parallel program-
ming system.
We stipulate here that a thread process executes in a pseudo parallel
mode. The so-called pseudo parallel means that each thread process sequen-
tially executes on the same processor as the process does. It is also called
concurrent execution. Besides, the multi-processors of the multi-processor
system with shared memory can be assigned to a process, and in this way,
the threads can have the parallel execution in the real sense.
Shared variables and message passing are the low layer model of parallelism.
They directly reflect the architecture of shared memory and distributed mem-
15.5 Linda Meta Array Space 425
ory. Many other parallel programming languages are designed based on more
abstract model. These kinds of languages include parallel functional, logical
and object-oriented languages. We just introduce the situation of object-
oriented languages here.
The important idea of object-oriented languages is to “encapsulate” data
into objects. The data in the objects can only be accessed via the operations
(or methods) defined for objects. As we introduced in Chapter 14, other
important structures include classes, inheritances, and polymorphism. The
biggest superiority of object oriented programming is that it can write pro-
grams with well structures. From the technology point of view, it is suitable
for writing large-scale programs and may realize re-used software packages.
No matter whether it is for parallel programming or for sequential program-
ming, this superiority is extremely important. This is why people are inter-
ested in parallel object oriented languages.
Parallelism usually may be introduced from the execution of a number of
objects at the same time. It can also be introduced through allowing several
processes execute on the same processor at the same time. The communica-
tions between objects are expressed via operation requirements. One object
may cause the operation of another object on the same processor or different
processors. The operation requirements and message passing are similar, but
from the semantics of language, operation requirements are more integral.
Meanwhile, it may have many substitution methods. Many parallel object
oriented languages allow that the internal processes of objects may consist of
multiple control threads. One common mode is to use one thread as the main
process of an object, and let that thread dynamically create an additional
thread to handle each operation.
This manner realizes hidden reception. The synchronization of the thread
processes may be represented using monitor, many parallel object oriented
languages are based on monitor.
The another method that gets higher abstraction level of programming model
is using suitable communication data structure, to design meta space as a
part of Linda system [2]. Linda is a small set of simple primitive operations.
These primitive operations may be added to existing sequential language,
so that the language becomes a parallel language. There are many basic
languages of which the idea is applied to generate parallel languages, for
example, C/Linda, Fortran/Linda and Lisp/Linda all are parallel languages.
Meta array space is a kind of shared memory that in structure is accessed
in combination. Meta array space is regarded as records. No matter on which
processor does a process run, the record can be accessed by all the processes
in the program. From the sense, meta array space is a shared memory. How-
426 Chapter 15 Compilation of Parallel Languages
ever, meta array space can also be efficiently realized in distributed memory
systems.
There are three operations that are defined in meta array space:
1) out: adding a meta element to the meta array space;
2) read: from the meta array space, a matched meta element is read;
3) in: from the meta array space, a matched meta element is read, mean-
while, remove it from the meta array space.
For example, in C/Linda, calling
out(‘‘item’’,5,3.12);
generates a meta element with three components (a string, an integer type
number, a float number), and adding the meta element to meta array space.
read( ) and in( ) operations are used to meta array space to search for a meta
element. For each component of meta element, it is either a real parameter
(through the expression of value passing) or determines a formal parameter
(a variable beginning with a “?”, through introduction transmission).
The “real” and “formal” parameter here are different from the “real” and
“formal” which we introduce in procedural languages. For example, calling
float f;
in (‘‘item’’,5, ?&f)
stipulates, in order to read a matched meta element with three components
from meta array space, and the first component is a string with value “item”,
the second component is integer type number 5, and the third component is a
float number. If from the meta array space, the found element is just (“item”,
5, 3.12) that is added to meta array space with
out(‘‘item’’,5,3.12)
it will be read out by in operation. Then it will be removed from the meta
array space. If in the meta array space, besides (“item”, 5, 3.12), there are
(“item”, 5, 2.69) and (“item”, 5, 4.65), then it indicates that in the meta
array space there are many meta elements that match each other, hence
from them one may arbitrarily choose one. But if there is no suitable meta
element that fits the requirement, then in( ) or read( ) is locked. The calling
process is hanged up, until there is another process that adds suitable meta
element to the meta array space.
The primitive operation aforementioned is an atomic operation that is
undivisible. That means that either it will be executed until it finishes, or it
is locked and cannot run. Therefore if there are two processes that attempt
to execute an operation on the same element at the same time, then there
will be only one process that will succeed, and the another will fail and be
locked.
In Linda, there is no the primitive operation that modifies the existing
meta element in the meta array space. In this case, if one wants to modify
the existing meta element, it must be that the meta element is taken from
the meta array space (through read( ) or in( )), then modify it and finally
15.6 Data Parallel Languages 427
add it to the meta array space. If there are two or three processes that
simultaneously execute the segment of code, then there is only one process
that successfully executes the in operation, the rest processes are locked until
the meta element is put back to the meta array space.
Linda model is similar to shared variable model, their difference is that
meta array space is of combination addressing. One meta element has no
address, but read( ) and in( ) operations define the specification of meta ele-
ment, the system will do the matching of meta element and the specification.
to the region schedule algorithm are added. Besides, on the diagram, the
special edges may be added to show prohibited or illegal or unsuitable code
movement. These edges will prevent code from illegal movement, because
these movements are likely to violate the flow control, and during the debug-
ging, there is no way to make the compensation.
3) Schedule these operations.
4) Either along the direction, or on the sequential stage, carry out the
necessary modifications. These modifications may adopt additional forms of
operations, they adjust the illegal transformations caused by the positions of
the debugging operations. For example, sometimes an operation that should
go along some path of code is moved to somewhere else. In this case, the op-
eration should be moved back to its original position. This kind of additional
operations is called patching operation.
5) Return to step 1) to repeat the steps above until there is no unscheduled
code.
The steps aforementioned are the kernel algorithm of region schedule.
The initial trace schedule technique presented the algorithm. In the last two
decades, other region schedule algorithms were also occurring. In any step,
they are different from the trace schedule algorithm, i.e., different from the
kernel algorithm above. Generally speaking, the main differences are on the
step 1), the selection of regions, and also on the step 3), the complexity of
the scanning of the schedule structure.
Although region schedule usually involves the region of loopless code,
most of the algorithms handle loop in certain manners. For example, many
compilers expand important loops in order to enlarge the size of regions. But
even in the case, the really scheduled region is still loopless.
In order to schedule loops, a set of techniques called software stream is
created. The earliest software stream technique was developed in 1970s that
was developed for Fortran compiler of CDC – 6600. It was used by IBM for
the compilers of high performance CPV during the same period of time.
Under the software stream technology, several loop iteration operations are
430 Chapter 15 Compilation of Parallel Languages
organized to be a new loop [3]. The new loop combines different iteration
operations to form a new single loop. New loop mixs the different iteration
operations to fill the “gaps” (on different processors). On the entry and exit
of the new loop, similar to the entry and exit of new loop of region schedule,
the operations are arranged on the new position that allows the original loop
iterations to perform the operations which the single iteration of new loop
cannot do.
Interestingly, some ones proposed the idea that combines the region sched-
ule and stream. Conceptually, the idea is to make loop expansion infinitely,
then using region schedule to debug until the explicitly reasonable schedule
model is formed successfully. The model is used to form a new loop model,
establishing software stream. The technique is called perfect stream.
In the following, we discuss three problems concerning region scheduling:
1) The main region types.
2) Once a given region type is selected, how to form the real region from
the source program.
3) The construct and issue of schedule.
Most of the region schedule algorithms differ on the shape definition. Usually,
these algorithms are named based on the shapes of the regions. Therefore,
we first introduce the most commonly used regions.
1) The basic blocks. The basic block is the degenerate form of the region.
It generally consists of a segment of sequential statements with only one entry
and one exit.
2) The track. Track is another type of regions. The tracks constitute the
basic block. A track is a linear path of the program code.
Suppose that B0 , B1 , . . . , Bn are basic blocks of the program, and their
order is given. A track is formed from the operations of these basic blocks.
And it has the following properties:
(1) Every basic block is the predecessor of the following block in the
sequence. That is, for k=0, . . . n – 1, the descendant of Bk is Bk+1 , or we say
that Bk branches to Bk+1 .
(2) The code of track does not contain loop, except that the whole region
is a part of some extended loop extension. viz. for any i and j, there exists
no the path Bj → Bi → Bj . But track does not prohibit the forward branch
in the region, or the flow away from the region and then return to the region
again later.
In the Fig. 15.3, B1 , B2 , B3 , B4 , B5 , B6 , B7 are basic blocks, where B1
→ B2 → B4 → B6 → B7 is a track that crosses the region (the shade part).
Similarly, B1 → B2 → B5 → B7 and B3 → B4 → B6 → B7 are also tracks.
(3) Super blocks. A super block actually is also a track but it contains
15.7 Code Generation for Hidden Parallel Programs 431
some constraints. Apart from the branch that goes to the first block, the
braches that go to other blocks are not allowed. Therefore, a super block
consists of the operations of the basic block series B0 , B1 , . . . , Bn , and it
possesses the same properties as the track does. The properties (1) and (2)
above hold for super blocks. We have:
• Each basic block is the predecessor of the following block in the basic
block series. For each K=0, 1, . . . , n – 1, Bk+1 follows Bk or Bk branches
to Bk+1.
• Super block does not have loop, except that the whole region is the part
of some of the extended loop activities, viz., for any i and j, there exists
no path Bj → Bi → Bj . But for super block, it has one more property:
• In the region, apart from B0 , there is no branch that switches to a block
of the region.
In the references on super block regions, the illegal branches are called
side doors.
(4) Tree regions: A tree region is a tree shape region in the control stream
of the program, and it contains a basic block. A tree shape region consists
of the operations of the series of basic blocks B0 , B1 , . . . , Bn , and it has the
following properties:
• Apart from B0 , each block has just a predecessor. That is, the predecessor
of Bj is the basic block Bi , where i<j. Bi is also the parent node of Bj .
This implies that a super block will be formed through arbitrary path of
the tree region, i.e., a track without side door entry.
• For any j and i, there exits no Bj → Bi → Bj , except for B0 . It contains
no loop, except that the whole region is a part of the loop that surrounds
it.
As super block, the constraints of not allowing side doors may be removed
using tail copy and other expansion techniques. If a region has only one
control flow, this case is called linear region. In this sense, track and super
block are linear regions while tree region is non-linear region.
(5) Other region. Besides the regions described above, some experts also
proposed other regions, e.g., in Fig. 15.3, the track2 is a non-linear region
with one entry. It is a bit like the tree region, but there is no constraint of
side door. But its implementation is very difficult. The large scale module is
also a region module. It is a region of single entry and multiple exits, and it
has internal control flow. It is a variant of super block, and contains some
prediction operation. It can be scheduled as a module.
Once the shape of region is determined, two problems occur. One is how to
partition the program to be some regions with specific shapes; and second
is how to construct the schedule of them. The first problem is called the
432 Chapter 15 Compilation of Parallel Languages
formation of the region, and the second is called the construction of the
schedule.
We discuss the first one. In order to form the region, the whole control
stream of a program must be divided into some blocks that can be provided
to schedule construction program to consume and to manage with clear def-
inition. Hence, the formation of regions and the efficiency of schedule con-
struction combination are crucial to the performance. These regions that are
wisely selected will cover the context free grammar of the program with such
a manner that the execution of the program follows the path that scheduled
code predicted. Therefore, the important thing is to make sure that schedule
construction combination knows what it needs to do.
The formation of the region aims at finding out the parts of program that
can be executed simultaneously, and they are combined to form the same
region. If two parts of program can be executed together, but in different
regions, it will not be good for instruction schedule. Therefore the designers
of the formation of the regions are confronted with three problems: (1) which
programs may be executed frequently? (2) how do you know that two pro-
grams may be executed together? (3) How are the shape of the region and
the two problems above interacted each other?
For the traditional answers to the first two problems, the images are used
to measure or use the heuristic search to estimate the execution frequency
of each program. Both heuristic search method and profile-based method
assign execution frequency to such parts of programs as the nodes and edges
of context-free grammar. If the heuristic search method is used, then it should
pay more attention on the method of collecting statistic numbers, and how to
manage these numbers when the program modified by different parts of the
compiler. On the collection of profile types and the techniques of collecting
statistic numbers, many creations have been done in the last decade.
Once a set of statistic numbers is available, the rest problem is how to
use them to form the regions.
The formation of the regions usually involves the selection of the good
regions according to the existing context-free grammars. It also contains the
copies of some parts of the context-free grammars to improve the quality
of the regions. The results of copies definitely will affect the length of the
program. Hence many different algorithms and heuristic search methods
are applied to make the tradeoff. The formation of the regions also needs the
effective regions which the generation of schedule construction programs may
use, this may require additional bookkeeping and program transformation.
The formation of regions is related to the context-free grammar of pro-
grams. And only from the height of the context-free grammar, can the concept
of region be formalized, so that the formation of regions be clarified.
Reviewing the definition of context-free grammar (CFG) which we dis-
cussed before, and the language accepted by a context-free grammar L(G).
We have pointed out that most of the parts of all programming languages
belong to context-free grammars. Therefore, most of the programs developed
15.7 Code Generation for Hidden Parallel Programs 433
Theorem 15.1 In the grammar with Greibach normal form, if in the pro-
duction A → b α, α contains nonterminals A and B but they have no
relation, then they each constitutes different regions that are disjoint.
Proof Suppose that starting from start symbol S, and the production with
S as the left part is S → aα. Then in α, the first nonterminal is B, apart from
the first terminal. We continue the derivation. We must get a string with
terminals and nonterminals as its elements. Among them there are those
nonterminals that they have no relation . Therefore, there is no conflict
happening. They form different regions.
Theorem 15.2 In a context-free grammar, if in some production, there is
a nonterminal that occurs twice or more at the right hand side, then each of
them form a region.
434 Chapter 15 Compilation of Parallel Languages
S→dBA|(CBA|aA|CcA|aB|CcB|a—CC
T→aB|cCB|a|Cc
A→+S
B→*T
C→aBAD|CBAD|aAD|cACD|aBD|cDC|aD
D→)
where each production can be numbered and the nonterminals are numbered
in the order of their occurrences. According to the relation , those nonter-
minals that have no the relation may form different regions, as shown in Fig.
15.4.
Theorem 15.3 In a conditional statement, the two branches may form the
regions that are able to compile in parallel, then when it runs, it will be
decided which branch will be executed according to the condition.
Proof When the compilation proceeds, both two branches need to be com-
piled so that the program can decide which branch is executed. While in the
phase of compilation the compilation of two (even more) branches do not
cause conflict. Even in the phase of running, it is still feasible that let two
branches run in parallel first, then the condition is decided to let one run.
For example, for mergesort, we have the program flow stream shown in
Fig. 15.5 in which the left and right parts enclosed by dash lines form paral-
lelable compilation regions. After compilation, the execution of the program
may be started. It is shown in Figs 15.6 and 15.7.
Theorem 15.4 In Shellsort, the comparisons of different increments can
be compiled in parallel as disjoint regions.
Proof In the shellsort with increments, say 8, 4, 2, and 1, if the number of
elements to be sorted is big, then when it starts, 1 and 9, 17 and 25, 33 and
41, . . . may be compared. These comparisons obviously may be proceeded in
parallel, both in compilation and in execution. Similarly, then the increment
15.7 Code Generation for Hidden Parallel Programs 435
Fig. 15.7 The execution of program that executes in parallel first then in sequen-
tial.
become 4, 1 and 5, 9 and 13, . . . may be handled in parallel again. The same
holds for increments 2 and 1.
Theorem 15.5 In the evaluation of expressions (either numerical or propo-
sitional), if the expression belongs to two disjoint sub-expressions that have
no value exchange then they can be regarded as region that can be compiled
in parallel.
For example, in Ak = (kn−k k!) − (k − 1)n+k−1 (k − 1)!/n!, kn−k k! and
436 Chapter 15 Compilation of Parallel Languages
Problems
References
[1] IEEE Proceedings of the IEEE (2001) Special issue on Microprocessor Ar-
chitecture & Compiler Technology.
[2] Grune D, Henrie Bal, Ceriel J et al (2001) Modern compiler design. Pearson
Prentice Education, Singapore.
[3] Su Y L (2005) The intelligenization of synchronization languages in embed-
ded systems. The Journal of Guangxi Academy of Science 21(4): 236 – 238.
[4] Su Y L (2004) On the issues about region schedule in parallel compilation.
The Journal of Guangxi Academy of Science 20(4): 220 – 224.
Chapter 16 Compilation of Grid Computing
Ian Foster
The rising of the grid computing, no doubt, is a new thing that happened in
the end of last century and the beginning of this century. It definitely brought
lots of new things with it. If it has become a reality and gradually occupies
certain amount of market, everything involved should be taken into account,
especially it is not mature yet. For those problems that are still open, we
should also make our effort to solve them, to make our contribution to the
solutions. This chapter only presents a brief introduction to grid computing.
We believe that more knowledge on it will be added as the time goes, and
the contents on the topic will be more bountiful in the future.
resources worked together, hence in the device that the demonstration showed
covered the exploration of many concepts of early stage of grid computing.
The success of I-Way urged the DARPA (The Administration of Ameri-
can ministry, high-level research project management) to invest fund in the
project that creates fundamental tools for distributed computing. This re-
search project was collaboratively led by Ian Foster of Alanne Laboratory
and Carl Kesselman of University fo California. It was named Globus. This
project team created a set of tools that became the basis of the grid com-
puting research activity for academic research field. On the supercomputing
conference held in 1997, the running software from about 80 sites around the
world based on Globus tool kit was linked together.
These efforts were called grid computing as it is similar to the electric grid.
The functions of the grid computing is like the function of electric grid that
it provides the electricity to billions of appliances so that they all have the
power. The grid computing makes any one at any time to use the tremendous
computing ability with truly transparent manner.
In the academic field, the most concerned problem is still on the estab-
lishment of the effective grid frame so that the distributed high performance
computing may play role. Since in the same period of time, the Internet de-
veloped quickly, the ability of personal computers was increasingly enhanced,
scientists made many attempts to establish powerful distributed computing
system through the networking of personal computers. In 1997, the Entropia
network was established to equip the world wide idle computers to solve
the scientific problems interested. Subsequently Entropia accumulated up to
30,000 sets of computers with the trigger speed of 1010 per second. A grand
new philanthropic computer field occurs, in which users volunteered to pro-
vide their computers for use of the analysis of patients reaction to chemistry,
also use for the discovery of the medicine for AIDS and other therapeutics.
Although the projects aforementioned have not yet got the investment
from any companies and become real product, they have received attention
from more multimedia than early stage in any other science research plans.
Since the end of 2000, the papers on grid computing were transferred from
commercial journals to popular papers and journals. The main news papers
around the world all report the development in the field.
Nowadays, the big companies like IBM, SUN Microsystem,Intel, HP, and
the smaller companies like Platform Computing, AVaki, Entropia, Datasy-
napse, United Device, etc., all invest more fund to the research on grid com-
puting. But their focus is rather on the applications to commerce than to
scientific research.
Now we may consider the intent of the grid computing. The common
definition of grid computing is:
• The flexible, safe and harmonic resource sharing in the individual, re-
search department and resource dynamic set.
• The transparent, safe and harmonic resource sharing, and the cooperation
in various sites.
16.2 Rising of Grid Computing and Intent 441
• The organizations that are able to form virtual cooperation. They work
together in an open and hyterogeneus server environment to share appli-
cations and data, so that they can solve common problems.
• That are able to accumulate great amount of computing resources, these
computing resources physically separate in order to solve large-scale prob-
lems and workload, just like that all servers and resources may be put
together in a site.
• The fundamental structure of hardware and software, it provides reliable,
consistent, ubiquitous and cheap accesses to computing resources.
• Network provides us with information, while grid allows us to handle
them.
• The definitions listed above each has its unique features, and also grasps
some characteristics of the grid computing. But we tend to use the follow-
ing definition that defines grid wider and so it may better describe grid
system.
Definition Grid computing. In the situation that does not exist central
position, central control, and ubiquitous and existing trust relation each vir-
tual organization structure in the procedure of pursuing their common target,
the set of resources, including hardware and software that can organize the
organization structures to share the resources, is called grid computing.
The virtual organization in the definition involves very wide extent. It
ranges from small companies, till the big companies spread around the world,
with lots of people coming from different structures. The size of virtual orga-
nization may be either big or small, either static or dynamic. Someone can
also be temporarily set up for special purpose. When the goal is attained, It
is dismissed.
Some of the examples of virtual organizations are:
• The accountant department of a company.
• The competition score statistic department of South Africa Soccer games.
• The urgently organized response team for handling petro leaking in Mex-
ico Bay.
A resource is an entity for share, it may be computing-type, e.g., personal
digital assistants, laptop computers, desk computers, workstations, servers,
clusters, and supercomputers; it may also be storage-type, e.g., the hard disc
drivers, cheap redundant array of magnetic disc, and double word storage
device. The sensors are another type of resources, band width is also a kind
of resource used for various activities of the virtual organization structures.
That there is no central position and central control implies that there is
no need for setting the specific central position for the management of grid
resource. Finally, the crucial point that needs to be noted is that in network
environment, resources do not need to know the information each other, they
do not need to have predefined security relation.
442 Chapter 16 Compilation of Grid Computing
Fig. 16.1 Grid network and linear array that link computers.
Here we assume that the links between processors are bidirectional. But
in linear array, processor 1 has no left adjacent while processor p has no right
adjacent. For other processor i, i − 1 is its left adjacent while i+1 is its right
√ √
adjacent. p × p network has several sub graphs, each sub graph contains
√
a linear array with p processors.
The communication between processors is done under the help of com-
munication links. Any two processors that have links to connect can commu-
16.3 Grid Computing Model 443
nicate each other in one unit time. If there is no link between two processors,
then their communications need to rely on the path that connects them. The
time that the communications between the two processors take would depend
on the length of the path (at least it is right for small amount of information.)
Suppose that in unit time, one processor may execute a local computation,
or at most communicate with four adjacent.
In this grid network, the processors that the first coordinates are equal
form a row of the processor grid network, the processors that the second
coordinates are equal form a column of the grid network. If each row or each
√
column consists of p processors, this forms a linear array. Usually, the grid
network algorithm consists of the steps on rows or columns.
Suppose that a wins Meanwhile, when t=2, package c and d also advanced
one edge towards the destination, they will be together with b (as shown in
Fig. 16.3). When t=3, as b has higher priority than c and d, so b goes first.
When t=4, packages c and d compete again for advancement. Since they have
the same priority, the resolution is arbitrary. Suppose that d is the winner.
Then package c will take 2 more steps or delay two more steps to get the
destination. Finally each package arrives at the destination.
The moving distance of c is 4, and it lined to queue twice (competed with
b and d respectively), so it delayed 2, and the total consuming time is 6.
Suppose that the different priority scheme is used, what is the result? If the
furthest destination first scheme is adopted, then when t=4, the package c
will have highest priority, so it will be handled in advance. In this case, its
running time is 5.
Fig. 16.3 Packages moving to right from left and moving to left from right are
independent.
Lemma 16.1 Suppose that in each source address there is only one pack-
age. In a linear array with p processors, assume that the destinations are
arbitrary. Without losing generality, we now only consider the case of mov-
ing from left to right. If package q has the source at processor i and it wants
to get to processor j. Then it will need to move j – i to get the destination.
446 Chapter 16 Compilation of Grid Computing
Notice that one package can only passes one edge at a time. Since q does not
meet any package on the way, it will not be delayed. For all the packages the
routing selection has maximum time p – 1. Meanwhile, the length of queue
for the algorithm is the largest number of packages going to any node.
Lemma 16.2 In the linear array of p processors, p any processor i (i=1,
2, . . . , p) initially has ki packages and they meet i=1 ki =p. That means that
each processor is exactly the destination of a package. If the priority scheme
of furthest destination first is adopted, then the time taken by package with
source processor i will not exceed the distance that the package moves to
the destination. In other words, if the package moves from left to right, the
consuming time will not exceed p – i; if it moves to left from right, the time
is less than i – 1.
Proof Suppose that package q comes from processor i, and its destination
is q. Without losing generality, suppose that the package moves from left
to right, and suppose that each package has selected the shortest path from
source address to destination address. If package q generates delay, it can
only be generated by that the number of destination is greater than j. And it
is also caused by that a package has its source at the left of i. Suppose that
the numbers of such packages at processors 1, 2, . . . , j are k1 , k2 , . . . , kj−1
(initially), notice that j−1 i=1 ki p − j.
Suppose that m satisfies m m j – 1, km−1 and km 1, the sequence
km , km−1 , . . . , kj−1 is called free sequence. The package in free sequence
cannot generate delay due to other distributions, because according to the
priority rule, the package at the left has higher priority than the package at
the right to be selected.
Moreover, in every step, at least one package joins the free sequence.
Fig. 16.4 shows the example. the numbers represent the number of packages
on the nodes. For example, when t = 0, on node i there are 3 packages. At
the time, 0, 1, 0, 1, 1 is a free sequence. Notice that, as time changes, the
number of packages in the sequence will change too. For example, when t =
1, a package joins the sequence; when t = 2, 4 packages join the sequences
again.
Therefore, after p – j steps, all the packages that may cause package q
delayed are in the free sequence. Package q needs at most j – i steps to get
destination (see Fig. 16.4). The moving of package from right to left is similar.
Lemma 16.3 In the linear array of p processors, suppose that the packages
sent from any processor are more than one, and the number of receiving
packages is more than one too. From processors 1, 2, . . . , j the number
of sending packages are no more than j+f(p) (for any j, and f is a selected
16.4 Compilation of Grid Computing 447
function). Then in the priority scheme of furthest destination first, the routing
selection for these packages can be resolved in p+f(p) steps.
Proof Suppose that q is a package that has source i and destination j (j is
located at the right of i). q can be delayed by at most i+f(p) packages because
these packages have their source addresses 1, 2, . . . , i, and they have higher
priority than q. If each package makes q delayed at most once, that means
that the package with higher priority cause q delayed. Then the package will
not make q delayed again. Then the delay of q will be equal to or less than
i+f(p). Since q only needs j – i steps to get to destination, the total time that
q takes is less than j = f(p). In summary, for all the packages, the maximum
consuming time is p+f(p).
Example 16.3 Fig. 16.5 demonstrates the proof of Lemma 16.3. In this
example, there are 8 packages a, b, c, d, e, f, g, h. Suppose that g is the
package that is most concerned. Package g can only be delayed by a, b, c, d,
e, f, h. When t=9, package g arrives at its destination. It passed the distance
of 2, and the dalay is 7. In this graph, the packages that passed node j are
not shown.
So far there is no any formal programming language used for grid comput-
ing. Therefore we have no way to explore the compilation of this kind of
language. However, there are many researches already on the compilations of
grid computing programs or compilation between procedures [2]. They have
not put into market, especially there is very few works that had been done
in the compilation between documents. One of the reasons is that the com-
448 Chapter 16 Compilation of Grid Computing
pilation of the whole programs must make the compilation very complicate,
even though there were some researches who stated that it is very beneficial
to the correctness and performance of programs. But users usually do not
want to have long time compilation.
However, in the future, the tasks of compilation is so heavy and tough.
So the compilation between procedures becomes a must rather than a luxu-
rious thing, especially for the compilation of grid computing on distributed
heterogeneous computers.
For the compilation of cross procedures, the problems that need to be
solved are as follows:
1) The compilation of cross procedures that supports grid application
system frame must be able to integrate the whole application performance
evaluation program and mapping program. Therefore, in order to construct
the execution model of the whole program, a cross compilation of interpro-
cedure is very necessary.
2) The management of binary position (i, j is also the necessary function
of the program development time) In order to avoid the expensive phase
of partitioning binary components, it is important to connect the parts of
programs in shared component based on remote computing resources. These
programs have partly stored in the remote computing resources in advance.
The optimization of the contingencies is the crucial function of interprocedure
cross compilation.
3) In order to avoid the time of compilation being too long, the recompi-
lation of the document in program should be managed. Although there are
some researches in the analysis of recompilation, they are not available in
market yet.
4) If one wants to put the previous running time analysis into the de-
cision making of the currently running compilation, then the management
of recompilation is more complicated. In order to manage this process, the
compilation environment should be complicated enough too.
5) Some interprocedure cross compilation analysis needs to be completed
in linking time and running time. How to effectively manage the process is
still an open problem.
The research on compilers has generated two general techniques for han-
dling the long latency in storage and communications on parallel computers:
one is hidden latency, it overlaps the data communication and computing.
The another is the reduction of the latency that is used for reorganizing
program so that the data in local storage can be more effectively reused. In
practice, these two techniques are very effective.
In the compilers used in grid computing, the implementation of these two
techniques is very sophisticate, and the hidden latency is especially problem-
atic, because the latency in grid computing is big and changing. Therefore,
if we want to be able to determine how to extract the values of the variables,
then it will need to spend more time in the estimation of running time and
communication delay. This also means that the latency tolerant algorithm is
16.4 Compilation of Grid Computing 449
more important.
Running time compilation
The kernel of the compilation of grid computing is its parallel imple-
mentation. The important problem that is related to this is the scheme of
automatic load balance in the grid. Therefore, there is need for some nec-
essary information, e.g., the upper bound of loops and the sizes of arrays.
However for many applications, these messages are unknown before running.
The lack of this information is also difficult for the definitions of problems in
irregular grid. It makes the implementation of parallelism very difficult even
on homogenous parallel computers.
The running time compilation has many forms. After the scalar data have
been put into memory, it may be as simple as reconsideration of decision,
but it can also be as complicated as planning communication in irregular
computation. Because before the crucial data structures are defined, the fun-
damental grid and position of the computing are unknown.
For the grid, it is necessary to reconfigure the whole program and the
implementation of load balance when it is running, this is possibly an ex-
tremely time-consuming process. Therefore it needs to have a strategy that
may minimize the overhead of these steps. In general, it is necessary to carry
out the research on how to minimize the overhead with running time being
emphasis and complicated decision factor because more and more the cases
in which the requirements will be met in the future.
In order to resolve the problem aforementioned, some designed the method
of running time compilation reviewer/executer. In this method, the compi-
lation program partitions the key computing part into two parts: one is the
reviewer, and another one is the executer. The former one can only be ex-
ecuted once, after the running time data is allowed to use to establish the
plan that will be effectively executed on parallel computer. And the later one
executer is called in the execution of every iteration of the computing and
the execution is to implement the plan defined by reviewer.
The idea of the scheme is to amortize the computing cost of running times
in many time steps of complicated computing. In the upper costs of loops are
unknown, the reviewer may partition the loop into some small loops. Once
the upper bounds are known, they may match the power of target machine,
meanwhile executer only needs to execute correct computation on the small
loops in each machine. The reviewer must follow the rules of making the
balance in complicated and irregular problems. The tasks of reviewer and
executer are very complicated.
Running time compilation is a powerful tool for tailoring the programs so
that they are suitable for execution on any parallel computers. Especially it
is structured crucial for distributed heterogeneous structured computers.
For grid computing, the compiler and parallelization tools reside on the
middle level of the matrix (grid).
In the current stage, what we know about the compilation of grid com-
450 Chapter 16 Compilation of Grid Computing
puting is limited indeed. But we may predict that, not long after, people
will have deeper understanding about it, consequently, there will be more
concrete achievements on the research of the field.
Problems
Problem 16.1 Design an algorithm for sorting numbers in the grid with
√ √
p × p nodes.
√ √
Problem 16.2 Suppose that on the p × p grid, it happens that each
processor is a source address of a package, and it is also the destination of
a package. Design a deterministic routing algorithm to solve the routing
problem for the packages. Your algorithm should has the complexity of
√
O( p). The size of queue is O(1). Hint: use the sorting algorithm.
√ √
Problem 16.3 On an p × p grid, at first the sorting for the rows is
executed, then sorting for the column is performed. Prove that the rows
are still in order.
Problem 16.4 Using the ides of problem 2, design a deterministic group
permutation routing algorithm. The complexity of your algorithm should
√
be O( p) and the size of the queue is O(1).
References
[1] Foster I, Keeselman C (2004) The Grid, 2nd edn. Elsevier, Amsterdam.
[2] Abbas A (2004) Grid Computing: a practical guide to technology and appli-
cations. Charles River Media.
Index
L O