0% found this document useful (0 votes)
5 views108 pages

compiler-notes-of-unit-1-2

The document provides comprehensive lecture notes on Compiler Design for Uttarakhand Technical University, covering key topics such as lexical analysis, syntax analysis, type checking, code generation, and optimization. It outlines the course syllabus, modules, and phases of compilation, including the roles of various components like preprocessors, compilers, assemblers, and interpreters. Additionally, it discusses the structure of compilers, error handling, and includes references and textbooks for further reading.

Uploaded by

sharmaashu779977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views108 pages

compiler-notes-of-unit-1-2

The document provides comprehensive lecture notes on Compiler Design for Uttarakhand Technical University, covering key topics such as lexical analysis, syntax analysis, type checking, code generation, and optimization. It outlines the course syllabus, modules, and phases of compilation, including the roles of various components like preprocessors, compilers, assemblers, and interpreters. Additionally, it discusses the structure of compilers, error handling, and includes references and textbooks for further reading.

Uploaded by

sharmaashu779977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

lOMoARcPSD|54605199

Compiler notes OF UNIT 1 & 2

Compiler Design (Uttarakhand Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

COMPILER DESIGN LECTURE


NOTES
COURSE SYLLABUS
Compiler Design (BCST-602)
L T P
3 1 0

Syllabus CO BL
Module –I Introduction to compiling & Lexical Analysis CO-01, CO-02 2
Introduction of Compiler, Major data Structure in compiler, types
of Compiler, Front-end and Back- end of compiler, Compiler
structure: analysis-synthesis model of compilation, various phases of a
compiler, Lexical analysis: Input buffering , Specification &
Recognition of Tokens,Design of a Lexical Analyzer Generator, LEX.
Module –II Syntax Analysis &Syntax Directed Translation CO-02,CO-03, 3
Syntax analysis: CFGs, Top down parsing, Brute force approach,
recursive descent parsing, transformation on the grammars, predictive
parsing, bottom up parsing, operator precedence parsing, LR parsers
(SLR,LALR, LR),Parser generation. Syntax directed definitions:
Construction of Syntax trees, Bottom up evaluation of S-attributed
definition, L-attribute definition, Top down translation, Bottom Up
evaluation of inherited attributes Recursive Evaluation, Analysis of
Syntax directed definition.
Module –III Type Checking & Run Time Environment CO-03 2
Type checking: type system, specification of simple type checker,
equivalence of expression, types, type conversion, overloading of
functions and operations, polymorphic functions. Run time
Environment: storage organization, Storage allocation strategies,
parameter passing, dynamic storage allocation, Symbol table, Error
Detection & Recovery, Ad-Hoc and Systematic Methods.
Module –IV Code Generation CO-04 2
Intermediate code generation: Declarations, Assignment statements,
Boolean expressions, Case statements, Back patching, Procedure calls
Code Generation: Issues in the design of code generator, Basic block
and flow graphs, Register allocation and assignment, DAG
representation of basic blocks, peephole optimization, generating code
from DAG.
Module –V Code Optimization CO-05 2
Introduction to Code optimization: sources of optimization of basic
blocks, loops in flow graphs, dead code elimination, loop
optimization, Introduction to global data flow analysis, Code
Improving transformations ,Data flow analysis of structure flow graph
Symbolic debugging of optimized code.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

TEXT BOOKS:

1 C. Siva Ram Murthy & B. S. Manoj: Ad-hoc Wireless Networks, 2nd Edition,
Pearson Education, 2011.
2 Deploying Wireless networks, Andy Wilton, Tim charity, Cambridge university
press
3 Fundamental of Wireless Networking, Ron Price, TMH.
4 3G Wireless Networks, Clint Smity, TMH.

REFERENCES:
1. Ozan K. Tonguz and Gianguigi Ferrari: Ad-hoc Wireless Networks, John Wiley,
2007.
2. Xiuzhen Cheng, Xiao Hung, Ding-Zhu Du: Ad-hoc Wireless Networking, Kluwer
Academic Publishers, 2004.
3. C.K. Toh: Ad-hoc Mobile Wireless Networks- Protocols and Systems, Pearson
Education, 2002.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Module -I
Introduction to Compiling:
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

Fig 1.1: Language Processing System


Preprocessor

A preprocessor produce input to compilers. They may perform the following functions.

1. Macro processing: A preprocessor may allow a user to define macros that are short hands
for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-
of- control and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by
certain amounts to build-in macro

COMPILER

Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a
compiler is error showing to the programmer.

Fig 1.2: Structure of Compiler

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program
is loaded into a memory executed.

Fig 1.3: Execution process of source program in Compiler

ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).

INTERPRETER
An interpreter is a program that appears to execute a source program as if it were machine language.

Fig1.4: Execution in Interpreter

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution

Advantages:
Modification of user program can be easily made and implemented as execution proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.

LOADER AND LINK-EDITOR:

Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control to it,

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

thereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would
have to retranslate his program with each execution, thus wasting translation time. To over come this
problems of wasted translation time and memory. System programmers developed another
component called loader
“A loader is a program that places programs into memory and prepares them for execution.” It would
be more efficient if subroutines could be translated into object form the loader could”relocate”
directly behind the user’s program. The task of adjusting programs o they may be placed in arbitrary
core locations is called relocation. Relocation loaders perform four functions.

1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language and produces as
output a program in another language. Beside program translation, the translator performs another
very important role, the error-detection. Any violation of d HLL specification would be detected and
reported to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.

1.3 LIST OF COMPILERS


1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. ECMAScript interpreters
10. Fortran
compilers 11 .Java
compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers

1.4 STRUCTURE OF THE COMPILER DESIGN

Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated operation


that takes source program in one representation and produces output in another representation. The
phases of a compiler are shown in below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)

Compilation process is partitioned into no-of-sub processes called ‘phases’.


Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the source program into a
sequence of automic units called tokens.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig 1.5: Phases of Compiler

Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is
aided by using techniques based on formal grammar of the programming language.

Intermediate Code Generations:-


An intermediate representation of the final machine language code is produced. This phase bridges
the analysis and synthesis phases of translation.

Code Optimization :-
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space.

Code Generation:-
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is
the machine language program of the specified computer.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Table Management (or) Book-keeping:- This is the portion to keep the names used by the
program and records essential information about each. The data structure used to record this
information called a ‘Symbol Table’.

Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of
tokens, which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens
together into syntactic structure called as expression. Expression may further be combined to form
statements. The syntactic structure can be regarded as a tree whose leaves are the token called as
parse trees.

The parser has two functions. It checks if the tokens from lexical analyzer, occur in pattern that are
permitted by the specification for the source language. It also imposes on tokens a tree-like structure
that is used by the sub-sequent phases of the compiler.

Example, if a program contains the expression A+/B after lexical analysis this expression might
appear to the syntax analyzer as the token sequence id+/id. On seeing the /, the syntax analyzer
should detect an error situation, because the presence of these two adjacent binary operators violates
the formulations rule of an expression. Syntax analysis is to make explicit the hierarchical structure
of the incoming token stream by identifying which parts of the token stream should be grouped.

Example, (A/B*C has two possible interpretations.)


1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.

Intermediate Code Generation:-


The intermediate code generation uses the structure produced by the syntax analyzer to create a
stream of simple instructions. Many styles of intermediate code are possible. One common style uses
instruction with one operator and a small number of operands. The output of the syntax analyzer is
some representation of a parse tree. the intermediate code generation phase transforms this parse tree
into an intermediate language representation of the source program.

Code Optimization
This is optional phase described to improve the intermediate code so that the output runs faster and
takes less space. Its output is another intermediate code program that does the some job as the
original, but in a way that saves time and / or spaces.
a. Local Optimization:-
There are local transformations that can be applied to a program to make an improvement.
For example,
If A > B goto L2

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Goto L3
L2 :

This can be replaced by a single statement


If A < B goto L3

Another important local optimization is the elimination of common sub-expressions


A := B + C + D
E := B + C + F
Might be evaluated as

T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.

b. Loop Optimization:-
Another important source of optimization concerns about increasing the speed of loops. A
typical loop improvement is to move a computation that produces the same result each
time around the loop to a point, in the program just before the loop is entered.

Code generator :-
Code Generator produces the object code by deciding on the memory locations for data, selecting
code to access each datum and selecting the registers in which each computation is to be done. Many
computers have only a few high speed registers in which computations can be performed quickly. A
good code generator would attempt to utilize registers as efficiently as possible.

Table Management OR Book-keeping :-


A compiler needs to collect information about all the data objects that appear in the source program.
The information about data objects is collected by the early phases of the compiler-lexical and
syntactic analyzers. The data structure used to record this information is called as Symbol Table.

Error Handing :-
One of the most important functions of a compiler is the detection and reporting of errors in the
source program. The error message should allow the programmer to determine exactly where the
errors have occurred. Errors may occur in all or the phases of a compiler.

Whenever a phase of the compiler discovers an error, it must report the error to the error handler,
which issues an appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Example:

Fig 1.6: Compilation Process of a source code through phases

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

2. A simple One Pass Compiler:

2.0 INTRODUCTION: In computer programming, a one-pass compiler is a compiler that


passes through the parts of each compilation unit only once, immediately translating each part
into its final machine code. This is in contrast to a multi-pass compiler which converts the
program into one or more intermediate representations in steps between source code and
machine code, and which reprocesses the entire compilation unit in each sequential pass.

2.1 OVERVIEW

 Language Definition
o Appearance of programming language :
Vocabulary : Regular expression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
o Semantics : Informal language or some examples

 Fig 2.1. Structure of our compiler front end

2.2 SYNTAX DEFINITION

 To specify the syntax of a language : CFG and BNF


oExample : if-else statement in C has the form of statement →if ( expression )
statement else statement
 An alphabet of a language is a set of symbols.
o Examples : {0,1} for a binary number system(language)={0,1,100,101,...}
{a,b,c} for language={a,b,c, ac,abcc..}
{if,(,),else ...} for a if statements={if(a==1)goto10, if--}
 A string over an alphabet
o is a sequence of zero or more symbols from the alphabet.
o Examples : 0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
o Null string is a string which does not have any symbol of alphabet.
 Language
o Is a subset of all the strings over a given alphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac, abcc..}
A2={all of C tokens} L2= {all sentences of C program }
 Example 2.1. Grammar for expressions consisting of digits and plus and
minus signs.
o Language of expressions L={9-5+2, 3-1, ...}
o The productions of grammar for this language L are:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

list → list +
digit list → list -
digit list → digit
digit → 0|1|2|3|4|5|6|7|8|9
o list, digit : Grammar variables, Grammar symbols
o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols
 Convention specifying grammar
o Terminal symbols : bold face string if, num, id
o Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B

 Grammar G=(N,T,P,S)
o N : a set of nonterminal symbols
o T : a set of terminal symbols, tokens
o P : a set of production rules
o S : a start symbol, S∈N
o
 Grammar G for a language L={9-5+2, 3-1, ...}
o G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P : list -> list + digit
list -> list -
digit list ->
digit
digit -> 0|1|2|3|4|5|6|7|8|9
S=list

 Some definitions for a language L and its grammar G


 Derivation :
A sequence of replacements S⇒α1⇒α2⇒…⇒αn is a derivation of αn.
Example, A derivation 1+9 from the grammar G
 left most derivation
list ⇒ list + digit ⇒ digit + digit ⇒ 1 + digit ⇒ 1 + 9
 right most derivation
list ⇒ list + digit ⇒ list + 9 ⇒ digit + 9 ⇒ 1 + 9
 Language of grammar L(G)
L(G) is a set of sentences that can be generated from the grammar G.
L(G)={x| S ⇒* x} where x ∈ a sequence of terminal symbols
 Example: Consider a grammar
G=(N,T,P,S): N={S} T={a,b}
S=S P ={S → aSb | ε }
 is aabb a sentecne of L(g)? (derivation of string aabb)
S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(or S⇒* aabb) so,
aabbεL(G)
 there is no derivation for aa, so aa∉L(G)
 note L(G)={anbn| n≧0} where anbn meas n a's followed by n b's.

 Parse Tree

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

A derivation can be conveniently represented by a derivation tree( parse tree).


o The root is labeled by the start symbol.
o Each leaf is labeled by a token or ε.
o Each interior none is labeled by a nonterminal symbol.
o When a production A→x1… xn is derived, nodes labeled by x1… xn are made as
children
nodes of node labeled by A.
 root : the start symbol
 internal nodes : nonterminal
 leaf nodes : terminal

o Example G:
list -> list + digit | list - digit | digit
digit -> 0|1|2|3|4|5|6|7|8|9
 left most derivation for 9-5+2,
list ⇒ list+digit ⇒ list-digit+digit ⇒ digit-digit+digit ⇒ 9-digit+digit
⇒ 9-5+digit ⇒ 9-5+2
 right most derivation for 9-5+2,
list ⇒ list+digit ⇒ list+2 ⇒ list-digit+2 ⇒ list-5+2
⇒ digit-5+2 ⇒ 9-5+2

parse tree for 9-5+2

Fig 2.2. Parse tree for 9-5+2 according to the grammar in Example

Ambiguity
 A grammar is said to be ambiguous if the grammar has more than one parse tree for
a given string of tokens.
 Example 2.5. Suppose a grammar G that can not distinguish between lists and digits as
in Example 2.1.
 G : string → string + string | string - string |0|1|2|3|4|5|6|7|8|9

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig 2.3. Two Parse tree for 9-5+2


 1-5+2 has 2 parse trees => Grammar G is ambiguous.

Associativity of operator
A operator is said to be left associative if an operand with operators on both sides of it is
taken by the operator to its left.
eg) 9+5+2≡(9+5)+2, a=b=c≡a=(b=c)
 Left Associative Grammar :
list → list + digit | list - digit
digit →0|1|…|9
 Right Associative Grammar :
right → letter = right | letter
letter → a|b|…|z

Fig 2.4. Parse tree left- and right-associative operators.

Precedence of operators
We say that a operator(*) has higher precedence than other operator(+) if the operator(*) takes
operands before other operator(+) does.
 ex. 9+5*2≡9+(5*2), 9*5+2≡(9*5)+2
 left associative operators : + , - , * , /
 right associative operators : = , **

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 Syntax of full expressions


operator associative precedence

+,- left 1 low


*,/ left 2 heigh

 expr → expr + term | expr - term | term


term → term * factor | term / factor | factor
factor → digit | ( expr )
digit → 0 | 1 | … | 9

 Syntax of statements
o stmt → id = expr ;
| if ( expr ) stmt ;
| if ( expr ) stmt else stmt ;
| while ( expr ) stmt ;
expr → expr + term | expr - term | term
term → term * factor | term / factor | factor
factor → digit | ( expr )
digit → 0 | 1 | … | 9
2.3 SYNTAX-DIRECTED TRANSLATION(SDT)
A formalism for specifying translations for programming language constructs.
( attributes of a construct: type, string, location, etc)
 Syntax directed definition(SDD) for the translation of constructs
 Syntax directed translation scheme(SDTS) for specifying translation
Postfix notation for an expression E
 If E is a variable or constant, then the postfix nation for E is E itself ( E.t≡E ).
 if E is an expression of the form E1 op E2 where op is a binary operator
o E1' is the postfix of E1,
o E2' is the postfix of E2
o then E1' E2' op is the postfix for E1 op E2
 if E is (E1), and E1' is a postfix
then E1' is the postfix for E

eg) 9-5+2⇒95-2+

9 - (5 + 2) ⇒9 5 2 + -

Syntax-Directed Definition(SDD) for translation


 SDD is a set of semantic rules predefined for each productions respectively
for translation.
 A translation is an input-output mapping procedure for translation of an input X,
o construct a parse tree for X.
o synthesize attributes over the parse tree.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 Suppose a node n in parse tree is labeled by X and X.a denotes the


value of attribute a of X at that node.
 compute X's attributes X.a using the semantic rules associated with

X. Example 2.6. SDD for infix to postfix translation

Fig 2.5. Syntax-directed definition for infix to postfix translation.

An example of synthesized attributes for input X=9-5+2

Fig 2.6. Attribute values at nodes in a parse tree.

Syntax-directed Translation Schemes(SDTS)


 A translation scheme is a context-free grammar in which program fragments
called translation actions are embedded within the right sides of the production.
productions(postfix) SDD for postfix to SDTS
infix notation
list → list + term list.t = t.t || term.t || "+" list → list + term
lis {print("+")}

 {print("+");} : translation(semantic) action.


 SDTS generates an output for each sentence x generated by underlying grammar by
executing actions in the order they appear during depth-first traversal of a parse tree for
x.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

1. Design translation schemes(SDTS) for translation


2. Translate :
a) parse the input string x and
b) emit the action result encountered during the depth-first traversal of parse tree.

Fig 2.7. Example of a depth-first traversal of a tree. Fig 2.8. An extra leaf is constructed for a semantic action.

Example 2.8.
 SDD vs. SDTS for infix to postfix translation.

productions SDD SDTS


expr → list + term expr.t = list.t || term.t || expr → list + term
expr → list + term "+" expr.t = list.t || term.t || printf{"+")}
expr → term "-" expr.t = term.t expr → list + term printf{"-")}
term → 0 term.t = "0" expr → term
term → 1 term.t = "1" term → 0 printf{"0")}
… … term → 1 printf{"1")}
term → 9 term.t = "9" …
term → 9 printf{"0")}

 Action translating for input 9-5+2

Fig 2.9. Actions translating 9-5+2 into 95-2+.


1) Parse.
2) Translate.
Do we have to maintain the whole parse tree ?
No, Semantic actions are performed during parsing, and we don't need the nodes (whose
semantic actions done).

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

2.4 PARSING
if token string x ∈ L(G), then parse tree
else error message
Top-Down parsing
1. At node n labeled with nonterminal A, select one of the productions whose left part is
A and construct children of node n with the symbols on the right side of that
production.
2. Find the next node at which a sub-tree is to be
constructed. ex. G: type → simple
|↑id
|array [ simple ] of type
simple → integer
|char
|num dotdot num

Fig 2.10. Top-down parsing while scanning the input from left to right.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig 2.11. Steps in the top-down construction of a parse tree.


 The selection of production for a nonterminal may involve trial-and-error.
=> backtracking

 G : { S->aSb | c | ab }
According to topdown parsing procedure, acb , aabb∈L(G)?
 S/acb⇒aSb/acb⇒aSb/acb⇒aaSbb/acb ⇒ X
(S→aSb) move (S→aSb) backtracking
⇒aSb/acb⇒acb/acb⇒acb/acb⇒acb/acb
(s→c) move move
so, acb∈ L(G)
Is is finished in 7 steps including one backtracking.

 S/aabb⇒aSb/aabb⇒aSb/aabb⇒aaSbb/aabb⇒aaSbb/aabb⇒aaaSbbb/aabb ⇒ X
(S→aSb) move (S→aSb) move (S→aSb) backtracking
⇒aaSbb/aabb⇒aacbb/aabb ⇒ X
(S→c) backtracking
⇒aaSbb/aabb⇒aaabbb/aabb⇒ X
(S→ab) backtracking
⇒aaSbb/aabb⇒ X
backtracking
⇒aSb/aabb⇒acb/aabb
(S→c) bactracking
⇒aSb/aabb⇒aabb/aabb⇒aabb/aabb⇒aabb/aabb⇒aaba/aabb
(S→ab) move move move
so, aabb∈L(G)
but process is too difficult. It needs 18 steps including 5 backtrackings.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 procedure of top-down parsing


let a pointed grammar symbol and pointed input symbol be g, a respectively.
o if( g ∈ N ) select and expand a production whose left part equals to g next to
current production.
else if( g = a ) then make g and a be a symbol next to current symbol.
else if( g ≠a ) back tracking
 let the pointed input symbol a be the symbol that moves back to
steps same with the number of current symbols of underlying
production
 eliminate the right side symbols of current production and let the
pointed symbol g be the left side symbol of current production.

Predictive parsing (Recursive Decent Parsing,RDP)


 A strategy for the general top-down parsing
Guess a production, see if it matches, if not, backtrack and try another.

 It may fail to recognize correct string in some grammar G and is tedious in processing.

 Predictive parsing
o is a kind of top-down parsing that predicts a production whose derived terminal
symbol is equal to next input symbol while expanding in top-down paring.
o without backtracking.
o Procedure decent parser is a kind of predictive parser that is implemented by
disjoint recursive procedures one procedure for each nonterminal, the procedures
are patterned after the productions.
 procedure of predictive parsing(RDP)
let a pointed grammar symbol and pointed input symbol be g, a respectively.
o if( g ∈ N )
 select next production P whose left symbol equals to g and a set of first
terminal symbols of derivation from the right symbols of the production
P includes a input symbol a.
 expand derivation with that production P.
o else if( g = a ) then make g and a be a symbol next to current symbol.
o else if( g ≠a ) error

 G : { S→aSb | c | ab } => G1 : { S->aS' | c S'->Sb | ab }


According to predictive parsing procedure, acb , aabb∈L(G)?
o S/acb⇒ confused in { S→aSb, S→ab }
o so, a predictive parser requires some restriction in grammar, that is, there should
be only one production whose left part of productions are A and each first
terminal symbol of those productions have unique terminal symbol.
 Requirements for a grammar to be suitable for RDP: For each nonterminal either
1. A → Bα, or
2. A → a1α1 | a2α2 | … | anαn
1) for 1 ≦ i, j ≦ n and i≠ j, ai ≠ aj
2) A ε may also occur if none of ai can follow A in a derivation and if we have A→ε

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 If the grammar is suitable, we can parse efficiently without


backtrack. General top-down parser with backtracking

Recursive Descent Parser without backtracking

Picture Parsing ( a kind of predictive parsing ) without backtracking

Left Factoring
 If a grammar contains two productions of
form S→ aα and S → aβ
it is not suitable for top down parsing without backtracking. Troubles of this form can
sometimes be removed from the grammar by a technique called the left factoring.
 In the left factoring, we replace { S→ aα, S→ aβ } by
{ S → aS', S'→ α, S'→ β } cf. S→ a(α|β)
(Hopefully α and β start with different symbols)
 left factoring for G { S→aSb | c | ab }
S→aS' | c cf. S(=aSb | ab | c = a ( Sb | b) | c ) → a S' | c
S'→Sb | b
 A concrete example:
<stmt> → IF <boolean> THEN <stmt> |
IF <boolean> THEN <stmt> ELSE
<stmt> is transformed into
<stmt>→ IF <boolean> THEN <stmt> S'
S' → ELSE <stmt> | ε

 Example,
ofor G1 : { S→aSb | c | ab }
According to predictive parsing procedure, acb , aabb∈L(G)?
 S/aabb⇒ unable to choose { S→aSb, S→ab ?}
o According for the feft factored gtrammar G1, acb ,
aabb∈L(G)? G1 : { S→aS'|c S'→Sb|b} <= {S=a(Sb|b) | c }
o S/acb⇒aS'/acb⇒aS'/acb ⇒ aSb/acb ⇒ acb/acb ⇒ acb/acb⇒ acb/acb
(S→aS') move (S'→Sb⇒aS'b) (S'→c) move move
so, acb∈ L(G)
It needs only 6 steps whithout any backtracking.
cf. General top-down parsing needs 7 steps and I backtracking.
o S/aabb⇒aS'/aabb⇒aS'/aabb⇒aSb/aabb⇒aaS'b/aabb⇒aaS'b/aabb⇒aabb/aabb⇒ ⇒
(S→aS') move (S'→Sb⇒aS'b) (S'→aS') move (S'→b) move move
so, aabb∈L(G)
but, process is finished in 8 steps without any backtracking.
cf. General top-down parsing needs 18 steps including 5 backtrackings.

Left Recursion
 A grammar is left recursive iff it contains a nonterminal A, such
that A⇒+ Aα, where is any string.
o Grammar {S→ Sα | c} is left recursive because of S⇒Sα
o Grammar {S→ Aα, A→ Sb | c} is also left recursive because of S⇒Aα⇒ Sbα
 If a grammar is left recursive, you cannot build a predictive top down parser for it.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

1) If a parser is trying to match S & S→Sα, it has no idea how many times S must
be applied
2) Given a left recursive grammar, it is always possible to find another grammar
that generates the same language and is not left recursive.
3) The resulting grammar might or might not be suitable for RDP.

 After this, if we need left factoring, it is not suitable for RDP.


 Right recursion: Special care/Harder than left recursion/SDT can handle.

Eliminating Left Recursion


Let G be S→ S A | A
Note that a top-down parser cannot parse the grammar G, regardless of the order the productions
are tried.
⇒ The productions generate strings of form AA …A
⇒ They can be replaced by S→A S' and S'→A S'|ε

Example :
 A → Aα∣β
=>
A → βR
R → αR | ε

Fig 2.12. Left-and right-recursive ways of generating a string.

 In general, the rule is that


o If A→ Aα1 | Aα2 | … | Aαn and
A→ β1 | β2 | … | βm (no βi's start with A),
then, replace by
A → β1R | β2R| … | βmR and
Z → α1R | α2R | … | αnR | ε

Exercise: Remove the left recursion in the following grammar:


expr → expr + term | expr - term
expr → term
solution:
expr → term rest
rest → + term rest | - term rest | ε

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

2.5 A TRANSLATOR FOR SIMPLE EXPRESSIONS


 Convert infix into postfix(polish notation) using SDT.
 Abstract syntax (annotated parse tree) tree vs. Concrete syntax tree

 Concrete syntax tree : parse tree.


 Abstract syntax tree: syntax tree
 Concrete syntax : underlying grammar

Adapting the Translation Scheme


 Embed the semantic action in the production
 Design a translation scheme
 Left recursion elimination and Left factoring
 Example
3) Design a translate scheme and eliminate left recursion
E→ E + T {'+'} E→ T {} R
E→ E - T {'-'} R→ + T{'+'} R
E→ T {} R→ - T{'-'} R
T→ 0{'0'}|…|9{'9'} R→ ε
T→ 0{'0'}…|9{'9'}
4)Translate of a input string 9-5+2 : parsing and SDT

Result: 9 5 – 2 +

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Example of translator design and execution


 A translation scheme and with left-recursion.
Initial specification for infix-to-postfix with left recursion eliminated
translator
expr → expr + term {printf{"+")} expr → term rest
expr → expr - term {printf{"-")} rest → + term {printf{"+")} rest
expr → term rest → - term {printf{"-")} rest
term → 0 {printf{"0")} rest → ε
term → 1 {printf{"1")} term → 0 {printf{"0")}
… term → 1 {printf{"1")}
term → 9 {printf{"0")} …
term → 9 {printf{"0")}

Fig 2.13. Translation of 9 – 5 +2 into 95-2+.

Procedure for the Nonterminal expr, term, and rest

Fig 2.14. Function for the nonterminals expr, rest, and term.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Optimizer and Translator

2.6 LEXICAL ANALYSIS


 reads and converts the input into a stream of tokens to be analyzed by parser.
 lexeme : a sequence of characters which comprises a single token.
 Lexical Analyzer →Lexeme / Token → Parser
Removal of White Space and Comments
 Remove white space(blank, tab, new line etc.) and comments
Contsants
 Constants: For a while, consider only integers
 eg) for input 31 + 28, output(token
representation)? input : 31 + 28
output: <num, 31> <+, > <num, 28>
num + :token
31 28 : attribute, value(or lexeme) of integer token num
Recognizing
 Identifiers
o Identifiers are names of variables, arrays, functions...
o A grammar treats an identifier as a token.
o eg) input : count = count + increment;
output : <id,1> <=, > <id,1> <+, > <id, 2>;
Symbol table
tokens attributes(lexeme)

0
1 id count
2 id increment
3
 Keywords are reserved, i.e., they cannot be used as identifiers.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Then a character string forms an identifier only if it is no a keyword.


 punctuation symbols
o operators : + - * / := < > …

Interface to lexical analyzer

Fig 2.15. Inserting a lexical analyzer between the input and the parser

A Lexical Analyzer

Fig 2.16. Implementing the interactions in Fig. 2.15.

 c=getchcar(); ungetc(c,stdin);
 token representation
o #define NUM 256
 Function lexan()
eg) input string 76 + a
input , output(returned value)
76 NUM, tokenval=76 (integer)
+ +
A id , tokeval="a"

 A way that parser handles the token NUM returned by laxan()


o consider a translation
scheme factor → ( expr )
| num { print(num.value) }
#define NUM 256

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

...
factor() {
if(lookahead == '(' ) {
match('('); exor(); match(')');
} else if (lookahead == NUM) {
printf(" %f ",tokenval); match(NUM);
} else error();
}
 The implementation of function lexan
1) #include <stdio.h>
2) #include <ctype.h>
3) int lino = 1;
4) int tokenval = NONE;
5) int lexan() {
6) int t;
7) while(1) {
8) t = getchar();
9) if ( t==' ' || t=='\t' ) ;
10) else if ( t=='\n' ) lineno +=1;
11) else if (isdigit(t)) {
12) tokenval = t -'0';
13) t = getchar();
14) while ( isdigit(t)) {
15) tokenval = tokenval*10 + t - '0';
16) t =getchar();
17) }
18) ungetc(t,stdin);
19) retunr NUM;
20) } else {
21) tokenval = NONE;
22) return t;
23) }
24) }
25) }

2.7 INCORPORATION A SYMBOL TABLE


 The symbol table interface, operation, usually called by parser.
o insert(s,t): input s: lexeme
t: token
output index of new entry
o lookup(s): input s: lexeme
output index of the entry for string s, or 0 if s is not found in the symbol
table.
 Handling reserved keywords
1. Inserts all keywords in the symbol table in
advance. ex) insert("div", div)

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

insert("mod", mod)
2. while parsing
 whenever an identifier s is encountered.
if (lookup(s)'s token in {keywords} ) s is for a keyword; else s is for a identifier;

 example
o preset
insert("div",div);
insert("mod",mod);
o while parsing
lookup("count")=>0 insert("count",id);
lookup("i") =>0 insert("i",id);
lookup("i") =>4, id
llokup("div")=>1,div

Fig 2.17. Symbol table and array for storing strings.

2.8 ABSTRACT STACK MACHINE


o An abstract machine is for intermediate code generation/execution.
o Instruction classes: arithmetic / stack manipulation / control flow
 3 components of abstract stack machine
1) Instruction memory : abstract machine code, intermediate code(instruction)
2) Stack
3) Data memory
 An example of stack machine operation.
o for a input (5+a)*b, intermediate codes : push 5 rvalue 2 ....

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

L-value and r-value


 l-values a : address of location a
 r-values a : if a is location, then content of location
a if a is constant, then value a
 eg) a :=5 + b;
lvalue a⇒2 r value 5 ⇒ 5 r value of b ⇒ 7

Stack Manipulation
 Some instructions for assignment operation
o push v : push v onto the stack.
o rvalue a : push the contents of data location a.
o lvalue a : push the address of data location a.
o pop : throw away the top element of the stack.
o := : assignment for the top 2 elements of the stack.
o copy : push a copy of the top element of the stack.

Translation of Expressions
 Infix expression(IE) → SDD/SDTS → Abstact macine codes(ASC) of postfix expression for
stack machine evaluation.
eg)
o IE: a + b, (⇒PE: a b + ) ⇒ IC: rvalue a
rvalue b
+
o day := (1461 * y) div 4 + (153 * m + 2) div 5 + d
(⇒ day 1462 y * 4 div 153 m * 2 + 5 div + d +
:=)
⇒1) lvalue day 6) div 11) push 5 16) :=
2) push 1461 7) push 153 12) div
3) rvalue y 8) rvalue m 13) +
4) * 9) push 2 14) rvalue d
5) push 4 10) + 15) +
 A translation scheme for assignment-statement into abstract astack machine code e can
be expressed formally In the form as follows:
stmt → id := expr
{ stmt.t := 'lvalue' || id.lexeme || expr.t || ':=' }
eg) day :=a+b ⇒ lvalue day rvalue a rvalue b + :=

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Control Flow
 3 types of jump instructions :
o Absolute target location
o Relative target location( distance :Current ↔Target)
o Symbolic target location(i.e. the machine supports labels)
 Control-flow instructions:
o label a: the jump's target a
o goto a: the next instruction is taken from statement labeled a
o gofalse a: pop the top & if it is 0 then jump to a
o gotrue a: pop the top & if it is nonzero then jump to a
o halt : stop execution

Translation of Statements
 Translation scheme for translation if-statement into abstract machine
code. stmt → if expr then stmt1
{ out := newlabel1)
stmt.t := expr.t || 'gofalse' out || stmt1.t || 'label' out }

Fig 2.18. Code layout for conditional and while statements.

 Translation scheme for while-statement ?

Emitting a Translation
 Semantic Action(Tranaslation Scheme):
1. stmt → if
expr { out := newlabel; emit('gofalse', out) }
then
stmt1 { emit('label', out) }
2. stmt → id { emit('lvalue', id.lexeme) }
:=
expr { emit(':=') }
3. stmt → i
expr { out := newlabel; emit('gofalse', out) }
then
stmt1 { emit('label', out) ; out1 := newlabel; emit('goto', out`1); }

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

else
stmt2 { emit('label', out1) ; }
if(expr==false) goto out
stmt1 goto out1
out : stmt2
out1:

Implementation
 procedure stmt()
 var test,out:integer;
 begin
o if lookahead = id then begin
 emit('lvalue',tokenval);
match(id); match(':='); expr();
emit(':=');
o end
o else if lookahead = 'if' then begin
 match('if');
 expr();
 out := newlabel();
 emit('gofalse', out);
 match('then');
 stmt;
 emit('label', out)
o end
o else error();
 end

Control Flow with Analysis


 if E1 or E2 then S vs if E1 and E2 then S
E1 or E2 = if E1 then true else E2
E1 and E2 = if E1 then E2 else
false
 The code for E1 or E2.
o Codes for E1 Evaluation result: e1
o copy
o gotrue OUT
o pop
o Codes for E2 Evaluation result: e2
o label OUT

 The full code for if E1 or E2 then S ;


o codes for E1
o copy
o gotrue OUT1
o pop
o codes for E2
o label OUT1

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

o gofalse OUT2
o code for S
o label OUT2
 Exercise: How about if E1 and E2 then
S;
o if E1 and E2 then S1 else S2;

2.9 Putting the techniques together!


 infix expression ⇒ postfix expression
eg) id+(id-id)*num/id ⇒ id id id - num * id /
+

Description of the Translator


 Syntax directed translation scheme
(SDTS) to translate the infix
expressions into the postfix expressions,
Fig 2.19. Specification for infix-to-postfix translation

Structure of the translator,

Fig 2.19. Modules of infix to postfix translator.

o global header file "header.h"

The Lexical Analysis Module lexer.c


o Description of tokens
+ - * / DIV MOD ( ) ID NUM DONE

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig 2.20. Description of tokens.

The Parser Module parser.c

SDTS
|| ← left recursion elimination
New SDTS

Fig 2.20. Specification for infix to postfix translator & syntax directed translation scheme after
eliminating left-recursion.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

The Emitter Module emitter.c


emit (t,tval)

The Symbol-Table Modules symbol.c and init.c


Symbol.c
data structure of symbol table Fig 2.29 p62
insert(s,t)
lookup(s)

The Error Module error.c


Example of execution
input 12 div 5 + 2
output 12
5
div
2
+

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

3. Lexical Analysis:

3.1 OVER VIEW OF LEXICAL ANALYSIS


 To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation
that can be used to describe essentially all the tokens of programming language.
 Secondly , having decided what the tokens are, we need some mechanism to recognize
these in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.

3.2 ROLE OF LEXICAL ANALYZER


The LA is the first phase of a compiler. It main task is to read the input character and produce as
output a sequence of tokens that the parser uses for syntax analysis.

Fig. 3.1: Role of Lexical analyzer

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads
the input character until it can identify the next token. The LA return to the parser representation
for the token it has found. The representation will be an integer code, if the token is a simple
construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.

3.3 TOKEN, LEXEME, PATTERN:


Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig. 3.2: Example of Token, Lexeme and Pattern

3.4. LEXICAL ERRORS:


Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other
side, will be thrown by your scanner when a given set of already recognised valid tokens don't
match any of the right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or lexical error is
detected.

Error-recovery actions are:


i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

3.5. REGULAR EXPRESSIONS


Regular expression is a formula that describes a possible set of string. Component of regular
expression..
X the character x
. any character, usually accept a new
line [x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as an language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we would write.
Identifier = letter (letter | digit)*

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Here are the rules that define the regular expression over alphabet .
 is a regular expression denoting { € }, that is, the language containing only the empty
string.
 For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string
consisting of the single symbol ‘a’ .
 If R and S are regular expressions, then

(R) | (S) means L(r) U L(s)


R.S means L(r).L(s)
R* denotes L(r*)

3.6. REGULAR DEFINITIONS


For notational convenience, we may wish to give names to regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examins the input string and
finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt

Expr →term relop term
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

else →else
relop →< | > |<= | >= | = = | < >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that follows the
white space . It is the following token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any WS - -
if if -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
== relop EQ
<> relop NE

3.7. TRANSITION DIAGRAM:


Transition Diagram has a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled by a
symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has
been found, although the actual lexeme may not consist of all positions b/w the
lexeme Begin and forward pointers we always indicate an accepting state by a double
circle.
2. In addition, if it is necessary to return the forward pointer one position, then we
shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Fig. 3.3: Transition diagram of Relational operators

As an intermediate step in the construction of a LA, we first produce a stylized flowchart,


called a transition diagram. Position in a transition diagram, are drawn as circles and are
called as states.

Fig. 3.4: Transition diagram of Identifier

The above TD for an identifier, defined to be a letter followed by any no of letters or digits.A
sequence of transition diagram can be converted into program to look for the tokens specified
by the diagrams. Each state gets a segment of code.

3.8. FINITE AUTOMATON


 A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
 We call the recognizer of the tokens as a finite automaton.
 A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
 This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
 Both deterministic and non-deterministic finite automaton recognize regular sets.
 Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
 First, we define regular expressions for tokens; Then we convert them into a DFA to get
a lexical analyzer for our tokens.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

3.9. Non-Deterministic Finite Automaton (NFA)


 A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
 ε- transitions are allowed in NFAs. In other words, we can move from one state
to another one without consuming any symbol.
 A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Example:

3.10. Deterministic Finite Automaton (DFA)

 A Deterministic Finite Automaton (DFA) is a special form of a NFA.


 No state has ε- transition
 For each symbol a and state s, there is at most one labeled edge a leaving s. i.e.
transition function is from pair of state-symbol to state (not set of states)

Example:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

3.11. Converting RE to NFA


 This is one way to convert a regular expression into a NFA.
 There can be other ways (much efficient) for the conversion.
 Thomson’s Construction is simple and systematic method.
 It guarantees that the resulting NFA will have exactly one final state, and one start state.
 Construction starts from simplest parts (alphabet symbols).
 To create a NFA for a complex regular expression, NFAs of its sub-expressions
are combined to create its NFA.
 To recognize an empty string ε:

 To recognize a symbol a in the alphabet Σ:

 For regular expression r1 | r2:

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 For regular expression r1 r2

Here, final state of N(r1) becomes the final state of N(r1r2).


 For regular expression r*

Example:
For a RE (a|b) * a, the NFA construction is shown below.

3.12. Converting NFA to DFA (Subset Construction)


We merge together NFA states by looking at them from the point of view of the input characters:
 From the point of view of the input, any two states that are connected by an –transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented by
the same states in the DFA.
 If it is possible to have multiple transitions based on the same symbol, then we can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all those
states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
To perform this operation, let us define two functions:
 The -closure function takes a state and returns the set of states reachable from it based on
(one or more) -transitions. Note that this will always include the state itself. We should
be able to get from a state to any state in its -closure without consuming any input.
 The function move takes a state and a character, and returns the set of states reachable
by one transition on this character.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.

For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')


move(C,`a').
The Subset Construction Algorithm is a follows:

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)


while (there is one unmarked S1 in DS) do
begin
mark S1
for each input symbol a do
begin
S2 ← ε-closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a] ← S2
end
end

a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA


 the start state of DFA is ε-closure({s0})

3.13. Lexical Analyzer Generator

3.18. Lex specifications:


A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

1. The declarations section includes declarations of variables,manifest constants(A manifest


constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14),
and regular definitions.
2. The translation rules of a Lex program are statements of the form :

p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are needed by the
actions.Alternatively these procedures can be compiled separately and loaded with the
lexical analyzer.

Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.

3.19. INPUT BUFFERING

The LA scans the characters of the source pgm one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:
1. Buffer pairs
2. Sentinels

The lexical analyzer scans the characters of the source program one a t a time to discover tokens.
Often, however, many characters beyond the next token many have to be examined before the
next token itself can be determined. For this and other reasons, it is desirable for thelexical
analyzer to read its input from an input buffer. Figure shows a buffer divided into two haves of,
say 100 characters each. One pointer marks the beginning of the token being discovered. A look
ahead pointer scans ahead of the beginning point, until the token is discovered .we view the
position of each pointer as being between the character last read and thecharacter next to be read.
In practice each buffering scheme adopts one convention either apointer is at the symbol last
read or the symbol it is ready to read.

Token beginnings look ahead pointerThe distance which the lookahead pointer may have to
travel past the actual token may belarge. For example, in a PL/I program we may see:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

DECALRE (ARG1, ARG2… ARG n) Without knowing whether DECLARE is a keyword or an


array name until we see the character that follows the right parenthesis. In either case, the token
itself ends at the second E. If the look ahead pointer travels beyond the buffer half in which it
began, the other half must be loaded with the next characters from the source file. Since the
buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token is discovered. In the above example, ifthe look ahead
traveled to the left half and all the way through the left half to the middle, we could not reload
the right half, because we would lose characters that had not yet been groupedinto tokens. While
we can make the buffer larger if we chose or use another buffering scheme,we cannot ignore the
fact that overhead is limited.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

SYNTAX ANALYSIS

4.1 ROLE OF THE PARSER :


Parser for any grammar is program that takes as input string w (obtain set of strings tokens
from the lexical analyzer) and produces as output either a parse tree for w , if w is a valid
sentences of grammar or error message indicating that w is not a valid sentences of given
grammar. The goal of the parser is to determine the syntactic validity of a source string is
valid, a tree is built for use by the subsequent phases of the computer. The tree reflects the
sequence of derivations or reduction used during the parser. Hence, it is called parse tree. If
string is invalid, the parse has to issue diagnostic message identifying the nature and cause of
the errors in string. Every elementary subtree in the parse tree corresponds to a production of
the grammar.
There are two ways of identifying an elementry sutree:

1. By deriving a string from a non-terminal or


2. By reducing a string of symbol to a non-terminal.

The two types of parsers employed are:


a. Top down parser: which build parse trees from top(root) to
bottom(leaves)
b. Bottom up parser: which build parse trees from leaves and work up the
root.

Fig . 4.1: position of parser in compiler model.


4.2 CONTEXT FREE GRAMMARS
Inherently recursive structures of a programming language are defined by a context-free
Grammar. In a context-free grammar, we have four triples G( V,T,P,S).
Here , V is finite set of terminals (in our case, this will be the set of tokens)
T is a finite set of non-terminals (syntactic-variables)

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

P is a finite set of productions rules in the following form


A → α where A is a non-terminal and α is a string of terminals and non-terminals
(including the empty string)
S is a start symbol (one of the non-terminal symbol)
L(G) is the language of G (the language generated by G) which is a set of sentences.
A sentence of L(G) is a string of terminal symbols of G. If S is the start symbol of G then
ω is a sentence of L(G) iff S ⇒ω whereω is a string of terminals of G. If G is a context-
free grammar, L(G) is a context-free language. Two grammar G1 and G2 are equivalent, if
they produce same grammar.
Consider the production of the form S ⇒α, If α contains non-terminals, it is called as a
sentential form of G. If α does not contain non-terminals, it is called as a sentence of G.
4.2.1 Derivations
In general a derivation step is
αAβ α⇒γβ is sentential form and if there is a production rule A→γ in our
grammar. where α and β are arbitrary strings of terminal and non-terminal symbols α1
⇒α2 ⇒... ⇒ αn (αn derives from α1 or α1 derives αn ). There are two types of derivaion
1At each derivation step, we can choose any of the non-terminal in the sentential form of G
for the replacement.
2If we always choose the left-most non-terminal in each derivation step, this derivation is
called as left-most derivation.
Example:
E→E+E|E–E|E*E|E/E|-
EE→(E)
E → id
Leftmost derivation :
E→E+E
→E * E+E →id* E+E→id*id+E→id*id+id
The string is derive from the grammar w= id*id+id, which is consists of all terminal
symbols
Rightmost derivation
E→E+E
→E+E * E→E+ E*id→E+id*id→id+id*id
Given grammar G : E → E+E | E*E | ( E ) | - E |
id Sentence to be derived : – (id+id)

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

LEFTMOST DERIVATION RIGHTMOST DERIVATION


E→-E E→-E
E→-(E) E→-(E)
E → - ( E+E ) E → - (E+E )
E → - ( id+E ) E → - ( E+id )
E → - ( id+id ) E → - ( id+id )
 String that appear in leftmost derivation are called left sentinel forms.
 String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
 Given a grammar G with start symbol S, if S → α , where α may contain non-
terminals or terminals, then α is called the sentinel form of G.
Yield or frontier of tree:
 Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The
sentinel form in the parse tree is called yield or frontier of the tree.
4.2.2 PARSE TREE
 Inner nodes of a parse tree are non-terminal symbols.
 The leaves of a parse tree are terminal symbols.
 A parse tree can be seen as a graphical representation of a derivation.

Ambiguity:
A grammar that produces more than one parse for some sentence is said to be ambiguous
grammar.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Example : Given grammar G : E → E+E | E*E | ( E ) | - E | id

The sentence id+id*id has the following two distinct leftmost derivations:
E → E+ E E → E* E
E → id + E E→E+E*E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id

The two corresponding parse trees are :

Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of
operators as follows:
^ (right to left)
/,* (left to right)
-,+ (left to right)
We get the following unambiguous grammar:
E → E+T | T
T → T*F | F
F → G^F | G
G → id | (E)
Consider this example, G: stmt → if expr then stmt |if expr then stmt elsestmt | other
This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the
following

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Two parse trees for leftmost derivation :

To eliminate ambiguity, the following grammar may be used:


stmt → matched_stmt | unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
unmatched_stmt → if expr then stmt| if expr then matched_stmt else unmatched_stmt
Eliminating Left Recursion:
A grammar is said to be left recursive if it has a non-terminal A such that there is a derivation
A=>Aα for some string α. Top-down parsing methods cannot handle left-recursive grammars.
Hence, left recursion can be eliminated as follows:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

If there is a production A → Aα | β it can be replaced with a sequence of two


productions
A → βA’
A’ → αA’ | ε
Without changing the set of strings derivable from A.
Example : Consider the following grammar for arithmetic expressions:
E → E+T | T
T → T*F | F
F → (E) | id
First eliminate the left recursion for E as
E → TE’
E’ → +TE’ |ε
Then eliminate for T as
T → FT’
T’→ *FT’ | ε
Thus the obtained grammar after eliminating left recursion is
E → TE’
E’ → +TE’ |ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
Algorithm to eliminate left recursion:

1. Arrange the non-terminals in some order A1, A2 . . . An.


2.for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai → Aj γ
by the productions Ai → δ1 γ | δ2γ | . . . | δk γ
where Aj→ δ1 |δ2 | . . . |δk are all the current Aj-productions;
end
eliminate the immediate left recursion among the Ai-productions
end

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Left factoring:

Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing. When it is not clear which of two alternative productions to use to expand
a non-terminal A, we can rewrite the A-productions to defer the decision until we have seen
enough of the input to make the right choice.
If there is any production A → αβ1 | αβ2 , it can be rewritten as
A → αA’
A’ → β1 | β2
Consider the grammar , G : S→iEtS | iEtSeS | a
E→b
Left factored, this grammar becomes
S → iEtSS’ | a
S’ → eS | ε
E→b
TOP-DOWN PARSING
It can be viewed as an attempt to find a left-most derivation for an input string or an
attempt to construct a parse tree for the input starting from the root to the leaves.
Types of top-down parsing :
1. Recursive descent parsing
2. Predictive parsing
1. RECURSIVE DESCENT PARSING
 Recursive descent parsing is one of the top-down parsing techniques that uses a set of
recursive procedures to scan its input.
 This parsing method may involve backtracking, that is, making repeated scans of the
input.
Example for backtracking :
Consider the grammar G : S→cAd
A → ab | a
and the input string w=cad.
The parse tree can be constructed using the following top-down approach :
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.

Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the
input symbol d.
Hence discard the chosen production and reset the pointer to second position. This is called
backtracking.
Step4:
Now try the second alternative for A.

Now we can halt and announce the successful completion of parsing.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Example for recursive decent parsing:


A left-recursive grammar can cause a recursive-descent parser to go into an infinite loop.
Hence, elimination of left-recursion must be done before parsing.
Consider the grammar for arithmetic expressions
E → E+T | T
T → T*F | F
F → (E) | id
After eliminating the left-recursion the grammar becomes,
E → TE’
E’ → +TE’ |ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
Now we can write the procedure for grammar as follows:
Recursive procedure:
Procedure E()
begin
T( );
EPRIME( );
End
Procedure EPRIME( )
begin
If input_symbol=’+’ then
ADVANCE( );
T( );
EPRIME( );
end
Procedure T( )
begin
F( );
TPRIME( );
End

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Procedure TPRIME( )
begin
If input_symbol=’*’ then
ADVANCE( );
F( );
TPRIME( );
end
Procedure F( )
begin
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
end

else ERROR( );
Stack implementation:
PROCEDURE INPUT STRING
E( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id id*id
TPRIME( ) id id*id
EPRIME( ) id id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

2. PREDICTIVE PARSING
 Predictive parsing is a special case of recursive descent parsing where no
backtracking is required.
 The key problem of predictive parsing is to determine the production to be applied
for a non-terminal in case of alternatives.
Non-recursive predictive parser

The table-driven predictive parser has an input buffer, stack, a parsing table and an output
stream.
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.
Predictive parsing program:
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three
possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to
the next input symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table
M. This entry will either be an X-production of the grammar or an error entry.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

If M[X, a] = {X → UVW},the parser replaces X on top of the stack by UVW


If M[X, a] =error, the parser calls an error recovery routine.
Algorithm for nonrecursive predictive parsing:
Input : A string w and a parsing table M for grammar G.
Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$
in the input buffer. The program that utilizes the predictive parsing table M to produce a parse
for the input is as follows:
set ip to point to the first symbol of w$;
repeat
letX be the top stack symbol andathe symbol pointed to by ip;
if X is a terminal or $then
if X = a then
popX from the stack and advance ip
else error()
else/* X is a non-terminal */
if M[X, a] = X →Y1Y2 … Yk then begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, with Y1 on top;
output the production X → Y1 Y2 . . . Yk
end
elseerror()
until X = $
Predictive parsing table construction:
The construction of a predictive parser is aided by two functions associated with a grammar
G:
1. FIRST
2. FOLLOW
Rules for first( ):
1. If X is terminal, then FIRST(X) is {X}.
2. If X → ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → aα is a production then add a to FIRST(X).

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

4. If X is non-terminal and X → Y 1 Y2…Yk is a production, then place a in FIRST(X) if for


some i, a is in FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1); that is, Y1,….Yi-
1
=> ε. If ε is in FIRST(Yj) for all j=1,2,..,k, then add ε to FIRST(X).
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
2. If there is a production A → αBβ, then everything in FIRST(β) except ε is placed
in follow(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains ε,
then everything in FOLLOW(A) is in FOLLOW(B).
Algorithm for construction of predictive parsing table:
Input : GrammarG
Output : Parsing table M
Method :
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε is
in FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $].
4. Make each undefined entry of M be error.
Example:
Consider the following grammar :
E → E+T | T
T→T*F | F
F → (E) | id
After eliminating left-recursion the grammar is
E → TE’
E’ → +TE’ |ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ ,ε}
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }

LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.
Consider this following grammar:
S → iEtS | iEtSeS | a
E→b

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

After eliminating left factoring, we have


S→iEtSS’ | a
S’→eS |ε
E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e,ε}
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}

Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the
root is called bottom-up parsing.
A general type of bottom-up parser is a shift-reduce parser.

SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for
an input string beginning at the leaves (the bottom) and working up towards the root (the
top).
Example:
Consider the grammar:
S → aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

REDUCTION (LEFTMOST) RIGHTMOST DERIVATION

abbcde (A → b) S→ aABe
aAbcde (A → Abc) → aAde
aAde (B → d) → aAbcde
aABe(S → aABe) → abbcde
S
The reductions trace out the right-most derivation in reverse.

Handles:

A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.

Example:

Consider the grammar:

E → E+E
E → E*E
E → (E)
E → id

And the input string id1+id2*id3

The rightmost derivation is :

E →E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→id 1+id2*id3

In the above derivation the underlined substrings are calledhandles.

Handle pruning:

A rightmost derivation in reverse can be obtained by “handle pruning”.


th
(i.e.) ifwis a sentence or string of the grammar at hand, thenw= y n, where yn is then right-
sentinel form of some rightmost derivation.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Stack implementation of shift-reduce parsing :

Stack Input Action


$ id1+id2*id3 $ shift

$ id1 +id2*id3 $ reduce by E→id

$E +id2*id3 $ shift

$ E+ id2*id3 $ shift

$ E+id2 *id3 $ reduce by E→id

$ E+E *id3 $ shift

$ E+E* id3 $ shift

$ E+E*id3 $ reduce by E→id

$ E+E*E $ reduce by E→ E *E

$ E+E $ reduce by E→ E+E

$E $ accept

Actions in shift-reduce parser:


• shift – The next input symbol is shifted onto the top of the stack.
• reduce – The parser replaces the handle within a stack with a non-terminal.
• accept – The parser announces successful completion of parsing.
• error– The parser discovers that a syntax error has occurred and calls an error
recovery routine.

Conflicts in shift-reduce parsing:

There are two conflicts that occur in shift shift-reduce parsing:

1. Shift-reduce conflict: The parser cannot decide whether to shift or to reduce.

2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make.

1. Shift-reduce conflict:

Example:

Consider the grammar:

E→E+E | E*E | id and input id+id*id

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Stack Input Action Stack Input Action


$ E+E *id $ Reduce by $E+E *id $ Shift
E→E+E
$E *id $ Shift $E+E* id $ Shift

$ E* id $ Shift $E+E*id $ Reduce by


E→id
$ E*id $ Reduce by $E+E*E $ Reduce by
E→id E→E*E
$ E*E $ Reduce by $E+E $ Reduce by
E→E*E E→E*E
$E $E

2. Reduce-reduce conflict:

Consider the grammar:

M → R+R | R+c | R
R→c
and input c+c

Stack Input Action Stack Input Action


$ c+c $ Shift $ c+c $ Shift

$c +c $ Reduce by $c +c $ Reduce by
R→c R→c
$R +c $ Shift $R +c $ Shift

$ R+ c$ Shift $ R+ c$ Shift

$ R+c $ Reduce by $ R+c $ Reduce by


R→c M→R+c
$ R+R $ Reduce by $M $
M→R+R
$M $

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Viable prefixes:
➢ a is a viable prefix of the grammar if there iswsuch that awis a right sentinel form.
➢ The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce parser
are called viable prefixes.
➢ The set of viable prefixes is a regular language.

OPERATOR-PRECEDENCE PARSING

An efficient way of constructing shift-reduce parser is called operator-precedence parsing.

Operator precedence parser can be constructed from a grammar called Operator-grammar. These
grammars have the property that no production on right side is ε or has two adjacent non-
terminals.

Example:

Consider the grammar:

E → EAE | (E) | -E | id
A→+|-|*|/|↑

Since the right side EAE has three consecutive non-terminals, the grammar can be written as
follows:

E → E+E | E-E | E*E | E/E | E↑E | -E | id

Operator precedence relations:


There are three disjoint precedence relations namely
< . - less than
=- equal to
.
>- greater than
The relations give the following meaning:
a< . b – a yields precedence to b
a = b – a has the same precedence as b
a . >b – a takes precedence over b

Rules for binary operations:


1. If operator θ1 has higher precedence than operator θ2, then make
θ1 . >θ 2 and θ2 < . θ1

2. If operators θ1 and θ2, are of equal precedence, then make


θ1 . >θ 2 and θ2 . >θ 1 if operators are left associative
θ1 < . θ2 and θ2 < . θ1 if right associative

3. Make the following for all operators θ:


θ< . id , id . >θ
θ< . ( , (< . θ
) . >θ , θ . >)
θ . >$ , $< . θ

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Also make
( = ) , (< . ( , ) . >) , (< .
id , id . >) , $< .
id , id . >$ , $< .
( , ) . >$

Example:

Operator-precedence relations for the grammar

E → E+E | E-E | E*E | E/E | E↑E | (E) | -E | id is given in the following table assuming

1. ↑ is of highest precedence and right-associative


2. * and / are of next higher precedence and left-associative, and
3. + and - are of lowest precedence and left-associative
Note that theblanksin the table denote error entries.

TABLE : Operator-precedence relations


+ - * / ↑ id ( ) $
+ .
> .
> <. <. <. <. <. .
> .
>
- .
> .
> <. <. <. <. <. .
> .
>
* .
> .
> .
> .
> <. <. <. .
> .
>
/ .
> .
> .
> .
> <. <. <. .
> .
>
↑ .
> .
> .
> .
> <. <. <. .
> .
>
id .
> .
> .
> .
> .
·>
.
> .
>
( <. <. <. <. <. <. <. =
) .
> .
> .
> .
> .
> .
> .
>
$ < .
< .
< .
< .
< .
< .
< .

Operator precedence parsing algorithm:

Input :An input stringwand a table of precedence relations.


Output :Ifwis well formed, askeletalparse tree,with a placeholder non-terminal E labeling all
interior nodes; otherwise, an error indication.
Method :Initially the stack contains $ and the input buffer the stringw$. To parse, we execute
the following program :

(1) Setipto point to the first symbol ofw$;


(2)repeat forever
(3)if$ is on top of the stack andippoints to $then
(4)return
else begin
(5) letabe the topmost terminal symbol on the stack
and letbbe the symbol pointed to byip;
.
(6)ifa< bora=bthen begin
(7) pushbonto the stack;
(8) advanceipto the next input symbol;
end;

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

.
(9)else ifa >bthen /*reduce*/
(10)repeat
(11) pop the stack
.
(12)untilthe top stack terminal is related by <
to the terminal most recently popped
(13)elseerror( )
end

Stack implementation of operator precedence parsing:


Operator precedence parsing uses a stack and precedence relation table for its
implementation of above algorithm. It is a shift-reduce parsing containing all four actions shift,
reduce, accept and error.
The initial configuration of an operator precedence parsing is
STACK INPUT
$ w$

where w is the input string to be parsed.

Example:

Consider the grammar E → E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string isid+id*id.The
implementation is as follows:

STACK INPUT COMMENT


$ <· id+id*id $ shift id
$ id ·> +id*id $ pop the top of the stack id
$ <· +id*id $ shift +
$+ <· id*id $ shift id
$ +id ·> *id $ pop id
$+ <· *id $ shift *
$+* <· id $ shift id
$ + * id ·> $ pop id
$+* ·> $ pop *
$+ ·> $ pop +
$ $ accept

Advantages of operator precedence parsing:


1. It is easy to implement.
2. Once an operator precedence relation is made between all pairs of terminals of a grammar ,
the grammar can be ignored. The grammar is not referred anymore during implementation.

Disadvantages of operator precedence parsing:


1. It is hard to handle tokens like the minus sign (-) which has two different precedence.
2. Only a small class of grammar can be parsed using operator-precedence parser.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

LR PARSERS
An efficient bottom-up syntax analysis technique that can be used to parse a large class of
CFG is called LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for
constructing a rightmost derivation in reverse, and the ‘k’ for the number of input symbols.
When ‘k’ is omitted, it is assumed to be 1.

Advantages of LR parsing:
✓ It recognizes virtually all programming language constructs for which CFG can be
written.
✓ It is an efficient non-backtracking shift-reduce parsing method.
✓ A grammar that can be parsed using LR method is a proper superset of a grammar that
can be parsed with predictive parser.
✓ It detects a syntactic error as soon as possible.

Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.

Types of LR parsing method:


1. SLR- Simple LR
▪ Easiest to implement, least powerful.
2. CLR- Canonical LR
▪ Most powerful, most expensive.
3. LALR- Look-Ahead LR
▪ Intermediate in size and cost between the other two methods.

The LR parsing algorithm:

The schematic form of an LR parser is as follows:

INPUT a1 ai an $
… …

Sm LR parsing program OUTPUT


Xm
Sm-1
Xm-1
… ac琀椀on goto
S0

STACK

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

It consists of : an input, an output, a stack, a driver program, and a parsing table that has two
parts (actionandgoto).

➢ The driver program is the same for all LR parser.

➢ The parsing program reads characters from an input buffer one at a time.

➢ The program uses a stack to store a string of the form s 0 X1s1X2s2…Xmsm, where sm is on
top. Each Xi is a grammar symbol and each si is a state.

➢ The parsing table consists of two parts :actionandgotofunctions.

Action: The parsing program determines s m, the state currently on top of stack, and ai, the
current input symbol. It then consultsaction[s m,ai] in the action table which can have one of four
values :

1. shift s, where s is a state,


2. reduce by a grammar production A → β,
3. accept, and
4. error.

Goto: The function goto takes a state and grammar symbol as arguments and produces a state.

LR Parsing algorithm:

Input: An input stringwand an LR parsing table with functionsactionandgotofor grammar G.

Output: Ifwis in L(G), a bottom-up-parse forw; otherwise, an error indication.

Method: Initially, the parser has s 0 on its stack, where s0 is the initial state, andw$ in the input
buffer. The parser then executes the following program :

setipto point to the first input symbol ofw$;


repeat forever begin
letsbe the state on top of the stack and
athe symbol pointed to byip;
ifaction[s,a] = shifts’then begin
pushathens’ on top of the stack;
advanceipto the next input symbol
end
else ifaction[s,a] = reduce A→βthen begin
pop 2* | β | symbols off the stack;
lets’ be the state now on top of the stack;
push A thengoto[s’, A] on top of the stack;
output the production A→ β
end
else ifaction[s,a] = acceptthen
return
elseerror( )
end

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

CONSTRUCTING SLR(1) PARSING TABLE:

To perform SLR parsing, take grammar as input and do the following:


1. Find LR(0) items.
2. Completing the closure.
3. Computegoto(I,X), where, I is set of items and X is grammar symbol.

LR(O) items:
AnLR(O) itemof a grammar G is a production of G with a dot at some position of the
right side. For example, production A → XYZ yields the four items :

A →.XYZ
A → X.YZ
A → XY.Z
A → XYZ.

Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I
by the two rules:

1. Initially, every item in I is added to closure(I).


2. If A → a . Bβ is in closure(I) and B → y is a production, then add the item B → . y to I , if it
is not already there. We apply this rule until no more new items can be added to closure(I).

Goto operation:
Goto(I, X) is defined to be the closure of the set of all items [A→ aX . β] such that
[A→ a . Xβ] is in I.

Steps to construct SLR parsing table for grammar G are:

1. Augment G and produce G’


2. Construct the canonical collection of set of items C for G’
3. Construct the parsing action functionactionandgotousing the following algorithm that
requires FOLLOW(A) for each non-terminal of grammar.

Algorithm for construction of SLR parsing table:

Input: An augmented grammar G’


Output: The SLR parsing table functionsactionandgotofor G’
Method:
1. Construct C = {I0, I1, …. In}, the collection of sets of LR(0) items for G’.
2. Stateiis constructed from I i.. The parsing functions for stateiare determined as follows:
(a) If [A→a·aβ] is in Ii and goto(Ii,a) = Ij, then setaction[i,a] to “shift j”. Hereamust be
terminal.
(b) If [A→a·] is in Ii , then setaction[i,a] to “reduce A→a” for allain FOLLOW(A).
(c) If [S’→S.] is in Ii, then setaction[i,$] to “accept”.

If any conflicting actions are generated by the above rules, we say grammar is not SLR(1).

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

3. Thegototransitions for stateiare constructed for all non-terminals A using the rule:
Ifgoto(I i,A) = Ij, thengoto[i,A] =j.
4. All entries not defined by rules (2) and (3) are made “error”
5. The initial state of the parser is the one constructed from the set of items containing
[S’→.S].

Example for SLR parsing:


Construct SLR parsing for the following grammar :
G:E→E+T|T
T→T*F|F
F → (E) | id

The given grammar is :


G : E → E + T ------ (1)
E →T ------ (2)
T → T * F ------ (3)
T→F ------ (4)
F → (E) ------ (5)
F → id ------ (6)

Step 1 :Convert given grammar into augmented grammar.


Augmented grammar :
E’ → E
E→E+T
E→T
T→T*F
T→F
F → (E)
F → id

Step 2 :Find LR (0) items.

I0 : E’ →.E
E →.E + T
E →.T
T →.T * F
T →.F
F →.(E)
F →.id

GOTO ( I0 , E) GOTO ( I4 , id )
I1 : E’ → E. I5 : F → id.
E → E.+ T

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

GOTO ( I6 , T )
GOTO ( I0 , T) I9 : E → E + T.
I2 : E → T. T → T.* F
T → T.* F
GOTO ( I6 , F )
GOTO ( I0 , F) I3 : T → F.
I3 : T → F.
GOTO ( I6 , ( )
I4 : F → (.E )

GOTO ( I0 , ( ) GOTO ( I6 , id)


I4 : F → (.E) I5 : F → id.
E →.E + T
E →.T GOTO ( I7 , F )
T →.T * F I10 : T → T * F.
T →.F
F →.(E) GOTO ( I7 , ( )
F →.id I4 : F → (.E )
E →.E + T
GOTO ( I0 , id ) E →.T
I5 : F → id. T →.T * F
T →.F
GOTO ( I1 , + ) F →.(E)
I6 : E → E +.T F →.id
T →.T * F
T →.F GOTO ( I7 , id )
F →.(E) I5 : F → id.
F →.id
GOTO ( I8 , ) )
GOTO ( I2 , * ) I11 : F → ( E ).
I7 : T → T *.F
F →.(E) GOTO ( I8 , + )
F →.id I6 : E → E +.T
T →.T * F
GOTO ( I4 , E ) T →.F
I8 : F → ( E.) F →.( E )
E → E.+ T F →.id

GOTO ( I4 , T) GOTO ( I9 , *)
I2 : E →T. I7 : T → T *.F
T → T.* F F →.( E )
F →.id
GOTO ( I4 , F)
I3 : T → F.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

GOTO ( I4 , ( )
I4 : F → (.E)
E →.E + T E →.T
T →.T * F T →.F
F →.(E)
F → id

FOLLOW (E) = { $ , ) , +)
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }

SLR parsing table:

ACTION GOTO

id + * ( ) $ E T F

IO s5 s4 1 2 3

I1 s6 ACC

I2 r2 s7 r2 r2

I3 r4 r4 r4 r4

I4 s5 s4 8 2 3

I5 r6 r6 r6 r6

I6 s5 s4 9 3

I7 s5 s4 10

I8 s6 s11

I9 r1 s7 r1 r1

I1O r3 r3 r3 r3

I11 r5 r5 r5 r5

Blank entries are error entries.

Stack implementation:

Check whether the inputid + id * idis valid or not.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

STACK INPUT ACTION

0 id + id * id $ GOTO ( I0 , id ) = s5 ;shift

0 id 5 + id * id $ GOTO ( I5 , + ) = r6 ;reduceby F→id

0F3 + id * id $ GOTO ( I0 , F ) = 3
GOTO ( I3 , + ) = r4 ;reduceby T → F

0T2 + id * id $ GOTO ( I0 , T ) = 2
GOTO ( I2 , + ) = r2 ;reduceby E → T

0E1 + id * id $ GOTO ( I0 , E ) = 1
GOTO ( I1 , + ) = s6 ;shift

0E1+6 id * id $ GOTO ( I6 , id ) = s5 ;shift

0 E 1 + 6 id 5 * id $ GOTO ( I5 , * ) = r6 ;reduceby F → id

0E1+6F3 * id $ GOTO ( I6 , F ) = 3
GOTO ( I3 , * ) = r4 ;reduceby T → F

0E1+6T9 * id $ GOTO ( I6 , T ) = 9
GOTO ( I9 , * ) = s7 ;shift

0E1+6T9*7 id $ GOTO ( I7 , id ) = s5 ;shift

0 E 1 + 6 T 9 * 7 id 5 $ GOTO ( I5 , $ ) = r6 ;reduceby F → id

0 E 1 + 6 T 9 * 7 F 10 $ GOTO ( I7 , F ) = 10
GOTO ( I10 , $ ) = r3 ;reduceby T → T * F

0E1+6T9 $ GOTO ( I6 , T ) = 9
GOTO ( I9 , $ ) = r1 ;reduceby E → E + T

0E1 $ GOTO ( I0 , E ) = 1
GOTO ( I1 , $ ) =accept

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

MODULE-4 INTERMEDIATE CODE GENERATION

INTRODUCTION

The front end translates a source program into an intermediate representation from
which the back end generates target code.

Benefits of using a machine-independent intermediate form are:

1. Retargeting is facilitated. That is, a compiler for a different machine can be created
by attaching a back end for the new machine to an existing front end.

2. A machine-independent code optimizer can be applied to the intermediate representation.

Position of intermediate code generator

parser sta琀椀c intermediate intermediate code


checker code generator generator
code

INTERMEDIATE LANGUAGES

Three ways of intermediate representation:

 Syntax tree

 Postfix notation

 Three address code

The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.

Graphical Representations:

Syntax tree:

A syntax tree depicts the natural hierarchical structure of a source program. Adag
(Directed Acyclic Graph)gives the same information but in a more compact way because
common subexpressions are identified. A syntax tree and dag for the assignment statementa : =
b * - c + b * - care as follows:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

assign assign

a + a +

* * *
b uminus b uminus b uminus

c c c

(a) Syntax tree (b) Dag

Postfix notation:

Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of


the tree in which a node appears immediately after its children. The postfix notation for the
syntax tree given above is

a b c uminus * b c uminus * + assign

Syntax-directed definition:

Syntax trees for assignment statements are produced by the syntax-directed definition.
Non-terminal S generates an assignment statement. The two binary operators + and * are
examples of the full operator set in a typical language. Operator associativities and precedences
are the usual ones, even though they have not been put into the grammar. This definition
constructs the tree from the input a : = b * - c + b* - c.

PRODUCTION SEMANTIC RULE

Sid : = E S.nptr : = mknode(‘assign’,mkleaf(id, id.place), E.nptr)

EE 1 + E2 E.nptr : = mknode(‘+’, E1.nptr, E2.nptr )

EE 1 * E2 E.nptr : = mknode(‘*’, E1.nptr, E2.nptr )

E-E 1 E.nptr : = mknode(‘uminus’, E1.nptr)

E( E 1 ) E.nptr : = E1.nptr

Eid E.nptr : = mkleaf( id, id.place )

Syntax-directed definition to produce syntax trees for assignment statements

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

The tokenidhas an attributeplacethat points to the symbol-table entry for the identifier.
A symbol-table entry can be found from an attributeid.name, representing the lexeme associated
with that occurrence ofid.If the lexical analyzer holds all lexemes in a single array of
characters, then attributenamemight be the index of the first character of the lexeme.

Two representations of the syntax tree are as follows. In (a) each node is represented as a
record with a field for its operator and additional fields for pointers to its children. In (b), nodes
are allocated from an array of records and the index or position of the node serves as the pointer
to the node. All the nodes in the syntax tree can be visited by following pointers, starting from
the root at position 10.

Two representations of the syntax tree

aaaaaaaaaaaaa
assign
0 id b

1 id c
id a
2
2 uminus 1

3 * 0 2
+
4 id b

5 id c
*
*
6 uminus 5

id b id b 7 * 4 6

uminus uminus 8 + 3 7

id a
9
id c id c
assign 9 8
10

(b)
(a)

Three-Address Code:

Three-address code is a sequence of statements of the general form

x : = yopz

where x, y and z are names, constants, or compiler-generated temporaries;opstands for any


operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on boolean-
valued data. Thus a source language expression like x+ y*z might be translated into a sequence

t1 : = y * z
t2 : = x + t1

where t1 and t2 are compiler-generated temporary names.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Advantages of three-address code:

 The unraveling of complicated arithmetic expressions and of nested flow-of-control


statements makes three-address code desirable for target code generation and
optimization.

 The use of names for the intermediate values computed by a program allows three-
address code to be easily rearranged – unlike postfix notation.

Three-address code is a linearized representation of a syntax tree or a dag in which


explicit names correspond to the interior nodes of the graph. The syntax tree and dag are
represented by the three-address code sequences. Variable names can appear directly in three-
address statements.

Three-address code corresponding to the syntax tree and dag given above

t1 : = - c t1 : = -c

t2 : = b * t1 t2 : = b * t1

t3 : = - c t5 : = t2 + t2

t4 : = b * t3 a : = t5

t5 : = t2 + t4

a : = t5

(a) Code for the syntax tree (b) Code for the dag

The reason for the term “three-address code” is that each statement usually contains three
addresses, two for the operands and one for the result.

Types of Three-Address Statements:

The common three-address statements are:

1. Assignment statements of the formx : = yopz, whereopis a binary arithmetic or logical


operation.

2. Assignment instructions of the formx : =opy, whereopis a unary operation. Essential unary
operations include unary minus, logical negation, shift operators, and conversion operators
that, for example, convert a fixed-point number to a floating-point number.

3.Copy statementsof the formx : = ywhere the value ofyis assigned tox.

4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.

5. Conditional jumps such asifx relop ygoto L. This instruction applies a relational operator (
<, =, >=, etc. ) toxandy, and executes the statement with label L next ifxstands in relation

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

relop to y. If not, the three-address statement following ifx relop ygoto L is executed next,
as in the usual sequence.

6.param xandcall p, nfor procedure calls andreturn y, where y representing a returned value
is optional. For example,
param x1
param x2
...
param xn
call p,n
generated as part of a call of the procedure p(x1, x2, …. ,xn ).

7. Indexed assignments of the form x : = y[i] and x[i] : = y.

8. Address and pointer assignments of the form x : = &y , x : = *y, and *x : = y.

Syntax-Directed Translation into Three-Address Code:

When three-address code is generated, temporary names are made up for the interior
nodes of a syntax tree. For example,id : =Econsists of code to evaluateEinto some temporary
t, followed by the assignmentid.place: =t.

Given input a : = b * - c + b * - c, the three-address code is as shown above. The


synthesized attributeS.coderepresents the three-address code for the assignmentS.
The nonterminalEhas two attributes :
1.E.place, the name that will hold the value ofE, and
2.E. code, the sequence of three-address statements evaluatingE.

Syntax-directed definition to produce three-address code for assignments

PRODUCTION SEMANTIC RULES

S  id : = S.code : = E.code||gen(id.place ‘:=’ E.place)

E E  E1 + E.place := newtemp;
E.code := E1.code||E 2.code||gen(E.place ‘:=’ E 1.place ‘+’ E2.place)
E2
E.place := newtemp;
E.code := E1.code || E2.code || gen(E.place ‘:=’ E1.place ‘*’ E2.place)
E  E1 *
E.place := newtemp;
E.code := E1.code || gen(E.place ‘:=’ ‘uminus’ E1.place)
E2 E  - E1
E.place : =
E1.place; E.code :
E  ( E1 )
= E1.code

E.place : =
E  id
id.place; E.code :
=‘‘

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Semantic rules generating code for a while statement

S.begin:

E.code

if E.place = 0 goto S.after

S1.code

goto S.begin

S.after:. . .

PRODUCTION SEMANTIC RULES

S whileEdoS 1 S.begin :=
newlabel; S.after :=
newlabel;
S.code := gen(S.begin ‘:’)
|| E.code ||
gen ( ‘if’ E.place ‘=’ ‘0’ ‘goto’
S.after)|| S1.code ||
gen ( ‘goto’ S.begin)
|| gen ( S.after ‘:’)

 The functionnewtempreturns a sequence of distinct names t 1,t2,….. in response to


successive calls.
 Notationgen(x ‘:=’ y ‘+’ z)is used to represent three-address statement x := y + z.
Expressions appearing instead of variables likex, yandzare evaluated when passed to
gen, and quoted operators or operand, like ‘+’ are taken literally.
 Flow-of–control statements can be added to the language of assignments. The code forS
whileEdoS 1 is generated using new attributesS.beginandS.afterto mark the first
statement in the code forEand the statement following the code for S, respectively.
 The functionnewlabelreturns a new label every time it is called.
 We assume that a non-zero expression represents true; that is when the value ofE
becomes zero, control leaves the while statement.

Implementation of Three-Address Statements:

A three-address statement is an abstract form of intermediate code. In a compiler,


these statements can be implemented as records with fields for the operator and the operands.
Three such representations are:

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 Quadruples

 Triples

 Indirect triples

Quadruples:

 A quadruple is a record structure with four fields, which are,op, arg1, arg2andresult.

 Theopfield contains an internal code for the operator. The three-address statementx : =
y op zis represented by placingyinarg1,zinarg2andxinresult.

 The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporary names must be entered
into the symbol table as they are created.

Triples:

 To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.

 If we do so, three-address statements can be represented by records with only three fields:
op, arg1andarg2.

 The fieldsarg1andarg2, for the arguments ofop, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).

 Since three fields are used, this intermediate code format is known astriples.

op arg1 arg2 result op arg1 arg2

(0) uminus c t1 (0) uminus c

(1) * b t1 t2 (1) * b (0)

(2) uminus c t3 (2) uminus c

(3) * b t3 t4 (3) * b (2)

(4) + t2 t4 t5 (4) + (1) (3)

(5) := t3 a (5) assign a (4)

(a) Quadruples (b) Triples

Quadruple and triple representation of three-address statements given above

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

A ternary operation like x[i] : = y requires two entries in the triple structure as shown as below
while x : = y[i] is naturally represented as two operations.

op arg1 arg2 op arg1 arg2

(0) []= x i (0) =[] y i

(1) assign (0) y (1) assign x (0)

(a) x[i] : = y (b) x : = y[i]

Indirect Triples:

 Another implementation of three-address code is that of listing pointers to triples,


rather than listing the triples themselves. This implementation is called indirect
triples.

 For example, let us use an array statement to list pointers to triples in the desired
order. Then the triples shown above might be represented as follows:

statement op arg1 arg2

(0) (14) (14) uminus c


(1) (15) (15) * b (14)
(2) (16) (16) uminus c
(3) (17) (17) * b (16)
(4) (18) (18) + (15) (17)
(5) (19) (19) assign a (18)

Indirect triples representation of three-address statements

DECLARATIONS

As the sequence of declarations in a procedure or block is examined, we can lay out


storage for names local to the procedure. For each local name, we create a symbol-table entry
with information like the type and the relative address of the storage for the name. The relative
address consists of an offset from the base of the static data area or the field for local data in an
activation record.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Declarations in a Procedure:
The syntax of languages such as C, Pascal and Fortran, allows all the declarations in a
single procedure to be processed as a group. In this case, a global variable, sayoffset, can
keep track of the next available relative address.

In the translation scheme shown below:

 Nonterminal P generates a sequence of declarations of the formid :T.

 Before the first declaration is considered,offsetis set to 0. As each new name is seen ,
that name is entered in the symbol table with offset equal to the current value ofoffset,
andoffsetis incremented by the width of the data object denoted by that name.

 The procedureenter( name, type, offset) creates a symbol-table entry forname, gives its
typetypeand relative addressoffsetin its data area.

 Attributetyperepresents a type expression constructed from the basic typesintegerand


realby applying the type constructorspointerandarray. If type expressions are
represented by graphs, then attributetypemight be a pointer to the node representing a
type expression.

 The width of an array is obtained by multiplying the width of each element by the
number of elements in the array. The width of each pointer is assumed to be 4.

Computing the types and relative addresses of declared names

P D { offset : = 0 }

DD;D

D  id : T { enter(id.name, T.type,
offset); offset : = offset +
T.width }

T  integer { T.type : =
integer; T.width :
=4}

T  real { T.type : = real;


T.width : = 8 }

T  array [ num ] of T1 { T.type : = array(num.val,


T1.type); T.width : = num.val X
T1.width }

T  ↑T 1 { T.type : = pointer
( T1.type); T.width : = 4 }

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Keeping Track of Scope Information:

When a nested procedure is seen, processing of declarations in the enclosing procedure is


temporarily suspended. This approach will be illustrated by adding semantic rules to the
following language:

PD

D  D ; D |id: T |proc id; D ; S

One possible implementation of a symbol table is a linked list of entries for names.

A new symbol table is created when a procedure declarationD  proc idD 1;Sis seen,
and entries for the declarations in D1 are created in the new table. The new table points back to
the symbol table of the enclosing procedure; the name represented by id itself is local to the
enclosing procedure. The only change from the treatment of variable declarations is that the
procedureenteris told which symbol table to make an entry in.

For example, consider the symbol tables for proceduresreadarray, exchange, and
quicksortpointing back to that for the containing proceduresort, consisting of the entire
program. Sincepartitionis declared withinquicksort, its table points to that ofquicksort.

Symbol tables for nested procedures

sort
nil header
a
x
readarray
to
exchange readarray
quicksort to
exchange
readarra exchang quicksort
hd y hd e hd
k
v
par琀椀琀椀on

par琀椀琀椀o
hd n
i
j

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

The semantic rules are defined in terms of the following operations:

1. mktable(previous)creates a new symbol table and returns a pointer to the new table. The
argumentpreviouspoints to a previously created symbol table, presumably that for the
enclosing procedure.

2. enter(table, name, type, offset)creates a new entry for namenamein the symbol table pointed
to bytable.Again,enterplaces typetypeand relative addressoffsetin fields within the entry.

3. addwidth(table, width)records the cumulative width of all the entries in table in the header
associated with this symbol table.

4. enterproc(table, name, newtable)creates a new entry for procedurenamein the symbol table
pointed to bytable. The argumentnewtablepoints to the symbol table for this procedure
name.

Syntax directed translation scheme for nested procedures

PMD { addwidth ( top( tblptr) , top


(offset)); pop (tblptr); pop (offset) }

Mɛ { t : = mktable (nil);


push (t,tblptr); push (0,offset) }

D  D1 ; D2

D  proc id ; N D1 ; S { t : = top (tblptr);


addwidth ( t, top
(offset)); pop (tblptr);
pop (offset);
enterproc (top (tblptr), id.name, t) }

D  id : T { enter (top (tblptr), id.name, T.type, top


(offset)); top (offset) := top (offset) + T.width }

Nɛ { t := mktable (top (tblptr));


push (t, tblptr); push (0,offset) }

 The stacktblptris used to contain pointers to the tables forsort, quicksort,andpartition


when the declarations inpartitionare considered.

 The top element of stackoffsetis the next available relative address for a local of
the current procedure.

 All semantic actions in the subtrees for B and C

in ABC {action A}

are done beforeaction A at the end of the production occurs. Hence, the action associated
with the marker M is the first to be done.
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

 The action for nonterminal M initializes stacktblptrwith a symbol table for the
outermost scope, created by operationmktable(nil).The action also pushes relative
address 0 onto stack offset.

 Similarly, the nonterminal N uses the operationmktable(top(tblptr))to create a new


symbol table. The argumenttop(tblptr)gives the enclosing scope for the new table.

 For each variable declarationid:T, an entry is created foridin the current symbol table.
The top of stack offset is incremented by T.width.

 When the action on the right side ofD  proc id; ND1; Soccurs, the width of all
declarations generated by D1 is on the top of stack offset; it is recorded
usingaddwidth. Stackstblptrandoffsetare then popped.
At this point, the name of the enclosed procedure is entered into the symbol table of its
enclosing procedure.

ASSIGNMENT STATEMENTS

Suppose that the context in which an assignment appears is given by the following grammar.

PM D

Mɛ

DD ; D |id: T |proc id; N D ; S

Nɛ

Nonterminal P becomes the new start symbol when these productions are added to those in the
translation scheme shown below.

Translation scheme to produce three-address code for assignments

Sid : = E { p : = lookup (id.name);


ifp≠nilthen
emit( p ‘ : =’ E.place)
elseerror }

EE 1 + E2 { E.place : = newtemp;


emit( E.place ‘: =’ E1.place ‘ + ‘ E2.place ) }

EE 1 * E2 { E.place : = newtemp;


emit( E.place ‘: =’ E1.place ‘ * ‘ E2.place ) }

E-E 1 { E.place : = newtemp;


emit ( E.place ‘: =’ ‘uminus’ E1.place ) }

E( E 1 ) { E.place : = E1.place }

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Eid { p : = lookup (id.name);

ifp≠nilthen
E.place : = p
elseerror }

Reusing Temporary Names

 The temporaries used to hold intermediate values in expression calculations tend to


clutter up the symbol table, and space has to be allocated to hold their values.

 Temporaries can be reused by changingnewtemp. The code generated by the rules for E
E 1 + E2 has the general form:

evaluate E1 into t1
evaluate E2 into t2
t : = t1 + t2

 The lifetimes of these temporaries are nested like matching pairs of balanced parentheses.

 Keep a count c , initialized to zero. Whenever a temporary name is used as an operand,


decrement c by 1. Whenever a new temporary name is generated, use $c and increase c
by 1.

 For example, consider the assignment x := a * b + c * d – e * f

Three-address code with stack temporaries

statement value of c

0
$0 := a * b 1
$1 := c * d 2
$0 := $0 + $1 1
$1 := e * f 2
$0 := $0 - $1 1
x := $0 0

Addressing Array Elements:

Elements of an array can be accessed quickly if the elements are stored in a block of
consecutive locations. If the width of each array element isw, then theith element of array A
begins in location

base + ( i – low )xw

wherelowis the lower bound on the subscript andbaseis the relative address of the storage
allocated for the array. That is,baseis the relative address of A[low].

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

The expression can be partially evaluated at compile time if it is rewritten as

ixw+ (base–lowxw)

The subexpressionc = base – lowxwcan be evaluated when the declaration of the array is seen.
We assume that c is saved in the symbol table entry for A , so the relative address of A[i] is
obtained by simply addingixwtoc.

Address calculation of multi-dimensional arrays:

A two-dimensional array is stored in of the two forms :

 Row-major (row-by-row)

 Column-major (column-by-column)

Layouts for a 2 x 3 array

A[ 1 1 ] A[11]
first column
first row A[ 1,2 ] A[21]
A[ 1 3 ] A [ 1,2 ]
A[ 2,1 ] A [ 2,2 ] second column
second row A[ 2 2 ] A[13]
A[ 2,3 ] A [ 2,3 ] third column

(a) ROW-MAJOR (b) COLUMN-MAJOR

In the case of row-major form, the relative address of A[ i1 , i2] can be calculated by the formula

base + ((i1 – low1)xn 2 + i2 – low2)xw

where,low 1 andlow 2 are the lower bounds on the values ofi 1 andi 2 andn 2 is the number of
values thati 2 can take. That is, ifhigh 2 is the upper bound on the value ofi 2, thenn 2 = high2 –
low2 +1.

Assuming that i1 and i2 are the only values that are known at compile time, we can rewrite the
above expression as

(( i1 xn 2 ) + i2 )xw + ( base – (( low 1 xn 2 ) + low2 )xw)

Generalized formula:

The expression generalizes to the following expression for the relative address of A[i1,i2,…,ik]

(( . . . (( i1n2 + i2 ) n3 + i3) . . . ) nk + ik )xw + base – (( . . .((low 1n2 + low2)n3 + low3) . . .)


nk + lowk)xw

for all j, nj = highj – lowj + 1

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

The Translation Scheme for Addressing Array Elements :

Semantic actions will be added to the grammar :

(1) S  L:=E
(2) E  E+E
(3) E  (E)
(4) E  L
(5) LElist]
(6) L  id
(7) Elist  Elist , E
(8) Elist  id[E

We generate a normal assignment ifLis a simple name, and an indexed assignment into the
location denoted byLotherwise :

(1) S  L : = E{ifL.offset =null then/ * L is a simpleid*/


emit(L.place ‘: =’ E.place) ;
else
emit(L.place ‘[‘L.offset ‘]’ ‘: =’ E.place) }

(2) E  E1 + E2 {E.place : = newtemp;


emit(E.place ‘: =’ E 1.place ‘ +’ E2.place) }

(3) E  ( E1 ){E.place : = E 1.place}

When an array referenceLis reduced toE, we want ther-value ofL. Therefore we use indexing
to obtain the contents of the locationL.place[L.offset] :

(4) E  L{ifL.offset =null then/* L is a simpleid* /


E.place : = L.place
else begin
E.place : = newtemp;
emit(E.place ‘: =’ L.place ‘[‘L.offset‘]’)
end}

(5) L  Elist] {L.place : = newtemp;


L.offset : =
newtemp;
emit(L.place ‘: =’ c( Elist.array ));
emit(L.offset ‘: =’ Elist.place ‘*’ width (Elist.array)) }

(6) L  id{L.place :=id.place;


L.offset :=null}

(7) Elist  Elist1 , E{t := newtemp;


m : = Elist1.ndim +1;
emit(t ‘: =’ Elist 1.place ‘*’ limit(Elist 1.array,m));
emit(t ‘: =’ t ‘+’ E.place);
Elist.array : = Elist1.array;
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

Elist.place : = t;
Elist.ndim : =
m}

(8) Elist  id[E{Elist.array : =id.place;

Elist.place : =
E.place; Elist.ndim :
=1 }

Type conversion within Assignments :

Consider the grammar for assignment statements as above, but suppose there are two
types – real and integer , with integers converted to reals when necessary. We have another
attributeE.type, whose value is eitherrealorinteger. The semantic rule forE.typeassociated
with the productionE  E + Eis :

E  E + E{E.type: =
ifE 1.type = integerand
E2.type = integertheninteger
elsereal}

The entire semantic rule forE  E + Eand most of the other productions must be
modified to generate, when necessary, three-address statements of the form x : = inttoreal y,
whose effect is to convert integer y to a real of equal value, called x.

Semantic action forE  E1 + E2

E.place := newtemp;
ifE 1.type = integerandE 2.type = integerthen begin
emit( E.place ‘: =’ E1.place ‘int +’ E2.place);
E.type : = integer
end
else ifE 1.type = realandE 2.type = realthen begin
emit( E.place ‘: =’ E1.place‘real +’E
2.place); E.type : = real

end
else ifE 1.type = integerandE 2.type = realthen begin
u : = newtemp;
emit( u ‘: =’‘inttoreal’E 1.place);
emit( E.place ‘: =’ u‘ real +’E 2.place);
E.type : = real
end
else ifE 1.type = realandE 2.type =integerthen begin
u : = newtemp;
emit( u ‘: =’‘inttoreal’E 2.place);
emit( E.place ‘: =’ E1.place ‘real +’
u); E.type : = real
end
else
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

E.type : = type_error;

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

For example, for the input x : = y + i * j


assumingxandyhave typereal, and i and j have typeinteger, the output would look like

t1 : = i int* j
t3 : = inttoreal t1
t2 : = y real+ t3
x : = t2

BOOLEAN EXPRESSIONS

Boolean expressions have two primary purposes. They are used to compute logical
values, but more often they are used as conditional expressions in statements that alter the flow
of control, such as if-then-else, or while-do statements.

Boolean expressions are composed of the boolean operators (and, or,andnot) applied
to elements that are boolean variables or relational expressions. Relational expressions are of the
formE 1 relopE 2, where E1 and E2 are arithmetic expressions.

Here we consider boolean expressions generated by the following grammar :

EEorE | EandE |notE | ( E ) |id relop id | true | false

Methods of Translating Boolean Expressions:

There are two principal methods of representing the value of a boolean expression. They are :

 To encode true and falsenumericallyand to evaluate a boolean expression analogously


to an arithmetic expression. Often, 1 is used to denote true and 0 to denote false.

 To implement boolean expressions byflow of control, that is, representing the value of a
boolean expression by a position reached in a program. This method is particularly
convenient in implementing the boolean expressions in flow-of-control statements, such
as the if-then and while-do statements.

Numerical Representation

Here, 1 denotes true and 0 denotes false. Expressions will be evaluated completely from
left to right, in a manner similar to arithmetic expressions.

For example :

 The translation for


aorband notc
is the three-address sequence
t1 : =notc
t2 : = bandt 1
t3 : = aort 2

 A relational expression such as a < b is equivalent to the conditional


statement if a < b then 1 else 0

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

which can be translated into the three-address code sequence (again, we arbitrarily start
statement numbers at 100) :

100 : if a < b goto


103 101 : t : = 0
102 : goto 104
103 : t : = 1
104 :

Translation scheme using a numerical representation for booleans

E  E1 orE 2 { E.place : = newtemp;


emit( E.place ‘: =’ E1.place ‘or’ E2.place ) }
E  E1 andE 2 { E.place : = newtemp;
emit( E.place ‘: =’ E1.place ‘and’ E2.place ) }
E  notE 1 { E.place : = newtemp;
emit( E.place ‘: =’ ‘not’ E1.place ) }
E  ( E1 ) { E.place : = E1.place }
E  id1 relop id2 { E.place : = newtemp;
emit( ‘if’id 1.placerelop.opid 2.place ‘goto’ nextstat
+3); emit( E.place ‘: =’ ‘0’ );
emit(‘goto’nextstat +2);
emit( E.place ‘: =’ ‘1’)
}
E  true{ E.place : = newtemp;
emit( E.place ‘: =’ ‘1’)
} E false{ E.place : = newtemp;
emit( E.place ‘: =’ ‘0’) }

Short-Circuit Code:

We can also translate a boolean expression into three-address code without generating
code for any of the boolean operators and without having the code necessarily evaluate the entire
expression. This style of evaluation is sometimes called “short-circuit” or “jumping” code. It is
possible to evaluate boolean expressions without generating code for the boolean operatorsand,
or,andnotif we represent the value of an expression by a position in the code sequence.

Translation of a < b or c < d and e < f

100 : if a < b goto 103 107 : t2 : = 1

101 : t1 : = 0 108 : if e < f goto 111

102 : goto 104 109 : t3 : = 0

103 : t1 : = 1 110 : goto 112

104 : if c < d goto 107 111 : t3 : = 1

105 : t2 : = 0 112 : t4 : = t2 and t3

106 : goto 108 113 : t5 : = t1 or t4


Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

Flow-of-Control Statements

We now consider the translation of boolean expressions into three-address code in the
context of if-then, if-then-else, and while-do statements such as those generated by the following
grammar:

SifEthenS 1
|ifEthenS 1 elseS 2

| whileEdoS 1

In each of these productions,Eis the Boolean expression to be translated. In the translation, we


assume that a three-address statement can be symbolically labeled, and that the function
newlabelreturns a new symbolic label each time it is called.

 E.true is the label to which control flows if E is true, and E.false is the label to which
control flows if E is false.

 The semantic rules for translating a flow-of-control statement S allow control to flow
from the translation S.code to the three-address instruction immediately following
S.code.

 S.next is a label that is attached to the first three-address instruction to be executed after
the code for S.

Code for if-then , if-then-else, and while-do statements

toE.true
E.code
toE.fals
E.code S1.code
e

S1.code toE.true E.true:


gotoS.next
E.true: toE.false S2.code

E.false:

E.false: ... ...

S.next:

(a) if-then (b) if-then-else

S.begin: E.code (c) while-do

E.true: S1.code

gotoS.begin
E.false:. .
.
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Syntax-directed definition for flow-of-control statements

PRODUCTION SEMANTIC RULES

SifEthenS 1 E.true : =
newlabel; E.false :
= S.next; S1.next :
= S.next;
S.code : = E.code || gen(E.true ‘:’) || S1.code
SifEthenS 1 elseS 2
E.true : = newlabel;
E.false : =
newlabel; S1.next :
= S.next; S2.next : =
S.next;
S.code : = E.code || gen(E.true ‘:’) || S1.code ||
gen(‘goto’ S.next) ||
SwhileEdoS 1
gen( E.false ‘:’) || S2.code

S.begin : =
newlabel; E.true : =
newlabel; E.false : =
S.next; S1.next : =
S.begin;
S.code : = gen(S.begin ‘:’)|| E.code
|| gen(E.true ‘:’) || S1.code
|| gen(‘goto’ S.begin)

Control-Flow Translation of Boolean Expressions:

Syntax-directed definition to produce three-address code for booleans

PRODUCTION SEMANTIC RULES

EE 1 orE 2 E1.true : = E.true;


E1.false : =
newlabel; E2.true : =
E.true; E2.false : =
E.false;
E.code : = E1.code || gen(E1.false ‘:’) || E2.code
EE 1 andE 2 E.true : =
newlabel; E1.false :
= E.false; E2.true :
= E.true; E2.false :
= E.false;
E.code : = E1.code || gen(E1.true ‘:’) || E2.code
EnotE 1 E1.true : =
E.false; E1.false :
= E.true; E.code :
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

= E1.code
E( E1 ) E1.true : = E.true;

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

E1.false : =
E.false; E.code : =
E1.code
Eid 1 relop id2
E.code : = gen(‘if’id 1.placerelop.opid 2.place
‘goto’E.true) || gen(‘goto’E.false)
Etrue
E.code : = gen(‘goto’E.true)
Efalse
E.code : = gen(‘goto’ E.false)

CASE STATEMENTS

The “switch” or “case” statement is available in a variety of languages. The switch-statement


syntax is as shown below :
Switch-statement syntax

switchexpressio
n begin
casevalue:statemen
t
casevalue:statemen
t
...
casevalue:statemen
t default :statement
end

There is a selector expression, which is to be evaluated, followed bynconstant values


that the expression might take, including a default “value” which always matches the expression
if no other value does. The intended translation of a switch is code to:

1. Evaluate the expression.


2. Find which value in the list of cases is the same as the value of the expression.
3. Execute the statement associated with the value found.

Step (2) can be implemented in one of several ways :

 By a sequence of conditionalgotostatements, if the number of cases is small.


 By creating a table of pairs, with each pair consisting of a value and a label for the code
of the corresponding statement. Compiler generates a loop to compare the value of the
expression with each value in the table. If no match is found, the default (last) entry is
sure to match.
 If the number of cases s large, it is efficient to construct a hash table.
 There is a common special case in which an efficient implementation of the n-way branch
exists. If the values all lie in some small range, say imin to imax, and the number of
different values is a reasonable fraction of imax - imin, then we can construct an array of
labels, with the label of the statement for value j in the entry of the table with offset j -
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

imin and the label for the default in entries not filled otherwise. To perform switch,

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

evaluate the expression to obtain the value ofj, check the value is within range and
transfer to the table entry at offset j-imin .

Syntax-Directed Translation of Case Statements:

Consider the following switch statement:

switchE
begin
caseV 1 :S 1

caseV 2 :S 2

...
caseV n-1 :S n-1

default :S n

end

This case statement is translated into intermediate code that has the following form :

Translation of a case statement

code to evaluateEinto t
goto test
L1 : code forS 1
goto next
L2 : code forS 2
goto next
. . .
Ln-1 : code forS n-1
goto next
Ln : code forS n
goto next
test : if t =V 1 goto L1
if t =V 2 goto L2
. . .
if t =V n-1 goto Ln-1
goto Ln
next :

To translate into above form :

 When keywordswitchis seen, two new labelstestandnext,and a new temporarytare


generated.

 As expressionEis parsed, the code to evaluateEintotis generated. After processingE,


the jumpgoto testis generated.

 As eachcasekeyword occurs, a new label L is created and entered into the symbol table.
i

A pointer to this symbol-table entry and the valueV i of case constant are placed on a
stack (used only to store cases).

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

 Each statementcaseV i: Si is processed by emitting the newly created label Li, followed
by the code forS i , followed by the jumpgoto next.

 Then when the keywordendterminating the body of the switch is found, the code can be
generated for the n-way branch. Reading the pointer-value pairs on the case stack from
the bottom to the top, we can generate a sequence of three-address statements of the form

caseV 1 L1
caseV 2 L2
...
caseV n-1 Ln-1
case t Ln
label next

where t is the name holding the value of the selector expressionE, and L n is the label for
the default statement.

BACKPATCHING

The easiest way to implement the syntax-directed definitions for boolean expressions is
to use two passes. First, construct a syntax tree for the input, and then walk the tree in depth-first
order, computing the translations. The main problem with generating code for boolean
expressions and flow-of-control statements in a single pass is that during one single pass we may
not know the labels that control must go to at the time the jump statements are generated. Hence,
a series of branching statements with the targets of the jumps left unspecified is generated. Each
statement will be put on a list of goto statements whose labels will be filled in when the proper
label can be determined. We call this subsequent filling in of labelsbackpatching.

To manipulate lists of labels, we use three functions :

1. makelist(i) creates a new list containing onlyi, an index into the array of quadruples;
makelistreturns a pointer to the list it has made.
2. merge(p 1,p2) concatenates the lists pointed to byp 1 andp 2,and returns a pointer to the
concatenated list.
3. backpatch(p,i)inserts i as the target label for each of the statements on the list pointed to
byp.

Boolean Expressions:

We now construct a translation scheme suitable for producing quadruples for boolean
expressions during bottom-up parsing. The grammar we use is the following:

(1) EE 1 orM E 2

(2) |E 1 andM E 2
(3) |notE 1
(4) | (E 1)
(5) |id 1 relop id2
(6) |true
(7) |false
(8)Mɛ

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Synthesized attributestruelistandfalselistof nonterminalEare used to generate jumping code


for boolean expressions. Incomplete jumps with unfilled labels are placed on lists pointed to by
E.truelistandE.falselist.

Consider productionEE 1 andM E 2.IfE 1 is false, thenEis also false, so the statements on
E1.falselistbecome part ofE.falselist. IfE 1 is true, then we must next testE 2, so the target for the
statementsE 1.truelistmust be the beginning of the code generated forE 2. This target is obtained
using marker nonterminalM.

AttributeM.quadrecords the number of the first statement ofE 2.code. With the production M
ɛwe associate the semantic action

{M.quad : = nextquad}

The variablenextquadholds the index of the next quadruple to follow. This value will be
backpatched onto theE 1.truelistwhen we have seen the remainder of the productionE  E1 and
M E2. The translation scheme is as follows:

(1) EE 1 orM E 2 {backpatch(E 1.falselist,M.quad);


E.truelist: =merge(E 1.truelist, E2.truelist);
E.falselist: =E 2.falselist}

(2) EE 1 andM E 2 {backpatch(E 1.truelist, M.quad);


E.truelist: =E 2.truelist;
E.falselist: =merge(E 1.falselist, E2.falselist) }

(3) EnotE 1 {E.truelist: =E 1.falselist;


E.falselist: =E 1.truelist; }

(4) E(E 1 ) {E.truelist: =E 1.truelist;


E.falselist: =E 1.falselist; }

(5) Eid 1 relop id2 {E.truelist: =makelist(nextquad);


E.falselist: =makelist(nextquad + 1);
emit(‘if’id 1.placerelop.opid 2.place ‘goto_’)
emit(‘goto_’) }

(6)Etrue{E.truelist: =makelist(nextquad);
emit(‘goto_’) }

(7)Efalse{E.falselist: =makelist(nextquad);
emit(‘goto_’) }

(8)Mɛ{M.quad: =nextquad}

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Flow-of-Control Statements:

A translation scheme is developed for statements generated by the following grammar :

(1)SifEthenS
(2) |ifEthenSelseS
(3) |whileEdoS
(4) |beginLend
(5) |A
(6) LL ; S
(7) |S

HereSdenotes a statement,La statement list,Aan assignment statement, and E a boolean


expression. We make the tacit assumption that the code that follows a given statement in
execution also follows it physically in the quadruple array. Else, an explicit jump must be
provided.

Scheme to implement the Translation:

The nonterminal E has two attributesE.truelistandE.falselist.LandSalso need a list of


unfilled quadruples that must eventually be completed by backpatching. These lists are pointed
to by the attributesL..nextlistandS.nextlist.S.nextlistis a pointer to a list of all conditional and
unconditional jumps to the quadruple following the statement S in execution order, andL.nextlist
is defined similarly.

The semantic rules for the revised grammar are as follows:

(1)SifEthenM 1 S1 NelseM 2 S2

{backpatch(E.truelist,M 1.quad);
backpatch(E.falselist,M 2.quad);
S.nextlist: =merge(S 1.nextlist,merge(N.nextlist,S 2.nextlist)) }

We backpatch the jumps whenEis true to the quadrupleM 1.quad, which is the beginning of the
code for S1. Similarly, we backpatch jumps whenEis false to go to the beginning of the code for
S2. The listS.nextlistincludes all jumps out of S 1 and S2, as well as the jump generated byN.

(2)Nɛ{N.nextlist: =makelist(nextquad);
emit(‘goto_’) }

(3)Mɛ{M.quad: =nextquad}

(4) SifEthenM S 1 {backpatch(E.truelist,M.quad);


S.nextlist: =merge(E.falselist,S 1.nextlist) }

(5) SwhileM 1 EdoM 2 S1 {backpatch(S 1.nextlist,M 1.quad);


backpatch(E.truelist,M 2.quad);
S.nextlist: =E.falselist
emit( ‘goto’M 1.quad) }

(6)SbeginLend{S.nextlist: =L.nextlist}

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

(7)SA{S.nextlist: =nil}

The assignmentS.nextlist: =nilinitializesS.nextlistto an empty list.

(8)LL1;M S{backpatch(L .nextlist,M.quad); 1

L.nextlist: =S.nextlist}

The statement followingL 1 in order of execution is the beginning ofS. Thus theL1.nextlistlist is
backpatched to the beginning of the code forS, which is given byM.quad.

(9)LS{L.nextlist: =S.nextlist}

PROCEDURE CALLS

The procedure is such an important and frequently used programming construct that it is
imperative for a compiler to generate good code for procedure calls and returns. The run-time
routines that handle procedure argument passing, calls and returns are part of the run-time
support package.

Let us consider a grammar for a simple procedure call

statement (1)S  call id(Elist)


(2) Elist  Elist , E
(3) Elist  E

Calling Sequences:

The translation for a call includes a calling sequence, a sequence of actions taken on
entry to and exit from each procedure. The falling are the actions that take place in a calling
sequence :

 When a procedure call occurs, space must be allocated for the activation record of the
called procedure.

 The arguments of the called procedure must be evaluated and made available to the
called procedure in a known place.

 Environment pointers must be established to enable the called procedure to access data in
enclosing blocks.

 The state of the calling procedure must be saved so it can resume execution after the call.

 Also saved in a known place is the return address, the location to which the called
routine must transfer after it is finished.

 Finally a jump to the beginning of the code for the called procedure must be generated.

For example, consider the following syntax-directed

translation (1)S  call id(Elist)


{foreach itemponqueuedo
emit(‘param’ p);
Downloaded by Piryanshu Sharma ([email protected])
lOMoARcPSD|54605199

emit(‘call’id.place) }
(2)Elist  Elist , E

{ appendE.placeto the end ofqueue}

(3)ElistE
{ initializequeueto contain onlyE.place}

 Here, the code for S is the code forElist, which evaluates the arguments, followed by a
parampstatement for each argument, followed by acallstatement.

 queueis emptied and then gets a single pointer to the symbol table location for the name
that denotes the value of E.

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

Downloaded by Piryanshu Sharma ([email protected])


lOMoARcPSD|54605199

INTRODUCTION

Downloaded by Piryanshu Sharma ([email protected])

You might also like