0% found this document useful (0 votes)
2 views

CD Unit II Part II LR Parsers

LR parsing is a non-recursive, bottom-up syntax analysis technique that efficiently handles a wide class of context-free grammars. It includes various algorithms like SLR, CLR, and LALR, each differing in complexity and efficiency. The LR parser uses a state-based approach to make shift-reduce decisions and can detect syntactic errors promptly, although constructing an LR parser by hand is labor-intensive, often requiring tools like Yacc.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CD Unit II Part II LR Parsers

LR parsing is a non-recursive, bottom-up syntax analysis technique that efficiently handles a wide class of context-free grammars. It includes various algorithms like SLR, CLR, and LALR, each differing in complexity and efficiency. The LR parser uses a state-based approach to make shift-reduce decisions and can detect syntactic errors promptly, although constructing an LR parser by hand is labor-intensive, often requiring tools like Yacc.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT – II (Part II)

Syntax Analysis

2.6 Introduction to LR Parsing:


The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of
context-free grammar which makes it the most efficient syntax analysis technique. LR
parsers are also known as LR(k) parsers, where
 L stands for left-to-right scanning of the input stream;
 R stands for the construction of right-most derivation in reverse, and
 k denotes the number of lookahead symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:
 SLR() – Simple LR Parser:
 Works on smallest class of grammar
 Few number of states, hence very small table
 Simple and fast construction
 CLR() – Canonical LR Parser:
 Works on complete set of LR(1) Grammar
 Generates large table and large number of states
 Slow construction
 LALR() – Look-Ahead LR Parser:
 Works on intermediate size of grammar
 Number of states are same as in SLR(1)

2.6.1 Why LR Parsers?


LR parsing is attractive for a variety of reasons:
 LR parsers can be constructed to recognize virtually all programming language
constructs for which context-free grammars can be written.
 The LR-parsing method is the most general non backtracking shift-reduce parsing
method known.
 An LR parser can detect a syntactic error as soon as it is possible to do so on a
left-to-right scan of the input.

The principal drawback of the LR method is that it is too much work to construct an LR
parser by hand for a typical programming-language grammar. A specialized tool, an LR
parser generator, is needed. Most commonly used is Yacc. Such a generator takes a
context-free grammar and automatically produces a parser for that grammar. If the
grammar contains ambiguities or other constructs that are difficult to parse in a left-to-
right scan of the input, then the parser generator locates these constructs and provides
detailed diagnostic messages.

2.7. Simple LR (SLR)


1 Items and the LR(0) Automaton
An LR parser makes shift-reduce decisions by maintaining states to keep track of where
we are in a parse. States represent sets of "items." An LR(0) item (item for short) of a
grammar G is a production of G with a dot at some position of the body. Thus,
production A  XYZ yields the four items

Page 1
A  .XYZ
A  X.YZ
A  XY.Z
A  XYZ.
The production A  ε generates only one item, A  . .

Intuitively, an item indicates how much of a production we have seen at a given point in
the parsing process. For example, the item A  .XYZ indicates that we hope to see a
string derivable from XYZ next on the input. Item A  X.YZ indicates that we have just
seen on the input a string derivable from X and that we hope next to see a string
derivable from YZ. Item A  XYZ. indicates that we have seen the body XYZ and that it
may be time to reduce XYZ to A.

One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the
basis for constructing a deterministic finite automaton that is used to make parsing
decisions. Such an automaton is called an LR(0) automaton.

In particular, each state of the LR(0) automaton represents a set of items in the
canonical LR(0) collection. The automaton for the expression grammar (2.1), shown in
Fig. 2.18, will serve as the running example for discussing the canonical LR(0) collection
for a grammar.

Figure 2.18: LR(O) automaton for the expression grammar (4.1)

Page 2
To construct the canonical LR(0) collection for a grammar, we define an augmented
grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol
S, then G', the augmented grammar for G, is G with a new start symbol S’ and
production S' —> S. The purpose of this new starting production is to indicate to the
parser when it should stop parsing and announce acceptance of the input. That is,
acceptance occurs when and only when the parser is about to reduce by S'  S.

 Closure of Item Sets


If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed
from I by the two rules:
1. Initially, add every item in I to CLOSURE(I).
2. If A  α.Bβ is in CLOSURE(I) and B  ϒ is a production, then add the item B  .ϒ to
CLOSURE(I), if it is not already there. Apply this rule until no more new items can be
added to CLOSURE(I).

The closure can be computed as .

 The Function GOTO


The second useful function is GOTO (I, X) where I is a set of items and X is a grammar
symbol. GOTO ( I , X) is defined to be the closure of the set of all items [A  αX.β] such
that [A  α.Xβ] is in I. Intuitively, the GOTO function is used to define the transitions in
the LR(0) automaton for a grammar. The states of the automaton correspond to sets of
items, and GOTO(I,X) specifies the transition from the state for I under input X.

The algorithm to construct C, the canonical collection of sets of LR(0) items for an
augmented grammar G' is given by:

void items(G') {
C = CLOSURE({[S' -> S]});
repeat
for ( each set of items I in C )
for ( each grammar symbol X )
if ( GOTO (I, X) is not empty and not in C )
add GOTO(I, X) to C;
until no new sets of items are added to C on a round;
}

Page 3
The LR-Parsing Algorithm
A schematic of an LR parser is shown in Fig. 2.20. It consists of an input, an output, a
stack, a driver program, and a parsing table that has two parts (ACTION and GOTO). The
driver program is the same for all LR parsers; only the parsing table changes from one
parser to another. The parsing program reads characters from an input buffer one at a
time. Where a shift-reduce parser would shift a symbol, an LR parser shifts a state. Each
state summarizes the information contained in the stack below it.

Figure 2.20: Model of an LR parser

The stack holds a sequence of states, s0s1…sm, where sm is on top. In the SLR method,
the stack holds states from the LR(0) automaton; the canonical- LR and LALR methods
are similar. By construction, each state has a corresponding grammar symbol. Recall
that states correspond to sets of items, and that there is a transition from state i to
state j if GOTO(Ii,X) = Ij. All transitions to state j must be for the same grammar symbol
X. Thus, each state, except the start state 0, has a unique grammar symbol associated
with it.

 Structure of the LR Parsing Table


The parsing table consists of two parts: a parsing-action function ACTION and a goto
function GOTO.
1. The ACTION function takes as arguments a state i and a terminal a (or $, the
input endmarker). The value of ACTION[i,a] can have one of four forms:
(a) Shift j, where j is a state. The action taken by the parser effectively shifts
input a to the stack, but uses state j to represent a.
(b) Reduce A β. The action of the parser effectively reduces β on the top of
the stack to head A.
(c) Accept. The parser accepts the input and finishes parsing.
(d) Error. The parser discovers an error in its input and takes some corrective
action.
2. We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii, A]
= Ij, then GOTO also maps a state i and a nonterminal A to state j .

 LR-Parser Configurations
To describe the behavior of an LR parser, it helps to have a notation representing the
complete state of the parser: its stack and the remaining input. A configuration of an LR
parser is a pair:
(s0s1…sm, a,ai+1…an$)

Page 4
where the first component is the stack contents (top on the right), and the second
component is the remaining input. This configuration represents the right-sentential
form
X1X2…Xmaaiai+1…an
in essentially the same way as a shift-reduce parser would; the only difference is that
instead of grammar symbols, the stack holds states from which grammar symbols can
be recovered. That is, Xi is the grammar symbol represented by state si. Note that s0, the
start state of the parser, does not represent a grammar symbol, and serves as a bottom-
of-stack marker, as well as playing an important role in the parse.

 Behavior of the LR Parser


The next move of the parser from the configuration above is determined by reading ai,
the current input symbol, and sm , the state on top of the stack, and then consulting the
entry ACTION[sm , ai] in the parsing action table. The configurations resulting after each
of the four types of move are as follows:

1. If ACTION[sm , ai] = shift s, the parser executes a shift move; it shifts the next state s
onto the stack, entering the configuration
(s0s1…sm, a,ai+1…an$)
The symbol ai need not be held on the stack, since it can be recovered from s, if needed
(which in practice it never is). The current input symbol i s now ai+1.
2. If ACTION sm , ai] = reduce A  β, then the parser executes a reduce move, entering
the configuration
(s0s1…sm-rs, a,ai+1…an$)
where r is the length of β, and s = GOTO[sm-r, A]. Here the parser first popped r state
symbols off the stack, exposing state sm-r. The parser then pushed s, the entry for
GOTO[sm-r, A], onto the stack. The current input symbol is not changed in a reduce
move. For the LR parsers we shall construct, Xm-r+1…Xm, the sequence of grammar
symbols corresponding to the states popped off the stack, will always match β, the right
side of the reducing production.

The output of an LR parser is generated after a reduce move by executing the semantic
action associated with the reducing production.
3. If ACTlON[sm , ai] = accept, parsing is completed.
4. If ACTlON[sm , ai] = error, the parser has discovered an error and calls an error
recovery routine.

The LR-parsing algorithm is summarized below. All LR parsers behave in this fashion; the
only difference between one LR parser and another is the information in the ACTION
and GOTO fields of the parsing table.

Algorithm : LR-parsing algorithm.


INPUT: An input string w and an LR-parsing table with functions ACTION and GOTO for a
grammar G.
OUTPUT: If w is in L(G), the reduction steps of a bottom-up parse for w; otherwise, an
error indication.
METHOD: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in
the input buffer. The parser then executes the program.

Page 5
let a be the first symbol of w$;
while(1) { /* repeat forever */
let s be the state on top of the stack;
if ( ACTION[s, a] = shift t ) {
push t onto the stack;
let a be the next input symbol;
} else if ( ACTION [s, a] = reduce A β) {
pop |β| symbols off t he stack;
let state t now be on top of the stack;
push GOTO[t, A] onto the stack;
output the production A β;
} else if ( ACTION[s,a] = accept ) break; /* parsing is done */
else call error-recovery routine;
}

Example 2.15: Figure 2.21 shows the ACTION and GOTO functions of an LR-parsing table
for the expression grammar productions numbered:
1) E  E + T / T 4) T  F
2) E  T 5) F  (E) / id
3) T  T * F / F 6) F  id
The codes for the actions are:
1. si means shift and stack state i,
2. rj means reduce by the production numbered j,
3. acc means accept,
4. blank means error.

Figure 2.21: Parsing table for expression grammar

On input id * id + id, the sequence of stack and input contents is shown in Fig. 2, 22.
Also shown for clarity, are the sequences of grammar symbols corresponding to the
states held on the stack. For example, at line (1) the LR parser is in state 0, the initial
state with no grammar symbol, and with id the first input symbol. The action in row 0
and column id of the action field of Fig. 2.21 is s5, meaning shift by pushing state 5. That
is what has happened at line (2): the state symbol 5 has been pushed onto the stack,
and id has been removed from the input.

Page 6
Figure 4.38: Moves of an LR parser on id * id + id

Then, * becomes the current input symbol, and the action of state 5 on input * is to
reduce by F id. One state symbol is popped off the stack. State 0 is then exposed. Since
the goto of state 0 on F is 3, state 3 is pushed onto the stack. We now have the
configuration in line (3). Each of the remaining moves is determined similarly.

Page 7
2.8 More Powerful LR Parsers
2.8.1 Canonical LR(1) Items
The most general technique for constructing an LR parsing table from a grammar called
"canonical-LR" or just "LR" method, which makes full use of the lookahead symbol(s).
This method uses a large set of items, called the LR(1) items.

Formally, we say LR(1) item [A  α. β, a] is valid for a viable prefix ϒ if there is a


derivation
S =rm*> ᵟAꙍ=rm*> ᵟαβꙍ, where
1. ϒ = ᵟa, and
2. Either a is the first symbol of w, or w is ε and a is $.

Example 3.1: Let us consider the grammar


SBB
BaB|b
*
There is a rightmost derivation S =rm > aaBab =rm> aaaBab. We see that item [B  α. β,
a] is valid for a viable prefix ϒ = aaa by letting ᵟ = aa, A = B, w = ab, a = a, and β= B in the
above definition. There is also a rightmost derivation S=rm*> BaB=rm> BaaB. From this
derivation we see that item [B a.B,$] is valid for viable prefix Baa.

Constructing LR(1) Sets of Items

Figure 3.1: Sets-of-LR(l)-items construction for grammar G'

Algorithm 3.1: Construction of the sets of LR(1) items.


INPUT: An augmented grammar G'.

Page 8
OUTPUT: The sets of LR(1) items that are the set of items valid for one or more viable
prefixes of G'.
METHOD: The procedures CLOSURE and GOTO and the main routine items for
constructing the sets of items were shown in Fig. 3.1

Example 3.2: Consider the following augmented grammar.


S'  S
S  CC …. (3.1)
C cC | d
We begin by computing the closure of {[S'  .S,$]}. To close, we match the item [S' 
.S,$] with the item [A  α. Bβ, a] in the procedure CLOSURE. That is, A = S', a = ε, B = S,
β = ε, and a = $. Function CLOSURE tells us to add [B .ϒ,b] for each production B -> ϒ
and terminal b in FIRST (βa). In terms of the present grammar, B  ϒ must be S  CC,
and since β is ε and
a is $, b may only be $. Thus we add [S .CC,$].

We continue to compute the closure by adding all items [C  .ϒ, b] for b in FIRST( C$ ) .
That is, matching [S  .CC, $] against [A  α. Bβ, a], we have A = S, α = ε, B = C, β = C,
and a = $. Since C does not derive the empty string, FIRST(C$) = FIRST( C ) . Since
FIRST(C) contains terminals c and d, we add items [C  .cC,c], [C  .cC, d], [C  .d, c]
and [C  .d, d]. None of the new items has a nonterminal immediately to the right of
the dot, so we have completed our first set of LR(1) items. The initial set of items is
I0 : S  .S,$
S  .CC. $
C  .cC, c/d
C  .d, c/d
We now give the LR(1) sets of items construction. Figure 3.2 shows the ten sets of items
with their goto's.

Figure 3.2: The GOTO graph for grammar (3.1)

Page 9
Canonical LR(1) Parsing Tables
The rules for constructing the LR(1) ACTION and GOTO functions from the sets of LR(1)
items. These functions are represented by a table, as before. The only difference is in
the values of the entries.

Algorithm 3.2: Construction of canonical-LR parsing tables.


INPUT: An augmented grammar G'.
OUTPUT: The canonical-LR parsing table functions ACTION and GOTO for G'.
METHOD:
1. Construct C' = {I0,I1,…, In}, the collection of sets of LR(1) items for G'
2. State i of the parser is constructed from Ii. The parsing action for state i is
determined as follows.
(a) If [A  α.aβ,b] is in Ii and GOTO(Ii,a) = Ij, then set ACTION[i, a] to "shift j .
" Here a must be a terminal.
(b) If [A  α.,a] is in Ii, A ≠ S', then set ACTI0N[i, a] to "reduce A -> a."
(c) If [S' S., $] is in Ii, then set ACTION[i, a] to "accept."
If any conflicting actions result from the above rules, we say the grammar is not
LR(1). The algorithm fails to produce a parser in this case.
3. The goto transitions for state i are constructed for all nonterminals A using the
rule: If GOTO(Ij, A) = Ij, then GOTO[i, A] = j.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set of items
containing
[S' .-S, $].

The table formed from the parsing action and goto functions produced by Algorithm 3.2
is called the canonical LR(1) parsing table. An LR parser using this table is called a
canonical-LR(1) parser. If the parsing action function has no multiply defined entries,
then the given grammar is called an LR(1) grammar. As before, we omit the "(1)" if it is
understood.

Example 3.3: The canonical parsing table for grammar (3.1) is shown in Fig. 3.3.
Productions 1, 2, and 3 are S  CC, C  cC, and C  d, respectively.
Every SLR(1) grammar is an LR(1) grammar, but for an SLR(1) grammar the canonical LR
parser may have more states than the SLR parser for the same grammar.

Figure 3.3: Canonical parsing table for grammar (3.1)

Page 10
2.8.2 Constructing LALR Parsing Tables
The last parser construction method, called, the LALR (lookahead- LR) technique. This
method is often used in practice, because the tables obtained by it are considerably
smaller than the canonical LR tables, yet most common syntactic constructs of
programming languages can be expressed conveniently by an LALR grammar.

Algorithm 3.3: Construction of Look Ahead -LR parsing tables.


INPUT: An augmented grammar G'.
OUTPUT: The LALR parsing-table functions ACTION and GOTO for G'.
METHOD:
1. Construct C = { I0,I1,…, In} , the collection of sets of LR(1) items.
2. For each core present among the set of LR(1) items, find all sets having that core,
and replace these sets by their union.
3. Let C = {J0,J1,... ,Jm} be the resulting sets of LR(1) items. The parsing actions for
state i are constructed from Ji in the same manner as in Algorithm 3.2. If there is
a parsing action conflict, the algorithm fails to produce a parser, and the
grammar is said not to be LALR(l).
4. The GOTO table is constructed as follows. If J is the union of one or more sets of
LR(1) items, that is, J = I1 U I2 U … Ik, then the cores of GOTO(I1, X), GOTO(I2 , X),...
, GOTO(Ik , X) are the same, since I1, I2 …, Ik all have the same core. Let K be the
union of all sets of items having the same core as GOTO(I1,X). Then GOTO(J,X) =
K.

The table produced by Algorithm 3.3 is called the LALR parsing table for G. If there are
no parsing action conflicts, then the given grammar is said to be an LALR(1) grammar.
The collection of sets of items constructed in step (3) is called the LALR(1) collection.

Example 3.4: Again consider grammar (3.1) whose GOTO graph was shown in Fig. 3.2.
As we mentioned, there are three pairs of sets of items that can be merged. I3 and I6 are
replaced
by their union:
I36 : C  c.C, c/d/$
C  .cC, c/d/$
C  .d, c/c/$
I4 and I7 are replaced by their union:
I47 : C  d., c/d/$
and I8 and I9 are replaced by their union:
I89 : C  cC., c/d/$
The LALR action and goto functions for the condensed sets of items are shown in Fig.
3.4.

Page 11
Figure 3.4: LALR parsing table for the grammar of Example3.1

2.9 Using Ambiguous Grammars


Ambiguous grammars are quite useful in the specification and implementation of
languages. For language constructs like expressions, an ambiguous grammar provides a
shorter, more natural specification than any equivalent unambiguous grammar. Another
use of ambiguous grammars is in isolating commonly occurring syntactic constructs for
special-case optimization. With an ambiguous grammar, we can specify the special-case
constructs by carefully adding new productions to the grammar.

Precedence and Associativity to Resolve Conflicts


 Associativity
If an operand has operators on both sides, the side on which the operator takes this
operand is decided by the associativity of those operators. If the operation is left-
associative, then the operand will be taken by the left operator or if the operation is
right-associative, the right operator will take the operand.
Example
o Operations such as Addition, Multiplication, Subtraction, and Division
are left associative. If the expression contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id

o Operations like Exponentiation are right associative, i.e., the order of


evaluation in the same expression will be:
id op (id op id)
For example, id ^ (id ^ id)

 Precedence
If two different operators share a common operand, the precedence of operators
decides which will take the operand. That is, 2+3*4 can have two different parse trees,
one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting
precedence among operators, this problem can be easily removed. As in the previous

Page 12
example, mathematically * (multiplication) has precedence over + (addition), so the
expression 2+3*4 will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its grammar.

Page 13
2.10 Error Recovery in LR Parsing
An LR parser will detect an error when it consults the parsing action table and find a
blank or error entry. Errors are never detected by consulting the goto table.
An LR parser will detect an error as soon as there is no valid continuation for the portion
of the input thus far scanned.
 A canonical LR parser will not make even a single reduction before announcing
the error.
 SLR and LALR parsers may make several reductions before detecting an error,
but they will never shift an erroneous input symbol onto the stack.
 We can implement two Modes of Error Recovery :
1. Panic-mode Error Recovery
2. Phrase-level Recovery

Panic-mode Error Recovery


We can implement panic-mode error recovery by scanning down the stack until a state
s with a goto on a particular nonterminal A is found.
 Zero or more input symbols are then discarded until a symbol a is found that can
legitimately follow A.
 The parser then stacks the state GOTO(s, A) and resumes normal parsing.
 The situation might exist where there is more than one choice for the
nonterminal A.
 Normally these would be nonterminals representing major program pieces, e.g.
an expression, a statement, or a block.

For example, if A is the nonterminal stmt, a might be semicolon or }, which marks the
end of a statement sequence.

This method of error recovery attempts to eliminate the phrase containing the
syntactic error.

 The parser determines that a string derivable from A contains an error.


 Part of that string has already been processed, and the result of this processing
is a sequence of states on top of the stack.
 The remainder of the string is still in the input, and the parser attempts to skip
over the remainder of this string by looking for a symbol on the input that can
legitimately follow A.
 By removing states from the stack, skipping over the input, and pushing GOTO(s,
A) on the stack, the parser pretends that if has found an instance of A and
resumes normal parsing.

Phrase-level Recovery
Phrase-level recovery is implemented by examining each error entry in the LR action
table and deciding on the basis of language usage the most likely programmer error
that would give rise to that error.
An appropriate recovery procedure can then be constructed;
 presumably the top of the stack and/or first input symbol would be modified in a
way deemed appropriate for each error entry.
 In designing specific error-handling routines for an LR parser, we can fill in each
blank entry in the action field with a pointer to an error routine that will take the
appropriate action selected by the compiler designer.

Page 14
 The actions may include insertion or deletion of symbols from the stack or the
input or both, or alteration and transposition of input symbols.
 We must make our choices so that the LR parser will not get into an infinite loop.
 A safe strategy will assure that at least one input symbol will be removed or
shifted eventually, or that the stack will eventually shrink if the end of the input
has been reached.
 Popping a stack state that covers a nonterminal should be avoided, because this
modification eliminates from the stack a construct that has already been
successfully parsed.

2.11 Parser Generators


YACC is an automatic tool that generates the parser program
1. YACC stands for Yet Another Compiler Compiler.
2. YACC provides a tool to produce a parser for a given grammar.
3. YACC is a program designed to compile a LALR (1) grammar.
4. It is used to produce the source code of the syntactic analyzer of the language
produced by LALR (1) grammar.
5. The input of YACC is the rule or grammar and the output is a C program.

A translator can be constructed using YACC.


 First, a file, say translate.y, containing a YACC specification of the translator is
prepared. The UNIX system command yacc translate.y
 transforms the file translate.y into a C program called y.tab.c using the LALR
method.
 The program y.tab.c is a representation of an LALR parser written in C, along
with other C routines that the user may have prepared.

 By compiling y.tab.c along with the ly library that contains the LR parsing
program using the command.
 cc y.tab.c -ly
 We obtain the desired object program a.out that performs the translation
specified by the original Yacc program.
 If other procedures are needed, they can be compiled or loaded
with y.tab.c, just as with any C program.

Page 15
A YACC source program has three parts:
Declarations
%%
translation rules
%%
auxiliary functions

Page 16

You might also like