CD Unit II Part II LR Parsers
CD Unit II Part II LR Parsers
Syntax Analysis
The principal drawback of the LR method is that it is too much work to construct an LR
parser by hand for a typical programming-language grammar. A specialized tool, an LR
parser generator, is needed. Most commonly used is Yacc. Such a generator takes a
context-free grammar and automatically produces a parser for that grammar. If the
grammar contains ambiguities or other constructs that are difficult to parse in a left-to-
right scan of the input, then the parser generator locates these constructs and provides
detailed diagnostic messages.
Page 1
A .XYZ
A X.YZ
A XY.Z
A XYZ.
The production A ε generates only one item, A . .
Intuitively, an item indicates how much of a production we have seen at a given point in
the parsing process. For example, the item A .XYZ indicates that we hope to see a
string derivable from XYZ next on the input. Item A X.YZ indicates that we have just
seen on the input a string derivable from X and that we hope next to see a string
derivable from YZ. Item A XYZ. indicates that we have seen the body XYZ and that it
may be time to reduce XYZ to A.
One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the
basis for constructing a deterministic finite automaton that is used to make parsing
decisions. Such an automaton is called an LR(0) automaton.
In particular, each state of the LR(0) automaton represents a set of items in the
canonical LR(0) collection. The automaton for the expression grammar (2.1), shown in
Fig. 2.18, will serve as the running example for discussing the canonical LR(0) collection
for a grammar.
Page 2
To construct the canonical LR(0) collection for a grammar, we define an augmented
grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol
S, then G', the augmented grammar for G, is G with a new start symbol S’ and
production S' —> S. The purpose of this new starting production is to indicate to the
parser when it should stop parsing and announce acceptance of the input. That is,
acceptance occurs when and only when the parser is about to reduce by S' S.
The algorithm to construct C, the canonical collection of sets of LR(0) items for an
augmented grammar G' is given by:
void items(G') {
C = CLOSURE({[S' -> S]});
repeat
for ( each set of items I in C )
for ( each grammar symbol X )
if ( GOTO (I, X) is not empty and not in C )
add GOTO(I, X) to C;
until no new sets of items are added to C on a round;
}
Page 3
The LR-Parsing Algorithm
A schematic of an LR parser is shown in Fig. 2.20. It consists of an input, an output, a
stack, a driver program, and a parsing table that has two parts (ACTION and GOTO). The
driver program is the same for all LR parsers; only the parsing table changes from one
parser to another. The parsing program reads characters from an input buffer one at a
time. Where a shift-reduce parser would shift a symbol, an LR parser shifts a state. Each
state summarizes the information contained in the stack below it.
The stack holds a sequence of states, s0s1…sm, where sm is on top. In the SLR method,
the stack holds states from the LR(0) automaton; the canonical- LR and LALR methods
are similar. By construction, each state has a corresponding grammar symbol. Recall
that states correspond to sets of items, and that there is a transition from state i to
state j if GOTO(Ii,X) = Ij. All transitions to state j must be for the same grammar symbol
X. Thus, each state, except the start state 0, has a unique grammar symbol associated
with it.
LR-Parser Configurations
To describe the behavior of an LR parser, it helps to have a notation representing the
complete state of the parser: its stack and the remaining input. A configuration of an LR
parser is a pair:
(s0s1…sm, a,ai+1…an$)
Page 4
where the first component is the stack contents (top on the right), and the second
component is the remaining input. This configuration represents the right-sentential
form
X1X2…Xmaaiai+1…an
in essentially the same way as a shift-reduce parser would; the only difference is that
instead of grammar symbols, the stack holds states from which grammar symbols can
be recovered. That is, Xi is the grammar symbol represented by state si. Note that s0, the
start state of the parser, does not represent a grammar symbol, and serves as a bottom-
of-stack marker, as well as playing an important role in the parse.
1. If ACTION[sm , ai] = shift s, the parser executes a shift move; it shifts the next state s
onto the stack, entering the configuration
(s0s1…sm, a,ai+1…an$)
The symbol ai need not be held on the stack, since it can be recovered from s, if needed
(which in practice it never is). The current input symbol i s now ai+1.
2. If ACTION sm , ai] = reduce A β, then the parser executes a reduce move, entering
the configuration
(s0s1…sm-rs, a,ai+1…an$)
where r is the length of β, and s = GOTO[sm-r, A]. Here the parser first popped r state
symbols off the stack, exposing state sm-r. The parser then pushed s, the entry for
GOTO[sm-r, A], onto the stack. The current input symbol is not changed in a reduce
move. For the LR parsers we shall construct, Xm-r+1…Xm, the sequence of grammar
symbols corresponding to the states popped off the stack, will always match β, the right
side of the reducing production.
The output of an LR parser is generated after a reduce move by executing the semantic
action associated with the reducing production.
3. If ACTlON[sm , ai] = accept, parsing is completed.
4. If ACTlON[sm , ai] = error, the parser has discovered an error and calls an error
recovery routine.
The LR-parsing algorithm is summarized below. All LR parsers behave in this fashion; the
only difference between one LR parser and another is the information in the ACTION
and GOTO fields of the parsing table.
Page 5
let a be the first symbol of w$;
while(1) { /* repeat forever */
let s be the state on top of the stack;
if ( ACTION[s, a] = shift t ) {
push t onto the stack;
let a be the next input symbol;
} else if ( ACTION [s, a] = reduce A β) {
pop |β| symbols off t he stack;
let state t now be on top of the stack;
push GOTO[t, A] onto the stack;
output the production A β;
} else if ( ACTION[s,a] = accept ) break; /* parsing is done */
else call error-recovery routine;
}
Example 2.15: Figure 2.21 shows the ACTION and GOTO functions of an LR-parsing table
for the expression grammar productions numbered:
1) E E + T / T 4) T F
2) E T 5) F (E) / id
3) T T * F / F 6) F id
The codes for the actions are:
1. si means shift and stack state i,
2. rj means reduce by the production numbered j,
3. acc means accept,
4. blank means error.
On input id * id + id, the sequence of stack and input contents is shown in Fig. 2, 22.
Also shown for clarity, are the sequences of grammar symbols corresponding to the
states held on the stack. For example, at line (1) the LR parser is in state 0, the initial
state with no grammar symbol, and with id the first input symbol. The action in row 0
and column id of the action field of Fig. 2.21 is s5, meaning shift by pushing state 5. That
is what has happened at line (2): the state symbol 5 has been pushed onto the stack,
and id has been removed from the input.
Page 6
Figure 4.38: Moves of an LR parser on id * id + id
Then, * becomes the current input symbol, and the action of state 5 on input * is to
reduce by F id. One state symbol is popped off the stack. State 0 is then exposed. Since
the goto of state 0 on F is 3, state 3 is pushed onto the stack. We now have the
configuration in line (3). Each of the remaining moves is determined similarly.
Page 7
2.8 More Powerful LR Parsers
2.8.1 Canonical LR(1) Items
The most general technique for constructing an LR parsing table from a grammar called
"canonical-LR" or just "LR" method, which makes full use of the lookahead symbol(s).
This method uses a large set of items, called the LR(1) items.
Page 8
OUTPUT: The sets of LR(1) items that are the set of items valid for one or more viable
prefixes of G'.
METHOD: The procedures CLOSURE and GOTO and the main routine items for
constructing the sets of items were shown in Fig. 3.1
We continue to compute the closure by adding all items [C .ϒ, b] for b in FIRST( C$ ) .
That is, matching [S .CC, $] against [A α. Bβ, a], we have A = S, α = ε, B = C, β = C,
and a = $. Since C does not derive the empty string, FIRST(C$) = FIRST( C ) . Since
FIRST(C) contains terminals c and d, we add items [C .cC,c], [C .cC, d], [C .d, c]
and [C .d, d]. None of the new items has a nonterminal immediately to the right of
the dot, so we have completed our first set of LR(1) items. The initial set of items is
I0 : S .S,$
S .CC. $
C .cC, c/d
C .d, c/d
We now give the LR(1) sets of items construction. Figure 3.2 shows the ten sets of items
with their goto's.
Page 9
Canonical LR(1) Parsing Tables
The rules for constructing the LR(1) ACTION and GOTO functions from the sets of LR(1)
items. These functions are represented by a table, as before. The only difference is in
the values of the entries.
The table formed from the parsing action and goto functions produced by Algorithm 3.2
is called the canonical LR(1) parsing table. An LR parser using this table is called a
canonical-LR(1) parser. If the parsing action function has no multiply defined entries,
then the given grammar is called an LR(1) grammar. As before, we omit the "(1)" if it is
understood.
Example 3.3: The canonical parsing table for grammar (3.1) is shown in Fig. 3.3.
Productions 1, 2, and 3 are S CC, C cC, and C d, respectively.
Every SLR(1) grammar is an LR(1) grammar, but for an SLR(1) grammar the canonical LR
parser may have more states than the SLR parser for the same grammar.
Page 10
2.8.2 Constructing LALR Parsing Tables
The last parser construction method, called, the LALR (lookahead- LR) technique. This
method is often used in practice, because the tables obtained by it are considerably
smaller than the canonical LR tables, yet most common syntactic constructs of
programming languages can be expressed conveniently by an LALR grammar.
The table produced by Algorithm 3.3 is called the LALR parsing table for G. If there are
no parsing action conflicts, then the given grammar is said to be an LALR(1) grammar.
The collection of sets of items constructed in step (3) is called the LALR(1) collection.
Example 3.4: Again consider grammar (3.1) whose GOTO graph was shown in Fig. 3.2.
As we mentioned, there are three pairs of sets of items that can be merged. I3 and I6 are
replaced
by their union:
I36 : C c.C, c/d/$
C .cC, c/d/$
C .d, c/c/$
I4 and I7 are replaced by their union:
I47 : C d., c/d/$
and I8 and I9 are replaced by their union:
I89 : C cC., c/d/$
The LALR action and goto functions for the condensed sets of items are shown in Fig.
3.4.
Page 11
Figure 3.4: LALR parsing table for the grammar of Example3.1
Precedence
If two different operators share a common operand, the precedence of operators
decides which will take the operand. That is, 2+3*4 can have two different parse trees,
one corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting
precedence among operators, this problem can be easily removed. As in the previous
Page 12
example, mathematically * (multiplication) has precedence over + (addition), so the
expression 2+3*4 will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its grammar.
Page 13
2.10 Error Recovery in LR Parsing
An LR parser will detect an error when it consults the parsing action table and find a
blank or error entry. Errors are never detected by consulting the goto table.
An LR parser will detect an error as soon as there is no valid continuation for the portion
of the input thus far scanned.
A canonical LR parser will not make even a single reduction before announcing
the error.
SLR and LALR parsers may make several reductions before detecting an error,
but they will never shift an erroneous input symbol onto the stack.
We can implement two Modes of Error Recovery :
1. Panic-mode Error Recovery
2. Phrase-level Recovery
For example, if A is the nonterminal stmt, a might be semicolon or }, which marks the
end of a statement sequence.
This method of error recovery attempts to eliminate the phrase containing the
syntactic error.
Phrase-level Recovery
Phrase-level recovery is implemented by examining each error entry in the LR action
table and deciding on the basis of language usage the most likely programmer error
that would give rise to that error.
An appropriate recovery procedure can then be constructed;
presumably the top of the stack and/or first input symbol would be modified in a
way deemed appropriate for each error entry.
In designing specific error-handling routines for an LR parser, we can fill in each
blank entry in the action field with a pointer to an error routine that will take the
appropriate action selected by the compiler designer.
Page 14
The actions may include insertion or deletion of symbols from the stack or the
input or both, or alteration and transposition of input symbols.
We must make our choices so that the LR parser will not get into an infinite loop.
A safe strategy will assure that at least one input symbol will be removed or
shifted eventually, or that the stack will eventually shrink if the end of the input
has been reached.
Popping a stack state that covers a nonterminal should be avoided, because this
modification eliminates from the stack a construct that has already been
successfully parsed.
By compiling y.tab.c along with the ly library that contains the LR parsing
program using the command.
cc y.tab.c -ly
We obtain the desired object program a.out that performs the translation
specified by the original Yacc program.
If other procedures are needed, they can be compiled or loaded
with y.tab.c, just as with any C program.
Page 15
A YACC source program has three parts:
Declarations
%%
translation rules
%%
auxiliary functions
Page 16