S2 BottomUpParsing
S2 BottomUpParsing
Naranker Dulay
[email protected]
https://www.doc.ic.ac.uk/~nd/compilers
IF
LPAREN if
INT(24)
RPAREN
IDENT('A') 24 A B
ELSE
IDENT('B')
n.dulay LR (Bottom-up) Parsing 2
LR Parsing - main idea
Abstract Syntax
Tokens LR Parser
Tree (AST)
DFA to
S’ ● S, $ S’ S ●, $ 1
S ● id, $
S ● V = E, $ id S id ●, $
2
Parsing
V ● id, = V id ●, =
V
= E
V ● = E, $ S V = ● E, $ V = E ●, $
Table
S S 5
E ● int, $ int
3 E ● V, $ E int ●, $ 6
V ● id, $ V
4 E V ●, $ 7
id
V id ●, $ 8
to DFA
id int = $ E V S
0 S2 G3 G1
1 A
2 R3 R1
3 S4
r1: S ® id 4 S8 S6 G5 G7
r2: S ® V ‘=’ E 5 R2
r3: V ® id
6 R5
r4: E ® V
7 R4
r5: E ® int
8 R3
There are several methods for generating LR parsing tables. The methods
essentially define parsers/grammars of different expressive power, e.g. LR(0),
SLR(1), LALR(1), LR(1), LR(2), LR(3) etc. Less powerful methods may fail to
produce a parsing table for a given context-free grammar.
The parsers resulting from these methods are similar - in each case their states
consist of so-called LR(k) items of the grammar, but vary in the number of
states and/or in the size of their lookahead sets.
LR(0) parsers are the simplest LR parsers. They don’t use the current token in
order to perform a reduction.
LR(0) items are instances of the rules of the grammar with a dot • in some
position on the right hand side of the rule.
An item indicates how much of a rule has been seen, e.g. the item
E ® E + • int indicates that we have seen an expression and a plus sign
and hope to encounter an integer next. LR(0) items thus indicate
intermediate steps in the recognition of the RHS of a rule.
LR(0) items are used as states of a finite automaton that maintains information
about the progress of a shift-reduce parse.
We build a NFA from the LR(0) items and from this NFA we can build a DFA
using the subset construction.
X → A • BC
ε B→•D
For each state X containing the item R' ® .. • add P [X, $] = a (accept)
Stack
Top State
Parsing Table
Top-1 State LR Parser (shift, reduce,
goto, accept,
error)
Operation: We push state 0 onto the stack and then repeatedly perform:
SWITCH ParsingTable[Stack[Top],CurrentToken]
shift Sn: Push state n onto the Stack, Advance Current Token
reduce Rn: Remove L elements from the Stack where L=length of RHS of rule n.
Push ParsingTable[Stack[Top],LHS of R]; i.e. the “Goto” action
reduce also generates an AST node for the rule
accept a: Accept input stream (i.e. parse was successful)
error: Report Error
goto Gn: This case is not selected directly. Looked-up in the reduce case
above.
n.dulay LR (Bottom-up) Parsing 15
Example
For LR, we add an auxiliary rule
Grammar E’ ® E $ with an end-of-input symbol $. $
is implied if omitted
rule r1: E ® E ‘+’ int -- rules labelled for parsing table
rule r2: E ® int
int + $ E
0 s2 g1
1 s3 a
2 r2 r2
3 s4
4 r1 r1
Abbreviations: s (shift), r (reduce), a (accept), g (goto), blank (error)
Note: s and g are followed by a state number, r is followed by a rule number.
n.dulay LR (Bottom-up) Parsing 16
Example Contd LR(1) Parsing Table
State ACTION GOTO
int + $ E
E’® E $ 0 s2 g1
r1: E ® E ‘+’ int 1 s3 a
r2: E ® int 2 r2 r2
3 s4
4 r1 r1
LR(1) is the most powerful of these methods and sufficient for most programming
languages. Its disadvantage is that it requires large parsing tables. LALR(1) is a
modification of LR(1) that retains virtually all the benefits of LR(1), while producing
much smaller parsing tables. We’ll look at LR(1) first, then LALR(1).
SLR(1) parsers are similar to LR(0) parsers except that we only add reduce
actions into a SLR(1) parsing table “if appropriate”:
An SLR(1) parsing table has the same states and gotos as an LR(0)
parsing table, only the actions differ.
This change increases the expressiveness of SLR(1) parsers over LR(0) parsers
considerably and many programming language constructs can be parsed by
SLR(1) that LR(0) cannot parse. We won’t look at SLR parsers in any further
detail but rather consider the more powerful LR(1) and LALR(1) parsers.
The FIRST set for a sequence of rules (non-terminals) and tokens (terminals)
a, is the set of all tokens that could start a derivation of a, plus ε if a could
derive ε.
The FOLLOW set for a rule (non-terminal) A, is the set of all tokens
(terminals) that could immediately follow A, plus $ if A can end the input.
The FIRST set for a rule A, is the set of all tokens that could start a derivation
of A, plus ε if A could derive ε
foreach alternative of A of the form A ® b1 b2 ... bn
Include FIRST(b1) - {ε} in FIRST(A)
If ε in FIRST(b1)
Include FIRST (b2) - {ε} in FIRST(A)
If ε in FIRST(b2)
...
Include FIRST (bn) - {ε} in FIRST(A)
If ε in FIRST(bn)
Include ε in FIRST(A)
The FIRST set for a sequence ABC.. is formed as for the rule case above.
FIRST(Factor) = { (, id }
FIRST(Term2) = { *, ε }
FIRST(Term) = FIRST(Factor) = { (, id }
FIRST(Expr2) = { +, ε }
FIRST(Expr) = FIRST(Term) = { (, id }
The FOLLOW set for a rule A, is the set of all tokens that could immediately
follow A. Note: ε is never in the FOLLOW set.
FOLLOW(Expr) = { ), $ }
FOLLOW(Expr2) = FOLLOW(Expr) + FOLLOW(Expr2) = { ), $ }
FOLLOW(Term) = FIRST(Expr2) + FOLLOW(Expr) + FOLLOW(Expr2)
= { +, ), $ }
FOLLOW(Term2) = FOLLOW(Term) + FOLLOW(Term2) = { +, ), $ }
FOLLOW(Factor) = FIRST(Term2) + FOLLOW(Term) + FOLLOW(Term2)
= { *, +, ), $ }
n.dulay LR (Bottom-up) Parsing 24
LR(1) Items
[ X ® A • B C, t ] [ X ® A • B C, t ]
[X → A • BC, t]
ε [B → • D, u]
We include the initial item for the auxiliary rule i.e. [X’ ® •X, $] to start
the construction of the DFA, which proceeds as before.
Grammar Expanded
ε
V ➝ ● id, =
S’® S $
S ® id
S ® V ‘=’ E
V ® id 0
S’➝ ● S, $
E ® V S ➝ ● id, $
E ® int S ➝ ● V = E, $
V ➝ ● id, =
0 S
S’® S $ S’➝ ● S, $ S’➝ S ●, $ 1
S ® id S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
S ® V ‘=’ E V ➝ ● id, = V ➝ id ●, = 2
V ® id V
=
E ® V S ➝ V ● = E, $ S ➝ V = ● E, $
E ® int ε
3
E ➝ ● int, $
E ➝ ● V, $ 4
ε
V ➝ ● id, $
0 S
S’® S $ S’➝ ● S, $ S’➝ S ●, $ 1
S ® id S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
S ® V ‘=’ E V ➝ ● id, = V ➝ id ●, = 2
V ® id V
=
E ® V S ➝ V ● = E, $ S ➝ V = ● E, $
E ® int E ➝ ● int, $
3 E ➝ ● V, $
V ➝ ● id, $
4
0 S
S’➝ ● S, $ S’➝ S ●, $ 1
S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
V ➝ ● id, = V ➝ id ●, = 2
V
= E
S ➝ V ● = E, $ S ➝ V = ● E, $ S ➝ V = E ●, $ 5
E ➝ ● int, $ int
3 E ➝ ● V, $ E ➝ int ●, $ 6
V ➝ ● id, $ V
4 E ➝ V ●, $
Grammar 7
id
S’® S $ V ➝ id ●, $ 8
S ® id
S ® V ‘=’ E
V ® id
E ® V
E ® int
n.dulay LR (Bottom-up) Parsing 30
Parsing Table for Example
r1: S ® id Remember for LR(1) we only perform a reduction
r2: S ® V ‘=’ E X ® A•, t if the current token is t
r3: V ® id
r4: E ® V
r5: E ® int
State ACTION GOTO
id int = $ E V S
0 s2 g3 g1
1 a
2 r3 r1
3 s4
4 s8 s6 g5 g7
5 r2
6 r5
7 r4
8 r3
LALR(1) parsers are like LR(1) parsers but we perform an optimisation where we
merge LR(1) states that have the same LR(0) items but differ in their look-ahead
token.
An LALR item thus consists of an LR(0) item and a set of look-ahead tokens:
In the following example, LR(1) states 2 and 3 have the same LR(0) items. We
can therefore combine them
int X ➝ E ● T, =
2
V ➝ id ●, + int
X ➝ E ● T, =,$
Y 23
V ➝ id ●, +,/
Y X ➝ E ● T, $
V ➝ id ●, / 3
A ® id
A ® id ‘[‘ Expr ’]’ Do we shift or do we reduce? => Shift-Reduce conflict
Expr ® Expr ‘+’ Expr | Expr ‘*’ Expr | ‘(‘ Expr ‘)’ | int
On encountering an error entry, we can either stop and report a single error to the
programmer, or we can attempt to recover from the error and so allow parsing to
continue and further errors to be reported.
Many error recovery schemes exist, some attempt recovery by scanning forward in the
token stream, some modify the parse stack, others do both.
The above rules specify that if a syntax error is encountered in an expression that the
parser should skip to the next ) or the next ; In YACC, error is a special terminal
symbol but has shift actions entered into the parsing table as if it were a normal
token. See Appel p79 or Aho p264 for more details.
IF
LPAREN
RPAREN If-Statement
IDENT("A") Statement
ELSE
Else-Part
IDENT("B") Statement
INT(24)
IDENT("A") If-Statement
IDENT("B")
program ® statement_sequence
%{
#include <string>
#include "astnode.h"
#include "parser.hpp"
%}
digit [0-9]
number {digit}+
letter [a-zA-Z]
identifier {letter}+
whitespace [ \t\n]+
comment //[^\n]*\n
• For this example we’ll use C++ code with our bison grammar.
astnode.h will hold C++ class definitions for our AST nodes.
• parser.hpp is the include file generated by bison-c++.
%%
if return IF;
then return THEN;
else return ELSE;
while return WHILE;
do return DO;
":=" return BECOMES;
"=" return EQUALS;
"<" return LESSTHAN;
"+" return PLUS;
"-" return MINUS;
"(" return LPAREN;
")" return RPAREN;
";" return SEMICOLON;
• To keep our flex code very simple, token attributes (e.g. for numbers) will be
generated in bison. We’ll just pass a C++ string.
%{
#include "astnode.h"
#include <cstdio>
#include <cstdlib>
%}
%union {
int token;
std::string *string;
Identifier *id;
Expression *expression;
Statement *statement;
StatSeq *statseq;
ASTnode *node;
}
/* Associativity of operators */
%left PLUS MINUS
program:
statement_sequence
{ ast = $1; }
;
statement_sequence:
statement
{ $$ = new StatSeq();
$$->statements.push_back($1);
}
| statement_sequence SEMICOLON statement
{ $1->statements.push_back($3);
}
;
statement:
if_statement { $$ = $1; }
| while_statement { $$ = $1; }
| assign_statement { $$ = $1; }
;
n.dulay LR (Bottom-up) Parsing 51
parser.y
if_statement:
IF expression THEN statement_sequence END
{ $$ = new IfStatement(*$2, *$4); }
;
while_statement:
WHILE expression DO statement_sequence END
{ $$ = new WhileStatement(*$2, *$4); }
;
assign_statement:
identifier BECOMES expression
{ $$ = new Assignment(*$1, *$3); }
;
expression:
identifier
{ $$ = $1; }
| number
{ $$ = $1; }
| expression PLUS expression
{ $$ = new Operator(*$1, $2, *$3); }
| expression MINUS expression
{ $$ = new Operator(*$1, $2, *$3); }
| expression comparator expression
{ $$ = new Operator(*$1, $2, *$3); }
| LPAREN expression RPAREN
{ $$ = $2; }
;
identifier:
IDENTIFIER
{ $$ = new Identifier(*$1);
delete $1;
}
;
number:
INTEGER
{ $$ = new Number(atol($1->c_str()));
delete $1;
}
;
comparator:
EQUALS
| LESSTHAN
;
#include <vector>
#include <iostream>
class ASTnode {
public:
virtual ~ASTnode() {}
virtual void semantic_check () {}
};
#include <iostream>
#include "astnode.h"
all: parser
OBJS = scanner.o parser.o main.o
CPPFLAGS =
parser.cpp: parser.y
bison -d –v -o $@ $^
parser.hpp: parser.cpp
%.o: %.cpp
g++ -c $(CPPFLAGS) -o $@ $<
parser: $(OBJS)
g++ -o $@ $(OBJS)