0% found this document useful (0 votes)
20 views

S2 BottomUpParsing

Uploaded by

xsh7p9fjk8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

S2 BottomUpParsing

Uploaded by

xsh7p9fjk8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

LR (Bottom-up) Parsing:

Tokens to AST Tree

Naranker Dulay
[email protected]

https://www.doc.ic.ac.uk/~nd/compilers

n.dulay LR (Bottom-up) Parsing 1


Parsing - main idea
Parsing (syntax analysis) transforms a sequence of tokens into an abstract
syntax tree based on the (typically context free) grammar of the language

Context Free Grammar (CFG)


CFG rules take the form rule ® (rule|token)*

Tokens Parser Abstract Syntax Tree (AST)

IF
LPAREN if
INT(24)
RPAREN
IDENT('A') 24 A B
ELSE
IDENT('B')
n.dulay LR (Bottom-up) Parsing 2
LR Parsing - main idea

LR parsing is also LR Grammar


known as:
(i) bottom-up parsing,
reflecting the direction
the AST is constructed,
LR Parser Generator
(ii) shift-reduce
parsing, reflecting the
two main actions
performed by an LR Parsing Table
parser.

Abstract Syntax
Tokens LR Parser
Tree (AST)

n.dulay LR (Bottom-up) Parsing 3


0 S

DFA to
S’ ● S, $ S’ S ●, $ 1
S ● id, $
S ● V = E, $ id S id ●, $
2

Parsing
V ● id, = V id ●, =
V
= E
V ● = E, $ S V = ● E, $ V = E ●, $
Table
S S 5
E ● int, $ int
3 E ● V, $ E int ●, $ 6
V ● id, $ V
4 E V ●, $ 7
id
V id ●, $ 8

Grammar State ACTION GOTO

to DFA
id int = $ E V S
0 S2 G3 G1
1 A
2 R3 R1
3 S4
r1: S ® id 4 S8 S6 G5 G7
r2: S ® V ‘=’ E 5 R2
r3: V ® id
6 R5
r4: E ® V
7 R4
r5: E ® int
8 R3

n.dulay LR (Bottom-up) Parsing 4


Parsers for Context-Free Grammars
LL and LR grammars are LL(k): Left-to-right scan, Leftmost-derivation,
subsets of CFGs that we k-token look-ahead
can parse in O(n) time LL(0) No such thing.
instead of O(n3) for full
CFGs. LL(1) Most popular LL. Can implement as an
automaton or as a recursive descent parser
k=2 Sometimes useful
k>2 Rarely needed
LL - Top-Down Parsers
LR(k): Left-to-right scan, Rightmost-derivation (in
n Build AST from root reverse), k-token look-ahead
to leaves.
LR(0) Weak. Useful for learning about LR.
SLR(1) Stronger, superseded by LALR(1)
LALR(1) Fast and popular. Close in power to LR(1)
LR - Bottom-Up Parsers
LR(1) Powerful -- needs more memory
n Build AST from leaves k => 2 Possible, but rarely needed/used
to root.
LR(n) is a superset of an LL(n) and hence more powerful.

n.dulay LR (Bottom-up) Parsing 5


Chomsky Hierarchy
In the following
R,S are non-terminals (the name of a rule/production, non-leaf node in AST)
t is a sequence of terminals (tokens from lexical analyser, leaf nodes in AST)
a, b, j are sequences of terminals and non-terminals

type 3 – regular grammars (Deterministic Finite Automata)


R ® t

type 2 – context free grammars (Pushdown Automata)


R ® a

type 1 – context sensitive grammars (Linear Bounded Automata)


a R b ® a j b

type 0 – unrestricted grammars/recursively enumerable (Turing Machines)


a ® b

n.dulay LR (Bottom-up) Parsing 6


LR Parsers

Most programming languages can be described by a Context-Free


Grammar (CFG) and virtually all CFGs for programming languages can
be described by an LR grammar and implemented with an LR parser.

LR Parsers are too complex to build by hand. We need an LR parser-


generator e.g. yacc/bison, SableCC, CUP, Beaver. The principles of LR
operation are “relatively easy” to understand however.

LR Parsers can be implemented efficiently in O(n) time.

n.dulay LR (Bottom-up) Parsing 7


LR Grammar to Parsing Table

There are several methods for generating LR parsing tables. The methods
essentially define parsers/grammars of different expressive power, e.g. LR(0),
SLR(1), LALR(1), LR(1), LR(2), LR(3) etc. Less powerful methods may fail to
produce a parsing table for a given context-free grammar.

The parsers resulting from these methods are similar - in each case their states
consist of so-called LR(k) items of the grammar, but vary in the number of
states and/or in the size of their lookahead sets.

LR(k) Grammar LR(k) Items

NFA DFA ParsingTable

n.dulay LR (Bottom-up) Parsing 8


LR(0) Items

LR(0) parsers are the simplest LR parsers. They don’t use the current token in
order to perform a reduction.

LR(0) items are instances of the rules of the grammar with a dot • in some
position on the right hand side of the rule.

An item indicates how much of a rule has been seen, e.g. the item
E ® E + • int indicates that we have seen an expression and a plus sign
and hope to encounter an integer next. LR(0) items thus indicate
intermediate steps in the recognition of the RHS of a rule.

Example: The rule X ® ABC has 4 LR(0) items

X ® • ABC Initial item


X ® A • BC
X ® AB • C
X ® ABC • Complete (or Reduce) Item

n.dulay LR (Bottom-up) Parsing 9


Example
For LR, we add an auxiliary rule
The Grammar with an end-of-input symbol $. $ is
implied if omitted.
E’® E $
rule r1: E ® E ‘+’ int -- rules labelled for parsing table
rule r2: E ® int

has 8 LR(0) items:

E’® • E Auxiliary rule, Initial item, $ is implied


E’® E • Complete Item or Reduce Item
E ® • E + int Initial item
E ® E • + int
E ® E + • int
E ® E + int • Complete Item
E ® • int
E ® int • Complete Item

n.dulay LR (Bottom-up) Parsing 10


LR(0) to NFA

LR(0) items are used as states of a finite automaton that maintains information
about the progress of a shift-reduce parse.

We build a NFA from the LR(0) items and from this NFA we can build a DFA
using the subset construction.

NFA Transitions: Given a state with item X ® A • BC, add a B transition


and a new state as follows:
B
X → A • BC X → AB • C

In addition, if B is a rule (non-terminal) then foreach initial item B ® • D


add an ε transition and a new state as follows:

X → A • BC
ε B→•D

n.dulay LR (Bottom-up) Parsing 11


Example
Grammar
E’ → • E
E
E’® E
E’ → E • ε
r1: E ® E ‘+’ int
r2: E ® int
E → • E + int ε
E
Items
E → E • + int ε
E’® • E +
E’® E • E → E + • int ε
E ® • E + int int
E ® E • + int E → E + int •
E ® E + • int
E ® E + int • E → • int
E ® • int
int
E ® int •
E → int •

n.dulay LR (Bottom-up) Parsing 12


NFA to DFA
E’ → • E E’ → • E 0
E E → • E + int
E’ → E • ε E → • int
E
1
E → • E + int ε E’ → E •
E E → E • + int
+ int
E → E • + int ε 2
E → E + • int
+
E → E + • int ε int
3
E → E + int •
int
E → E + int •
4
E → int •
E → • int
int
Note: Since it’s easy to determine
E → int • the ε-closures from items, we'll Number states
construct the DFA directly from
now on. from 0
n.dulay LR (Bottom-up) Parsing 13
DFA to LR(0) Parsing Table
T
For each terminal transition X ® Y add P [X, T] = sY (shift Y)
N
For each non-terminal transition X ® Y add P [X, N] = gY (goto Y)

For each state X containing the item R' ® .. • add P [X, $] = a (accept)

For each state X containing an item R ® .. • add P [X, T] = rN (reduce)


for every terminal T where N is R’s rule number

Note: State ACTION GOTO


For LR(0) parsers if there is more
int + $ E
than one action for a table cell
then the grammar is not LR(0), 0 s4 g1
e.g. we may have a shift and a 1 s2 a
reduce (shift-reduce conflict) or 2 s3
two reduces (reduce-reduce
conflict). Blank cells indicate an 3 r1 r1 r1
error. 4 r2 r2 r2

n.dulay LR (Bottom-up) Parsing 14


Model of an LR parser
Tokens T0 T1 ... Tk ... Tn $

Stack
Top State
Parsing Table
Top-1 State LR Parser (shift, reduce,
goto, accept,
error)

Bottom State Note: This a Pushdown Automaton for LR(1) parsing

Operation: We push state 0 onto the stack and then repeatedly perform:
SWITCH ParsingTable[Stack[Top],CurrentToken]
shift Sn: Push state n onto the Stack, Advance Current Token
reduce Rn: Remove L elements from the Stack where L=length of RHS of rule n.
Push ParsingTable[Stack[Top],LHS of R]; i.e. the “Goto” action
reduce also generates an AST node for the rule
accept a: Accept input stream (i.e. parse was successful)
error: Report Error
goto Gn: This case is not selected directly. Looked-up in the reduce case
above.
n.dulay LR (Bottom-up) Parsing 15
Example
For LR, we add an auxiliary rule
Grammar E’ ® E $ with an end-of-input symbol $. $
is implied if omitted
rule r1: E ® E ‘+’ int -- rules labelled for parsing table
rule r2: E ® int

LR(1) Parsing Table

STATE ACTION (Terminals) GOTO (Non-Terminals)

int + $ E
0 s2 g1
1 s3 a
2 r2 r2
3 s4
4 r1 r1
Abbreviations: s (shift), r (reduce), a (accept), g (goto), blank (error)
Note: s and g are followed by a state number, r is followed by a rule number.
n.dulay LR (Bottom-up) Parsing 16
Example Contd LR(1) Parsing Table
State ACTION GOTO

int + $ E
E’® E $ 0 s2 g1
r1: E ® E ‘+’ int 1 s3 a
r2: E ® int 2 r2 r2
3 s4
4 r1 r1

Stack Tokens Action


0 int + int $ T[0,int] = s2
E
0 2 + int $ T[2,+] = r2, pop 1 elem, push T[0,E] r1
0 1 + int $ T[1,+] = s3
E
0 1 3 int $ T[3,int] = s4
r2
0 1 3 4 $ T[4,$] = r1, pop 3 elems, push T[0,E]
0 1 $ T[1,$] = a int + int $
n.dulay LR (Bottom-up) Parsing 17
LR(1) and LALR(1) Parsers

LR parsers essentially differ in how they perform reduction:

LR(0) A reduce item X ® A• always causes a reduction (the current token is


not used in LR(0) parsing)
SLR(1) A reduce item X ® A• causes a reduction only if the current token is
in FOLLOW(X)
LR(1) A reduce item X ® A•, t causes a reduction only if the current token
is equal to the look-ahead token t
LALR(1) Like LR(1) but we combine LR(1) states that differ in the look-ahead
token of the item only.

LR(1) is the most powerful of these methods and sufficient for most programming
languages. Its disadvantage is that it requires large parsing tables. LALR(1) is a
modification of LR(1) that retains virtually all the benefits of LR(1), while producing
much smaller parsing tables. We’ll look at LR(1) first, then LALR(1).

n.dulay LR (Bottom-up) Parsing 18


SLR(1) Optional Material

SLR(1) parsers are similar to LR(0) parsers except that we only add reduce
actions into a SLR(1) parsing table “if appropriate”:

SLR(1): Add a reduce action for an item A ® B • only if the current


token can FOLLOW A somewhere in the grammar.

An SLR(1) parsing table has the same states and gotos as an LR(0)
parsing table, only the actions differ.

This change increases the expressiveness of SLR(1) parsers over LR(0) parsers
considerably and many programming language constructs can be parsed by
SLR(1) that LR(0) cannot parse. We won’t look at SLR parsers in any further
detail but rather consider the more powerful LR(1) and LALR(1) parsers.

n.dulay LR (Bottom-up) Parsing 19


FIRST and FOLLOW Sets

The FIRST set for a sequence of rules (non-terminals) and tokens (terminals)
a, is the set of all tokens that could start a derivation of a, plus ε if a could
derive ε.

FIRST(a) = {t:(a Þ* tb)} È (if a Þ* ε then {ε} else {})

Þ* zero or more derivations


Þ+ one or more derivations
We say that a non-terminal A is NULLABLE if A Þ* ε

The FOLLOW set for a rule (non-terminal) A, is the set of all tokens
(terminals) that could immediately follow A, plus $ if A can end the input.

FOLLOW(A) = {t:(S Þ+ aAtb)} È (if S Þ* aA then {$} else {})

where S is the start symbol


n.dulay LR (Bottom-up) Parsing 20
FIRST Set

The FIRST set for a token T is {T}, for ε is {ε}

The FIRST set for a rule A, is the set of all tokens that could start a derivation
of A, plus ε if A could derive ε
foreach alternative of A of the form A ® b1 b2 ... bn
Include FIRST(b1) - {ε} in FIRST(A)
If ε in FIRST(b1)
Include FIRST (b2) - {ε} in FIRST(A)
If ε in FIRST(b2)
...
Include FIRST (bn) - {ε} in FIRST(A)
If ε in FIRST(bn)
Include ε in FIRST(A)

The FIRST set for a sequence ABC.. is formed as for the rule case above.

n.dulay LR (Bottom-up) Parsing 21


FIRST Set Example
Expr’ ® Expr $
Expr ® Term Expr2
Expr2 ® ‘+’ Term Expr2 | ε
Term ® Factor Term2
Term2 ® ‘*’ Factor Term2 | ε
Factor ® ‘(’ Expr ‘)’ | id

FIRST(Factor) = { (, id }
FIRST(Term2) = { *, ε }
FIRST(Term) = FIRST(Factor) = { (, id }
FIRST(Expr2) = { +, ε }
FIRST(Expr) = FIRST(Term) = { (, id }

n.dulay LR (Bottom-up) Parsing 22


FOLLOW Sets

The FOLLOW set for a rule A, is the set of all tokens that could immediately
follow A. Note: ε is never in the FOLLOW set.

(i) Foreach derivation of the form B ® C A D


Include FIRST(D) - {ε} in FOLLOW(A)
if D Þ* ε include FOLLOW(B) in FOLLOW(A)

This last if can be written


if {ε} in FIRST(D) include FOLLOW(B) in FOLLOW(A)

Note: C & D can be empty. If D is empty then we also include FOLLOW(B)


in FOLLOW(A)

(ii) If a rule A can end the input we include $ in FOLLOW(A)


Recall we use $ as an End-of-Input token.

n.dulay LR (Bottom-up) Parsing 23


FOLLOW Set Example
Expr’ ® Expr $
Expr ® Term Expr2
Expr2 ® ‘+’ Term Expr2 | ε
Term ® Factor Term2
Term2 ® ‘*’ Factor Term2 | ε
Factor ® ‘(’ Expr ‘)’ | id

FOLLOW(Expr) = { ), $ }
FOLLOW(Expr2) = FOLLOW(Expr) + FOLLOW(Expr2) = { ), $ }
FOLLOW(Term) = FIRST(Expr2) + FOLLOW(Expr) + FOLLOW(Expr2)
= { +, ), $ }
FOLLOW(Term2) = FOLLOW(Term) + FOLLOW(Term2) = { +, ), $ }
FOLLOW(Factor) = FIRST(Term2) + FOLLOW(Term) + FOLLOW(Term2)
= { *, +, ), $ }
n.dulay LR (Bottom-up) Parsing 24
LR(1) Items

An LR(1) item is a pair [ LR(0)item, t ]

(i) a LR(0) item as before, plus

(ii) a look-ahead token t

Intuitively the LR(1) item

[ X ® A • B C, t ] [ X ® A • B C, t ]

indicates that A is on top of the stack and that we only want to


recognise B if B is followed by a string derivable from Ct i.e. if the
current token is in FIRST(Ct)

n.dulay LR (Bottom-up) Parsing 25


LR(1) NFA Transitions
Given a state with the following LR(1) item: [X ® A•BC, t] add the
following NFA transition and state
B
[X → A • BC, t] [X → AB • C, t]

Also if B is a rule (non-terminal), then for every rule B ® D add an ε-


transition and a new state for every token u in FIRST(Ct) i.e add a transition
iff the string following B is derivable from Ct.

[X → A • BC, t]
ε [B → • D, u]

We include the initial item for the auxiliary rule i.e. [X’ ® •X, $] to start
the construction of the DFA, which proceeds as before.

n.dulay LR (Bottom-up) Parsing 26


Example
Grammar
0
S’➝ ● S, $
S ® id | V ‘=’ E ε
V ® id
E ® V | int S ➝ ● id, $
S ➝ ● V = E, $

Grammar Expanded
ε

V ➝ ● id, =
S’® S $
S ® id
S ® V ‘=’ E
V ® id 0
S’➝ ● S, $
E ® V S ➝ ● id, $
E ® int S ➝ ● V = E, $
V ➝ ● id, =

n.dulay LR (Bottom-up) Parsing 27


Example contd (1)
Grammar

0 S
S’® S $ S’➝ ● S, $ S’➝ S ●, $ 1
S ® id S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
S ® V ‘=’ E V ➝ ● id, = V ➝ id ●, = 2
V ® id V
=
E ® V S ➝ V ● = E, $ S ➝ V = ● E, $
E ® int ε
3

E ➝ ● int, $
E ➝ ● V, $ 4
ε

V ➝ ● id, $

n.dulay LR (Bottom-up) Parsing 28


Example contd (2)
Grammar

0 S
S’® S $ S’➝ ● S, $ S’➝ S ●, $ 1
S ® id S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
S ® V ‘=’ E V ➝ ● id, = V ➝ id ●, = 2
V ® id V
=
E ® V S ➝ V ● = E, $ S ➝ V = ● E, $
E ® int E ➝ ● int, $
3 E ➝ ● V, $
V ➝ ● id, $
4

n.dulay LR (Bottom-up) Parsing 29


Example Completed

0 S
S’➝ ● S, $ S’➝ S ●, $ 1
S ➝ ● id, $
S ➝ ● V = E, $ id S ➝ id ●, $
V ➝ ● id, = V ➝ id ●, = 2
V
= E
S ➝ V ● = E, $ S ➝ V = ● E, $ S ➝ V = E ●, $ 5
E ➝ ● int, $ int
3 E ➝ ● V, $ E ➝ int ●, $ 6
V ➝ ● id, $ V
4 E ➝ V ●, $
Grammar 7
id
S’® S $ V ➝ id ●, $ 8
S ® id
S ® V ‘=’ E
V ® id
E ® V
E ® int
n.dulay LR (Bottom-up) Parsing 30
Parsing Table for Example
r1: S ® id Remember for LR(1) we only perform a reduction
r2: S ® V ‘=’ E X ® A•, t if the current token is t
r3: V ® id
r4: E ® V
r5: E ® int
State ACTION GOTO
id int = $ E V S
0 s2 g3 g1
1 a
2 r3 r1
3 s4
4 s8 s6 g5 g7
5 r2
6 r5
7 r4
8 r3

n.dulay LR (Bottom-up) Parsing 31


LALR(1) Parsers

LALR(1) parsers are like LR(1) parsers but we perform an optimisation where we
merge LR(1) states that have the same LR(0) items but differ in their look-ahead
token.
An LALR item thus consists of an LR(0) item and a set of look-ahead tokens:

An LALR(1) item is a pair [ LR(0) item, {t0, .., tn} ]

Because of this merging, it is possible for LALR(1) parsing to produce spurious


reductions before an error is detected, whereas LR(1) parsing would detect the error
immediately, for example, for the grammar:
A ® ‘(‘ A ‘)’
A ® ‘+’
The input string +) will be caught immediately after shifting + in LR(1) while
LALR(1) will shift + and then do the reduction A ® + before catching the error if
the look-ahead set for A ® +• is { $, ) }

n.dulay LR (Bottom-up) Parsing 32


LALR(1) Example

In the following example, LR(1) states 2 and 3 have the same LR(0) items. We
can therefore combine them

int X ➝ E ● T, =
2
V ➝ id ●, + int
X ➝ E ● T, =,$
Y 23
V ➝ id ●, +,/
Y X ➝ E ● T, $
V ➝ id ●, / 3

LR(1) States LALR(1) State

Note: It is possible to construct the DFA of LALR(1) items directly from


the DFA of LR(0) items by using a technique known as propagating
lookaheads. For details see Aho p240 (Optional).

n.dulay LR (Bottom-up) Parsing 33


Ambiguity and Conflicts
Ambiguous grammars (i.e. where we can derive more than 1 parse tree
for some input) cannot be LR(k) for any k. An LR(k) parser will reach a
state where given the entire stack contents and the next k input tokens it
cannot decide whether to shift or reduce.
Ambiguity can be removed by rewriting the grammar, or handled by
augmenting the grammar with additional features that allow the parser to
resolve the ambiguity, e.g. by defining the associativity and precedence of
operators.

A ® id
A ® id ‘[‘ Expr ’]’ Do we shift or do we reduce? => Shift-Reduce conflict

Expr ® id Which reduce should we perform?


Var ® id => Reduce-Reduce conflict

n.dulay LR (Bottom-up) Parsing 34


Shift-Reduce Conflicts
Consider the following grammar: S ® if E then S else S
S ® if E then S
S ® id

What is the interpretation for the input: if a then if b then c else d


In a LR parsing table we will have a state with LR(0) items:
[S ® if E then S • ] and
[S ® if E then S • else S ]

If we shift our interpretation will be: if a then if b then c else d

On the other hand if we reduce our interpretation becomes:

if a then if b then c else d

LR parser generators give priority to shifting. This is analogous to matching the


longest sequence in Lexical Analysis. If this is not appropriate or we want to
remove the shift-reduce conflict, then we need to rewrite the grammar.

n.dulay LR (Bottom-up) Parsing 35


Shift-Reduce Conflicts Contd.

We can remove a shift-reduce conflict by rewriting the grammar e.g.:

Consider the grammar:


S ® if E then S else S
S ® if E then S
S ® other
E ® 0 | 1

We can rewrite this grammar as:


S ® MatchedS
MatchedS ® if E then MatchedS else MatchedS
MatchedS ® other
S ® UnMatchedS
UnMatchedS ® if E then S
UnMatchedS ® if E then MatchedS else UnMatchedS

This works by permitting only a matched statement to come before an else in an


if statement, forcing all else-parts to become matched as soon as possible.
n.dulay LR (Bottom-up) Parsing 36
Reduce-Reduce Conflicts

A reduce-reduce conflict occurs if there is a choice of reducing by 2 (or more) rules.

A common disambiguating rule used by LR parser generators for reduce-reduce


conflicts is to give priority to the earliest rule listed in the grammar source file,
alternatively we can rewrite the grammar.

Precedence. The following grammar is ambiguous because it doesn’t capture the


associatively or precedence of + and *

Expr ® Expr ‘+’ Expr | Expr ‘*’ Expr | ‘(‘ Expr ‘)’ | int

We can rewrite this unambiguously as:

Expr ® Expr ‘+’ Term | Term


Term ® Term ‘*’ Factor | Factor
Factor ® ‘(‘ Expr ‘)’ | int

n.dulay LR (Bottom-up) Parsing 37


Error Recovery (Optional Material)

On encountering an error entry, we can either stop and report a single error to the
programmer, or we can attempt to recover from the error and so allow parsing to
continue and further errors to be reported.
Many error recovery schemes exist, some attempt recovery by scanning forward in the
token stream, some modify the parse stack, others do both.

Example. YACC (Yet-Another Compiler-Compiler, Bison in GNU) supports extra


Error-Recovery Rules that tell the parser how to resynchronise on error, e.g.

Expr ® ‘(‘ error ‘)’


Expressions ® error ‘;’ Expression

The above rules specify that if a syntax error is encountered in an expression that the
parser should skip to the next ) or the next ; In YACC, error is a special terminal
symbol but has shift actions entered into the parsing table as if it were a normal
token. See Appel p79 or Aho p264 for more details.

n.dulay LR (Bottom-up) Parsing 38


Parse Trees
In a parse tree, we build leaf nodes for each input token (when we shift) and
non-leaf nodes for each rule (when we reduce).

IF

LPAREN

INT(24) Factor Term Expr

RPAREN If-Statement

IDENT("A") Statement

ELSE
Else-Part
IDENT("B") Statement

n.dulay LR (Bottom-up) Parsing 39


Abstract Syntax Trees (ASTs)
Although parse trees fully represent the syntactic structure of the input, they
hold unnecessary information and are inconvenient to work with. For example
we don’t normally care about keywords and bracketing symbols in the later
stages (semantic analysis, code generation) of the compiler. An Abstract
Syntax Tree (AST) is a refined version of the parse tree that retains only the
information that is useful for later stages of the compiler, or for error reporting
or at runtime. For example the AST for an if statement might be:

INT(24)

IDENT("A") If-Statement

IDENT("B")

n.dulay LR (Bottom-up) Parsing 40


Abstract Syntax Trees Contd.
ASTs can be constructed either with a separate pass, or by attaching AST
construction code directly to grammar rules, e.g.

Statement ® print ‘(‘ Expr:E ‘)’ | return new PrintNode (E)


Statement ® ident:D ‘=‘ Expr:E | return new AssignNode(D,E)
Statement ® if ‘(‘ Expr:E ’)’ Statement:S | return new IfNode(E, S, null)
Statement ® if ‘(‘ Expr:E ‘)’ Statement:S1 else Statement:S2
| return new IfNode (E, S1, S2)
Statement ® Statement:S1 ‘;’ Statement:S2 | return new StatListNode (S1, S2)
Expr ® Expr:E Op:O Term:T | return new ExprNode (E, O, T)
Expr ® Term:T | return T
Term ® ‘(‘ Expr:E ‘)’ | return E
Term ® ident:D | return new IdNode (D.stringvalue)
Term ® int:N | return new IntNode (N.intvalue)
Op ® ‘+’ | return new ArithOpNode (‘+’)
Op ® ‘-’ | return new ArithOpNode (‘-’)

n.dulay LR (Bottom-up) Parsing 41


Summary

Bottom-up parsing is a very powerful technique for recognising the syntax


of programming languages. Although more complicated to understand than
top-down parsers, it is popular since LR grammars are usually simpler and
more natural than LL grammars, and LR parser generators can produce
efficient parsers with good error reporting.

For more details on bottom up parsing and further exercises see


[Cooper – Chapter 3]
[Appel – Chapter 3]
[Aho – Chapter 4]

n.dulay LR (Bottom-up) Parsing 42


Flex-Bison (C++) Example
(OPTIONAL MATERIAL)

n.dulay LR (Bottom-up) Parsing 43


Grammar

program ® statement_sequence

statement_sequence ® statement | statement_sequence ';' statement

statement ® if_statement | while_statement | assign_statement

assign_statement ® identifier ':=' expression

if_statement ® 'if' expression 'then' statement_sequence 'end'

while_statement ® 'while' expression 'do' statement_sequence 'end’

expression ® expression '+' expression


| expression '-' expression
| expression '*' expression
| expression '/' expression
| '(' expression ')’
| identifier
| number

n.dulay LR (Bottom-up) Parsing 44


scanner.l

%{
#include <string>
#include "astnode.h"
#include "parser.hpp"
%}

digit [0-9]
number {digit}+
letter [a-zA-Z]
identifier {letter}+
whitespace [ \t\n]+
comment //[^\n]*\n

• For this example we’ll use C++ code with our bison grammar.
astnode.h will hold C++ class definitions for our AST nodes.
• parser.hpp is the include file generated by bison-c++.

n.dulay LR (Bottom-up) Parsing 45


scanner.l

%%

if return IF;
then return THEN;
else return ELSE;
while return WHILE;
do return DO;
":=" return BECOMES;
"=" return EQUALS;
"<" return LESSTHAN;
"+" return PLUS;
"-" return MINUS;
"(" return LPAREN;
")" return RPAREN;
";" return SEMICOLON;

n.dulay LR (Bottom-up) Parsing 46


scanner.l

{number} yylval.string = new std::string(yytext,yyleng);


return INTEGER;
{identifier} yylval.string = new std::string(yytext,yyleng);
return IDENTIFIER;
{whitespace} { }
. return ERROR;

• To keep our flex code very simple, token attributes (e.g. for numbers) will be
generated in bison. We’ll just pass a C++ string.

n.dulay LR (Bottom-up) Parsing 47


parser.y

%{

#include "astnode.h"
#include <cstdio>
#include <cstdlib>

extern int yylex(); /* Lexical analyser generated by flex */

void yyerror(const char *message) {


std::printf("Error: %s\n", message);
std::exit(1);
}

StatSeq *ast; /* Pointer to root of Abstract Syntax Tree */

%}

n.dulay LR (Bottom-up) Parsing 48


parser.y

%union {
int token;
std::string *string;
Identifier *id;
Expression *expression;
Statement *statement;
StatSeq *statseq;
ASTnode *node;
}

%token <token> IF THEN ELSE WHILE DO READ WRITE


%token <token> BECOMES EQUALS LESSTHAN PLUS MINUS
%token <token> LPAREN RPAREN SEMICOLON
%token <string> INTEGER ERROR
%token <id> IDENTIFIER

n.dulay LR (Bottom-up) Parsing 49


parser.y

%type <id> identifier


%type <statseq> program statement_sequence
%type <statement> statement begin_statement assign_statement
%type <statement> if_statement while_statement
%type <expression> expression number
%type <token> comparator

/* Associativity of operators */
%left PLUS MINUS

/* Start symbol. Will default to first non-terminal symbol */


%start program

n.dulay LR (Bottom-up) Parsing 50


parser.y

program:
statement_sequence
{ ast = $1; }
;

statement_sequence:
statement
{ $$ = new StatSeq();
$$->statements.push_back($1);
}
| statement_sequence SEMICOLON statement
{ $1->statements.push_back($3);
}
;

statement:
if_statement { $$ = $1; }
| while_statement { $$ = $1; }
| assign_statement { $$ = $1; }
;
n.dulay LR (Bottom-up) Parsing 51
parser.y

if_statement:
IF expression THEN statement_sequence END
{ $$ = new IfStatement(*$2, *$4); }
;

while_statement:
WHILE expression DO statement_sequence END
{ $$ = new WhileStatement(*$2, *$4); }
;

assign_statement:
identifier BECOMES expression
{ $$ = new Assignment(*$1, *$3); }
;

n.dulay LR (Bottom-up) Parsing 52


parser.y

expression:
identifier
{ $$ = $1; }
| number
{ $$ = $1; }
| expression PLUS expression
{ $$ = new Operator(*$1, $2, *$3); }
| expression MINUS expression
{ $$ = new Operator(*$1, $2, *$3); }
| expression comparator expression
{ $$ = new Operator(*$1, $2, *$3); }
| LPAREN expression RPAREN
{ $$ = $2; }
;

n.dulay LR (Bottom-up) Parsing 53


parser.y

identifier:
IDENTIFIER
{ $$ = new Identifier(*$1);
delete $1;
}
;

number:
INTEGER
{ $$ = new Number(atol($1->c_str()));
delete $1;
}
;

comparator:
EQUALS
| LESSTHAN
;

n.dulay LR (Bottom-up) Parsing 54


astnode.h

#include <vector>
#include <iostream>

class ASTnode {
public:
virtual ~ASTnode() {}
virtual void semantic_check () {}
};

class Expression : public ASTnode {};


typedef std::vector<Expression*> ExpressionList;

class Statement : public ASTnode {};


typedef std::vector<Statement*> StatementList;

class StatSeq: public Statement {


public:
StatementList statements;
StatSeq () {}
};
n.dulay LR (Bottom-up) Parsing 55
astnode.h

class IfStatement : public Statement {


public:
Expression& expr;
StatSeq& thenS;

IfStatement(Expression& expr, StatSeq& thenS)


: expr(expr), thenS(thenS) {}
};

class WhileStatement : public Statement {


public:
Expression& expr;
StatSeq& doS;

WhileStatement(Expression& expr, StatSeq& doS)


: expr(expr), doS(doS) {}
};

n.dulay LR (Bottom-up) Parsing 56


astnode.h

class Assignment : public Statement {


public:
Identifier& var;
Expression& expr;
Assignment(Identifier& var, Expression& expr)
: var(var), expr(expr) {}
};

class Number : public Expression {


public:
int value;
Number(int value) : value(value) {}
};

class Identifier : public Expression {


public:
std::string id;
Identifier(const std::string& id) : id(id) {}
};

n.dulay LR (Bottom-up) Parsing 57


main.cpp

#include <iostream>
#include "astnode.h"

using namespace std;

extern int yyparse();

extern StatSeq* ast;

int main(int argc, char **argv) {


yyparse();
return 0;
}

n.dulay LR (Bottom-up) Parsing 58


makefile

all: parser
OBJS = scanner.o parser.o main.o
CPPFLAGS =

parser.cpp: parser.y
bison -d –v -o $@ $^

parser.hpp: parser.cpp

scanner.cpp: scanner.l parser.hpp


flex -o $@ $^

%.o: %.cpp
g++ -c $(CPPFLAGS) -o $@ $<

parser: $(OBJS)
g++ -o $@ $(OBJS)

n.dulay LR (Bottom-up) Parsing 59

You might also like