Bottomupparsing
Bottomupparsing
com
Introduction:
In computer science, LR parsers are a type of bottom-up parsers that efficiently
handle deterministic context-free languages in guaranteed linear time. The LALR parsers and
the SLR parsers are common variants of LR parsers. LR parsers are often mechanically
generated from a formal grammar for the language by a parser generator tool. They are very
widely used for the processing of computer languages, more than other kinds of generated
parsers.
m
• The name LR is an initialism. The L means that the parser reads input text in one
co
direction without backing up; that direction is typically Left to right within each line,
and top to bottom across the lines of the full input file. (This is true for most parsers.)
n.
• The R means that the parser produces a reversed Rightmost derivation; it does
a bottom-up parse, not a top-down LL parse or ad-hoc parse.
io
• The name LR is often followed by a numeric qualifier, as in LR(1) or sometimes LR
at
(k). To avoid backtracking or guessing, the LR parser is allowed to peek ahead
at k lookahead input symbols before deciding how to parse earlier symbols.
uc
Typically k is 1 and is not mentioned.
• The name LR is often preceded by other qualifiers, as in SLR and LALR.
ed
• LR parsers are deterministic; they produce a single correct parse without guesswork
or backtracking, in linear time. This is ideal for computer languages. But LR parsers
hi
are not suited for human languages which need more flexible but slower methods.
ks
• Other parser methods (CYK algorithm, Earley parser, and GLR parser) that backtrack
or yield multiple parses may take O(n2), O(n3) or even exponential time when they
a
guess badly.
.s
• The above properties of L, R, and k are actually shared by all shift-reduce parsers,
including precedence parsers. But by convention, the LR name stands for the form of
w
parsing invented by Donald Knuth, and excludes the earlier, less powerful precedence
w
• LR parsers can handle a larger range of languages and grammars than precedence
parsers or top-down LL parsing. This is because the LR parser waits until it has seen
an entire instance of some grammar pattern before committing to what it has found.
• An LL parser has to decide or guess what it is seeing much sooner, when it has only
seen the leftmost input symbol of that pattern. LR is also better at error reporting. It
detects syntax errors as early in the input stream as possible.
www.sakshieducation.com
www.sakshieducation.com
Examplee:
• A LR parserr scans and parses the input
An i text in
n one forwarrd pass over the text. Thhe
parser builds up the parsse tree increementally, bottom
b up, aand left to right,
r withouut
guuessing or backtracking
b .
• A every poin
At nt in this passs, the parser has accum
mulated a listt of subtrees or phrases of
o
thhe input text that have beeen already pparsed.
m
• T
Those subtreees are not yeet joined toggether becauuse the parseer has not yeet reached thhe
co
riight end of thhe syntax paattern that wiill combine them.
t
• A step 6 in the
At t examplee parse, onlyy "A*2" has been parsedd, incomplettely. Only thhe
n.
sh
haded lowerr-left cornerr of the paarse tree ex
xists. None of the parsse tree nodees
nuumbered 7 and
a above exxist yet.
io
• N
Nodes 3, 4, and
a 6 are thhe roots of isolated
i subttrees for varriable A, op
perator *, annd
at
nuumber 2, resspectively. These
T three root nodes are
a temporarrily held in a parse stackk.
T remaininng unparsed portion of the input streeam is "+ 1"". (Please reefer the below
The w
uc
im
mage)
ed
hi
a ks
.s
w
w
w
www.sakshieducation.com
www.sakshieducation.com
m
co
Shift and reduce actions:
n.
As with other shift-reduce parsers, an LR parser works by doing some combination of Shift
io
steps and Reduce steps.
• A Shift step advances in the input stream by one symbol. That shifted symbol becomes a
at
new single-node parse tree.
A Reduce step applies a completed grammar rule to some of the recent parse trees,
•
uc
joining them together as one tree with a new root symbol.
ed
If the input has no syntax errors, the parser continues with these steps until all of the input has
been consumed and all of the parse trees have been reduced to a single tree representing an
hi
LR parsers differ from other shift-reduce parsers in how they decide when to reduce, and how
to pick between rules with similar endings. But the final decisions and the sequence of shift
a
or reduce steps are the same. Much of the LR parser's efficiency is from being deterministic.
.s
To avoid guessing, the LR parser often looks ahead (rightwards) at the next scanned symbol,
w
before deciding what to do with previously scanned symbols. The lexical scanner works one
w
or more symbols ahead of the parser. The look ahead symbols are the 'right-hand context' for
the parsing decision.
w
Example:
• At every parse step, the entire input text is divided into parse stack, current lookahead
symbol, and remaining unscanned text.
• The parser's next action is determined by the rightmost stack symbol(s) and the
lookahead symbol.
www.sakshieducation.com
www.sakshieducation.com
• The action is read from a table containing all syntactically valid combinations of stack
and lookahead symbols.
Look
Step Parse Stack Unscanned Parser Action
Ahead
m
1 Id = B + C*2 Shift
co
2 id = id + C*2 Shift
n.
4 id = Value + C*2 Reduce by Products ← Value
io
5 id = Products + C*2 Reduce by Sums ← Products
at
6 id = Sums + C*2 Shift
7 id = Sums + id *2 Shift
uc
8 id = Sums + id * 2 Reduce by Value ← id
ed
id = Sums + Products
12 eof Reduce by Value ← int
* int
a
13 eof
Value Value
w
eof
Products
w
Grammar Examples:
A grammar is the set of patterns or syntax rules for the input language. It doesn't cover all
language rules, such as the size of numbers, or the consistent use of names and their
www.sakshieducation.com
www.sakshieducation.com
definitions in the context of the whole program. Shift-reduce parsers use a context-free
grammar that deals just with local patterns of symbols.
The example grammar used here is a tiny subset of the Java or C language:
Assign ← id = Sums
Sums ← Sums + Products
Sums ← Products
Products ← Products * Value
m
Products ← Value
co
Value ← int
Value ← id
n.
• The grammar's terminal symbols are the multi-character symbols or 'tokens' found in
the input stream by a lexical scanner. Here these include = + * and int for any integer
io
constant, and id for any identifier name.
at
• The grammar doesn't care what the int values or id spellings are, nor does it care about
blanks or line breaks. The grammar uses these terminal symbols but does not define
uc
them. They are always at the bottom bushy end of the parse tree.
• The capitalized terms like Sums are nonterminal symbols. These are names for
ed
concepts or patterns in the language. They are defined in the grammar and never occur
themselves in the input stream. They are always above the bottom of the parse tree.
hi
• They only happen as a result of the parser applying some grammar rule. Some
ks
nonterminals are defined with two or more rules; these are alternative patterns. Rules
can refer back to themselves.
a
• This grammar uses recursive rules to handle repeated math operators. Grammars for
.s
complete languages use recursive rules to handle lists, parenthesized expressions and
nested statements.
w
• Any given computer language can be described by several different grammars. The
w
www.sakshieducation.com
www.sakshieducation.com
• A table-driven parser has all of its knowledge about the grammar encoded into
unchanging data called parser tables. The parser's program code is a simple generic
loop that applies unchanged to many grammars and languages.
• The tables may be worked out by hand for precedence methods. For LR methods, the
complex tables are mechanically derived from a grammar by some parser
generator tool like Bison.
• The parser tables are usually much larger than the grammar. In other parsers that are
m
not table-driven, such as recursive descent, each language construct is parsed by a
co
different subroutine, specialized to the syntax of that one const.
n.
LR and LALR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of
io
context-free grammar which makes it the most efficient syntax analysis technique. LR
at
parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the
input stream; R stands for the construction of right-most derivation in reverse, and k denotes
uc
the number of lookahead symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:
ed
LR Parsing Algorithm
Here we describe a skeleton algorithm of an LR parser:
token = next_token()
www.sakshieducation.com
www.sakshieducation.com
repeat forever
s = top of stack
m
co
else if action[s, tpken] = “reduce A::= β“ then
POP 2 * |β| symbols
n.
s = top of stack
PUSH A
io
PUSH goto[s,A]
at
else if action[s, token] = “accept” then
uc
return
ed
else
error()
hi
ks
LL vs. LR
LL LR
a
.s
Starts with the root nonterminal on the Ends with the root nonterminal on the stack.
w
stack.
w
Uses the stack for designating what is still Uses the stack for designating what is already
to be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
www.sakshieducation.com
www.sakshieducation.com
Continuously pops a nonterminal off the Tries to recognize a right hand side on the
stack, and pushes the corresponding right stack, pops it, and pushes the corresponding
hand side. nonterminal.
Reads the terminals when it pops one off Reads the terminals while it pushes them on
m
the stack. the stack.
co
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
n.
Error Recovery in Parsing:
io
A parser should be able to detect and report any error in the program. It is expected that
when an error is encountered, the parser should be able to handle it and carry on parsing the
at
rest of the input. Mostly it is expected from the parser to check for errors but errors may be
uc
encountered at various stages of the compilation process. A program may have the following
kinds of errors at various stages:
ed
There are four common error-recovery strategies that can be implemented in the parser to
a
Panic Mode
w
When a parser encounters an error anywhere in the statement, it ignores the rest of the
statement by not processing input from erroneous input to delimiter, such as semi-colon.
w
This is the easiest way of error-recovery and also, it prevents the parser from developing
w
infinite loops.
Statement Mode
When a parser encounters an error, it tries to take corrective measures so that the rest of
inputs of statement allow the parser to parse ahead. For example, inserting a missing
www.sakshieducation.com
www.sakshieducation.com
semicolon, replacing comma with a semicolon etc. Parser designers have to be careful here
because one wrong correction may lead to an infinite loop.
Error Productions
Some common errors are known to the compiler designers that may occur in the code. In
addition, the designers can create augmented grammar to be used, as productions that
generate erroneous constructs when these errors are encountered.
m
Global correction
co
The parser considers the program in hand as a whole and tries to figure out what the
program is intended to do and tries to find out a closest match for it, which is error-free.
n.
When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-
free statement Y. This may allow the parser to make minimal changes in the source code,
io
but due to the complexity (time and space) of this strategy, it has not been implemented in
at
practice yet. uc
Abstract Syntax Trees
Parse tree representations are not easy to be parsed by the compiler, as they contain more
ed
details than actually needed. Take the following parse tree as an example:
hi
a ks
.s
w
w
If watched closely, we find most of the leaf nodes are single child to their parent nodes. This
w
information can be eliminated before feeding it to the next phase. By hiding extra
information, we can obtain a tree as shown below:
www.sakshieducation.com
www.sakshieducation.com
m
Abstractt tree can be represented as:
co
n.
io
at
uc
ed
are moree compact thhan a parse trree and can bbe easily useed by a comppiler.
a ks
.s
w
w
w
www.sakshieducation.com
www.sakshieducation.com
YACC:
• YACC is a computer program for the Unix operating system. It is a LALR parser
generator, generating a parser, the part of a compiler that tries to make syntactic sense
of thesource code, specifically a LALR parser, based on an analytic grammar written
in a notation similar to BNF.
m
• YACC itself used to be available as the default parser generator on most Unix
systems, though it has since been supplanted as the default by more recent, largely
co
compatible, programs.
• YACC is an acronym for "Yet Another Compiler Compiler". It is a LALR parser
n.
generator, generating a parser, the part of a compiler that tries to make syntactic sense
io
of thesource code, specifically a LALR parser, based on an analytic grammar written
in a notation similar to BNF.
at
• It was originally developed in the early 1970s by Stephen C. Johnson at AT&T
uc
Corporation and written in the B programming language, but soon rewritten in C. It
appeared as part of Version 3 Unix and a full description of Yacc was published in
ed
1975.
• The input to Yacc is a grammar with snippets of C code (called "actions") attached to
its rules. Its output is a shift-reduce parser in C that executes the C snippets associated
hi
• Typical actions involve the construction of parse trees. Using an example from
Johnson, if the call node(label, left, right)constructs a binary parse tree node with the
a
specified label and children, then the rule recognizes summation expressions and
.s
constructs nodes for them. The special identifiers $$, $1 and $3 refer to items on the
w
parser's stack.
w
w
• Yacc and similar programs (largely reimplementations) have been very popular. Yacc
itself used to be available as the default parser generator on most Unix systems,
though it has since been supplanted as the default by more recent, largely compatible,
programs such as Berkeley Yacc, GNU bison, MKS Yacc and Abraxas PCYACC.
www.sakshieducation.com
www.sakshieducation.com
m
which is then followed by the parsing stage proper.
co
• Lexical analyzer generators, such as Lex or Flex are widely available.
The IEEE POSIX P1003.2 standard defines the functionality and requirements for
n.
both Lex and Yacc.
• Some versions of AT&T Yacc have become open source. For example, source
io
code (for different implementations) is available with the standard distributions
at
of Plan 9 andOpenSolaris. uc
ed
hi
a ks
.s
w
w
w
www.sakshieducation.com