Formal Methods of Describing Syntax - PPL
Formal Methods of Describing Syntax - PPL
The formal language-generation mechanisms, usually called grammars, that are commonly used
to describe the syntax of programming languages.
In the middle to late 1950s, two men, Noam Chomsky and John Backus, in unrelated research
efforts, developed the same syntax description formalism, which subsequently became the most
widely used method for programming language syntax.
Context-Free Grammars
Two of these grammar classes, named context-free and regular, turned out to be useful
for describing the syntax of programming languages.
The forms of the tokens of programming languages can be described by regular
grammars.
The syntax of whole programming languages, with minor exceptions, can be
described by context-free grammars.
A simple Java assignment statement, for example, can be represented by the abstraction
<assign>
The text on the left side of the arrow, which is aptly called the left-hand side (LHS), is
the abstraction being defined.
The text to the right of the arrow is the definition of the LHS. It is called the right-
hand side (RHS) and consists of some mixture of tokens, lexemes, and references to
other abstractions.
Altogether, the definition is called a rule, or production.
In the example rule just given, the abstractions <var> and <expression> must be defined
for the <assign> definition to be useful.
This particular rule specifies that the abstraction <assign> is defined as an instance of the
abstraction <var>, followed by the lexeme =, followed by an instance of the abstraction
<expression>.
The abstractions in a BNF description, or grammar, are often called non terminal
symbols, or simply non terminals, and the lexemes and tokens of the rules are called
terminal symbols, or simply terminals.
A BNF description, or grammar, is a collection of rules.
Non terminal symbols can have two or more distinct definitions, representing two or
more possible syntactic forms in the language.
Multiple definitions can be written as a single rule, with the different definitions
separated by the symbol |, meaning logical OR.
For example, a Java if statement can be described with the rules
Describing Lists
<ident_list> → identifier
| identifier, <ident_list>
The language described by the grammar of Example 3.1 has only one statement form:
assignment.
A program consists of the special word begin, followed by a list of statements
separated by semicolons, followed by the special word end.
An expression is either a single variable or two variables separated by either a + or -
operator.
The only variable names in this language are A,B, and C.
This derivation, like all derivations, begins with the start symbol, in this case
<program>.
The symbol => is read “derives.”
Each successive string in the sequence is derived from the previous string by
replacing one of the non terminals with one of that non terminal’s definitions.
Each of the strings in the derivation, including <program>, is called a
sentential form.
In this derivation, the replaced non terminal is always the leftmost non
terminal in the previous sentential form.
Derivations that use this order of replacement are called leftmost derivations.
The derivation continues until the sentential form contains no non terminals.
That sentential form, consisting of only terminals, or lexemes, is the generated
sentence.
Example 3.2 is another example of a grammar for part of a typical programming language.
<id> → A | B | C
| <id> * <expr>
| ( <expr> )
| <id>
The grammar of Example 3.2 describes assignment statements whose right sides are
arithmetic expressions with multiplication and addition operators and parentheses. For
example, the statement
=> A = <expr>
=> A = B * <expr>
=> A = B * ( <expr> )
=> A = B * ( <id> + <expr> )
=> A = B * ( A + <expr> )
=> A = B * ( A + <id> )
=> A = B * ( A + C )
Parse Trees
One of the most attractive features of grammars is that they naturally describe the hierarchical
syntactic structure of the sentences of the languages they define .These hierarchical structures are
called parse trees.
For example, the parse tree in Figure 3.1 shows the structure of the assignment statement
derived previously
Every internal node of a parse tree is labeled with a non terminal symbol;
every leaf is labeled with a terminal symbol.
Every sub tree of a parse tree describes one instance of an abstraction in the sentence.
Ambiguity
A grammar that generates a sentential form for which there are two or more distinct parse
trees is said to be ambiguous. Consider the grammar shown in Example 3.3, which is a
minor variation of the grammar shown in Example 3.2.
<id> → A | B | C
| <expr> * <expr>
| ( <expr> )
| <id>
A=B+C*A
The ambiguity occurs because the grammar specifies slightly less syntactic structure than does
the grammar of Example 3.2.
Rather than allowing the parse tree of an expression to grow only on the right, this grammar
allows growth on both the left and the right.
The grammar will produce two distinct parse trees which it cannot be determined by compiler.
Syntactic ambiguity of language structures is a problem because compilers often base the
semantics of those structures on their syntactic form.
Specifically, the compiler chooses the code to be generated for a statement by examining
its parse tree. If a language structure has more than one parse tree, then the meaning of the
structure cannot be determined uniquely.
There are several other characteristics of a grammar that are sometimes useful in determining
whether a grammar is ambiguous.
(1) if the grammar generates a sentence with more than one leftmost derivation and
(2) if the grammar generates a sentence with more than one rightmost derivation.
Some parsing algorithms can be based on ambiguous grammars. When such a parser
encounters an ambiguous construct, it uses non grammatical information provided by the
designer to construct the correct parse tree
Operator precedence
In some cases, there can be conflicts in the parse tree, where one tree suggests one
precedence order and another suggests a different one.
To resolve this, we design a grammar that ensures consistent precedence regardless of the
order of operators in the expression.
Identify the operators in the language and assign precedence levels to them. In this
case, let's assume that * has higher precedence than +.
Instead of using a single nonterminal (<expr>) for both operands of both + and *,
introduce three non-terminals to represent operands: <expr>, <term>, and <factor>.
Define production rules that reflect the desired precedence order. For instance:
<expr> generates only + operators, with <term> as the right operand.
<term> generates * operators, with <factor> as its right operand.
Ensure that the grammar forces different operators to different levels in the parse
tree. This is achieved by defining production rules such that * operators are
always lower in the parse tree compared to + operators.
The grammar in Example 3.4 generates the same language as the grammars of Examples 3.2 and
3.3, but it is unambiguous and it specifies the usual precedence order of multiplication and
addition operators. The following derivation of the sentence A = B + C * A uses the grammar of
Example 3.4:
=> A = <expr>
=> A = B + <term>
=> A = B + C * <factor>
=> A = B + C * <id>
=> A = B + C * A
The unique parse tree for this sentence, using the grammar of Example 3.4, is shown in Figure
3.3.
The connection between parse trees and derivations is very close: Either can easily
be constructed from the other.
Every derivation with an unambiguous grammar has a unique parse tree, although
that tree can be represented by different derivations.
For example, the following derivation of the sentence A = B + C * A is different from the
derivation of the same sentence given previously. This is a rightmost derivation, whereas
the previous one is leftmost. Both of these derivations, however, are represented by the
same parse tree.
<assign> => <id> = <expr>
=> <id> = <expr> + <term>
=> <id> = <expr> + <term> * <factor>
=> <id> = <expr> + <term> * <id>
=> <id> = <expr> + <term> * A
=> <id> = <expr> + <factor> * A
=> <id> = <expr> + <id> * A
=> <id> = <expr> + C * A
=> <id> = <term> + C * A
=> <id> = <factor> + C * A
=> <id> = B + C * A
=> A = B + C * A
Associativity of Operators:
Left Associativity: Left associativity means that when operators have the same precedence, they
are evaluated from left to right. This is typical for addition and multiplication in mathematics. In
the context of parsing, left associativity is implied when the left addition operator is lower than
the right addition operator in the parse tree.
This means that in expressions like "A + B + C" or "D * E * F", the operations are performed
from left to right, regardless of their position within the expression.
Parsing the Expression: When parsing this expression, the parse tree will reflect left
associativity. The parse tree might look like this:
/\
+ C
/\
A B
Right Associativity: Right associativity, on the other hand, implies that operators are evaluated
from right to left. This is often the case for the exponentiation operator, where A ** B ** C is
interpreted as A ** (B ** C). Right recursion in the grammar rules can be used to specify right
associativity.
Parsing the Expression: When parsing this expression, the right associativity dictates
that the operation closest to the rightmost operand is evaluated first. Therefore, the parse
tree would reflect this by placing the rightmost operation (B ** C) lower in the tree
compared to the operation to its left (A ** (B ** C)). The parse tree might look like this:
**
/ \
A **
/ \
B C
A=B+C+A
The parse tree for this sentence, as defined with the grammar of Example 3.4, is shown in Figure 3.4.
When a grammar rule has its LHS also appearing at the beginning of its RHS, the
rule is said to be left recursive.
This left recursion specifies left associativity. For example, the left recursion of
the rules of the grammar of Example 3.4 causes it to make both addition and
multiplication left associative
In most languages that provide it, the exponentiation operator is right associative.
A grammar rule is right recursive if the LHS appears at the right end of the
RHS. Rules such as
<factor> → <exp> ** <factor>
|<exp>
<exp> → ( <expr> )
|id
Could be used to describe exponentiation as a right-associative operator.