0% found this document useful (0 votes)
1K views15 pages

Formal Methods of Describing Syntax - PPL

The document discusses formal language grammars and Backus-Naur Form (BNF) which are commonly used to describe the syntax of programming languages. BNF uses rules and symbols to define a language's syntax in a concise way. The document also discusses context-free grammars, derivations, parse trees, and ambiguity in grammars.

Uploaded by

alukapellyvijaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views15 pages

Formal Methods of Describing Syntax - PPL

The document discusses formal language grammars and Backus-Naur Form (BNF) which are commonly used to describe the syntax of programming languages. BNF uses rules and symbols to define a language's syntax in a concise way. The document also discusses context-free grammars, derivations, parse trees, and ambiguity in grammars.

Uploaded by

alukapellyvijaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Formal methods of describing syntax:

The formal language-generation mechanisms, usually called grammars, that are commonly used
to describe the syntax of programming languages.

Backus-Naur Form and Context

Backus-Naur Form and Context-Free Grammars

In the middle to late 1950s, two men, Noam Chomsky and John Backus, in unrelated research
efforts, developed the same syntax description formalism, which subsequently became the most
widely used method for programming language syntax.

Context-Free Grammars

 Two of these grammar classes, named context-free and regular, turned out to be useful
for describing the syntax of programming languages.
 The forms of the tokens of programming languages can be described by regular
grammars.
 The syntax of whole programming languages, with minor exceptions, can be
described by context-free grammars.

Origins of Backus-Naur Form

 ALGOL 58 was presented by John Backus, a prominent member of the ACM-GAMM


group, at an international conference in 1959 (Backus, 1959).
 This paper introduced a new formal notation for specifying programming language
syntax.
 This revised method of syntax description became known as Backus-Naur Form, or
simply BNF.
 The new notation was later modified slightly by Peter Naur for the description of
ALGOL 60 (Naur, 1960).
 BNF is a natural notation for describing syntax.
 In fact, something similar to BNF was used by Panini to describe the syntax of Sanskrit
several hundred years before Christ.
 The use of BNF in the ALGOL 60 report was not immediately accepted by computer
users, it soon became and is still the most popular method of concisely describing
programming language syntax.
 BNF is nearly identical to Chomsky’s generative devices for context-free languages,
called context-free grammars.
Fundamentals

 A meta language is a language that is used to describe another language.


 BNF is a meta language for programming languages.
 BNF uses abstractions for syntactic structures.

A simple Java assignment statement, for example, can be represented by the abstraction
<assign>

(pointed brackets are often used to delimit names of abstractions).

 The actual definition of <assign> can be given by

<assign> → <var> = <expression>

 The text on the left side of the arrow, which is aptly called the left-hand side (LHS), is
the abstraction being defined.
 The text to the right of the arrow is the definition of the LHS. It is called the right-
hand side (RHS) and consists of some mixture of tokens, lexemes, and references to
other abstractions.
 Altogether, the definition is called a rule, or production.

In the example rule just given, the abstractions <var> and <expression> must be defined
for the <assign> definition to be useful.

 This particular rule specifies that the abstraction <assign> is defined as an instance of the
abstraction <var>, followed by the lexeme =, followed by an instance of the abstraction
<expression>.

One example sentence whose syntactic structure is described by the rule is

total = subtotal1 + subtotal2

 The abstractions in a BNF description, or grammar, are often called non terminal
symbols, or simply non terminals, and the lexemes and tokens of the rules are called
terminal symbols, or simply terminals.
 A BNF description, or grammar, is a collection of rules.
 Non terminal symbols can have two or more distinct definitions, representing two or
more possible syntactic forms in the language.
 Multiple definitions can be written as a single rule, with the different definitions
separated by the symbol |, meaning logical OR.
For example, a Java if statement can be described with the rules

<if_stmt> → if ( <logic_expr> ) <stmt>


<if_stmt> → if ( <logic_expr> ) <stmt> else <stmt>
or with the rule
<if_stmt> → if ( <logic_expr> ) <stmt>
| if ( <logic_expr> ) <stmt> else <stmt>
In these rules, <stmt> represents either a single statement or a compound statement.
 Although BNF is simple, it is sufficiently powerful to describe nearly all of the syntax of
programming languages.

Describing Lists

 Variable-length lists in mathematics are often written using an ellipsis (. . .); 1, 2, . . . is


an example.
 BNF does not include the ellipsis, so an alternative method is required for describing lists
of syntactic elements in programming languages (for example, a list of identifiers
appearing on a data declaration statement).
 For BNF, the alternative is recursion.
 A rule is recursive if its LHS appears in its RHS.
The following rules illustrate how recursion is used to describe lists:

<ident_list> → identifier

| identifier, <ident_list>

This defines <ident_list> as either a single token (identifier) or an identifier


followed by a comma and another instance of <ident_list>.
 Recursion is used to describe lists

Grammars and Derivations


 A grammar is a generative device for defining languages.
 The sentences of the language are generated through a sequence of applications of
the rules, beginning with a special non terminal of the grammar called the start symbol.
 This sequence of rule applications is called a derivation.
 In a grammar for a complete programming language, the start symbol represents a
complete program and is often named <program>.
The simple grammar shown in Example 3.1 is used to illustrate derivations.

 The language described by the grammar of Example 3.1 has only one statement form:
assignment.
 A program consists of the special word begin, followed by a list of statements
separated by semicolons, followed by the special word end.
 An expression is either a single variable or two variables separated by either a + or -
operator.
 The only variable names in this language are A,B, and C.

A derivation of a program in this language follows:

 This derivation, like all derivations, begins with the start symbol, in this case
<program>.
 The symbol => is read “derives.”
 Each successive string in the sequence is derived from the previous string by
replacing one of the non terminals with one of that non terminal’s definitions.
 Each of the strings in the derivation, including <program>, is called a
sentential form.
 In this derivation, the replaced non terminal is always the leftmost non
terminal in the previous sentential form.
 Derivations that use this order of replacement are called leftmost derivations.
 The derivation continues until the sentential form contains no non terminals.
 That sentential form, consisting of only terminals, or lexemes, is the generated
sentence.

Example 3.2 is another example of a grammar for part of a typical programming language.

EXAMPLE 3.2 A Grammar for Simple Assignment Statements

<assign> → <id> = <expr>

<id> → A | B | C

<expr> → <id> + <expr>

| <id> * <expr>

| ( <expr> )

| <id>

 The grammar of Example 3.2 describes assignment statements whose right sides are
arithmetic expressions with multiplication and addition operators and parentheses. For
example, the statement

A = B * ( A + C ) is generated by the leftmost derivation:

<assign> => <id> = <expr>

=> A = <expr>

=> A = <id> * <expr>

=> A = B * <expr>

=> A = B * ( <expr> )
=> A = B * ( <id> + <expr> )

=> A = B * ( A + <expr> )

=> A = B * ( A + <id> )

=> A = B * ( A + C )

Parse Trees

One of the most attractive features of grammars is that they naturally describe the hierarchical
syntactic structure of the sentences of the languages they define .These hierarchical structures are
called parse trees.

For example, the parse tree in Figure 3.1 shows the structure of the assignment statement
derived previously

 Every internal node of a parse tree is labeled with a non terminal symbol;
 every leaf is labeled with a terminal symbol.
 Every sub tree of a parse tree describes one instance of an abstraction in the sentence.
Ambiguity
 A grammar that generates a sentential form for which there are two or more distinct parse
trees is said to be ambiguous. Consider the grammar shown in Example 3.3, which is a
minor variation of the grammar shown in Example 3.2.

EXAMPLE 3.3 An Ambiguous Grammar for Simple Assignment Statements

<assign> → <id> = <expr>

<id> → A | B | C

<expr> → <expr> + <expr>

| <expr> * <expr>

| ( <expr> )

| <id>

The grammar of Example 3.3 is ambiguous because the sentence

A=B+C*A

has two distinct parse trees, as shown in Figure 3.2.

The ambiguity occurs because the grammar specifies slightly less syntactic structure than does
the grammar of Example 3.2.

Rather than allowing the parse tree of an expression to grow only on the right, this grammar
allows growth on both the left and the right.
The grammar will produce two distinct parse trees which it cannot be determined by compiler.

 Syntactic ambiguity of language structures is a problem because compilers often base the
semantics of those structures on their syntactic form.
 Specifically, the compiler chooses the code to be generated for a statement by examining
its parse tree. If a language structure has more than one parse tree, then the meaning of the
structure cannot be determined uniquely.

There are several other characteristics of a grammar that are sometimes useful in determining
whether a grammar is ambiguous.

They include the following:

(1) if the grammar generates a sentence with more than one leftmost derivation and

(2) if the grammar generates a sentence with more than one rightmost derivation.

 Some parsing algorithms can be based on ambiguous grammars. When such a parser
encounters an ambiguous construct, it uses non grammatical information provided by the
designer to construct the correct parse tree

Operator precedence

Operator precedence determines the order in which operations are performed in an


expression. For example, in the expression "x + y * z", operator precedence determines whether
we do the addition first and then the multiplication, or the multiplication first and then the
addition.
To do this, we use a grammar, which is like a set of rules for constructing valid expressions.
The grammar ensures that the correct order of operations is followed when evaluating an
expression.

 In some cases, there can be conflicts in the parse tree, where one tree suggests one
precedence order and another suggests a different one.
 To resolve this, we design a grammar that ensures consistent precedence regardless of the
order of operators in the expression.
 Identify the operators in the language and assign precedence levels to them. In this
case, let's assume that * has higher precedence than +.
 Instead of using a single nonterminal (<expr>) for both operands of both + and *,
introduce three non-terminals to represent operands: <expr>, <term>, and <factor>.
 Define production rules that reflect the desired precedence order. For instance:
 <expr> generates only + operators, with <term> as the right operand.
 <term> generates * operators, with <factor> as its right operand.
 Ensure that the grammar forces different operators to different levels in the parse
tree. This is achieved by defining production rules such that * operators are
always lower in the parse tree compared to + operators.

For instance, consider the expression A + B * C:

 <expr> generates + operator, with <term> as the right operand.


 <term> generates * operator, with <factor> as its right operand.
 So, * is lower in the parse tree, indicating it should be evaluated first.
 Regardless of the order of appearance of operators in an expression, the grammar ensures
consistent precedence by structuring the parse tree such that operators with higher
precedence are lower in the tree.
 The resulting grammar is unambiguous and specifies a consistent precedence of
operators, ensuring that expressions are evaluated correctly according to the defined
precedence rules.
The grammar of Example 3.4 is such a grammar.

The grammar in Example 3.4 generates the same language as the grammars of Examples 3.2 and
3.3, but it is unambiguous and it specifies the usual precedence order of multiplication and
addition operators. The following derivation of the sentence A = B + C * A uses the grammar of
Example 3.4:

<assign> => <id> = <expr>

=> A = <expr>

=> A = <expr> + <term>

=> A = <term> + <term>

=> A = <factor> + <term>

=> A = <id> + <term>

=> A = B + <term>

=> A = B + <term> * <factor>

=> A = B + <factor> * <factor>

=> A = B + <id> * <factor>

=> A = B + C * <factor>

=> A = B + C * <id>

=> A = B + C * A

The unique parse tree for this sentence, using the grammar of Example 3.4, is shown in Figure
3.3.
 The connection between parse trees and derivations is very close: Either can easily
be constructed from the other.
 Every derivation with an unambiguous grammar has a unique parse tree, although
that tree can be represented by different derivations.
 For example, the following derivation of the sentence A = B + C * A is different from the
derivation of the same sentence given previously. This is a rightmost derivation, whereas
the previous one is leftmost. Both of these derivations, however, are represented by the
same parse tree.
<assign> => <id> = <expr>
=> <id> = <expr> + <term>
=> <id> = <expr> + <term> * <factor>
=> <id> = <expr> + <term> * <id>
=> <id> = <expr> + <term> * A
=> <id> = <expr> + <factor> * A
=> <id> = <expr> + <id> * A
=> <id> = <expr> + C * A
=> <id> = <term> + C * A
=> <id> = <factor> + C * A
=> <id> = B + C * A
=> A = B + C * A
Associativity of Operators:

Associativity of Operators: When an expression contains multiple operators of the same


precedence level, such as multiplication and division, the associativity rule specifies the order in
which these operators should be evaluated. For example, in the expression A / B * C,
associativity determines whether the division operation is performed before or after the
multiplication operation.

Left Associativity: Left associativity means that when operators have the same precedence, they
are evaluated from left to right. This is typical for addition and multiplication in mathematics. In
the context of parsing, left associativity is implied when the left addition operator is lower than
the right addition operator in the parse tree.

This means that in expressions like "A + B + C" or "D * E * F", the operations are performed
from left to right, regardless of their position within the expression.

Consider an example expression: "A + B + C".

Parsing the Expression: When parsing this expression, the parse tree will reflect left
associativity. The parse tree might look like this:

/\

+ C

/\

A B

Right Associativity: Right associativity, on the other hand, implies that operators are evaluated
from right to left. This is often the case for the exponentiation operator, where A ** B ** C is
interpreted as A ** (B ** C). Right recursion in the grammar rules can be used to specify right
associativity.

For example expression: A ** B ** C.

Parsing the Expression: When parsing this expression, the right associativity dictates
that the operation closest to the rightmost operand is evaluated first. Therefore, the parse
tree would reflect this by placing the rightmost operation (B ** C) lower in the tree
compared to the operation to its left (A ** (B ** C)). The parse tree might look like this:
**

/ \

A **

/ \

B C

A=B+C+A

The parse tree for this sentence, as defined with the grammar of Example 3.4, is shown in Figure 3.4.

 When a grammar rule has its LHS also appearing at the beginning of its RHS, the
rule is said to be left recursive.
 This left recursion specifies left associativity. For example, the left recursion of
the rules of the grammar of Example 3.4 causes it to make both addition and
multiplication left associative
 In most languages that provide it, the exponentiation operator is right associative.

 To indicate right associativity, right recursion can be used.

 A grammar rule is right recursive if the LHS appears at the right end of the
RHS. Rules such as
<factor> → <exp> ** <factor>
|<exp>
<exp> → ( <expr> )
|id
Could be used to describe exponentiation as a right-associative operator.

EXTENDED BACKUS – NAUR FORM (EBNF)


 The extended version of BNF is called Extended BNF, or simply EBNF
 Three extensions are commonly included in the various versions of EBNF.
 The first of these denotes an optional part of an RHS, which is delimited by
brackets.
For example, an if-else statement can be described as
<if_stmt> → if (<expression>) <statement> [else <statement>]
without the use of the brackets, the syntactic description of this statement would require
the following two rules:
<if_stmt> → if (<expression>) <statement>
| if (<expression>) <statement> else <statement>
 The second extension is the use of braces in an RHS to indicate that the enclosed
part can be repeated indefinitely or left out altogether. This extension allows lists to
be built with a single rule, instead of using recursion and two rules.
 For example, lists of identifiers separated by commas can be described by the following
rule:
<ident_list> → <identifier> { , <identifier> }
This is a replacement of the recursion by a form of implied iteration; the part enclosed
within braces can be iterated any number of times.
 The third common extension deals with multiple-choice options. When a single
element must be chosen from a group, the options are placed in parentheses and
separated by the OR operator, |. For example,
<term> → <term> ( * | / | % ) <factor>
In BNF, a description of this would require the following three rules:
<term> → <term> * <factor>
| <term> / <factor>
| <term> % <factor>
 The brackets, braces, and parentheses in the EBNF extensions are meta-symbols,
which mean they are notational tools and not terminal symbols in the syntactic
entities they help describe.
 EXAMPLE: BNF and EBNF Versions of an Expression Grammar
BNF:
<expr> → <expr> + <term>
| <expr> - <term>
| <term>
<factor> → <expr> ** <factor>
|<expr>
<expr> → (<expr>)
| id
EBNF:
<expr> → <term> { ( + | - ) <term> }
<factor> → <expr> { ** <expr> }
<expr> → (<expr>)
| id
 The BNF rule
<expr> → <expr> + <term>
clearly specifies—in fact forces—the + operator to be left associative. However, the
EBNF version,
<expr> → <term> { + <term> }
does not imply the direction of associativity.
 This problem is overcome in a syntax analyzer based on an EBNF grammar for
expressions by designing the syntax analysis process to enforce the correct associativity.
 Some variations on BNF and EBNF have appeared. Among these are the following:
 In place of the arrow, a colon is used and the RHS is placed on the next line.
 Instead of a vertical bar to separate alternative RHSs, they are simply placed on
separate lines.
 In place of square brackets to indicate something being optional, the subscript opt
is used. For example,
Constructor Declarator → SimpleName (FormalParameterListopt)
 Rather than using the | symbol in a parenthesized list of elements to indicate a
choice, the words “one of ” are used.
 For example,
AssignmentOperator → one of = *= /= %= += -= <<= >>= &=
^= |=

You might also like