Compiler Design Study Material Unit 1
Compiler Design Study Material Unit 1
Compiler:-
A compiler translates the code written in one language to some other language
without changing the meaning of the program. It is also expected that a compiler
should make the target code efficient and optimized in terms of time and space.
Types of linker:-
Dynamic Linker:- Many OS environment allow dynamic linking, that is
postponing of the resolution of some undefined symbol or library function until a
program is run.
Static Linker:- It is the result of linker copying all library routines used in the
program into the executable image. This may require more disk-space and
memory than dynamic linking.
Loader:-
Loader is a part of operating system and is responsible for loading executable files
into memory and execute them. It calculates the size of a program (instructions
and data) and creates memory space for it. It initializes various registers to initiate
execution.
Phases of Compiler:-
example. float position, initial, rate;
position=initial+rate*60
Lexical Analysis:-
This phase scans the source code as a stream of characters and converts it into
meaningful lexemes. Lexical analyzer represents these lexemes in the form of
tokens.
Syntax Analysis:-
It takes the token produced by lexical analysis as input and generates a parse tree
(or syntax tree). In this phase, token arrangements are checked against the source
code grammar, i.e. the parser checks if the expression made by the tokens is
syntactically correct.
Semantic Analysis:-
Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types,
and adding string to an integer. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared before use
or not etc. The semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generation:-
After semantic analysis the compiler generates an intermediate code of the source
code for the target machine. It represents a program for some abstract machine. It
is in between the high-level language and the machine language. This
intermediate code should be generated in such a way that it makes it easier to be
translated into the target machine code.
Code Optimization:-
The next phase does code optimization of the intermediate code. Optimization can
be assumed as something that removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program execution without
wasting resources (CPU, memory).
Code Generation:-
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable
machine code. Sequence of instructions of machine code performs the task as the
intermediate code would do.
Symbol Table:-
It is a data-structure maintained throughout all the phases of a compiler. All the
identifier's names along with their types are stored here. The symbol table makes
it easier for the compiler to quickly search the identifier record and retrieve it. The
symbol table is also used for scope management.
Analysis Phase:-
The analysis phase of the compiler reads the source program, divides it into core
parts and then checks for lexical, grammar and syntax errors. The analysis phase
generates an intermediate representation of the source program.
Synthesis Phase:-
The synthesis phase generates the target program with the help of intermediate
source code representation and symbol table.
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire
program.
Phase : A phase of a compiler is a distinguishable stage, which takes input
from the previous stage, processes and yields output that can be used as
input for the next stage. A pass can have more than one phase.
Compiler Vs Interpreter:-
Question:-
A pascal translator written in C language that takes pascal code and
produces C as o/p. Create a pascal translator in C++ for the same
Sol:-
Finite Automata:-
Finite automata is a state machine that takes a string of symbols as input and
changes its state accordingly. Finite automata is a recognizer for regular
expressions. When a regular expression string is fed into finite automata, it
changes its state for each literal. If the input string is successfully processed and
the automata reaches its final state, it is accepted, i.e., the string just fed was said
to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
Case 1: R=a
Case 2: R=b
Case 3: R=a+b
R=a | b
Case 4: R=ab
Case 5: R=a*
Case 6: R=a+
Solution: -
Step 1:- Put ϵ- closure (S0) an unmarked state into the set of DFA.
Step 2:- while(there is one unmarked state s1 in DFA)
begin
mark s1
for each input symbol a do
begin
s2← ϵ-closure(move(s1,a))
if(s2 is not in DFA)
{
Then add s2 into DFA as an unmarked state
}
Dtrans[s1,a]←s2
End
End
The lexical analyzer scans the characters of the source program one at a time to
discover tokens. Often, however, many characters beyond the next token many
have to be examined before the next token itself can be determined. For this and
other reasons, it is desirable for the lexical analyzer to read its input from an input
buffer. Figure shows a buffer divided into two half of, say 100 characters each.
One pointer marks the beginning of the token being discovered. A look ahead
pointer scans ahead of the beginning point, until the token is discovered .we view
the position of each pointer as being between the character last read and the
character next to be read. In practice each buffering scheme adopts one convention
either a pointer is at the symbol last read or the symbol it is ready to read.
Token beginnings look ahead pointer The distance which the look ahead pointer
may have to travel past the actual token may be large. For example, in a PL/I
program we may see:
Ex:-
abc = pqr*xyz
1. A buffer is divided into two half, each containing N-bit of info. One pointer
mark the beginning of the token ,while another pointer look ahead, scan
ahead of the beginning pointer until the token is discovered.
Pattern: A set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern associated with the
token.
Formal Grammar:
• Formal grammar is a set of rules. It is used to identify correct or incorrect
strings of tokens in a language. The formal grammar is represented as G.
• Formal grammar is used to generate all possible strings over the alphabet
that is syntactically correct in the language.
• Formal grammar is used mostly in the syntactic analysis phase (parsing)
particularly during the compilation.
Formal grammar G is written as follows:
G = <V, N, P, S>
Where:
N describes a finite set of non-terminal symbols.
V describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.
Example:
S → a A |b
A→aA|a
• Backus Normal Form is a notation techniques for the context free grammar.
• It is often used to describe the syntax of languages used in computing.
• Such as computer programming languages, document formats and so on. .
• They are applied wherever exact descriptions of languages are needed.
Lex:-
The Lex compiler is a tool that allows one to specify a lexical analyzer from
regular expressions.
I
Structure of Lex Program:-
%{
Declarations/Definition Section
%}
%%
Rule Section
%%
User Defined Section
Definition Section:-
The definition section creates an environment for the lexer, which is a C-code. This
area of lex specification is separated by “%{” and “%}” it contains C statements
such as global deceleration , commands, library files and other deceleration which
will be copied to the lexical analyzer when it passed through the lex tool.
%{
#include<stdio.h>
%}
Rule Section:-
It contains the pattern and actions that specify the lex specification. A pattern is in
the form of a regular expression to match the string.
Once the pattern is matched the corresponding action part is invoked. The action
part contains normal C language statements.
They are enclosed within braces “{” and “}”, if there is more than one statement
then make these component statement into single block of statements.
%%
[a] {printf(“Alphabet a”);}
%%
This section contains any valid C-code. Lex copies the contents of this
section into generated lexical analyzer.
Lex itself does not produce an executable program, instead , it translate the
lex specification into a file containing a C subroutine called yylex.
All the rules in the rule section will automatically be converted into C
statement by the lex tool and will be put under the function name yylex().
Whenever we call the function yylex() in main function even though we
have not defined it anywhere in the program.
Ambiguity:-
• A grammar can have more than one parse tree generating a given string of
terminals. Such a grammar is said to be ambiguous.
• To show that a grammar is ambiguous all we need to do is find a terminal
string that is the yield of more than one parse tree.
• Since a string with more than one parse tree usually has more than one
meaning, we need to design unambiguous grammars for compiling
applications, or to use ambiguous grammars with additional rules to resolve
the ambiguities.
Example: Let us consider a grammar G with the production rule
E→I
E→E+E
E→E*E
E → (E)
I → ε | 0 | 1 | 2 | ... | 9