100% found this document useful (1 vote)

39 views530 pages

Let's Build A Compiler

Uploaded by

Safeer Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

39 views530 pages

Let's Build A Compiler

Uploaded by

Safeer Ullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 530

Free eBook

Let's Build A Compiler

eBooks Home Page | Virus Programming | Nostradamus

Prophecies

Let's Build A Compiler

1. Introduction
2. Expression Parsing
3. More Expressions
4. Interpreters
5. Control Constructs
6. Boolean Expressions
7. Lexical Scanning
8. A Little Philosophy
9. A Top View
10. Introduction to "TINY"
11. Lexical Scan Revisited
12. Miscellany
13. Procedures
14. Types
15. Back To The Future
16. Unit Construction

Book Review :

This book, by Jack W. Crenshaw, is good for those people, who

want their own compiler ready in about a month or so, like
students preparing for projects. This book has a pragmatic
approach towards compiler programming. It doesn't disgusts
you with the boring compiler theory, rather it teaches you by
doing it.

I am also writing a tutorial / ebook on the same topic, in which

I am explaining the code in C++. This is actually my OOPS
project in my V Semester of Engineering in Computer Science.
I will upload the whole Project ( + Project Report on this site ) .
All this is prepared, but I need to format it before uploading on
the website. If you want it before it is uploaded, please mail me
at http://amitmathur.8m.com/mailme.html?
Pls_Send_Me_your_OOPS_Project .

Rating :

This book is hosted on Amit Mathur's website : http://amitmathur.8m.com .

LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part I: INTRODUCTION

INTRODUCTION

This series of articles is a tutorial on the theory and practice

of developing language parsers and compilers. Before we are
finished, we will have covered every aspect of compiler
construction, designed a new programming language, and built a
working compiler.

Though I am not a computer scientist by education (my Ph.D. is in

a different field, Physics), I have been interested in compilers
for many years. I have bought and tried to digest the contents
of virtually every book on the subject ever written. I don't
mind telling you that it was slow going. Compiler texts are
written for Computer Science majors, and are tough sledding for
the rest of us. But over the years a bit of it began to seep in.
What really caused it to jell was when I began to branch off on
my own and begin to try things on my own computer. Now I plan to
share with you what I have learned. At the end of this series
you will by no means be a computer scientist, nor will you know
all the esoterics of compiler theory. I intend to completely
ignore the more theoretical aspects of the subject. What you
_WILL_ know is all the practical aspects that one needs to know
to build a working system.

This is a "learn-by-doing" series. In the course of the series I

will be performing experiments on a computer. You will be
expected to follow along, repeating the experiments that I do,
and performing some on your own. I will be using Turbo Pascal
4.0 on a PC clone. I will periodically insert examples written
in TP. These will be executable code, which you will be expected
to copy into your own computer and run. If you don't have a copy
of Turbo, you will be severely limited in how well you will be
able to follow what's going on. If you don't have a copy, I urge
you to get one. After all, it's an excellent product, good for
many other uses!

Some articles on compilers show you examples, or show you (as in

the case of Small-C) a finished product, which you can then copy
and use without a whole lot of understanding of how it works. I
hope to do much more than that. I hope to teach you HOW the
things get done, so that you can go off on your own and not only
reproduce what I have done, but improve on it.

This is admittedly an ambitious undertaking, and it won't be done

in one page. I expect to do it in the course of a number of
articles. Each article will cover a single aspect of compiler
theory, and will pretty much stand alone. If all you're
interested in at a given time is one aspect, then you need to
look only at that one article. Each article will be uploaded as
it is complete, so you will have to wait for the last one before
you can consider yourself finished. Please be patient.

The average text on compiler theory covers a lot of ground that

we won't be covering here. The typical sequence is:

o An introductory chapter describing what a compiler is.

o A chapter or two on syntax equations, using Backus-Naur Form

(BNF).

o A chapter or two on lexical scanning, with emphasis on

deterministic and non-deterministic finite automata.

o Several chapters on parsing theory, beginning with top-down

recursive descent, and ending with LALR parsers.

o A chapter on intermediate languages, with emphasis on P-code

and similar reverse polish representations.

o Many chapters on alternative ways to handle subroutines and

parameter passing, type declarations, and such.

o A chapter toward the end on code generation, usually for some

imaginary CPU with a simple instruction set. Most readers
(and in fact, most college classes) never make it this far.
o A final chapter or two on optimization. This chapter often
goes unread, too.

I'll be taking a much different approach in this series. To

begin with, I won't dwell long on options. I'll be giving you
_A_ way that works. If you want to explore options, well and
good ... I encourage you to do so ... but I'll be sticking to
what I know. I also will skip over most of the theory that puts
people to sleep. Don't get me wrong: I don't belittle the
theory, and it's vitally important when it comes to dealing with
the more tricky parts of a given language. But I believe in
putting first things first. Here we'll be dealing with the 95%
of compiler techniques that don't need a lot of theory to handle.

I also will discuss only one approach to parsing: top-down,

recursive descent parsing, which is the _ONLY_ technique that's
at all amenable to hand-crafting a compiler. The other
approaches are only useful if you have a tool like YACC, and also
don't care how much memory space the final product uses.

I also take a page from the work of Ron Cain, the author of the
original Small C. Whereas almost all other compiler authors have
historically used an intermediate language like P-code and
divided the compiler into two parts (a front end that produces
P-code, and a back end that processes P-code to produce
executable object code), Ron showed us that it is a
straightforward matter to make a compiler directly produce
executable object code, in the form of assembler language
statements. The code will _NOT_ be the world's tightest code ...
producing optimized code is a much more difficult job. But it
will work, and work reasonably well. Just so that I don't leave
you with the impression that our end product will be worthless, I
_DO_ intend to show you how to "soup up" the compiler with some
optimization.

Finally, I'll be using some tricks that I've found to be most

helpful in letting me understand what's going on without wading
through a lot of boiler plate. Chief among these is the use of
single-character tokens, with no embedded spaces, for the early
design work. I figure that if I can get a parser to recognize
and deal with I-T-L, I can get it to do the same with IF-THEN-
ELSE. And I can. In the second "lesson," I'll show you just
how easy it is to extend a simple parser to handle tokens of
arbitrary length. As another trick, I completely ignore file
I/O, figuring that if I can read source from the keyboard and
output object to the screen, I can also do it from/to disk files.
Experience has proven that once a translator is working
correctly, it's a straightforward matter to redirect the I/O to
files. The last trick is that I make no attempt to do error
correction/recovery. The programs we'll be building will
RECOGNIZE errors, and will not CRASH, but they will simply stop
on the first error ... just like good ol' Turbo does. There will
be other tricks that you'll see as you go. Most of them can't be
found in any compiler textbook, but they work.

A word about style and efficiency. As you will see, I tend to

write programs in _VERY_ small, easily understood pieces. None
of the procedures we'll be working with will be more than about
15-20 lines long. I'm a fervent devotee of the KISS (Keep It
Small Simple) school of software development. I try to never
do something tricky or complex, when something simple will do.
Inefficient? Perhaps, but you'll like the results. As Brian
Kernighan has said, FIRST make it run, THEN make it run fast.
If, later on, you want to go back and tighten up the code in one
of our products, you'll be able to do so, since the code will be
quite understandable. If you do so, however, I urge you to wait
until the program is doing everything you want it to.

I also have a tendency to delay building a module until I

discover that I need it. Trying to anticipate every possible
future contingency can drive you crazy, and you'll generally
guess wrong anyway. In this modern day of screen editors and
fast compilers, I don't hesitate to change a module when I feel I
need a more powerful one. Until then, I'll write only what I
need.

One final caveat: One of the principles we'll be sticking to here

is that we don't fool around with P-code or imaginary CPUs, but
that we will start out on day one producing working, executable
object code, at least in the form of assembler language source.
However, you may not like my choice of assembler language ...
it's 68000 code, which is what works on my system (under SK*DOS).
I think you'll find, though, that the translation to any other
CPU such as the 80x86 will be quite obvious, though, so I don't
see a problem here. In fact, I hope someone out there who knows
the '86 language better than I do will offer us the equivalent
object code fragments as we need them.

THE CRADLE

Every program needs some boiler plate ... I/O routines, error
message routines, etc. The programs we develop here will be no
exceptions. I've tried to hold this stuff to an absolute
minimum, however, so that we can concentrate on the important
stuff without losing it among the trees. The code given below
represents about the minimum that we need to get anything done.
It consists of some I/O routines, an error-handling routine and a
skeleton, null main program. I call it our cradle. As we
develop other routines, we'll add them to the cradle, and add the
calls to them as we need to. Make a copy of the cradle and save
it, because we'll be using it more than once.

There are many different ways to organize the scanning activities

of a parser. In Unix systems, authors tend to use getc and
ungetc. I've had very good luck with the approach shown here,
which is to use a single, global, lookahead character. Part of
the initialization procedure (the only part, so far!) serves to
"prime the pump" by reading the first character from the input
stream. No other special techniques are required with Turbo 4.0
... each successive call to GetChar will read the next character
in the stream.

{--------------------------------------------------------------}
program Cradle;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char; { Lookahead Character }

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);

begin
WriteLn;
WriteLn(^G, 'Error: ', s, '.');
end;

{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);

begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);

begin
Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);

begin
if Look = x then GetChar
else Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin
IsAlpha := upcase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
{ Get an Identifier }

function GetName: char;

begin
if not IsAlpha(Look) then Expected('Name');
GetName := UpCase(Look);
GetChar;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNum: char;

begin
if not IsDigit(Look) then Expected('Integer');
GetNum := Look;
GetChar;
end;

{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);

begin
Write(TAB, s);
end;

{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
begin
GetChar;
end;

{--------------------------------------------------------------}
{ Main Program }

begin
Init;
end.
{--------------------------------------------------------------}

That's it for this introduction. Copy the code above into TP and
compile it. Make sure that it compiles and runs correctly. Then
proceed to the first lesson, which is on expression parsing.

Copyright Notice of this book

GET YOUR DOMAIN:
type here .com
Get It Now >>

Free eBook
Let's Build A Compiler
Copyright Info.

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.
This book can be distributed as it is with above information

This book is hosted on Amit Mathur's website : http://amitmathur.8m.com .

GET YOUR DOMAIN:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part II: EXPRESSION PARSING

INTRODUCTION

GETTING STARTED

If you've read the introduction document to this series, you will

already know what we're about. You will also have copied the

cradle software into your Turbo Pascal system, and have compiled

it. So you should be ready to go.

The purpose of this article is for us to learn how to parse and

translate mathematical expressions. What we would like to see as

output is a series of assembler-language statements that perform

the desired actions. For purposes of definition, an expression

is the right-hand side of an equation, as in

x = 2*y + 3/(4*z)

In the early going, I'll be taking things in _VERY_ small steps.

That's so that the beginners among you won't get totally lost.
There are also some very good lessons to be learned early on,

that will serve us well later. For the more experienced readers:

bear with me. We'll get rolling soon enough.

SINGLE DIGITS

In keeping with the whole theme of this series (KISS, remember?),

let's start with the absolutely most simple case we can think of.

That, to me, is an expression consisting of a single digit.

Before starting to code, make sure you have a baseline copy of

the "cradle" that I gave last time. We'll be using it again for

other experiments. Then add this code:

{---------------------------------------------------------------}

{ Parse and Translate a Math Expression }

procedure Expression;

begin

EmitLn('MOVE #' + GetNum + ',D0')

end;

{---------------------------------------------------------------}

And add the line "Expression;" to the main program so that it

reads:

{---------------------------------------------------------------}

begin

Init;

Expression;

end.
{---------------------------------------------------------------}

Now run the program. Try any single-digit number as input. You

should get a single line of assembler-language output. Now try

any other character as input, and you'll see that the parser

properly reports an error.

CONGRATULATIONS! You have just written a working translator!

OK, I grant you that it's pretty limited. But don't brush it off

too lightly. This little "compiler" does, on a very limited

scale, exactly what any larger compiler does: it correctly

recognizes legal statements in the input "language" that we have

defined for it, and it produces correct, executable assembler

code, suitable for assembling into object format. Just as

importantly, it correctly recognizes statements that are NOT

legal, and gives a meaningful error message. Who could ask for

more? As we expand our parser, we'd better make sure those two

characteristics always hold true.

There are some other features of this tiny program worth

mentioning. First, you can see that we don't separate code

generation from parsing ... as soon as the parser knows what we

want done, it generates the object code directly. In a real

compiler, of course, the reads in GetChar would be from a disk

file, and the writes to another disk file, but this way is much

easier to deal with while we're experimenting.

Also note that an expression must leave a result somewhere. I've

chosen the 68000 register DO. I could have made some other

choices, but this one makes sense.

BINARY EXPRESSIONS

Now that we have that under our belt, let's branch out a bit.

Admittedly, an "expression" consisting of only one character is

not going to meet our needs for long, so let's see what we can do

to extend it. Suppose we want to handle expressions of the form:

1+2

or 4-3

or, in general, <term> +/- <term>

(That's a bit of Backus-Naur Form, or BNF.)

To do this we need a procedure that recognizes a term and leaves

its result somewhere, and another that recognizes and

distinguishes between a '+' and a '-' and generates the

appropriate code. But if Expression is going to leave its result

in DO, where should Term leave its result? Answer: the same

place. We're going to have to save the first result of Term

somewhere before we get the next one.

OK, basically what we want to do is have procedure Term do what

Expression was doing before. So just RENAME procedure Expression

as Term, and enter the following new version of Expression:

{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;

begin

Term; mov a,getnum

EmitLn('MOVE D0,D1'); mov b,a

case Look of

'+': Add;

'-': Subtract;

else Expected('Addop');

end;

{--------------------------------------------------------------}

Next, just above Expression enter these two procedures:

{--------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD D1,D0'); ADD

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB D1,D0');

end;

{-------------------------------------------------------------}

When you're finished with that, the order of the routines should

be:

o Term (The OLD Expression)

o Add

o Subtract

o Expression

Now run the program. Try any combination you can think of of two

single digits, separated by a '+' or a '-'. You should get a

series of four assembler-language instructions out of each run.

Now try some expressions with deliberate errors in them. Does

the parser catch the errors?

Take a look at the object code generated. There are two

observations we can make. First, the code generated is NOT what

we would write ourselves. The sequence

MOVE #n,D0

MOVE D0,D1

is inefficient. If we were writing this code by hand, we would

probably just load the data directly to D1.

There is a message here: code generated by our parser is less

efficient than the code we would write by hand. Get used to it.

That's going to be true throughout this series. It's true of all

compilers to some extent. Computer scientists have devoted whole

lifetimes to the issue of code optimization, and there are indeed

things that can be done to improve the quality of code output.

Some compilers do quite well, but there is a heavy price to pay

in complexity, and it's a losing battle anyway ... there will

probably never come a time when a good assembler-language pro-

grammer can't out-program a compiler. Before this session is

over, I'll briefly mention some ways that we can do a little op-

timization, just to show you that we can indeed improve things

without too much trouble. But remember, we're here to learn, not

to see how tight we can make the object code. For now, and

really throughout this series of articles, we'll studiously

ignore optimization and concentrate on getting out code that

works.

Speaking of which: ours DOESN'T! The code is _WRONG_! As things

are working now, the subtraction process subtracts D1 (which has

the FIRST argument in it) from D0 (which has the second). That's

the wrong way, so we end up with the wrong sign for the result.

So let's fix up procedure Subtract with a sign-changer, so that

it reads

{-------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;
begin

Match('-');

Term;

EmitLn('SUB D1,D0'); SUB A,B

EmitLn('NEG D0'); NEG A

end;

{-------------------------------------------------------------}

Now our code is even less efficient, but at least it gives the

right answer! Unfortunately, the rules that give the meaning of

math expressions require that the terms in an expression come out

in an inconvenient order for us. Again, this is just one of

those facts of life you learn to live with. This one will come

back to haunt us when we get to division.

OK, at this point we have a parser that can recognize the sum or

difference of two digits. Earlier, we could only recognize a

single digit. But real expressions can have either form (or an

infinity of others). For kicks, go back and run the program with

the single input line '1'.

Didn't work, did it? And why should it? We just finished

telling our parser that the only kinds of expressions that are

legal are those with two terms. We must rewrite procedure

Expression to be a lot more broadminded, and this is where things

start to take the shape of a real parser.

GENERAL EXPRESSIONS

In the REAL world, an expression can consist of one or more

terms, separated by "addops" ('+' or '-'). In BNF, this is

written

<expression> ::= <term> [<addop> <term>]*

We can accomodate this definition of an expression with the

addition of a simple loop to procedure Expression:

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

Term;

while Look in ['+', '-'] do begin

EmitLn('MOVE D0,D1');

case Look of

'+': Add;

'-': Subtract;

else Expected('Addop');

end;

{--------------------------------------------------------------}
NOW we're getting somewhere! This version handles any number of

terms, and it only cost us two extra lines of code. As we go on,

you'll discover that this is characteristic of top-down parsers

... it only takes a few lines of code to accomodate extensions to

the language. That's what makes our incremental approach

possible. Notice, too, how well the code of procedure Expression

matches the BNF definition. That, too, is characteristic of the

method. As you get proficient in the approach, you'll find that

you can turn BNF into parser code just about as fast as you can

type!

OK, compile the new version of our parser, and give it a try. As

usual, verify that the "compiler" can handle any legal

expression, and will give a meaningful error message for an

illegal one. Neat, eh? You might note that in our test version,

any error message comes out sort of buried in whatever code had

already been generated. But remember, that's just because we are

using the CRT as our "output file" for this series of

experiments. In a production version, the two outputs would be

separated ... one to the output file, and one to the screen.

USING THE STACK

At this point I'm going to violate my rule that we don't

introduce any complexity until it's absolutely necessary, long

enough to point out a problem with the code we're generating. As

things stand now, the parser uses D0 for the "primary" register,

and D1 as a place to store the partial sum. That works fine for
now, because as long as we deal with only the "addops" '+' and

'-', any new term can be added in as soon as it is found. But in

general that isn't true. Consider, for example, the expression

1+(2-(3+(4-5)))

If we put the '1' in D1, where do we put the '2'? Since a

general expression can have any degree of complexity, we're going

to run out of registers fast!

Fortunately, there's a simple solution. Like every modern

microprocessor, the 68000 has a stack, which is the perfect place

to save a variable number of items. So instead of moving the term

in D0 to D1, let's just push it onto the stack. For the benefit

of those unfamiliar with 68000 assembler language, a push is

written

-(SP)

and a pop, (SP)+ .

So let's change the EmitLn in Expression to read:

EmitLn('MOVE D0,-(SP)'); PUSH Ax

and the two lines in Add and Subtract to

EmitLn('ADD (SP)+,D0') POP B (16 bit) / ADD AL,BL

and EmitLn('SUB (SP)+,D0'),

respectively. Now try the parser again and make sure we haven't

broken it.

Once again, the generated code is less efficient than before, but

it's a necessary step, as you'll see.

MULTIPLICATION AND DIVISION

Now let's get down to some REALLY serious business. As you all

know, there are other math operators than "addops" ...

expressions can also have multiply and divide operations. You

also know that there is an implied operator PRECEDENCE, or

hierarchy, associated with expressions, so that in an expression

2 + 3 * 4,

we know that we're supposed to multiply FIRST, then add. (See

why we needed the stack?)

In the early days of compiler technology, people used some rather

complex techniques to insure that the operator precedence rules

were obeyed. It turns out, though, that none of this is

necessary ... the rules can be accommodated quite nicely by our

top-down parsing technique. Up till now, the only form that

we've considered for a term is that of a single decimal digit.

More generally, we can define a term as a PRODUCT of FACTORS;

i.e.,

<term> ::= <factor> [ <mulop> <factor> ]*

What is a factor? For now, it's what a term used to be ... a

single digit.

Notice the symmetry: a term has the same form as an expression.

As a matter of fact, we can add to our parser with a little

judicious copying and renaming. But to avoid confusion, the

listing below is the complete set of parsing routines. (Note the

way we handle the reversal of operands in Divide.)

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Factor;

begin

EmitLn('MOVE #' + GetNum + ',D0') MOV AL,<GETNUM>

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*'); POP BX

Factor; MOV AL,<NEXTNUM>

EmitLn('MULS (SP)+,D0'); MULS AX,BX

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Divide }

procedure Divide;

begin

Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('DIVS D1,D0');

end;
{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

procedure Term;

begin

Factor; MOV AL,<NUM>

while Look in ['*', '/'] do begin

EmitLn('MOVE D0,-(SP)'); PUSH AX

case Look of

'*': Multiply;

'/': Divide;

else Expected('Mulop');

end;

{--------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD (SP)+,D0');

end;
{-------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

Term;

while Look in ['+', '-'] do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;

else Expected('Addop');

end;

{--------------------------------------------------------------}
Hot dog! A NEARLY functional parser/translator, in only 55 lines

of Pascal! The output is starting to look really useful, if you

continue to overlook the inefficiency, which I hope you will.

Remember, we're not trying to produce tight code here.

PARENTHESES

We can wrap up this part of the parser with the addition of

parentheses with math expressions. As you know, parentheses are

a mechanism to force a desired operator precedence. So, for

example, in the expression

2*(3+4) ,

the parentheses force the addition before the multiply. Much

more importantly, though, parentheses give us a mechanism for

defining expressions of any degree of complexity, as in

(1+2)/((3+4)+(5-6))

The key to incorporating parentheses into our parser is to

realize that no matter how complicated an expression enclosed by

parentheses may be, to the rest of the world it looks like a

simple factor. That is, one of the forms for a factor is:

<factor> ::= (<expression>)

This is where the recursion comes in. An expression can contain a

factor which contains another expression which contains a factor,

etc., ad infinitum.

Complicated or not, we can take care of this by adding just a few

lines of Pascal to procedure Factor:

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else

EmitLn('MOVE #' + GetNum + ',D0');

end;

{--------------------------------------------------------------}

Note again how easily we can extend the parser, and how well the

Pascal code matches the BNF syntax.

As usual, compile the new version and make sure that it correctly

parses legal sentences, and flags illegal ones with an error

message.

UNARY MINUS

At this point, we have a parser that can handle just about any

expression, right? OK, try this input sentence:

-1

WOOPS! It doesn't work, does it? Procedure Expression expects

everything to start with an integer, so it coughs up the leading

minus sign. You'll find that +3 won't work either, nor will

something like

-(3-2) .

There are a couple of ways to fix the problem. The easiest

(although not necessarily the best) way is to stick an imaginary

leading zero in front of expressions of this type, so that -3

becomes 0-3. We can easily patch this into our existing version

of Expression:

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

if IsAddop(Look) then

EmitLn('CLR D0')

else

Term;

while IsAddop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;

else Expected('Addop');

end;
end;

end;

{--------------------------------------------------------------}

I TOLD you that making changes was easy! This time it cost us

only three new lines of Pascal. Note the new reference to

function IsAddop. Since the test for an addop appeared twice, I

chose to embed it in the new function. The form of IsAddop

should be apparent from that for IsAlpha. Here it is:

{--------------------------------------------------------------}

{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

OK, make these changes to the program and recompile. You should

also include IsAddop in your baseline copy of the cradle. We'll

be needing it again later. Now try the input -1 again. Wow!

The efficiency of the code is pretty poor ... six lines of code

just for loading a simple constant ... but at least it's correct.

Remember, we're not trying to replace Turbo Pascal here.

At this point we're just about finished with the structure of our

expression parser. This version of the program should correctly

parse and compile just about any expression you care to throw at
it. It's still limited in that we can only handle factors

involving single decimal digits. But I hope that by now you're

starting to get the message that we can accomodate further

extensions with just some minor changes to the parser. You

probably won't be surprised to hear that a variable or even a

function call is just another kind of a factor.

In the next session, I'll show you just how easy it is to extend

our parser to take care of these things too, and I'll also show

you just how easily we can accomodate multicharacter numbers and

variable names. So you see, we're not far at all from a truly

useful parser.

A WORD ABOUT OPTIMIZATION

Earlier in this session, I promised to give you some hints as to

how we can improve the quality of the generated code. As I said,

the production of tight code is not the main purpose of this

series of articles. But you need to at least know that we aren't

just wasting our time here ... that we can indeed modify the

parser further to make it produce better code, without throwing

away everything we've done to date. As usual, it turns out that

SOME optimization is not that difficult to do ... it simply takes

some extra code in the parser.

There are two basic approaches we can take:

o Try to fix up the code after it's generated

This is the concept of "peephole" optimization. The general

idea it that we know what combinations of instructions the

compiler is going to generate, and we also know which ones

are pretty bad (such as the code for -1, above). So all we

do is to scan the produced code, looking for those

combinations, and replacing them by better ones. It's sort

of a macro expansion, in reverse, and a fairly

straightforward exercise in pattern-matching. The only

complication, really, is that there may be a LOT of such

combinations to look for. It's called peephole optimization

simply because it only looks at a small group of instructions

at a time. Peephole optimization can have a dramatic effect

on the quality of the code, with little change to the

structure of the compiler itself. There is a price to pay,

though, in both the speed, size, and complexity of the

compiler. Looking for all those combinations calls for a lot

of IF tests, each one of which is a source of error. And, of

course, it takes time.

In the classical implementation of a peephole optimizer,

it's done as a second pass to the compiler. The output code

is written to disk, and then the optimizer reads and

processes the disk file again. As a matter of fact, you can

see that the optimizer could even be a separate PROGRAM from

the compiler proper. Since the optimizer only looks at the

code through a small "window" of instructions (hence the

name), a better implementation would be to simply buffer up a

few lines of output, and scan the buffer after each EmitLn.

o Try to generate better code in the first place

This approach calls for us to look for special cases BEFORE

we Emit them. As a trivial example, we should be able to

identify a constant zero, and Emit a CLR instead of a load,

or even do nothing at all, as in an add of zero, for example.

Closer to home, if we had chosen to recognize the unary minus

in Factor instead of in Expression, we could treat constants

like -1 as ordinary constants, rather then generating them

from positive ones. None of these things are difficult to

deal with ... they only add extra tests in the code, which is

why I haven't included them in our program. The way I see

it, once we get to the point that we have a working compiler,

generating useful code that executes, we can always go back

and tweak the thing to tighten up the code produced. That's

why there are Release 2.0's in the world.

There IS one more type of optimization worth mentioning, that

seems to promise pretty tight code without too much hassle. It's

my "invention" in the sense that I haven't seen it suggested in

print anywhere, though I have no illusions that it's original

with me.

This is to avoid such a heavy use of the stack, by making better

use of the CPU registers. Remember back when we were doing only

addition and subtraction, that we used registers D0 and D1,

rather than the stack? It worked, because with only those two

operations, the "stack" never needs more than two entries.

Well, the 68000 has eight data registers. Why not use them as a

privately managed stack? The key is to recognize that, at any

point in its processing, the parser KNOWS how many items are on

the stack, so it can indeed manage it properly. We can define a

private "stack pointer" that keeps track of which stack level

we're at, and addresses the corresponding register. Procedure

Factor, for example, would not cause data to be loaded into

register D0, but into whatever the current "top-of-stack"

register happened to be.

What we're doing in effect is to replace the CPU's RAM stack with

a locally managed stack made up of registers. For most

expressions, the stack level will never exceed eight, so we'll

get pretty good code out. Of course, we also have to deal with

those odd cases where the stack level DOES exceed eight, but

that's no problem either. We simply let the stack spill over

into the CPU stack. For levels beyond eight, the code is no

worse than what we're generating now, and for levels less than

eight, it's considerably better.

For the record, I have implemented this concept, just to make

sure it works before I mentioned it to you. It does. In

practice, it turns out that you can't really use all eight levels

... you need at least one register free to reverse the operand

order for division (sure wish the 68000 had an XTHL, like the

8080!). For expressions that include function calls, we would

also need a register reserved for them. Still, there is a nice

improvement in code size for most expressions.

So, you see, getting better code isn't that difficult, but it

does add complexity to the our translator ... complexity we can

do without at this point. For that reason, I STRONGLY suggest

that we continue to ignore efficiency issues for the rest of this

series, secure in the knowledge that we can indeed improve the

code quality without throwing away what we've done.

Next lesson, I'll show you how to deal with variables factors and

function calls. I'll also show you just how easy it is to handle

multicharacter tokens and embedded white space.

Get a domain & site:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part III: MORE EXPRESSIONS

INTRODUCTION

In the last installment, we examined the techniques used to parse

and translate a general math expression. We ended up with a

simple parser that could handle arbitrarily complex expressions,

with two restrictions:

o No variables were allowed, only numeric factors

o The numeric factors were limited to single digits

In this installment, we'll get rid of those restrictions. We'll

also extend what we've done to include assignment statements

function calls and. Remember, though, that the second

restriction was mainly self-imposed ... a choice of convenience

on our part, to make life easier and to let us concentrate on the

fundamental concepts. As you'll see in a bit, it's an easy

restriction to get rid of, so don't get too hung up about it.
We'll use the trick when it serves us to do so, confident that we

can discard it when we're ready to.

VARIABLES

Most expressions that we see in practice involve variables, such

b*b+4*a*c

No parser is much good without being able to deal with them.

Fortunately, it's also quite easy to do.

Remember that in our parser as it currently stands, there are two

kinds of factors allowed: integer constants and expressions

within parentheses. In BNF notation,

<factor> ::= <number> | (<expression>)

The '|' stands for "or", meaning of course that either form is a

legal form for a factor. Remember, too, that we had no trouble

knowing which was which ... the lookahead character is a left

paren '(' in one case, and a digit in the other.

It probably won't come as too much of a surprise that a variable

is just another kind of factor. So we extend the BNF above to

read:

<factor> ::= <number> | (<expression>) | <variable>

Again, there is no ambiguity: if the lookahead character is a

letter, we have a variable; if a digit, we have a number. Back

when we translated the number, we just issued code to load the

number, as immediate data, into D0. Now we do the same, only we

load a variable.

A minor complication in the code generation arises from the fact

that most 68000 operating systems, including the SK*DOS that I'm

using, require the code to be written in "position-independent"

form, which basically means that everything is PC-relative. The

format for a load in this language is

MOVE X(PC),D0

where X is, of course, the variable name. Armed with that, let's

modify the current version of Factor to read:

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

EmitLn('MOVE ' + GetName + '(PC),D0')

else

EmitLn('MOVE #' + GetNum + ',D0');

end;
{--------------------------------------------------------------}

I've remarked before how easy it is to add extensions to the

parser, because of the way it's structured. You can see that

this still holds true here. This time it cost us all of two

extra lines of code. Notice, too, how the if-else-else structure

exactly parallels the BNF syntax equation.

OK, compile and test this new version of the parser. That didn't

hurt too badly, did it?

FUNCTIONS

There is only one other common kind of factor supported by most

languages: the function call. It's really too early for us to

deal with functions well, because we haven't yet addressed the

issue of parameter passing. What's more, a "real" language would

include a mechanism to support more than one type, one of which

should be a function type. We haven't gotten there yet, either.

But I'd still like to deal with functions now for a couple of

reasons. First, it lets us finally wrap up the parser in

something very close to its final form, and second, it brings up

a new issue which is very much worth talking about.

Up till now, we've been able to write what is called a

"predictive parser." That means that at any point, we can know

by looking at the current lookahead character exactly what to do

next. That isn't the case when we add functions. Every language

has some naming rules for what constitutes a legal identifier.

For the present, ours is simply that it is one of the letters

'a'..'z'. The problem is that a variable name and a function

name obey the same rules. So how can we tell which is which?

One way is to require that they each be declared before they are

used. Pascal takes that approach. The other is that we might

require a function to be followed by a (possibly empty) parameter

list. That's the rule used in C.

Since we don't yet have a mechanism for declaring types, let's

use the C rule for now. Since we also don't have a mechanism to

deal with parameters, we can only handle empty lists, so our

function calls will have the form

x() .

Since we're not dealing with parameter lists yet, there is

nothing to do but to call the function, so we need only to issue

a BSR (call) instead of a MOVE.

Now that there are two possibilities for the "If IsAlpha" branch

of the test in Factor, let's treat them in a separate procedure.

Modify Factor to read:

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;
Match(')');

end

else if IsAlpha(Look) then

Ident

else

EmitLn('MOVE #' + GetNum + ',D0');

end;

{--------------------------------------------------------------}

and insert before it the new procedure

{---------------------------------------------------------------}

{ Parse and Translate an Identifier }

procedure Ident;

var Name: char;

begin

Name := GetName;

getip();

if Look = '(' then begin

Match('(');

Match(')');

EmitLn('BSR ' + Name);

end

else

EmitLn('MOVE ' + Name + '(PC),D0')

end;
{---------------------------------------------------------------}

OK, compile and test this version. Does it parse all legal

expressions? Does it correctly flag badly formed ones?

The important thing to notice is that even though we no longer

have a predictive parser, there is little or no complication

added with the recursive descent approach that we're using. At

the point where Factor finds an identifier (letter), it doesn't

know whether it's a variable name or a function name, nor does it

really care. It simply passes it on to Ident and leaves it up to

that procedure to figure it out. Ident, in turn, simply tucks

away the identifier and then reads one more character to decide

which kind of identifier it's dealing with.

Keep this approach in mind. It's a very powerful concept, and it

should be used whenever you encounter an ambiguous situation

requiring further lookahead. Even if you had to look several

tokens ahead, the principle would still work.

As long as we're talking philosophy, there's another important

issue to point out: error handling. Notice that although the

parser correctly rejects (almost) every malformed expression we

can throw at it, with a meaningful error message, we haven't

really had to do much work to make that happen. In fact, in the

whole parser per se (from Ident through Expression) there are

only two calls to the error routine, Expected. Even those aren't
necessary ... if you'll look again in Term and Expression, you'll

see that those statements can't be reached. I put them in early

on as a bit of insurance, but they're no longer needed. Why

don't you delete them now?

So how did we get this nice error handling virtually for free?

It's simply that I've carefully avoided reading a character

directly using GetChar. Instead, I've relied on the error

handling in GetName, GetNum, and Match to do all the error

checking for me. Astute readers will notice that some of the

calls to Match (for example, the ones in Add and Subtract) are

also unnecessary ... we already know what the character is by the

time we get there ... but it maintains a certain symmetry to

leave them in, and the general rule to always use Match instead

of GetChar is a good one.

I mentioned an "almost" above. There is a case where our error

handling leaves a bit to be desired. So far we haven't told our

parser what and end-of-line looks like, or what to do with

embedded white space. So a space character (or any other

character not part of the recognized character set) simply causes

the parser to terminate, ignoring the unrecognized characters.

It could be argued that this is reasonable behavior at this

point. In a "real" compiler, there is usually another statement

following the one we're working on, so any characters not treated

as part of our expression will either be used for or rejected as

part of the next one.

But it's also a very easy thing to fix up, even if it's only
temporary. All we have to do is assert that the expression

should end with an end-of-line , i.e., a carriage return.

To see what I'm talking about, try the input line

1+2 <space> 3+4

See how the space was treated as a terminator? Now, to make the

compiler properly flag this, add the line

if Look <> CR then Expected('Newline');

in the main program, just after the call to Expression. That

catches anything left over in the input stream. Don't forget to

define CR in the const statement:

CR = ^M;

As usual, recompile the program and verify that it does what it's

supposed to.

ASSIGNMENT STATEMENTS

OK, at this point we have a parser that works very nicely. I'd

like to point out that we got it using only 88 lines of

executable code, not counting what was in the cradle. The

compiled object file is a whopping 4752 bytes. Not bad,

considering we weren't trying very hard to save either source

code or object size. We just stuck to the KISS principle.

Of course, parsing an expression is not much good without having

something to do with it afterwards. Expressions USUALLY (but not

always) appear in assignment statements, in the form

We're only a breath away from being able to parse an assignment

statement, so let's take that last step. Just after procedure

Expression, add the following new procedure:

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: char;

begin

Name := GetName;

Match('=');

Expression;

EmitLn('LEA ' + Name + '(PC),A0'); // mov name,ax

EmitLn('MOVE D0,(A0)')

end;

{--------------------------------------------------------------}

Note again that the code exactly parallels the BNF. And notice

further that the error checking was painless, handled by GetName

and Match.

The reason for the two lines of assembler has to do with a

peculiarity in the 68000, which requires this kind of construct

for PC-relative code.

Now change the call to Expression, in the main program, to one to

Assignment. That's all there is to it.

Son of a gun! We are actually compiling assignment statements.

If those were the only kind of statements in a language, all we'd

have to do is put this in a loop and we'd have a full-fledged

compiler!

Well, of course they're not the only kind. There are also little

items like control statements (IFs and loops), procedures,

declarations, etc. But cheer up. The arithmetic expressions

that we've been dealing with are among the most challenging in a

language. Compared to what we've already done, control

statements will be easy. I'll be covering them in the fifth

installment. And the other statements will all fall in line, as

long as we remember to KISS.

MULTI-CHARACTER TOKENS

Throughout this series, I've been carefully restricting

everything we do to single-character tokens, all the while

assuring you that it wouldn't be difficult to extend to multi-

character ones. I don't know if you believed me or not ... I

wouldn't really blame you if you were a bit skeptical. I'll

continue to use that approach in the sessions which follow,

because it helps keep complexity away. But I'd like to back up

those assurances, and wrap up this portion of the parser, by

showing you just how easy that extension really is. In the

process, we'll also provide for embedded white space. Before you

make the next few changes, though, save the current version of

the parser away under another name. I have some more uses for it

in the next installment, and we'll be working with the single-

character version.
Most compilers separate out the handling of the input stream into

a separate module called the lexical scanner. The idea is that

the scanner deals with all the character-by-character input, and

returns the separate units (tokens) of the stream. There may

come a time when we'll want to do something like that, too, but

for now there is no need. We can handle the multi-character

tokens that we need by very slight and very local modifications

to GetName and GetNum.

The usual definition of an identifier is that the first character

must be a letter, but the rest can be alphanumeric (letters or

numbers). To deal with this, we need one other recognizer

function

{--------------------------------------------------------------}

{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

Add this function to your parser. I put mine just after IsDigit.

While you're at it, might as well include it as a permanent

member of Cradle, too.

Now, we need to modify function GetName to return a string

instead of a character:
{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: string;

var Token: string;

begin

Token := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

Token := Token + UpCase(Look);

GetChar;

end;

GetName := Token;

end;

{--------------------------------------------------------------}

Similarly, modify GetNum to read:

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: string;

var Value: string;

begin

Value := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

GetNum := Value;

end;

{--------------------------------------------------------------}

Amazingly enough, that is virtually all the changes required to

the parser! The local variable Name in procedures Ident and

Assignment was originally declared as "char", and must now be

declared string[8]. (Clearly, we could make the string length

longer if we chose, but most assemblers limit the length anyhow.)

Make this change, and then recompile and test. _NOW_ do you

believe that it's a simple change?

WHITE SPACE

Before we leave this parser for awhile, let's address the issue

of white space. As it stands now, the parser will barf (or

simply terminate) on a single space character embedded anywhere

in the input stream. That's pretty unfriendly behavior. So

let's "productionize" the thing a bit by eliminating this last

restriction.

The key to easy handling of white space is to come up with a

simple rule for how the parser should treat the input stream, and

to enforce that rule everywhere. Up till now, because white

space wasn't permitted, we've been able to assume that after each
parsing action, the lookahead character Look contains the next

meaningful character, so we could test it immediately. Our

design was based upon this principle.

It still sounds like a good rule to me, so that's the one we'll

use. This means that every routine that advances the input

stream must skip over white space, and leave the next non-white

character in Look. Fortunately, because we've been careful to

use GetName, GetNum, and Match for most of our input processing,

it is only those three routines (plus Init) that we need to

modify.

Not surprisingly, we start with yet another new recognizer

routine:

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}

We also need a routine that will eat white-space characters,

until it finds a non-white one:

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;

{--------------------------------------------------------------}

Now, add calls to SkipWhite to Match, GetName, and GetNum as

shown below:

{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin

if Look <> x then Expected('''' + x + '''')

else begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: string;

var Token: string;

begin
Token := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

Token := Token + UpCase(Look);

GetChar;

end;

GetName := Token;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: string;

var Value: string;

begin

Value := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

GetNum := Value;

SkipWhite;

end;

{--------------------------------------------------------------}

(Note that I rearranged Match a bit, without changing the

functionality.)

Finally, we need to skip over leading blanks where we "prime the

pump" in Init:

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

Make these changes and recompile the program. You will find that

you will have to move Match below SkipWhite, to avoid an error

message from the Pascal compiler. Test the program as always to

make sure it works properly.

Since we've made quite a few changes during this session, I'm

reproducing the entire parser below:

{--------------------------------------------------------------}

program parse;

{--------------------------------------------------------------}

{ Constant Declarations }

const TAB = ^I;

CR = ^M;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look: char; { Lookahead Character }

{--------------------------------------------------------------}

{ Read New Character From Input Stream }

procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }

procedure Error(s: string);

begin

WriteLn;

WriteLn(^G, 'Error: ', s, '.');

end;

{--------------------------------------------------------------}

{ Report Error and Halt }

procedure Abort(s: string);

begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}

{ Report What Was Expected }

procedure Expected(s: string);

begin

Abort(s + ' Expected');

end;

{--------------------------------------------------------------}

{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin

IsAlpha := UpCase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Recognize an Alphanumeric }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;
{--------------------------------------------------------------}

{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;

{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin
if Look <> x then Expected('''' + x + '''')

else begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: string;

var Token: string;

begin

Token := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

Token := Token + UpCase(Look);

GetChar;

end;

GetName := Token;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: string;

var Value: string;

begin

Value := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

GetNum := Value;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Output a String with Tab }

procedure Emit(s: string);

begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin

Emit(s);

WriteLn;

end;
{---------------------------------------------------------------}

{ Parse and Translate a Identifier }

procedure Ident;

var Name: string[8];

begin

Name:= GetName;

if Look = '(' then begin

Match('(');

Match(')');

EmitLn('BSR ' + Name);

end

else

EmitLn('MOVE ' + Name + '(PC),D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

Ident

else

EmitLn('MOVE #' + GetNum + ',D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*');

Factor;

EmitLn('MULS (SP)+,D0');

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Divide }

procedure Divide;

begin

Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('EXS.L D0');

EmitLn('DIVS D1,D0');

end;

{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;

begin

Factor;

while Look in ['*', '/'] do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'*': Multiply;

'/': Divide;

end;

{--------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD (SP)+,D0');

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin
Match('-');

Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

if IsAddop(Look) then

EmitLn('CLR D0')

else

Term;

while IsAddop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;

end;

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string[8];

begin

Name := GetName;

Match('=');

Expression;

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Assignment;

If Look <> CR then Expected('NewLine');

end.

{--------------------------------------------------------------}
Now the parser is complete. It's got every feature we can put in

a one-line "compiler." Tuck it away in a safe place. Next time

we'll move on to a new subject, but we'll still be talking about

expressions for quite awhile. Next installment, I plan to talk a

bit about interpreters as opposed to compilers, and show you how

the structure of the parser changes a bit as we change what sort

of action has to be taken. The information we pick up there will

serve us in good stead later on, even if you have no interest in

interpreters. See you next time.

GET YOUR DOMAIN:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part IV: INTERPRETERS

INTRODUCTION

In the first three installments of this series, we've looked at

parsing and compiling math expressions, and worked our way grad-

ually and methodically from dealing with very simple one-term,

one-character "expressions" up through more general ones, finally

arriving at a very complete parser that could parse and translate

complete assignment statements, with multi-character tokens,

embedded white space, and function calls. This time, I'm going

to walk you through the process one more time, only with the goal

of interpreting rather than compiling object code.

Since this is a series on compilers, why should we bother with

interpreters? Simply because I want you to see how the nature of

the parser changes as we change the goals. I also want to unify

the concepts of the two types of translators, so that you can see
not only the differences, but also the similarities.

Consider the assignment statement

x=2*y+3

In a compiler, we want the target CPU to execute this assignment

at EXECUTION time. The translator itself doesn't do any arith-

metic ... it only issues the object code that will cause the CPU

to do it when the code is executed. For the example above, the

compiler would issue code to compute the expression and store the

results in variable x.

For an interpreter, on the other hand, no object code is gen-

erated. Instead, the arithmetic is computed immediately, as the

parsing is going on. For the example, by the time parsing of the

statement is complete, x will have a new value.

The approach we've been taking in this whole series is called

"syntax-driven translation." As you are aware by now, the struc-

ture of the parser is very closely tied to the syntax of the

productions we parse. We have built Pascal procedures that rec-

ognize every language construct. Associated with each of these

constructs (and procedures) is a corresponding "action," which

does whatever makes sense to do once a construct has been

recognized. In our compiler so far, every action involves

emitting object code, to be executed later at execution time. In

an interpreter, every action involves something to be done im-

mediately.

What I'd like you to see here is that the layout ... the struc-

ture ... of the parser doesn't change. It's only the actions
that change. So if you can write an interpreter for a given

language, you can also write a compiler, and vice versa. Yet, as

you will see, there ARE differences, and significant ones.

Because the actions are different, the procedures that do the

recognizing end up being written differently. Specifically, in

the interpreter the recognizing procedures end up being coded as

FUNCTIONS that return numeric values to their callers. None of

the parsing routines for our compiler did that.

Our compiler, in fact, is what we might call a "pure" compiler.

Each time a construct is recognized, the object code is emitted

IMMEDIATELY. (That's one reason the code is not very efficient.)

The interpreter we'll be building here is a pure interpreter, in

the sense that there is no translation, such as "tokenizing,"

performed on the source code. These represent the two extremes

of translation. In the real world, translators are rarely so

pure, but tend to have bits of each technique.

I can think of several examples. I've already mentioned one:

most interpreters, such as Microsoft BASIC, for example, trans-

late the source code (tokenize it) into an intermediate form so

that it'll be easier to parse real time.

Another example is an assembler. The purpose of an assembler, of

course, is to produce object code, and it normally does that on a

one-to-one basis: one object instruction per line of source code.

But almost every assembler also permits expressions as arguments.

In this case, the expressions are always constant expressions,

and so the assembler isn't supposed to issue object code for

them. Rather, it "interprets" the expressions and computes the

corresponding constant result, which is what it actually emits as

object code.

As a matter of fact, we could use a bit of that ourselves. The

translator we built in the previous installment will dutifully

spit out object code for complicated expressions, even though

every term in the expression is a constant. In that case it

would be far better if the translator behaved a bit more like an

interpreter, and just computed the equivalent constant result.

There is a concept in compiler theory called "lazy" translation.

The idea is that you typically don't just emit code at every

action. In fact, at the extreme you don't emit anything at all,

until you absolutely have to. To accomplish this, the actions

associated with the parsing routines typically don't just emit

code. Sometimes they do, but often they simply return in-

formation back to the caller. Armed with such information, the

caller can then make a better choice of what to do.

For example, given the statement

x = x + 3 - 2 - (5 - 4) ,

our compiler will dutifully spit out a stream of 18 instructions

to load each parameter into registers, perform the arithmetic,

and store the result. A lazier evaluation would recognize that

the arithmetic involving constants can be evaluated at compile

time, and would reduce the expression to

x=x+0.

An even lazier evaluation would then be smart enough to figure

out that this is equivalent to

x=x,

which calls for no action at all. We could reduce 18 in-

structions to zero!

Note that there is no chance of optimizing this way in our trans-

lator as it stands, because every action takes place immediately.

Lazy expression evaluation can produce significantly better

object code than we have been able to so far. I warn you,

though: it complicates the parser code considerably, because each

routine now has to make decisions as to whether to emit object

code or not. Lazy evaluation is certainly not named that because

it's easier on the compiler writer!

Since we're operating mainly on the KISS principle here, I won't

go into much more depth on this subject. I just want you to be

aware that you can get some code optimization by combining the

techniques of compiling and interpreting. In particular, you

should know that the parsing routines in a smarter translator

will generally return things to their caller, and sometimes

expect things as well. That's the main reason for going over

interpretation in this installment.

THE INTERPRETER

OK, now that you know WHY we're going into all this, let's do it.

Just to give you practice, we're going to start over with a bare

cradle and build up the translator all over again. This time, of

course, we can go a bit faster.

Since we're now going to do arithmetic, the first thing we need

to do is to change function GetNum, which up till now has always

returned a character (or string). Now, it's better for it to

return an integer. MAKE A COPY of the cradle (for goodness's

sake, don't change the version in Cradle itself!!) and modify

GetNum as follows:

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: integer;

begin

if not IsDigit(Look) then Expected('Integer');

GetNum := Ord(Look) - Ord('0');

GetChar;

end;

{--------------------------------------------------------------}

Now, write the following version of Expression:

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

function Expression: integer;

begin

Expression := GetNum;

end;

{--------------------------------------------------------------}
Finally, insert the statement

Writeln(Expression);

at the end of the main program. Now compile and test.

All this program does is to "parse" and translate a single

integer "expression." As always, you should make sure that it

does that with the digits 0..9, and gives an error message for

anything else. Shouldn't take you very long!

OK, now let's extend this to include addops. Change Expression

to read:

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

function Expression: integer;

var Value: integer;

begin

if IsAddop(Look) then

Value := 0

else

Value := GetNum;

while IsAddop(Look) do begin

case Look of

'+': begin

Match('+');
Value := Value + GetNum;

end;

'-': begin

Match('-');

Value := Value - GetNum;

end;

Expression := Value;

end;

{--------------------------------------------------------------}

The structure of Expression, of course, parallels what we did

before, so we shouldn't have too much trouble debugging it.

There's been a SIGNIFICANT development, though, hasn't there?

Procedures Add and Subtract went away! The reason is that the

action to be taken requires BOTH arguments of the operation. I

could have chosen to retain the procedures and pass into them the

value of the expression to date, which is Value. But it seemed

cleaner to me to keep Value as strictly a local variable, which

meant that the code for Add and Subtract had to be moved in line.

This result suggests that, while the structure we had developed

was nice and clean for our simple-minded translation scheme, it

probably wouldn't do for use with lazy evaluation. That's a

little tidbit we'll probably want to keep in mind for later.

OK, did the translator work? Then let's take the next step.
It's not hard to figure out what procedure Term should now look

like. Change every call to GetNum in function Expression to a

call to Term, and then enter the following form for Term:

{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

function Term: integer;

var Value: integer;

begin

Value := GetNum;

while Look in ['*', '/'] do begin

case Look of

'*': begin

Match('*');

Value := Value * GetNum;

end;

'/': begin

Match('/');

Value := Value div GetNum;

end;

Term := Value;

end;
{--------------------------------------------------------------}

Now, try it out. Don't forget two things: first, we're dealing

with integer division, so, for example, 1/3 should come out zero.

Second, even though we can output multi-digit results, our input

is still restricted to single digits.

That seems like a silly restriction at this point, since we have

already seen how easily function GetNum can be extended. So

let's go ahead and fix it right now. The new version is

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: integer;

var Value: integer;

begin

Value := 0;

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := 10 * Value + Ord(Look) - Ord('0');

GetChar;

end;

GetNum := Value;

end;

{--------------------------------------------------------------}

If you've compiled and tested this version of the interpreter,

the next step is to install function Factor, complete with pa-

renthesized expressions. We'll hold off a bit longer on the

variable names. First, change the references to GetNum, in

function Term, so that they call Factor instead. Now code the

following version of Factor:

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

function Expression: integer; Forward;

function Factor: integer;

begin

if Look = '(' then begin

Match('(');

Factor := Expression;

Match(')');

end

else

Factor := GetNum;

end;

{---------------------------------------------------------------}

That was pretty easy, huh? We're rapidly closing in on a useful

interpreter.

A LITTLE PHILOSOPHY

Before going any further, there's something I'd like to call to

your attention. It's a concept that we've been making use of in

all these sessions, but I haven't explicitly mentioned it up till

now. I think it's time, because it's a concept so useful, and so

powerful, that it makes all the difference between a parser

that's trivially easy, and one that's too complex to deal with.

In the early days of compiler technology, people had a terrible

time figuring out how to deal with things like operator prece-

dence ... the way that multiply and divide operators take

precedence over add and subtract, etc. I remember a colleague of

some thirty years ago, and how excited he was to find out how to

do it. The technique used involved building two stacks, upon

which you pushed each operator or operand. Associated with each

operator was a precedence level, and the rules required that you

only actually performed an operation ("reducing" the stack) if

the precedence level showing on top of the stack was correct. To

make life more interesting, an operator like ')' had different

precedence levels, depending upon whether or not it was already

on the stack. You had to give it one value before you put it on

the stack, and another to decide when to take it off. Just for

the experience, I worked all of this out for myself a few years

ago, and I can tell you that it's very tricky.

We haven't had to do anything like that. In fact, by now the

parsing of an arithmetic statement should seem like child's play.

How did we get so lucky? And where did the precedence stacks go?

A similar thing is going on in our interpreter above. You just

KNOW that in order for it to do the computation of arithmetic

statements (as opposed to the parsing of them), there have to be

numbers pushed onto a stack somewhere. But where is the stack?

Finally, in compiler textbooks, there are a number of places

where stacks and other structures are discussed. In the other

leading parsing method (LR), an explicit stack is used. In fact,

the technique is very much like the old way of doing arithmetic

expressions. Another concept is that of a parse tree. Authors

like to draw diagrams of the tokens in a statement, connected

into a tree with operators at the internal nodes. Again, where

are the trees and stacks in our technique? We haven't seen any.

The answer in all cases is that the structures are implicit, not

explicit. In any computer language, there is a stack involved

every time you call a subroutine. Whenever a subroutine is

called, the return address is pushed onto the CPU stack. At the

end of the subroutine, the address is popped back off and control

is transferred there. In a recursive language such as Pascal,

there can also be local data pushed onto the stack, and it, too,

returns when it's needed.

For example, function Expression contains a local parameter

called Value, which it fills by a call to Term. Suppose, in its

next call to Term for the second argument, that Term calls

Factor, which recursively calls Expression again. That "in-

stance" of Expression gets another value for its copy of Value.

What happens to the first Value? Answer: it's still on the

stack, and will be there again when we return from our call

sequence.
In other words, the reason things look so simple is that we've

been making maximum use of the resources of the language. The

hierarchy levels and the parse trees are there, all right, but

they're hidden within the structure of the parser, and they're

taken care of by the order with which the various procedures are

called. Now that you've seen how we do it, it's probably hard to

imagine doing it any other way. But I can tell you that it took

a lot of years for compiler writers to get that smart. The early

compilers were too complex too imagine. Funny how things get

easier with a little practice.

The reason I've brought all this up is as both a lesson and a

warning. The lesson: things can be easy when you do them right.

The warning: take a look at what you're doing. If, as you branch

out on your own, you begin to find a real need for a separate

stack or tree structure, it may be time to ask yourself if you're

looking at things the right way. Maybe you just aren't using the

facilities of the language as well as you could be.

The next step is to add variable names. Now, though, we have a

slight problem. For the compiler, we had no problem in dealing

with variable names ... we just issued the names to the assembler

and let the rest of the program take care of allocating storage

for them. Here, on the other hand, we need to be able to fetch

the values of the variables and return them as the return values

of Factor. We need a storage mechanism for these variables.

Back in the early days of personal computing, Tiny BASIC lived.

It had a grand total of 26 possible variables: one for each

letter of the alphabet. This fits nicely with our concept of

single-character tokens, so we'll try the same trick. In the

beginning of your interpreter, just after the declaration of

variable Look, insert the line:

Table: Array['A'..'Z'] of integer;

We also need to initialize the array, so add this procedure:

{---------------------------------------------------------------}

{ Initialize the Variable Area }

procedure InitTable;

var i: char;

begin

for i := 'A' to 'Z' do

Table[i] := 0;

end;

{---------------------------------------------------------------}

You must also insert a call to InitTable, in procedure Init.

DON'T FORGET to do that, or the results may surprise you!

Now that we have an array of variables, we can modify Factor to

use it. Since we don't have a way (so far) to set the variables,

Factor will always return zero values for them, but let's go

ahead and extend it anyway. Here's the new version:

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

function Expression: integer; Forward;

function Factor: integer;

begin

if Look = '(' then begin

Match('(');

Factor := Expression;

Match(')');

end

else if IsAlpha(Look) then

Factor := Table[GetName]

else

Factor := GetNum;

end;

{---------------------------------------------------------------}

As always, compile and test this version of the program. Even

though all the variables are now zeros, at least we can correctly

parse the complete expressions, as well as catch any badly formed

expressions.

I suppose you realize the next step: we need to do an assignment

statement so we can put something INTO the variables. For now,

let's stick to one-liners, though we will soon be handling

multiple statements.
The assignment statement parallels what we did before:

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: char;

begin

Name := GetName;

Match('=');

Table[Name] := Expression;

end;

{--------------------------------------------------------------}

To test this, I added a temporary write statement in the main

program, to print out the value of A. Then I tested it with

various assignments to it.

Of course, an interpretive language that can only accept a single

line of program is not of much value. So we're going to want to

handle multiple statements. This merely means putting a loop

around the call to Assignment. So let's do that now. But what

should be the loop exit criterion? Glad you asked, because it

brings up a point we've been able to ignore up till now.

One of the most tricky things to handle in any translator is to

determine when to bail out of a given construct and go look for

something else. This hasn't been a problem for us so far because

we've only allowed for a single kind of construct ... either an

expression or an assignment statement. When we start adding

loops and different kinds of statements, you'll find that we have

to be very careful that things terminate properly. If we put our

interpreter in a loop, we need a way to quit. Terminating on a

newline is no good, because that's what sends us back for another

line. We could always let an unrecognized character take us out,

but that would cause every run to end in an error message, which

certainly seems uncool.

What we need is a termination character. I vote for Pascal's

ending period ('.'). A minor complication is that Turbo ends

every normal line with TWO characters, the carriage return (CR)

and line feed (LF). At the end of each line, we need to eat

these characters before processing the next one. A natural way

to do this would be with procedure Match, except that Match's

error message prints the character, which of course for the CR

and/or LF won't look so great. What we need is a special proce-

dure for this, which we'll no doubt be using over and over. Here

it is:

{--------------------------------------------------------------}

{ Recognize and Skip Over a Newline }

procedure NewLine;

begin

if Look = CR then begin

GetChar;
if Look = LF then

GetChar;

end;

{--------------------------------------------------------------}

Insert this procedure at any convenient spot ... I put mine just

after Match. Now, rewrite the main program to look like this:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Assignment;

NewLine;

until Look = '.';

end.

{--------------------------------------------------------------}

Note that the test for a CR is now gone, and that there are also

no error tests within NewLine itself. That's OK, though ...

whatever is left over in terms of bogus characters will be caught

at the beginning of the next assignment statement.

Well, we now have a functioning interpreter. It doesn't do us a

lot of good, however, since we have no way to read data in or

write it out. Sure would help to have some I/O!

Let's wrap this session up, then, by adding the I/O routines.

Since we're sticking to single-character tokens, I'll use '?' to

stand for a read statement, and '!' for a write, with the char-

acter immediately following them to be used as a one-token

"parameter list." Here are the routines:

{--------------------------------------------------------------}

{ Input Routine }

procedure Input;

begin

Match('?');

Read(Table[GetName]);

end;

{--------------------------------------------------------------}

{ Output Routine }

procedure Output;

begin

Match('!');

WriteLn(Table[GetName]);

end;

{--------------------------------------------------------------}

They aren't very fancy, I admit ... no prompt character on input,

for example ... but they get the job done.

The corresponding changes in the main program are shown below.

Note that we use the usual trick of a case statement based upon
the current lookahead character, to decide what to do.

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

case Look of

'?': Input;

'!': Output;

else Assignment;

end;

NewLine;

until Look = '.';

end.

{--------------------------------------------------------------}

You have now completed a real, working interpreter. It's pretty

sparse, but it works just like the "big boys." It includes three

kinds of program statements (and can tell the difference!), 26

variables, and I/O statements. The only things that it lacks,

really, are control statements, subroutines, and some kind of

program editing function. The program editing part, I'm going to

pass on. After all, we're not here to build a product, but to

learn things. The control statements, we'll cover in the next

installment, and the subroutines soon after. I'm anxious to get

on with that, so we'll leave the interpreter as it stands.

I hope that by now you're convinced that the limitation of sin-

gle-character names and the processing of white space are easily

taken care of, as we did in the last session. This time, if

you'd like to play around with these extensions, be my guest ...

they're "left as an exercise for the student." See you next

time.
GET YOUR DOMAIN:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part V: CONTROL CONSTRUCTS

INTRODUCTION

In the first four installments of this series, we've been

concentrating on the parsing of math expressions and assignment

statements. In this installment, we'll take off on a new and

exciting tangent: that of parsing and translating control

constructs such as IF statements.

This subject is dear to my heart, because it represents a turning

point for me. I had been playing with the parsing of

expressions, just as we have done in this series, but I still

felt that I was a LONG way from being able to handle a complete

language. After all, REAL languages have branches and loops and

subroutines and all that. Perhaps you've shared some of the same

thoughts. Awhile back, though, I had to produce control

constructs for a structured assembler preprocessor I was writing.

Imagine my surprise to discover that it was far easier than the

expression parsing I had already been through. I remember

thinking, "Hey! This is EASY!" After we've finished this session,

I'll bet you'll be thinking so, too.

THE PLAN

In what follows, we'll be starting over again with a bare cradle,

and as we've done twice before now, we'll build things up one at

a time. We'll also be retaining the concept of single-character

tokens that has served us so well to date. This means that the

"code" will look a little funny, with 'i' for IF, 'w' for WHILE,

etc. But it helps us get the concepts down pat without fussing

over lexical scanning. Fear not ... eventually we'll see

something looking like "real" code.

I also don't want to have us get bogged down in dealing with

statements other than branches, such as the assignment statements

we've been working on. We've already demonstrated that we can

handle them, so there's no point carrying them around as excess

baggage during this exercise. So what I'll do instead is to use

an anonymous statement, "other", to take the place of the non-

control statements and serve as a place-holder for them. We have

to generate some kind of object code for them (we're back into

compiling, not interpretation), so for want of anything else I'll

just echo the character input.

OK, then, starting with yet another copy of the cradle, let's

define the procedure:

{--------------------------------------------------------------}

{ Recognize and Translate an "Other" }

procedure Other;

begin

EmitLn(GetName);

end;

{--------------------------------------------------------------}

Now include a call to it in the main program, thus:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

Other;

end.

{--------------------------------------------------------------}

Run the program and see what you get. Not very exciting, is it?

But hang in there, it's a start, and things will get better.

The first thing we need is the ability to deal with more than one

statement, since a single-line branch is pretty limited. We did

that in the last session on interpreting, but this time let's get

a little more formal. Consider the following BNF:

<program> ::= <block> END

<block> ::= [ <statement> ]*

This says that, for our purposes here, a program is defined as a

block, followed by an END statement. A block, in turn, consists

of zero or more statements. We only have one kind of statement,

so far.

What signals the end of a block? It's simply any construct that

isn't an "other" statement. For now, that means only the END

statement.

Armed with these ideas, we can proceed to build up our parser.

The code for a program (we have to call it DoProgram, or Pascal

will complain, is:

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block;

if Look <> 'e' then Expected('End');

EmitLn('END')

end;

{--------------------------------------------------------------}

Notice that I've arranged to emit an "END" command to the

assembler, which sort of punctuates the output code, and makes

sense considering that we're parsing a complete program here.

The code for Block is:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e']) do begin

Other;

end;

{--------------------------------------------------------------}

(From the form of the procedure, you just KNOW we're going to be

adding to it in a bit!)

OK, enter these routines into your program. Replace the call to

Block in the main program, by a call to DoProgram. Now try it

and see how it works. Well, it's still not much, but we're

getting closer.

SOME GROUNDWORK

Before we begin to define the various control constructs, we need

to lay a bit more groundwork. First, a word of warning: I won't

be using the same syntax for these constructs as you're familiar

with from Pascal or C. For example, the Pascal syntax for an IF

is:

IF <condition> THEN <statement>

(where the statement, of course, may be compound).

The C version is similar:

IF ( <condition> ) <statement>

Instead, I'll be using something that looks more like Ada:

IF <condition> <block> ENDIF

In other words, the IF construct has a specific termination

symbol. This avoids the dangling-else of Pascal and C and also

precludes the need for the brackets {} or begin-end. The syntax

I'm showing you here, in fact, is that of the language KISS that

I'll be detailing in later installments. The other constructs

will also be slightly different. That shouldn't be a real

problem for you. Once you see how it's done, you'll realize that

it really doesn't matter so much which specific syntax is

involved. Once the syntax is defined, turning it into code is

straightforward.

Now, all of the constructs we'll be dealing with here involve

transfer of control, which at the assembler-language level means

conditional and/or unconditional branches. For example, the

simple IF statement

IF <condition> A ENDIF B ....

must get translated into

Branch if NOT condition to L

L: B

...

It's clear, then, that we're going to need some more procedures

to help us deal with these branches. I've defined two of them

below. Procedure NewLabel generates unique labels. This is done

via the simple expedient of calling every label 'Lnn', where nn

is a label number starting from zero. Procedure PostLabel just

outputs the labels at the proper place.

Here are the two routines:

{--------------------------------------------------------------}

{ Generate a Unique Label }

function NewLabel: string;

var S: string;

begin

Str(LCount, S);

NewLabel := 'L' + S;

Inc(LCount);

end;

{--------------------------------------------------------------}

{ Post a Label To Output }

procedure PostLabel(L: string);

begin

WriteLn(L, ':');

end;

{--------------------------------------------------------------}

Notice that we've added a new global variable, LCount, so you

need to change the VAR declarations at the top of the program to

look like this:

var Look : char; { Lookahead Character }

Lcount: integer; { Label Counter }

Also, add the following extra initialization to Init:

LCount := 0;

(DON'T forget that, or your labels can look really strange!)

At this point I'd also like to show you a new kind of notation.

If you compare the form of the IF statement above with the as-

sembler code that must be produced, you can see that there are

certain actions associated with each of the keywords in the

statement:

IF: First, get the condition and issue the code for it.

Then, create a unique label and emit a branch if false.

ENDIF: Emit the label.

These actions can be shown very concisely if we write the syntax

this way:

<condition> { Condition;

L = NewLabel;

Emit(Branch False to L); }

<block>

ENDIF { PostLabel(L) }

This is an example of syntax-directed translation. We've been

doing it all along ... we've just never written it down this way

before. The stuff in curly brackets represents the ACTIONS to be

taken. The nice part about this representation is that it not

only shows what we have to recognize, but also the actions we

have to perform, and in which order. Once we have this syntax,

the code almost writes itself.

About the only thing left to do is to be a bit more specific

about what we mean by "Branch if false."

I'm assuming that there will be code executed for <condition>

that will perform Boolean algebra and compute some result. It

should also set the condition flags corresponding to that result.

Now, the usual convention for a Boolean variable is to let 0000

represent "false," and anything else (some use FFFF, some 0001)

represent "true."

On the 68000 the condition flags are set whenever any data is
moved or calculated. If the data is a 0000 (corresponding to a

false condition, remember), the zero flag will be set. The code

for "Branch on zero" is BEQ. So for our purposes here,

BEQ <=> Branch if false // JZ

BNE <=> Branch if true // JNZ

It's the nature of the beast that most of the branches we see

will be BEQ's ... we'll be branching AROUND the code that's

supposed to be executed when the condition is true.

THE IF STATEMENT

With that bit of explanation out of the way, we're finally ready

to begin coding the IF-statement parser. In fact, we've almost

already done it! As usual, I'll be using our single-character

approach, with the character 'i' for IF, and 'e' for ENDIF (as

well as END ... that dual nature causes no confusion). I'll

also, for now, skip completely the character for the branch con-

dition, which we still have to define.

The code for DoIf is:

{--------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;

var L: string;
begin

Match('i');

L := NewLabel;

Condition;

EmitLn('BEQ ' + L);

Block;

Match('e');

PostLabel(L);

end;

{--------------------------------------------------------------}

Add this routine to your program, and change Block to reference

it as follows:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e']) do begin

case Look of

'i': DoIf;

'o': Other;

end;

{--------------------------------------------------------------}
Notice the reference to procedure Condition. Eventually, we'll

write a routine that can parse and translate any Boolean con-

dition we care to give it. But that's a whole installment by

itself (the next one, in fact). For now, let's just make it a

dummy that emits some text. Write the following routine:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Condition }

{ This version is a dummy }

Procedure Condition;

begin

EmitLn('<condition>');

end;

{--------------------------------------------------------------}

Insert this procedure in your program just before DoIf. Now run

the program. Try a string like

aibece

As you can see, the parser seems to recognize the construct and

inserts the object code at the right places. Now try a set of

nested IF's, like

aibicedefe

It's starting to look real, eh?

Now that we have the general idea (and the tools such as the

notation and the procedures NewLabel and PostLabel), it's a piece

of cake to extend the parser to include other constructs. The

first (and also one of the trickiest) is to add the ELSE clause

to IF. The BNF is

IF <condition> <block> [ ELSE <block>] ENDIF

The tricky part arises simply because there is an optional part,

which doesn't occur in the other constructs.

The corresponding output code should be

BEQ L1

<block>

BRA L2

L1: <block>

L2: ...

This leads us to the following syntax-directed translation:

<condition> { L1 = NewLabel;

L2 = NewLabel;

Emit(BEQ L1) }

<block>

ELSE { Emit(BRA L2);

PostLabel(L1) }
<block>

ENDIF { PostLabel(L2) }

Comparing this with the case for an ELSE-less IF gives us a clue

as to how to handle both situations. The code below does it.

(Note that I use an 'l' for the ELSE, since 'e' is otherwise

occupied):

{--------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure DoIf;

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;

EmitLn('BEQ ' + L1);

Block;

if Look = 'l' then begin

Match('l');

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block;

end;
Match('e');

PostLabel(L2);

end;

{--------------------------------------------------------------}

There you have it. A complete IF parser/translator, in 19 lines

of code.

Give it a try now. Try something like

aiblcede

Did it work? Now, just to be sure we haven't broken the ELSE-

less case, try

aibece

Now try some nested IF's. Try anything you like, including some

badly formed statements. Just remember that 'e' is not a legal

"other" statement.

THE WHILE STATEMENT

The next type of statement should be easy, since we already have

the process down pat. The syntax I've chosen for the WHILE

statement is

WHILE <condition> <block> ENDWHILE

I know, I know, we don't REALLY need separate kinds of ter-

minators for each construct ... you can see that by the fact that

in our one-character version, 'e' is used for all of them. But I

also remember MANY debugging sessions in Pascal, trying to track

down a wayward END that the compiler obviously thought I meant to

put somewhere else. It's been my experience that specific and

unique keywords, although they add to the vocabulary of the

language, give a bit of error-checking that is worth the extra

work for the compiler writer.

Now, consider what the WHILE should be translated into. It

should be:

L1: <condition>

BEQ L2

<block>

BRA L1

L2:

As before, comparing the two representations gives us the actions

needed at each point.

WHILE { L1 = NewLabel;

PostLabel(L1) }

<condition> { Emit(BEQ L2) }

<block>

ENDWHILE { Emit(BRA L1);

PostLabel(L2) }
The code follows immediately from the syntax:

{--------------------------------------------------------------}

{ Parse and Translate a WHILE Statement }

procedure DoWhile;

var L1, L2: string;

begin

Match('w');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Condition;

EmitLn('BEQ ' + L2);

Block;

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}

Since we've got a new statement, we have to add a call to it

within procedure Block:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e', 'l']) do begin

case Look of

'i': DoIf;

'w': DoWhile;

else Other;

end;

{--------------------------------------------------------------}

No other changes are necessary.

OK, try the new program. Note that this time, the <condition>

code is INSIDE the upper label, which is just where we wanted it.

Try some nested loops. Try some loops within IF's, and some IF's

within loops. If you get a bit confused as to what you should

type, don't be discouraged: you write bugs in other languages,

too, don't you? It'll look a lot more meaningful when we get

full keywords.

I hope by now that you're beginning to get the idea that this

really IS easy. All we have to do to accomodate a new construct

is to work out the syntax-directed translation of it. The code

almost falls out from there, and it doesn't affect any of the

other routines. Once you've gotten the feel of the thing, you'll

see that you can add new constructs about as fast as you can
dream them up.

THE LOOP STATEMENT

We could stop right here, and have a language that works. It's

been shown many times that a high-order language with only two

constructs, the IF and the WHILE, is sufficient to write struc-

tured code. But we're on a roll now, so let's richen up the

repertoire a bit.

This construct is even easier, since it has no condition test at

all ... it's an infinite loop. What's the point of such a loop?

Not much, by itself, but later on we're going to add a BREAK

command, that will give us a way out. This makes the language

considerably richer than Pascal, which has no break, and also

avoids the funny WHILE(1) or WHILE TRUE of C and Pascal.

The syntax is simply

LOOP <block> ENDLOOP

and the syntax-directed translation is:

LOOP { L = NewLabel;

PostLabel(L) }

<block>

ENDLOOP { Emit(BRA L }

The corresponding code is shown below. Since I've already used

'l' for the ELSE, I've used the last letter, 'p', as the

"keyword" this time.

{--------------------------------------------------------------}

{ Parse and Translate a LOOP Statement }

procedure DoLoop;

var L: string;

begin

Match('p');

L := NewLabel;

PostLabel(L);

Block;

Match('e');

EmitLn('BRA ' + L);

end;

{--------------------------------------------------------------}

When you insert this routine, don't forget to add a line in Block

to call it.

REPEAT-UNTIL

Here's one construct that I lifted right from Pascal. The syntax

REPEAT <block> UNTIL <condition> ,

and the syntax-directed translation is:

REPEAT { L = NewLabel;

PostLabel(L) }

<block>

UNTIL

<condition> { Emit(BEQ L) }

As usual, the code falls out pretty easily:

{--------------------------------------------------------------}

{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;

var L: string;

begin

Match('r');

L := NewLabel;

PostLabel(L);

Block;

Match('u');

Condition;

EmitLn('BEQ ' + L);

end;

{--------------------------------------------------------------}

As before, we have to add the call to DoRepeat within Block.

This time, there's a difference, though. I decided to use 'r'

for REPEAT (naturally), but I also decided to use 'u' for UNTIL.

This means that the 'u' must be added to the set of characters in

the while-test. These are the characters that signal an exit

from the current block ... the "follow" characters, in compiler

jargon.

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e', 'l', 'u']) do begin

case Look of

'i': DoIf;

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

else Other;

end;

{--------------------------------------------------------------}

THE FOR LOOP

The FOR loop is a very handy one to have around, but it's a bear

to translate. That's not so much because the construct itself is

hard ... it's only a loop after all ... but simply because it's
hard to implement in assembler language. Once the code is

figured out, the translation is straightforward enough.

C fans love the FOR-loop of that language (and, in fact, it's

easier to code), but I've chosen instead a syntax very much like

the one from good ol' BASIC:

FOR <ident> = <expr1> TO <expr2> <block> ENDFOR

The translation of a FOR loop can be just about as difficult as

you choose to make it, depending upon the way you decide to

define the rules as to how to handle the limits. Does expr2 get

evaluated every time through the loop, for example, or is it

treated as a constant limit? Do you always go through the loop

at least once, as in FORTRAN, or not? It gets simpler if you

adopt the point of view that the construct is equivalent to:

TEMP = <expr2>

WHILE <ident> <= TEMP

<block>

ENDWHILE

Notice that with this definition of the loop, <block> will not be

executed at all if <expr1> is initially larger than <expr2>.

The 68000 code needed to do this is trickier than anything we've

done so far. I had a couple of tries at it, putting both the

counter and the upper limit on the stack, both in registers,

etc. I finally arrived at a hybrid arrangement, in which the

loop counter is in memory (so that it can be accessed within the

loop), and the upper limit is on the stack. The translated code

came out like this:

<ident> get name of loop counter

<expr1> get initial value

LEA <ident>(PC),A0 address the loop counter

SUBQ #1,D0 predecrement it

MOVE D0,(A0) save it

<expr1> get upper limit

MOVE D0,-(SP) save it on stack

L1: LEA <ident>(PC),A0 address loop counter

MOVE (A0),D0 fetch it to D0

ADDQ #1,D0 bump the counter

MOVE D0,(A0) save new value

CMP (SP),D0 check for range

BLE L2 skip out if D0 > (SP)

<block>

BRA L1 loop for next pass

L2: ADDQ #2,SP clean up the stack

Wow! That seems like a lot of code ... the line containing

<block> seems to almost get lost. But that's the best I could do

with it. I guess it helps to keep in mind that it's really only
sixteen words, after all. If anyone else can optimize this

better, please let me know.

Still, the parser routine is pretty easy now that we have the

code:

{--------------------------------------------------------------}

{ Parse and Translate a FOR Statement }

procedure DoFor;

var L1, L2: string;

Name: char;

begin

Match('f');

L1 := NewLabel;

L2 := NewLabel;

Name := GetName;

Match('=');

Expression;

EmitLn('SUBQ #1,D0');

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)');

Expression;

EmitLn('MOVE D0,-(SP)');

PostLabel(L1);

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE (A0),D0');

EmitLn('ADDQ #1,D0');
EmitLn('MOVE D0,(A0)');

EmitLn('CMP (SP),D0');

EmitLn('BGT ' + L2);

Block;

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

EmitLn('ADDQ #2,SP');

end;

{--------------------------------------------------------------}

Since we don't have expressions in this parser, I used the same

trick as for Condition, and wrote the routine

{--------------------------------------------------------------}

{ Parse and Translate an Expression }

{ This version is a dummy }

Procedure Expression;

begin

EmitLn('<expr>');

end;

{--------------------------------------------------------------}

Give it a try. Once again, don't forget to add the call in

Block. Since we don't have any input for the dummy version of

Expression, a typical input line would look something like

afi=bece

Well, it DOES generate a lot of code, doesn't it? But at least

it's the RIGHT code.

THE DO STATEMENT

All this made me wish for a simpler version of the FOR loop. The

reason for all the code above is the need to have the loop

counter accessible as a variable within the loop. If all we need

is a counting loop to make us go through something a specified

number of times, but don't need access to the counter itself,

there is a much easier solution. The 68000 has a "decrement and

branch nonzero" instruction built in which is ideal for counting.

For good measure, let's add this construct, too. This will be

the last of our loop structures.

The syntax and its translation is:

<expr> { Emit(SUBQ #1,D0);

L = NewLabel;

PostLabel(L);

Emit(MOVE D0,-(SP) }

<block>

ENDDO { Emit(MOVE (SP)+,D0;

Emit(DBRA D0,L) }

That's quite a bit simpler! The loop will execute <expr> times.
Here's the code:

{--------------------------------------------------------------}

{ Parse and Translate a DO Statement }

procedure Dodo;

var L: string;

begin

Match('d');

L := NewLabel;

Expression;

EmitLn('SUBQ #1,D0');

PostLabel(L);

EmitLn('MOVE D0,-(SP)');

Block;

EmitLn('MOVE (SP)+,D0');

EmitLn('DBRA D0,' + L);

end;

{--------------------------------------------------------------}

I think you'll have to agree, that's a whole lot simpler than the

classical FOR. Still, each construct has its place.

THE BREAK STATEMENT

Earlier I promised you a BREAK statement to accompany LOOP. This

is one I'm sort of proud of. On the face of it a BREAK seems

really tricky. My first approach was to just use it as an extra

terminator to Block, and split all the loops into two parts, just

as I did with the ELSE half of an IF. That turns out not to

work, though, because the BREAK statement is almost certainly not

going to show up at the same level as the loop itself. The most

likely place for a BREAK is right after an IF, which would cause

it to exit to the IF construct, not the enclosing loop. WRONG.

The BREAK has to exit the inner LOOP, even if it's nested down

into several levels of IFs.

My next thought was that I would just store away, in some global

variable, the ending label of the innermost loop. That doesn't

work either, because there may be a break from an inner loop

followed by a break from an outer one. Storing the label for the

inner loop would clobber the label for the outer one. So the

global variable turned into a stack. Things were starting to get

messy.

Then I decided to take my own advice. Remember in the last

session when I pointed out how well the implicit stack of a

recursive descent parser was serving our needs? I said that if

you begin to see the need for an external stack you might be

doing something wrong. Well, I was. It is indeed possible to

let the recursion built into our parser take care of everything,

and the solution is so simple that it's surprising.

The secret is to note that every BREAK statement has to occur

within a block ... there's no place else for it to be. So all we

have to do is to pass into Block the exit address of the

innermost loop. Then it can pass the address to the routine that
translates the break instruction. Since an IF statement doesn't

change the loop level, procedure DoIf doesn't need to do anything

except pass the label into ITS blocks (both of them). Since

loops DO change the level, each loop construct simply ignores

whatever label is above it and passes its own exit label along.

All this is easier to show you than it is to describe. I'll

demonstrate with the easiest loop, which is LOOP:

{--------------------------------------------------------------}

{ Parse and Translate a LOOP Statement }

procedure DoLoop;

var L1, L2: string;

begin

Match('p');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}

Notice that DoLoop now has TWO labels, not just one. The second

is to give the BREAK instruction a target to jump to. If there

is no BREAK within the loop, we've wasted a label and cluttered

up things a bit, but there's no harm done.

Note also that Block now has a parameter, which for loops will

always be the exit address. The new version of Block is:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block(L: string);

begin

while not(Look in ['e', 'l', 'u']) do begin

case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;

'd': DoDo;

'b': DoBreak(L);

else Other;

end;

{--------------------------------------------------------------}

Again, notice that all Block does with the label is to pass it

into DoIf and DoBreak. The loop constructs don't need it,
because they are going to pass their own label anyway.

The new version of DoIf is:

{--------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block(L: string); Forward;

procedure DoIf(L: string);

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;

EmitLn('BEQ ' + L1);

Block(L);

if Look = 'l' then begin

Match('l');

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block(L);

end;

Match('e');

PostLabel(L2);

end;

{--------------------------------------------------------------}
Here, the only thing that changes is the addition of the

parameter to procedure Block. An IF statement doesn't change the

loop nesting level, so DoIf just passes the label along. No

matter how many levels of IF nesting we have, the same label will

be used.

Now, remember that DoProgram also calls Block, so it now needs to

pass it a label. An attempt to exit the outermost block is an

error, so DoProgram passes a null label which is caught by

DoBreak:

{--------------------------------------------------------------}

{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);

begin

Match('b');

if L <> '' then

EmitLn('BRA ' + L)

else Abort('No loop to break from');

end;

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block('');
if Look <> 'e' then Expected('End');

EmitLn('END')

end;

{--------------------------------------------------------------}

That ALMOST takes care of everything. Give it a try, see if you

can "break" it <pun>. Careful, though. By this time we've used

so many letters, it's hard to think of characters that aren't now

representing reserved words. Remember: before you try the

program, you're going to have to edit every occurence of Block in

the other loop constructs to include the new parameter. Do it

just like I did for LOOP.

I said ALMOST above. There is one slight problem: if you take a

hard look at the code generated for DO, you'll see that if you

break out of this loop, the value of the loop counter is still

left on the stack. We're going to have to fix that! A shame ...

that was one of our smaller routines, but it can't be helped.

Here's a version that doesn't have the problem:

{--------------------------------------------------------------}

{ Parse and Translate a DO Statement }

procedure Dodo;

var L1, L2: string;

begin
Match('d');

L1 := NewLabel;

L2 := NewLabel;

Expression;

EmitLn('SUBQ #1,D0');

PostLabel(L1);

EmitLn('MOVE D0,-(SP)');

Block(L2);

EmitLn('MOVE (SP)+,D0');

EmitLn('DBRA D0,' + L1);

EmitLn('SUBQ #2,SP');

PostLabel(L2);

EmitLn('ADDQ #2,SP');

end;

{--------------------------------------------------------------}

The two extra instructions, the SUBQ and ADDQ, take care of

leaving the stack in the right shape.

CONCLUSION

At this point we have created a number of control constructs ...

a richer set, really, than that provided by almost any other pro-

gramming language. And, except for the FOR loop, it was pretty

easy to do. Even that one was tricky only because it's tricky in

assembler language.

I'll conclude this session here. To wrap the thing up with a red

ribbon, we really should have a go at having real keywords

instead of these mickey-mouse single-character things. You've

already seen that the extension to multi-character words is not

difficult, but in this case it will make a big difference in the

appearance of our input code. I'll save that little bit for the

next installment. In that installment we'll also address Boolean

expressions, so we can get rid of the dummy version of Condition

that we've used here. See you then.

For reference purposes, here is the completed parser for this

session:

{--------------------------------------------------------------}

program Branch;

{--------------------------------------------------------------}

{ Constant Declarations }

const TAB = ^I;

CR = ^M;

{--------------------------------------------------------------}

{ Variable Declarations }

var Look : char; { Lookahead Character }

Lcount: integer; { Label Counter }

{--------------------------------------------------------------}

{ Read New Character From Input Stream }

procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }

procedure Error(s: string);

begin

WriteLn;

WriteLn(^G, 'Error: ', s, '.');

end;

{--------------------------------------------------------------}

{ Report Error and Halt }

procedure Abort(s: string);

begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}

{ Report What Was Expected }

procedure Expected(s: string);

begin

Abort(s + ' Expected');

end;

{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin

if Look = x then GetChar

else Expected('''' + x + '''');

end;

{--------------------------------------------------------------}

{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin

IsAlpha := UpCase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin
IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: char;

begin

if not IsAlpha(Look) then Expected('Name');

GetName := UpCase(Look);

GetChar;

end;
{--------------------------------------------------------------}

{ Get a Number }

function GetNum: char;

begin

if not IsDigit(Look) then Expected('Integer');

GetNum := Look;

GetChar;

end;

{--------------------------------------------------------------}

{ Generate a Unique Label }

function NewLabel: string;

var S: string;

begin

Str(LCount, S);

NewLabel := 'L' + S;

Inc(LCount);

end;

{--------------------------------------------------------------}

{ Post a Label To Output }

procedure PostLabel(L: string);

begin
WriteLn(L, ':');

end;

{--------------------------------------------------------------}

{ Output a String with Tab }

procedure Emit(s: string);

begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin

Emit(s);

WriteLn;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Condition }

procedure Condition;

begin

EmitLn('<condition>');

end;

{--------------------------------------------------------------}
{ Parse and Translate a Math Expression }

procedure Expression;

begin

EmitLn('<expr>');

end;

{--------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block(L: string); Forward;

procedure DoIf(L: string);

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;

EmitLn('BEQ ' + L1);

Block(L);

if Look = 'l' then begin

Match('l');

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block(L);

end;
Match('e');

PostLabel(L2);

end;

{--------------------------------------------------------------}

{ Parse and Translate a WHILE Statement }

procedure DoWhile;

var L1, L2: string;

begin

Match('w');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Condition;

EmitLn('BEQ ' + L2);

Block(L2);

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}

{ Parse and Translate a LOOP Statement }

procedure DoLoop;

var L1, L2: string;

begin

Match('p');
L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}

{ Parse and Translate a REPEAT Statement }

procedure DoRepeat;

var L1, L2: string;

begin

Match('r');

L1 := NewLabel;

L2 := NewLabel;

PostLabel(L1);

Block(L2);

Match('u');

Condition;

EmitLn('BEQ ' + L1);

PostLabel(L2);

end;

{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }

procedure DoFor;

var L1, L2: string;

Name: char;

begin

Match('f');

L1 := NewLabel;

L2 := NewLabel;

Name := GetName;

Match('=');

Expression;

EmitLn('SUBQ #1,D0');

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)');

Expression;

EmitLn('MOVE D0,-(SP)');

PostLabel(L1);

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE (A0),D0');

EmitLn('ADDQ #1,D0');

EmitLn('MOVE D0,(A0)');

EmitLn('CMP (SP),D0');

EmitLn('BGT ' + L2);

Block(L2);

Match('e');

EmitLn('BRA ' + L1);

PostLabel(L2);

EmitLn('ADDQ #2,SP');

end;

{--------------------------------------------------------------}

{ Parse and Translate a DO Statement }

procedure Dodo;

var L1, L2: string;

begin

Match('d');

L1 := NewLabel;

L2 := NewLabel;

Expression;

EmitLn('SUBQ #1,D0');

PostLabel(L1);

EmitLn('MOVE D0,-(SP)');

Block(L2);

EmitLn('MOVE (SP)+,D0');

EmitLn('DBRA D0,' + L1);

EmitLn('SUBQ #2,SP');

PostLabel(L2);

EmitLn('ADDQ #2,SP');

end;
{--------------------------------------------------------------}

{ Recognize and Translate a BREAK }

procedure DoBreak(L: string);

begin

Match('b');

EmitLn('BRA ' + L);

end;

{--------------------------------------------------------------}

{ Recognize and Translate an "Other" }

procedure Other;

begin

EmitLn(GetName);

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block(L: string);

begin

while not(Look in ['e', 'l', 'u']) do begin

case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;
'd': DoDo;

'b': DoBreak(L);

else Other;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block('');

if Look <> 'e' then Expected('End');

EmitLn('END')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;
{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DoProgram;

end.

{--------------------------------------------------------------}
GET YOUR DOMAIN:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part VI: BOOLEAN EXPRESSIONS

INTRODUCTION

In Part V of this series, we took a look at control constructs,

and developed parsing routines to translate them into object

code. We ended up with a nice, relatively rich set of

constructs.

As we left the parser, though, there was one big hole in our

capabilities: we did not address the issue of the branch

condition. To fill the void, I introduced to you a dummy parse

routine called Condition, which only served as a place-keeper for

the real thing.

One of the things we'll do in this session is to plug that hole

by expanding Condition into a true parser/translator.

THE PLAN
We're going to approach this installment a bit differently than

any of the others. In those other installments, we started out

immediately with experiments using the Pascal compiler, building

up the parsers from very rudimentary beginnings to their final

forms, without spending much time in planning beforehand. That's

called coding without specs, and it's usually frowned upon. We

could get away with it before because the rules of arithmetic are

pretty well established ... we know what a '+' sign is supposed

to mean without having to discuss it at length. The same is true

for branches and loops. But the ways in which programming

languages implement logic vary quite a bit from language to

language. So before we begin serious coding, we'd better first

make up our minds what it is we want. And the way to do that is

at the level of the BNF syntax rules (the GRAMMAR).

THE GRAMMAR

For some time now, we've been implementing BNF syntax equations

for arithmetic expressions, without ever actually writing them

down all in one place. It's time that we did so. They are:

<expression> ::= <unary op> <term> [<addop> <term>]*

<term> ::= <factor> [<mulop> factor]*

<factor> ::= <integer> | <variable> | ( <expression> )

(Remember, the nice thing about this grammar is that it enforces

the operator precedence hierarchy that we normally expect for

algebra.)
Actually, while we're on the subject, I'd like to amend this

grammar a bit right now. The way we've handled the unary minus

is a bit awkward. I've found that it's better to write the

grammar this way:

<expression> ::= <term> [<addop> <term>]*

<term> ::= <signed factor> [<mulop> factor]*

<signed factor> ::= [<addop>] <factor>

<factor> ::= <integer> | <variable> | (<expression>)

This puts the job of handling the unary minus onto Factor, which

is where it really belongs.

This doesn't mean that you have to go back and recode the

programs you've already written, although you're free to do so if

you like. But I will be using the new syntax from now on.

Now, it probably won't come as a shock to you to learn that we

can define an analogous grammar for Boolean algebra. A typical

set or rules is:

<b-expression>::= <b-term> [<orop> <b-term>]*

<b-term> ::= <not-factor> [AND <not-factor>]*

<not-factor> ::= [NOT] <b-factor>

<b-factor> ::= <b-literal> | <b-variable> | (<b-expression>)

Notice that in this grammar, the operator AND is analogous to

'*', and OR (and exclusive OR) to '+'. The NOT operator is

analogous to a unary minus. This hierarchy is not absolutely

standard ... some languages, notably Ada, treat all logical

operators as having the same precedence level ... but it seems

natural.

Notice also the slight difference between the way the NOT and the

unary minus are handled. In algebra, the unary minus is

considered to go with the whole term, and so never appears but

once in a given term. So an expression like

a * -b

or worse yet,

a - -b

is not allowed. In Boolean algebra, though, the expression

a AND NOT b

makes perfect sense, and the syntax shown allows for that.

RELOPS

OK, assuming that you're willing to accept the grammar I've shown

here, we now have syntax rules for both arithmetic and Boolean

algebra. The sticky part comes in when we have to combine the

two. Why do we have to do that? Well, the whole subject came up

because of the need to process the "predicates" (conditions)

associated with control statements such as the IF. The predicate

is required to have a Boolean value; that is, it must evaluate to

either TRUE or FALSE. The branch is then taken or not taken,

depending on that value. What we expect to see going on in

procedure Condition, then, is the evaluation of a Boolean

expression.

But there's more to it than that. A pure Boolean expression can

indeed be the predicate of a control statement ... things like

IF a AND NOT b THEN ....

But more often, we see Boolean algebra show up in such things as

IF (x >= 0) and (x <= 100) THEN ...

Here, the two terms in parens are Boolean expressions, but the

individual terms being compared: x, 0, and 100, are NUMERIC in

nature. The RELATIONAL OPERATORS >= and <= are the catalysts by

which the Boolean and the arithmetic ingredients get merged

together.

Now, in the example above, the terms being compared are just

that: terms. However, in general each side can be a math

expression. So we can define a RELATION to be:

<relation> ::= <expression> <relop> <expression> ,

where the expressions we're talking about here are the old

numeric type, and the relops are any of the usual symbols

=, <> (or !=), <, >, <=, and >=

If you think about it a bit, you'll agree that, since this kind

of predicate has a single Boolean value, TRUE or FALSE, as its

result, it is really just another kind of factor. So we can

expand the definition of a Boolean factor above to read:

<b-factor> ::= <b-literal>

| <b-variable>

| (<b-expression>)

| <relation>

THAT's the connection! The relops and the relation they define

serve to wed the two kinds of algebra. It is worth noting that

this implies a hierarchy where the arithmetic expression has a

HIGHER precedence that a Boolean factor, and therefore than all

the Boolean operators. If you write out the precedence levels

for all the operators, you arrive at the following list:

Level Syntax Element Operator

0 factor literal, variable

1 signed factor unary minus

2 term *, /

3 expression +, -

4 b-factor literal, variable, relop

5 not-factor NOT

6 b-term AND

7 b-expression OR, XOR

If we're willing to accept that many precedence levels, this

grammar seems reasonable. Unfortunately, it won't work! The

grammar may be great in theory, but it's no good at all in the

practice of a top-down parser. To see the problem, consider the

code fragment:

IF ((((((A + B + C) < 0 ) AND ....

When the parser is parsing this code, it knows after it sees the

IF token that a Boolean expression is supposed to be next. So it

can set up to begin evaluating such an expression. But the first

expression in the example is an ARITHMETIC expression, A + B + C.

What's worse, at the point that the parser has read this much of

the input line:

IF ((((((A ,

it still has no way of knowing which kind of expression it's

dealing with. That won't do, because we must have different

recognizers for the two cases. The situation can be handled

without changing any of our definitions, but only if we're

willing to accept an arbitrary amount of backtracking to work our

way out of bad guesses. No compiler writer in his right mind

would agree to that.

What's going on here is that the beauty and elegance of BNF

grammar has met face to face with the realities of compiler

technology.

To deal with this situation, compiler writers have had to make

compromises so that a single parser can handle the grammar

without backtracking.

FIXING THE GRAMMAR

The problem that we've encountered comes up because our

definitions of both arithmetic and Boolean factors permit the use

of parenthesized expressions. Since the definitions are

recursive, we can end up with any number of levels of

parentheses, and the parser can't know which kind of expression

it's dealing with.

The solution is simple, although it ends up causing profound

changes to our grammar. We can only allow parentheses in one

kind of factor. The way to do that varies considerably from

language to language. This is one place where there is NO

agreement or convention to help us.

When Niklaus Wirth designed Pascal, the desire was to limit the

number of levels of precedence (fewer parse routines, after all).

So the OR and exclusive OR operators are treated just like an

Addop and processed at the level of a math expression.

Similarly, the AND is treated like a Mulop and processed with

Term. The precedence levels are

Level Syntax Element Operator

0 factor literal, variable

1 signed factor unary minus, NOT

2 term *, /, AND

3 expression +, -, OR

Notice that there is only ONE set of syntax rules, applying to

both kinds of operators. According to this grammar, then,

expressions like

x + (y AND NOT z) DIV 3

are perfectly legal. And, in fact, they ARE ... as far as the

parser is concerned. Pascal doesn't allow the mixing of

arithmetic and Boolean variables, and things like this are caught

at the SEMANTIC level, when it comes time to generate code for

them, rather than at the syntax level.

The authors of C took a diametrically opposite approach: they

treat the operators as different, and have something much more

akin to our seven levels of precedence. In fact, in C there are

no fewer than 17 levels! That's because C also has the operators

'=', '+=' and its kin, '<<', '>>', '++', '--', etc. Ironically,

although in C the arithmetic and Boolean operators are treated

separately, the variables are NOT ... there are no Boolean or

logical variables in C, so a Boolean test can be made on any

integer value.

We'll do something that's sort of in-between. I'm tempted to

stick mostly with the Pascal approach, since that seems the
simplest from an implementation point of view, but it results in

some funnies that I never liked very much, such as the fact that,

in the expression

IF (c >= 'A') and (c <= 'Z') then ...

the parens above are REQUIRED. I never understood why before,

and neither my compiler nor any human ever explained it very

well, either. But now, we can all see that the 'and' operator,

having the precedence of a multiply, has a higher one than the

relational operators, so without the parens the expression is

equivalent to

IF c >= ('A' and c) <= 'Z' then

which doesn't make sense.

In any case, I've elected to separate the operators into

different levels, although not as many as in C.

<b-expression> ::= <b-term> [<orop> <b-term>]*

<b-term> ::= <not-factor> [AND <not-factor>]*

<not-factor> ::= [NOT] <b-factor>

<b-factor> ::= <b-literal> | <b-variable> | <relation>

<relation> ::= | <expression> [<relop> <expression]

<expression> ::= <term> [<addop> <term>]*

<term> ::= <signed factor> [<mulop> factor]*

<signed factor>::= [<addop>] <factor>

<factor> ::= <integer> | <variable> | (<b-expression>)

This grammar results in the same set of seven levels that I

showed earlier. Really, it's almost the same grammar ... I just

removed the option of parenthesized b-expressions as a possible

b-factor, and added the relation as a legal form of b-factor.

There is one subtle but crucial difference, which is what makes

the whole thing work. Notice the square brackets in the

definition of a relation. This means that the relop and the

second expression are OPTIONAL.

A strange consequence of this grammar (and one shared by C) is

that EVERY expression is potentially a Boolean expression. The

parser will always be looking for a Boolean expression, but will

"settle" for an arithmetic one. To be honest, that's going to

slow down the parser, because it has to wade through more layers

of procedure calls. That's one reason why Pascal compilers tend

to compile faster than C compilers. If it's raw speed you want,

stick with the Pascal syntax.

THE PARSER

Now that we've gotten through the decision-making process, we can

press on with development of a parser. You've done this with me

several times now, so you know the drill: we begin with a fresh

copy of the cradle, and begin adding procedures one by one. So

let's do it.

We begin, as we did in the arithmetic case, by dealing only with

Boolean literals rather than variables. This gives us a new kind

of input token, so we're also going to need a new recognizer, and

a new procedure to read instances of that token type. Let's

start by defining the two new procedures:

{--------------------------------------------------------------}

{ Recognize a Boolean Literal }

function IsBoolean(c: char): Boolean;

begin

IsBoolean := UpCase(c) in ['T', 'F'];

end;

{--------------------------------------------------------------}

{ Get a Boolean Literal }

function GetBoolean: Boolean;

var c: char;

begin

if not IsBoolean(Look) then Expected('Boolean Literal');

GetBoolean := UpCase(Look) = 'T';

GetChar;

end;

{--------------------------------------------------------------}

Type these routines into your program. You can test them by

adding into the main program the print statement

WriteLn(GetBoolean);
OK, compile the program and test it. As usual, it's not very

impressive so far, but it soon will be.

Now, when we were dealing with numeric data we had to arrange to

generate code to load the values into D0. We need to do the same

for Boolean data. The usual way to encode Boolean variables is

to let 0 stand for FALSE, and some other value for TRUE. Many

languages, such as C, use an integer 1 to represent it. But I

prefer FFFF hex (or -1), because a bitwise NOT also becomes a

Boolean NOT. So now we need to emit the right assembler code to

load those values. The first cut at the Boolean expression

parser (BoolExpression, of course) is:

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Expression }

procedure BoolExpression;

begin

if not IsBoolean(Look) then Expected('Boolean Literal');

if GetBoolean then

EmitLn('MOVE #-1,D0')

else

EmitLn('CLR D0');

end;

{---------------------------------------------------------------}

Add this procedure to your parser, and call it from the main
program (replacing the print statement you had just put there).

As you can see, we still don't have much of a parser, but the

output code is starting to look more realistic.

Next, of course, we have to expand the definition of a Boolean

expression. We already have the BNF rule:

<b-expression> ::= <b-term> [<orop> <b-term>]*

I prefer the Pascal versions of the "orops", OR and XOR. But

since we are keeping to single-character tokens here, I'll encode

those with '|' and '~'. The next version of BoolExpression is

almost a direct copy of the arithmetic procedure Expression:

{--------------------------------------------------------------}

{ Recognize and Translate a Boolean OR }

procedure BoolOr;

begin

Match('|');

BoolTerm;

EmitLn('OR (SP)+,D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate an Exclusive Or }

procedure BoolXor;

begin

Match('~');

BoolTerm;

EmitLn('EOR (SP)+,D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Expression }

procedure BoolExpression;

begin

BoolTerm;

while IsOrOp(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'|': BoolOr;

'~': BoolXor;

end;

{---------------------------------------------------------------}

Note the new recognizer IsOrOp, which is also a copy, this time

of IsAddOp:

{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): Boolean;

begin

IsOrop := c in ['|', '~'];

end;

{--------------------------------------------------------------}

OK, rename the old version of BoolExpression to BoolTerm, then

enter the code above. Compile and test this version. At this

point, the output code is starting to look pretty good. Of

course, it doesn't make much sense to do a lot of Boolean algebra

on constant values, but we'll soon be expanding the types of

Booleans we deal with.

You've probably already guessed what the next step is: The

Boolean version of Term.

Rename the current procedure BoolTerm to NotFactor, and enter the

following new version of BoolTerm. Note that is is much simpler

than the numeric version, since there is no equivalent of

division.

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Term }

procedure BoolTerm;

begin

NotFactor;
while Look = '&' do begin

EmitLn('MOVE D0,-(SP)');

Match('&');

NotFactor;

EmitLn('AND (SP)+,D0');

end;

{--------------------------------------------------------------}

Now, we're almost home. We are translating complex Boolean

expressions, although only for constant values. The next step is

to allow for the NOT. Write the following procedure:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Factor with NOT }

procedure NotFactor;

begin

if Look = '!' then begin

Match('!');

BoolFactor;

EmitLn('EOR #-1,D0');

end

else

BoolFactor;

end;

{--------------------------------------------------------------}
And rename the earlier procedure to BoolFactor. Now try that.

At this point the parser should be able to handle any Boolean

expression you care to throw at it. Does it? Does it trap badly

formed expressions?

If you've been following what we did in the parser for math

expressions, you know that what we did next was to expand the

definition of a factor to include variables and parens. We don't

have to do that for the Boolean factor, because those little

items get taken care of by the next step. It takes just a one

line addition to BoolFactor to take care of relations:

{--------------------------------------------------------------}

{ Parse and Translate a Boolean Factor }

procedure BoolFactor;

begin

if IsBoolean(Look) then

if GetBoolean then

EmitLn('MOVE #-1,D0')

else

EmitLn('CLR D0')

else Relation;

end;

{--------------------------------------------------------------}

You might be wondering when I'm going to provide for Boolean

variables and parenthesized Boolean expressions. The answer is,

I'm NOT! Remember, we took those out of the grammar earlier.

Right now all I'm doing is encoding the grammar we've already

agreed upon. The compiler itself can't tell the difference

between a Boolean variable or expression and an arithmetic one

... all of those will be handled by Relation, either way.

Of course, it would help to have some code for Relation. I don't

feel comfortable, though, adding any more code without first

checking out what we already have. So for now let's just write a

dummy version of Relation that does nothing except eat the

current character, and write a little message:

{---------------------------------------------------------------}

{ Parse and Translate a Relation }

procedure Relation;

begin

WriteLn('<Relation>');

GetChar;

end;

{--------------------------------------------------------------}

OK, key in this code and give it a try. All the old things

should still work ... you should be able to generate the code for

ANDs, ORs, and NOTs. In addition, if you type any alphabetic

character you should get a little <Relation> place-holder, where

a Boolean factor should be. Did you get that? Fine, then let's

move on to the full-blown version of Relation.

To get that, though, there is a bit of groundwork that we must

lay first. Recall that a relation has the form

<relation> ::= | <expression> [<relop> <expression]

Since we have a new kind of operator, we're also going to need a

new Boolean function to recognize it. That function is shown

below. Because of the single-character limitation, I'm sticking

to the four operators that can be encoded with such a character

(the "not equals" is encoded by '#').

{--------------------------------------------------------------}

{ Recognize a Relop }

function IsRelop(c: char): Boolean;

begin

IsRelop := c in ['=', '#', '<', '>'];

end;

{--------------------------------------------------------------}

Now, recall that we're using a zero or a -1 in register D0 to

represent a Boolean value, and also that the loop constructs

expect the flags to be set to correspond. In implementing all

this on the 68000, things get a a little bit tricky.

Since the loop constructs operate only on the flags, it would be

nice (and also quite efficient) just to set up those flags, and
not load anything into D0 at all. This would be fine for the

loops and branches, but remember that the relation can be used

ANYWHERE a Boolean factor could be used. We may be storing its

result to a Boolean variable. Since we can't know at this point

how the result is going to be used, we must allow for BOTH cases.

Comparing numeric data is easy enough ... the 68000 has an

operation for that ... but it sets the flags, not a value.

What's more, the flags will always be set the same (zero if

equal, etc.), while we need the zero flag set differently for the

each of the different relops.

The solution is found in the 68000 instruction Scc, which sets a

byte value to 0000 or FFFF (funny how that works!) depending upon

the result of the specified condition. If we make the

destination byte to be D0, we get the Boolean value needed.

Unfortunately, there's one final complication: unlike almost

every other instruction in the 68000 set, Scc does NOT reset the

condition flags to match the data being stored. So we have to do

one last step, which is to test D0 and set the flags to match it.

It must seem to be a trip around the moon to get what we want: we

first perform the test, then test the flags to set data into D0,

then test D0 to set the flags again. It is sort of roundabout,

but it's the most straightforward way to get the flags right, and

after all it's only a couple of instructions.

I might mention here that this area is, in my opinion, the one

that represents the biggest difference between the efficiency of

hand-coded assembler language and compiler-generated code. We

have seen already that we lose efficiency in arithmetic

operations, although later I plan to show you how to improve that

a bit. We've also seen that the control constructs themselves

can be done quite efficiently ... it's usually very difficult to

improve on the code generated for an IF or a WHILE. But

virtually every compiler I've ever seen generates terrible code,

compared to assembler, for the computation of a Boolean function,

and particularly for relations. The reason is just what I've

hinted at above. When I'm writing code in assembler, I go ahead

and perform the test the most convenient way I can, and then set

up the branch so that it goes the way it should. In effect, I

"tailor" every branch to the situation. The compiler can't do

that (practically), and it also can't know that we don't want to

store the result of the test as a Boolean variable. So it must

generate the code in a very strict order, and it often ends up

loading the result as a Boolean that never gets used for

anything.

In any case, we're now ready to look at the code for Relation.

It's shown below with its companion procedures:

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Equals" }

procedure Equals;

begin

Match('=');

Expression;
EmitLn('CMP (SP)+,D0');

EmitLn('SEQ D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Not Equals" }

procedure NotEquals;

begin

Match('#');

Expression;

EmitLn('CMP (SP)+,D0');

EmitLn('SNE D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Less Than" }

procedure Less;

begin

Match('<');

Expression;

EmitLn('CMP (SP)+,D0');

EmitLn('SGE D0');

end;
{---------------------------------------------------------------}

{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;

begin

Match('>');

Expression;

EmitLn('CMP (SP)+,D0');

EmitLn('SLE D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Relation }

procedure Relation;

begin

Expression;

if IsRelop(Look) then begin

EmitLn('MOVE D0,-(SP)');

case Look of

'=': Equals;

'#': NotEquals;

'<': Less;

'>': Greater;

end;

EmitLn('TST D0');

end;

end;
{---------------------------------------------------------------}

Now, that call to Expression looks familiar! Here is where the

editor of your system comes in handy. We have already generated

code for Expression and its buddies in previous sessions. You

can copy them into your file now. Remember to use the single-

character versions. Just to be certain, I've duplicated the

arithmetic procedures below. If you're observant, you'll also

see that I've changed them a little to make them correspond to

the latest version of the syntax. This change is NOT necessary,

so you may prefer to hold off on that until you're sure

everything is working.

{---------------------------------------------------------------}

{ Parse and Translate an Identifier }

procedure Ident;

var Name: char;

begin

Name:= GetName;

if Look = '(' then begin

Match('(');

Match(')');

EmitLn('BSR ' + Name);

end

else

EmitLn('MOVE ' + Name + '(PC),D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

Ident

else

EmitLn('MOVE #' + GetNum + ',D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate the First Math Factor }

procedure SignedFactor;

begin

if Look = '+' then

GetChar;

if Look = '-' then begin

GetChar;

if IsDigit(Look) then

EmitLn('MOVE #-' + GetNum + ',D0')

else begin

Factor;

EmitLn('NEG D0');

end;

end

else Factor;

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*');

Factor;

EmitLn('MULS (SP)+,D0');

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Divide }

procedure Divide;

begin
Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('EXS.L D0');

EmitLn('DIVS D1,D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

procedure Term;

begin

SignedFactor;

while Look in ['*', '/'] do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'*': Multiply;

'/': Divide;

end;

{---------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');
Term;

EmitLn('ADD (SP)+,D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

Term;

while IsAddop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;
end;

end;

{---------------------------------------------------------------}

There you have it ... a parser that can handle both arithmetic

AND Boolean algebra, and things that combine the two through the

use of relops. I suggest you file away a copy of this parser in

a safe place for future reference, because in our next step we're

going to be chopping it up.

MERGING WITH CONTROL CONSTRUCTS

At this point, let's go back to the file we had previously built

that parses control constructs. Remember those little dummy

procedures called Condition and Expression? Now you know what

goes in their places!

I warn you, you're going to have to do some creative editing

here, so take your time and get it right. What you need to do is

to copy all of the procedures from the logic parser, from Ident

through BoolExpression, into the parser for control constructs.

Insert them at the current location of Condition. Then delete

that procedure, as well as the dummy Expression. Next, change

every call to Condition to refer to BoolExpression instead.

Finally, copy the procedures IsMulop, IsOrOp, IsRelop, IsBoolean,

and GetBoolean into place. That should do it.

Compile the resulting program and give it a try. Since we

haven't used this program in awhile, don't forget that we used

single-character tokens for IF, WHILE, etc. Also don't forget

that any letter not a keyword just gets echoed as a block.

Try

ia=bxlye

which stands for "IF a=b X ELSE Y ENDIF".

What do you think? Did it work? Try some others.

ADDING ASSIGNMENTS

As long as we're this far, and we already have the routines for

expressions in place, we might as well replace the "blocks" with

real assignment statements. We've already done that before, so

it won't be too hard. Before taking that step, though, we need

to fix something else.

We're soon going to find that the one-line "programs" that we're

having to write here will really cramp our style. At the moment

we have no cure for that, because our parser doesn't recognize

the end-of-line characters, the carriage return (CR) and the line

feed (LF). So before going any further let's plug that hole.

There are a couple of ways to deal with the CR/LFs. One (the

C/Unix approach) is just to treat them as additional white space

characters and ignore them. That's actually not such a bad

approach, but it does sort of produce funny results for our

parser as it stands now. If it were reading its input from a

source file as any self-respecting REAL compiler does, there

would be no problem. But we're reading input from the keyboard,

and we're sort of conditioned to expect something to happen when

we hit the return key. It won't, if we just skip over the CR and

LF (try it). So I'm going to use a different method here, which

is NOT necessarily the best approach in the long run. Consider

it a temporary kludge until we're further along.

Instead of skipping the CR/LF, We'll let the parser go ahead and

catch them, then introduce a special procedure, analogous to

SkipWhite, that skips them only in specified "legal" spots.

Here's the procedure:

{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin

if Look = CR then GetChar;

if Look = LF then GetChar;

end;

{--------------------------------------------------------------}

Now, add two calls to Fin in procedure Block, like this:

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block(L: string);

begin

while not(Look in ['e', 'l', 'u']) do begin

Fin;

case Look of

'i': DoIf(L);

'w': DoWhile;

'p': DoLoop;

'r': DoRepeat;

'f': DoFor;

'd': DoDo;

'b': DoBreak(L);

else Other;

end;

Fin;

end;

{--------------------------------------------------------------}

Now, you'll find that you can use multiple-line "programs." The

only restriction is that you can't separate an IF or WHILE token

from its predicate.

Now we're ready to include the assignment statements. Simply

change that call to Other in procedure Block to a call to

Assignment, and add the following procedure, copied from one of

our earlier programs. Note that Assignment now calls

BoolExpression, so that we can assign Boolean variables.

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: char;

begin

Name := GetName;

Match('=');

BoolExpression;

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)');

end;

{--------------------------------------------------------------}

With that change, you should now be able to write reasonably

realistic-looking programs, subject only to our limitation on

single-character tokens. My original intention was to get rid of

that limitation for you, too. However, that's going to require a

fairly major change to what we've done so far. We need a true

lexical scanner, and that requires some structural changes. They

are not BIG changes that require us to throw away all of what

we've done so far ... with care, it can be done with very minimal

changes, in fact. But it does require that care.

This installment has already gotten pretty long, and it contains

some pretty heavy stuff, so I've decided to leave that step until
next time, when you've had a little more time to digest what

we've done and are ready to start fresh.

In the next installment, then, we'll build a lexical scanner and

eliminate the single-character barrier once and for all. We'll

also write our first complete compiler, based on what we've done

in this session. See you then.

[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part VII: LEXICAL SCANNING

INTRODUCTION

In the last installment, I left you with a compiler that would

ALMOST work, except that we were still limited to single-

character tokens. The purpose of this session is to get rid of

that restriction, once and for all. This means that we must deal

with the concept of the lexical scanner.

Maybe I should mention why we need a lexical scanner at all ...

after all, we've been able to manage all right without one, up

till now, even when we provided for multi-character tokens.

The ONLY reason, really, has to do with keywords. It's a fact of

computer life that the syntax for a keyword has the same form as

that for any other identifier. We can't tell until we get the

complete word whether or not it IS a keyword. For example, the

variable IFILE and the keyword IF look just alike, until you get
to the third character. In the examples to date, we were always

able to make a decision based upon the first character of the

token, but that's no longer possible when keywords are present.

We need to know that a given string is a keyword BEFORE we begin

to process it. And that's why we need a scanner.

In the last session, I also promised that we would be able to

provide for normal tokens without making wholesale changes to

what we have already done. I didn't lie ... we can, as you will

see later. But every time I set out to install these elements of

the software into the parser we have already built, I had bad

feelings about it. The whole thing felt entirely too much like a

band-aid. I finally figured out what was causing the problem: I

was installing lexical scanning software without first explaining

to you what scanning is all about, and what the alternatives are.

Up till now, I have studiously avoided giving you a lot of

theory, and certainly not alternatives. I generally don't

respond well to the textbooks that give you twenty-five different

ways to do something, but no clue as to which way best fits your

needs. I've tried to avoid that pitfall by just showing you ONE

method, that WORKS.

But this is an important area. While the lexical scanner is

hardly the most exciting part of a compiler, it often has the

most profound effect on the general "look & feel" of the

language, since after all it's the part closest to the user. I

have a particular structure in mind for the scanner to be used

with KISS. It fits the look & feel that I want for that
language. But it may not work at all for the language YOU'RE

cooking up, so in this one case I feel that it's important for

you to know your options.

So I'm going to depart, again, from my usual format. In this

session we'll be getting much deeper than usual into the basic

theory of languages and grammars. I'll also be talking about

areas OTHER than compilers in which lexical scanning plays an

important role. Finally, I will show you some alternatives for

the structure of the lexical scanner. Then, and only then, will

we get back to our parser from the last installment. Bear with

me ... I think you'll find it's worth the wait. In fact, since

scanners have many applications outside of compilers, you may

well find this to be the most useful session for you.

LEXICAL SCANNING

Lexical scanning is the process of scanning the stream of input

characters and separating it into strings called tokens. Most

compiler texts start here, and devote several chapters to

discussing various ways to build scanners. This approach has its

place, but as you have already seen, there is a lot you can do

without ever even addressing the issue, and in fact the scanner

we'll end up with here won't look much like what the texts

describe. The reason? Compiler theory and, consequently, the

programs resulting from it, must deal with the most general kind

of parsing rules. We don't. In the real world, it is possible

to specify the language syntax in such a way that a pretty simple

scanner will suffice. And as always, KISS is our motto.

Typically, lexical scanning is done in a separate part of the

compiler, so that the parser per se sees only a stream of input

tokens. Now, theoretically it is not necessary to separate this

function from the rest of the parser. There is only one set of

syntax equations that define the whole language, so in theory we

could write the whole parser in one module.

Why the separation? The answer has both practical and

theoretical bases.

In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of

grammars. They are:

o Type 0: Unrestricted (e.g., English)

o Type 1: Context-Sensitive

o Type 2: Context-Free

o Type 3: Regular

A few features of the typical programming language (particularly

the older ones, such as FORTRAN) are Type 1, but for the most

part all modern languages can be described using only the last

two types, and those are all we'll be dealing with here.

The neat part about these two types is that there are very

specific ways to parse them. It has been shown that any regular

grammar can be parsed using a particular form of abstract machine

called the state machine (finite automaton). We have already

implemented state machines in some of our recognizers.

Similarly, Type 2 (context-free) grammars can always be parsed

using a push-down automaton (a state machine augmented by a

stack). We have also implemented these machines. Instead of

implementing a literal stack, we have relied on the built-in

stack associated with recursive coding to do the job, and that in

fact is the preferred approach for top-down parsing.

Now, it happens that in real, practical grammars, the parts that

qualify as regular expressions tend to be the lower-level parts,

such as the definition of an identifier:

<ident> ::= <letter> [ <letter> | <digit> ]*

Since it takes a different kind of abstract machine to parse the

two types of grammars, it makes sense to separate these lower-

level functions into a separate module, the lexical scanner,

which is built around the idea of a state machine. The idea is to

use the simplest parsing technique needed for the job.

There is another, more practical reason for separating scanner

from parser. We like to think of the input source file as a

stream of characters, which we process right to left without

backtracking. In practice that isn't possible. Almost every

language has certain keywords such as IF, WHILE, and END. As I

mentioned earlier, we can't really know whether a given

character string is a keyword, until we've reached the end of it,

as defined by a space or other delimiter. So in that sense, we

MUST save the string long enough to find out whether we have a

keyword or not. That's a limited form of backtracking.

So the structure of a conventional compiler involves splitting up

the functions of the lower-level and higher-level parsing. The

lexical scanner deals with things at the character level,

collecting characters into strings, etc., and passing them along

to the parser proper as indivisible tokens. It's also considered

normal to let the scanner have the job of identifying keywords.

STATE MACHINES AND ALTERNATIVES

I mentioned that the regular expressions can be parsed using a

state machine. In most compiler texts, and indeed in most

compilers as well, you will find this taken literally. There is

typically a real implementation of the state machine, with

integers used to define the current state, and a table of actions

to take for each combination of current state and input

character. If you write a compiler front end using the popular

Unix tools LEX and YACC, that's what you'll get. The output of

LEX is a state machine implemented in C, plus a table of actions

corresponding to the input grammar given to LEX. The YACC output

is similar ... a canned table-driven parser, plus the table

corresponding to the language syntax.

That is not the only choice, though. In our previous

installments, you have seen over and over that it is possible to

implement parsers without dealing specifically with tables,

stacks, or state variables. In fact, in Installment V I warned

you that if you find yourself needing these things you might be

doing something wrong, and not taking advantage of the power of

Pascal. There are basically two ways to define a state machine's

state: explicitly, with a state number or code, and implicitly,

simply by virtue of the fact that I'm at a certain place in the

code (if it's Tuesday, this must be Belgium). We've relied

heavily on the implicit approaches before, and I think you'll

find that they work well here, too.

In practice, it may not even be necessary to HAVE a well-defined

lexical scanner. This isn't our first experience at dealing with

multi-character tokens. In Installment III, we extended our

parser to provide for them, and we didn't even NEED a lexical

scanner. That was because in that narrow context, we could

always tell, just by looking at the single lookahead character,

whether we were dealing with a number, a variable, or an

operator. In effect, we built a distributed lexical scanner,

using procedures GetName and GetNum.

With keywords present, we can't know anymore what we're dealing

with, until the entire token is read. This leads us to a more

localized scanner; although, as you will see, the idea of a

distributed scanner still has its merits.

SOME EXPERIMENTS IN SCANNING

Before getting back to our compiler, it will be useful to

experiment a bit with the general concepts.

Let's begin with the two definitions most often seen in real

programming languages:

<ident> ::= <letter> [ <letter> | <digit> ]*

<number ::= [<digit>]+

(Remember, the '*' indicates zero or more occurences of the terms

in brackets, and the '+', one or more.)

We have already dealt with similar items in Installment III.

Let's begin (as usual) with a bare cradle. Not surprisingly, we

are going to need a new recognizer:

{--------------------------------------------------------------}

{ Recognize an Alphanumeric Character }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

Using this let's write the following two routines, which are very

similar to those we've used before:

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: string;

var x: string[8];

begin

x := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

x := x + UpCase(Look);

GetChar;

end;

GetName := x;
end;

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: string;

var x: string[16];

begin

x := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

x := x + Look;

GetChar;

end;

GetNum := x;

end;

{--------------------------------------------------------------}

(Notice that this version of GetNum returns a string, not an

integer as before.)

You can easily verify that these routines work by calling them

from the main program, as in

WriteLn(GetName);

This program will print any legal name typed in (maximum eight

characters, since that's what we told GetName). It will reject

anything else.

Test the other routine similarly.

WHITE SPACE

We also have dealt with embedded white space before, using the

two routines IsWhite and SkipWhite. Make sure that these

routines are in your current version of the cradle, and add the

the line

SkipWhite;

at the end of both GetName and GetNum.

Now, let's define the new procedure:

{--------------------------------------------------------------}

{ Lexical Scanner }

Function Scan: string;

begin

if IsAlpha(Look) then

Scan := GetName

else if IsDigit(Look) then

Scan := GetNum

else begin

Scan := Look;

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}
We can call this from the new main program:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Token := Scan;

writeln(Token);

until Token = CR;

end.

{--------------------------------------------------------------}

(You will have to add the declaration of the string Token at the

beginning of the program. Make it any convenient length, say 16

characters.)

Now, run the program. Note how the input string is, indeed,

separated into distinct tokens.

STATE MACHINES

For the record, a parse routine like GetName does indeed

implement a state machine. The state is implicit in the current

position in the code. A very useful trick for visualizing what's

going on is the syntax diagram, or "railroad-track" diagram.

It's a little difficult to draw one in this medium, so I'll use

them very sparingly, but the figure below should give you the

idea:

|-----> Other---------------------------> Error

Start -------> Letter ---------------> Other -----> Finish

|<----- Letter <---------|

|<----- Digit <----------

As you can see, this diagram shows how the logic flows as

characters are read. Things begin, of course, in the start

state, and end when a character other than an alphanumeric is

found. If the first character is not alpha, an error occurs.

Otherwise the machine will continue looping until the terminating

delimiter is found.

Note that at any point in the flow, our position is entirely

dependent on the past history of the input characters. At that

point, the action to be taken depends only on the current state,

plus the current input character. That's what make this a state

machine.

Because of the difficulty of drawing railroad-track diagrams in

this medium, I'll continue to stick to syntax equations from now

on. But I highly recommend the diagrams to you for anything you
do that involves parsing. After a little practice you can begin

to see how to write a parser directly from the diagrams.

Parallel paths get coded into guarded actions (guarded by IF's or

CASE statements), serial paths into sequential calls. It's

almost like working from a schematic.

We didn't even discuss SkipWhite, which was introduced earlier,

but it also is a simple state machine, as is GetNum. So is their

parent procedure, Scan. Little machines make big machines.

The neat thing that I'd like you to note is how painlessly this

implicit approach creates these state machines. I personally

prefer it a lot over the table-driven approach. It also results

is a small, tight, and fast scanner.

NEWLINES

Moving right along, let's modify our scanner to handle more than

one line. As I mentioned last time, the most straightforward way

to do this is to simply treat the newline characters, carriage

return and line feed, as white space. This is, in fact, the way

the C standard library routine, iswhite, works. We didn't

actually try this before. I'd like to do it now, so you can get

a feel for the results.

To do this, simply modify the single executable line of IsWhite

to read:

IsWhite := c in [' ', TAB, CR, LF];

We need to give the main program a new stop condition, since it

will never see a CR. Let's just use:

until Token = '.';

OK, compile this program and run it. Try a couple of lines,

terminated by the period. I used:

now is the time

for all good men.

Hey, what happened? When I tried it, I didn't get the last

token, the period. The program didn't halt. What's more, when I

pressed the 'enter' key a few times, I still didn't get the

period.

If you're still stuck in your program, you'll find that typing a

period on a new line will terminate it.

What's going on here? The answer is that we're hanging up in

SkipWhite. A quick look at that routine will show that as long

as we're typing null lines, we're going to just continue to loop.

After SkipWhite encounters an LF, it tries to execute a GetChar.

But since the input buffer is now empty, GetChar's read statement

insists on having another line. Procedure Scan gets the

terminating period, all right, but it calls SkipWhite to clean

up, and SkipWhite won't return until it gets a non-null line.

This kind of behavior is not quite as bad as it seems. In a real

compiler, we'd be reading from an input file instead of the

console, and as long as we have some procedure for dealing with

end-of-files, everything will come out OK. But for reading data

from the console, the behavior is just too bizarre. The fact of

the matter is that the C/Unix convention is just not compatible

with the structure of our parser, which calls for a lookahead

character. The code that the Bell wizards have implemented

doesn't use that convention, which is why they need 'ungetc'.

OK, let's fix the problem. To do that, we need to go back to the

old definition of IsWhite (delete the CR and LF characters) and

make use of the procedure Fin that I introduced last time. If

it's not in your current version of the cradle, put it there now.

Also, modify the main program to read:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Token := Scan;

writeln(Token);

if Token = CR then Fin;

until Token = '.';

end.

{--------------------------------------------------------------}
Note the "guard" test preceding the call to Fin. That's what

makes the whole thing work, and ensures that we don't try to read

a line ahead.

Try the code now. I think you'll like it better.

If you refer to the code we did in the last installment, you'll

find that I quietly sprinkled calls to Fin throughout the code,

wherever a line break was appropriate. This is one of those

areas that really affects the look & feel that I mentioned. At

this point I would urge you to experiment with different

arrangements and see how you like them. If you want your

language to be truly free-field, then newlines should be

transparent. In this case, the best approach is to put the

following lines at the BEGINNING of Scan:

while Look = CR do

Fin;

If, on the other hand, you want a line-oriented language like

Assembler, BASIC, or FORTRAN (or even Ada... note that it has

comments terminated by newlines), then you'll need for Scan to

return CR's as tokens. It must also eat the trailing LF. The

best way to do that is to use this line, again at the beginning

of Scan:

if Look = LF then Fin;

For other conventions, you'll have to use other arrangements.

In my example of the last session, I allowed newlines only at

specific places, so I was somewhere in the middle ground. In the

rest of these sessions, I'll be picking ways to handle newlines

that I happen to like, but I want you to know how to choose other

ways for yourselves.

OPERATORS

We could stop now and have a pretty useful scanner for our

purposes. In the fragments of KISS that we've built so far, the

only tokens that have multiple characters are the identifiers and

numbers. All operators were single characters. The only

exception I can think of is the relops <=, >=, and <>, but they

could be dealt with as special cases.

Still, other languages have multi-character operators, such as

the ':=' of Pascal or the '++' and '>>' of C. So while we may

not need multi-character operators, it's nice to know how to get

them if necessary.

Needless to say, we can handle operators very much the same way

as the other tokens. Let's start with a recognizer:

{--------------------------------------------------------------}

{ Recognize Any Operator }

function IsOp(c: char): boolean;

begin

IsOp := c in ['+', '-', '*', '/', '<', '>', ':', '='];

end;

{--------------------------------------------------------------}
It's important to note that we DON'T have to include every

possible operator in this list. For example, the paretheses

aren't included, nor is the terminating period. The current

version of Scan handles single-character operators just fine as

it is. The list above includes only those characters that can

appear in multi-character operators. (For specific languages, of

course, the list can always be edited.)

Now, let's modify Scan to read:

{--------------------------------------------------------------}

{ Lexical Scanner }

Function Scan: string;

begin

while Look = CR do

Fin;

if IsAlpha(Look) then

Scan := GetName

else if IsDigit(Look) then

Scan := GetNum

else if IsOp(Look) then

Scan := GetOp

else begin

Scan := Look;

GetChar;

end;
SkipWhite;

end;

{--------------------------------------------------------------}

Try the program now. You will find that any code fragments you

care to throw at it will be neatly broken up into individual

tokens.

LISTS, COMMAS AND COMMAND LINES

Before getting back to the main thrust of our study, I'd like to

get on my soapbox for a moment.

How many times have you worked with a program or operating system

that had rigid rules about how you must separate items in a list?

(Try, the last time you used MSDOS!) Some programs require

spaces as delimiters, and some require commas. Worst of all,

some require both, in different places. Most are pretty

unforgiving about violations of their rules.

I think this is inexcusable. It's too easy to write a parser

that will handle both spaces and commas in a flexible way.

Consider the following procedure:

{--------------------------------------------------------------}

{ Skip Over a Comma }

procedure SkipComma;

begin

SkipWhite;
if Look = ',' then begin

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

This eight-line procedure will skip over a delimiter consisting

of any number (including zero) of spaces, with zero or one comma

embedded in the string.

TEMPORARILY, change the call to SkipWhite in Scan to a call to

SkipComma, and try inputting some lists. Works nicely, eh?

Don't you wish more software authors knew about SkipComma?

For the record, I found that adding the equivalent of SkipComma

to my Z80 assembler-language programs took all of 6 (six) extra

bytes of code. Even in a 64K machine, that's not a very high

price to pay for user-friendliness!

I think you can see where I'm going here. Even if you never

write a line of a compiler code in your life, there are places in

every program where you can use the concepts of parsing. Any

program that processes a command line needs them. In fact, if

you think about it for a bit, you'll have to conclude that any

time you write a program that processes user inputs, you're

defining a language. People communicate with languages, and the

syntax implicit in your program defines that language. The real

question is: are you going to define it deliberately and

explicitly, or just let it turn out to be whatever the program

ends up parsing?

I claim that you'll have a better, more user-friendly program if

you'll take the time to define the syntax explicitly. Write down

the syntax equations or draw the railroad-track diagrams, and

code the parser using the techniques I've shown you here. You'll

end up with a better program, and it will be easier to write, to

boot.

GETTING FANCY

OK, at this point we have a pretty nice lexical scanner that will

break an input stream up into tokens. We could use it as it

stands and have a servicable compiler. But there are some other

aspects of lexical scanning that we need to cover.

The main consideration is <shudder> efficiency. Remember when we

were dealing with single-character tokens, every test was a

comparison of a single character, Look, with a byte constant. We

also used the Case statement heavily.

With the multi-character tokens being returned by Scan, all those

tests now become string comparisons. Much slower. And not only

slower, but more awkward, since there is no string equivalent of

the Case statement in Pascal. It seems especially wasteful to

test for what used to be single characters ... the '=', '+', and

other operators ... using string comparisons.

Using string comparison is not impossible ... Ron Cain used just

that approach in writing Small C. Since we're sticking to the

KISS principle here, we would be truly justified in settling for

this approach. But then I would have failed to tell you about

one of the key approaches used in "real" compilers.

You have to remember: the lexical scanner is going to be called a

_LOT_! Once for every token in the whole source program, in

fact. Experiments have indicated that the average compiler

spends anywhere from 20% to 40% of its time in the scanner

routines. If there were ever a place where efficiency deserves

real consideration, this is it.

For this reason, most compiler writers ask the lexical scanner to

do a little more work, by "tokenizing" the input stream. The

idea is to match every token against a list of acceptable

keywords and operators, and return unique codes for each one

recognized. In the case of ordinary variable names or numbers,

we just return a code that says what kind of token they are, and

save the actual string somewhere else.

One of the first things we're going to need is a way to identify

keywords. We can always do it with successive IF tests, but it

surely would be nice if we had a general-purpose routine that

could compare a given string with a table of keywords. (By the

way, we're also going to need such a routine later, for dealing

with symbol tables.) This usually presents a problem in Pascal,

because standard Pascal doesn't allow for arrays of variable

lengths. It's a real bother to have to declare a different

search routine for every table. Standard Pascal also doesn't

allow for initializing arrays, so you tend to see code like

Table[1] := 'IF';

Table[2] := 'ELSE';

Table[n] := 'END';

which can get pretty old if there are many keywords.

Fortunately, Turbo Pascal 4.0 has extensions that eliminate both

of these problems. Constant arrays can be declared using TP's

"typed constant" facility, and the variable dimensions can be

handled with its C-like extensions for pointers.

First, modify your declarations like this:

{--------------------------------------------------------------}

{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}

(The dimension used in SymTab is not real ... no storage is

allocated by the declaration itself, and the number need only be

"big enough.")

Now, just beneath those declarations, add the following:

{--------------------------------------------------------------}

{ Definition of Keywords and Token Types }

const KWlist: array [1..4] of Symbol =

('IF', 'ELSE', 'ENDIF', 'END');

{--------------------------------------------------------------}

Next, insert the following new function:

{--------------------------------------------------------------}

{ Table Lookup }

{ If the input string matches a table entry, return the entry

index. If not, return a zero. }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;

found: boolean;

begin

found := false;

i := n;

while (i > 0) and not found do

if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;

{--------------------------------------------------------------}

To test it, you can temporarily change the main program as

follows:

{--------------------------------------------------------------}

{ Main Program }

begin

ReadLn(Token);

WriteLn(Lookup(Addr(KWList), Token, 4));

end.

{--------------------------------------------------------------}

Notice how Lookup is called: The Addr function sets up a pointer

to KWList, which gets passed to Lookup.

OK, give this a try. Since we're bypassing Scan here, you'll

have to type the keywords in upper case to get any matches.

Now that we can recognize keywords, the next thing is to arrange

to return codes for them.

So what kind of code should we return? There are really only two

reasonable choices. This seems like an ideal application for the

Pascal enumerated type. For example, you can define something

SymType = (IfSym, ElseSym, EndifSym, EndSym, Ident, Number,

Operator);

and arrange to return a variable of this type. Let's give it a

try. Insert the line above into your type definitions.

Now, add the two variable declarations:

Token: Symtype; { Current Token }

Value: String[16]; { String Token of Look }

Modify the scanner to read:

{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;

if IsAlpha(Look) then begin

Value := GetName;

k := Lookup(Addr(KWlist), Value, 4);

if k = 0 then

Token := Ident

else

Token := SymType(k - 1);

end

else if IsDigit(Look) then begin

Value := GetNum;

Token := Number;

end

else if IsOp(Look) then begin

Value := GetOp;
Token := Operator;

end

else begin

Value := Look;

Token := Operator;

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

(Notice that Scan is now a procedure, not a function.)

Finally, modify the main program to read:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

repeat

Scan;

case Token of

Ident: write('Ident ');

Number: Write('Number ');

Operator: Write('Operator ');

IfSym, ElseSym, EndifSym, EndSym: Write('Keyword ');

end;

Writeln(Value);

until Token = EndSym;

end.

{--------------------------------------------------------------}

What we've done here is to replace the string Token used earlier

with an enumerated type. Scan returns the type in variable Token,

and returns the string itself in the new variable Value.

OK, compile this and give it a whirl. If everything goes right,

you should see that we are now recognizing keywords.

What we have now is working right, and it was easy to generate

from what we had earlier. However, it still seems a little

"busy" to me. We can simplify things a bit by letting GetName,

GetNum, GetOp, and Scan be procedures working with the global

variables Token and Value, thereby eliminating the local copies.

It also seems a little cleaner to move the table lookup into

GetName. The new form for the four procedures is, then:

{--------------------------------------------------------------}

{ Get an Identifier }

procedure GetName;

var k: integer;

begin

Value := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

Value := Value + UpCase(Look);

GetChar;

end;

k := Lookup(Addr(KWlist), Value, 4);

if k = 0 then

Token := Ident

else

Token := SymType(k-1);

end;

{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

Value := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

Token := Number;

end;

{--------------------------------------------------------------}

{ Get an Operator }

procedure GetOp;
begin

Value := '';

if not IsOp(Look) then Expected('Operator');

while IsOp(Look) do begin

Value := Value + Look;

GetChar;

end;

Token := Operator;

end;

{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;

if IsAlpha(Look) then

GetName

else if IsDigit(Look) then

GetNum

else if IsOp(Look) then

GetOp

else begin

Value := Look;

Token := Operator;
GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

RETURNING A CHARACTER

Essentially every scanner I've ever seen that was written in

Pascal used the mechanism of an enumerated type that I've just

described. It is certainly a workable mechanism, but it doesn't

seem the simplest approach to me.

For one thing, the list of possible symbol types can get pretty

long. Here, I've used just one symbol, "Operator," to stand for

all of the operators, but I've seen other designs that actually

return different codes for each one.

There is, of course, another simple type that can be returned as

a code: the character. Instead of returning the enumeration

value 'Operator' for a '+' sign, what's wrong with just returning

the character itself? A character is just as good a variable for

encoding the different token types, it can be used in case

statements easily, and it's sure a lot easier to type. What

could be simpler?

Besides, we've already had experience with the idea of encoding

keywords as single characters. Our previous programs are already

written that way, so using this approach will minimize the

changes to what we've already done.

Some of you may feel that this idea of returning character codes
is too mickey-mouse. I must admit it gets a little awkward for

multi-character operators like '<='. If you choose to stay with

the enumerated type, fine. For the rest, I'd like to show you

how to change what we've done above to support that approach.

First, you can delete the SymType declaration now ... we won't be

needing that. And you can change the type of Token to char.

Next, to replace SymType, add the following constant string:

const KWcode: string[5] = 'xilee';

(I'll be encoding all idents with the single character 'x'.)

Lastly, modify Scan and its relatives as follows:

{--------------------------------------------------------------}

{ Get an Identifier }

procedure GetName;

begin

Value := '';

if not IsAlpha(Look) then Expected('Name');

while IsAlNum(Look) do begin

Value := Value + UpCase(Look);

GetChar;

end;

Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];

end;
{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

Value := '';

if not IsDigit(Look) then Expected('Integer');

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

Token := '#';

end;

{--------------------------------------------------------------}

{ Get an Operator }

procedure GetOp;

begin

Value := '';

if not IsOp(Look) then Expected('Operator');

while IsOp(Look) do begin

Value := Value + Look;

GetChar;

end;

if Length(Value) = 1 then

Token := Value[1]
else

Token := '?';

end;

{--------------------------------------------------------------}

{ Lexical Scanner }

procedure Scan;

var k: integer;

begin

while Look = CR do

Fin;

if IsAlpha(Look) then

GetName

else if IsDigit(Look) then

GetNum

else if IsOp(Look) then begin

GetOp

else begin

Value := Look;

Token := '?';

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}
{ Main Program }

begin

Init;

repeat

Scan;

case Token of

'x': write('Ident ');

'#': Write('Number ');

'i', 'l', 'e': Write('Keyword ');

else Write('Operator ');

end;

Writeln(Value);

until Value = 'END';

end.

{--------------------------------------------------------------}

This program should work the same as the previous version. A

minor difference in structure, maybe, but it seems more

straightforward to me.

DISTRIBUTED vs CENTRALIZED SCANNERS

The structure for the lexical scanner that I've just shown you is

very conventional, and about 99% of all compilers use something

very close to it. This is not, however, the only possible

structure, or even always the best one.

The problem with the conventional approach is that the scanner

has no knowledge of context. For example, it can't distinguish

between the assignment operator '=' and the relational operator

'=' (perhaps that's why both C and Pascal use different strings

for the two). All the scanner can do is to pass the operator

along to the parser, which can hopefully tell from the context

which operator is meant. Similarly, a keyword like 'IF' has no

place in the middle of a math expression, but if one happens to

appear there, the scanner will see no problem with it, and will

return it to the parser, properly encoded as an 'IF'.

With this kind of approach, we are not really using all the

information at our disposal. In the middle of an expression, for

example, the parser "knows" that there is no need to look for

keywords, but it has no way of telling the scanner that. So the

scanner continues to do so. This, of course, slows down the

compilation.

In real-world compilers, the designers often arrange for more

information to be passed between parser and scanner, just to

avoid this kind of problem. But that can get awkward, and

certainly destroys a lot of the modularity of the structure.

The alternative is to seek some way to use the contextual

information that comes from knowing where we are in the parser.

This leads us back to the notion of a distributed scanner, in

which various portions of the scanner are called depending upon

the context.

In KISS, as in most languages, keywords ONLY appear at the

beginning of a statement. In places like expressions, they are

not allowed. Also, with one minor exception (the multi-character

relops) that is easily handled, all operators are single

characters, which means that we don't need GetOp at all.

So it turns out that even with multi-character tokens, we can

still always tell from the current lookahead character exactly

what kind of token is coming, except at the very beginning of a

statement.

Even at that point, the ONLY kind of token we can accept is an

identifier. We need only to determine if that identifier is a

keyword or the target of an assignment statement.

We end up, then, still needing only GetName and GetNum, which are

used very much as we've used them in earlier installments.

It may seem at first to you that this is a step backwards, and a

rather primitive approach. In fact, it is an improvement over

the classical scanner, since we're using the scanning routines

only where they're really needed. In places where keywords are

not allowed, we don't slow things down by looking for them.

MERGING SCANNER AND PARSER

Now that we've covered all of the theory and general aspects of

lexical scanning that we'll be needing, I'm FINALLY ready to back

up my claim that we can accomodate multi-character tokens with

minimal change to our previous work. To keep things short and

simple I will restrict myself here to a subset of what we've done

before; I'm allowing only one control construct (the IF) and no
Boolean expressions. That's enough to demonstrate the parsing of

both keywords and expressions. The extension to the full set of

constructs should be pretty apparent from what we've already

done.

All the elements of the program to parse this subset, using

single-character tokens, exist already in our previous programs.

I built it by judicious copying of these files, but I wouldn't

dare try to lead you through that process. Instead, to avoid any

confusion, the whole program is shown below:

{--------------------------------------------------------------}

program KISS;

{--------------------------------------------------------------}

{ Constant Declarations }

const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}

{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}

{ Variable Declarations }

var Look : char; { Lookahead Character }

Lcount: integer; { Label Counter }

{--------------------------------------------------------------}

{ Read New Character From Input Stream }

procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }

procedure Error(s: string);

begin

WriteLn;

WriteLn(^G, 'Error: ', s, '.');

end;

{--------------------------------------------------------------}

{ Report Error and Halt }

procedure Abort(s: string);

begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}

{ Report What Was Expected }

procedure Expected(s: string);

begin

Abort(s + ' Expected');

end;

{--------------------------------------------------------------}

{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin

IsAlpha := UpCase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin

IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

{ Recognize an Addop }
function IsAddop(c: char): boolean;

begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

{ Recognize a Mulop }

function IsMulop(c: char): boolean;

begin

IsMulop := c in ['*', '/'];

end;

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;
{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin

if Look <> x then Expected('''' + x + '''');

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin

if Look = CR then GetChar;

if Look = LF then GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get an Identifier }

function GetName: char;

begin

while Look = CR do

Fin;

if not IsAlpha(Look) then Expected('Name');

Getname := UpCase(Look);

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get a Number }

function GetNum: char;

begin

if not IsDigit(Look) then Expected('Integer');

GetNum := Look;

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Generate a Unique Label }

function NewLabel: string;

var S: string;

begin

Str(LCount, S);

NewLabel := 'L' + S;

Inc(LCount);

end;

{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);

begin

WriteLn(L, ':');

end;

{--------------------------------------------------------------}

{ Output a String with Tab }

procedure Emit(s: string);

begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin

Emit(s);

WriteLn;

end;

{---------------------------------------------------------------}

{ Parse and Translate an Identifier }

procedure Ident;

var Name: char;

begin
Name := GetName;

if Look = '(' then begin

Match('(');

Match(')');

EmitLn('BSR ' + Name);

end

else

EmitLn('MOVE ' + Name + '(PC),D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

Ident

else

EmitLn('MOVE #' + GetNum + ',D0');

end;
{---------------------------------------------------------------}

{ Parse and Translate the First Math Factor }

procedure SignedFactor;

var s: boolean;

begin

s := Look = '-';

if IsAddop(Look) then begin

GetChar;

SkipWhite;

end;

Factor;

if s then

EmitLn('NEG D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*');

Factor;

EmitLn('MULS (SP)+,D0');

end;

{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;

begin

Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('EXS.L D0');

EmitLn('DIVS D1,D0');

end;

{---------------------------------------------------------------}

{ Completion of Term Processing (called by Term and FirstTerm }

procedure Term1;

begin

while IsMulop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'*': Multiply;

'/': Divide;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

procedure Term;

begin
Factor;

Term1;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term with Possible Leading Sign }

procedure FirstTerm;

begin

SignedFactor;

Term1;

end;

{---------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD (SP)+,D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');
Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

FirstTerm;

while IsAddop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Condition }

{ This version is a dummy }

Procedure Condition;

begin

EmitLn('Condition');

end;
{---------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block;

Forward;

procedure DoIf;

var L1, L2: string;

begin

Match('i');

Condition;

L1 := NewLabel;

L2 := L1;

EmitLn('BEQ ' + L1);

Block;

if Look = 'l' then begin

Match('l');

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block;

end;

PostLabel(L2);

Match('e');

end;

{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: char;

begin

Name := GetName;

Match('=');

Expression;

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

while not(Look in ['e', 'l']) do begin

case Look of

'i': DoIf;

CR: while Look = CR do

Fin;

else Assignment;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block;

if Look <> 'e' then Expected('END');

EmitLn('END')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DoProgram;

end.

{--------------------------------------------------------------}

A couple of comments:

(1) The form for the expression parser, using FirstTerm, etc.,

is a little different from what you've seen before. It's

yet another variation on the same theme. Don't let it throw

you ... the change is not required for what follows.

(2) Note that, as usual, I had to add calls to Fin at strategic

spots to allow for multiple lines.

Before we proceed to adding the scanner, first copy this file and

verify that it does indeed parse things correctly. Don't forget

the "codes": 'i' for IF, 'l' for ELSE, and 'e' for END or ENDIF.

If the program works, then let's press on. In adding the scanner

modules to the program, it helps to have a systematic plan. In

all the parsers we've written to date, we've stuck to a

convention that the current lookahead character should always be

a non-blank character. We preload the lookahead character in

Init, and keep the "pump primed" after that. To keep the thing

working right at newlines, we had to modify this a bit and treat

the newline as a legal token.

In the multi-character version, the rule is similar: The current

lookahead character should always be left at the BEGINNING of the

next token, or at a newline.

The multi-character version is shown next. To get it, I've made

the following changes:

o Added the variables Token and Value, and the type definitions

needed by Lookup.

o Added the definitions of KWList and KWcode.

o Added Lookup.

o Replaced GetName and GetNum by their multi-character versions.

(Note that the call to Lookup has been moved out of GetName,

so that it will not be executed for calls within an

expression.)

o Created a new, vestigial Scan that calls GetName, then scans

for keywords.

o Created a new procedure, MatchString, that looks for a

specific keyword. Note that, unlike Match, MatchString does

NOT read the next keyword.

o Modified Block to call Scan.

o Changed the calls to Fin a bit. Fin is now called within

GetName.

Here is the program in its entirety:

{--------------------------------------------------------------}

program KISS;

{--------------------------------------------------------------}

{ Constant Declarations }

const TAB = ^I;

CR = ^M;

LF = ^J;

{--------------------------------------------------------------}

{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;
{--------------------------------------------------------------}

{ Variable Declarations }

var Look : char; { Lookahead Character }

Token : char; { Encoded Token }

Value : string[16]; { Unencoded Token }

Lcount: integer; { Label Counter }

{--------------------------------------------------------------}

{ Definition of Keywords and Token Types }

const KWlist: array [1..4] of Symbol =

('IF', 'ELSE', 'ENDIF', 'END');

const KWcode: string[5] = 'xilee';

{--------------------------------------------------------------}

{ Read New Character From Input Stream }

procedure GetChar;

begin

Read(Look);

end;

{--------------------------------------------------------------}

{ Report an Error }

procedure Error(s: string);

begin

WriteLn;

WriteLn(^G, 'Error: ', s, '.');

end;
{--------------------------------------------------------------}

{ Report Error and Halt }

procedure Abort(s: string);

begin

Error(s);

Halt;

end;

{--------------------------------------------------------------}

{ Report What Was Expected }

procedure Expected(s: string);

begin

Abort(s + ' Expected');

end;

{--------------------------------------------------------------}

{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin

IsAlpha := UpCase(c) in ['A'..'Z'];

end;

{--------------------------------------------------------------}

{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin
IsDigit := c in ['0'..'9'];

end;

{--------------------------------------------------------------}

{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;

begin

IsAlNum := IsAlpha(c) or IsDigit(c);

end;

{--------------------------------------------------------------}

{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin

IsAddop := c in ['+', '-'];

end;

{--------------------------------------------------------------}

{ Recognize a Mulop }

function IsMulop(c: char): boolean;

begin

IsMulop := c in ['*', '/'];

end;

{--------------------------------------------------------------}

{ Recognize White Space }

function IsWhite(c: char): boolean;

begin

IsWhite := c in [' ', TAB];

end;

{--------------------------------------------------------------}

{ Skip Over Leading White Space }

procedure SkipWhite;

begin

while IsWhite(Look) do

GetChar;

end;

{--------------------------------------------------------------}

{ Match a Specific Input Character }

procedure Match(x: char);

begin

if Look <> x then Expected('''' + x + '''');

GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Skip a CRLF }

procedure Fin;

begin
if Look = CR then GetChar;

if Look = LF then GetChar;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;

found: boolean;

begin

found := false;

i := n;

while (i > 0) and not found do

if s = T^[i] then

found := true

else

dec(i);

Lookup := i;

end;

{--------------------------------------------------------------}

{ Get an Identifier }

procedure GetName;

begin

while Look = CR do
Fin;

if not IsAlpha(Look) then Expected('Name');

Value := '';

while IsAlNum(Look) do begin

Value := Value + UpCase(Look);

GetChar;

end;

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get a Number }

procedure GetNum;

begin

if not IsDigit(Look) then Expected('Integer');

Value := '';

while IsDigit(Look) do begin

Value := Value + Look;

GetChar;

end;

Token := '#';

SkipWhite;

end;

{--------------------------------------------------------------}

{ Get an Identifier and Scan it for Keywords }

procedure Scan;

begin

GetName;

Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];

end;

{--------------------------------------------------------------}

{ Match a Specific Input String }

procedure MatchString(x: string);

begin

if Value <> x then Expected('''' + x + '''');

end;

{--------------------------------------------------------------}

{ Generate a Unique Label }

function NewLabel: string;

var S: string;

begin

Str(LCount, S);

NewLabel := 'L' + S;

Inc(LCount);

end;

{--------------------------------------------------------------}

{ Post a Label To Output }

procedure PostLabel(L: string);

begin
WriteLn(L, ':');

end;

{--------------------------------------------------------------}

{ Output a String with Tab }

procedure Emit(s: string);

begin

Write(TAB, s);

end;

{--------------------------------------------------------------}

{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin

Emit(s);

WriteLn;

end;

{---------------------------------------------------------------}

{ Parse and Translate an Identifier }

procedure Ident;

begin

GetName;

if Look = '(' then begin

Match('(');

Match(')');
EmitLn('BSR ' + Value);

end

else

EmitLn('MOVE ' + Value + '(PC),D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;

begin

if Look = '(' then begin

Match('(');

Expression;

Match(')');

end

else if IsAlpha(Look) then

Ident

else begin

GetNum;

EmitLn('MOVE #' + Value + ',D0');

end;

{---------------------------------------------------------------}

{ Parse and Translate the First Math Factor }

procedure SignedFactor;

var s: boolean;

begin

s := Look = '-';

if IsAddop(Look) then begin

GetChar;

SkipWhite;

end;

Factor;

if s then

EmitLn('NEG D0');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Multiply }

procedure Multiply;

begin

Match('*');

Factor;

EmitLn('MULS (SP)+,D0');

end;

{-------------------------------------------------------------}

{ Recognize and Translate a Divide }

procedure Divide;

begin
Match('/');

Factor;

EmitLn('MOVE (SP)+,D1');

EmitLn('EXS.L D0');

EmitLn('DIVS D1,D0');

end;

{---------------------------------------------------------------}

{ Completion of Term Processing (called by Term and FirstTerm }

procedure Term1;

begin

while IsMulop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'*': Multiply;

'/': Divide;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Math Term }

procedure Term;

begin

Factor;

Term1;

end;
{---------------------------------------------------------------}

{ Parse and Translate a Math Term with Possible Leading Sign }

procedure FirstTerm;

begin

SignedFactor;

Term1;

end;

{---------------------------------------------------------------}

{ Recognize and Translate an Add }

procedure Add;

begin

Match('+');

Term;

EmitLn('ADD (SP)+,D0');

end;

{---------------------------------------------------------------}

{ Recognize and Translate a Subtract }

procedure Subtract;

begin

Match('-');

Term;

EmitLn('SUB (SP)+,D0');

EmitLn('NEG D0');
end;

{---------------------------------------------------------------}

{ Parse and Translate an Expression }

procedure Expression;

begin

FirstTerm;

while IsAddop(Look) do begin

EmitLn('MOVE D0,-(SP)');

case Look of

'+': Add;

'-': Subtract;

end;

{---------------------------------------------------------------}

{ Parse and Translate a Boolean Condition }

{ This version is a dummy }

Procedure Condition;

begin

EmitLn('Condition');

end;

{---------------------------------------------------------------}

{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;

var L1, L2: string;

begin

Condition;

L1 := NewLabel;

L2 := L1;

EmitLn('BEQ ' + L1);

Block;

if Token = 'l' then begin

L2 := NewLabel;

EmitLn('BRA ' + L2);

PostLabel(L1);

Block;

end;

PostLabel(L2);

MatchString('ENDIF');

end;

{--------------------------------------------------------------}

{ Parse and Translate an Assignment Statement }

procedure Assignment;

var Name: string;

begin

Name := Value;
Match('=');

Expression;

EmitLn('LEA ' + Name + '(PC),A0');

EmitLn('MOVE D0,(A0)');

end;

{--------------------------------------------------------------}

{ Recognize and Translate a Statement Block }

procedure Block;

begin

Scan;

while not (Token in ['e', 'l']) do begin

case Token of

'i': DoIf;

else Assignment;

end;

Scan;

end;

{--------------------------------------------------------------}

{ Parse and Translate a Program }

procedure DoProgram;

begin

Block;

MatchString('END');
EmitLn('END')

end;

{--------------------------------------------------------------}

{ Initialize }

procedure Init;

begin

LCount := 0;

GetChar;

end;

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

DoProgram;

end.

{--------------------------------------------------------------}

Compare this program with its single-character counterpart. I

think you will agree that the differences are minor.

CONCLUSION

At this point, you have learned how to parse and generate code

for expressions, Boolean expressions, and control structures.

You have now learned how to develop lexical scanners, and how to
incorporate their elements into a translator. You have still not

seen ALL the elements combined into one program, but on the basis

of what we've done before you should find it a straightforward

matter to extend our earlier programs to include scanners.

We are very close to having all the elements that we need to

build a real, functional compiler. There are still a few things

missing, notably procedure calls and type definitions. We will

deal with those in the next few sessions. Before doing so,

however, I thought it would be fun to turn the translator above

into a true compiler. That's what we'll be doing in the next

installment.

Up till now, we've taken a rather bottom-up approach to parsing,

beginning with low-level constructs and working our way up. In

the next installment, I'll also be taking a look from the top

down, and we'll discuss how the structure of the translator is

altered by changes in the language definition.

See you then.

GET YOUR DOMAIN:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part VIII: A LITTLE PHILOSOPHY

INTRODUCTION

This is going to be a different kind of session than the others

in our series on parsing and compiler construction. For this

session, there won't be any experiments to do or code to write.

This once, I'd like to just talk with you for a while.

Mercifully, it will be a short session, and then we can take up

where we left off, hopefully with renewed vigor.

When I was in college, I found that I could always follow a

prof's lecture a lot better if I knew where he was going with it.

I'll bet you were the same.

So I thought maybe it's about time I told you where we're going

with this series: what's coming up in future installments, and in

general what all this is about. I'll also share some general
thoughts concerning the usefulness of what we've been doing.

THE ROAD HOME

So far, we've covered the parsing and translation of arithmetic

expressions, Boolean expressions, and combinations connected by

relational operators. We've also done the same for control

constructs. In all of this we've leaned heavily on the use of

top-down, recursive descent parsing, BNF definitions of the

syntax, and direct generation of assembly-language code. We also

learned the value of such tricks as single-character tokens to

help us see the forest through the trees. In the last

installment we dealt with lexical scanning, and I showed you

simple but powerful ways to remove the single-character barriers.

Throughout the whole study, I've emphasized the KISS philosophy

... Keep It Simple, Sidney ... and I hope by now you've realized

just how simple this stuff can really be. While there are for

sure areas of compiler theory that are truly intimidating, the

ultimate message of this series is that in practice you can just

politely sidestep many of these areas. If the language

definition cooperates or, as in this series, if you can define

the language as you go, it's possible to write down the language

definition in BNF with reasonable ease. And, as we've seen, you

can crank out parse procedures from the BNF just about as fast as

you can type.

As our compiler has taken form, it's gotten more parts, but each

part is quite small and simple, and very much like all the
others.

At this point, we have many of the makings of a real, practical

compiler. As a matter of fact, we already have all we need to

build a toy compiler for a language as powerful as, say, Tiny

BASIC. In the next couple of installments, we'll go ahead and

define that language.

To round out the series, we still have a few items to cover.

These include:

o Procedure calls, with and without parameters

o Local and global variables

o Basic types, such as character and integer types

o Arrays

o Strings

o User-defined types and structures

o Tree-structured parsers and intermediate languages

o Optimization

These will all be covered in future installments. When we're

finished, you'll have all the tools you need to design and build

your own languages, and the compilers to translate them.

I can't design those languages for you, but I can make some

comments and recommendations. I've already sprinkled some

throughout past installments. You've seen, for example, the

control constructs I prefer.

These constructs are going to be part of the languages I build.

I have three languages in mind at this point, two of which you

will see in installments to come:

TINY - A minimal, but usable language on the order of Tiny

BASIC or Tiny C. It won't be very practical, but it will

have enough power to let you write and run real programs

that do something worthwhile.

KISS - The language I'm building for my own use. KISS is

intended to be a systems programming language. It won't

have strong typing or fancy data structures, but it will

support most of the things I want to do with a higher-

order language (HOL), except perhaps writing compilers.

I've also been toying for years with the idea of a HOL-like

assembler, with structured control constructs and HOL-like

assignment statements. That, in fact, was the impetus behind my

original foray into the jungles of compiler theory. This one may

never be built, simply because I've learned that it's actually

easier to implement a language like KISS, that only uses a subset

of the CPU instructions. As you know, assembly language can be

bizarre and irregular in the extreme, and a language that maps

one-for-one onto it can be a real challenge. Still, I've always

felt that the syntax used in conventional assemblers is dumb ...

why is

MOVE.L A,B

better, or easier to translate, than

B=A ?

I think it would be an interesting exercise to develop a

"compiler" that would give the programmer complete access to and

control over the full complement of the CPU instruction set, and
would allow you to generate programs as efficient as assembly

language, without the pain of learning a set of mnemonics. Can

it be done? I don't know. The real question may be, "Will the

resulting language be any easier to write than assembly"? If

not, there's no point in it. I think that it can be done, but

I'm not completely sure yet how the syntax should look.

Perhaps you have some comments or suggestions on this one. I'd

love to hear them.

You probably won't be surprised to learn that I've already worked

ahead in most of the areas that we will cover. I have some good

news: Things never get much harder than they've been so far.

It's possible to build a complete, working compiler for a real

language, using nothing but the same kinds of techniques you've

learned so far. And THAT brings up some interesting questions.

WHY IS IT SO SIMPLE?

Before embarking on this series, I always thought that compilers

were just naturally complex computer programs ... the ultimate

challenge. Yet the things we have done here have usually turned

out to be quite simple, sometimes even trivial.

For awhile, I thought is was simply because I hadn't yet gotten

into the meat of the subject. I had only covered the simple

parts. I will freely admit to you that, even when I began the

series, I wasn't sure how far we would be able to go before

things got too complex to deal with in the ways we have so far.

But at this point I've already been down the road far enough to
see the end of it. Guess what?

THERE ARE NO HARD PARTS!

Then, I thought maybe it was because we were not generating very

good object code. Those of you who have been following the

series and trying sample compiles know that, while the code works

and is rather foolproof, its efficiency is pretty awful. I

figured that if we were concentrating on turning out tight code,

we would soon find all that missing complexity.

To some extent, that one is true. In particular, my first few

efforts at trying to improve efficiency introduced complexity at

an alarming rate. But since then I've been tinkering around with

some simple optimizations and I've found some that result in very

respectable code quality, WITHOUT adding a lot of complexity.

Finally, I thought that perhaps the saving grace was the "toy

compiler" nature of the study. I have made no pretense that we

were ever going to be able to build a compiler to compete with

Borland and Microsoft. And yet, again, as I get deeper into this

thing the differences are starting to fade away.

Just to make sure you get the message here, let me state it flat

out:

USING THE TECHNIQUES WE'VE USED HERE, IT IS POSSIBLE TO

BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING

A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.

Since the series began I've received some comments from you.
Most of them echo my own thoughts: "This is easy! Why do the

textbooks make it seem so hard?" Good question.

Recently, I've gone back and looked at some of those texts again,

and even bought and read some new ones. Each time, I come away

with the same feeling: These guys have made it seem too hard.

What's going on here? Why does the whole thing seem difficult in

the texts, but easy to us? Are we that much smarter than Aho,

Ullman, Brinch Hansen, and all the rest?

Hardly. But we are doing some things differently, and more and

more I'm starting to appreciate the value of our approach, and

the way that it simplifies things. Aside from the obvious

shortcuts that I outlined in Part I, like single-character tokens

and console I/O, we have made some implicit assumptions and done

some things differently from those who have designed compilers in

the past. As it turns out, our approach makes life a lot easier.

So why didn't all those other guys use it?

You have to remember the context of some of the earlier compiler

development. These people were working with very small computers

of limited capacity. Memory was very limited, the CPU

instruction set was minimal, and programs ran in batch mode

rather than interactively. As it turns out, these caused some

key design decisions that have really complicated the designs.

Until recently, I hadn't realized how much of classical compiler

design was driven by the available hardware.

Even in cases where these limitations no longer apply, people

have tended to structure their programs in the same way, since

that is the way they were taught to do it.

In our case, we have started with a blank sheet of paper. There

is a danger there, of course, that you will end up falling into

traps that other people have long since learned to avoid. But it

also has allowed us to take different approaches that, partly by

design and partly by pure dumb luck, have allowed us to gain

simplicity.

Here are the areas that I think have led to complexity in the

past:

o Limited RAM Forcing Multiple Passes

I just read "Brinch Hansen on Pascal Compilers" (an

excellent book, BTW). He developed a Pascal compiler for a

PC, but he started the effort in 1981 with a 64K system, and

so almost every design decision he made was aimed at making

the compiler fit into RAM. To do this, his compiler has

three passes, one of which is the lexical scanner. There is

no way he could, for example, use the distributed scanner I

introduced in the last installment, because the program

structure wouldn't allow it. He also required not one but

two intermediate languages, to provide the communication

between phases.

All the early compiler writers had to deal with this issue:

Break the compiler up into enough parts so that it will fit

in memory. When you have multiple passes, you need to add

data structures to support the information that each pass

leaves behind for the next. That adds complexity, and ends
up driving the design. Lee's book, "The Anatomy of a

Compiler," mentions a FORTRAN compiler developed for an IBM

1401. It had no fewer than 63 separate passes! Needless to

say, in a compiler like this the separation into phases

would dominate the design.

Even in situations where RAM is plentiful, people have

tended to use the same techniques because that is what

they're familiar with. It wasn't until Turbo Pascal came

along that we found how simple a compiler could be if you

started with different assumptions.

o Batch Processing

In the early days, batch processing was the only choice ...

there was no interactive computing. Even today, compilers

run in essentially batch mode.

In a mainframe compiler as well as many micro compilers,

considerable effort is expended on error recovery ... it can

consume as much as 30-40% of the compiler and completely

drive the design. The idea is to avoid halting on the first

error, but rather to keep going at all costs, so that you

can tell the programmer about as many errors in the whole

program as possible.

All of that harks back to the days of the early mainframes,

where turnaround time was measured in hours or days, and it

was important to squeeze every last ounce of information out

of each run.
In this series, I've been very careful to avoid the issue of

error recovery, and instead our compiler simply halts with

an error message on the first error. I will frankly admit

that it was mostly because I wanted to take the easy way out

and keep things simple. But this approach, pioneered by

Borland in Turbo Pascal, also has a lot going for it anyway.

Aside from keeping the compiler simple, it also fits very

well with the idea of an interactive system. When

compilation is fast, and especially when you have an editor

such as Borland's that will take you right to the point of

the error, then it makes a lot of sense to stop there, and

just restart the compilation after the error is fixed.

o Large Programs

Early compilers were designed to handle large programs ...

essentially infinite ones. In those days there was little

choice; the idea of subroutine libraries and separate

compilation were still in the future. Again, this

assumption led to multi-pass designs and intermediate files

to hold the results of partial processing.

Brinch Hansen's stated goal was that the compiler should be

able to compile itself. Again, because of his limited RAM,

this drove him to a multi-pass design. He needed as little

resident compiler code as possible, so that the necessary

tables and other data structures would fit into RAM.

I haven't stated this one yet, because there hasn't been a

need ... we've always just read and written the data as

streams, anyway. But for the record, my plan has always

been that, in a production compiler, the source and object

data should all coexist in RAM with the compiler, a la the

early Turbo Pascals. That's why I've been careful to keep

routines like GetChar and Emit as separate routines, in

spite of their small size. It will be easy to change them

to read to and write from memory.

o Emphasis on Efficiency

John Backus has stated that, when he and his colleagues

developed the original FORTRAN compiler, they KNEW that they

had to make it produce tight code. In those days, there was

a strong sentiment against HOLs and in favor of assembly

language, and efficiency was the reason. If FORTRAN didn't

produce very good code by assembly standards, the users

would simply refuse to use it. For the record, that FORTRAN

compiler turned out to be one of the most efficient ever

built, in terms of code quality. But it WAS complex!

Today, we have CPU power and RAM size to spare, so code

efficiency is not so much of an issue. By studiously

ignoring this issue, we have indeed been able to Keep It

Simple. Ironically, though, as I have said, I have found

some optimizations that we can add to the basic compiler

structure, without having to add a lot of complexity. So in

this case we get to have our cake and eat it too: we will
end up with reasonable code quality, anyway.

o Limited Instruction Sets

The early computers had primitive instruction sets. Things

that we take for granted, such as stack operations and

indirect addressing, came only with great difficulty.

Example: In most compiler designs, there is a data structure

called the literal pool. The compiler typically identifies

all literals used in the program, and collects them into a

single data structure. All references to the literals are

done indirectly to this pool. At the end of the

compilation, the compiler issues commands to set aside

storage and initialize the literal pool.

We haven't had to address that issue at all. When we want

to load a literal, we just do it, in line, as in

MOVE #3,D0

There is something to be said for the use of a literal pool,

particularly on a machine like the 8086 where data and code

can be separated. Still, the whole thing adds a fairly

large amount of complexity with little in return.

Of course, without the stack we would be lost. In a micro,

both subroutine calls and temporary storage depend heavily

on the stack, and we have used it even more than necessary

to ease expression parsing.

o Desire for Generality

Much of the content of the typical compiler text is taken up

with issues we haven't addressed here at all ... things like

automated translation of grammars, or generation of LALR

parse tables. This is not simply because the authors want

to impress you. There are good, practical reasons why the

subjects are there.

We have been concentrating on the use of a recursive-descent

parser to parse a deterministic grammar, i.e., a grammar

that is not ambiguous and, therefore, can be parsed with one

level of lookahead. I haven't made much of this limitation,

but the fact is that this represents a small subset of

possible grammars. In fact, there is an infinite number of

grammars that we can't parse using our techniques. The LR

technique is a more powerful one, and can deal with grammars

that we can't.

In compiler theory, it's important to know how to deal with

these other grammars, and how to transform them into

grammars that are easier to deal with. For example, many

(but not all) ambiguous grammars can be transformed into

unambiguous ones. The way to do this is not always obvious,

though, and so many people have devoted years to develop

ways to transform them automatically.

In practice, these issues turn out to be considerably less

important. Modern languages tend to be designed to be easy

to parse, anyway. That was a key motivation in the design

of Pascal. Sure, there are pathological grammars that you

would be hard pressed to write unambiguous BNF for, but in

the real world the best answer is probably to avoid those

grammars!

In our case, of course, we have sneakily let the language

evolve as we go, so we haven't painted ourselves into any

corners here. You may not always have that luxury. Still,

with a little care you should be able to keep the parser

simple without having to resort to automatic translation of

the grammar.

We have taken a vastly different approach in this series. We

started with a clean sheet of paper, and developed techniques

that work in the context that we are in; that is, a single-user

PC with rather ample CPU power and RAM space. We have limited

ourselves to reasonable grammars that are easy to parse, we have

used the instruction set of the CPU to advantage, and we have not

concerned ourselves with efficiency. THAT's why it's been easy.

Does this mean that we are forever doomed to be able to build

only toy compilers? No, I don't think so. As I've said, we can

add certain optimizations without changing the compiler

structure. If we want to process large files, we can always add

file buffering to do that. These things do not affect the

overall program design.

And I think that's a key factor. By starting with small and

limited cases, we have been able to concentrate on a structure

for the compiler that is natural for the job. Since the
structure naturally fits the job, it is almost bound to be simple

and transparent. Adding capability doesn't have to change that

basic structure. We can simply expand things like the file

structure or add an optimization layer. I guess my feeling is

that, back when resources were tight, the structures people ended

up with were artificially warped to make them work under those

conditions, and weren't optimum structures for the problem at

hand.

CONCLUSION

Anyway, that's my arm-waving guess as to how we've been able to

keep things simple. We started with something simple and let it

evolve naturally, without trying to force it into some

traditional mold.

We're going to press on with this. I've given you a list of the

areas we'll be covering in future installments. With those

installments, you should be able to build complete, working

compilers for just about any occasion, and build them simply. If

you REALLY want to build production-quality compilers, you'll be

able to do that, too.

For those of you who are chafing at the bit for more parser code,

I apologize for this digression. I just thought you'd like to

have things put into perspective a bit. Next time, we'll get

back to the mainstream of the tutorial.

So far, we've only looked at pieces of compilers, and while we

have many of the makings of a complete language, we haven't

talked about how to put it all together. That will be the

subject of our next two installments. Then we'll press on into

the new subjects I listed at the beginning of this installment.

See you then.

Get a domain & site:
type here .com
Get It Now >>
[ Amit Mathur Online Homepage ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part IX: A TOP VIEW

INTRODUCTION

In the previous installments, we have learned many of the

techniques required to build a full-blown compiler. We've done

both assignment statements (with Boolean and arithmetic

expressions), relational operators, and control constructs. We

still haven't addressed procedure or function calls, but even so

we could conceivably construct a mini-language without them.

I've always thought it would be fun to see just how small a

language one could build that would still be useful. We're

ALMOST in a position to do that now. The problem is: though we

know how to parse and translate the constructs, we still don't

know quite how to put them all together into a language.

In those earlier installments, the development of our programs

had a decidedly bottom-up flavor. In the case of expression

parsing, for example, we began with the very lowest level

constructs, the individual constants and variables, and worked

our way up to more complex expressions.

Most people regard the top-down design approach as being better

than the bottom-up one. I do too, but the way we did it

certainly seemed natural enough for the kinds of things we were

parsing.

You mustn't get the idea, though, that the incremental approach

that we've been using in all these tutorials is inherently

bottom-up. In this installment I'd like to show you that the

approach can work just as well when applied from the top down ...

maybe better. We'll consider languages such as C and Pascal, and

see how complete compilers can be built starting from the top.

In the next installment, we'll apply the same technique to build

a complete translator for a subset of the KISS language, which

I'll be calling TINY. But one of my goals for this series is

that you will not only be able to see how a compiler for TINY or

KISS works, but that you will also be able to design and build

compilers for your own languages. The C and Pascal examples will

help. One thing I'd like you to see is that the natural

structure of the compiler depends very much on the language being

translated, so the simplicity and ease of construction of the

compiler depends very much on letting the language set the

program structure.

It's a bit much to produce a full C or Pascal compiler here, and

we won't try. But we can flesh out the top levels far enough so
that you can see how it goes.

Let's get started.

THE TOP LEVEL

One of the biggest mistakes people make in a top-down design is

failing to start at the true top. They think they know what the

overall structure of the design should be, so they go ahead and

write it down.

Whenever I start a new design, I always like to do it at the

absolute beginning. In program design language (PDL), this top

level looks something like:

begin

solve the problem

end

OK, I grant you that this doesn't give much of a hint as to what

the next level is, but I like to write it down anyway, just to

give me that warm feeling that I am indeed starting at the top.

For our problem, the overall function of a compiler is to compile

a complete program. Any definition of the language, written in

BNF, begins here. What does the top level BNF look like? Well,

that depends quite a bit on the language to be translated. Let's

take a look at Pascal.

THE STRUCTURE OF PASCAL

Most texts for Pascal include a BNF or "railroad-track"

definition of the language. Here are the first few lines of one:

<program> ::= <program-header> <block> '.'

<program-header> ::= PROGRAM <ident>

<block> ::= <declarations> <statements>

We can write recognizers to deal with each of these elements,

just as we've done before. For each one, we'll use our familiar

single-character tokens to represent the input, then flesh things

out a little at a time. Let's begin with the first recognizer:

the program itself.

To translate this, we'll start with a fresh copy of the Cradle.

Since we're back to single-character names, we'll just use a 'p'

to stand for 'PROGRAM.'

To a fresh copy of the cradle, add the following code, and insert

a call to it from the main program:

{--------------------------------------------------------------}

{ Parse and Translate A Program }

procedure Prog;

var Name: char;

begin

Match('p'); { Handles program header part }

Name := GetName;

Prolog(Name);
Match('.');

Epilog(Name);

end;

{--------------------------------------------------------------}

The procedures Prolog and Epilog perform whatever is required to

let the program interface with the operating system, so that it

can execute as a program. Needless to say, this part will be

VERY OS-dependent. Remember, I've been emitting code for a 68000

running under the OS I use, which is SK*DOS. I realize most of

you are using PC's and would rather see something else, but I'm

in this thing too deep to change now!

Anyhow, SK*DOS is a particularly easy OS to interface to. Here

is the code for Prolog and Epilog:

{--------------------------------------------------------------}

{ Write the Prolog }

procedure Prolog;

begin

EmitLn('WARMST EQU $A01E');

end;

{--------------------------------------------------------------}

{ Write the Epilog }

procedure Epilog(Name: char);

begin
EmitLn('DC WARMST');

EmitLn('END ' + Name);

end;

{--------------------------------------------------------------}

As usual, add this code and try out the "compiler." At this

point, there is only one legal input:

px. (where x is any single letter, the program name)

Well, as usual our first effort is rather unimpressive, but by

now I'm sure you know that things will get more interesting.

There is one important thing to note: THE OUTPUT IS A WORKING,

COMPLETE, AND EXECUTABLE PROGRAM (at least after it's assembled).

This is very important. The nice feature of the top-down

approach is that at any stage you can compile a subset of the

complete language and get a program that will run on the target

machine. From here on, then, we need only add features by

fleshing out the language constructs. It's all very similar to

what we've been doing all along, except that we're approaching it

from the other end.

FLESHING IT OUT

To flesh out the compiler, we only have to deal with language

features one by one. I like to start with a stub procedure that

does nothing, then add detail in incremental fashion. Let's

begin by processing a block, in accordance with its PDL above.

We can do this in two stages. First, add the null procedure:

{--------------------------------------------------------------}

{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);

begin

end;

{--------------------------------------------------------------}

and modify Prog to read:

{--------------------------------------------------------------}

{ Parse and Translate A Program }

procedure Prog;

var Name: char;

begin

Match('p');

Name := GetName;

Prolog;

DoBlock(Name);

Match('.');

Epilog(Name);

end;

{--------------------------------------------------------------}

That certainly shouldn't change the behavior of the program, and

it doesn't. But now the definition of Prog is complete, and we

can proceed to flesh out DoBlock. That's done right from its BNF

definition:

{--------------------------------------------------------------}

{ Parse and Translate a Pascal Block }

procedure DoBlock(Name: char);

begin

Declarations;

PostLabel(Name);

Statements;

end;

{--------------------------------------------------------------}

The procedure PostLabel was defined in the installment on

branches. Copy it into your cradle.

I probably need to explain the reason for inserting the label

where I have. It has to do with the operation of SK*DOS. Unlike

some OS's, SK*DOS allows the entry point to the main program to

be anywhere in the program. All you have to do is to give that

point a name. The call to PostLabel puts that name just before

the first executable statement in the main program. How does

SK*DOS know which of the many labels is the entry point, you ask?

It's the one that matches the END statement at the end of the

program.

OK, now we need stubs for the procedures Declarations and

Statements. Make them null procedures as we did before.

Does the program still run the same? Then we can move on to the

next stage.

DECLARATIONS

The BNF for Pascal declarations is:

<declarations> ::= ( <label list> |

<constant list> |

<type list> |

<variable list> |

<procedure> |

<function> )*

(Note that I'm using the more liberal definition used by Turbo

Pascal. In the standard Pascal definition, each of these parts

must be in a specific order relative to the rest.)

As usual, let's let a single character represent each of these

declaration types. The new form of Declarations is:

{--------------------------------------------------------------}

{ Parse and Translate the Declaration Part }

procedure Declarations;

begin

while Look in ['l', 'c', 't', 'v', 'p', 'f'] do

case Look of

'l': Labels;
'c': Constants;

't': Types;

'v': Variables;

'p': DoProcedure;

'f': DoFunction;

end;

{--------------------------------------------------------------}

Of course, we need stub procedures for each of these declaration

types. This time, they can't quite be null procedures, since

otherwise we'll end up with an infinite While loop. At the very

least, each recognizer must eat the character that invokes it.

Insert the following procedures:

{--------------------------------------------------------------}

{ Process Label Statement }

procedure Labels;

begin

Match('l');

end;

{--------------------------------------------------------------}

{ Process Const Statement }

procedure Constants;

begin
Match('c');

end;

{--------------------------------------------------------------}

{ Process Type Statement }

procedure Types;

begin

Match('t');

end;

{--------------------------------------------------------------}

{ Process Var Statement }

procedure Variables;

begin

Match('v');

end;

{--------------------------------------------------------------}

{ Process Procedure Definition }

procedure DoProcedure;

begin

Match('p');

end;

{--------------------------------------------------------------}

{ Process Function Definition }

procedure DoFunction;

begin

Match('f');

end;

{--------------------------------------------------------------}

Now try out the compiler with a few representative inputs. You

can mix the declarations any way you like, as long as the last

character in the program is'.' to indicate the end of the

program. Of course, none of the declarations actually declare

anything, so you don't need (and can't use) any characters other

than those standing for the keywords.

We can flesh out the statement part in a similar way. The BNF

for it is:

<statements> ::= <compound statement>

<compound statement> ::= BEGIN <statement>

(';' <statement>) END

Note that statements can begin with any identifier except END.

So the first stub form of procedure Statements is:

{--------------------------------------------------------------}

{ Parse and Translate the Statement Part }

procedure Statements;

begin

Match('b');
while Look <> 'e' do

GetChar;

Match('e');

end;

{--------------------------------------------------------------}

At this point the compiler will accept any number of

declarations, followed by the BEGIN block of the main program.

This block itself can contain any characters at all (except an

END), but it must be present.

The simplest form of input is now

'pxbe.'

Try it. Also try some combinations of this. Make some

deliberate errors and see what happens.

At this point you should be beginning to see the drill. We begin

with a stub translator to process a program, then we flesh out

each procedure in turn, based upon its BNF definition. Just as

the lower-level BNF definitions add detail and elaborate upon the

higher-level ones, the lower-level recognizers will parse more

detail of the input program. When the last stub has been

expanded, the compiler will be complete. That's top-down

design/implementation in its purest form.

You might note that even though we've been adding procedures, the

output of the program hasn't changed. That's as it should be.

At these top levels there is no emitted code required. The

recognizers are functioning as just that: recognizers. They are

accepting input sentences, catching bad ones, and channeling good

input to the right places, so they are doing their job. If we

were to pursue this a bit longer, code would start to appear.

The next step in our expansion should probably be procedure

Statements. The Pascal definition is:

<statement> ::= <simple statement> | <structured statement>

<simple statement> ::= <assignment> | <procedure call> | null

<structured statement> ::= <compound statement> |

<if statement> |

<case statement> |

<while statement> |

<repeat statement> |

<for statement> |

These are starting to look familiar. As a matter of fact, you

have already gone through the process of parsing and generating

code for both assignment statements and control structures. This

is where the top level meets our bottom-up approach of previous

sessions. The constructs will be a little different from those

we've been using for KISS, but the differences are nothing you

can't handle.

I think you can get the picture now as to the procedure. We

begin with a complete BNF description of the language. Starting

at the top level, we code up the recognizer for that BNF

statement, using stubs for the next-level recognizers. Then we

flesh those lower-level statements out one by one.

As it happens, the definition of Pascal is very compatible with

the use of BNF, and BNF descriptions of the language abound.

Armed with such a description, you will find it fairly

straightforward to continue the process we've begun.

You might have a go at fleshing a few of these constructs out,

just to get a feel for it. I don't expect you to be able to

complete a Pascal compiler here ... there are too many things

such as procedures and types that we haven't addressed yet ...

but it might be helpful to try some of the more familiar ones.

It will do you good to see executable programs coming out the

other end.

If I'm going to address those issues that we haven't covered yet,

I'd rather do it in the context of KISS. We're not trying to

build a complete Pascal compiler just yet, so I'm going to stop

the expansion of Pascal here. Let's take a look at a very

different language.

THE STRUCTURE OF C

The C language is quite another matter, as you'll see. Texts on

C rarely include a BNF definition of the language. Probably

that's because the language is quite hard to write BNF for.

One reason I'm showing you these structures now is so that I can

impress upon you these two facts:

(1) The definition of the language drives the structure of the

compiler. What works for one language may be a disaster for

another. It's a very bad idea to try to force a given

structure upon the compiler. Rather, you should let the BNF

drive the structure, as we have done here.

(2) A language that is hard to write BNF for will probably be

hard to write a compiler for, as well. C is a popular

language, and it has a reputation for letting you do

virtually anything that is possible to do. Despite the

success of Small C, C is _NOT_ an easy language to parse.

A C program has less structure than its Pascal counterpart. At

the top level, everything in C is a static declaration, either of

data or of a function. We can capture this thought like this:

<program> ::= ( <global declaration> )*

<global declaration> ::= <data declaration> |

In Small C, functions can only have the default type int, which

is not declared. This makes the input easy to parse: the first

token is either "int," "char," or the name of a function. In

Small C, the preprocessor commands are also processed by the

compiler proper, so the syntax becomes:

<global declaration> ::= '#' <preprocessor command> |

'int' <data list> |

'char' <data list> |

<ident> <function body> |

Although we're really more interested in full C here, I'll show

you the code corresponding to this top-level structure for Small

{--------------------------------------------------------------}

{ Parse and Translate A Program }

procedure Prog;

begin

while Look <> ^Z do begin

case Look of

'#': PreProc;

'i': IntDecl;

'c': CharDecl;

else DoFunction(Int);

end;

{--------------------------------------------------------------}

Note that I've had to use a ^Z to indicate the end of the source.

C has no keyword such as END or the '.' to otherwise indicate the

end.

With full C, things aren't even this easy. The problem comes

about because in full C, functions can also have types. So when

the compiler sees a keyword like "int," it still doesn't know

whether to expect a data declaration or a function definition.

Things get more complicated since the next token may not be a

name ... it may start with an '*' or '(', or combinations of the

two.

More specifically, the BNF for full C begins with:

<program> ::= ( <top-level decl> )*

<top-level decl> ::= <function def> | <data decl>

<data decl> ::= [<class>] <type> <decl-list>

<function def> ::= [<class>] [<type>] <function decl>

You can now see the problem: The first two parts of the

declarations for data and functions can be the same. Because of

the ambiguity in the grammar as written above, it's not a

suitable grammar for a recursive-descent parser. Can we

transform it into one that is suitable? Yes, with a little work.

Suppose we write it this way:

<top-level decl> ::= [<class>] <decl>

<decl> ::= <type> <typed decl> | <function decl>

<typed decl> ::= <data list> | <function decl>

We can build a parsing routine for the class and type

definitions, and have them store away their findings and go on,

without their ever having to "know" whether a function or a data

declaration is being processed.

To begin, key in the following version of the main program:

{--------------------------------------------------------------}

{ Main Program }

begin

Init;

while Look <> ^Z do begin

GetClass;

GetType;

TopDecl;

end;

end.

{--------------------------------------------------------------}

For the first round, just make the three procedures stubs that do

nothing _BUT_ call GetChar.

Does this program work? Well, it would be hard put NOT to, since

we're not really asking it to do anything. It's been said that a

C compiler will accept virtually any input without choking. It's

certainly true of THIS compiler, since in effect all it does is

to eat input characters until it finds a ^Z.

Next, let's make GetClass do something worthwhile. Declare the

global variable

var Class: char;

and change GetClass to do the following:

{--------------------------------------------------------------}

{ Get a Storage Class Specifier }

Procedure GetClass;

begin

if Look in ['a', 'x', 's'] then begin

Class := Look;

GetChar;

end

else Class := 'a';

end;

{--------------------------------------------------------------}

Here, I've used three single characters to represent the three

storage classes "auto," "extern," and "static." These are not

the only three possible classes ... there are also "register" and

"typedef," but this should give you the picture. Note that the

default class is "auto."

We can do a similar thing for types. Enter the following

procedure next:

{--------------------------------------------------------------}

{ Get a Type Specifier }

procedure GetType;

begin
Typ := ' ';

if Look = 'u' then begin

Sign := 'u';

Typ := 'i';

GetChar;

end

else Sign := 's';

if Look in ['i', 'l', 'c'] then begin

Typ := Look;

GetChar;

end;

{--------------------------------------------------------------}

Note that you must add two more global variables, Sign and Typ.

With these two procedures in place, the compiler will process the

class and type definitions and store away their findings. We can

now process the rest of the declaration.

We are by no means out of the woods yet, because there are still

many complexities just in the definition of the type, before we

even get to the actual data or function names. Let's pretend for

the moment that we have passed all those gates, and that the next

thing in the input stream is a name. If the name is followed by

a left paren, we have a function declaration. If not, we have at

least one data item, and possibly a list, each element of which

can have an initializer.

Insert the following version of TopDecl:

{--------------------------------------------------------------}

{ Process a Top-Level Declaration }

procedure TopDecl;

var Name: char;

begin

Name := Getname;

if Look = '(' then

DoFunc(Name)

else

DoData(Name);

end;

{--------------------------------------------------------------}

(Note that, since we have already read the name, we must pass it

along to the appropriate routine.)

Finally, add the two procedures DoFunc and DoData:

{--------------------------------------------------------------}

{ Process a Function Definition }

procedure DoFunc(n: char);

begin

Match('(');

Match(')');

Match('{');

Match('}');
if Typ = ' ' then Typ := 'i';

Writeln(Class, Sign, Typ, ' function ', n);

end;

{--------------------------------------------------------------}

{ Process a Data Declaration }

procedure DoData(n: char);

begin

if Typ = ' ' then Expected('Type declaration');

Writeln(Class, Sign, Typ, ' data ', n);

while Look = ',' do begin

Match(',');

n := GetName;

WriteLn(Class, Sign, Typ, ' data ', n);

end;

Match(';');

end;

{--------------------------------------------------------------}

Since we're still a long way from producing executable code, I

decided to just have these two routines tell us what they found.

OK, give this program a try. For data declarations, it's OK to

give a list separated by commas. We can't process initializers

as yet. We also can't process argument lists for the functions,

but the "(){}" characters should be there.

We're still a _VERY_ long way from having a C compiler, but what

we have is starting to process the right kinds of inputs, and is

recognizing both good and bad inputs. In the process, the

natural structure of the compiler is starting to take form.

Can we continue this until we have something that acts more like

a compiler. Of course we can. Should we? That's another matter.

I don't know about you, but I'm beginning to get dizzy, and we've

still got a long way to go to even get past the data

declarations.

At this point, I think you can see how the structure of the

compiler evolves from the language definition. The structures

we've seen for our two examples, Pascal and C, are as different

as night and day. Pascal was designed at least partly to be easy

to parse, and that's reflected in the compiler. In general, in

Pascal there is more structure and we have a better idea of what

kinds of constructs to expect at any point. In C, on the other

hand, the program is essentially a list of declarations,

terminated only by the end of file.

We could pursue both of these structures much farther, but

remember that our purpose here is not to build a Pascal or a C

compiler, but rather to study compilers in general. For those of

you who DO want to deal with Pascal or C, I hope I've given you

enough of a start so that you can take it from here (although

you'll soon need some of the stuff we still haven't covered yet,

such as typing and procedure calls). For the rest of you, stay

with me through the next installment. There, I'll be leading you

through the development of a complete compiler for TINY, a subset

of KISS.
See you then.
GET YOUR DOMAIN:
type here .com
Get It Now >>

[ Amit Mathur Online Home Page ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book ca be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part X : INTRODUCTION TO "TINY"

INTRODUCTION

In the last installment, I showed you the general idea for the
top-down development of a compiler. I gave you the first few
steps of the process for compilers for Pascal and C, but I
stopped far short of pushing it through to completion. The
reason was simple: if we're going to produce a real, functional
compiler for any language, I'd rather do it for KISS, the
language that I've been defining in this tutorial series.

In this installment, we're going to do just that, for a subset of

KISS which I've chosen to call TINY.

The process will be essentially that outlined in Installment IX,

except for one notable difference. In that installment, I
suggested that you begin with a full BNF description of the
language. That's fine for something like Pascal or C, for which
the language definition is firm. In the case of TINY, however,
we don't yet have a full description ... we seem to be defining
the language as we go. That's OK. In fact, it's preferable,
since we can tailor the language slightly as we go, to keep the
parsing easy.

So in the development that follows, we'll actually be doing a

top-down development of BOTH the language and its compiler. The
BNF description will grow along with the compiler.

In this process, there will be a number of decisions to be made,

each of which will influence the BNF and therefore the nature of
the language. At each decision point I'll try to remember to
explain the decision and the rationale behind my choice. That
way, if you happen to hold a different opinion and would prefer a
different option, you can choose it instead. You now have the
background to do that. I guess the important thing to note is
that nothing we do here is cast in concrete. When YOU'RE
designing YOUR language, you should feel free to do it YOUR way.

Many of you may be asking at this point: Why bother starting over
from scratch? We had a working subset of KISS as the outcome of
Installment VII (lexical scanning). Why not just extend it as
needed? The answer is threefold. First of all, I have been
making a number of changes to further simplify the program ...
changes like encapsulating the code generation procedures, so
that we can convert to a different target machine more easily.
Second, I want you to see how the development can indeed be done
from the top down as outlined in the last installment. Finally,
we both need the practice. Each time I go through this exercise,
I get a little better at it, and you will, also.

GETTING STARTED

Many years ago there were languages called Tiny BASIC, Tiny
Pascal, and Tiny C, each of which was a subset of its parent full
language. Tiny BASIC, for example, had only single-character
variable names and global variables. It supported only a single
data type. Sound familiar? At this point we have almost all the
tools we need to build a compiler like that.

Yet a language called Tiny-anything still carries some baggage

inherited from its parent language. I've often wondered if this
is a good idea. Granted, a language based upon some parent
language will have the advantage of familiarity, but there may
also be some peculiar syntax carried over from the parent that
may tend to add unnecessary complexity to the compiler. (Nowhere
is this more true than in Small C.)

I've wondered just how small and simple a compiler could be made
and still be useful, if it were designed from the outset to be
both easy to use and to parse. Let's find out. This language
will just be called "TINY," period. It's a subset of KISS, which
I also haven't fully defined, so that at least makes us
consistent (!). I suppose you could call it TINY KISS. But that
opens up a whole can of worms involving cuter and cuter (and
perhaps more risque) names, so let's just stick with TINY.

The main limitations of TINY will be because of the things we

haven't yet covered, such as data types. Like its cousins Tiny C
and Tiny BASIC, TINY will have only one data type, the 16-bit
integer. The first version we develop will also have no
procedure calls and will use single-character variable names,
although as you will see we can remove these restrictions without
much effort.

The language I have in mind will share some of the good features
of Pascal, C, and Ada. Taking a lesson from the comparison of
the Pascal and C compilers in the previous installment, though,
TINY will have a decided Pascal flavor. Wherever feasible, a
language structure will be bracketed by keywords or symbols, so
that the parser will know where it's going without having to
guess.

One other ground rule: As we go, I'd like to keep the compiler
producing real, executable code. Even though it may not DO much
at the beginning, it will at least do it correctly.

Finally, I'll use a couple of Pascal restrictions that make

sense: All data and procedures must be declared before they are
used. That makes good sense, even though for now the only data
type we'll use is a word. This rule in turn means that the only
reasonable place to put the executable code for the main program
is at the end of the listing.

The top-level definition will be similar to Pascal:

<program> ::= PROGRAM <top-level decl> <main> '.'

Already, we've reached a decision point. My first thought was to

make the main block optional. It doesn't seem to make sense to
write a "program" with no main program, but it does make sense if
we're allowing for multiple modules, linked together. As a
matter of fact, I intend to allow for this in KISS. But then we
begin to open up a can of worms that I'd rather leave closed for
now. For example, the term "PROGRAM" really becomes a misnomer.
The MODULE of Modula-2 or the Unit of Turbo Pascal would be more
appropriate. Second, what about scope rules? We'd need a
convention for dealing with name visibility across modules.
Better for now to just keep it simple and ignore the idea
altogether.

There's also a decision in choosing to require the main program

to be last. I toyed with the idea of making its position
optional, as in C. The nature of SK*DOS, the OS I'm compiling
for, make this very easy to do. But this doesn't really make
much sense in view of the Pascal-like requirement that all data
and procedures be declared before they're referenced. Since the
main program can only call procedures that have already been
declared, the only position that makes sense is at the end, a la
Pascal.

Given the BNF above, let's write a parser that just recognizes
the brackets:

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure Prog;
begin
Match('p');
Header;
Prolog;
Match('.');
Epilog;
end;
{--------------------------------------------------------------}
The procedure Header just emits the startup code required by the
assembler:

{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin
WriteLn('WARMST', TAB, 'EQU $A01E');
end;
{--------------------------------------------------------------}

The procedures Prolog and Epilog emit the code for identifying
the main program, and for returning to the OS:

{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
PostLabel('MAIN');
end;

{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog;
begin
EmitLn('DC WARMST');
EmitLn('END MAIN');
end;
{--------------------------------------------------------------}

The main program just calls Prog, and then looks for a clean
ending:

{--------------------------------------------------------------}
{ Main Program }

begin
Init;
Prog;
if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}

At this point, TINY will accept only one input "program," the
null program:

PROGRAM . (or 'p.' in our shorthand.)

Note, though, that the compiler DOES generate correct code for
this program. It will run, and do what you'd expect the null
program to do, that is, nothing but return gracefully to the OS.

As a matter of interest, one of my favorite compiler benchmarks

is to compile, link, and execute the null program in whatever
language is involved. You can learn a lot about the
implementation by measuring the overhead in time required to
compile what should be a trivial case. It's also interesting to
measure the amount of code produced. In many compilers, the code
can be fairly large, because they always include the whole run-
time library whether they need it or not. Early versions of
Turbo Pascal produced a 12K object file for this case. VAX C
generates 50K!

The smallest null programs I've seen are those produced by

Modula-2 compilers, and they run about 200-800 bytes.

In the case of TINY, we HAVE no run-time library as yet, so the

object code is indeed tiny: two bytes. That's got to be a
record, and it's likely to remain one since it is the minimum
size required by the OS.

The next step is to process the code for the main program. I'll
use the Pascal BEGIN-block:

<main> ::= BEGIN <block> END

Here, again, we have made a decision. We could have chosen to

require a "PROCEDURE MAIN" sort of declaration, similar to C. I
must admit that this is not a bad idea at all ... I don't
particularly like the Pascal approach since I tend to have
trouble locating the main program in a Pascal listing. But the
alternative is a little awkward, too, since you have to deal with
the error condition where the user omits the main program or
misspells its name. Here I'm taking the easy way out.

Another solution to the "where is the main program" problem might

be to require a name for the program, and then bracket the main
by

BEGIN <name>
END <name>

similar to the convention of Modula 2. This adds a bit of

"syntactic sugar" to the language. Things like this are easy to
add or change to your liking, if the language is your own design.

To parse this definition of a main block, change procedure Prog

to read:

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure Prog;
begin
Match('p');
Header;
Main;
Match('.');
end;
{--------------------------------------------------------------}

and add the new procedure:

{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
Match('b');
Prolog;
Match('e');
Epilog;
end;
{--------------------------------------------------------------}

Now, the only legal program is:

PROGRAM BEGIN END . (or 'pbe.')

Aren't we making progress??? Well, as usual it gets better. You

might try some deliberate errors here, like omitting the 'b' or
the 'e', and see what happens. As always, the compiler should
flag all illegal inputs.

DECLARATIONS

The obvious next step is to decide what we mean by a declaration.

My intent here is to have two kinds of declarations: variables
and procedures/functions. At the top level, only global
declarations are allowed, just as in C.

For now, there can only be variable declarations, identified by

the keyword VAR (abbreviated 'v'):

<top-level decls> ::= ( <data declaration> )*

<data declaration> ::= VAR <var-list>

Note that since there is only one variable type, there is no need
to declare the type. Later on, for full KISS, we can easily add
a type description.

The procedure Prog becomes:

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure Prog;
begin
Match('p');
Header;
TopDecls;
Main;
Match('.');
end;
{--------------------------------------------------------------}

Now, add the two new procedures:

{--------------------------------------------------------------}
{ Process a Data Declaration }

procedure Decl;
begin
Match('v');
GetChar;
end;

{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
while Look <> 'b' do
case Look of
'v': Decl;
else Abort('Unrecognized Keyword ''' + Look + '''');
end;
end;
{--------------------------------------------------------------}

Note that at this point, Decl is just a stub. It generates no

code, and it doesn't process a list ... every variable must occur
in a separate VAR statement.

OK, now we can have any number of data declarations, each

starting with a 'v' for VAR, before the BEGIN-block. Try a few
cases and see what happens.

DECLARATIONS AND SYMBOLS

That looks pretty good, but we're still only generating the null
program for output. A real compiler would issue assembler
directives to allocate storage for the variables. It's about
time we actually produced some code.

With a little extra code, that's an easy thing to do from

procedure Decl. Modify it as follows:

{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
end;
{--------------------------------------------------------------}

The procedure Alloc just issues a command to the assembler to

allocate storage:

{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);

begin
WriteLn(N, ':', TAB, 'DC 0');
end;
{--------------------------------------------------------------}

Give this one a whirl. Try an input that declares some

variables, such as:

pvxvyvzbe.

See how the storage is allocated? Simple, huh? Note also that
the entry point, "MAIN," comes out in the right place.

For the record, a "real" compiler would also have a symbol table
to record the variables being used. Normally, the symbol table
is necessary to record the type of each variable. But since in
this case all variables have the same type, we don't need a
symbol table for that reason. As it turns out, we're going to
find a symbol necessary even without different types, but let's
postpone that need until it arises.

Of course, we haven't really parsed the correct syntax for a data

declaration, since it involves a variable list. Our version only
permits a single variable. That's easy to fix, too.

The BNF for <var-list> is

<var-list> ::= <ident> (, <ident>)*

Adding this syntax to Decl gives this new version:

{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
var Name: char;
begin
Match('v');
Alloc(GetName);
while Look = ',' do begin
GetChar;
Alloc(GetName);
end;
end;
{--------------------------------------------------------------}

OK, now compile this code and give it a try. Try a number of
lines of VAR declarations, try a list of several variables on one
line, and try combinations of the two. Does it work?

INITIALIZERS

As long as we're dealing with data declarations, one thing that's

always bothered me about Pascal is that it doesn't allow
initializing data items in the declaration. That feature is
admittedly sort of a frill, and it may be out of place in a
language that purports to be a minimal language. But it's also
SO easy to add that it seems a shame not to do so. The BNF
becomes:

<var-list> ::= <var> ( <var> )*

<var> ::= <ident> [ = <integer> ]

Change Alloc as follows:

{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);

begin
Write(N, ':', TAB, 'DC ');
if Look = '=' then begin
Match('=');
WriteLn(GetNum);
end
else
WriteLn('0');
end;
{--------------------------------------------------------------}

There you are: an initializer with six added lines of Pascal.

OK, try this version of TINY and verify that you can, indeed,
give the variables initial values.

By golly, this thing is starting to look real! Of course, it

still doesn't DO anything, but it looks good, doesn't it?

Before leaving this section, I should point out that we've used
two versions of function GetNum. One, the earlier one, returns a
character value, a single digit. The other accepts a multi-digit
integer and returns an integer value. Either one will work here,
since WriteLn will handle either type. But there's no reason to
limit ourselves to single-digit values here, so the correct
version to use is the one that returns an integer. Here it is:
{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;

var Val: integer;
begin
Val := 0;
if not IsDigit(Look) then Expected('Integer');
while IsDigit(Look) do begin
Val := 10 * Val + Ord(Look) - Ord('0');
GetChar;
end;
GetNum := Val;
end;
{--------------------------------------------------------------}

As a matter of fact, strictly speaking we should allow for

expressions in the data field of the initializer, or at the very
least for negative values. For now, let's just allow for
negative values by changing the code for Alloc as follows:

{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: char);

begin
if InTable(N) then Abort('Duplicate Variable Name ' + N);
ST[N] := 'v';
Write(N, ':', TAB, 'DC ');
if Look = '=' then begin
Match('=');
If Look = '-' then begin
Write(Look);
Match('-');
end;
WriteLn(GetNum);
end
else
WriteLn('0');
end;
{--------------------------------------------------------------}

Now you should be able to initialize variables with negative

and/or multi-digit values.

THE SYMBOL TABLE

There's one problem with the compiler as it stands so far: it

doesn't do anything to record a variable when we declare it. So
the compiler is perfectly content to allocate storage for several
variables with the same name. You can easily verify this with an
input like

pvavavabe.

Here we've declared the variable A three times. As you can see,
the compiler will cheerfully accept that, and generate three
identical labels. Not good.

Later on, when we start referencing variables, the compiler will

also let us reference variables that don't exist. The assembler
will catch both of these error conditions, but it doesn't seem
friendly at all to pass such errors along to the assembler. The
compiler should catch such things at the source language level.

So even though we don't need a symbol table to record data types,

we ought to install one just to check for these two conditions.
Since at this point we are still restricted to single-character
variable names, the symbol table can be trivial. To provide for
it, first add the following declaration at the beginning of your
program:

var ST: array['A'..'Z'] of char;

and insert the following function:

{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: char): Boolean;

begin
InTable := ST[n] <> ' ';
end;
{--------------------------------------------------------------}

We also need to initialize the table to all blanks. The

following lines in Init will do the job:

var i: char;
begin
for i := 'A' to 'Z' do
ST[i] := ' ';
...

Finally, insert the following two lines at the beginning of

Alloc:

if InTable(N) then Abort('Duplicate Variable Name ' + N);

ST[N] := 'v';

That should do it. The compiler will now catch duplicate

declarations. Later, we can also use InTable when generating
references to the variables.

EXECUTABLE STATEMENTS

At this point, we can generate a null program that has some data
variables declared and possibly initialized. But so far we
haven't arranged to generate the first line of executable code.
Believe it or not, though, we almost have a usable language!
What's missing is the executable code that must go into the main
program. But that code is just assignment statements and control
statements ... all stuff we have done before. So it shouldn't
take us long to provide for them, as well.

The BNF definition given earlier for the main program included a
statement block, which we have so far ignored:

<main> ::= BEGIN <block> END

For now, we can just consider a block to be a series of

assignment statements:

<block> ::= (Assignment)*

Let's start things off by adding a parser for the block. We'll
begin with a stub for the assignment statement:

{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
begin
GetChar;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
while Look <> 'e' do
Assignment;
end;
{--------------------------------------------------------------}

Modify procedure Main to call Block as shown below:

{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
Match('b');
Prolog;
Block;
Match('e');
Epilog;
end;
{--------------------------------------------------------------}
This version still won't generate any code for the "assignment
statements" ... all it does is to eat characters until it sees
the 'e' for 'END.' But it sets the stage for what is to follow.

The next step, of course, is to flesh out the code for an

assignment statement. This is something we've done many times
before, so I won't belabor it. This time, though, I'd like to
deal with the code generation a little differently. Up till now,
we've always just inserted the Emits that generate output code in
line with the parsing routines. A little unstructured, perhaps,
but it seemed the most straightforward approach, and made it easy
to see what kind of code would be emitted for each construct.

However, I realize that most of you are using an 80x86 computer,

so the 68000 code generated is of little use to you. Several of
you have asked me if the CPU-dependent code couldn't be collected
into one spot where it would be easier to retarget to another
CPU. The answer, of course, is yes.

To accomplish this, insert the following "code generation"

routines:

{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
EmitLn('CLR D0');
end;

{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: integer);

begin
Emit('MOVE #');
WriteLn(n, ',D0');
end;

{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name: char);

begin
if not InTable(Name) then Undefined(Name);
EmitLn('MOVE ' + Name + '(PC),D0');
end;
{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push;
begin
EmitLn('MOVE D0,-(SP)');
end;

{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure PopAdd;
begin
EmitLn('ADD (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure PopSub;
begin
EmitLn('SUB (SP)+,D0');
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }

procedure PopMul;
begin
EmitLn('MULS (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }

procedure PopDiv;
begin
EmitLn('MOVE (SP)+,D7');
EmitLn('EXT.L D7');
EmitLn('DIVS D0,D7');
EmitLn('MOVE D7,D0');
end;

{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure Store(Name: char);

begin
if not InTable(Name) then Undefined(Name);
EmitLn('LEA ' + Name + '(PC),A0');
EmitLn('MOVE D0,(A0)')
end;
{---------------------------------------------------------------}
The nice part of this approach, of course, is that we can
retarget the compiler to a new CPU simply by rewriting these
"code generator" procedures. In addition, we will find later
that we can improve the code quality by tweaking these routines a
bit, without having to modify the compiler proper.

Note that both LoadVar and Store check the symbol table to make
sure that the variable is defined. The error handler Undefined
simply calls Abort:

{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);

begin
Abort('Undefined Identifier ' + n);
end;
{--------------------------------------------------------------}

OK, we are now finally ready to begin processing executable code.

We'll do that by replacing the stub version of procedure
Assignment.

We've been down this road many times before, so this should all
be familiar to you. In fact, except for the changes associated
with the code generation, we could just copy the procedures from
Part VII. Since we are making some changes, I won't just copy
them, but we will go a little faster than usual.

The BNF for the assignment statement is:

<assignment> ::= <ident> = <expression>

<expression> ::= <first term> ( <addop> <term> )*

<first term> ::= <first factor> <rest>

<term> ::= <factor> <rest>

<rest> ::= ( <mulop> <factor> )*

<first factor> ::= [ <addop> ] <factor>

<factor> ::= <var> | <number> | ( <expression> )

This version of the BNF is also a bit different than we've used
before ... yet another "variation on the theme of an expression."
This particular version has what I consider to be the best
treatment of the unary minus. As you'll see later, it lets us
handle negative constant values efficiently. It's worth
mentioning here that we have often seen the advantages of
"tweaking" the BNF as we go, to help make the language easy to
parse. What you're looking at here is a bit different: we've
tweaked the BNF to make the CODE GENERATION more efficient!
That's a first for this series.

Anyhow, the following code implements the BNF:

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure Expression; Forward;

procedure Factor;
begin
if Look = '(' then begin
Match('(');
Expression;
Match(')');
end
else if IsAlpha(Look) then
LoadVar(GetName)
else
LoadConst(GetNum);
end;

{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }

procedure NegFactor;
begin
Match('-');
if IsDigit(Look) then
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }

procedure FirstFactor;
begin
case Look of
'+': begin
Match('+');
Factor;
end;
'-': NegFactor;
else Factor;
end;
end;

{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
Match('*');
Factor;
PopMul;
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
Match('/');
Factor;
PopDiv;
end;

{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }

procedure Term1;
begin
while IsMulop(Look) do begin
Push;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
Factor;
Term1;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }

procedure FirstTerm;
begin
FirstFactor;
Term1;
end;

{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
Match('+');
Term;
PopAdd;
end;

{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match('-');
Term;
PopSub;
end;

{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
FirstTerm;
while IsAddop(Look) do begin
Push;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
Name := GetName;
Match('=');
Expression;
Store(Name);
end;
{--------------------------------------------------------------}

OK, if you've got all this code inserted, then compile it and
check it out. You should be seeing reasonable-looking code,
representing a complete program that will assemble and execute.
We have a compiler!

BOOLEANS

The next step should also be familiar to you. We must add

Boolean expressions and relational operations. Again, since
we've already dealt with them more than once, I won't elaborate
much on them, except where they are different from what we've
done before. Again, we won't just copy from other files because
I've changed a few things just a bit. Most of the changes just
involve encapsulating the machine-dependent parts as we did for
the arithmetic operations. I've also modified procedure
NotFactor somewhat, to parallel the structure of FirstFactor.
Finally, I corrected an error in the object code for the
relational operators: The Scc instruction I used only sets the
low 8 bits of D0. We want all 16 bits set for a logical true, so
I've added an instruction to sign-extend the low byte.
To begin, we're going to need some more recognizers:

{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;

begin
IsOrop := c in ['|', '~'];
end;

{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;

begin
IsRelop := c in ['=', '#', '<', '>'];
end;
{--------------------------------------------------------------}

Also, we're going to need some more code generation routines:

{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
EmitLn('NOT D0');
end;
{---------------------------------------------------------------}
.
.
.
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }

procedure PopAnd;
begin
EmitLn('AND (SP)+,D0');
end;

{---------------------------------------------------------------}
{ OR Top of Stack with Primary }

procedure PopOr;
begin
EmitLn('OR (SP)+,D0');
end;

{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }

procedure PopXor;
begin
EmitLn('EOR (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }

procedure PopCompare;
begin
EmitLn('CMP (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was = }

procedure SetEqual;
begin
EmitLn('SEQ D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was != }

procedure SetNEqual;
begin
EmitLn('SNE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was > }

procedure SetGreater;
begin
EmitLn('SLT D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was < }

procedure SetLess;
begin
EmitLn('SGT D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}

All of this gives us the tools we need. The BNF for the Boolean
expressions is:

<bool-expr> ::= <bool-term> ( <orop> <bool-term> )*

<bool-term> ::= <not-factor> ( <andop> <not-factor> )*

<not-factor> ::= [ '!' ] <relation>

<relation> ::= <expression> [ <relop> <expression> ]

Sharp-eyed readers might note that this syntax does not include
the non-terminal "bool-factor" used in earlier versions. It was
needed then because I also allowed for the Boolean constants TRUE
and FALSE. But remember that in TINY there is no distinction
made between Boolean and arithmetic types ... they can be freely
intermixed. So there is really no need for these predefined
values ... we can just use -1 and 0, respectively.

In C terminology, we could always use the defines:

#define TRUE -1
#define FALSE 0

(That is, if TINY had a preprocessor.) Later on, when we allow

for declarations of constants, these two values will be
predefined by the language.

The reason that I'm harping on this is that I've already tried
the alternative, which is to include TRUE and FALSE as keywords.
The problem with that approach is that it then requires lexical
scanning for EVERY variable name in every expression. If you'll
recall, I pointed out in Installment VII that this slows the
compiler down considerably. As long as keywords can't be in
expressions, we need to do the scanning only at the beginning of
every new statement ... quite an improvement. So using the
syntax above not only simplifies the parsing, but speeds up the
scanning as well.

OK, given that we're all satisfied with the syntax above, the
corresponding code is shown below:

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }

procedure Equals;
begin
Match('=');
Expression;
PopCompare;
SetEqual;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEquals;
begin
Match('#');
Expression;
PopCompare;
SetNEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
Match('<');
Expression;
PopCompare;
SetLess;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
Match('>');
Expression;
PopCompare;
SetGreater;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Relation }

procedure Relation;
begin
Expression;
if IsRelop(Look) then begin
Push;
case Look of
'=': Equals;
'#': NotEquals;
'<': Less;
'>': Greater;
end;
end;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }

procedure NotFactor;
begin
if Look = '!' then begin
Match('!');
Relation;
NotIt;
end
else
Relation;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }
procedure BoolTerm;
begin
NotFactor;
while Look = '&' do begin
Push;
Match('&');
NotFactor;
PopAnd;
end;
end;

{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
Match('|');
BoolTerm;
PopOr;
end;

{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
Match('~');
BoolTerm;
PopXor;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Look) do begin
Push;
case Look of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;
{--------------------------------------------------------------}

To tie it all together, don't forget to change the references to

Expression in procedures Factor and Assignment so that they call
BoolExpression instead.

OK, if you've got all that typed in, compile it and give it a
whirl. First, make sure you can still parse an ordinary
arithmetic expression. Then, try a Boolean one. Finally, make
sure that you can assign the results of relations. Try, for
example:

pvx,y,zbx=z>ye.

which stands for:

PROGRAM
VAR X,Y,Z
BEGIN
X=Z>Y
END.

See how this assigns a Boolean value to X?

CONTROL STRUCTURES

We're almost home. With Boolean expressions in place, it's a

simple matter to add control structures. For TINY, we'll only
allow two kinds of them, the IF and the WHILE:

<if> ::= IF <bool-expression> <block> [ ELSE <block>] ENDIF

<while> ::= WHILE <bool-expression> <block> ENDWHILE

Once again, let me spell out the decisions implicit in this

syntax, which departs strongly from that of C or Pascal. In both
of those languages, the "body" of an IF or WHILE is regarded as a
single statement. If you intend to use a block of more than one
statement, you have to build a compound statement using BEGIN-END
(in Pascal) or '{}' (in C). In TINY (and KISS) there is no such
thing as a compound statement ... single or multiple they're all
just blocks to these languages.

In KISS, all the control structures will have explicit and unique
keywords bracketing the statement block, so there can be no
confusion as to where things begin and end. This is the modern
approach, used in such respected languages as Ada and Modula 2,
and it completely eliminates the problem of the "dangling else."

Note that I could have chosen to use the same keyword END to end
all the constructs, as is done in Pascal. (The closing '}' in C
serves the same purpose.) But this has always led to confusion,
which is why Pascal programmers tend to write things like

end { loop }

or end { if }

As I explained in Part V, using unique terminal keywords does

increase the size of the keyword list and therefore slows down
the scanning, but in this case it seems a small price to pay for
the added insurance. Better to find the errors at compile time
rather than run time.

One last thought: The two constructs above each have the non-
terminals
<bool-expression> and <block>

juxtaposed with no separating keyword. In Pascal we would expect

the keywords THEN and DO in these locations.

I have no problem with leaving out these keywords, and the parser
has no trouble either, ON CONDITION that we make no errors in the
bool-expression part. On the other hand, if we were to include
these extra keywords we would get yet one more level of insurance
at very little cost, and I have no problem with that, either.
Use your best judgment as to which way to go.

OK, with that bit of explanation let's proceed. As usual, we're

going to need some new code generation routines. These generate
the code for conditional and unconditional branches:

{---------------------------------------------------------------}
{ Branch Unconditional }

procedure Branch(L: string);

begin
EmitLn('BRA ' + L);
end;

{---------------------------------------------------------------}
{ Branch False }

procedure BranchFalse(L: string);

begin
EmitLn('TST D0');
EmitLn('BEQ ' + L);
end;
{--------------------------------------------------------------}

Except for the encapsulation of the code generation, the code to

parse the control constructs is the same as you've seen before:

{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;
var L1, L2: string;
begin
Match('i');
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Look = 'l' then begin
Match('l');
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
Match('e');
end;

{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
Match('w');
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
Match('e');
Branch(L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}

To tie everything together, we need only modify procedure Block

to recognize the "keywords" for the IF and WHILE. As usual, we
expand the definition of a block like so:

<block> ::= ( <statement> )*

where

<statement> ::= <if> | <while> | <assignment>

The corresponding code is:

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
while not(Look in ['e', 'l']) do begin
case Look of
'i': DoIf;
'w': DoWhile;
else Assignment;
end;
end;
end;
{--------------------------------------------------------------}

OK, add the routines I've given, compile and test them. You
should be able to parse the single-character versions of any of
the control constructs. It's looking pretty good!

As a matter of fact, except for the single-character limitation

we've got a virtually complete version of TINY. I call it, with
tongue planted firmly in cheek, TINY Version 0.1.

LEXICAL SCANNING

Of course, you know what's next: We have to convert the program

so that it can deal with multi-character keywords, newlines, and
whitespace. We have just gone through all that in Part VII.
We'll use the distributed scanner technique that I showed you in
that installment. The actual implementation is a little
different because the way I'm handling newlines is different.

To begin with, let's simply allow for whitespace. This involves

only adding calls to SkipWhite at the end of the three routines,
GetName, GetNum, and Match. A call to SkipWhite in Init primes
the pump in case there are leading spaces.

Next, we need to deal with newlines. This is really a two-step

process, since the treatment of the newlines with single-
character tokens is different from that for multi-character ones.
We can eliminate some work by doing both steps at once, but I
feel safer taking things one step at a time.

Insert the new procedure:

{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure NewLine;
begin
while Look = CR do begin
GetChar;
if Look = LF then GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}

Note that we have seen this procedure before in the form of

Procedure Fin. I've changed the name since this new one seems
more descriptive of the actual function. I've also changed the
code to allow for multiple newlines and lines with nothing but
white space.

The next step is to insert calls to NewLine wherever we decide a

newline is permissible. As I've pointed out before, this can be
very different in different languages. In TINY, I've decided to
allow them virtually anywhere. This means that we need calls to
NewLine at the BEGINNING (not the end, as with SkipWhite) of the
procedures GetName, GetNum, and Match.

For procedures that have while loops, such as TopDecl, we need a

call to NewLine at the beginning of the procedure AND at the
bottom of each loop. That way, we can be assured that NewLine
has just been called at the beginning of each pass through the
loop.

If you've got all this done, try the program out and verify that
it will indeed handle white space and newlines.

If it does, then we're ready to deal with multi-character tokens

and keywords. To begin, add the additional declarations (copied
almost verbatim from Part VII):

{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char; { Lookahead Character }

Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }

ST: Array['A'..'Z'] of char;

{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW = 9;
NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =

('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
'VAR', 'BEGIN', 'END', 'PROGRAM');

const KWcode: string[NKW1] = 'xilewevbep';

{--------------------------------------------------------------}

Next, add the three procedures, also from Part VII:

{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;
found: Boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }

procedure Scan;
begin
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);

begin
if Value <> x then Expected('''' + x + '''');
end;
{--------------------------------------------------------------}

Now, we have to make a fairly large number of subtle changes to

the remaining procedures. First, we must change the function
GetName to a procedure, again as we did in Part VII:

{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
NewLine;
if not IsAlpha(Look) then Expected('Name');
Value := '';
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}

Note that this procedure leaves its result in the global string
Value.

Next, we have to change every reference to GetName to reflect its

new form. These occur in Factor, Assignment, and Decl:

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure BoolExpression; Forward;

procedure Factor;
begin
if Look = '(' then begin
Match('(');
BoolExpression;
Match(')');
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value[1]);
end
else
LoadConst(GetNum);
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: char;
begin
Name := Value[1];
Match('=');
BoolExpression;
Store(Name);
end;
{---------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
begin
GetName;
Alloc(Value[1]);
while Look = ',' do begin
Match(',');
GetName;
Alloc(Value[1]);
end;
end;
{--------------------------------------------------------------}

(Note that we're still only allowing single-character variable

names, so we take the easy way out here and simply use the first
character of the string.)

Finally, we must make the changes to use Token instead of Look as

the test character and to call Scan at the appropriate places.
Mostly, this involves deleting calls to Match, occasionally
replacing calls to Match by calls to MatchString, and Replacing
calls to NewLine by calls to Scan. Here are the affected
routines:

{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;

procedure DoIf;
var L1, L2: string;
begin
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = 'l' then begin
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString('ENDIF');
end;

{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString('ENDWHILE');
Branch(L1);
PostLabel(L2);
end;

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
Scan;
while Token <> 'b' do begin
case Token of
'v': Decl;
else Abort('Unrecognized Keyword ' + Value);
end;
Scan;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
MatchString('BEGIN');
Prolog;
Block;
MatchString('END');
Epilog;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure Prog;
begin
MatchString('PROGRAM');
Header;
TopDecls;
Main;
Match('.');
end;

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: char;
begin
for i := 'A' to 'Z' do
ST[i] := ' ';
GetChar;
Scan;
end;
{--------------------------------------------------------------}

That should do it. If all the changes got in correctly, you

should now be parsing programs that look like programs. (If you
didn't make it through all the changes, don't despair. A
complete listing of the final form is given later.)

Did it work? If so, then we're just about home. In fact, with a
few minor exceptions we've already got a compiler that's usable.
There are still a few areas that need improvement.

MULTI-CHARACTER VARIABLE NAMES

One of those is the restriction that we still have, requiring

single-character variable names. Now that we can handle multi-
character keywords, this one begins to look very much like an
arbitrary and unnecessary limitation. And indeed it is.
Basically, its only virtue is that it permits a trivially simple
implementation of the symbol table. But that's just a
convenience to the compiler writers, and needs to be eliminated.

We've done this step before. This time, as usual, I'm doing it a
little differently. I think the approach used here keeps things
just about as simple as possible.

The natural way to implement a symbol table in Pascal is by

declaring a record type, and making the symbol table an array of
such records. Here, though, we don't really need a type field
yet (there is only one kind of entry allowed so far), so we only
need an array of symbols. This has the advantage that we can use
the existing procedure Lookup to search the symbol table as well
as the keyword list. As it turns out, even when we need more
fields we can still use the same approach, simply by storing the
other fields in separate arrays.

OK, here are the changes that need to be made. First, add the
new typed constant:

NEntry: integer = 0;

Then change the definition of the symbol table as follows:

const MaxEntry = 100;

var ST : array[1..MaxEntry] of Symbol;

(Note that ST is _NOT_ declared as a SymTab. That declaration is

a phony one to get Lookup to work. A SymTab would take up too
much RAM space, and so one is never actually allocated.)

Next, we need to replace InTable:

{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;

begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;
{--------------------------------------------------------------}

We also need a new procedure, AddEntry, that adds a new entry to

the table:
{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);

begin
if InTable(N) then Abort('Duplicate Identifier ' + N);
if NEntry = MaxEntry then Abort('Symbol Table Full');
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;
{--------------------------------------------------------------}

This procedure is called by Alloc:

{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: Symbol);

begin
if InTable(N) then Abort('Duplicate Variable Name ' + N);
AddEntry(N, 'v');
.
.
.
{--------------------------------------------------------------}

Finally, we must change all the routines that currently treat the
variable name as a single character. These include LoadVar and
Store (just change the type from char to string), and Factor,
Assignment, and Decl (just change Value[1] to Value).

One last thing: change procedure Init to clear the array as

shown:

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := '';
SType[i] := ' ';
end;
GetChar;
Scan;
end;
{--------------------------------------------------------------}

That should do it. Try it out and verify that you can, indeed,
use multi-character variable names.

MORE RELOPS
We still have one remaining single-character restriction: the one
on relops. Some of the relops are indeed single characters, but
others require two. These are '<=' and '>='. I also prefer the
Pascal '<>' for "not equals," instead of '#'.

If you'll recall, in Part VII I pointed out that the conventional

way to deal with relops is to include them in the list of
keywords, and let the lexical scanner find them. But, again,
this requires scanning throughout the expression parsing process,
whereas so far we've been able to limit the use of the scanner to
the beginning of a statement.

I mentioned then that we can still get away with this, since the
multi-character relops are so few and so limited in their usage.
It's easy to just treat them as special cases and handle them in
an ad hoc manner.

The changes required affect only the code generation routines and
procedures Relation and friends. First, we're going to need two
more code generation routines:

{---------------------------------------------------------------}
{ Set D0 If Compare was <= }

procedure SetLessOrEqual;
begin
EmitLn('SGE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;
begin
EmitLn('SLE D0');
EmitLn('EXT D0');
end;
{---------------------------------------------------------------}

Then, modify the relation parsing routines as shown below:

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;
begin
Match('=');
Expression;
PopCompare;
SetLessOrEqual;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEqual;
begin
Match('>');
Expression;
PopCompare;
SetNEqual;
end;

Continued to part b of this chapter

GET YOUR DOMAIN:
type here .com
Get It Now >>

[ Amit Mathur Online Home Page ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part X : INTRODUCTION TO "TINY"

Part - B

... continution of part A

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
Match('<');
case Look of
'=': LessOrEqual;
'>': NotEqual;
else begin
Expression;
PopCompare;
SetLess;
end;
end;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
Match('>');
if Look = '=' then begin
Match('=');
Expression;
PopCompare;
SetGreaterOrEqual;
end
else begin
Expression;
PopCompare;
SetGreater;
end;
end;
{---------------------------------------------------------------}

That's all it takes. Now you can process all the relops. Try
it.

INPUT/OUTPUT

We now have a complete, working language, except for one minor

embarassment: we have no way to get data in or out. We need some
I/O.

Now, the convention these days, established in C and continued in

Ada and Modula 2, is to leave I/O statements out of the language
itself, and just include them in the subroutine library. That
would be fine, except that so far we have no provision for
subroutines. Anyhow, with this approach you run into the problem
of variable-length argument lists. In Pascal, the I/O statements
are built into the language because they are the only ones for
which the argument list can have a variable number of entries.
In C, we settle for kludges like scanf and printf, and must pass
the argument count to the called procedure. In Ada and Modula 2
we must use the awkward (and SLOW!) approach of a separate call
for each argument.

So I think I prefer the Pascal approach of building the I/O in,

even though we don't need to.

As usual, for this we need some more code generation routines.

These turn out to be the easiest of all, because all we do is to
call library procedures to do the work:

{---------------------------------------------------------------}
{ Read Variable to Primary Register }

procedure ReadVar;
begin
EmitLn('BSR READ');
Store(Value);
end;

{---------------------------------------------------------------}
{ Write Variable from Primary Register }

procedure WriteVar;
begin
EmitLn('BSR WRITE');
end;
{--------------------------------------------------------------}

The idea is that READ loads the value from input to the D0, and
WRITE outputs it from there.

These two procedures represent our first encounter with a need

for library procedures ... the components of a Run Time Library
(RTL). Of course, someone (namely us) has to write these
routines, but they're not part of the compiler itself. I won't
even bother showing the routines here, since these are obviously
very much OS-dependent. I _WILL_ simply say that for SK*DOS,
they are particularly simple ... almost trivial. One reason I
won't show them here is that you can add all kinds of fanciness
to the things, for example by prompting in READ for the inputs,
and by giving the user a chance to reenter a bad input.

But that is really separate from compiler design, so for now I'll
just assume that a library call TINYLIB.LIB exists. Since we now
need it loaded, we need to add a statement to include it in
procedure Header:

{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin

WriteLn('WARMST', TAB, 'EQU $A01E');

EmitLn('LIB TINYLIB');
end;
{--------------------------------------------------------------}

That takes care of that part. Now, we also need to recognize the
read and write commands. We can do this by adding two more
keywords to our list:

{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW = 11;

NKW1 = 12;

const KWlist: array[1..NKW] of Symbol =

('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
'READ', 'WRITE', 'VAR', 'BEGIN', 'END',
'PROGRAM');

const KWcode: string[NKW1] = 'xileweRWvbep';

{--------------------------------------------------------------}

(Note how I'm using upper case codes here to avoid conflict with
the 'w' of WHILE.)

Next, we need procedures for processing the read/write statement

and its argument list:
{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
Match('(');
GetName;
ReadVar;
while Look = ',' do begin
Match(',');
GetName;
ReadVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
{ Process a Write Statement }

procedure DoWrite;
begin
Match('(');
Expression;
WriteVar;
while Look = ',' do begin
Match(',');
Expression;
WriteVar;
end;
Match(')');
end;
{--------------------------------------------------------------}

Finally, we must expand procedure Block to handle the new

statement types:

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}

That's all there is to it. _NOW_ we have a language!

CONCLUSION

At this point we have TINY completely defined. It's not much ...
actually a toy compiler. TINY has only one data type and no
subroutines ... but it's a complete, usable language. While
you're not likely to be able to write another compiler in it, or
do anything else very seriously, you could write programs to read
some input, perform calculations, and output the results. Not
too bad for a toy.

Most importantly, we have a firm base upon which to build further

extensions. I know you'll be glad to hear this: this is the last
time I'll start over in building a parser ... from now on I
intend to just add features to TINY until it becomes KISS. Oh,
there'll be other times we will need to try things out with new
copies of the Cradle, but once we've found out how to do those
things they'll be incorporated into TINY.

What will those features be? Well, for starters we need

subroutines and functions. Then we need to be able to handle
different types, including arrays, strings, and other structures.
Then we need to deal with the idea of pointers. All this will be
upcoming in future installments.

See you then.

For references purposes, the complete listing of TINY Version 1.0

is shown below:

{--------------------------------------------------------------}
program Tiny10;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;

CR = ^M;
LF = ^J;

LCount: integer = 0;
NEntry: integer = 0;

{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char; { Lookahead Character }

Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }
const MaxEntry = 100;

var ST : array[1..MaxEntry] of Symbol;

SType: array[1..MaxEntry] of char;

{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW = 11;

NKW1 = 12;

const KWlist: array[1..NKW] of Symbol =

('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
'READ', 'WRITE', 'VAR', 'BEGIN', 'END',
'PROGRAM');

const KWcode: string[NKW1] = 'xileweRWvbep';

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }

procedure Error(s: string);

begin
WriteLn;
WriteLn(^G, 'Error: ', s, '.');
end;

{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);

begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);

begin
Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);

begin
Abort('Undefined Identifier ' + n);
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin
IsAlpha := UpCase(c) in ['A'..'Z'];
end;

{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;

begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin
IsAddop := c in ['+', '-'];
end;

{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;

begin
IsMulop := c in ['*', '/'];
end;

{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;

begin
IsOrop := c in ['|', '~'];
end;

{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;

begin
IsRelop := c in ['=', '#', '<', '>'];
end;

{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;

begin
IsWhite := c in [' ', TAB];
end;

{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;

{--------------------------------------------------------------}
{ Skip Over an End-of-Line }

procedure NewLine;
begin
while Look = CR do begin
GetChar;
if Look = LF then GetChar;
SkipWhite;
end;
end;

{--------------------------------------------------------------}
{ Match a Specific Input Character }

procedure Match(x: char);

begin
NewLine;
if Look = x then GetChar
else Expected('''' + x + '''');
SkipWhite;
end;

{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;
found: Boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;

{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry. Zero if not present. }

function Locate(N: Symbol): integer;

begin
Locate := Lookup(@ST, n, MaxEntry);
end;

{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;

begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;

{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);

begin
if InTable(N) then Abort('Duplicate Identifier ' + N);
if NEntry = MaxEntry then Abort('Symbol Table Full');
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;

{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
NewLine;
if not IsAlpha(Look) then Expected('Name');
Value := '';
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get a Number }

function GetNum: integer;

var Val: integer;
begin
NewLine;
if not IsDigit(Look) then Expected('Integer');
Val := 0;
while IsDigit(Look) do begin
Val := 10 * Val + Ord(Look) - Ord('0');
GetChar;
end;
GetNum := Val;
SkipWhite;
end;

{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }

procedure Scan;
begin
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;

{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);

begin
if Value <> x then Expected('''' + x + '''');
end;

{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);

begin
Write(TAB, s);
end;

{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;

var S: string;
begin
Str(LCount, S);
NewLabel := 'L' + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);

begin
WriteLn(L, ':');
end;

{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
EmitLn('CLR D0');
end;

{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
EmitLn('NOT D0');
end;

{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: integer);

begin
Emit('MOVE #');
WriteLn(n, ',D0');
end;

{---------------------------------------------------------------}
{ Load a Variable to Primary Register }

procedure LoadVar(Name: string);

begin
if not InTable(Name) then Undefined(Name);
EmitLn('MOVE ' + Name + '(PC),D0');
end;

{---------------------------------------------------------------}
{ Push Primary onto Stack }

procedure Push;
begin
EmitLn('MOVE D0,-(SP)');
end;

{---------------------------------------------------------------}
{ Add Top of Stack to Primary }

procedure PopAdd;
begin
EmitLn('ADD (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }

procedure PopSub;
begin
EmitLn('SUB (SP)+,D0');
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }

procedure PopMul;
begin
EmitLn('MULS (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }

procedure PopDiv;
begin
EmitLn('MOVE (SP)+,D7');
EmitLn('EXT.L D7');
EmitLn('DIVS D0,D7');
EmitLn('MOVE D7,D0');
end;

{---------------------------------------------------------------}
{ AND Top of Stack with Primary }

procedure PopAnd;
begin
EmitLn('AND (SP)+,D0');
end;

{---------------------------------------------------------------}
{ OR Top of Stack with Primary }

procedure PopOr;
begin
EmitLn('OR (SP)+,D0');
end;
{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }

procedure PopXor;
begin
EmitLn('EOR (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }

procedure PopCompare;
begin
EmitLn('CMP (SP)+,D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was = }

procedure SetEqual;
begin
EmitLn('SEQ D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was != }

procedure SetNEqual;
begin
EmitLn('SNE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was > }

procedure SetGreater;
begin
EmitLn('SLT D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was < }

procedure SetLess;
begin
EmitLn('SGT D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was <= }
procedure SetLessOrEqual;
begin
EmitLn('SGE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Set D0 If Compare was >= }

procedure SetGreaterOrEqual;
begin
EmitLn('SLE D0');
EmitLn('EXT D0');
end;

{---------------------------------------------------------------}
{ Store Primary to Variable }

procedure Store(Name: string);

begin
if not InTable(Name) then Undefined(Name);
EmitLn('LEA ' + Name + '(PC),A0');
EmitLn('MOVE D0,(A0)')
end;

{---------------------------------------------------------------}
{ Branch Unconditional }

procedure Branch(L: string);

begin
EmitLn('BRA ' + L);
end;

{---------------------------------------------------------------}
{ Branch False }

procedure BranchFalse(L: string);

begin
EmitLn('TST D0');
EmitLn('BEQ ' + L);
end;

{---------------------------------------------------------------}
{ Read Variable to Primary Register }

procedure ReadVar;
begin
EmitLn('BSR READ');
Store(Value[1]);
end;

{ Write Variable from Primary Register }

procedure WriteVar;
begin
EmitLn('BSR WRITE');
end;

{--------------------------------------------------------------}
{ Write Header Info }

procedure Header;
begin
WriteLn('WARMST', TAB, 'EQU $A01E');
end;

{--------------------------------------------------------------}
{ Write the Prolog }

procedure Prolog;
begin
PostLabel('MAIN');
end;

{--------------------------------------------------------------}
{ Write the Epilog }

procedure Epilog;
begin
EmitLn('DC WARMST');
EmitLn('END MAIN');
end;

{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }

procedure BoolExpression; Forward;

procedure Factor;
begin
if Look = '(' then begin
Match('(');
BoolExpression;
Match(')');
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value);
end
else
LoadConst(GetNum);
end;

{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }

procedure NegFactor;
begin
Match('-');
if IsDigit(Look) then
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }

procedure FirstFactor;
begin
case Look of
'+': begin
Match('+');
Factor;
end;
'-': NegFactor;
else Factor;
end;
end;

{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }

procedure Multiply;
begin
Match('*');
Factor;
PopMul;
end;

{-------------------------------------------------------------}
{ Recognize and Translate a Divide }

procedure Divide;
begin
Match('/');
Factor;
PopDiv;
end;

{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }

procedure Term1;
begin
while IsMulop(Look) do begin
Push;
case Look of
'*': Multiply;
'/': Divide;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }

procedure Term;
begin
Factor;
Term1;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Leading Term }

procedure FirstTerm;
begin
FirstFactor;
Term1;
end;

{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
Match('+');
Term;
PopAdd;
end;

{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }

procedure Subtract;
begin
Match('-');
Term;
PopSub;
end;

{---------------------------------------------------------------}
{ Parse and Translate an Expression }

procedure Expression;
begin
FirstTerm;
while IsAddop(Look) do begin
Push;
case Look of
'+': Add;
'-': Subtract;
end;
end;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }
procedure Equal;
begin
Match('=');
Expression;
PopCompare;
SetEqual;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }

procedure LessOrEqual;
begin
Match('=');
Expression;
PopCompare;
SetLessOrEqual;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }

procedure NotEqual;
begin
Match('>');
Expression;
PopCompare;
SetNEqual;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }

procedure Less;
begin
Match('<');
case Look of
'=': LessOrEqual;
'>': NotEqual;
else begin
Expression;
PopCompare;
SetLess;
end;
end;
end;

{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }

procedure Greater;
begin
Match('>');
if Look = '=' then begin
Match('=');
Expression;
PopCompare;
SetGreaterOrEqual;
end
else begin
Expression;
PopCompare;
SetGreater;
end;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Relation }

procedure Relation;
begin
Expression;
if IsRelop(Look) then begin
Push;
case Look of
'=': Equal;
'<': Less;
'>': Greater;
end;
end;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }

procedure NotFactor;
begin
if Look = '!' then begin
Match('!');
Relation;
NotIt;
end
else
Relation;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }

procedure BoolTerm;
begin
NotFactor;
while Look = '&' do begin
Push;
Match('&');
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }

procedure BoolOr;
begin
Match('|');
BoolTerm;
PopOr;
end;

{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }

procedure BoolXor;
begin
Match('~');
BoolTerm;
PopXor;
end;

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Look) do begin
Push;
case Look of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }

procedure Assignment;
var Name: string;
begin
Name := Value;
Match('=');
BoolExpression;
Store(Name);
end;

{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;

{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }

procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString('ENDWHILE');
Branch(L1);
PostLabel(L2);
end;

{--------------------------------------------------------------}
{ Process a Read Statement }

procedure DoRead;
begin
Match('(');
GetName;
ReadVar;
while Look = ',' do begin
Match(',');
GetName;
ReadVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
{ Process a Write Statement }

procedure DoWrite;
begin
Match('(');
Expression;
WriteVar;
while Look = ',' do begin
Match(',');
Expression;
WriteVar;
end;
Match(')');
end;

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;
end;
end;

{--------------------------------------------------------------}
{ Allocate Storage for a Variable }

procedure Alloc(N: Symbol);

begin
if InTable(N) then Abort('Duplicate Variable Name ' + N);
AddEntry(N, 'v');
Write(N, ':', TAB, 'DC ');
if Look = '=' then begin
Match('=');
If Look = '-' then begin
Write(Look);
Match('-');
end;
WriteLn(GetNum);
end
else
WriteLn('0');
end;

{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }

procedure Decl;
begin
GetName;
Alloc(Value);
while Look = ',' do begin
Match(',');
GetName;
Alloc(Value);
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }

procedure TopDecls;
begin
Scan;
while Token <> 'b' do begin
case Token of
'v': Decl;
else Abort('Unrecognized Keyword ' + Value);
end;
Scan;
end;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Main Program }

procedure Main;
begin
MatchString('BEGIN');
Prolog;
Block;
MatchString('END');
Epilog;
end;

{--------------------------------------------------------------}
{ Parse and Translate a Program }

procedure Prog;
begin
MatchString('PROGRAM');
Header;
TopDecls;
Main;
Match('.');
end;

{--------------------------------------------------------------}
{ Initialize }

procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := '';
SType[i] := ' ';
end;
GetChar;
Scan;
end;

{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}
GET YOUR DOMAIN:
type here .com
Get It Now >>

[ Amit Mathur Online Home Page ]

LET'S BUILD A COMPILER!

By
Jack W. Crenshaw, Ph.D.
24 July 1988
Copyright : Jack W. Crenshaw
This book can be distributed as it is with above information

[ eBooks Homepage ] [ This Book Homepage ] [ Discussion Forum ]

Part XI : LEXICAL SCAN REVISITED

INTRODUCTION

I've got some good news and some bad news. The bad news is that
this installment is not the one I promised last time. What's
more, the one after this one won't be, either.

The good news is the reason for this installment: I've found a
way to simplify and improve the lexical scanning part of the
compiler. Let me explain.

BACKGROUND

If you'll remember, we talked at length about the subject of

lexical scanners in Part VII, and I left you with a design for a
distributed scanner that I felt was about as simple as I could
make it ... more than most that I've seen elsewhere. We used
that idea in Part X. The compiler structure that resulted was
simple, and it got the job done.

Recently, though, I've begun to have problems, and they're the

kind that send a message that you might be doing something wrong.

The whole thing came to a head when I tried to address the issue
of semicolons. Several people have asked me about them, and
whether or not KISS will have them separating the statements. My
intention has been NOT to use semicolons, simply because I don't
like them and, as you can see, they have not proved necessary.

But I know that many of you, like me, have gotten used to them,
and so I set out to write a short installment to show you how
they could easily be added, if you were so inclined.

Well, it turned out that they weren't easy to add at all. In

fact it was darned difficult.

I guess I should have realized that something was wrong, because

of the issue of newlines. In the last couple of installments
we've addressed that issue, and I've shown you how to deal with
newlines with a procedure called, appropriately enough, NewLine.
In TINY Version 1.0, I sprinkled calls to this procedure in
strategic spots in the code.

It seems that every time I've addressed the issue of newlines,

though, I've found it to be tricky, and the resulting parser
turned out to be quite fragile ... one addition or deletion here
or there and things tended to go to pot. Looking back on it, I
realize that there was a message in this that I just wasn't
paying attention to.

When I tried to add semicolons on top of the newlines, that was

the last straw. I ended up with much too complex a solution. I
began to realize that something fundamental had to change.

So, in a way this installment will cause us to backtrack a bit

and revisit the issue of scanning all over again. Sorry about
that. That's the price you pay for watching me do this in real
time. But the new version is definitely an improvement, and will
serve us well for what is to come.

As I said, the scanner we used in Part X was about as simple as

one can get. But anything can be improved. The new scanner is
more like the classical scanner, and not as simple as before.
But the overall compiler structure is even simpler than before.
It's also more robust, and easier to add to and/or modify. I
think that's worth the time spent in this digression. So in this
installment, I'll be showing you the new structure. No doubt
you'll be happy to know that, while the changes affect many
procedures, they aren't very profound and so we lose very little
of what's been done so far.

Ironically, the new scanner is much more conventional than the

old one, and is very much like the more generic scanner I showed
you earlier in Part VII. Then I started trying to get clever,
and I almost clevered myself clean out of business. You'd think
one day I'd learn: K-I-S-S!

THE PROBLEM

The problem begins to show itself in procedure Block, which I've

reproduced below:

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

As you can see, Block is oriented to individual program

statements. At each pass through the loop, we know that we are
at the beginning of a statement. We exit the block when we have
scanned an END or an ELSE.

But suppose that we see a semicolon instead. The procedure as

it's shown above can't handle that, because procedure Scan only
expects and can only accept tokens that begin with a letter.

I tinkered around for quite awhile to come up with a fix. I

found many possible approaches, but none were very satisfying. I
finally figured out the reason.

Recall that when we started with our single-character parsers, we

adopted a convention that the lookahead character would always be
prefetched. That is, we would have the character that
corresponds to our current position in the input stream fetched
into the global character Look, so that we could examine it as
many times as needed. The rule we adopted was that EVERY
recognizer, if it found its target token, would advance Look to
the next character in the input stream.

That simple and fixed convention served us very well when we had
single-character tokens, and it still does. It would make a lot
of sense to apply the same rule to multi-character tokens.

But when we got into lexical scanning, I began to violate that

simple rule. The scanner of Part X did indeed advance to the
next token if it found an identifier or keyword, but it DIDN'T do
that if it found a carriage return, a whitespace character, or an
operator.

Now, that sort of mixed-mode operation gets us into deep trouble

in procedure Block, because whether or not the input stream has
been advanced depends upon the kind of token we encounter. If
it's a keyword or the target of an assignment statement, the
"cursor," as defined by the contents of Look, has been advanced
to the next token OR to the beginning of whitespace. If, on the
other hand, the token is a semicolon, or if we have hit a
carriage return, the cursor has NOT advanced.

Needless to say, we can add enough logic to keep us on track.

But it's tricky, and makes the whole parser very fragile.

There's a much better way, and that's just to adopt that same
rule that's worked so well before, to apply to TOKENS as well as
single characters. In other words, we'll prefetch tokens just as
we've always done for characters. It seems so obvious once you
think about it that way.

Interestingly enough, if we do things this way the problem that

we've had with newline characters goes away. We can just lump
them in as whitespace characters, which means that the handling
of newlines becomes very trivial, and MUCH less prone to error
than we've had to deal with in the past.

THE SOLUTION

Let's begin to fix the problem by re-introducing the two

procedures:

{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected('Identifier');
Token := 'x';
Value := '';
repeat
Value := Value + UpCase(Look);
GetChar;
until not IsAlNum(Look);
end;

{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected('Number');
Token := '#';
Value := '';
repeat
Value := Value + Look;
GetChar;
until not IsDigit(Look);
end;
{--------------------------------------------------------------}

These two procedures are functionally almost identical to the

ones I showed you in Part VII. They each fetch the current
token, either an identifier or a number, into the global string
Value. They also set the encoded version, Token, to the
appropriate code. The input stream is left with Look containing
the first character NOT part of the token.

We can do the same thing for operators, even multi-character

operators, with a procedure such as:

{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Token := Look;
Value := '';
repeat
Value := Value + Look;
GetChar;
until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
end;
{--------------------------------------------------------------}

Note that GetOp returns, as its encoded token, the FIRST

character of the operator. This is important, because it means
that we can now use that single character to drive the parser,
instead of the lookahead character.

We need to tie these procedures together into a single procedure

that can handle all three cases. The following procedure will
read any one of the token types and always leave the input stream
advanced beyond it:

{--------------------------------------------------------------}
{ Get the Next Input Token }

procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;
{--------------------------------------------------------------}

***NOTE that here I have put SkipWhite BEFORE the calls rather
than after. This means that, in general, the variable Look will
NOT have a meaningful value in it, and therefore we should NOT
use it as a test value for parsing, as we have been doing so far.
That's the big departure from our normal approach.

Now, remember that before I was careful not to treat the carriage
return (CR) and line feed (LF) characters as white space. This
was because, with SkipWhite called as the last thing in the
scanner, the encounter with LF would trigger a read statement.
If we were on the last line of the program, we couldn't get out
until we input another line with a non-white character. That's
why I needed the second procedure, NewLine, to handle the CRLF's.

But now, with the call to SkipWhite coming first, that's exactly
the behavior we want. The compiler must know there's another
token coming or it wouldn't be calling Next. In other words, it
hasn't found the terminating END yet. So we're going to insist
on more data until we find something.

All this means that we can greatly simplify both the program and
the concepts, by treating CR and LF as whitespace characters, and
eliminating NewLine. You can do that simply by modifying the
function IsWhite:
{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;

begin
IsWhite := c in [' ', TAB, CR, LF];
end;
{--------------------------------------------------------------}

We've already tried similar routines in Part VII, but you might
as well try these new ones out. Add them to a copy of the Cradle
and call Next with the following main program:

{--------------------------------------------------------------}
{ Main Program }

begin
Init;
repeat
Next;
WriteLn(Token, ' ', Value);
until Token = '.';
end.
{--------------------------------------------------------------}

Compile it and verify that you can separate a program into a

series of tokens, and that you get the right encoding for each
token.

This ALMOST works, but not quite. There are two potential
problems: First, in KISS/TINY almost all of our operators are
single-character operators. The only exceptions are the relops
>=, <=, and <>. It seems a shame to treat all operators as
strings and do a string compare, when only a single character
compare will almost always suffice. Second, and much more
important, the thing doesn't WORK when two operators appear
together, as in (a+b)*(c+d). Here the string following 'b' would
be interpreted as a single operator ")*(."

It's possible to fix that problem. For example, we could just

give GetOp a list of legal characters, and we could treat the
parentheses as different operator types than the others. But
this begins to get messy.

Fortunately, there's a better way that solves all the problems.

Since almost all the operators are single characters, let's just
treat them that way, and let GetOp get only one character at a
time. This not only simplifies GetOp, but also speeds things up
quite a bit. We still have the problem of the relops, but we
were treating them as special cases anyway.

So here's the final version of GetOp:

{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;
{--------------------------------------------------------------}

Note that I still give the string Value a value. If you're truly
concerned about efficiency, you could leave this out. When we're
expecting an operator, we will only be testing Token anyhow, so
the value of the string won't matter. But to me it seems to be
good practice to give the thing a value just in case.

Try this new version with some realistic-looking code. You

should be able to separate any program into its individual
tokens, with the caveat that the two-character relops will scan
into two separate tokens. That's OK ... we'll parse them that
way.

Now, in Part VII the function of Next was combined with procedure
Scan, which also checked every identifier against a list of
keywords and encoded each one that was found. As I mentioned at
the time, the last thing we would want to do is to use such a
procedure in places where keywords should not appear, such as in
expressions. If we did that, the keyword list would be scanned
for every identifier appearing in the code. Not good.

The right way to deal with that is to simply separate the

functions of fetching tokens and looking for keywords. The
version of Scan shown below does NOTHING but check for keywords.
Notice that it operates on the current token and does NOT advance
the input stream.

{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }

procedure Scan;
begin
if Token = 'x' then
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}

There is one last detail. In the compiler there are a few places
that we must actually check the string value of the token.
Mainly, this is done to distinguish between the different END's,
but there are a couple of other places. (I should note in
passing that we could always eliminate the need for matching END
characters by encoding each one to a different character. Right
now we are definitely taking the lazy man's route.)

The following version of MatchString takes the place of the

character-oriented Match. Note that, like Match, it DOES advance
the input stream.
{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);

begin
if Value <> x then Expected('''' + x + '''');
Next;
end;
{--------------------------------------------------------------}

FIXING UP THE COMPILER

Armed with these new scanner procedures, we can now begin to fix
the compiler to use them properly. The changes are all quite
minor, but there are quite a few places where changes are
necessary. Rather than showing you each place, I will give you
the general idea and then just give the finished product.

First of all, the code for procedure Block doesn't change, though
its function does:

{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }

Remember that the new version of Scan doesn't advance the input
stream, it only scans for keywords. The input stream must be
advanced by each procedure that Block calls.

In general, we have to replace every test on Look with a similar

test on Token. For example:

{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }

procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Token) do begin
Push;
case Token of
'|': BoolOr;
'~': BoolXor;
end;
end;
end;
{--------------------------------------------------------------}

In procedures like Add, we don't have to use Match anymore. We

need only call Next to advance the input stream:

{--------------------------------------------------------------}
{ Recognize and Translate an Add }

procedure Add;
begin
Next;
Term;
PopAdd;
end;
{-------------------------------------------------------------}

Control structures are actually simpler. We just call Next to

advance over the control keywords:

{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }

procedure Block; Forward;

procedure DoIf;
var L1, L2: string;
begin
Next;
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = 'l' then begin
Next;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString('ENDIF');
end;
{--------------------------------------------------------------}

That's about the extent of the REQUIRED changes. In the listing

of TINY Version 1.1 below, I've also made a number of other
"improvements" that aren't really required. Let me explain them
briefly:

(1) I've deleted the two procedures Prog and Main, and combined
their functions into the main program. They didn't seem to
add to program clarity ... in fact they seemed to just
muddy things up a little.

(2) I've deleted the keywords PROGRAM and BEGIN from the
keyword list. Each one only occurs in one place, so it's
not necessary to search for it.

(3) Having been bitten by an overdose of cleverness, I've

reminded myself that TINY is supposed to be a minimalist
program. Therefore I've replaced the fancy handling of
unary minus with the dumbest one I could think of. A giant
step backwards in code quality, but a great simplification
of the compiler. KISS is the right place to use the other
version.

(4) I've added some error-checking routines such as CheckTable

and CheckDup, and replaced in-line code by calls to them.
This cleans up a number of routines.

(5) I've taken the error checking out of code generation

routines like Store, and put it in the parser where it
belongs. See Assignment, for example.

(6) There was an error in InTable and Locate that caused them
to search all locations instead of only those with valid
data in them. They now search only valid cells. This
allows us to eliminate the initialization of the symbol
table, which was done in Init.

(7) Procedure AddEntry now has two arguments, which helps to

make things a bit more modular.

(8) I've cleaned up the code for the relational operators by

the addition of the new procedures CompareExpression and
NextExpression.

(9) I fixed an error in the Read routine ... the earlier value
did not check for a valid variable name.

CONCLUSION

The resulting compiler for TINY is given below. Other than the
removal of the keyword PROGRAM, it parses the same language as
before. It's just a bit cleaner, and more importantly it's
considerably more robust. I feel good about it.

The next installment will be another digression: the discussion

of semicolons and such that got me into this mess in the first
place. THEN we'll press on into procedures and types. Hang in
there with me. The addition of those features will go a long way
towards removing KISS from the "toy language" category. We're
getting very close to being able to write a serious compiler.

TINY VERSION 1.1

{--------------------------------------------------------------}
program Tiny11;

{--------------------------------------------------------------}
{ Constant Declarations }

const TAB = ^I;

CR = ^M;
LF = ^J;

LCount: integer = 0;
NEntry: integer = 0;

{--------------------------------------------------------------}
{ Type Declarations }

type Symbol = string[8];

SymTab = array[1..1000] of Symbol;

TabPtr = ^SymTab;

{--------------------------------------------------------------}
{ Variable Declarations }

var Look : char; { Lookahead Character }

Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }

const MaxEntry = 100;

var ST : array[1..MaxEntry] of Symbol;

SType: array[1..MaxEntry] of char;

{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }

const NKW = 9;
NKW1 = 10;

const KWlist: array[1..NKW] of Symbol =

('IF', 'ELSE', 'ENDIF', 'WHILE', 'ENDWHILE',
'READ', 'WRITE', 'VAR', 'END');

const KWcode: string[NKW1] = 'xileweRWve';

{--------------------------------------------------------------}
{ Read New Character From Input Stream }

procedure GetChar;
begin
Read(Look);
end;

{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, 'Error: ', s, '.');
end;

{--------------------------------------------------------------}
{ Report Error and Halt }

procedure Abort(s: string);

begin
Error(s);
Halt;
end;

{--------------------------------------------------------------}
{ Report What Was Expected }

procedure Expected(s: string);

begin
Abort(s + ' Expected');
end;

{--------------------------------------------------------------}
{ Report an Undefined Identifier }

procedure Undefined(n: string);

begin
Abort('Undefined Identifier ' + n);
end;

{--------------------------------------------------------------}
{ Report a Duplicate Identifier }

procedure Duplicate(n: string);

begin
Abort('Duplicate Identifier ' + n);
end;

{--------------------------------------------------------------}
{ Check to Make Sure the Current Token is an Identifier }

procedure CheckIdent;
begin
if Token <> 'x' then Expected('Identifier');
end;

{--------------------------------------------------------------}
{ Recognize an Alpha Character }

function IsAlpha(c: char): boolean;

begin
IsAlpha := UpCase(c) in ['A'..'Z'];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }

function IsDigit(c: char): boolean;

begin
IsDigit := c in ['0'..'9'];
end;

{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }

function IsAlNum(c: char): boolean;

begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;

{--------------------------------------------------------------}
{ Recognize an Addop }

function IsAddop(c: char): boolean;

begin
IsAddop := c in ['+', '-'];
end;

{--------------------------------------------------------------}
{ Recognize a Mulop }

function IsMulop(c: char): boolean;

begin
IsMulop := c in ['*', '/'];
end;

{--------------------------------------------------------------}
{ Recognize a Boolean Orop }

function IsOrop(c: char): boolean;

begin
IsOrop := c in ['|', '~'];
end;

{--------------------------------------------------------------}
{ Recognize a Relop }

function IsRelop(c: char): boolean;

begin
IsRelop := c in ['=', '#', '<', '>'];
end;

{--------------------------------------------------------------}
{ Recognize White Space }

function IsWhite(c: char): boolean;

begin
IsWhite := c in [' ', TAB, CR, LF];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }

procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;

{--------------------------------------------------------------}
{ Table Lookup }

function Lookup(T: TabPtr; s: string; n: integer): integer;

var i: integer;
found: Boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;

{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry. Zero if not present. }

function Locate(N: Symbol): integer;

begin
Locate := Lookup(@ST, n, NEntry);
end;

{--------------------------------------------------------------}
{ Look for Symbol in Table }

function InTable(n: Symbol): Boolean;

begin
InTable := Lookup(@ST, n, NEntry) <> 0;
end;

{--------------------------------------------------------------}
{ Check to See if an Identifier is in the Symbol Table }
{ Report an error if it's not. }

procedure CheckTable(N: Symbol);

begin
if not InTable(N) then Undefined(N);
end;

{--------------------------------------------------------------}
{ Check the Symbol Table for a Duplicate Identifier }
{ Report an error if identifier is already in table. }

procedure CheckDup(N: Symbol);

begin
if InTable(N) then Duplicate(N);
end;

{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }

procedure AddEntry(N: Symbol; T: char);

begin
CheckDup(N);
if NEntry = MaxEntry then Abort('Symbol Table Full');
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;

{--------------------------------------------------------------}
{ Get an Identifier }

procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected('Identifier');
Token := 'x';
Value := '';
repeat
Value := Value + UpCase(Look);
GetChar;
until not IsAlNum(Look);
end;

{--------------------------------------------------------------}
{ Get a Number }

procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected('Number');
Token := '#';
Value := '';
repeat
Value := Value + Look;
GetChar;
until not IsDigit(Look);
end;

{--------------------------------------------------------------}
{ Get an Operator }

procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;

{--------------------------------------------------------------}
{ Get the Next Input Token }

procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;

{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }

procedure Scan;
begin
if Token = 'x' then
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;

{--------------------------------------------------------------}
{ Match a Specific Input String }

procedure MatchString(x: string);

begin
if Value <> x then Expected('''' + x + '''');
Next;
end;

{--------------------------------------------------------------}
{ Output a String with Tab }

procedure Emit(s: string);

begin
Write(TAB, s);
end;

{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }

procedure EmitLn(s: string);

begin
Emit(s);
WriteLn;
end;

{--------------------------------------------------------------}
{ Generate a Unique Label }

function NewLabel: string;

var S: string;
begin
Str(LCount, S);
NewLabel := 'L' + S;
Inc(LCount);
end;

{--------------------------------------------------------------}
{ Post a Label To Output }

procedure PostLabel(L: string);

begin
WriteLn(L, ':');
end;

{---------------------------------------------------------------}
{ Clear the Primary Register }

procedure Clear;
begin
EmitLn('CLR D0');
end;

{---------------------------------------------------------------}
{ Negate the Primary Register }

procedure Negate;
begin
EmitLn('NEG D0');
end;

{---------------------------------------------------------------}
{ Complement the Primary Register }

procedure NotIt;
begin
EmitLn('NOT D0');
end;

{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }

procedure LoadConst(n: string);