Re: parser
- From: thomas.mertes@xxxxxx
- Date: Wed, 17 Sep 2008 04:17:18 -0700 (PDT)
On 9 Sep., 21:15, "Bill Cunningham" <nos...@xxxxxxxxx> wrote:
What would be an example of a very simple parser? I have read about
parsers and that they come in 2 steps lexical and syntactic. Basically how
are they designed is my question; in a tree design.
Having written scanners (for lexical analysis) and parsers myself
several times for different purposes (including for the interpreter
of the Seed7 language) I think I have some suggestions. I did not
read all answers to your question. Please excuse me if I repeat the
obvious.
If you have the choice to determine the syntax to be parsed, I
suggest you choose a LL(1) one (for the lexical symbols and for the
syntax of the whole language). LL(1) means that all decisions can
be made based on the current symbol (or character for lexical
analysis). Pascal was designed to use a syntax which is parseable
with LL(1) methods. This design decision made Pascal easy to parse
and Pascal compilers slim. So maybe you take a look at the syntax of
Pascal (the one defined by Wirth. Some of the Borland extensions are
not so easy to parse). Other languages like FORTRAN, COBOL, C, C++
and Java were not designed with a concept how to parse them in mind.
A LL(1) lexical scanner is a function like 'getSymbol' which allows
to get the (character) input symbol by symbol. The scanner needs two
things to get its input:
1. A variable holding current character (often called just 'ch',
in the examples below I call it inFile.bufferChar).
2. A function to read the next character and to store it
in the current character variable (often called nextch, or
getchar, in the examples below
inFile.bufferChar := getc(inFile);
is used as the next character function).
By inspecting the current character the scanner does it's decisions
like:
while inFile.bufferChar in white_space_char do
inFile.bufferChar := getc(inFile);
end while;
As you can see character classifications such as white_space_char
are used to specify which characters the scanner should skip. If the
symbols are choosed wisely the scanner can also make other decisions
easily:
case inFile.bufferChar of
when {'a' ..'z'} | {'A' .. 'Z'}:
... read identifier
when {'+', '-', '*', '/'}:
... read operator symbol
when {'(', ')', '[', ']', '{', '}'}:
... read parenthesis
when {'0' .. '9'}:
... read numeric literal
when {'''}:
... read character literal
when {'"'}:
... read string literal
when {'#'}:
... skip comment and then read the next symbol
end case;
After the scanner has read a symbol (identifier, operator symbol,
parenthesis, or literal) the inFile.bufferChar (variable holding the
current character) should always contain the next character after
the symbol (which is possibly the first character of the next
symbol). The details of the scanned symbol are holded in some
variables (Pascal compilers even use global variables for this).
Usually you will have a symbol category (enumeration with values
like BEGIN, END, FUNCTION, IDENTIFIER, OPERATOR, PAREN,
NUMERIC_LITERAL, CHAR_LITERAL, STRING_LITERAL, END_OF_FILE). The
scanner can also have the job to recognize reserved words like
'begin' and to use the symbol category BEGIN to describe it. The
symbol category END_OF_FILE is used to signal EOF. Other variables
hold numeric character and string literal values.
I wrote also something about scanning functions here:
http://seed7.sourceforge.net/manual/file.htm#Scanning_a_file
The source code of some simple scanner functions is here:
http://seed7.sourceforge.net/prg/scanfile.htm
Since it is sometimes helpful to scan directly from a string
(without interfacing the string as file) there are also
scanner functions defined for strings:
http://seed7.sourceforge.net/prg/scanstri.htm
A recursive descend parser works similar as a scanner. It consists
of functions like 'ifstatement', 'whilestatement', 'expression',
'term' This functions use the current symbol to do the decisions
and call the scanner (getSymbol) to advance to the next symbol.
When a parser function is called it assumes that the first symbol
to be processed is already in the current symbol variable. When
a parser function is left the first symbol after the construct
processed is already in the current symbol variable. A parser
function to process a Pascal while statement is:
const proc: whilestatement is func
begin
insymbol;
expression;
if symbol.category = DOSY then
insymbol;
else
error(54);
end if;
statement(fsys);
end func; (* whilestatement *)
The parser function 'statement' (which has the job of parsing Pascal
statements) may contain the following case statement:
case symbol.category of
when {IDENT}:
insymbol;
if symbol.category = BECOMES then
assignment;
else
call;
end if;
when {BEGINSY}:
compoundstatement;
when {GOTOSY}:
gotostatement;
when {IFSY}:
ifstatement;
when {CASESY}:
casestatement;
when {WHILESY}:
whilestatement;
when {REPEATSY}:
repeatstatement;
when {FORSY}:
forstatement;
when {WITHSY}:
withstatement;
end case;
if symbol.category = SEMICOLON then
insymbol;
elsif symbol.category not in {ENDSY, ELSESY, UNTILSY} then
error(6);
end if;
If you think that LL(1) parsing is not powerful enough I can tell
you that Seed7 is processed with a LL(1) scanner and a table driven
recursive descend LL(1) parser.
While writing scanners and parsers is an interesting task, there
are also other possibilities. If you want to parse some programming
language you have also the possibility to hook at an existing
programming language. Seed7 provides several mechanisms to extend
the language syntactically and semantically. So you might take a
look at the following examples:
http://seed7.sourceforge.net/examples/declstat.htm
http://seed7.sourceforge.net/examples/for_decl.htm
http://seed7.sourceforge.net/examples/operator.htm
Seed7 provides also an interface to get the list/tree representation
of the parsed program. This way function calls are represented as
lists (of type ref_list) where the first element is the function
and the other list elements are parameters. There are several types
(category, reference, ref_list, program) which are used to access
this representation of a program. This types are described here:
http://seed7.sourceforge.net/manual/types.htm#category
http://seed7.sourceforge.net/manual/types.htm#reference
http://seed7.sourceforge.net/manual/types.htm#ref_list
http://seed7.sourceforge.net/manual/types.htm#program
Greetings Thomas Mertes
Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.
.
- References:
- parser
- From: Bill Cunningham
- parser
- Prev by Date: Re: new multicore programming docs for GCC and Visual Studio
- Next by Date: Re: new multicore programming docs for GCC and Visual Studio
- Previous by thread: Re: parser
- Next by thread: Blancpain Leman Tourbillon Mens Watch 2925-3642-53B Recommendation Discount Watches
- Index(es):
Relevant Pages
|