Re: Parsing s-expression trees
- From: "xahlee@xxxxxxxxx" <xahlee@xxxxxxxxx>
- Date: Mon, 1 Sep 2008 11:32:54 -0700 (PDT)
On Aug 31, 11:24 am, Carter <carterch...@xxxxxxxxx> wrote:
Hi,
I am trying to design a language interpreter which takes text --> s-
expressions --> lisp. The first transformation for me for me is
reasonably well understood and i suspect I have enough knowledge to
know how to go about it but the second causes me some difficulties
because I am unsure how to construct a parser generator for trees (I
would prefer this since I expect my language specification to change
as I proceed). Are there existing tools which can walk and match s-
expressions efficiently or perhaps in the worse case some references
on how to parse tree grammars so I can generate my own?
Tangential to your issue, but note that lisp's s-expression is
irregular, this makes writing a parser for it non-trivial.
For detail, see:
★ Fundamental Problems of Lisp
http://xahlee.org/UnixResource_dir/writ/lisp_problems.html
★ The Concepts and Confusions of Prefix, Infix, Postfix and Fully
Nested Notations
http://xahlee.org/UnixResource_dir/writ/notations.html
also, in another post of recent, i detailed a isomorphism between tree
data structure and tree in textual form (like sexp). This isomorphism
has significant impact on the issues of parsing and much of the lang's
power. I haven't yet put that post up on my website.
In the following, is parts of the essay of one of the above article,
followed by the article on isomorphism between tree data structure and
tree syntax.
---------------------------
Fundamental Problems of Lisp
Xah Lee, 2008-07
In this article, we discuss 2 problems that are rooted in lisp. One is
the irregularity in its often cited regular syntax. The other is the
language's use of “cons” for list.
Syntax Irregularities
Lisp family of languages, in particular, Common Lisp, Scheme Lisp,
Emacs Lisp, are well know for its syntax's regularity, namely,
“everything” is of the form “(f x1 x2 ...)”.. However, it is little
talked about that there are several irregularities in its syntax. Here
are some examples of the syntax irregularity.
The comment syntax of semicolon to end of line “;”.
The dotted notation for cons cell “(1 . 2)”.
The single quote syntax used to hold evaluation, e.g. “'(1 2 3)”.
The backquote and comma syntax used to hold but evaluate parts of
expression, e.g. “(setq x 1) (setq myVariableAndValuePair `(x ,x))”.
The “,@” for inserting a list as elements into another list.. e.g.
“(setq myListX (list 1 2)) (setq myListY (list 3 4)) (setq myListXY
`(,@ myListX ,@ myListY))”
There are various others in Common Lisp or Scheme Lisp. For example,
the char “#” and “#|”. In Scheme's R6RS, it has introduced a few new
ones.
In the following, we detail how these irregularities hamper the power
of regular syntax, and some powerful features and language
developments that lisp have missed that may be due to it.
Confusing
Lisp's irregular syntax are practically confusing. For example, the
difference between “(list 1 2 3)”, “'(1 2 3)”, “(quote (1 2 3))” is a
frequently asked question. The use of “` , ,@” etc are esoteric. If
all these semantics use the regular syntactical form “(f args)”, then
much confusion will be reduced and people will understand and use
these features better. For example, The “'(1 2 3)” might be changed to
“(' 1 2 3)”, and
(setq myListXY `(,@ myListX ,@ myListY))
could have been:
(setq myListXY (eval-parts (splice myListX) (splice myListY)))
or with sugar syntax for typing convenience:
(setq myListXY (` (,@ myListX) (,@ myListY)))”
Syntax-Semantics Correspondence
A regular nested syntax has a one-to-one correspondence to the
language's abstract syntax tree, and to a large extent the syntax has
some correspondence to the language's semantics. The irregularities in
syntax break this correspondence.
For example, programers can pretty much tell what piece of source code
“(f x1 x2 x3 ...)” do by just reading the name that appears as first
element in the paren. As a contrast, in syntax soup languages such as
Java, Perl, the programmer must be familiar with each of its tens of
syntactical forms. (e.g. “if (...) {...}”, “for (....; ...; ...)
{...}”, “(some? this: that)”, “x++”, “myList = [1, 2, 3]” etc.) As a
example, if lisp's “'(1 2 3)” is actually “(quote 1 2 3)” or shortcut
form “(' 1 2 3)”, then it is much easier to understand.
Source Code Transformation
Lisp relies on a regular nested syntax. Because of such regularity of
the syntax, it allows transformation of the source code by a simple
lexical scan. This has powerful ramification. (lisp's macros is one
example) For example, since the syntax is regular, one could easily
have alternative, easier to read syntaxes as a layer. (the concept is
somewhat known in early lisp as M-expression) Mathematica took this
advantage (probably independent of lisp's influence), so that you
really have easy to read syntax, yet fully retain the regular form
advantages. In lisp history, such layer been done and tried here and
there in various forms or langs ( CGOL↗, Dylan↗), but never caught on
due to largely social happenings. Part of these reasons are political
and lisper's sensitivity to criticism of its nested parens.
In lisp communities, it is widely recognized that lisp's regular
syntax has the property that “code is data; data is code”. However,
what does that mean exactly is usually not clearly understood in the
lisp community. Here is its defining characteristics:
A regular nested syntax, makes it possible to do source code
transformations trivially with a lexical scan.
The benefits of a regular syntax has become widely recognized since
mid 2000s, by the XML language. The XML language, due to its strict
syntactical regularity, has developed into many transformation
technologies such XSLT, XQuery, STX etc. See XML transformation
languages↗.
Automatic, Uniform, Universal, Source Code Display
One of the advantage of pure fully functional syntax is that a
programer should never need to format his source code (i.e. pressing
tabs, returns) in coding, and save the hundreds hours of labor,
guides, tutorials, advices, publications, editor tools, on what's
known as “coding style convention”, because the editor can reformat
the source code on the fly based on a simple lexical scan.
Because lisp's syntax has lots of nested parenthesis, the source code
formatting is much more labor intensive than syntax soup languages
such as Perl, even when using a dedicated lisp editor such as emacs
that contain large number editing commands on nested syntax.
The lisp community, established a particular way of formatting lisp
code as exhibited in emacs's lisp modes and written guides of
conventions. The recognition of such convention further erode any
possibility and awareness of automatic, uniform, universal,
formatting. (e.g. the uniform and universal part of advantage is
exhibited by Python)
As a example, the Mathematica language features a pure nested syntax
similar to lisp but without irregularities. So, in that language,
since version 3 released in 1996, the source code in its editor are
automatically formatted on the fly as programer types, much in the
same way paragraphs are automatically wrapped in a word processor
since early 1990s
Note the phrase “automatic, uniform, universal, source code display”.
By “automatic”, it means that any text editor can format your code on
the fly or by request, and this feature can be trivially implemented.
By “uniform”, it means there is one simple and mechanical heuristic,
to determine a canonical way to format any lisp code for human-
readable display. By “universal” is meant that all programers, will
recognize and habituated with this one canonical way, as a standard.
(they can of course set their editor to display it in other ways)
The “uniform” and “universal” aspect is a well-known property of
Python lang's source code. The reason Python's source code has such
uniform and universal display formatting is because it is worked into
the language's semantics. i.e. the semantics of the code depends on
the formatting (i.e. where you press tabs and returns). But also note,
Python's source code is not and cannot be automatically formatted,
precisely because the semantics and formatting is tied together. A
strictly regular nested syntax, such as Mathematica's, can, and is
done, since 1996. Lisp, despite its syntax irregularities, i think it
still can have a automatic formatting at least to a large, practical,
extent. Once lisp has automatic on-the-fly formatting (think of
emacs's fill-sexp, auto-fill-sexp), then lisp code will achieve
uniform and universal source code formatting display.
The advantage of having a automatic, uniform, universal, source code
display for a language is that it gets rids of the hundreds of hours
on the labor, tools, guides, arguments, about how one should format
his code. (this is partly the situation of Python already) But more
importantly, by having such properties, it will actually have a impact
on how programer codes in the language. i.e. what kind of idioms they
choose to use, what type of comments they put in code, and where.
This, further influences the evolution of the language, i.e. what kind
of functions or features are added to the lang. For some detail on
this aspect, see: The Harm of Manual Code Formating
Syntax As Markup Language
One of the power of such pure nested syntax is that you could build up
layers on top of it, so that the source code can function as markup of
conventional mathematical notations (i.e. MathML) and or as a word-
processing-like file that can contain structures, images (e.g.
Microsoft Office Open XML↗), yet lose practical nothing.
This is done in Mathematica in 1996 with release of Mathematica
version 3. (e.g. think of XML, its uniform nested syntax, its diverse
use as a markup lang, then, some people are adding computational
semantics to it now (i.e. a computer language with syntax of xml. e.g.
O:XML↗). You can think of Mathematica going the other way, by starting
with a computer lang with a regular nested syntax, then add new but
inert keywords to it with markup semantics. The compiler will just
treat these inert keywords like comment syntax when doing computation.
When the source code is read by a editor, the editor takes the markup
keywords for structural or stylistic representation, with title,
chapter heading, tables, images, animations, hyperlinks, typeset math
expression (e.g. think of MathML↗) etc. The non-marked-up keywords are
shown as one-dimensional textual source code just like source code is
normally shown is most languages.)
Frequently Asked Questions
You say that lisp syntax irregularities “reduce such syntax's power”.
What you mean by “syntax's power”?
Here are some concrete examples of what i mean by power of syntax.
In lisp, the comment is done by the char “;” running to end of line.
Note that this does not allow nested comment. So for example, if you
have multi-line code, and you want to comment out them all, you have
to prepend each line by a semicolon. However, if you have nested
comment syntax, one could just bracket the block of code to comment it
out. This, is a simple, perhaps trivial, example of “power of a
syntax”.
In Python, the formatting is part of the lang's syntax. Some
programers may not like it, but it is well accepted that due to
Python's syntax, python code is very easy to read, and it much done
away about programer preferences and argument about code formatting.
This is example of power of a syntax.
Let me give another, different example. You know that perl's syntax,
often the function's arguments do not necessarily need to have a paren
around it. For example, “print (3);” and “print 3;” are the same
thing. This is a example of power of syntax, when considered as a
flexibility or save of typing, for good or bad. Similarly, in
javascript for example, ending semicolon is optional when it is at end
of a line. (for sample perl and python code, see Xah's Perl and Python
Tutorial.)
In Mathematica, the language has a syntax system such that you can use
fully regular nested notation (e.g. “f[g[x]]”), postfix notation (e.g.
“x//g//f”), prefix notation (e.g. “f@g@x”), infix notation (e.g.
“1~Plus~2”), for ANY function in the language, and you can mix all of
the above. (prefix and postfix by nature are limited to functions with
just 1 arg, and infix notation by nature are limited to functions with
2 args) This is a example of power of syntax.
(for detailed explanation of Mathematica syntax and comparison to
lisp's, see: The Concepts and Confusions of Prefix, Infix, Postfix and
Fully Functional Notations )
In general, a computer lang has a syntax. The syntax, as text written
from left to right, has various properties and characteristics. Ease
of input (think of APL as counter example), succinctness (e.g. Perl,
APL), variability (Perl, Mathematica), readability (Python),
familiarity (C, Java, Javascript, ...), 2-dimensional notation (e.g.
traditional math notation) (Mathematica), ease of parsing (lisp),
regularity (APL, Mathematica, Lisp, XML), flexibility (Mathematica)...
etc. Basically, you can look at syntax, and programer's need to type
them, and how the textual structure can be mapped to the semantic
space, from many perspectives. The good qualities, such as ease of
input, ease of reading, ease of parsing, ease of transformation,
simplicity, flexibility, etc, can be considered as elements of the
syntax's power.
As a example of syntax of little power, think of a lang using just the
symbol “0” and “1” as its sole char set.
Many of lisp's sugar syntax are designed to reduce nested paren. Why
using a more consistent, but more annoying sugar syntax?
Ultimately, you have to ask why lisper advocate nested syntax in the
first place.
If lispers love the nested syntax, then, the argument that there
should not be irregularities, has merit. If lispers think occasional
irregularities of non parenthesized syntax is good, then there's the
question of how many, or what form. You might as well introduce “++i”
for “(setq i (1+ i))”.
--------------------------------------------------------------
2008-08-23
someone wrote:
«
(setf bin-tree '(4 (2 1 3) (6 5 7)))
Thus in my view the only fringe nodes (leaves) are 1, 3, 5, and 7.
»
Yes, you are right.
also note, when in a language where there's isomorphism between the
syntax and the nested list, there's a concept of head.
For example, in pseudo lisp:
(list (list 1 3) (list 5 7))
In this list, the 1,3,5,7 are leafs. Each having index {1,1}, {1,2},
{2,1}, {2,2}.
Now, the three appearances of “list”, are non-leaf nodes in the tree,
having indexes of: {0}, {1,0}, {2,0}. These positions are called Head
in Mathematica, and the notion of head also appear in lisp.
Now, consider this pseudo lisp: (4 (2 1 3) (6 5 7))
which is closer to what you gave above. Now, the element at index {0}
is 4, and at {1,0} is 2, and at {2,0} is 6.
The whole expression itself, is still a tree or a nested list. In this
way, we can see that there is a isomorphism between the textual
representation of a tree, and what is actually considered a list
datatype in the lisp-like language.
So, suppose you invented a lisp language, so that there's no cons, but
the symbol “list” represent a list datatype (the implementation of the
language may actually just be linked list as cons). So, in this
language, the expression
(list (list 1 3) (list 5 7))
would represent a list datatype. However, the expression
(4 (2 1 3) (6 5 7))
would not be a list datatype, however, the expression's structure is
identical to the previous one, and still a tree. In this language,
when the head of a expression does not have a valid definition, such
as being a integer, it is simply left unevaluated.
So, in this lang, both
(list (list 1 3) (list 5 7)) and (4 (2 1 3) (6 5 7))
are valid expressions, and of identical structure. The expression
itself represent a tree. The 2 expression can be easily transformed
into each other, by simply doing a replacement of the atom “list” to
one of integer, or vice versa. (e.g. replacing by pattern matching or
actually apply a function to the head positions)
Now, the thing about languages with a pure nested syntax is that, the
head itself can be a nested expression. For example, you can have
((f x) 1 2 3)
So, when you have a expression such as
(x (y 1 3) (z 5 7))
The indexes at {0}, {1,0}, {2,0}, i.e. the x, y, z, needs not to be
atoms themselves. They can be arbitrary expressions (tree).
So this means, in this language the head, or non-leaf nodes of a tree,
can hold data, not just the leafs.
In most lang that supports nested list, such as perl, php, python,
only the leaf nodes holds data. But as you can see the above, in this
lang with regular nested syntax, not only leaf nodes can hold data,
but any node, including non-leaf nodes (heads), can hold arbitrarily
nested data.
As a illustration to intermediat, in XML, the none-leaf nodes can hold
a limited amount of data, called its attributes.
actually, lang such as perl, php, python, javascript, you can actually
have a nested list where the non-leaf nodes also holds arbitrary
data. All you have to do, is to consider that the first element of a
list as the non-leaf nodes (i.e. lisp's “head”).
In langs such as perl etc, assuming 1st element of list as non-leaf
node is not a problem. But in lang with a purely nested syntax, by
assuming the 1st element of list as non-leaf node, i think it
necessarily introduces a more complex model of interpretation if you
still want a isomorphism between the syntax, tree, and list datatype.
--------------------------
Now, in lisp, because of the cons, combined with the fact that its
syntax is irregular, truely, fucked up all the beauty and power of
list processing.
The problem of the cons business is well known. For example, your
surprise at the various definition of leaf is just one Frequent
puzzle. The other thing about its irregular syntax, such as
'(1 2 3)
(quote (1 2 3))
(list 1 2 3)
adds more complexity to the cons problem. For example, if you read
again the above exposition about the isomorphism between the purely
nested uniform syntax, tree, and list datatype, and try to apply it to
Common Lisp, Emacs Lisp, or Scheme Lisp, you'll find all sort of
problems, and really see how lisp is crippled, in the sense that it
could have been far more consistent, simpler, powerful.
The above pseudo lisp lang explanation is based on my knowledge of
Mathematica. i.e. it is basically how Mathematica is.
Further readings:
• Trees
http://xahlee.org/tree/tree.html
• “The Concepts and Confusions of Prefix, Infix, Postfix and Fully
Functional Notations”
http://xahlee.org/UnixResource_dir/writ/notations.html
• Fundamental Problems of Lisp
http://xahlee.org/UnixResource_dir/writ/lisp_problems.html
Xah
∑ http://xahlee.org/
☄
.
- Prev by Date: Re: Parsing s-expression trees
- Next by Date: Re: Kenny does the Web! <spit>
- Previous by thread: Re: Parsing s-expression trees
- Index(es):