O(n) Parsing For General Context-Free Grammars & Transductions
- From: markwh04@xxxxxxxxx
- Date: 1 Feb 2007 14:06:40 -0800
Within the "algebraic approach" to formal language and automata
theory, a theory and calculus generalizing the "regular expressions"
higher up the Chomsky hierarchy for expressions capable of
representing context-free languages, transductions or other processes
described by syntax-directed definitions; expressions capable of even
representing general recursively specified processes, can be
developed.
The parsing problem in the classical theory involves that of
delineating one or a set of structures (parse trees, parse forests,
translations, translation sets) to represent the result of the process
of translation based on a context-free or syntax-direct specification.
Over the years, compact representational formats had evolved, notably
including the "packed parse forest" framework of Tomita et. al. in the
GLR parsing setting. However, all suffer the common malady of
essentially reducing to more or less compact ways of representing
grammars (i.e. the grammar that describes the entire set of possible
translations corresponding a given input); all essentially reducing.
to a well-known O(n^3) algorithm or a variant thereof.
The most important insight, however, comes from noting that for a
given grammar (in general) not only may the set of translations or
parses for a given input be infinite, but it may be a context-free
language or subset in its own right. Therefore, any method for
representing such sets in this general setting must amount to the
equivalent of an algebra or notation for "context-free expressions".
In fact, the process of converting a given input to a context-free
expression denoting its translation set is O(n) in the input. The
representation, moreover, can be formulated in such a way that certain
other ancillary processes (e.g. determining the emptiness of the set,
enumerating the elements of the set) can be efficiently formulated;
the particular case in point being the reduction to the emptiness test
for translation sets (a fancy way of defining CFL recognition) to that
of expressing "pure operator" expressions in the underlying algebra
for context-free expressions; this in turn reducing to the equivalent
of a familiar "shortest paths" variant of the O(n^log(7)) Viterbi
algorithm.
(1) Preliminaries: Parsing as translation
A context-free transduction with input and output alphabets X and Y,
respectively, is a context-free subset of the direct product monoid X*
x Y*. to
A monoid, if you're not familiar with the term is an algebra
consisting of a product (a,b -> ab, concatenation) and identity (1,
playing the role of the empty word) such that a(bc)=(ab)c; a1=a=1a.
The direct product M x N of two monoids M, N is defined by combining
M, N and posing the relations mn = nm for m in M, n in N. Formally, it
can be embodied by the Cartesian product M x N with identity (1,1) and
product (m,n)(m',n') = (m m', n n'), with correspondences m <-> (m,1)
and n <-> (1,n) for m in M, n in N and mn <-> (m,1)(1,n) = (m,n) and
nm <-> (1,n)(m,1) = (m,n).
Thought it is not widely known, the formalism for grammars, automata,
languages, etc. apply to monoids in general. A grammar over a monoid M
has a set V of non-terminals, a start symbol S in V, and a set, P, of
rules drawn from M[V] x M[V]. The monoid M[V] is the extension of M
obtained by freely adding the elements of V to M (called the "free
extension" of M by the set V). For the case M = X*, this gives you the
more familiar M[V] = (X union V)*, when X and V are disjoint.
A configuration is an element of M[V], and one may define a transition
by the rule (ayb) -> (azb) for each rule (y,z) in P, and a, b in
M[V]. Then the generating set for each configuration a in M[V] is
defined by
<a> = { m in M: a ->* m }.
The language recognized by the grammar is just <S>, and is a subset of
M.
For the case M = X* x Y*, this produces what is also known as a simple
syntax directed translation, whose corresponding automaton is called a
push down transduction.
Similarly, the rational subsets (those given by regular expressions)
of X* x Y* are precisely the translations produced by finite state
machines with inputs from X, outputs from Y. This is properly the
domain in which a utility, like LEX, actually operates in.
All parsing theory pertains to push down transductions and syntax
directed translations, not merely to context-free grammars.
For the more general case of a monoid M x N, the subsets may be
regarded as translations between words drawn from M to words drawn
from N. If T is a subset of M x N, then the translation set T[m] for
each word m in M may be defined by
T[m] = { n in N: (m,n) in T }.
With this preliminary out of the way, we can begin to address the
problem at hand and make the necessary fine distinctions that often
get conflated.
(2) Parsing vs. Recognition; Top-Down vs. Bottom-Up
(2.1) Clearing Up Some Misconceptions About What Parsing Is
So, let T be a translator over M and N (i.e., T a subset of M x N).
The 3 natural problems associated with translation may be posed as
follows:
(A) EMPTINESS: Determine whether T[m] = {} for a given m in M.
(B) ENUMERATION: Determine an enumeration
{ T[m,0], T[m,1], ... } of the set T[m].
For any given m in M, and n = 0,1,2,..., find whether n <
size of T[m] and (if so), find T[m,n].
(C) TRANSLATION: Given m in M, find the set T[m], itself.
The translation T is DETERMINISTIC if (size of T[m]) < 2 for all m in
M, otherwise it is NON-DETERMINISTIC.
Neither of these attributes apply directly to languages or grammars --
and this is the major misconception that had to be cleared here. They
apply ONLY through a given association of the former with
translations.
This association is given more or less at the outset by defining a
parser as a translator which produces, as its output, parse structures
of some sort. Since parse structures are normally cast in graphical
and non-linear form (as parse trees, forests, etc.), this attribution
can only be understood via some sort of linearization. The two
natural, canonical, linearizations are TOP-DOWN and BOTTOM-UP,
associated, respectively, with prefix and suffix ordering.
Given a grammar, G = (V, S, P) over a monoid M, define the top-down
and bottom-up translations canonically associated with G as follows:
V' = V
N = ({z} union P)*
P' is defined as follows:
for the start symbol S, add to P':
(S' -> S z) for bottom-up
(S' -> z S) for top-down
for each rule (a->b) in P, add to P':
(a -> b(a->b)) for bottom-up
(a -> (a->b)b) for bottom-up
For context-free grammars, the left-hand side of each rule consists
only of a non-terminal. Therefore, for this case there is a natural
interpretation of the output alphabet Y = {z} union P, with the output
(a->b) interpreted as a "reduce" action (bottom-up) or "build node"
action (top-down); and the output z interpreted as an "accept" action
(bottom-up) or "build root node" action (top-down).
Let G+ and G- denote, respectively the top-down and bottom-up
translators. There is a one-to-one correspondence between parse
trees, top-down and bottom-up constructions sequences, so that G+[m]
and G-[m] both have the same number of elements, that number being the
number of parses associated with the input word m in M.
Hence, a grammar is called AMBIGUOUS if either of its canonical
extensions is non-deterministic. A context-free language over a monoid
M is then called ambiguous if every grammar that generates it is
ambiguous.
The grammars, the best-known complexity (so far) of the emptiness test
is that incorporated by the Valiant Algorithm or any of its variants,
which gives you O(|w|^{log_2(7)) for (G+[w] == {}) or (G-[w] == {}).
The enumeration problem can be no less complex since it contains
within it a provision for checking the size of T[w], which requires at
least as much as the emptiness test.
On the other hand, the complexity of parsing depends entirely on what
manner of representation is chosen for denoting T[w].
(2.2) Parse Structures and Context-Free Expressions
The natural way to represent parses for T[w] is as a list of parse
trees, called a PARSE FOREST. For efficiency it is often desired to
remove redundancies by sharing nodes in the forest and even in the
individual trees.
It may surprise you to fully understand just what this is getting at,
ultimately. For, in fact, a parse tree is -- itself -- nothing more
and nothing less than a grammar provided in graphic form which has the
following properties:
(A) No non-terminal appears on the left-hand side (LHS) more
than once anywhere.
(B) No non-terminal appears on the right-hand side (RHS)
more than once anywhere.
(C) The non-terminals are minimal in number.
Each node in the tree is a non-terminal, each branch a rule in the
grammar, the root node is the start non-terminal.
The process of producing a parse tree effectively amounts to winnowing
down the original context free grammar by replicating non-terminals
from the original grammar (as need be) and replicating its rules in
like fashion; and doing so in such a way that the resulting grammar
qua "parse tree" produces only one word -- the input word.
A parse forest relaxes (A) by allowing the start symbol to appear more
than once on the LHS. It appears, in fact, one for each tree in the
forest.
The parse structure incorporates SHARING of nodes, if a non-terminal
may appear more than once on the RHS in the corresponding grammar.
The parse structure has NODE PACKING, if nodes other than the start
symbol may appear more than once on the LHS in the corresponding
grammar.
The parse structure is CYCLIC (i.e. cyclic as a labeled directed
graph) if the corresponding grammar has recursion in it.
Hence, the most general structure (a cyclic packed parse forest with
sharing) is nothing more and nothing less than just a fancy way of
writing a context free grammar, itself!
As related above, a grammar is ambiguous, if |G+[m]| = |G-[m]| > 1 for
any input word m in M. In the most general case, for context-free
grammars, the grammar may also be cyclic, in which case, |G+/-[m]| may
be infinite.
In fact, the subset G+/-[m] will generally be a CONTEXT-FREE subset of
the output monoid N.
Ultimately, this is why the notation for parse structures seems to
naturally converge onto a general notation for context-free grammars,
themselves. A parse structure of the most general type must serve as a
kind of generalization of the heretofore well-known and well-
understood notation of regular expressions which resides at Level 3
(regular grammars) in the Chomsky Hierarchy, to one able to work at
Level 2 (context-free grammars) or higher in the Chomsky Hierarchy.
More generally, any notation suitable for representing parse
structures in the most general case of a context-free grammar or (more
generally still) for arbitrary context-free translators, must be
tantamount to a notation for CONTEXT-FREE EXPRESSIONS.
So now, back to the original issue ...
When the translate set T[m] is represented in terms of context-free
expressions, the parsing problem may indeed reduce to an O(n) process!
However, this does NOT imply that either of the other two problems
(EMPTINESS, ENUMERATION) commonly locked up and confused with the
(PARSING) problem will likewise be O(n)!
In other words, contrary to what you may intuit, parsing is
substantially simpler than either recognition or enumeration.
You still have to enumerate and perform emptiness test on whatever
representation you draw for T[m]. However, what happens here is that
you're disentangling these two issues of recognition (T[w] != {}) vs.
the actual production of T[w] from one another.
This represents the ultimate logical conclusion to the process that
Tomita had started back in the mid 1980's with his packed parse forest
approach to general context-free parsing on an LR basis.
And we will return to the realm of LR parsing -- within the framework
of the algebraic approach.
(3) The Algebraic Approach
(3.1) Preliminaries
You may hear, often, the phrase in parsing theory that "such and such
grammars are equivalent, but the transformation between the two may
not be particularly meaningful". Underlying this is a degree of
vagueness borne of the misconception that runs rampant in the theory
that it's context-free languages, rather than context-free
translations, that parsing theory concerns itself with. What's
actually being referred to is that two grammars that recognize the
same language may have completely different canonical top-down
translations (or bottom-up translations). Transformations performed
in the grammar need not preserve these.
However, the transformations performed on grammars have, underlying
them, algebraic rules borne of the algebra which these various objects
reside in (context-free closed dioids), and the rules and all the
formalism thereof apply generally to context-free subsets of ANY
monoid -- including M x N (i.e. to context-free transductions).
You may not freely transform grammars; but in parsing you're not
working with grammars to begin with, but with translations. These, you
may freely transform, consistently with the rules of the underlying
algebra.
The ALGEBRAIC APPROACH is a general enveloping framework, of a purely
algebraic nature, which provides a common setting for all the various
items seen in formal language and automata theory, which renders the
equivalence between them natural and transparent.
Regular expressions provide the earliest precursor of the algebraic
approach. A regular language may be regarded as being an object in an
underlying Kleene algebra of regular expressions.
More generally, a language (or translation or more general objects
besides) in the algebraic approach is an element of an underlying
algebra.
The algebra contains the monoid operations associated with
concatenation and the identity (empty word), and so comprises a
monoid.
Grammars provide a system of relations between elements of the
algebra. The "->" relation is naturally interpreted as an inequality.
Therefore, the underlying monoid structure is partially ordered.
A grammar is a system of inequalities over a suitable algebra, which
will be delineated further below. The "languages" generated by the
various non-terminals are identified as those comprising the LEAST
SOLUTION to the system so defined.
For context-free grammars, it suffices to work within context-free
dioids; the latter providing the natural framework for CONTEXT-FREE
EXPRESSIONS, just as a Kleene algebra does for regular expressions.
An automaton (or machine or translator) is, likewise, rendered as a
system of inequalities, with each state treated as an expression, and
the transition rules as the inequalities themselves; e.g.
for the transition x: a -> b from state a to state b one has
the inequality a -> x b (for a generator) or a x -> b (for a reader).
For a translator with transition (x,y): a -> b (with input x, output
y), one has an inequality a x -> y b.
(3.2) Algebraic Formalisms
Further, the operations (union, empty set) naturally occur, these
being regarded as the least upper bounds under the ordering
a + b = l.u.b. {a,b}
0 = l.u.b. {}.
Thus, a least upper bound may be defined for any finite set.
The operation is distributive, so we may write:
a (b + c) d = abd + acd; A 0 D = 0
or more generally
l.u.b. (AB) = (l.u.b. A) (l.u.b. B), for finite sets A, B.
Equivalently, this may be cast as an algebra with product and +; and
identities 1 and 0 with the properties:
a(bc) = (ab)c; a1 = a = 1a
a+(b+c) = (a+b)+c; a+0 = a = 0+a
a(b+c)d = abc + acd; a0d = 0
a+b=b+a; a+a=a
In either case, the resulting system is called a DIOID.
Just as X* may be regarded as the free monoid produced from a set X,
the family F(X*) of finite subsets of X* (i.e., finite languages over
X) is the free dioid produced from X. The finite subsets F(M) of a
monoid M is the result of freely extending the monoid M to a dioid.
In more general cases, one may allow for a least upper bound to be
defined infinite sets, e.g.:
(1) A dioid K is a rational dioid if
(a) every rational subset of K has a least upper bound
(b) lub(AB) = lub(A) lub(B) for rational subsets A, B of K.
(2) A dioid K is a QUANTALE if
(a) every subset of K has a least upper bound
(b) lub(AB) = lub(A) lub(B) for all subsets A, B of K.
Other definitions include "context-free dioid", "turing dioid",
"countable dioid" if one replaces "rational subset" above respectively
by "context-free subset", "turing enumerable subset", "countable
subset".
A rational dioid has a Kleene star operation in it:
a* = l.u.b. { 1, a, aa, aaa, ... }
and can be equivalently defined in terms of it by the properties:
(1) a* -> 1 + a* a a*
(2) if x -> ab, acb, accb, acccb, ... then x -> a c* b.
A slight generalization of property (2) gives you a definition of a
KLEENE ALGEBRA
(2') if x -> a + bx + xc then x -> b* a c*.
It is known (and has been since the mid 1990's) that (1), (2') suffice
as additions to the properties of dioids to prove the equality of any
two regular expressions.
Conway had considered equivalents of "rational dioid", "countably
closed dioid" and "quantale" in his 1970's paper which attempted
various formulations of algebras for regular expressions. Kozen had
originally come up with the more general concept of Kleene Algebras
some time later. I arrived at it, independently, in the process of
trying to determine what would be required to find the least solution
to the inequality:
x -> a + bx + xc + xdx
(which is x = b* a c* (d b* a c*)*). The two axioms
(3) 0* = 1
(4) b* a c* (d b* a c*)* is the least solution to
x -> a + bx + xc + xdx.
The term Quantale comes out of Physics and Non-Linear systems in the
1980's. The people involved in these field believed (and still
believe) that they discovered the concept and are generally quite
unaware of the formal language ramifications of these concepts, and
equally unaware that Conway had already come up with it all more than
a decade before.
There is no reference (as of yet) in any literature anywhere to
anything like context-free dioids.
(3.3) The Algebraic Chomsky-Schuetzenberger Theorem and Context-Free
Expressions and Turing Expressions
Having previous pointed out the following characteristics:
free monoid generated from set X = X*
free dioid generated from set X = F(X*)
free dioid generated from monoid M = F(M)
where F(M) denotes the family of finite subsets of M (i.e. the family
of "finite languages"), one can go one further with this:
free rational dioid generated from set X = R(X*)
free rational dioid generated from monoid M = R(M)
where R(M) denotes the rational subsets of the monoid M. The Kleene
algebra R(X*) is one and the same as the algebra of regular
expressions over the alphabet X.
Denoting C(M), T(M), W(M) respectively as the context-free, turing,
countable subsets of M, and P(M) as the power set of M, one has:
C(X*) = free context-free dioid generated by X
T(X*) = free turing dioid generated by X
W(X*) = free countably closed dioid generated by X
P(X*) = free quantale generated by X
and
C(M) = free extension of M to a context-free dioid
T(M) = free extension of M to a turing dioid
W(M) = free extension of M to a countably closed dioid
P(M) = free extension of M to a quantale
The families C(X*), T(X*) are the context-free and turing languages
over alphabet X; and C(X* x Y*), T(X* x Y*), the context-free and
turing translations with input alphabet X and output alphabet Y. The
Kleene algebra R(X* x Y*) is the algebra that represents the actions
of a finite state machine with input alphabet X and output alphabet Y.
The algebra C(X* x Y*) provides the natural setting for parsing
theory.
One might also inquire whether and how one can define or talk about
extensions of the lower-grade algebras to those of the higher grades;
for instance, can a dioid be made rationally closed; or can a rational
dioid be made closed under context-free subsets?
There is a general construction (which, alas, won't be dealt with here
in detail) which, in fact, produces the desired chain of relations
between the various sorts of algebras. The two processes of interest
here are:
C: rational dioid -> context-free dioid.
For instance, for the regular languages R(X*) over X, what this does
is produce the closure under least solutions to context-free grammars,
extending the set R(X*) with the set C(X*) of all context-free
languages. More generally, C will convert R(M) to C(M).
The process
R: context-free dioid -> rational dioid
does nothing other than to eliminate the additional structure
associated with a context-free dioid (i.e., it "forgets" the ability
to take l.u.b.'s of context-free subsets). For C(X*), R transforms
C(X*) into itself, but now with C(X*) regarded merely as a rational
dioid.
The two conversions together:
RC: rational dioid -> closure thereof -> larger rational dioid
then effects the closure of a rational dioid under solutions to
context-free grammars, thus producing a Kleene algebra, cast in the
mould of regular expressions, for the corresponding context-free
expressions.
The earliest precursor to this process is none other than the Chomsky-
Schuetzenberger Theorem formulated in the 1960's. What this theorem
did was to provide a means to denote context-free languages by a kind
of operator algebra.
Unfortunately, it never fully realized its own algebraic underpinning,
and so represents (at best) an incomplete work. That situation can be
rectified here.
The theorem states the following:
Let X be an alphabet, and L a context-free language (L in
C(X*)). Then one may define a regular language R over an alphabet
X' = X union D
D = { u1, ..., un; v1, ..., vn }
which produces L by a process of filtering and erasure.
The filtering is done with respect to the Dyck language D_n(X) over X
defined by the grammar:
D_n -> 1 = empty word
D_n -> ui D_n vi D_n; i = 1, ..., n
Thus, the u's and v's play the role of bracketing symbols.
The regular expression is a general expression which incorporates the
original alphabet X, plus the extra symbols from D. It is filtered by
taking its intersection with D_n(X):
R --> R intersect D_n(X).
Then one performs the erasure:
f: X'* -> X*
f(1) = 1;
f(x w) = x f(w), for x in X; w in X'*
f(d w) = f(w), for d in D; w in X'*
writing
R intersect D_n(X) --> f(R intersect D_n(X)).
The result, if R is chosen right, is the language L.
The theorem asserts that every context-free language L has a
corresponding regular expression R including some number, n, of
bracketing symbols, which when so converted results in L, itself.
In fact, n = 2 will suffice. For n = 1, one obtains a strictly weaker
family of languages by this process which (I believe) will just be the
one-counter-languages, themselves.
The regular expressions developed by the theorem are the 1960's
embryonic form of what would have (and should have) become a general
formalism for context-free expressions. However, since the algebraic
underpinning of this result was never fully recognized as such, the
growth in development was stunted nearly at birth.
A similar theorem exists for turing languages, relating these as well
to regular languages -- hence showing that the process we're going to
describe can be used to arrive at a general "regular expression"
notation for ALL computable languages/translations/etc. Alas, this
won't be done here.
What the filtering operation does is effect the axioms:
ui vi = 1
ui vj = 0 if i, j are distinct.
The actual "= 1" part of the first equation is implemented via the
erasure operation; and the "= 0" part of the last equation via the
intersection operation.
In order to move the operator symbols freely throughout the words,
we'll also need the rule:
dx = dx, for d in D, x in X.
The erasure operator completely removes all the operator symbols, and
this, too, has to be effected in some fashion. A natural way to
implement this is to add in an extra term "p" with the properties:
p vj = 0 = ui p; p p = p.
Let the resulting algebra be called P_n. When combined with the set X,
let the corresponding Kleene algebra be called P_n(X*). For more
general monoids, one may define P_n(M) as the rational dioid that
provides an extension of P_n and of M in which the members of M
commute with those of P_n.
Then, if we regard R as the 1960's primordial expression with the
extra symbols remain inert, and R* as the same expression within the
various u's and v's activated by the algebraic relations above:
p P* p = f(R intersect D_n(X)) p.
The algebra P_n(M) represents the simultaneous incorporation of P_n
and R(M) such that the members of R(M) and of P_n commute with one
another.
More generally, we may define a "tensor product" for each type of
dioid, such that K x L will be the simultaneous extension of K and L
in which
kl = lk for k in K, l in L.
This will generalize the direct product of monoids, so that:
R(M x N) = R(M) x R(N)
C(M x N) = C(M) x C(N)
T(M x N) = T(M) x T(N)
W(M x N) = W(M) x W(N)
P(M x N) = P(M) x P(N)
and it will be transparent with respect to conversions:
C(K x L) = C(K) x C(L), for rational diods K, L
R(K x L) = R(K) x R(L), for context-free diods K, L.
Hence, the translations carried out by a finite-state machine, push-
down transducer, or turing transducer may be represented respectively
as:
R(X*) x R(Y*) -- for finite state machines
C(X*) x C(Y*) -- for push-down transducers
T(X*) x T(Y*) -- for turing transducers.
These algebras provide the natural framework for parsing theory,
giving us a natural notation for "translation expressions".
Thus, returning to the Chomsky-Schuetznberger Theorem, what we have is
that P_n(X) is just P_n x R(X*), so that the theorem, when rendered in
algebraic form, becomes:
C(X*) p = p (P_n x R(X*)) p.
Given the automata-theoretic nature of the proof to the theorem (and
the fact that at no place is the fact that X* is a free monoid
actually used anywhere), one may readily generalize the result to:
C(M) p = p (P_n x R(M)) p
for any monoid.
In terms of the conversion operators, C and R, this becomes:
R C R(M) p = p (P_n x R(M)) p.
What we will assert here, is that the result generalizes to ALL
rational dioids, K:
(R C K) p = p (P_n x K) p; for any n > 1.
this providing an effective implementation of the C operator.
This is the ALGEBRAIC CHOMSKY-SCHUETZENBERGER THEOREM.
The algebra P_n incorporates a notation for 1-stack machines with the
interpretations:
p = test for empty stack
ui = push i
vi = pop and test for equality to i.
Under this interpretation, one also has the identity,
Completeness Property: p + v1 u1 + ... + vn un = 1
which is useful to assert, though not required for the above
development.
All the foregoing can be cleaned up a little, but adding in an extra
set of "bracket" symbols (u0, v0) and writing p = v0 u0. This leads to
the definition of the algebra
C_{n+1}: consisting of u0,...,un,v0,...,vn
with
ui vi = 1; ui vj = 0 if i, j distinct.
The Completeness Property then becomes
v0 u0 + ... + vn un = 1.
Owing to the natural analogy to a notation seen in quantum theory (the
bra-ket notation of Dirac), it is then natural to write these
operators as a kind of labeled brackets:
the ui's get written as bra's: <i|
the vi's get written as ket's: |i>.
[It turns out that the correct analogy is NOT to quantum theory, but
actually to classical physics! One also encounters operators just like
these when discussing what is called the Maxwell-Boltzmann Fock space,
which is the space in classical physics naturally associated to an
assembly of particles.
If H is a Hilbert space with basis {1,2,...,n}, then its Maxwell-
Boltzmann Fock space MB(H) is identical to the space whose basis
consists of all the possible words formed from {1,2,...,n} (i.e. stack
words). The operators can then be defined by <i|: w -> wi; |i>: wi ->
w; |i>: wj -> 0 if i, j distinct; <0|: w -> 1 if w is the empty word;
<0|: w -> 0 otherwise; and |0> = <0| = p. Under this representation,
C_{n+1} = P_n.]
When cast in terms of the algebra C_n, one does not had direct
recourse to bracketing the words between p's. However, one can still
effect the erasure of operators by requiring that the words commute
with all the operators in C_n. Thus, one arrives at an alternate form
of the Algebraic Chomsky-Schuetzenberger Theorem:
For any n > 1:
C(X*) = Z(R(X*) x C_n)
C(M) = Z(R(M) x C_n), for any monoid
C R K = Z(K x C_n), for any rational dioid
where
Z(A) = { a in A: a commutes with all of C_n }.
In contrast to the rendition involving P_n, this provides a direct
definition of the language family itself, without the extra p dangling
around.
(4) LR Parsing and Context-Free Expressions
(4.1) The "Characteristic Automaton"
LR parsing is associated with bottom-up parsing. So, for a given
grammar G, the corresponding LR parser is a transducer for the bottom-
up canonical extension G-.
The parser is an automaton that operates in association with a stack
machine. Thus, there is a natural venue for not only bringing in the
full machinery of context-free expressions, but of even algebraizing
the whole edifice of LR parsing, itself, removing it from the thick
cobwebs of its ossified 1960's framework.
In the process, you'll also see that we could easily go much, much
further than mere LR parsing theory with the more general algebra,
theory and notation for context-free expressions. In particular, we
can go much further and directly process grammars whose rules allow
general regular expressions and even context-free expressions (!) on
the right-hand side.
Given a grammar G = (V, S, P) over a monoid M = X*, the LR parser is
defined in terms of the characteristic finite automaton, which is
itself given by the following indexed system:
(1) S -> S[z], where S is the start symbol
(2) N[X] -> N X, for each non-terminal N in V
(3) N[X] -> w(N->w), for each rule (N->w) in P, where w is in M
(4) N[X] -> w N'[z(N->wN'z)], for each rule (N->wN'z) in P,
where w is in M, N' another non-terminal and z in M[V].
The "terminals" in this system consist of the rules, of z, of the
terminals and non-terminals of the original system. These are
interpreted, respectively, as the "reduce", "accept", "shift" and
"goto" actions.
The indexed-grammar, itself, is infinite, but only a finite number of
its non-terminals are reached from S[z]. When converted to a minimal
deterministic automaton, the result is the LR parser, itself.
* The CLOSURE operation removes all the lambda transitions from the
above.
* The ITEM SET operation does the conversion to a minimal DFA. In
fact, the "item set" operation generalizes to one that converts NFA's
in general to minimal DFA's, so is not something that specifically
pertains to LR parsing at all.
Example grammar:
M = X*; X = {a,b,c,d,e,f}
S -> T; S -> S a T
T -> U; T -> T b U
U -> c U; U -> d; U -> e S f
associate these rules, respectively, with outputs s, t, u, v, w, x and
y so that we can write the canonical bottom-up translator as:
N = Y*; Y = {s,t,u,v,w,x,y,z}
start -> S z
S -> T s; S -> S a T t
T -> U u; T -> T b U v
U -> c U w; U -> d x; U -> e S f y
The characteristic automaton is then given by
start -> S[z]
S[X] -> S X; S[X] -> T[s]; S[X] -> S[a T[t]]
T[X] -> T X; T[X] -> U[u]; T[X] -> T[b U[v]]
U[X] -> U X; U[X] -> c U[w]; U[X] -> d x; U[X] -> e S[f y].
(4.2) Closure and "Item Sets" in Algebraic Form
Eliminating the lambda transitions, algebraically, consists of
substituting and eliminating all left-recursions. In a rule like
N -> N | ...
then N can be eliminated since part of the prescription underlying the
algebraic approach is that a grammar is a system of inequalities. The
inequality N -> N is redundant and therefore can be added or removed
without changing the system's solution set.
Likewise, rules involving the same non-terminal on the LHS may be
combined, by addition, on the right, since a+b means l.u.b. {a,b}:
start -> S[z]
S[X] -> S X + T[s] + S[a T[t]]
T[X] -> T X + U[u] + T[b U[v]]
U[X] -> U X + c U[w] + d x + e S[f y].
After eliminating the lambda transitions we get:
start -> S[z]
S[X] -> S (X + a T[t]) + T (s + b U[v]) + U u + d x + e S[f y]
T[X] -> T (X + b U[v]) + U u + d x + e S[fy]
U[X] -> U X + c U[w] + d x + e S[fy]
This is the closure operation commonly seen in LR parser construction,
done in a purely algebraic fashion.
The terms generated from S[z] can be enumerated by simply tracing
through the grammar (judiciously numbered):
#0 = S[z] -> S #5 + T #7 + U #9 + c #4 + d #A + e #1
#5 = z + a T[t] -> z + a #2
#7 = s + b U[v] -> s + b #3
#9 = u -> u
#A = x -> x
#1 = S[f y] -> S #6 + T #7 + U #9 + c #4 + d #A + e #1
#2 = T[t] -> T #8 + U #9 + c #4 + d #A + e #1
#3 = U[v] -> U #B + c #4 + d #A + e #1
#6 = f y + a T[t] -> f #C + a #2
#8 = t + b U[v] -> t + b #3
#B = v -> v
#4 = U[w] -> U #D + c #4 + d #A + e #1
#C = y -> y
#D = w -> w
The shifts are as follows
a: #5,#6 -> #2
b: #7,#8 -> #3
c: #0,#1,#2,#3,#4 -> #4
d: #0,#1,#2,#3,#4 -> #A
e: #0,#1,#2,#3,#4 -> #1
f: #6 -> #C
The gotos are as follows:
S: #0 -> #5; #1 -> #6
T: #0,#1 -> #7; #2 -> #8
U: #0,#1,#2 -> #9; #3 -> #B; #4 -> #D
The reduce actions (s-y) and accept (z) are as follows:
s in #7; t in #8; u in #9; v in #B
w in #D; x in #A; y in #C; z in #5
As you're about to see, the reduce actions and accept are IRRELEVANT.
We don't need any of that information! It's already implied by the
shift's and goto's.
(4.3) The Algebraic Form of the Parser & Elimination of Reduce Actions
This, in fact, applies within the more general framework of context-
free grammars where the right-hand sides may be regular expressions;
whereas the usual framework for LR parsing does not readily extend.
Indeed, this is where the algebraic nature behind the notation would
become of critical importance, you simply won't be able to get by
without it (particularly the Kleene star operator used in conjunction
with the operators).
The automaton, by itself, is only an incomplete specification since it
must work in tandem with a stack machine. However, since the formalism
was developed long before any of the foregoing material ever existed,
the cobwebs developed and ossification set it around the particulars
of a given implementation with no ability to raise the rest of the
machinery up to the algebraic or computation level in any direct or
simple fashion.
The underlying stack machine starts out by pushing state #0. Upon each
shift, it checks against the top of the stack to determine if the
shift can be applied and (if so), pushes the state associated with the
transition. For symbol "a" above, if the top of the stack before the
shift had a #5 or #6 on it, then afterwards, a #2 would be sitting on
top of it.
A reduce action involves removing, in reverse order, the last N states
from the stack, if the corresponding rule had a word of length N on
its RHS; and then performing the goto associated with the non-terminal
of the LHS of the rule. The accept action pops the last stack, and
removes the #0 which must be directly underneath it and halts the
parser.
Within the algebra C_n, a push is <i|, a pop |i>, and we will extend
this to the following notations:
<i1 ... ik| = <i1| ... <ik|
|i1 ... ik> = |ik> ... |i1>
[z] = |z><z|, where z = i1,...,ik
[z1,...,zn] = [z1] + ... + [zn]
The comma notation in the last item may be embedded, e.g.
[0(1,2),1,2] = [01] + [02] + [1] + [2], etc.
The shifts and goto's are represented by the following definitions:
<a| = [5,6]<2|
<b| = [7,8]<3|
<c| = [0,1,2,3,4]<4|
<d| = [0,1,2,3,4]<A|
<e| = [0,1,2,3,4]<1|
<f| = [6]<C|
<S| = [0]<5| + [1]<6|
<T| = [0,1]<7| + [2]<8|
<U| = [0,1,2]<9| + [3]<B| + [4]<D|
The generalization of this to other LR parsers should be clear from
this, so it won't be elaborated on any further.
In the following, define the ADJOINT operator as the operator which
reverses produces and the direction of the operator symbols. This
operator has the properties:
<i|' = |i>; |i>' = <i|
(xy)' = y' x'
(x+y)' = x' + y'
(x*)' = (x')*
and so on. Note, in particular, that [z]' = [z].
For each terminal or non-terminal <x|, define |x> = <x|'.
The reduce actions (s-y) and accept (z) are read directly off the
original rules:
s: S -> T ======> <s> = |T><S| = |7><S|
t: S -> S a T ==> <t> = |T>|a>|S><S| = |2 8>|S><S|
u: T -> U ======> <u> = |U><T| = |9><T|
v: T -> T b U ==> <v> = |U>|b>|T><T| = |3 B>|T><T|
w: U -> c U ====> <w> = |U>|c><U| = |4 D><U|
x: U -> d ======> <x> = |d><U| = |A><U|
y: U -> e S f ==> <y> = |f>|S>|e><U| = |1 6 C><U|
z: accept S ====> <z> = |S>|0> = |0 5>
Again, this readily generalizes to other LR parsers, so it won't be
elaborated on in detail any further here.
The context-free expression corresponding to the parser may thus be
directly written out as:
<0| (Sh + Re)* <z>
Sh = a<a| + b<b| + c<c| + d<d| + e<e| + f<f|
Re = s<s> + t<t> + u<u> + v<v> + w<w> + x<x> + y<y>
The general form of the context-free expression will have:
Sh = (sum x<x|: x in X)
Re = (sum r<r>: r in P).
The corresponding automaton is written as the following system:
S -> <0| A
A -> a [5,6]<2| A
A -> b [7,8]<3| A
A -> c [0,1,2,3,4]<4| A
A -> d [0,1,2,3,4]<A| A
A -> e [0,1,2,3,4]<1| A
A -> f [6]<C| A
A -> s |7><S| A
A -> t |2 8>|S><S| A
A -> u |9><T| A
A -> v |3 B>|T><T| A
A -> w |4 D><U| A
A -> x |A><U| A
A -> y |1 6 C><U| A
A -> z |0 5>
Because the LR parsing framework is lacking its algebraic foundation,
this is as far as it can go. Anything more has to be posed as an
extension (e.g. optimization rules).
(4.4) Algebraic Manipulation of the Parser
But now this is just a system of inequalities, and so can be
manipulated directly. Define S_F = <S| A; T_F = <T| A; U_F = <U| A.
Also, define
V = (c <4| + d <A| + e <1|) A
Then we may write:
S -> <0| A
A -> a [5,6]<2| A
A -> b [7,8]<3| A
A -> c [0,1,2,3,4]<4| A
A -> d [0,1,2,3,4]<A| A
A -> e [0,1,2,3,4]<1| A
A -> f [6]<C| A
A -> s |7> S_F
A -> t |2 8>|S> S_F
A -> u |9> T_F
A -> v |3 B>|T> T_F
A -> w |4 D> U_F
A -> x |A> U_F
A -> y |1 6 C> U_F
A -> z |0 5>
Thus
S -> <0| A -> <0| (c <4| + d <A| + e <1|) A = <0| V
S_F = <S| A = [0]<5| A + [1]<6| A
-> a ([0]<5 2| + [1]<6 2|) A + z |0> + f [1]<6 C| A
T_F = <T| A = [0,1]<7| A + [2]<8| A
-> b ([0,1]<7 3| + [2]<8 3|) A + s [0,1] S_F + t |2>|S>
S_F
U_F = <U| A = [0,1,2]<9| A + [3]<B| A + [4]<D| A
= u [0,1,2] T_F + v |3>|T> T_F + w |4> U_F
Also,
<4| A -> <4| (c <4| A + d <A| A + e <1| A) = <4| V
<1| A -> <1| (c <4| A + d <A| A + e <1| A) = <1| V
<A| A -> x U_F
hence
V = c <4| A + d <A| A + e <1| A
-> c <4| V + d x U_F + e <1| V
Since <2| A = <2| V, <3| A = <3| V, in a similar way that the
reductions for <1| A and <4| A were arrived at, we find that
S_F -> a ([0]<5 2| + [1]<6 2|) V + z |0> + f [1]<6 C| A
T_F -> b ([0,1]<7 3| + [2]<8 3|) V + s [0,1] S_F + t |2>|S>
S_F
Finally, from
[1]<6 C| A -> y |1> U_F,
we get the closed system:
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u [0,1,2] T_F + v |3>|T> T_F + w |4> U_F
T_F -> b ([0,1]<7 3| + [2]<8 3|) V + s [0,1] S_F + t |2>|S>
S_F
S_F -> a ([0]<5 2| + [1]<6 2|) V + z |0> + f y |1> U_F
or, after substituting in the goto expressions |S>, |T>:
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u [0,1,2] T_F + v (|7 3>[0,1] + |8 3>[2]) T_F + w |4> U_F
T_F -> b ([0,1]<7 3| + [2]<8 3|) V + s [0,1] S_F + t (|5 2>[0] + |
6 2>[1]) S_F
S_F -> a ([0]<5 2| + [1]<6 2|) V + z |0> + f y |1> U_F
Since the operators for 5 2, 6 2, 7 3, 8 3 always occur together, they
can be eliminated in favor (say) of just 5, 6, 7, 8. In this case, [2]
becomes [5,6]. Thus, the simplification,
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u [0,1,5,6] T_F + v (|7>[0,1] + |8>[5,6]) T_F + w |4> U_F
T_F -> b ([0,1]<7| + [5,6]<8|) V + s [0,1] S_F + t (|5>[0] + |
6>[1]) S_F
S_F -> a ([0]<5| + [1]<6|) V + z |0> + f y |1> U_F
Since [0,1,5,6] T_F = T_F, and [0,1] S_F = S_F, under this system, we
may further reduce this to:
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u T_F + v (|7>[0,1] + |8>[5,6]) T_F + w |4> U_F
T_F -> b ([0,1]<7| + [5,6]<8|) V + s S_F + t (|5>[0] + |6>[1])
S_F
S_F -> a ([0]<5| + [1]<6|) V + z |0> + f y |1> U_F
Further, in this context, the distinction between 5 and 6; and between
7 and 8 is no longer relevant, so they can be collapsed into (say) 5
and 7, respectively:
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u T_F + v |7>[0,1,5,6] T_F + w |4> U_F
T_F -> b [0,1,5,6]<7| V + s S_F + t |5>[0,1] S_F
S_F -> a [0,1]<5| V + z |0> + f y |1> U_F
and, again, after eliminating the redundant element,
S -> <0| V
V -> c <4| V + d x U_F + e <1| V
U_F -> u T_F + v |7> T_F + w |4> U_F
T_F -> b [0,1,5,6]<7| V + s S_F + t |5> S_F
S_F -> a [0,1]<5| V + z |0> + f y |1> U_F.
(4.5) Quasi-Deterministic Form and O(n) Parsing
Since the O(n) proposition is still conjectural, this part of the
development cannot yet be readily generalized beyond this or other
examples.
In quasi-deterministic form, the parser will has on the right-hand
side a sum of terms, each starting with a unique input symbol. This
entails introducing more general expressions on the right which may
make use of the Kleene star operator. The rules involving U_F and T_F
may thus be reduced to the form:
U_F -> (w |4>)* (u + v |7>) T_F
T_F -> b [0,1,5,6]<7| V + (s + t |5>) S_F
After substituting and pulling any input symbols out to the front, the
rule for T_F becomes:
T_F -> b [0,1,5,6]<7| V + a (s + t |5>)[0,1]<5| V
+ (s + t |5>)z |0> + f (s + t |5>) y |1> U_F.
Along with the rules for S and S_F
S -> <0| V
S_F -> a [0,1]<5| V + z |0> + f y |1> S_F
this renders the automaton in quasi-deterministic form, with the
elimination of U_F:
S -> <0| V
V -> c <4| V + d x (w |4>)* (u + v |7>) T_F + e <1| V
S_F -> a [0,1]<5| V + z |0> + f y |1> (w |4>)* (u + v |7>) T_F
T_F -> b [0,1,5,6]<7| V + a (s + t |5>)[0,1]<5| V + (s + t |5>)z |
0> + f (s + t |5>) y |1> (w |4>)* (u + v |7>) T_F
Thus, for instance, the parse for an input word, such as cedf, can be
directly read off:
<0|
<4|
<1|
x (w |4>)* (u + v |7>)
(s + t |5>) y |1> (w |4>)* (u + v |7>)
(s + t |5>) z |0>
which reduces to the word xusywusz, which (in turn) encodes the bottom-
up parse sequence:
x: reduce U1 <= d
u: reduce T1 <= U1
s: reduce S1 <= T1
y: reduce U3 <= e S1 f
w: reduce U4 <= c U3
u: reduce T2 <= U4
s: reduce S2 <= T2
z: accept S2
(5) An Example Involving Cyclic Grammars
Here, we'll look at the grammar
S -> 1; S -> SS; S -> a
whose canonical bottom-up translation is
start -> S z
S -> w + SS x + a y
it's characteristic automaton
start -> S[z]
S[X] -> S X + w + S[S[x]] + a y
or, after eliminating lambda transitions,
start -> S[z]
S[X] -> S (X + S[x]) + w + a y.
The minimal state set is:
#0 = S[z] ------> S #1 + w + a #3
#1 = z + S[x] --> z + S #2 + w + a #3
#3 = y ---------> y
#2 = x + S[x] --> x + S #2 + w + a #3.
The shift's and goto's are:
<a| = [0,1,2]<3|
<S| = [0]<1| + [1,2]<2|
and reduces and accept action:
<w> = <S|
<x> = |S>|S><S| = |2>|S><S|
<y> = |a><S| = |3><S|
<z> = |S>|0> = |0 1>
The resulting translator is:
S -> <0| A
A -> a [0,1,2]<3| A
A -> w <S| A
A -> x |2>|S><S| A
A -> y |3><S| A
A -> z |0 1>
or defining S_F = <S| A,
S -> <0| A
A -> a [0,1,2]<3| A
A -> w S_F
A -> x |2>|S> S_F = x (|1 2>[0] + |2 2>[1,2]) S_F
A -> y |3> S_F
A -> z |0 1>
S_F = <S| A -> [0]<1| A + [1,2]<2| A
Substituting for <3| A and eliminating the 5th rule yields for the 2nd
rule,
A -> a [0,1,2]<3| A -> a y [0,1,2] S_F -> a y S_F
which combines with the 3rd rule to yield:
A -> (w + ay) S_F,
resulting in the system
S -> <0| A
A -> (w + ay) S_F
A -> x (|1 2>[0] + |2 2>[1,2]) S_F
A -> z |0 1>
S_F = [0]<1| A + [1,2]<2| A
-> z |0> + x (|1>[0] + |2>[1,2]) S_F + (w + ay)([0]<1| +
[1,2]<2|) S_F
which, after substituting for S results in the reduced system:
S -> (w + ay) <0| S_F
S_F -> z |0> + x (|1>[0] + |2>[1,2]) S_F + (w + ay)([0]<1| +
[1,2]<2|) S_F.
The operators [0]<1| + [1,2]<2| and |1>[0] + |2>[1,2], along with <0|
and |0>, produce an algebra isomorphic to C_2, so they can be replaced
by <1| and |1> respectively, resulting in the system,
S -> (w + ay) <0| S_F
S_F -> z |0> + x |1> S_F + (w + ay)<1| S_F.
In quasi-deterministic form, this becomes
S -> w <0| (w<1| + x|1>)*z|0>
S -> a w <0| (w<1| + x|1>)*y<1| S_F
S -> a y <0| S_F
S_F -> (w<1| + x|1>)*z|0> + a (w<1| + x|1>)*y<1| S_F.
The expression U = (w<1| + x|1>)* is a bona-fide context-free
expression and cannot be represented by any finite sum of words or
even a regular expression. Defining the Dyck language D_2 by the
grammar
D_2 -> 1; D_2 -> w D_2 x D_2
the expression may be reduced to a normal form:
U = (w<1| + x|1>)* = (D_2 x |1>)* D_2 (w <1| D_2)*.
The system above, expressed in terms of U, becomes
S -> w <0|Uz|0>
S -> a <0| [wU<1|] y S_F
S_F -> Uz|0> + a Uy<1| S_F.
The word a^n has an infinite number of parse trees. The parses forms a
context-free set when linearized in terms of the canonical bottom-up
translator. They may be written as:
n = 0: w <0|U|0> z = w D_2 z
n > 0: w <0|[wU<1|] (Uy<1|)^{n-1} U|0> z
which is linear in the length of the input word.
The equivalent grammar for this context-free expression will be O(n^3).
.
- Follow-Ups:
- Re: O(n) Parsing For General Context-Free Grammars & Transductions
- From: David Wagner
- Re: O(n) Parsing For General Context-Free Grammars & Transductions
- Prev by Date: Re: division by 7efficiently
- Next by Date: Re: division by 7efficiently
- Previous by thread: division by 7efficiently
- Next by thread: Re: O(n) Parsing For General Context-Free Grammars & Transductions
- Index(es):
Relevant Pages
|