CYK & Context-Free Expressions
- From: Rock Brentwood <markwh04@xxxxxxxxx>
- Date: Fri, 25 Apr 2008 15:21:03 -0700 (PDT)
On Mar 27, 5:28 pm, Tegiri Nenashi <TegiriNena...@xxxxxxxxx> wrote:
Next, the CYK algorithm is essentuially a transitive
closure of a matrix. Given that matrix methods enjoy a lot of coverage
in Mr. Hopkins theory, perhaps there is a connection?
CYK and Earley are, respectively, the dynamic programming algorithms
for the so-called "bottom up parsing" and "top down parsing" -- which,
as per the terminology of the site you were referring to -- as the
special instances of the general translation problem for the canonical
bottom-up and top-down extensions of a grammar.
For an unambiguous grammar (that is: a grammar whose canonical
extensions are unambiguous), the output in both cases is, in effect,
itself a grammar; i.e. a grammar that recognizes the single word that
corresponds to the top-down or bottom-up construction sequence.
Extensions to ambiguous grammars generalize the output format so that
it can embody multiple sequences (i.e. multiple trees). Invariably,
this involves what's referred to as "node packing" and (for ambiguous
grammars) "cyclicity". The resulting format is equivalent to a context-
free grammar, itself. This corresponds to the fact that in the most
general case, the output upon translation of a single word is an
infinite set that, itself, is a context-free language. That is, the
(generally infinite) set of parses of a given word is a context-free
language, in its own right.
The generalization of CYK to ambiguous grammars, based on general LR
parsing (GLR), was described a long while back in my article on the
(generalized) Tomita algorithm -- which you may still be able to find
in the comp.compilers archive. The algebraic translation of GLR is
described more fully in the supplementary article
"Algebraic LR Parsing"
under the link http://federation.g3z.com/CompSci/index.htm#Algebraic
I've also included various supplements under
http://federation.g3z.com/CompSci/index.htm#Algebraic4
that provide more fully descriptions of everything (particularly the
algebraization of the LR method; hence also the encapsulation of CYK),
along with illustrations.
LR-based methods (or anything else that gives you grammars (or parse
forests) as outputs) have you O(n^3) complexity for (the canonical
extensions of) Chomsky normal form grammars.
A more direct representation in terms of context-free expressions of
the output set T[w] for a word w and translation T is easy to come up
with that is O(n) in the length of the word. This, of course, includes
the cases where T is the canonical top-down and bottom-up extensions
of a grammar, as special cases.
If T is a context-free subset of the monoid X* x Y*, where X and Y are
finite, then the context-free expression for T is simply
T = <0| (X' + Y' + H')* |0>
where
X' = sum_{x in X} (x <x|)
Y' = sum_{y in Y} (y <y|)
H' = sum_{q -> v1...vk} (|vk>...|v1><q|)
over the bra-ket algebra C_{x+y+q} where x = |X|, y = |Y| and q = |Q|,
Q being the non-terminals. This reduces to
T = <0| N (X' N)* |0>
where
N = (Y' + H')*.
For a given word w = x1 x2 ... xn, the corresponding parse structure
is simply the context-free expression
T[w] = <0| N <x1| N <x2| ... N <xn| N |0>.
The recognition problem is obtained directly from this under by monoid
homomorphism h: (Y union Q)* -> Q* that maps h: Y -> {1}. The
expression h(T[w]) becomes a pure bra-ket expression that reduces to 0
or 1. It reduces to 1 if and only if T[w] is not empty. The actual
process of reducing the bra-ket expression is closely linked to the
shortest paths algorithm and Boolean matrix multiplication. As best
known at present, this is O(n^log 7) in complexity.
As you can see by the various refinements (and the mere need to have
such refinements), the current state of the art in parsing theory is
still at a primordal state rife with imprecision of teminology and
confusion, notwithstanding the passage of 30+ years since the time of
Knuth (also as exemplified by the person originally responding to
you). Much of what went into the supplements was, in fact, to clear up
this confusion and lay out the more precise terminology.
For instance, the misconception is still quite prevalent that parsing
is something that applies in reference to context-free grammars. It
does not. It applies to simple syntax directed translations (SSDT's)
and can only be applied to grammars, themselves, via an extension of
the grammar (e.g., the bottom-up, top-down canonical extensions
referred to above) to a SSDT. It is a somewhat specialized instance of
the translation problem; that of determining the set T[A] = { n in N:
(m,n) in T for some m in A } of translations for a subset A of a
monoid M to a monoid N.
So, one sees (even in the literature, such as it exists), a failure to
distinguish between CFGs and SSDT's; the notion that algebraic methods
can't be "applied" because "the result may destroy equivalency" (in
fact, the "equivalency" it destroys is where algebraic methods are
applied to a CFG since the canonical extensions are not preserved by
such transformations; but algebraic methods can certainly be applied
to the SSDT's that the CFG's represent); one sees a failure to
distinguish between the various aspects of what are (mistakenly)
lumped under the single header of "parsing" (that is: enumeration,
choice and emptiness test/recognition); etc.
At present, I don't know how much this generalizes to non-free
monoids. The key problem at hand is to determine whether property
"A8" (transductions) holds for the "context-free" monad, C. That is:
Let M, N be monoids and T a context-free subset of M x N (that is, T
in C(M x N)):
A8_w: Weak transduction: for w in M, T[w] is in C(N).
A8_s: Transduction: for a context-free subset L in M (i.e., L in
C(M)), T'L is in C(N),
where
T'L = union { T[w]: w in L } = { n in N: (m, n) in T for some m in
L }.
Much of the ability to push everything over into a purely algebraic
context, removed entirely from free monoids, hinges on whether this
property is true or not. I suspect it's not, unless either M or N is
restricted to a free monoid: either M = X* or N = Y*.
.
- Follow-Ups:
- Re: CYK & Context-Free Expressions
- From: Tegiri Nenashi
- Re: CYK & Context-Free Expressions
- From: Chris F Clark
- Re: CYK & Context-Free Expressions
- Prev by Date: Re: Mark W. Hopkins theory perspective on parser engine technology?
- Next by Date: Re: Is there a name and theory for "arrangement/packing" problems?
- Previous by thread: Re: Mark W. Hopkins theory perspective on parser engine technology?
- Next by thread: Re: CYK & Context-Free Expressions
- Index(es):
Relevant Pages
|