Parsing and other fun things involving (no) DCGs




Some time ago, a task like the following was posted here:

Transform an arithmetic expression given as a list of numbers,
atoms, brackets and operators "+", "-", "*", "/" and "^" (power)
to an abstract syntax tree. Priority of operations must be
respected: ^ is top priority, * and / middle, + and - low.

Examples:
[1,+,x,+,2] should give add(add(num(1),var(x)), num(2))
[1,/,'(',1,+,2,^,x,')',+,x,*,y] should give add(div(num(1),
add(num(1),pow(num(2), var(x)))), mul(var(x),var(y)))

DCGs are suitable for parsing such lists (of tokens):

expression_ast(E, A) :- phrase(expression(A), E).

expression(E) --> term(T), expression_r(T, E).
expression_r(E0, E) --> [+], term(T), expression_r(add(E0,T), E).
expression_r(E0, E) --> [-], term(T), expression_r(sub(E0,T), E).
expression_r(E, E) --> [].

term(T) --> power(P), term_r(P, T).
term_r(T0, T) --> [*], power(P), term_r(mul(T0,P), T).
term_r(T0, T) --> [/], power(P), term_r(div(T0,P), T).
term_r(T, T) --> [].

power(P) --> factor(F), power_r(F, P).
power_r(P0, pow(P0, P)) --> [^], factor(P1), power_r(P1, P).
power_r(P, P) --> [].

factor(num(N)) --> [N], { number(N) }.
factor(var(A)) --> [A], { atom(A) }.
factor(E) --> ['('], expression(E), [')'].

Notice the structure of one group of clauses per precedence
level. Each level parses as far as it can, possibly involving several
subterms with operators of higher precedence. Examples:

%?- expression_ast([1,+,x,+,2], AST).
%@% Ast = add(add(num(1), var(x)), num(2)) ;
%@% No

%?- expression_ast([1,/,'(',1,+,2,^,x,')',+,x,*,y], AST).
%@% AST = add(div(num(1), add(num(1), pow(num(2), var(x)))),
%@% mul(var(x), var(y))) ;
%@% No

Notice also that + associates to the left, and ^ to the right:

%?- expression_ast([0,^,0,^,0], AST).
%@% AST = pow(num(0), pow(num(0), num(0))) ;
%@% No

<-><-><-><-><-><-><-><-><-><->

Instead of arithmetic expressions, let us now consider *regular*
expressions. We restrict ourselves to a simple subset of POSIX regular
expressions for this example. Using DCGs to build syntax trees from
the common string notation for regular expressions, we obtain:

regexp_ast(Rs, T) :- regexp(T, Rs, []).

regexp(R) --> regexp_conc(R0), regexp_r(R0, R).
regexp(empty) --> [].
regexp_r(R0, R) --> [0'|], regexp_conc(R1), regexp_r(or(R0, R1), R).
regexp_r(R, R) --> [].

regexp_conc(R) --> regexp_term(R0), regexp_conc_r(R0, R).
regexp_conc_r(R0, R) --> regexp_term(R1),
regexp_conc_r(concat(R0,R1), R).
regexp_conc_r(R, R) --> [].

regexp_term(R) --> regexp_factor(R0), regexp_term_r(R0, R).
regexp_term_r(R0, R) --> [0'*], regexp_term_r(times(R0), R).
regexp_term_r(R0, R) --> [0'?], regexp_term_r(optional(R0), R).
regexp_term_r(R0, R) --> [0'+], regexp_term_r(plus(R0), R).
regexp_term_r(R, R) --> [].

regexp_factor(literal(L)) --> [0'\\], [L0], {atom_codes(L, [L0])}.
regexp_factor(literal(L)) -->
[L0],
{ literal(L0), atom_codes(L, [L0]) }.
regexp_factor(R) --> [0'(], regexp(R), [0')].
regexp_factor(LR) --> [0'[], literals_and_ranges(LR), [0']].

digit(D) :- between(0'0, 0'9, D).
lower(L) :- between(0'a, 0'z, L).
upper(U) :- between(0'A, 0'Z, U).

literal(L) :- digit(L) ; lower(L) ; upper(L).

literals_and_ranges([literal(L)|LRs]) -->
[L], { literal(L) }, literals_and_ranges(LRs).
literals_and_ranges([range(L1,L2)|LRs]) -->
[L1], { literal(L1) }, [0'-], [L2], { literal(L2) },
literals_and_ranges(LRs).
literals_and_ranges([]) --> [].


This grammar is a bit different: There's now only one (explicit)
binary operator, "|". Concatenation is implicit and has higher
precedence. Also, there are *postfix* operators now. Examples:


%?- regexp_ast("ab|c", AST).
%@% AST = or(concat(literal(a), literal(b)), literal(c));
%@% No

%?- regexp_ast("a*b\\|c", AST).
%@% AST = concat(concat(concat(times(literal(a)), literal(b)),
%@% literal(('|'))), literal(c)) ;
%@% No


Let us now perform *matching* using such regular expressions. That is,
we want to answer questions of the form: "Given a string S and a
regular expression E, is any substring of S an element of the language
generated by E?". As we are again describing lists (of Unicode
codepoints), we can again use DCGs:

regexp_match(R, S) :-
regexp_ast(R, T),
append(_, M, S),
regexp_match(T, M, _).

regexp_match(empty) --> [].
regexp_match(literal(L0)) --> [L], { atom_codes(L0, [L]) }.
regexp_match(concat(A,B)) --> regexp_match(A), regexp_match(B).
regexp_match(optional(O)) --> [] ; regexp_match(O).
regexp_match(times(_)) --> [].
regexp_match(times(T)) --> regexp_match(T), regexp_match(times(T)).
regexp_match(plus(P)) --> regexp_match(P).
regexp_match(plus(P)) --> regexp_match(P), regexp_match(plus(P)).
regexp_match(or(A,B)) --> regexp_match(A) ; regexp_match(B).
regexp_match([R|Rs]) -->
[L], { range_or_literal_match([R|Rs], L) }.

range_or_literal_match(RLs, L) :- memberchk(literal(L), RLs).
range_or_literal_match(RLs, L) :-
member(range(A,B), RLs),
between(A, B, L).

Examples:

%?- regexp_match("(a|b)+", "zaaaaaaaaaaaaaabaaa").
%@% Yes

%?- regexp_match("a*b\\|c", "xyb|cz").
%@% Yes

%?- regexp_match("a*b\\|c", "xybcz").
%@% No


<-><-><-><-><-><-><-><-><-><->

Consider now a different task:

Combining nine 9s with any number of the operators +,-,*,/,(,),
what is the smallest positive integer that cannot be expressed?
Hints:

1)The answer isn't zero. You can express zero like this:
(9 - 9) * (9 + 9 + 9 + 9 + 9 + 9 + 9)
Also, zero isn't a positive integer.

2) The answer isn't one. You can express one like this:
9 - (9 * 9 - 9)/9 + 9 - 9 + 9 - 9

3) It's not a trick question.

4) Be sure to handle parentheses correctly.

Notes:

* You cannot exponentiate.
* You cannot concatenate (for example, put two 9s together to
make 99).
* The - operator can be used in either its binary or unary form.
* Assume base 10.

4) is a trick hint. Forget parentheses altogether! We *won't* generate
symbolic syntax trees of expressions this time, let alone lists of
concrete tokens. Instead, we will non-deterministically build lists of
9s and operators, and interpret them as instructions for a reverse
Polish notation calculator. As we are describing lists again, we could
again use DCGs. In this case however, the list of operations can be
built quite nicely by instantiating a single variable more and more:

make_and_eval_expr(N, Es, Val) :-
length(Nines, N),
Nines = [_|Ops],
make_and_eval_expr(Nines, Ops, [], [Val], Es).

make_and_eval_expr([], [], S, S, []).
make_and_eval_expr([_|Nines], Ops, S0, S, [9|Rest]) :-
make_and_eval_expr(Nines, Ops, [9|S0], S, Rest).
make_and_eval_expr(Nines, [_|Ops], [B,A|S0], S, [Op|Rest]) :-
op_a_b_r(Op, A, B, R),
make_and_eval_expr(Nines, Ops, [R|S0], S, Rest).

op_a_b_r(+, A, B, R) :- R is A + B.
op_a_b_r(-, A, B, R) :- R is A - B.
op_a_b_r(*, A, B, R) :- R is A * B.
op_a_b_r(/, A, B, R) :- B =\= 0, R is A rdiv B.

Try:

%?- between(2,inf,N), \+((make_and_eval_expr(9,_,N0),abs(N0)=:=N)).

to obtain the answer in ~15 minutes with SWI Prolog and a 1.6GHz CPU.
Faster versions are possible; good luck!

All the best,
Markus
.



Relevant Pages

  • Re: EE Student, Edit, Proposal Masters, Help (concepts of functional programming, symbolic programmi
    ... > "No, expressions are not lists. ... This is a specialty of Matlab lists. ... > represent such a data structure is with an array of pointers and when ... This is referred to as lazy evaluation. ...
    (sci.math.symbolic)
  • Re: How to get the terms of a compund term (predicate: simplfy)?
    ... > to show the real format of one expressions, ... expression into two lists and then recombining the lists ... if bla bla Head is something bla bla ... ... that Head and Tail variables is not instantiated and ...
    (comp.lang.prolog)
  • Re: comments on my design of a new language?
    ... Trees, Lists, Regular ... expressions back to library status to keep things as clean as possible. ... arrays, and functions. ...
    (comp.lang.misc)
  • Re: CVF 6.6C bug?
    ... > around expressions using defined operators. ... the standard ... there was a special rule for I/O lists. ... but extra parens in a list. ...
    (comp.lang.fortran)
  • Re: Question about Sun JAVAC
    ... I discuss this problem in my book "Design Your Own .Net Language and ... My book contains a discussion of regular expressions which attempts to ... I was last spring engaged in work which involved frequently searching ...
    (comp.programming)