Parsing and other fun things involving (no) DCGs
- From: Markus Triska <triska@xxxxxxxx>
- Date: Fri, 16 Mar 2007 21:09:56 +0100
Some time ago, a task like the following was posted here:
Transform an arithmetic expression given as a list of numbers,
atoms, brackets and operators "+", "-", "*", "/" and "^" (power)
to an abstract syntax tree. Priority of operations must be
respected: ^ is top priority, * and / middle, + and - low.
Examples:
[1,+,x,+,2] should give add(add(num(1),var(x)), num(2))
[1,/,'(',1,+,2,^,x,')',+,x,*,y] should give add(div(num(1),
add(num(1),pow(num(2), var(x)))), mul(var(x),var(y)))
DCGs are suitable for parsing such lists (of tokens):
expression_ast(E, A) :- phrase(expression(A), E).
expression(E) --> term(T), expression_r(T, E).
expression_r(E0, E) --> [+], term(T), expression_r(add(E0,T), E).
expression_r(E0, E) --> [-], term(T), expression_r(sub(E0,T), E).
expression_r(E, E) --> [].
term(T) --> power(P), term_r(P, T).
term_r(T0, T) --> [*], power(P), term_r(mul(T0,P), T).
term_r(T0, T) --> [/], power(P), term_r(div(T0,P), T).
term_r(T, T) --> [].
power(P) --> factor(F), power_r(F, P).
power_r(P0, pow(P0, P)) --> [^], factor(P1), power_r(P1, P).
power_r(P, P) --> [].
factor(num(N)) --> [N], { number(N) }.
factor(var(A)) --> [A], { atom(A) }.
factor(E) --> ['('], expression(E), [')'].
Notice the structure of one group of clauses per precedence
level. Each level parses as far as it can, possibly involving several
subterms with operators of higher precedence. Examples:
%?- expression_ast([1,+,x,+,2], AST).
%@% Ast = add(add(num(1), var(x)), num(2)) ;
%@% No
%?- expression_ast([1,/,'(',1,+,2,^,x,')',+,x,*,y], AST).
%@% AST = add(div(num(1), add(num(1), pow(num(2), var(x)))),
%@% mul(var(x), var(y))) ;
%@% No
Notice also that + associates to the left, and ^ to the right:
%?- expression_ast([0,^,0,^,0], AST).
%@% AST = pow(num(0), pow(num(0), num(0))) ;
%@% No
<-><-><-><-><-><-><-><-><-><->
Instead of arithmetic expressions, let us now consider *regular*
expressions. We restrict ourselves to a simple subset of POSIX regular
expressions for this example. Using DCGs to build syntax trees from
the common string notation for regular expressions, we obtain:
regexp_ast(Rs, T) :- regexp(T, Rs, []).
regexp(R) --> regexp_conc(R0), regexp_r(R0, R).
regexp(empty) --> [].
regexp_r(R0, R) --> [0'|], regexp_conc(R1), regexp_r(or(R0, R1), R).
regexp_r(R, R) --> [].
regexp_conc(R) --> regexp_term(R0), regexp_conc_r(R0, R).
regexp_conc_r(R0, R) --> regexp_term(R1),
regexp_conc_r(concat(R0,R1), R).
regexp_conc_r(R, R) --> [].
regexp_term(R) --> regexp_factor(R0), regexp_term_r(R0, R).
regexp_term_r(R0, R) --> [0'*], regexp_term_r(times(R0), R).
regexp_term_r(R0, R) --> [0'?], regexp_term_r(optional(R0), R).
regexp_term_r(R0, R) --> [0'+], regexp_term_r(plus(R0), R).
regexp_term_r(R, R) --> [].
regexp_factor(literal(L)) --> [0'\\], [L0], {atom_codes(L, [L0])}.
regexp_factor(literal(L)) -->
[L0],
{ literal(L0), atom_codes(L, [L0]) }.
regexp_factor(R) --> [0'(], regexp(R), [0')].
regexp_factor(LR) --> [0'[], literals_and_ranges(LR), [0']].
digit(D) :- between(0'0, 0'9, D).
lower(L) :- between(0'a, 0'z, L).
upper(U) :- between(0'A, 0'Z, U).
literal(L) :- digit(L) ; lower(L) ; upper(L).
literals_and_ranges([literal(L)|LRs]) -->
[L], { literal(L) }, literals_and_ranges(LRs).
literals_and_ranges([range(L1,L2)|LRs]) -->
[L1], { literal(L1) }, [0'-], [L2], { literal(L2) },
literals_and_ranges(LRs).
literals_and_ranges([]) --> [].
This grammar is a bit different: There's now only one (explicit)
binary operator, "|". Concatenation is implicit and has higher
precedence. Also, there are *postfix* operators now. Examples:
%?- regexp_ast("ab|c", AST).
%@% AST = or(concat(literal(a), literal(b)), literal(c));
%@% No
%?- regexp_ast("a*b\\|c", AST).
%@% AST = concat(concat(concat(times(literal(a)), literal(b)),
%@% literal(('|'))), literal(c)) ;
%@% No
Let us now perform *matching* using such regular expressions. That is,
we want to answer questions of the form: "Given a string S and a
regular expression E, is any substring of S an element of the language
generated by E?". As we are again describing lists (of Unicode
codepoints), we can again use DCGs:
regexp_match(R, S) :-
regexp_ast(R, T),
append(_, M, S),
regexp_match(T, M, _).
regexp_match(empty) --> [].
regexp_match(literal(L0)) --> [L], { atom_codes(L0, [L]) }.
regexp_match(concat(A,B)) --> regexp_match(A), regexp_match(B).
regexp_match(optional(O)) --> [] ; regexp_match(O).
regexp_match(times(_)) --> [].
regexp_match(times(T)) --> regexp_match(T), regexp_match(times(T)).
regexp_match(plus(P)) --> regexp_match(P).
regexp_match(plus(P)) --> regexp_match(P), regexp_match(plus(P)).
regexp_match(or(A,B)) --> regexp_match(A) ; regexp_match(B).
regexp_match([R|Rs]) -->
[L], { range_or_literal_match([R|Rs], L) }.
range_or_literal_match(RLs, L) :- memberchk(literal(L), RLs).
range_or_literal_match(RLs, L) :-
member(range(A,B), RLs),
between(A, B, L).
Examples:
%?- regexp_match("(a|b)+", "zaaaaaaaaaaaaaabaaa").
%@% Yes
%?- regexp_match("a*b\\|c", "xyb|cz").
%@% Yes
%?- regexp_match("a*b\\|c", "xybcz").
%@% No
<-><-><-><-><-><-><-><-><-><->
Consider now a different task:
Combining nine 9s with any number of the operators +,-,*,/,(,),
what is the smallest positive integer that cannot be expressed?
Hints:
1)The answer isn't zero. You can express zero like this:
(9 - 9) * (9 + 9 + 9 + 9 + 9 + 9 + 9)
Also, zero isn't a positive integer.
2) The answer isn't one. You can express one like this:
9 - (9 * 9 - 9)/9 + 9 - 9 + 9 - 9
3) It's not a trick question.
4) Be sure to handle parentheses correctly.
Notes:
* You cannot exponentiate.
* You cannot concatenate (for example, put two 9s together to
make 99).
* The - operator can be used in either its binary or unary form.
* Assume base 10.
4) is a trick hint. Forget parentheses altogether! We *won't* generate
symbolic syntax trees of expressions this time, let alone lists of
concrete tokens. Instead, we will non-deterministically build lists of
9s and operators, and interpret them as instructions for a reverse
Polish notation calculator. As we are describing lists again, we could
again use DCGs. In this case however, the list of operations can be
built quite nicely by instantiating a single variable more and more:
make_and_eval_expr(N, Es, Val) :-
length(Nines, N),
Nines = [_|Ops],
make_and_eval_expr(Nines, Ops, [], [Val], Es).
make_and_eval_expr([], [], S, S, []).
make_and_eval_expr([_|Nines], Ops, S0, S, [9|Rest]) :-
make_and_eval_expr(Nines, Ops, [9|S0], S, Rest).
make_and_eval_expr(Nines, [_|Ops], [B,A|S0], S, [Op|Rest]) :-
op_a_b_r(Op, A, B, R),
make_and_eval_expr(Nines, Ops, [R|S0], S, Rest).
op_a_b_r(+, A, B, R) :- R is A + B.
op_a_b_r(-, A, B, R) :- R is A - B.
op_a_b_r(*, A, B, R) :- R is A * B.
op_a_b_r(/, A, B, R) :- B =\= 0, R is A rdiv B.
Try:
%?- between(2,inf,N), \+((make_and_eval_expr(9,_,N0),abs(N0)=:=N)).
to obtain the answer in ~15 minutes with SWI Prolog and a 1.6GHz CPU.
Faster versions are possible; good luck!
All the best,
Markus
.
- Follow-Ups:
- Re: Parsing and other fun things involving (no) DCGs
- From: Markus Triska
- Re: Parsing and other fun things involving (no) DCGs
- Prev by Date: Re: Problem with SWI assertions (>= 5.6.28) and Autotest
- Next by Date: comp.lang.prolog Frequently Asked Questions
- Previous by thread: Problem with SWI assertions (>= 5.6.28) and Autotest
- Next by thread: Re: Parsing and other fun things involving (no) DCGs
- Index(es):
Relevant Pages
|