Re: RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/
- From: info@xxxxxxxxxxxx (D. Bolliger)
- Date: Tue, 5 Sep 2006 16:43:01 +0200
Stefan Th. Gries am Dienstag, 5. September 2006 14:20:
Hi all
Hallo Stefan
I have a regex question I can't solve. I know this is a realy long posting
but in order to explain the problem, I first say what I can do and then
what I can't. Any ideas, pointers, snippets of code etc. would be really
appreciated ... Thx,
STG
As you can see from the mail date, I didn't spend days to answer :-)
What I will present is a script to
- generate regexes (to be used in R)
- to test them
- demonstrate the building of complex regexes from parts
The regexes might no be exactly correct, the names could be better chosen, I didn't care much of capturing parenthesis and x modifier and comments etc.
I couldn't find a way without lookahead.
But the regexes select the cases you wish.
--------------------
I. This I can do ...
--------------------
I have an array @a with character strings:
@a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")
"<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c
PUN>.")
The defining characteristic of the character strings in the array are that
every word and every punctuation mark is preceded by a tag with the
following structure: /<(w ...(-...)?|c ...)>/
(a) I want to retrieve the sequence of
- a word tagged as <w CJC>, immediately followed by
- a word tagged as <w DT0>.
Since every tag starts with /</, I use this regex: /<w CJC>[^<]*?<w
DT0>[^<]*/, which works just fine by retrieving only @a[0].
(b) I want to retrieve the sequence of
- a word tagged as <w CJC>, followed by
- between 0 and 2 words and their tags (again, looking like this: /<(w
...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>.
I use this regex: /<w CJC>[^<]*?(<[wc] (...|...-...)>[^<]*?){0,2}<w
DT0>[^<]*/, which works just fine by retrieving only @a[0:1]. (I know I
could use "?:" to avoid the capturing for the backreference but I don't
care about that at the moment.)
----------------------
II. This I can't ...
----------------------
I have an array @b with character strings:
@b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w DT0>that <w NN2>cars",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w
DT0>that<c PUN>.", "<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr
target=KB2LC004> <w DT0>that<c PUN>.", "<w AT0>a <w CJC>and <wtr
target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c
PUN>.", "<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.")
I basically want to do the same things as above, but the complication is
that there are now additional kinds of tags -- tags that are not /<(w
...(-...)?|c ...)>/ -- and my problem is how to skip them, to disregard
them for the match. Thus,
(a) I want to retrieve those elements of @b in which "<w CJC>" and "<w
DT0>" are
- directly adjacent, or
- not interrupted by any word with its tag (again, looking like this: /<(w
...(-...)?|c ...)>/).
That is, I need to say something like "return everything from /<w CJC>/ and
/<w DT0>/ but not if there is any /<(w ...(-...)?|c ...)>/ in between the
two, then return nothing". Thus, of the array @b I would like to get back
the first eight elements, but not the last four elements:
@b[0]: yes, because only separated by a space
@b[1]: yes, because only separated by a space
@b[2]: yes, because not interrupted by /<(w ...(-...)?|c ...)>/, only by
/<ptr[^>]+>/ @b[3]: yes, because not interrupted by /<(w ...(-...)?|c
...)>/, only by /<ptr[^>]+>/ @b[4]: yes, because not interrupted by /<(w
...(-...)?|c ...)>/, only by /<ptr[^>]+>/ @b[5]: yes, because not
interrupted by /<(w ...(-...)?|c ...)>/, only by /<p tr[^>]+>/ and
/<ptr[^>]+>/ @b[6]: yes, because not interrupted by /<(w ...(-...)?|c
...)>/, only by /<w[^>]+>/ @b[7]: yes, because not interrupted by /<(w
...(-...)?|c ...)>/, only by /<c[^>]+>/ @b[8]: no, because interrupted by,
among other things, /<c PUN>/
@b[9]: no, because interrupted by, among other things, /<w NN2-VVZ>/
@b[10]: no, because interrupted by, among other things, /<w AJ0>hungry/
@b[11]: no, because interrupted by, among other things, /<w AJ0>/ and /<c
PUN>/
I do not use Perl, but R, so the regex
- *must* involve Perl-compatible regular expressions;
- would ideally work without lookaround (but if lookaround is absolutely
needed, so be it).
The best I came up with was this (again, I don't care putting in "?:"): /<w
CJC>[^<]+(<[^wc].*?>.*?)*<w DT0>[^<]*?/ but this does of course not work
for @b[6:7] because the relevant part of the regex only says /<[wc]/, but I
need to rule out all this /<(w ...(-...)?|c ...)>/.
(b) I want to retrieve the sequence of
- a word tagged as <w CJC>, followed by
- between 0 and 2 words and their tags (again, looking like this: /<(w
...(-...)?|c ...)>/), followed by - a word tagged as <w DT0>.
Again, the regex
- *must* involve Perl-compatible regular expressions;
- would ideally work without lookaround (but if lookaround is absolutely
needed, so be it).
#!/usr/bin/perl
use strict;
use warnings;
my $w_CJC =qr/(?:<w CJC>)/;
my $w_DT0 =qr/(?:<w DT0>)/;
my $generic1=qr/(?:<(w ...(-...)?|c ...)>)/;
my $ptr =qr/(?:<ptr[^>]+>)/;
my $p_tr =qr/(?:<p tr[^>]+>)/;
my $re_w =qr/(?:<w[^ ][^>]+>)/; # NOTE [^ ] to distinct from $generic1
my $re_c =qr/(?:<c[^ ][^>]+>)/; # dito
my $text =qr/(?:[^<>]*)/; # what follows the tags
my $disregard =qr/$text|$ptr|$p_tr/;
my $not_generic1=qr/(?:$w_CJC|$w_DT0|$ptr|$p_tr|$re_w|$re_c)$text/;
# just to check if selection is ok
#
sub retrieve {
my ($aref, $regex)=@_;
for my $str (@$aref) {
if ($str=~/$regex/) {warn "retrieved: $str\n";}
else {warn "ignored: $str\n";}
}
}
my @a=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.");
my @b=("<w AT0>a <w CJC>and <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w DT0>that <w NN2>cars",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.",
"<w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.");
my $re_1a=qr/$w_CJC$text$w_DT0$text/;
my $re_1b=qr/$w_CJC$text(?:$generic1$text){0,2}$w_DT0$text/;
my $re_not_interrupted_by_generic=qr/($not_generic1?(?!(?:$generic1$text)+)?)*?/;
my $re_2a=qr/$w_CJC$text$re_not_interrupted_by_generic$w_DT0$text/;
warn "\n*** 1a /$re_1a/\n\n";
retrieve(\@a, $re_1a);
warn "\n*** 1b /$re_1b/\n\n";
retrieve(\@a, $re_1b);
warn "\n*** 2a /$re_2a/\n\n";
retrieve(\@b, $re_2a);
__END__
The output is:
*** 1a /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/
retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
*** 1b /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?:(?-xism:(?:<(w ...(-...)?|c ...)>))(?-xism:(?:[^<>]*))){0,2}(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/
retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
*** 2a /(?-xism:(?-xism:(?:<w CJC>))(?-xism:(?:[^<>]*))(?-xism:((?-xism:(?:(?-xism:(?:<w CJC>))|(?-xism:(?:<w DT0>))|(?-xism:(?:<ptr[^>]+>))|(?-xism:(?:<p tr[^>]+>))|(?-xism:(?:<w[^ ][^>]+>))|(?-xism:(?:<c[^ ][^>]+>)))(?-xism:(?:[^<>]*)))?(?!(?:(?-xism:(?:<(w ...(-...)?|c ...)>))(?-xism:(?:[^<>]*)))+)?)*?)(?-xism:(?:<w DT0>))(?-xism:(?:[^<>]*)))/
retrieved: <w AT0>a <w CJC>and <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <w DT0>that <w NN2>cars
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ptr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <p tr target=KB2LC003> <ptr target=KB2LC004> <w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <wtr target=KB2LC003><w DT0>that<c PUN>.
retrieved: <w AT0>a <w CJC>and <ctr target=KB2LC003><w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><c PUN>, <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <ptr target=KB2LC003><w NN2-VVZ>cars <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.
ignored: <w AT0>a <w CJC>and <w AJ0>hungry <c PUN>,<w DT0>that<c PUN>.
Hope this helps a bit :-)
Dani
.
- References:
- RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/
- From: Stefan Th. Gries
- RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/
- Prev by Date: RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/
- Next by Date: Extract digits from string
- Previous by thread: RegEx: finding a string that does not contain /<(w ...(-...)?|c ...)>/
- Next by thread: Re: Extract digits from string
- Index(es):
Relevant Pages
|