Re: Negative lookahead regex clarification needed

From: Anno Siegel (anno4000_at_lublin.zrz.tu-berlin.de)
Date: 01/19/05


Date: 19 Jan 2005 21:28:23 GMT

shifty <shifty_MyU@yahoo.com> wrote in comp.lang.perl.misc:
> Hi,i
>
> I'm trying to hack my way through a regex for a chunk of code I'm going
> to use. I've been using a Regex Coach to run through this and I think
> I have correct syntax.

If the syntax weren't correct it wouldn't compile. What you are asking is
whether it does what you want it to do, which is about semantics.

> I am trying to find any one of several 'hacked' variants of the word
> "microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
> actual word "microsoft". I need the regex to be case sensitive.
>
> This is my regex - it seems to work, but I don't know if the syntax is
> honestly correct and I don't want it to break later:
>
> (?i).*\b(?:(?!microsoft)m+[i1l\\\|!¡îíìï]+[Cç]+r+[o0öøõôóòð]+[s§]+[o0öøõôóòð]+f+[t\+]+)\b.*

That string is mangled. It appears to contain literal backspaces or other
control characters that make it hard to analyze. It may well not compile.

Is there any reason why you want to use lookahead to exclude unaltered
strings like "microsoft"? Just skip those strings using an extra regex,
and concentrate on matching the altered variants.

To do this in a maintainable way, I'd first build a hash of possible
replacement characters. For "microsoft", it might look like this:

    my %repla = (
        m => 'm',
        i => 'i1',
        c => 'cç',
        r => 'r',
        o => 'o0',
        s => 's5§',
        f => 'f',
        t => 't+',
    );
    $_ = quotemeta for values %repla; # make regex-safe

Add more characters to cover other words besides "microsoft".

Then build your regex from the replacement strings in a systematic
way:

    my $re = join '', map "[$_]", @repla{ split //, 'microsoft'};
    $re = qr/$re/i; # made case-insensitive here

To test it, run

    for ( qw( microsoft miçr0§0ft m1cros0f+ m1crosaft) ) {
        next if /^microsoft$/i,
        print "$_\n" if $_ =~ $re;
    }

It prints only the middle two examples.

If you really need to do everything in one regex (yes, it does make a slight
difference), you can introduce negative lookahead by changing the line
containing qr// to

    $re = qr/(?!microsoft)$re/i;

Working this way, there is little doubt about what the code does, and it
will be easy to modify and extend. There is also no need for a "Regex
Coach" with dubious I/O habits.

Anno



Relevant Pages

  • Re: Java vs. Pascal
    ... Strings zu tun. ... 500 pattern.split~Regex ... Mit wiederverwendetem pattern hat man da schon 30% der Zeit gespart. ... Das Potential der verschiedenen Methoden (RegEx, StringTokenizer, split2) ...
    (de.comp.lang.java)
  • Re: Regex: Why is overreaching necessary?
    ... searching on the web for various examples, ... characters in the middle to search against. ... I gather you have data in records which are 68 characters long, and that somewhere between the 51st and 62nd character there may or may not be some specific four-character numeric strings. ... The regex as modified will match from 50 to 62 of any characters, followed by either of four specific 4-character numbers, followed by 6 to 18 of any character, followed by the end of the record. ...
    (comp.lang.perl.misc)
  • Regex doesnt match - what am I doing wrong?
    ... I am having trouble matching a regex that combines a negated character ... This matched all strings regardless of whether or not they ended in a ...
    (comp.lang.perl.misc)
  • Regex doesnt match - what am I doing wrong?
    ... I am having trouble matching a regex that combines a negated character ... This matched all strings regardless of whether or not they ended in a ...
    (comp.lang.perl)
  • Re: for a laught (???)
    ... Moreover, whenever possible, OC uses POSIX C functions ... Snip from POSIX regex - ... Regex doesn't work too well with a null byte delimiter :-) ... Regex doesn't work with null terminated strings. ...
    (comp.lang.cobol)