Re: Negative lookahead regex clarification needed
From: Anno Siegel (anno4000_at_lublin.zrz.tu-berlin.de)
Date: 01/19/05
- Next message: Hue-Bond: "Re: The world's shortest 'Hello World!' program: a proposal"
- Previous message: osmo: "Re: locale problem"
- In reply to: shifty: "Negative lookahead regex clarification needed"
- Next in thread: shifty: "Re: Negative lookahead regex clarification needed"
- Reply: shifty: "Re: Negative lookahead regex clarification needed"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 19 Jan 2005 21:28:23 GMT
shifty <shifty_MyU@yahoo.com> wrote in comp.lang.perl.misc:
> Hi,i
>
> I'm trying to hack my way through a regex for a chunk of code I'm going
> to use. I've been using a Regex Coach to run through this and I think
> I have correct syntax.
If the syntax weren't correct it wouldn't compile. What you are asking is
whether it does what you want it to do, which is about semantics.
> I am trying to find any one of several 'hacked' variants of the word
> "microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
> actual word "microsoft". I need the regex to be case sensitive.
>
> This is my regex - it seems to work, but I don't know if the syntax is
> honestly correct and I don't want it to break later:
>
> (?i).*\b(?:(?!microsoft)m+[i1l\\\|!¡îíìï]+[Cç]+r+[o0öøõôóòð]+[s§]+[o0öøõôóòð]+f+[t\+]+)\b.*
That string is mangled. It appears to contain literal backspaces or other
control characters that make it hard to analyze. It may well not compile.
Is there any reason why you want to use lookahead to exclude unaltered
strings like "microsoft"? Just skip those strings using an extra regex,
and concentrate on matching the altered variants.
To do this in a maintainable way, I'd first build a hash of possible
replacement characters. For "microsoft", it might look like this:
my %repla = (
m => 'm',
i => 'i1',
c => 'cç',
r => 'r',
o => 'o0',
s => 's5§',
f => 'f',
t => 't+',
);
$_ = quotemeta for values %repla; # make regex-safe
Add more characters to cover other words besides "microsoft".
Then build your regex from the replacement strings in a systematic
way:
my $re = join '', map "[$_]", @repla{ split //, 'microsoft'};
$re = qr/$re/i; # made case-insensitive here
To test it, run
for ( qw( microsoft miçr0§0ft m1cros0f+ m1crosaft) ) {
next if /^microsoft$/i,
print "$_\n" if $_ =~ $re;
}
It prints only the middle two examples.
If you really need to do everything in one regex (yes, it does make a slight
difference), you can introduce negative lookahead by changing the line
containing qr// to
$re = qr/(?!microsoft)$re/i;
Working this way, there is little doubt about what the code does, and it
will be easy to modify and extend. There is also no need for a "Regex
Coach" with dubious I/O habits.
Anno
- Next message: Hue-Bond: "Re: The world's shortest 'Hello World!' program: a proposal"
- Previous message: osmo: "Re: locale problem"
- In reply to: shifty: "Negative lookahead regex clarification needed"
- Next in thread: shifty: "Re: Negative lookahead regex clarification needed"
- Reply: shifty: "Re: Negative lookahead regex clarification needed"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|