Re: Regex: Why is overreaching necessary?



On Feb 20, 3:07 am, anno4...@xxxxxxxxxxxxxxxxxxxxxx wrote:
Shannon Jacobs <Shannon.Jacobs.nos...@xxxxxxxxx> wrote in comp.lang.perl.misc:

On Feb 17, 10:58 am, anno4...@xxxxxxxxxxxxxxxxxxxxxx wrote:
Shannon Jacobs <sha...@xxxxxxxxxxxx> wrote in comp.lang.perl.misc:

[...]

I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.

The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa

In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.

Try this variant:

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

Essentially that ties the pattern to the beginning of the substring,
then allows zero to two groups of four digits before a match.

Anno

Sorry, but that doesn't work. I think it's because it picks up the
false matches when it has no groups of four digits before the match.

It doesn't pick up false matches from the sample you supplied.

Somehow it needs to be limited to considering only four source digits
at a time, or to think that there is a non-digit boundary between the
two groups of four digits.

(I don't think it matters, and I tested it both ways, but I think it
should be

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:.{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

rather than your version. The data file may have spaces,

Then your sample data should have included such a case.

and I think
that \d wouldn't count them at that point.)

You are more permissive than the data requires. If you want to allow
blanks, allow blanks:

/^(?:[\d ]{4}){0,2}$form_values{'a_SEARCH_VALUE'}/

Anno

You are correct, but the problem is apparently in the particular data
sample which I provided. When tested against the full data file it
still has the problem of the false matches. I was in a hurry to
acknowledge my error, but I don't have time this morning to do more
diagnostics.

Perhaps it is something about the presence of the third number in some
of the real data that is causing it to fail? I see that the sample I
included did not have any cases with 12 digits, but only 8.

(I did test Ilya Zakharevich's proposed suggestion in the next post,
and it worked more poorly, producing additional false matches. I'm
eager to study the differences there, though his approach seems more
complicated than yours.)

.



Relevant Pages

  • Re: Regex: Why is overreaching necessary?
    ... then allows zero to two groups of four digits before a match. ... It doesn't pick up false matches from the sample you supplied. ... Then your sample data should have included such a case. ... blanks, allow blanks: ...
    (comp.lang.perl.misc)
  • Re: Regex: Why is overreaching necessary?
    ... digits), which are actually three numbers. ... total of 8 characters, thereby forcing the target to consider only four ... The data file may have spaces, ...
    (comp.lang.perl.misc)
  • Re: Regex: Why is overreaching necessary?
    ... An example of the search target in $form_values.... ... digits), which are actually three numbers. ... total of 8 characters, thereby forcing the target to consider only four ...
    (comp.lang.perl.misc)
  • Re: Interrupt Routine?
    ... At increments of 10 it still blanks ... out for a few counts and doesnt stop at zero. ... digits sent to it. ...
    (alt.lang.asm)
  • Remove non-numeric characters from a string
    ... I need to strip out off the blanks, parentheses, minuses etc and leave ... myself with just the 10 digits. ...
    (microsoft.public.sqlserver.programming)