Re: Converting a string to multiple search patterns

From: Anno Siegel (anno4000_at_lublin.zrz.tu-berlin.de)
Date: 06/08/04


Date: 8 Jun 2004 11:53:53 GMT

Tore Aursand <tore@aursand.no> wrote in comp.lang.perl.misc:
> Hi all!
>
> I'm stumped on this one: I have an application where I need to refine the
> search mechanism. The concept is quite simple: Get a string, convert it
> to separate words, count (and "score") each word for each document, and
> then display the result based on the score;
>
> my $query = 'A B C D';
> my @words = split( /\s+/, $query );
> foreach ( @documents ) {
> # ...
> }
>
> I need to refine it, as said. I want a higher score for word sequences,
> and in a particular order. For the example above ('A B C D'), I want to
> match in this order:
>
> 1. A B C D
> 2. A B C
> 3. B C D
> 4. A B
> 5. C D
> 6. A C
> 7. B D
> 9. A D
> 9. A
> 10. B
> 11. C
> 12. D

I'm missing "A B D", "A C D", and " B C " from the collection.
Are these entirely arbitrary?

> Anyone know of a module which can accomplis this? I really haven't tried
> with anything yet, 'cause I have no clue on how to do it. The closest
> thing I've been, has been with the Algorithm::Permute module. It doesn't
> give me what I want "out of the box", though...

I'm not sure what you are asking. Is it the generation of all selections
of 1 .. 4 objects from a set of 4? These don't correspond to permutations,
but to four-digit binary numbers (so there are 2**4 - 1 = 15 of them,
not counting the empty selection). I'm sure there is a module on CPAN
to generate them, but ad-hoc solutions aren't too hard either.

Or is the issue how to assign a score to each of a collection of
regexes and retrieve the score after each match? This can be done
using the (?{}) construct to execute code at match time.

Starting from your list (@lines, say) above, I'd generate a list @score
of pairs where each pair holds a score and a string to match:

    my @score = map [ split /\./], @lines;
    $_->[ 1] =~ tr/ //d for @score;

The second line simplifies things by deleting all blanks from the strings
to match. Your practical regexes may look different.

Build an alternation of patterns where each pattern includes code
to set a variable ($scored) to the corresponding score:

    my $rex = join '|', map "$_->[ 1](?\{ \$scored = $_->[ 0] \})", @score;

Generate a test string and check it.

    my $text = join '', map qw( A B C D E)[ rand 5], 1 .. 100;

    my $scored;
    use re 'eval';
    while ( $text =~ /($rex)/g ) {
        print "score $scored: $1\n";
    }

Anno



Relevant Pages

  • Re: Writing a "substring"/"replace substring" function in ksh88
    ... print characters from within a string given a range. ... ?- Optionally matches any one of the given patterns. ... *- Matches zero or more occurrences of the given ... I thought if I wanted to replace character 2 of a 20- ...
    (comp.unix.shell)
  • Re: FindFirstFile, how much faster than FindNextFile?
    ... Next, you say you want an arbitrary substring of the filename, but that's not what your ... But the whole notion that you would hand-code 50 unique patterns is remarkably silly. ... I knew that you got them from FindFirstFile. ... C++ standard library, the string type. ...
    (microsoft.public.vc.mfc)
  • variable substitution in switch patterns?
    ... my understanding is that the patterns are supposed ... set CharCnt [string length $SrcText]; ... set EmbeddedText ""; ... lappend StringList $MatrixText; ...
    (comp.lang.tcl)
  • Re: Is that a good design?
    ... I would have been able to reason the first gotcha. ... public string FirstName; ... has a reference architecture that shows the use of patterns. ... Public Shared Function GetUserInstance() As User ...
    (microsoft.public.dotnet.framework)
  • Re: Is that a good design?
    ... public string FirstName; ... has a reference architecture that shows the use of patterns. ... Public Shared Function GetInstanceAs IUser ... Public Shared Function GetUserInstance() As User ...
    (microsoft.public.dotnet.framework)