Optimizing PHP Relevancy Ranking Algorithm

From: Galen (phplist_at_zinkconsulting.com)
Date: 11/28/03


To: php-general@lists.php.net
Date: Fri, 28 Nov 2003 03:26:08 -0800

I've developed some search result ranking code and it works extremely
well in terms of relevancy but needs use some help with performance.
I'm already using zend optimizer and I've done some basic things to
clean up the code. This helps performance quite a bit by itself, but
that's not enough, because I need to deal with ranking 20,000 search
results.

 From my limited benchmarking, it looks like the biggest problem is when
I work with "$_SESSION.
$_SESSION["search"]["results"][$key1]["relevancy"] = ($result_score *
-1) / $num_fields_matched;" and "usort($_SESSION["search"]["results"],
"cmp");" which use up about equal amounts of time each and account for
80% of the loop time, but I don't know how to fix it. For the first
piece of code, replacing $_SESSION with a blank array nets a huge
performance increase (virtually eliminates time spent on that line of
code). The second line of code, when commented out, causes an almost
identical performance improvement as the first.

I wonder, is there some slowness when dealing $_SESSION as compared to
a regular array? Or is the slowness related to handling large arrays?
One of these two things probably explains the first line. I think the
second troublesome line of code may be due to the same problem as the
first or it may be related to using usort instead of another sorting
option, but I'm not sure how I could use another sorting option to meet
my sorting needs for this complex situation.

Anybody got a few ideas on how to speed up these two sluggish lines of
code? I'm pretty much out of ideas. And if you have any other
suggestions to speed things up, I would really appreciate them too.

Thanks,
        -Galen P. Zink

The code:

                function cmp($a, $b)
                {
                        if($a["relevancy"] < $b["relevancy"])
                        {
                                return 1;
                        }
                        elseif($a["relevancy"] > $b["relevancy"])
                        {
                                return -1;
                        }
                        else
                        {
                                return 0;
                        }
                }

                foreach($_SESSION["search"]["results"] as $key1 => $value1)
                {
                        $num_fields_matched = 0;
                        $result_score = 0;
                        $metaphone_ratio = 0;
                        foreach($_SESSION["search"]["statements"] as $key => $value)
                        {
                                if ($value != "")
                                {
                                        $value = strtolower(trim($value));
                                        $value1[$key] = strtolower(trim(($value1[$key])));
                                        $num_fields_matched++;
                                        $levenshtein = levenshtein($value, $value1[$key], "0.5", 1, 1);
                                        $value_metaphone = metaphone($value1[$key]);
                                        $search_metaphone = metaphone($value);
                                        $search_position = strpos($value1[$key], $value);
                                        $string_count = substr_count($value1[$key], $value);
                                        
                                        if ($search_metaphone == $value_metaphone AND $value_metaphone !=
"")
                                        {
                                                $metaphone_ratio = 1;
                                        }
                                        elseif ($search_metaphone != 0)
                                        {
                                                $metaphone_ratio = 0.6 * (1 / levenshtein($search_metaphone,
$value_metaphone));
                                        }
                                        
                                        $result_score = 1; //basic math involving all above variables set
in this foreach loop goes here - I'm not able to show it due to IP
issues
                                }
                        }
                        if ($num_fields_matched == 0)
                        {
                                $num_fields_matched = 1;
                        }
                        $_SESSION["search"]["results"][$key1]["relevancy"] = ($result_score
* -1) / $num_fields_matched;
                }
                                
                usort($_SESSION["search"]["results"], "cmp");
        }



Relevant Pages

  • Re: A Fast sorting algorithm for almost sorted data
    ... far my compressor has potential but is nowhere near ready. ... It does however make heavy use of sorting. ... which I am currently calling Run sort. ... entire selected run can be added to the sorted output array. ...
    (comp.compression)
  • Re: Ranking Without Sorting
    ... >> sorting the elements of the array. ... > inverting and applying such a permutation can be done in linear time. ... > You can avoid having to sort the input array by treating the output ...
    (comp.programming)
  • Re: Ranking Without Sorting
    ... > sorting the elements of the array. ... inverting and applying such a permutation can be done in linear time. ... You can avoid having to sort the input array by treating the output ...
    (comp.programming)
  • Re: Ranking Without Sorting
    ... >> contradicted in my first post where I gave an example of how to rank ... > It seems quite clear to me that your 'ranking' array contains the exact ... An unsorted array contains the same information as the same array after sorting. ... > the same operation as the insertion loop in insertion sort... ...
    (comp.programming)
  • Re: Efficiently Extracting Identical Values From A List/Array
    ... struct SortHelper ... Now sort that array according to NodeIndex: ... running through the data structure and sorting things out. ...
    (comp.lang.cpp)