Optimizing PHP Relevancy Ranking Algorithm
From: Galen (phplist_at_zinkconsulting.com)
Date: 11/28/03
- Next message: David T-G: "Re: [PHP] Regular Expression Help: genreate alphanumeric, 8 chars"
- Previous message: Binay: "Mysql backup"
- Next in thread: David Otton: "Re: [PHP] Optimizing PHP Relevancy Ranking Algorithm"
- Reply: David Otton: "Re: [PHP] Optimizing PHP Relevancy Ranking Algorithm"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
To: php-general@lists.php.net Date: Fri, 28 Nov 2003 03:26:08 -0800
I've developed some search result ranking code and it works extremely
well in terms of relevancy but needs use some help with performance.
I'm already using zend optimizer and I've done some basic things to
clean up the code. This helps performance quite a bit by itself, but
that's not enough, because I need to deal with ranking 20,000 search
results.
From my limited benchmarking, it looks like the biggest problem is when
I work with "$_SESSION.
$_SESSION["search"]["results"][$key1]["relevancy"] = ($result_score *
-1) / $num_fields_matched;" and "usort($_SESSION["search"]["results"],
"cmp");" which use up about equal amounts of time each and account for
80% of the loop time, but I don't know how to fix it. For the first
piece of code, replacing $_SESSION with a blank array nets a huge
performance increase (virtually eliminates time spent on that line of
code). The second line of code, when commented out, causes an almost
identical performance improvement as the first.
I wonder, is there some slowness when dealing $_SESSION as compared to
a regular array? Or is the slowness related to handling large arrays?
One of these two things probably explains the first line. I think the
second troublesome line of code may be due to the same problem as the
first or it may be related to using usort instead of another sorting
option, but I'm not sure how I could use another sorting option to meet
my sorting needs for this complex situation.
Anybody got a few ideas on how to speed up these two sluggish lines of
code? I'm pretty much out of ideas. And if you have any other
suggestions to speed things up, I would really appreciate them too.
Thanks,
-Galen P. Zink
The code:
function cmp($a, $b)
{
if($a["relevancy"] < $b["relevancy"])
{
return 1;
}
elseif($a["relevancy"] > $b["relevancy"])
{
return -1;
}
else
{
return 0;
}
}
foreach($_SESSION["search"]["results"] as $key1 => $value1)
{
$num_fields_matched = 0;
$result_score = 0;
$metaphone_ratio = 0;
foreach($_SESSION["search"]["statements"] as $key => $value)
{
if ($value != "")
{
$value = strtolower(trim($value));
$value1[$key] = strtolower(trim(($value1[$key])));
$num_fields_matched++;
$levenshtein = levenshtein($value, $value1[$key], "0.5", 1, 1);
$value_metaphone = metaphone($value1[$key]);
$search_metaphone = metaphone($value);
$search_position = strpos($value1[$key], $value);
$string_count = substr_count($value1[$key], $value);
if ($search_metaphone == $value_metaphone AND $value_metaphone !=
"")
{
$metaphone_ratio = 1;
}
elseif ($search_metaphone != 0)
{
$metaphone_ratio = 0.6 * (1 / levenshtein($search_metaphone,
$value_metaphone));
}
$result_score = 1; //basic math involving all above variables set
in this foreach loop goes here - I'm not able to show it due to IP
issues
}
}
if ($num_fields_matched == 0)
{
$num_fields_matched = 1;
}
$_SESSION["search"]["results"][$key1]["relevancy"] = ($result_score
* -1) / $num_fields_matched;
}
usort($_SESSION["search"]["results"], "cmp");
}
- Next message: David T-G: "Re: [PHP] Regular Expression Help: genreate alphanumeric, 8 chars"
- Previous message: Binay: "Mysql backup"
- Next in thread: David Otton: "Re: [PHP] Optimizing PHP Relevancy Ranking Algorithm"
- Reply: David Otton: "Re: [PHP] Optimizing PHP Relevancy Ranking Algorithm"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|